Thursday, October 22, 2009

Parsing files using Groovy regex

In my previous post I mentioned several ways of defining regular expressions in Groovy. Here I want to show how we can use Groovy regex to find/replace data in the files.

Parsing properties file (simplified)1

Data: each line in the file has the same structure; the entire line can be matched by single regex. Problem: transform each line to the object. Solution: construct regex with capturing parentheses, apply it to each line, extract captured data. Demonstrates: File.eachLine method, matrix syntax of Matcher object.

def properties = [:]
new File('path/to/some.properties').eachLine { line ->
if ((matcher = line =~ /^([^#=].*?)=(.+)$/)) {
properties[matcher[0][1]] = matcher[0][2]
}
}
println properties

Parsing csv files (simplified)2

Data: each line in the file has the same structure; the line consists of the blocks separated by some character sequence. Problem: transform each line to the list of objects. Solution: construct regex with capturing parentheses, parse each line with the regex in a loop extracting captured data. Demonstrates: ~// Pattern defenition, Matcher.group method, \G regex meta-sequence.

def regex = ~/\G(?:^|,)(?:"([^"]*+)"|([^",]*+))/
new File('path/to/file.csv').eachLine { line ->
def fields = []
def matcher = regex.matcher(line)
while (matcher.find()) {
fields << (matcher.group(1) ?: matcher.group(2))
}
println fields
}

Finding snapshot dependencies in the pom (simplified)3

Data: file contains blocks with known boundaries (possibly crossing multiple lines). Problem: extract the blocks satisfying some criteria. Solution: read the entire file into the string, construct regex with capturing parentheses, apply the regex to the string in a loop. Demonstrates: File.text property, list syntaxt of Matcher object, named capture, global \x regex modifier, local \s regex modifier.

def pom = new File('path/to/pom.xml').text
def matcher = pom =~ '''(?x)
<dependency> \\s*
<groupId>([^<]+)</groupId> \\s*
<artifactId>([^<]+)</artifactId> \\s*
<version>(.+?-SNAPSHOT)</version> (?s:.*?)
</dependency>
'''
matcher.each { matched, groupId, artifactId, version ->
println "$groupId:$artifactId:$version"
}

Finding stacktraces in the log

Data: file contains entries each of which starts with the same pattern and can span multiple lines. Typical example is log4j log files:

2009-10-16 15:32:12,157 DEBUG [com.ndpar.web.RequestProcessor] Loading user
2009-10-16 15:32:13,258 ERROR [com.ndpar.web.UserController] id to load is required for loading
java.lang.IllegalArgumentException: id to load is required for loading
at org.hibernate.event.LoadEvent.(LoadEvent.java:74)
at org.hibernate.event.LoadEvent.(LoadEvent.java:56)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:839)
at org.hibernate.impl.SessionImpl.get(SessionImpl.java:835)
at org.springframework.orm.hibernate3.HibernateTemplate$1.doInHibernate(HibernateTemplate.java:531)
at org.springframework.orm.hibernate3.HibernateTemplate.doExecute(HibernateTemplate.java:419)
at org.springframework.orm.hibernate3.HibernateTemplate.executeWithNativeSession(HibernateTemplate.java:374)
at org.springframework.orm.hibernate3.HibernateTemplate.get(HibernateTemplate.java:525)
at org.springframework.orm.hibernate3.HibernateTemplate.get(HibernateTemplate.java:519)
at com.ndpar.dao.UserManager.getUser(UserManager.java:90)
... 62 more
2009-10-16 15:32:14,659 DEBUG [com.ndpar.jms.MessageListener] Received message:
... multi-line message ...
2009-10-16 15:32:15,169 INFO [com.ndpar.dao.UserManager] User: ...

Problem: find entries satisfying some criteria. Solution: read the entire file into the string4, construct regex with capturing parentheses and lookahead, split the string into entries, loop through the result and apply criteria to each entry. Demonstrates: regex interpolation, combined global regex modifiers \s and \m.

def log = new File('path/to/your.log').text
def logLineStart = /^\d{4}-\d{2}-\d{2}/
def splitter = log =~ """(?xms)
( ${logLineStart} .*?)
(?= ${logLineStart} | \\Z)
"""
splitter.each { matched, entry ->
if (entry =~ /(?m)^(?:\t| {8})at/) println entry
}

Replacing text in the file

Use Groovy one-liner to perform the replacement. Here is the Tim's example in Groovy:

$ groovy -p -i -e '(line =~ /1\.6/).replaceAll("2.0-alpha-1-SNAPSHOT")' `find . -name pom.xml`


Resources

• Groovy regexes
• Groovy one-liners
• Using String.replaceAll method

Footnotes

1. This example is for demonstration purposes only. In real program you would just use Properties.load method.
2. The regex is simplified. If you want the real one, take a look at Jeffrey Friedl's example.
3. Again, in reality you would find snapshots using mvn dependency:resolve | grep SNAPSHOT command.
4. This approach won't work for big files. Take a look at this script for practical solution.

1 comment:

Steve Smith said...

Can we also parse a batch file using the same structure as mentioned above?