Friday, September 25, 2009

Building regular expressions in Groovy

Because of compact syntax regular expressions in Groovy are more readable than in Java. Here is how Jeffrey Friedl's example would look like in Groovy:

def subDomain  = '(?i:[a-z0-9]|[a-z0-9][-a-z0-9]*[a-z0-9])' // simple regex in single quotes
def topDomains = """
(?x-i : com \\b # you can put whitespaces and comments
| edu \\b # inside regex in eXtended mode
| biz \\b
| in(?:t|fo) \\b # but you have to escape
| mil \\b # backslashes in multiline strings
| net \\b
| org \\b
| [a-z][a-z] \\b
)"""

def hostname = /(?:${subDomain}\.)${topDomains}/ // variable substitution in slashy strings

def NOT_IN = /;\"'<>()\[\]{}\s\x7F-\xFF/ // backslash is not escaped in slashy strings
def NOT_END = /!.,?/
def ANYWHERE = /[^${NOT_IN}${NOT_END}]/
def EMBEDDED = /[$NOT_END]/ // you can ommit {} around var name

def urlPath = "/$ANYWHERE*($EMBEDDED+$ANYWHERE+)*"

def url =
"""(?x:
\\b

# match the hostname part
(
(?: ftp | http s? ): // [-\\w]+(\\.\\w[-\\w]*)+
|
$hostname
)

# allow optional port
(?: :\\d+ )?

# rest of url is optional, and begins with /
(?: $urlPath )?
)"""

assert 'http://www.google.com/search?rls=en&q=regex&ie=UTF-8&oe=UTF-8' ==~ url

As you can see, there are several options, and for every subexpression you can choose the one that's more expressive.

Resources

• Martin Fowler on composed regexes
• Pragmatic Dave on regexes in Ruby
• Feature request to make regexes even groovier
• Mastering Regular Expressions — best regex book
• Groovy Pattern and Matcher classes

No comments: