Up to Speed with Regexes

G. Wade Johnson

You've heard the one about...

Some people, when confronted with a problem, think I know, I'll use regular expressions. Now they have two problems.
-- Some random crank

The originator of this particular version of the joke was actually Jamie Zawinski. The quote was exhaustively researched by Jeffrey Friedl. Follow the link for more than you wanted to know about the subject.

What are Regular Expressions?

There are a lot of variations of regular expression syntax and functionality. Be sure you know which set you are using to avoid over/under matching.

What are they good for?

Regular expressions are really good at fuzzy descriptions of strings. This means they can be used for each of these specific purposes.

... Or More succinctly

(Recogniz|Pars|Extract|Transform|Validat)ing strings

You did see that coming, right?

Learning/Testing Tools

Hopefully, I'll get some time later to demonstrate rxrx from Perl. It's regular expressions are a little different than you will be using, but it's ability to show how a match happens is pretty amazing.

Terms

You can think of these as the fundamental terms in specifying a regular expression. We use these by themselves or with various operators to construct our regular expressions. Although it seems obvious, all but the last set match one character.

The anchors don't actually match a character, instead they match a location in a string. The ^ matches at the beginning of the string or immediately after a newline character. The $ matches at the end of a string or immediately before a newline character.

Basic Operations

The concatenation of a series of terms matches the first term, followed immediately by the second, etc.

Alternation matches either the expression to its left or the expression to its right. It will attempt a complete match on the left one before attempting the right one. If the first matches, the second won't be tried. This means it finds the first match, not the best match.

What can we do with just this?

When describing these, be very precise. Understanding exactly what you are matching is the key to effective regular expressions.

A little more Power: Character Classes

The . is quite powerful, but kind of a blunt tool. Character classes give much more precision. The first three are short, because they are very useful when matching or manipulating text. Add in the general form, and you can do precision matching of most kinds of text.

The general character class matches any character inside the square brackets. Two characters separated by a dash defines a range of characters. Metacharacters like . or * lose their special abilities inside the square brackets. The escape code character classes effectively expand to their full value inside the class. This allows you to combine and extend the built-in character classes easily.

The uppercase version of the escape code character classes is just the inverse of the lower case versions. So \D is any non-digit character, etc. To make the negated form of the general character class, use ^ as the first character. That is the only position where the character is special.

The only characters that are special in a character class are the ^ at the beginning, the - in a range, the ] that closes the class, and the \ which negates the effects of one of the special character class characters.

More Anchors

The \A anchor only matches at the beginning of the string. It's like the ^ without the special behavior around newlines. For validation, you almost always want \A instead of ^.

The next two anchors match at the end of the string. The only difference between them is that the \Z will match before a newline at the end of a string.

The word anchors are a bit different. The \b matches at the boundary of a word. This can be at the beginning or end of a string that borders a word character or between a word character and a non-word character. The \B matches at where there is no word boundary. Either between two word characters, between two non-word characters, or between a non-word character and the beginning or end of the string.

Arbitrary Quantifier

When we need a quantifier, the normal *, ?, and + are also blunt tools. Sometimes you need more precision and these quantifiers give you that.

Grouping/Capturing

The next power tool in your regular expressions is grouping, with or without capturing. The quantifiers only work on the preceding term. In many cases that would be a single character. Using grouping, yon can apply a quantifier to an entire sub-expression. Grouping is also useful for containing alternation, as seen in my joke earlier.

The simple parentheses also capture the matched text for later use. You'll most often find that useful for regular expression used in substitutions, but it can also be very helpful in certain matches. You use the captured matches with the \{number} notation. Where the number is the appropriate matching group. When the capturing groups are separate, it's easy to tell which group matches which number. Things get a little messier when parentheses nest. The key is to remember to count the left parenthesis.

More examples

Let's go over some more examples, based on some of the new things we've learned.

Be Aware

If there are two substrings that match a regular expression, the regular expression actually only matches the one that starts the furthest to the left. Each of the quantifiers we have seen will match as much as possible before testing the next term in the expression. If the next term is not a match, the quantifier will backtrack, giving up one character and retrying the match. This will repeat until the whole expression matches or the regular expression fails.

Anchors match a position in the string, not an actual character. Most people find it easier to think of an anchor as matching the location before or after a character.

Character classes match a character if they match at all. Sometimes people get sloppy thinking about negated character classes and talk about matching if the appropriate character is not there. This is not correct. A negated character class matches a character that is not in the list.

Tips

If you know something for certain about a string, match it. Anchors and runs of known characters can help make certain you are matching what you think you are matching. Try not to be over-general in what you are trying to match. That may result in matching something that you did not intend to.

Many people use * when one of the other quantifiers would be more precise. Keep in mind whether matching nothing would be acceptable. Also remember that * is probably not right for an optional item.

Traps

Remember that greedy quantifiers will match everything they can before the next term is allowed to try. If the rest of the regex can't match, then the regex engages in backtracking. The quantifier is forced to give up one match and then allow the rest of the regex to try again. This may repeat until the quantifier can no longer match.

Likewise, overly general character classes can result in matching more than you expect.

Remember that some quantifiers may match nothing which can result in unexpected behavior.

Alternation often results in surprises. The regular expression will try to match everything to the left of the | before attempting what is to the right.

Backtracking and Longest Match

Remember that greedy quantifiers will match everything they can before the next term is allowed to try. This can result in matching more than you expect. If the input string is long, it can also result in very slow matches.

Alternation

Alternation often results in surprises. The regular expression will try to match everything to the left of the | before attempting what is to the right. If the two expressions are long or would have overlapped this can result in a lot of extra work.

So, if the left side of the alternation would match and the right side would be a better match. The left one is still the only one matched. This is especially surprising if the item on the left is a subset of the one on the right.

The alternative operation also has the lowest precedence of any operation in a regular expression. Be careful about what you think is on each side.

Matching Too Much

The first of one of the classic examples of why you should not parse XML or HTML with a regular expression. It's much harder to get right than people think. The first may match multiple paragraphs. The second only matches a paragraph that contains no other tags.

Matching Nothing

Remember a* and a? are still successful when they match nothing.

The example is based on actual code that I saw used in a real system.

Things to Consider

I said regular expressions are a language. Therefore, you should think of each expression you write as a little program or function. As such, you should think about writing them and maintaining them much as you would any other code.

Readability

Readable regular expressions are more readable. Try to make its meaning clear. Don't use unnecessarily clever tricks. Take pity on the person who will maintain this. If you can break into smaller expressions and match them separately or build a larger expression out of smaller ones, try to do that.

Efficiency

In the normal case, a regular expression will keep trying to match in as many places in the string as possible as long as it has not succeeded. Anchors are powerful because they can tell the regular expression to give up early instead of trying to continue. Explicit strings have a similar effect. If there is an explicit string at the beginning of the regular expression, the engine will normally scan the text first looking for that string and may give up without matching anything else.

Nested quantifiers can be extremely slow due to exponential time complexity caused by multiple attempts to backtrack over the string.

If you have an unanchored quantified subexpression at the front or back of your regular expression, consider whether it does anything at all for you.

Character Class Surprises

Character classes have a few surprises built in. Remember that the escape sequences effectively expand inside a character class. So the first expression is basically a wordier version of \s. If the dash character is in a position that could not be part of a range, it has no special meaning. If it is in the right place, it has to be part of a range. The closing square bracket is not special as the first character of a character class (or the first after the ^ for a negated character class.

As I said earlier, the metacharacters have no meaning in a character class.

More Advanced Features

Examples

These all match different variations of 1-3 digits. Try to figure out what each one does. And be very precise when you try to describe each. The differences are sometimes quite subtle.

The Book

Mastering Regular Expressions by Jeffrey Friedl

Available from O'Reilly and on Safari

This is probably the best reference I know of for really understanding regular expressions. If you read and really understand this book, you will know more than you need for most regular expression problems.