Some people, when confronted with a problem, thinkI know, I'll use regular expressions.Now they have two problems.
The originator of this particular version of the joke was actually Jamie Zawinski. The quote was exhaustively researched by Jeffrey Friedl. Follow the link for more than you wanted to know about the subject.
There are a lot of variations of regular expression syntax and functionality. Be sure you know which set you are using to avoid over/under matching.
Regular expressions are really good at fuzzy descriptions of strings. This means they can be used for each of these specific purposes.
(Recogniz|Pars|Extract|Transform|Validat)ing strings
You did see that coming, right?
Hopefully, I'll get some time later to demonstrate rxrx
from
Perl. It's regular expressions are a little different than you will be using,
but it's ability to show how a match happens is pretty amazing.
a
, %
, 3
, etc.\0
, \n
, \r
, \t
, \f
, etc..
\.
, \*
, etc.^
, $
You can think of these as the fundamental terms in specifying a regular expression. We use these by themselves or with various operators to construct our regular expressions. Although it seems obvious, all but the last set match one character.
The anchors don't actually match a character, instead they match
a location in a string. The ^
matches at the beginning of the
string or immediately after a newline character. The $
matches
at the end of a string or immediately before a newline character.
abc
abc|def
a+
, a*
, a?
The concatenation of a series of terms matches the first term, followed immediately by the second, etc.
Alternation matches either the expression to its left or the expression to its right. It will attempt a complete match on the left one before attempting the right one. If the first matches, the second won't be tried. This means it finds the first match, not the best match.
/wade/
/cat|dog/
/"..."/
/^.+$/
/^.*$/
/^.*/
versus /.*$/
When describing these, be very precise. Understanding exactly what you are matching is the key to effective regular expressions.
\d
, \w
, \s
[aeiou]
, [0-9]
, [a-zA-Z0-9_]
\D
, \W
, \S
[^aeiou]
, [^0-9]
, [^a-zA-Z0-9_]
The .
is quite powerful, but kind of a blunt tool. Character
classes give much more precision. The first three are short, because they are
very useful when matching or manipulating text. Add in the general form, and
you can do precision matching of most kinds of text.
The general character class matches any character inside the square brackets.
Two characters separated by a dash defines a range of characters. Metacharacters
like .
or *
lose their special abilities inside the
square brackets. The escape code character classes effectively expand to their
full value inside the class. This allows you to combine and extend the built-in
character classes easily.
The uppercase version of the escape code character classes is just the inverse
of the lower case versions. So \D
is any non-digit character, etc.
To make the negated form of the general character class, use ^
as
the first character. That is the only position where the character is special.
The only characters that are special in a character class are the ^
at the beginning, the -
in a range, the ]
that
closes the class, and the \
which negates the effects of one of the
special character class characters.
\A
- Beginning of string\z
, \Z
- End of string\b
, \B
- Word/non-word boundaryThe \A
anchor only matches at the beginning of the string.
It's like the ^
without the special behavior around newlines.
For validation, you almost always want \A
instead of ^
.
The next two anchors match at the end of the string. The only difference
between them is that the \Z
will match before a newline at the
end of a string.
The word anchors are a bit different. The \b
matches at the
boundary of a word. This can be at the beginning or end of a string that
borders a word character or between a word character and a non-word character.
The \B
matches at where there is no word boundary. Either between
two word characters, between two non-word characters, or between a non-word
character and the beginning or end of the string.
/a{n}/
- match n
times/a{n,}/
- match at least n
times/a{n,m}/
- match at least n
but no more than m
timesWhen we need a quantifier, the normal *
, ?
, and +
are also blunt tools. Sometimes you need more precision and these quantifiers
give you that.
(abc)
- capture and group(?:abc)
- just group\1
, \2
, \3
, etc.The next power tool in your regular expressions is grouping, with or without capturing. The quantifiers only work on the preceding term. In many cases that would be a single character. Using grouping, yon can apply a quantifier to an entire sub-expression. Grouping is also useful for containing alternation, as seen in my joke earlier.
The simple parentheses also capture the matched text for later use.
You'll most often find that useful for regular expression used in substitutions,
but it can also be very helpful in certain matches. You use the captured matches
with the \{number}
notation. Where the number is the appropriate
matching group. When the capturing groups are separate, it's easy to tell which
group matches which number. Things get a little messier when parentheses nest.
The key is to remember to count the left parenthesis.
/(\d{3})0\d{3}/
/(\d{3})0\1/
/(\w+)\s+\1/
/dead\bbeef/
/^\d+$/
"Hello\n12345\nWorld"
/\A\d+\z/
/\A\d+\Z/
Let's go over some more examples, based on some of the new things we've learned.
If there are two substrings that match a regular expression, the regular expression actually only matches the one that starts the furthest to the left. Each of the quantifiers we have seen will match as much as possible before testing the next term in the expression. If the next term is not a match, the quantifier will backtrack, giving up one character and retrying the match. This will repeat until the whole expression matches or the regular expression fails.
Anchors match a position in the string, not an actual character. Most people find it easier to think of an anchor as matching the location before or after a character.
Character classes match a character if they match at all. Sometimes people get sloppy thinking about negated character classes and talk about matching if the appropriate character is not there. This is not correct. A negated character class matches a character that is not in the list.
*
versus +
or ?
If you know something for certain about a string, match it. Anchors and runs of known characters can help make certain you are matching what you think you are matching. Try not to be over-general in what you are trying to match. That may result in matching something that you did not intend to.
Many people use *
when one of the other quantifiers would be
more precise. Keep in mind whether matching nothing would be acceptable.
Also remember that *
is probably not right for an optional
item.
Remember that greedy quantifiers will match everything they can before the next term is allowed to try. If the rest of the regex can't match, then the regex engages in backtracking. The quantifier is forced to give up one match and then allow the rest of the regex to try again. This may repeat until the quantifier can no longer match.
Likewise, overly general character classes can result in matching more than you expect.
Remember that some quantifiers may match nothing which can result in unexpected behavior.
Alternation often results in surprises. The regular expression will try
to match everything to the left of the |
before attempting
what is to the right.
Remember that greedy quantifiers will match everything they can before the next term is allowed to try. This can result in matching more than you expect. If the input string is long, it can also result in very slow matches.
/c\w*\.com|c\w*\.con/
/cat|dog|rabbit|goat|cattle|deer/
/^cat|dog$/
Alternation often results in surprises. The regular expression will try
to match everything to the left of the |
before attempting
what is to the right. If the two expressions are long or would have overlapped
this can result in a lot of extra work.
So, if the left side of the alternation would match and the right side would be a better match. The left one is still the only one matched. This is especially surprising if the item on the left is a subset of the one on the right.
The alternative operation also has the lowest precedence of any operation in a regular expression. Be careful about what you think is on each side.
/<p>.+<\/p>/
/<p>[^<]+<\/p>/
The first of one of the classic examples of why you should not parse XML or HTML with a regular expression. It's much harder to get right than people think. The first may match multiple paragraphs. The second only matches a paragraph that contains no other tags.
Remember a*
and a?
are still successful when they
match nothing.
s/\s*/:/g
s/\s+/:/g
The example is based on actual code that I saw used in a real system.
I said regular expressions are a language. Therefore, you should think of each expression you write as a little program or function. As such, you should think about writing them and maintaining them much as you would any other code.
Readable regular expressions are more readable. Try to make its meaning clear. Don't use unnecessarily clever tricks. Take pity on the person who will maintain this. If you can break into smaller expressions and match them separately or build a larger expression out of smaller ones, try to do that.
/^\s+key:\s*(.*)/
/cat\w*/
/(\w*\s*)+/
In the normal case, a regular expression will keep trying to match in as many places in the string as possible as long as it has not succeeded. Anchors are powerful because they can tell the regular expression to give up early instead of trying to continue. Explicit strings have a similar effect. If there is an explicit string at the beginning of the regular expression, the engine will normally scan the text first looking for that string and may give up without matching anything else.
Nested quantifiers can be extremely slow due to exponential time complexity caused by multiple attempts to backtrack over the string.
If you have an unanchored quantified subexpression at the front or back of your regular expression, consider whether it does anything at all for you.
/[\s\t\r\n]/
/[a-z-]/
/[A-]/
/[][]/
/[.*?+]/
Character classes have a few surprises built in. Remember that the escape
sequences effectively expand inside a character class. So the first expression
is basically a wordier version of \s
. If the dash character is in
a position that could not be part of a range, it has no special meaning.
If it is in the right place, it has to be part of a range. The closing square
bracket is not special as the first character of a character class (or the
first after the ^
for a negated character class.
As I said earlier, the metacharacters have no meaning in a character class.
+?
, *?
, ??
, {n,m}?
abc(?=def)
abc(?!def)
(?<=abc)def
, (?<!abc)def
Sometimes you would like to have quantifiers that match as little as they
can before allowing the rest of the expression to match. The non-greedy forms
of the quantifiers do just that. They are the same as the normal quantifiers,
modified by a trailing ?
.
The positive and negative lookahead assertions are like general purpose anchors. They are not part of the match itself, but they only allow the expression to match if what's in the parenthesis matches (or not in the case of the negative lookahead).
Some regular expression systems also have lookbehinds, which can be used before a string. They are however, not supported by JavaScript.
/\A(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\z/
/\A\d{1,3}\z/
/\A(?!0\d)\d{1,3}\z/
/\A[1-9]\d{0,2}\z/
These all match different variations of 1-3 digits. Try to figure out what each one does. And be very precise when you try to describe each. The differences are sometimes quite subtle.
Mastering Regular Expressions by Jeffrey Friedl
Available from O'Reilly and on Safari
This is probably the best reference I know of for really understanding regular expressions. If you read and really understand this book, you will know more than you need for most regular expression problems.