Feb 222011
 

From phi.lho.free.fr I added this page which has awesome information because the original aite was unavailable when I checked it. Just to be clear, this is not my work and copyright info is listed at the bottom.

It needed reposting, so here it is in all it’s g33ky glory…

—————————————————————-

First, you may ask, if you found this page fortuitously, what are regular expressions ? (usually abbreviated as RegExp, RegEx or RE) In a few words, they are a powerful (but a bit geeky) way to manipulate text.

With them, you can see if a generic string (eg. “5 letters followed by 2 digits) is inside a text, you can extract a string out of a text (eg. getting the current version number from a software download page), check if a string meets some criteria (did the user type a date in the right format?), transform a text (morph a list of C’s #defines to a list of variable assignments in your favorite language), split a string with complex requirements (eg. get all words of a natural text, separated by spaces or punctuation signs), etc.

The drawback is its syntax, quite cryptic for the uninitiated (and sometime for the initiated…), but with practice, it appears that most of the tasks use rather simple expressions, so are not hard to master.

Now, don’t feel bad if you had to ask. I see myself as a seasoned professional programmer, with more than 15 years of paid experience, and more than 25 years of programming practice (going back to my first programmable calculator!). Yet, I started only recently to really use REs.

That’s because in most of the programming languages I used (C[++], Pascal, various Basics, assembly languages), regexps are not full part of the language, needing an external, non-standard library, and thus their use wasn’t commonplace.

I used them a bit with Unix tools like grep or sed, but only at low level of expertise. I actually started to use them with the Lua language that has a proprietary implementation, and when the SciTE source code editor integrated a simple implementation of REs. Being a bit primitive, it was less intimidating and I found myself using them more and more.

Then I learned languages fully integrating regexps. A bit of Perl, of course, but I am still a beginner here; JavaScript, which is quite close of Perl in RE syntax; PHP, with both its Posix implementation (ereg, outdated) and its PCRE-based one (preg); and Java, fully integrating REs since version 1.4.

Actually, you are probably using REs without knowing it… At least a very primitive, very simplified version of them, called meta characters in MS-Dos: in file patterns, ? replaces any character and * replaces any sequence of chars, like in *.jpg or backup.00?

What are the uses of regular expressions?

I did a quick overlook above, let’s expand these ideas here. Like with any language, be it natural language (English), programming language (Basic), or description language (HTML, REs), we are limited only by our own imagination… and our own needs! Oh well, we can be limited by the capabilities of the language itself, of course. But they often offer more than enough playground to keep us busy discovering the possibilities… and exploiting them. Actually, you will be mostly limited by your familiarity with the language. The more you use them, the more uses you will find, and you will wonder how you did without them!

A frequent use, particularly in JavaScript, is validation. It allows to quickly test if data entered by user or found in a file matches some criteria/rules. If it matches, it is likely to be valid. Note the restrictive form: You can check that an e-mail address as a proper form (less obvious than it seems, [email protected] is a valid e-mail address!), but not that it actually exists on a server… Likewise, you can check that a postal code has a correct format, but not that it is used by the postal service. And so on.

Yet, it is useful to catch early typo errors, like forgetting a digit in a postal code or a date. It is important in Web forms, because it avoids to submit the page then wait for a new page pointing out the error to be returned…

Another frequent use is to search a string in a text. Eg. you can find an e-mail address (again), a given HTML tag, a version number and/or a date in a Web page. You can just verify it is here (validation), get its position in the text, or extract the exact value.

The third main use is to replace one string by another, often derivated from the first. For example, from a file listing numbers followed by a tab and an identifier, we can create another file where these identifiers are prefixed, then followed by a equal sign and the value, or transformed in a series of C’s #defines, etc.

The last big category of use for REs is to split a string. It is often used to divide a sentence in words, whatever the separators can be (spaces, tabs, punctuation signs, etc.) and whatever their number.

The Tutorial

Hopefully, I have now whetted your appetite. So it is time to see how we do all this! There are many tutorials on regular expressions on the Net, so you may wonder why I wrote yet another one…

Well, I saw too often examples like searching the ab string in a acaadabca text… I guess you won’t use this often, unless you are a biologist studying DNA! So I will introduce mostly the bases of regexes, skipping or only mentioning advanced features but trying to be as pragmatic as possible.

I also tried to follow a logical progression, not by strict syntax grouping, but by introducing features progressively, giving examples using only the concepts already seen, and avoiding forward references (“this feature is similar to one we will see later…”).

One of the biggest problems with REs is that they are not normalized, every implementation in each language uses its own syntax.

There is basically two major definitions: Posix and Perl. I choose the most popular one (Perl, used in PCRE, thus PHP, in JavaScript, Java, etc.), but fortunately, lot of features are common to both specifications. And once you get the bases, you will just have to study the particularities of your language of choice to adapt the examples.

I want my examples to be language agnostic. So I use a simple pseudo-language where backslashes don’t have to be doubled (boo to Java!) and functions are hopefully simple and self explaining.

Useful links

There is a lot of reference material on regular expressions on the Net, a little googling will show them. I will give only three generic links:

Wikipedia article on regular expressions. Good history and overview, some links.
Regular-Expressions.info wants to be a portal about regexes. It is also a plug for the PowerGREP shareware, but it isn’t too intrusive. It has a JavaScript RE tester.
RegExLib.com is a catalog of useful regular expressions, contributed by users. There is a lot of redundancy (tons of e-mail address checkers!) and the quality of REs should be checked… But it is a good starting point to have ideas and to see some complex/clever regexes.
There are some pages that allow you to test regular expression online. The advantage is that you don’t have to install software on your computer. These can be useful to test the examples I give. Here are two of them:

Venimus’ regex tester is impressive. It does the classical JavaScript tester, but it also allow to test both PHP engines: the old Posix one (obsoleted) and the PCRE one. The strong point here is that he uses Ajax to send in real time the strings the user type, to fetch the result and to display it immediately.
The (Java) Regular Expression Test Page is quite good and allows to test the Java engine.
If you prefer, you can install a software to test off line. My two favorites (for Windows) are:

The Regex Coach, a very impressive program, written entirely in Lisp, including the RE engine itself! It follows the Perl rules and gives a good deal of information/analysis of your expressions.
JRegexpTester is a standalone Swing application that helps you test regular expressions with the Sun Java standard API (java.util.regex).” The site says it all. Obviously, you will need a JRE to run it. Not only it is a tester, but it aims to be a tool to manipulate files. Plus it integrates the entire RegExLib.com (see above) library of REs (a big XML file).
Last but not least, I would like to mention PCRE, a C library with a permissive license (BSD) which aims to be as much compatible with Perl REs as possible, and more. It is probably one of the most used RE engines (found in many applications and languages, including PHP).

Note there is an HTML version of the PCRE documentation, easier to read (and print) than the pure text format that comes with the library, in the grand Unix tradition… I used the Pattern reference as base reference for this document.

I wrote a PCRE wrapper for the AutoHotkey Windows language, using the DLL version of the library.

The first steps

RegExps are made of regular chars with no special meaning, and special chars (all of them low Ascii) whose meaning we will see soon.

Regular chars are letters, digits and some punctuation signs. They match themselves, which is pretty boring, yet essential!

In this tutorial, the IsMatch function will return true if the second argument is found in the first one. So IsMatch(“Regular expressions are cool!”, “expr”) will return true and IsMatch(“Regular expressions are cool!”, “kool”) will return false.

Note that by default case is important, ie. IsMatch(“REs”, “r”) will return false, but you can provide an option to match independently of case.

Backslash

Special chars are some punctuation signs whose meaning depends on context.

The first important special char is backslash: \

Put before a special char, including itself, it removes the special meaning and let match the ordinary character: IsMatch(“[C:]\tmp”, “\[C\:\\\]”) returns true. That’s one of the culprits of REs: with a lot of backslashes, the expression is nearly unreadable! The one before the : character is actually not needed, but doesn’t hurt.

More interesting is the capacity to match some control chars: \n \r \t are respectively newline (line feed), carriage return and tab. There are others, but less frequently meet in textual data. To match any character given by its hexadecimal Ascii code, use \xDD where DD are two hex digits.

Even more interesting is the capability to match some character classes, ie. groups of similar characters: \d matches any digit (0 to 9), \w matches any word character (a to z, A to Z, 0 to 9 and underscore (_), the list can be extended according to the system (or user) locale) and \s matches whitespace chars (definition depends on RE engine, there is at least space and tab, and maybe other control chars like line feed or carriage return).

Note that if you use capital letters, you will get the opposite: \D matches any character except digits, \W matches mostly punctuation chars and control chars, \S matches any non-space char.

We are already able to do useful searches: if I want to know if the user has typed a date in the right format in my GUI (or Web page), I write: IsMatch(userInput, “\d\d\d\d-\d\d-\d\d”) to accept 2006-05-24 but reject 2006-111-4. Well, it will also accept 5555-77-99, so it isn’t perfect yet…

Anchor this!

Actually, the regex engine will search the given expression inside the given text (unless specified otherwise by some options). That means that if you don’t restrict the size of data in the input field, the user can type 12038-11-252 and the IsMatch will return true…

You can avoid this problem by anchoring the expression to the beginning or the end of the text (or both, of course).

If you start your RE with the circumflex ^ character, the engine will match it only if the text starts with the expression.

If you end your RE with the dollar $ character, the engine will match it only if the text ends with the expression.

So, if you surround your expression between ^ and $, the RE will match only if the whole text matches the expression. So, our entry validation code will look like:
IsMatch(userInput, “^\d\d\d\d-\d\d-\d\d$”)

Other examples: Reading a script file, where line comments are marked by #, we want to check if a line is a comment. IsMatch(codeLine, “^#”)

On the same script file, we want to find out which lines end with semi-colon (;) IsMatch(codeLine, “;$”)

Dot and brackets

There is another character class, which actually matches any character, except end-of-line characters (unless you specify otherwise with the appropriate option).

That’s the dot (.). It is like the question mark in MS-Dos, matching only one character, whatever it can be. For example, “s…y” will match “silly” or “sorry” or even “s! Ty”. Beware, if you want to match the dot in a filename, for example, you must backslash it: “foo.\w\w\w” will match “foo.bar”, but also “footbal”! Write “foo\.\w\w\w” if you want an extension with three letters (or digits).

All this is nice, but a bit inflexible! What if I want to match only lower case letters? Or only some punctuation signs. Or all hexadecimal digits?

You can define your own character classes, using brackets ([ and ]): [;:!?”] will match all double punctuation signs. [abcdefABCDEF0123456789] will match any hexadecimal digit.

Obviously, if the list is long, it is annoying to type, so there is a shortcut: [a-fA-F0-9] is the same as above. The dash indicates to include all consecutive characters between the two given (consecutive in the Ascii table). Note that [z-a] will probably generate an error. If you want to include the dash to the character list, just put it at the start or the end of the list: [-_.].

Note also that most characters loose special meaning in these lists: here, dot is a plain dot. If you need to put ] in a char class, either backslash it, or put it at the start of the RE, since empty classes are not legal: [][] looks funny, but is actually a char class with just [ and ], the former loosing its special meaning once in the class.

The backslash keeps its special meaning: you can write [\r\n\t\b\a] to match some control chars, [A-F\d] to match upper case hex digits, etc.

You can invert the matching by putting the ^ char just after the opening bracket: [^a-zA-Z] will match anything except letters, [^\x00-\x1F] matches any non-control char. To match the ^ character itself, don’t put it at the start: [‘`^~] will match most diacritical marks.

Star, plus and question mark

This is becoming powerful, but still tedious to type: if I want to match an hexadecimal byte in C, I have to type IsMatch(byte, “0[xX][a-fA-F0-9][a-fA-F0-9]”. It becomes worse for wider types…

So clever RE designers defined characters to repeat a character or a character class (and some other sub-expressions as we will see later). They are called quantifiers (defining a quantity).

The star * indicates to repeat the previous expression from zero to an infinite number of times. So “Aa*h!” will match “Aah!” and “Aaaaaaah!”. But also “Ah!”! Remember the zero, it makes an expression optional. To be sure to match at least one character, you have to repeat it: “Aaa*h!”, but it is annoying if the expression to repeat is big, like 0[xX][a-fA-F0-9][a-fA-F0-9]*.

So, just use the plus + sign, it is just made for this! So we can write “Aa+h!”.

This is also useful for our hex numbers. Since 0x isn’t a legal number, we have to write: IsMatch(hexNumber, “0[xX][a-fA-F0-9]+”.

The question mark ? will match 0 or 1 occurrence of the previous expression. So, to keep with my interjection, “Aa?h!” can only match “Ah!” or “Aah!”. It is useful to mark optional stuff, eg. “,\s?” indicates there may be a space, or not, after the separator (comma).

Greedy little quantifier!

One of the most used combination is .*. It allows to match any number of any character.

Of course, it is rarely useful at start or end of a regexp, because usually, RE functions search inside strings, so already skip any non-matching characters. So, .* is mostly used between two bounds. For example, we want to get the content of a given HTML tag. Here, I will introduce the pseudo-function GetMatchString: if the htmlFragment variable contains

"<br /><a href='http://www.example.com'><img src='http://www.example.com/example.png' alt='Example' title='Canonical example site' /></a>". GetMatchString(htmlFragment, "<img \s.*/>") will return "<img src='http://www.example.com/example.png' alt='Example' title='Canonical example site' />"

as expected.

But beware of a very common pitfall most beginners will fall into (including me!): if we call

GetMatchString(htmlFragment, "<a \s.*>"), </a>

we won’t get

"<a href='http://www.example.com' title='Canonical example site'>" </a>

as probably expected, but

"<a href='http://www.example.com'><img src='http://www.example.com/example.png' alt='Example' title='Canonical example site' /></a>"!

Why? Because the * operator is “greedy”, ie. it avidly matches as many characters as it can, ending only on the last occurrence of the expression after it.

So, how can we make it stop on the first > sign? There are basically two ways. I use the first one in SciTE, whose primitive RE engine cannot handle the second way…

GetMatchString(htmlFragment, "<a \s[^>]*>") </a>

means “match as many characters different of > as you can and end on the first > sign that you find”.

GetMatchString(htmlFragment, "<a \s.*?>") </a>

Here, the interrogation point ? is a modifier of the star: it makes it a non-greedy operator, so it will end on the first occurrence of the following expression.
The second form is better, because more generic. The inverted char class trick won’t work for complex tail expressions.

Note that this non-greedy modifier applies also to + and ? quantifiers. The usage of +? is very similar to *?, but I find no real use for ??…

Capture this: parentheses

One powerful feature of regular expressions is the capacity to capture a part of the expression. This means that some parts of the expression are marked to keep the text they match. This is indicated by putting them between parentheses: ( and ).

These captures are numbered: the first one, when reading from left to right, has number 1, the second (taking in account the opening parenthesis, as captures can be nested) is 2, etc. Some RE engines limit the number of captures to 9, others handle as much as 65535…

There is a special, implicit capture, number 0: it is the whole expression itself (always captured).

So, back to our HTML fragment, if the pattern variable contains

<img \s+src='(.*?)'\s+alt='(.*?)'\s+title='(.*?)'\s*/>

we can write:

GetMatchString(htmlFragment, pattern)

and get

 "<img src='http://www.example.com/example.png' alt='Example' title='Canonical example site' />"

as result, but also:

GetMatchString(htmlFragment, pattern, 1)
GetMatchString(htmlFragment, pattern, 2)
GetMatchString(htmlFragment, pattern, 3)

to get respectively “http://www.example.com/example.png”, “Example” and “Canonical example site”.

Beside extracting the captured strings, there are two main uses for captures: they can be used internally in the regular expression itself (back references) or they can be used in a replace expression.

Back to date validation! Let say that we want to be more flexible, accepting alternatives separators: 2006-06-28, 2006/06/28, 2006\06\28, etc. We could write:

IsMatch(userInput, "^\d\d\d\d[-/\\ ]\d\d[-/\\ ]\d\d$")

but then, the user could type “2006 06/28″…

So we have to ensure both separators are identical. We can do this by capturing the first one, and using a reference to the capture in the second one:

IsMatch(userInput, "^\d\d\d\d([-/\\ ])\d\d\1\d\d$")

\1 is the reference to the first capture, \2 to the second one, etc.

Another example: in HTML, you can quote attribute values with either single quotes or double quotes. Of course, the quotes must match. So, to get an attribute value, you can use something like

"href=(['\"])(.*?)\1"

(double quote is backslashed here only because it is the host language’s string delimiter).

Note there is an ambiguity for which I saw no official solution: if the back reference is immediately followed by a digit, eg. 2, you cannot tell if \12 is the reference to the first capture followed by the digit 2, or the twelfth reference (if it exists). Of course, there is no problem with engines limited to 9 references… The solution I found is simple: use \1[2]…

There is also different ways of interpreting numbers above the number of sub-expressions (or the number of captured strings, which can be different). I won’t detail them here, check the documentation of the RE engine of your choice.

One powerful feature of REs is data morphing through the use of captures. To illustrate that, I will use the pseudo-function Replace:

Replace("See an advanced use of expressions", " [aeiou]", "y")

will return “See yn ydvanced yse yf yxpressions”…

The third parameter is the replace expression, which isn’t a regular expression. It is a plain string (no special character except the classical escapes), which can contain references to captures. The syntax of such references vary. Old applications like sed, vi or older versions of Perl used \0 to \9. Now Perl, and followers like Java, JavaScript, etc., use $0 to $9 and more. To disambiguate cases like \12 mentioned above, Perl can also use the ${0} to ${9} (or up to ${65535}…) notation (that’s actually variable dereferencing). Modern Perl and JavaScript prefer to use $& to $0. I will use the more universal $x notation from now on.

So we can play more and write:

Replace("See an advanced use of expressions", " ([aeiou])", " y\1")

to get “See yan yadvanced yuse yof yexpressions”.

References in the replace expression can be in any order. So if we have a file whose lines are in the form “0x10F5\tMSG_DOSMTHNG”, we can morph them to generate lines for our favorite language:

Replace(fileLine, "^0x([A-Z0-9])+\t(.*)$", replacePattern)

With

replacePattern being "\2 = &H\1", we will get "MSG_DOSMTHNG = &H10F5"

for Visual Basic.
With

 replacePattern being "#define \2 0x\1", we will get "#define MSG_DOSMTHNG 0x10F5"

for C.
With

replacePattern being "public final static \2 = 0x\1;", we will get "public final static MSG_DOSMTHNG = 0x10F5;"

for Java, and so on.

Parentheses again: grouping

Another usage of parentheses in regular expressions is to group RE elements to form a sub-expression. For example, to match a list of words separated by commas, we can write:

 (\w+,)*\w+:

we grouped \w+ (a series of word chars) with comma, and indicated we will repeat this pattern as much as needed. This will match “me,and,my,monkey”.

Note that here, we don’t use the captured string, so it could be interesting to have non-capturing parentheses. This is possible in some implementations, using the (?: ) notation: (?:\w+,)*\w+. It is useful if you need to add such grouping between real captures as it won’t change the numbering of captures. Or just as a kind of self-documentation. Or to keep the number of captures below 10…

There are other uses of parentheses, but they are somewhat convoluted and rarely used (and fall in the advanced stuff anyway). I will just indicate that some engines allow comments between

(?# ): (?:\w+,)*(?# Iterate on words and separators )\w+.

Some advanced stuff

Most of the expressions we saw until now are quite basic and works in most regular expression engines. Possible exceptions are \d \w \s \x classes (replaced by another notation in Posix REs), the non-greedy operator and the non-capturing/comment parentheses.

We will see now some slightly more advanced features, still useful and not too hard to understand. They may or may not be present in your favorite engine, you have to study its doc and experiment to be sure.

Embrace the braces

The star and plus notations are powerful, but matches an unlimited number of occurrences of the preceding expression. What if we want to match a precise number of matches without repeating the sub-expression, or if we want to specify a minimum number of occurrences, or put an upper limit?

Well, you can (of course…)!
{n} repeats the previous expression n times.
{n,} will match if previous expression is repeated at least n times (no upper limit).
{n,m} will match if previous expression is repeated at least n times but at most m times.
n and m must be between 0 and 65535.

You can guess that ?, * and + are actually shortcuts to this notation: ? is {0,1}, * is {0,} and + is {1,}.

And since these quantifiers are also greedy by default, you can follow them with the ? operator to make them non-greedy.

So if we want to match a number between 0 and 65535, we can use “\d{1,5}” to eliminate numbers with too many digits (but still have to check if not above 65535).

Our date expression can be simplified:

 "^\d{4}([-/\\ ])\d\d\1\d\d$":

I could have used \d{2}, but it is longer than \d\d!

Say we want our hexadecimal numbers have always 2, 4, 6 or 8 digits. We can check this with

"0[xX](?:[a-zA-Z0-9]{2}){1,4}".

We have to enclose the first expression because {2}{1,4} is not legal.

Alternatives

This is a fundamental feature of regular expressions, but that unfortunately isn’t implemented in all engines, hence its placement in advanced stuff.

So far, we had quite linear expressions, were all sub-expressions concatenate, and the whole expression matches only if sub-expr1 AND sub-expr2 AND … AND sub-exprn all match.

This is all fine and dandy, but what if I want a OR? For example, I want to match either “fast food” or “fine food”. Or, perhaps more realistically, I want to match any week day name in a date expression. How do I do that?

Alternatives to the rescue! You can specify several alternate parts, separated by a pipe | char, the RE engine will try them (from left to right) until one match.

So you can write fast food|fine food. Here, the whole expression provides two alternatives. But often you want to restrict the alternatives to a small sub-part of the expression, so you enclose the alternatives in parentheses: (fast|fine) food. As seen before, if you don’t need to capture this part, use a non-capturing group: (?:fast|fine) food, or, why not, f(?:ast|ine) food.

You are not restricted to two parts: (?:Sun|Mon|Tues|Wednes|Thurs|Fri|Satur)day. A part can be empty: (Mr\.?|Mrs\.?|Ms\.?|Miss|) March indicate one of these title, or none, can be used before the last name.

Note that (a|b|c|d|e|f) is similar to [a-f] except that it is longer to type and to execute…

Note also that if you try to match red|reddish against a sentence like “I want a red, or at worse reddish, car”, you will get two matches, both being “red”, because the engine stops on the first successful match. So in case of ambiguity, provide the longest match first: reddish|red (which can be rewritten, of course, as red(?:dish|)).

Of course, you are not restricted to plain string alternatives, each part can be a full sub-pattern, including itself other alternatives.

Just for fun, I give here a regular expression I built to match any date in European format (dd/mm/yyyy), from 1600 to 9999, checking that dates are valid, including for leap years! To be honest, I found a similar expression in RegExLib, written by dany.lauener(a)b-i.com. I deconstructed the expression, understood it, and rewrote it to be shorter and hopefully more efficient.

^((0[1-9]|1\d|2[0-8])/(0\d|1[012])/(1[6-9]|[2-9]\d)\d{2}| (29|30)/(0[13-9]|1[012])/(1[6-9]|[2-9]\d)\d{2}| 31/(0[13578]|1[02])/(1[6-9]|[2-9]\d)\d{2}| 29/02/((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|(16|[2468][048]|[3579][26])00))$

I added semi-arbitrary carriage returns to avoid scrolling but it should be in one line. It is impressive, but actually rather simple… I provide a commented, extended version at the end of the article.

Options

Depending on the RE library, a number of options (or modes) can be set when creating the expression or when matching. The most common ones are:

Option Letter Description
CASELESS i “a” will match both “a” and “A”
MULTILINE m the “start of line” (^) and “end of line” ($) constructs will also match respectively just after and just before a newline character. By default, they match only the start and end of the searched string, even if it has several lines inside.
DOTALL s the dot metacharacter can match an end-of-line character (by default, it doesn’t)
EXTENDED x whitespace data characters in the pattern are ignored, except if escaped or inside a character class. This allows to write the pattern on several lines and to indent it, to make it more readable. You can even add comments: they follow # and end with the end-of-line.
UNGREEDY U inverts the greediness of the quantifiers (specific to PCRE?)

These options can be set inside the expression, to change it for the remainder of the subpattern, ie. up to the end of the expression or the closing parenthesis at the level of the option.

This is done by the (?i), (?m), (?s), (?x) and (?U) constructs. These settings are unset by preceding them with a dash: (?-i). Several options can be set at once: (?m-ic).

You can quote me

Sometime you have to match a string full of characters used in RE syntax. If you want to search the

^\d{4}([-/\\ ])\d\d\1\d\d$

string in this HTML file, either you write:

"\^\\d\{4\}\(\[-/\\\\\ ]\)\\d\\d\\1\\d\\d\$",

which is even more unreadable than the given RE (even worse if the language requires you double all backslashes, like Java!), or you use special quotation marks:

\Q^\d{4}([-/\\ ])\d\d\1\d\d$\E.

When meeting \Q, all characters loose their special meaning, until the engine finds \E.

Bounds, Unlimited

Say you need to search variable names in a source code file. You write GetMatchVal(buffer, “lockFile[1-4]”) to get their positions and lengths in the file (yes, that’s another function).

But this will find also the unlockFile2 variable and friends, which is unwanted. So you want to write “\WlockFile[1-4]” to be sure the variable won’t be preceded by a letter. But this has two drawbacks: the position and length will be wrong (you have to adjust for them), and it won’t work if lockFile is at the start of the expression to search (no char before).

To avoid that, we can use zero-length assertions. Such assertions won’t consume any character, they just ensure some conditions are meet. In our case, we can write “\blockFile[1-4]\b” where \b forces to match at a word boundary. As usual, \B match when not at a word boundary. A word boundary is when the first or last character of the tested string match \w, or when the current character matches \w and the previous one matches \W or the reverse.

Note that now, you have enough elements to understand the logo I made for this article: \bR{5}r\b and \bE{9}s\b. The answer is at the bottom of the page…

Other bounds, useful if you use the MULTILINE mode: \A matches at start of searched string (equivalent to ^); \Z and \z matches at end of searched string (equivalent to $), the former matching before an ending newline if any, the later matches after this newline; \G matches at the end of the previous match. It is usually used like ^, at the start of an expression, to match a position depending on another match (not necessarily the same expression).

Posix character classes

The Posix syntax, supported by Perl in this case, defines more character classes that \d \w and \s. It uses a heavier syntax, but a bit more readable: [:xxx:] were xxx is one of the following words:

Class name Description
alnum letters and digits
alpha letters
ascii characters code from 0 to 127
blank space or tab only (Gnu extension)
cntrl control characters
digit decimal digits (like \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits
space white space (not like \s: HT (9), LF (10), VT (11, not in Perl), FF (12), CR (13), and space (32))
upper upper case letters
word “word” characters (like \w, that’s a Perl extension))
xdigit hexadecimal digits

These classes can only be used inside [ ], so to match a hex number, you must write: [[:xdigit:]]+

More advanced stuff…

That I won’t explain much, mostly because I just don’t use them, probably because I don’t see how to use them… If someday I see a good example for them, perhaps I will expand this tutorial. I put them here so you know they exists and you can explore them yourself.

Subpatterns as subroutines. A kind of backreference that doesn’t refer to a previously matched string, but to a previous pattern. This one is very useful, since identical pattern portions can be found in alternative parts. Alas, it seems to be specific to PCRE. I suppose Perl can simulate this with variable expansion. Too bad this isn’t part of the classical REs.

Notation: (?1).

Example: (19([5-9][0-9])|20(?2)) will match the years 1950 to 1999 and 2050 to 2099. I know it can be written differently, it is just an example that can be expanded with different stuff in the branches: (19([5-9][0-9]-01)|20(?2)-12).

Look ahead and look behind assertions. As you know now, assertions are matched as usual, but don’t consume characters. Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. Lookbehind assertions start with (?< = for positive assertions and (? ), possessive quantifiers are in PCRE and Java REs, they are marked with a plus sign (+) after a quantifier: \d++. They need some experience in REs, knowing the matching algorithms. They are used to optimize REs that proved to be inefficient when they fail (the RE engine tries to match any possible combination, with combinatorial explosion).

Named subpatterns are implemented in Python and PCRE (and some other engines). They allow to access a subpattern (a capture) by a name, which are easier to use than numbers (where you need to count the parentheses, and recount if the pattern changes). Again, not standard, but cool.

Notation: (?Psubpattern) to define the pattern, which can be used as backreference with (?P=name).

Conditional subpatterns. (?(condition)yes-pattern) and (?(condition)yes-pattern|no-pattern). Condition can be a back reference (true if the capture matched) or a look around (look ahead or look behind).

Examples:
(?:< ([biu])>)?Bold or italic or underlined or unchanged text(?(1)) will match either just the sentence alone, or the sentence surrounded by a bold, italic or underline tag.
(?(?=[JFMASOND])[A-Z][a-z]+|\d{2}) will match a month name or its two digit number, ie. it will match the month on “We meet on Julius or later” or “We meet on 07 or later”, but not on “We meet on the beach”. OK, this one is a bit contrived, and in most cases using alternates would be enough, but it should give you a taste of the syntax.

Recursive patterns. To match nested stuff (parentheses, braces…). Never tried.

Callouts. To call arbitrary code when matching… Obviously strongly dependent of implementation/language.

And any other feature that the RE engine at your disposal as come up with…

This is the end, my friend

I hope you enjoyed this brief, but rich journey into the mysterious and beautiful country of regular expressions…

I will conclude, curiously, by a word of warning: “Don’t try to use regexes exclusively for all text processing!”

You know the proverbial golden hammer: when you have such beautiful and powerful tool in your hand, you see everything in the world as nails…

REs are like this: they are so seductive that some people want to use them for all tasks to manipulate text.

But if your language has, for example, a simple text splitting function that accept only plain text as separator, you may want to still try and use it, because it may be much faster and easier to use than the regex equivalent.

Note that some RE engines can detect that an expression is just plain text, and can fall back on a fast search algorithm, like Boyer-Moore or derivatives.

I saw a typical example of this golden hammer syndrome in a forum: somebody asked a regular expression to parse something that was actually CSV lines. Everybody advised to drop REs for that and use a classical char by char scanner.

Why? Because CSV is harder to parse than it seems, and the file format depends much on the application that outputted it. Even Microsoft applications export to various CSV syntaxes depending on the program (or its version…).

For example, separators are classically comma, but might be semi-colon (French locale, for example, which uses the comma as decimal separator), or tab (sometime called TSV). If the value has no space, or perhaps just no special character (eg. the separator), the field can be unquoted (3.14,6,145). When quoted, the quotes inside can be escaped by doubling them or preceding them with backslash. Etc. I won’t even mention the case of fields with newline inside…

The problem here, and in most context-free languages (most programming languages), is that a given expression can occur in various contexts: a syntax construct has no special meaning in a string or in a comment. And regular expressions have hard time to find such context.

Another culprit is nesting. Unless the RE engine has special support for this, it is hard to find the final parenthesis of a function call that can have nested expressions: DoThis(a * (b – c), “)”, (sin(PI * x) – cos(PI *x )) / 2) which can be complicated by writing this call on several lines, adding comments, etc.

To handle this, we usually make an automaton, which is classically represented as a set of nodes (states) connected by arrows (transitions). The automaton is in a given state (in a string, in a comment, in an identifier…) which is usually a whole sub-automaton (while in a string, change state depending on the found character (plain char, escaped char, end-of-string char) until we find the real end-of-string mark).

For example, if we find a double quote in a C source, we go in the string sub-automaton, which will loop until it finds another double quote, not preceded by a backslash.

That doesn’t mean we cannot use REs in such scanner. We can use something like ([^\\]|\\.)*?” after the first double quote, to find the final one. But this means that you probably cannot make a giant RE to parse code, but several smaller REs used with the full capabilities (looping, testing…) of the host language.

Another example, holy grail of regexp, is e-mail address matching. The RFC [2]822 defines lot of rather complex rules, including addresses like “John D. O’Connor”@example.museum or [email protected], and regexlib.com shows tons of regexes trying to match one or the other subset of possible syntaxes (in general only letters, digits, dots and dashes). Often forgetting to match upper case characters!

There is a mythic RE written by Jeffrey Friedl that conforms strictly to all RFC rules. It is more than 6,300 character long! Obviously, it was hard to create, and is probably nearly unmaintainable.

And actually, adding a few lines of code with properly chosen smaller regexps can do the job much more efficiently. For example, to parse the above code, one could test if it starts with double quote. If that’s the case, find the closing double quote, process the content, then process the remainder of the address. If no quotes, check against simpler form. Etc. This way, the routine can check separately, quietly some simpler subexpressions rather than a (very) complex one.

Breaking a complex problem in simpler sub-problems, using the language facilities instead using arcane syntax or workarounds, these are sane principles.

Now, this might not be always possible. For example, you can have a validating edit field to which you give a regular expression to validate input. In this case, if it doesn’t provide callbacks, you have to give it self-sufficient expressions, no mean to add some code logic here. Idem if you have to manage user-supplied expressions.

So, while in normal case I would break a date in three parts and analyze the numbers programmatically, for such case, I can use the big expression shown above.
As promised, here is the breakdown/analysis of the expression, in EXTENDED (x) mode, which is, by the way, a good mean to understand how some impressive REs work:

(
   # Days 01 to 28, all months, all years: match quickly most cases
   ( # Days 01 to 28
      0[1-9]
      |
      1\d
      |
      2[0-8]
   )
   /
   ( # Months (any)
      0\d
      |
      1[012]
   )
   /
   ( # Years (any)
      1[6-9] # 1600 to 1999
      |
      [2-9]\d # 2000 to 9999
   )\d{2} # Any last two digits

   |

   # Case of months with 30 days
   ( # 29 or 30
      29
      |
      30
   )
   /
   ( # Months with 30 days
      0[13-9]
      |
      1[012]
   )
   /
   (1[6-9]|[2-9]\d)\d{2} # Years as above

   |

   # Case of months with 31 days
   31
   /
   ( # Months where 31th is valid
      0[13578]
      |
      1[02]
   )
   /
   (1[6-9]|[2-9]\d)\d{2} # Years as above

   |

   # Case of 29th February
   29/02/
   ( # Years
      (
         1[6-9] # 1600 to 1999
         |
         [2-9]\d # 2000 to 9999
      )
      ( # Last two digits can be divided by 4
         0[48]
         |
         [2468][048]
         |
         [13579][26]
      )
      |
      ( # Valid years ending with double zero
         16
         |
         [2468][048]
         |
         [3579][26]
      )00
   )
)

Note that the year expression ((1[6-9]|[2-9]\d)\d{2}) is used in several places, so I could have used a reference (Subpatterns as subroutines) for a shorter expression, but it would have been PCRE specific then.

The RE icon

I wanted to use a significant yet short regular expression to create an icon for this page. Short, because of limited space in the image. I chose to write both words: “Regular”, “Expressions”, in the style that is sometime used to abbreviate long words like Internationalization or Localization: take the first and last letters, and replace the middle with its number of characters, giving: I18n and L10n.

So I wrote: \bR\w{5}r\b and \bE\w{9}s\b, adding the word bounds to make them a bit more cryptic, and to avoid ambiguity. Of course, they can match other words, but that’s not a RE for real use anyway!

Copyright © 2006-2008 Philippe Lhoste / PhiLho
Created: 2006/06/24
Updated: 2008/09/30 (added reference to monster RE for e-mails)

This page is hosted at http://Phi.Lho.free.fr/development/RETutorial.en.html

From phi.lho.free.fr

Share This!
 Posted by at 10:45 am

Sorry, the comment form is closed at this time.