Ashley Sheridan​.co.uk

Practical Regular Expressions

Posted on

Tags:

I often quote Jamie Zawinski for his brilliant line on usage of regular expressions:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Despite this, I still use them to within an inch of their life because, when used correctly, they are incredibly useful; across everything from find/replace in a document or IDE, form validation, and right the way through to formatting columns of data into an HTML select list.

At first it might seem like I’m disagreeing with Jamies statement, but that couldn’t be farther from the truth, I think he’s hit the nail right on the head. Regular expressions are abused beyond their capabilities, but most of the time that’s down to naivety rather than flagrant misuse.

XKCD comic about Perl regular expressions

So What Are Regular Expressions?

Simply, a regex is a sequence of characters that describe a search pattern that can be used on regular language strings, or more simply, it’s wildcards on steroids. The key part of that first definition though, is why regexs are unsuitable for HTML. The technical reason is that HTML is something referred to as an irregular language, or more specifically a Chromsky Type 2 language, and regular expressions are Chromsky Type 3. While this means that you can’t use them to parse all HTML, they can be used in certain, more controlled, scenarios where you know the exact format of the HTML you’re parsing. I feel like I have to put explain this outright, only because it’s one of the most typical misuses of, and often asked question on Stack Overflow.

What Can You Do With Them?

A good place to start is by using them in a typical find scenario in your text editor of choice. You’ve copied the list of language culture names and codes from https://msdn.microsoft.com/en-gb/library/ee825488(v=cs.20).aspx and you want to turn them into a list of <option> elements for your website.

We only want the language culture name and display name columns, and we’ll ignore everything else. First, we’ll concentrate on matching the first column.

Basic Matching

So right now, our table should look like a long list in our text editor, with each column of data separated by tabs (if your editor has used something else then adjust the instructions that follow accordingly). As we have many lines, a good place to start is forcing each match to start at the beginning of the line, this is done with

^

Next, we start a group match. A group lets us refer back to it by a number later on when we perform a replace

^()

Now, we need to actually capture something inside our group. Looking at the first column, it looks like everything seems to be two lowercase letters, then a hypen, followed by two uppercase letters, so let’s try that:

^([a-z]{2}-[A-Z]{2})

A few new things here, the first is the set of square brackets. These match a single character out of a range of characters, in this case our range are the letters 'a' through to 'z' (and uppercase accordingly after a hyphen). Just after both of these there is the {2}. This says, match the previous thing exactly two times. So how did our regex fare? Mostly well, until it gets to Azeri, and then it fails, because they have 3 parts in their language culture, and it doesn’t match our pattern. So, we can change it to this:

^((?:[A-Z][a-z]-)?[a-z]{2}-[A-Z]{2})

This is a bit more complex, first we add another group with brackets inside brackets, but at the start of this one, we add ?: . This tells it to match, but not capture the match, so that when we come to do our replace stuff, so far we only have one capture and that is the whole culture column. The other new part is the ? after the nested brackets. This tells the regex that it’s an optional match; it might be there, it might not. This is important because not all items in this column have 3 parts for their culture code.

Does it work? Nearly, but now we’re stuck at Chinese Simplified, because it has 3 uppercase characters at the end. How do we change our regex to allow 2 or 3 of the uppercase characters?

^((?:[A-Z][a-z]-)?[a-z]{2}-[A-Z]{2,3})

We change the last brace to {2,3}. This tells the regex to match 2 or 3 times, no more or less. This works until Dhivehi in the Maldives, which has 3 lowercase characters in its first part (and it only contains the two parts). We update our regex to add the same change as above to our original lowercase match:

^((?:[A-Z][a-z]-)?[a-z]{2,3}-[A-Z]{2,3})

This works! Ever single entry in that column is found.

Negative Matches

Next, we want to match the tab (or other field delimiter that your text editor might be using), but we don’t want to capture it. For this reason, we put this match outside of the brackets, like so:

^((?:[A-Z][a-z]-)?[a-z]{2,3}-[A-Z]{2,3})\t

The \t might look familiar, it’s pretty much universal in most languages as the way to escape tabs. You can put the literal tab character in there, but it’s hard to type this on some editors as tab usually moves your focus instead.

Next, we want to match our second field. Scanning down the list, it looks like a mix of letters, spaces, hyphens, and brackets. Now we could match those specifically, but there’s another way:

^((?:[A-Z][a-z]-)?[a-z]{2,3}-[A-Z]{2,3})\t([^\t]+)

That’s it. Another group match with the brackets, and inside that we have our familiar square brackets. The unfamiliar part is the circumflex after the opening square bracket. This tells the regex to match anything that isn’t inside the square brackets. The only thing inside the square brackets is the tab escape character. On its own, this will only match a single character, so we add the + to the end to tell it to match multiple occurrences, which will match all characters that aren’t a tab, essentially the whole of the second field! Admittedly, the first column could have been captured in the same manner, but I wanted to illustrate the different techniques.

Lastly, we want to match the rest, but not capturing the match, which we can do with the regex wildcard:

^((?:[A-Z][a-z]-)?[a-z]{2,3}-[A-Z]{2,3})\t([^\t]+).+

The wildcard . matches any single character, and the + tells the parser to match that one or more times, essentially as many as the regex engine can match. This matches and discards the match, so we can’t use it later in any replacements. If we wanted to, we can make it a group by enclosing it in brackets (.+), but that's not necessary for our requirements here.

Performing the Replacements

So, now you have captured the parts of each line you care about, you can run a replacement to put each into its own <option> tag read to use in your HTML.

Different editors have different ways of doing this, but typically they fall into two camps, using either the $ or a \ to denote a captured group to use as a replacement, but usually it’s the $. So what does our replacement look like?

<option value="$1">$2</option> <option value="\1">\2</option>

That’s it, and hopefully it should be pretty self-explanatory. If you opt to replace all, then every line will now be replace with valid <option> tags that you can just drop into your application.

Greedy and Lazy Pattern Matching

While matching a wide selection of characters using a simple broad pattern like [a-z]+, sometimes you can accidentally capture more than you expect. This is because expressions are, by default, considered to be greedy. That is, they attempt to match the largest strings for each pattern you have.

Consider the following regex which is intended to capture every tag in some source HTML:

<img .+>

At first it looks like it should work, but if you ran it against the following HTML snippet, you would actually get a lot more than you expected: <p><img src="image.jpg"><em>Description of image</em></p>

It will actually match the start of the <img> tag right up to the end of the line and the closing </p> tag. This is the greedy nature of regular expressions. So what can be done to fix the expression to behave as we would like?

Well, the ? symbol has a secondary function, which turns our greedy matches into non-greedy or lazy, by telling the regex engine to match as few times as possible while still

<img .+?>

The ? can also be combined with * to match zero or more but only as much as necessary to meet the regex requirements.

What Else Can Regular Expressions Do?

There are a lot of things that I’ve not yet covered in much detail, but here is a brief run-down of some of the more useful features:

Anchors

Rather than being an explicit match of a character themselves, anchors match the gaps between characters in a string. We’ve seen one already, the ^ which matches the start of the line, others are:

Anchor Behaviour
^ Start of string
$ End of string
\b A word boundary, useful to find the beginning of words
\B Non word boundary

Character Matchers

These can be used to match a specific character, from letters to numbers and more.

Matcher Behaviour
\n New line
\r Carriage return
\t Tab
\s White space
\d Number
\D Non number
\w Word character (this includes, letters, numbers, and some punctuation)
\W Non word character
\ When followed by any other character it will escape it, useful for doing things like \. to match the . character, as it is treated as the wildcard character normally in regular expressions
\x44 Matches an uppercase D. The number 44 represents the hexadecimal value of the ASCII value for whichever character you want to match
\x{265E} or \u265E Matches the black knight chess character ♞ You can use this when the character you want to match goes beyond the standard ASCII range. Note that you might have to switch which style you use depending on the editor or language of the code you’re using this in

Quantifiers

These determine how many times a match should be applied.

Quanitifier Behaviour
* 0 or more of the previous match
+ 1 or more of the previous match
? 0 or 1 of the previous match
{5} Exactly 5 of the previous match
{1,10} Between 1 and 10 of the previous match
{4,} 4 or more of the previous match

Ranges

Ranges allow you to pick start and end characters and match everything in between, or generate negative range matches to capture strings that aren't in your range.

Range Behaviour
[abc] Matches exactly one a, b, or c character
[a-f] Matches exactly one character between a and f inclusively
[0-5] Matches one number character between 0 and 5. Note that expressions are strings, so you can’t use [0-20] to match a number between 1 and 20
[a-fA-F] Match one character between a and f or A and F
[^abc] Match exactly one character that isn’t a, b, or c

Groups

Groups allow you to capture matches to use later, either within a replace or as part of your application code.

Group Behaviour
(abc) Regular group that matches ‘abc’ and captures it as the first group
(abc(def)) Nested group that matches ‘abcdef’ and ‘def’ and captures them in the first and second groups respectively
(abc|xyz) Matches ‘abc’ or ‘xyz’ and captures it in the first group
(?:abc) Matches ‘abc’ but doesn’t capture it in the first group. This can be useful when combined with the | to match either one thing or another but not capture it

Unicode Matches

Many languages (except Javascript) support advanced Unicode character matching out of the box. This allows you to target specific types of letters or numbers more easily in a multilingual way, without the need to have to manually specify each character range for each language.

Unicode Escape Behaviour
\p{L} Matches any single letter, including letters in other languages
\p{Ll} Matches any single lowercase letter
\p{Lu} Matches any single uppercase letter
\p{N} Matches any number, including numbers like ① and ¼
\p{P} Matches any single punctuation character
\p{S} Matches any single symbol
\p{Sc} Matches any currency symbol
\p{Greek} Matches any single character from the Greek alphabet. Other script groups are available, such as Braille, Hebrew, and Tibeten. See below for more details

There is a fuller list of the Unicode character matchers available at http://php.net/manual/en/regexp.reference.unicode.php which includes a list of scripts (sets of Unicode characters belong in groups called scripts) that you can use in the \p{Script} format.

Further Reading

You can find a full regular expression cheat sheet put together by Dave Child and is a great resource, listing more than I have given an overview of here. Particularly useful in taking your expressions further with things like look aheads and behinds.

If you’re looking to test out your regular-expressions then you might find the tester at https://regex101.com/ helpful. It allows you to test different languages (as each has subtle differences with regards to syntax. It can even generate code in the language of your choice to drop into your application.

Another great tool to help visually build your expressions is RegExr which highlights the matches as you type, and provides plenty of helpful examples which you can use to generate your own expressions.

A Word of Warning

Earlier I aluded to the problem that regular expressions cannot be reliably used for parsing HTML. On the face of it, HTML doesn't seem that complex, and seems the perfect candidate for a regex.

Consider our earlier example of matching <img> tags in an HTML document:

<img .+?>

Now, if you have HTML that looks like this, you will obviously not get what you expect:

<p><img src="http://example.com/image.jpg" alt="an image showing something > something else"/></p>

Of course, we could modify the regex to look for the > immediately after a /, but what if our HTML tags don't use self-closing syntax? They're not required by HTML, which leads us to need to create ever more complex expressions to handle the little edge cases. At that point, we're far better off using a proper HTML document parser and let that do the work for us.

Another example is the infamous problem of using regular expressions to validate an email address. Despite what you might think, email addresses are incredibly complicated. So complicated in-fact, that the most valid expression to validate email addresses is over 6KB long!. Not exactly something that you want in your validation code, and it's terribly unreadable and impossible to maintain manually. The solution is to either rely on browsers own validation for email addresses (scope might vary as some of the browsers use some awfully limiting expressions of their own) or to rely on existing server libraries that you can use include and use without worrying about the validation is implemented.

At the end of the day, it's always worth bearing in mind Jamies comment regarding regular expressions, and use them when the situation allows and it makes sense, and always remember that if the only tool you have is a hammer, not everything is a nail.

Comments

Leave a comment