Building microservices with Dropwizard, MongoDB & Docker

PHP Regex Part – 2 (Advanced regexes)

If you are confused already, it is probably best that you re-read the last section before continuing – the expressions only get more complicated!

We have gone through basic and novice regexes – now we’re onto the powerful stuff called Advanced regexes. Regexes allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.

The first four affect the number of a pattern it should match, and the last two affect the position. + means “match one or more of the previous expression”, * means “match zero or more of the previous expression”, and ? means “match 0 or 1 of the previous expression”.

Here are some examples:

<?php
    preg_match("/[A-Za-z]*/", $string);
    preg_match("/-?[0-9]+/", $string);
    preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);
?>

The first expression will match “”, “a”, “aaaa”, “The sun has got his hat on”, and any other string containing uppercase and lowercase letters – the expression can be translated as “match zero or more uppercase and lowercase letters”. The second regex will match 1, 100, 324343995, and also -1, -100, -234011, etc – the “-?” means “match exactly 0 or 1 minus symbols”.

The last regex is fairly complicated, but, as always with regexes, complexity == power. As mentioned before, $ is a regex symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regex symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B,$C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That’s right – that regex will match PHP variables.

Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.

Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:

Regex String Result
/[A-Z]{3}/ FuZ No match; the regex will match precisely three uppercase letters
/[A-Z]{3}/i FuZ Match; same as above, but case insensitive this time
/[0-9]{3}-[0-9]{4}/ 555-1234 Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example
/[a-z]+[0-9]?[a-z]{1}/ aaa1 No match; must end with one lowercase letter
/[A-Z]{1,}99/ 99 No match; must start with at least one uppercase letter
/[A-Z]{1,5}99/ FINGERS99 No match; start with a maximum of 5 uppercase letters
/[A-Z]{1,5}[0-9]{2}/i adams42 Match

Finally, we have the dollar $ and caret ^ symbols, which mean “end of line” and “start of line” respectively. Consider the following string:

$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";

As you know, \n means “new line”, so what we have there is a string containing the following text:

This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned

In order to parse multi-line strings correctly, we need the “m” modifier, so “m” needs to go after the final slash. Here is some PHP code – which expressions do you think will match?

<?php
    preg_match("/is$/m", $multitest);
    preg_match("/the$/m", $multitest);
    preg_match("/^the/m", $multitest);
    preg_match("/^Symbol/m", $multitest);
    preg_match("/^[A-Z][a-z]{1,}/m", $multitest);
?>

The answer is “all of them” – they all match. Line one means “return true if ‘is’ is at the end of a line”, line two is “return true if ‘the’ is at the end of a line”, and line three is “return true if ‘the’ is at the end of a line”. Line four is “return true if “Symbol” is at the start of a line”, and line five is “return true if there is a capital letter followed by one or more lowercase letters at the start of a line.

As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.

However, we’re not finished yet, grasshopper – if you wish to attain regex nirvana, you need to understand the last few secrets of regex wisdom…

Let's Talk

Do you want to learn more about how I can help your company overcome problems? Let us have a conversation.