PHP Regex Part – 2 (Advanced regexes)
If you are confused already, it is probably best that you re-read the last section before continuing – the expressions only get more complicated!
We have gone through basic and novice regexes – now we’re onto the powerful stuff called Advanced regexes. Regexes allow you to use the characters +, *, ?, { }, $, and ^ outside of sets to have special meaning.
The first four affect the number of a pattern it should match, and the last two affect the position. + means “match one or more of the previous expression”, * means “match zero or more of the previous expression”, and ? means “match 0 or 1 of the previous expression”.
Here are some examples:
<?php
preg_match("/[A-Za-z]*/", $string);
preg_match("/-?[0-9]+/", $string);
preg_match("/\$[A-Za-z_][A-Za-z_0-9]*/", $string);
?>
The first expression will match “”, “a”, “aaaa”, “The sun has got his hat on”, and any other string containing uppercase and lowercase letters – the expression can be translated as “match zero or more uppercase and lowercase letters”. The second regex will match 1, 100, 324343995, and also -1, -100, -234011, etc – the “-?” means “match exactly 0 or 1 minus symbols”.
The last regex is fairly complicated, but, as always with regexes, complexity == power. As mentioned before, $ is a regex symbol in its own right, however here we proceed it with a backslash, which, unsurprisingly, works as an escape character turning the $ into a standard character and not a regex symbol. We then match precisely one symbol from the range A-Z, a-z, and _, then match zero or more symbols from the range A-Z, a-z, underscore, and 0-9. What kind of text would that match? Here are some examples: $A, $B,$C, $foo, $bar, $Test99, $_MyTest, $__Foo__. Look familiar? That’s right – that regex will match PHP variables.
Opening braces { and closing braces } can be used to define specific repeat counts in three different ways. Firstly, {n}, where n is a positive number, will match n instances of the previous expression. Secondly, {n,} will match a minimum of n instances of the previous expression. Finally, {m,n} will match a minimum of m instances and a maximum of n instances of the previous expression. Note that there are no spaces inside the braces.
Here is a list of advanced regular expressions using braces, with string used to match, and whether or not a match is made:
Regex | String | Result |
/[A-Z]{3}/ | FuZ | No match; the regex will match precisely three uppercase letters |
/[A-Z]{3}/i | FuZ | Match; same as above, but case insensitive this time |
/[0-9]{3}-[0-9]{4}/ | 555-1234 | Match; precisely three numbers, a dash, then precisely four. This will match local US telephone numbers, for example |
/[a-z]+[0-9]?[a-z]{1}/ | aaa1 | No match; must end with one lowercase letter |
/[A-Z]{1,}99/ | 99 | No match; must start with at least one uppercase letter |
/[A-Z]{1,5}99/ | FINGERS99 | No match; start with a maximum of 5 uppercase letters |
/[A-Z]{1,5}[0-9]{2}/i | adams42 | Match |
Finally, we have the dollar $ and caret ^ symbols, which mean “end of line” and “start of line” respectively. Consider the following string:
$multitest = "This is\na long test\nto see whether\nthe dollar\nSymbol\nand the\ncaret symbol\nwork as planned";
As you know, \n means “new line”, so what we have there is a string containing the following text:
This is
a long test
to see whether
the dollar
Symbol
and the
caret symbol
work as planned
In order to parse multi-line strings correctly, we need the “m” modifier, so “m” needs to go after the final slash. Here is some PHP code – which expressions do you think will match?
<?php
preg_match("/is$/m", $multitest);
preg_match("/the$/m", $multitest);
preg_match("/^the/m", $multitest);
preg_match("/^Symbol/m", $multitest);
preg_match("/^[A-Z][a-z]{1,}/m", $multitest);
?>
The answer is “all of them” – they all match. Line one means “return true if ‘is’ is at the end of a line”, line two is “return true if ‘the’ is at the end of a line”, and line three is “return true if ‘the’ is at the end of a line”. Line four is “return true if “Symbol” is at the start of a line”, and line five is “return true if there is a capital letter followed by one or more lowercase letters at the start of a line.
As you can see, matching the beginning and end of a line is simple with the $ and ^ characters, but when combined with +, *, ?, and { }, your regular expression-matching ability should rocket upwards.
However, we’re not finished yet, grasshopper – if you wish to attain regex nirvana, you need to understand the last few secrets of regex wisdom…