Perl 5.6.0 introduced a few more character classes into the mix – first, those defined by the POSIX (Portable Operating Systems Interface) standard, which are therefore present in a number of other applications. The more common character classes here are:
Shortcut Expansion Description
[[:alpha:]] [a-zA-Z] An alphabetic character.
[[:alnum:]] [0-9A-Za-z] An alphabetic or numeric character.
[[:digit:]] \d A digit, 0-9.
[[:lower:]] [a-z] A lower case letter.
[[:upper:]] [A-Z] An upper case letter.
[[:punct:]]
[!"#$%&'()*+,-./:;<=>?@\[\\\]^_`{|}~]
A punctuation character – note the escaped characters [,\, and ]. The Unicode standard also defines 'properties', which apply to some characters. For instance, the 'IsUpper' property can be used to match any upper-case character, in whichever language or alphabet. If you know the property you are trying to match, you can use the syntax \p{} to match it, for instance, the upper-case character is \p{IsUpper}.
Alternatives
Instead of giving a series of acceptable characters, you may want to say 'match either this or that'. The 'either-or' operator in a regular expression is the same as the bitwise 'or' operator, |. So, to match either 'yes' or 'maybe' in our example, we could say this:
> perl matchtest.plx
Enter some text to find: yes|maybe The text matches the pattern 'yes|maybe'.
>
That's either 'yes' or 'maybe'. But what if we wanted either 'yes' or 'yet'? To get alternatives on part of an expression, we need to group the options. In a regular expression, grouping is always done with parentheses:
> perl matchtest.plx
Enter some text to find: ye(s|t) The text matches the pattern 'ye(s|t)'.
>
If we have forgotten the parentheses, we would have tried to match either 'yes' or 't'. In this case, we'd still get a positive match, but it wouldn't be doing what we want – we'd get a match for any string with a 't' in it, whether the words 'yes' or 'yet' were there or not.
You can match either 'this' or 'that' or 'the other' by adding more alternatives:
> perl matchtest.plx
Enter some text to find: (this)|(that)|(the other) '(this)|(that)|(the other)' was not found.
>
However, in this case, it's more efficient to separate out the common elements:
> perl matchtest.plx
Enter some text to find: th(is|at|e other) 'th(is|at|e other)' was not found.
You can also nest alternatives. Say you want to match one of these patterns:
❑ 'the' followed by whitespace or a letter,
❑ 'or'
You might put something like this:
> perl matchtest.plx
Enter some text to find: (the(\s|[a-z]))|or The text matches the pattern '(the(\s|[a-z]))|or'.
>
It looks fearsome, but break it down into its components. Our two alternatives are:
❑ the(\s|[a-z])
❑ or
The second part is easy, while the first contains 'the' followed by two alternatives: \s and [a-z].
Hence 'either "the" followed by either a whitespace or a lower case letter, or "or"'. We can, in fact, tidy this up a little, by replacing (\s|[a-z]) with the less cluttered [\sa-z].
> perl matchtest.plx
Enter some text to find: (the[\sa-z])|or The text matches the pattern '(the[\sa-z])|or'.
>
Repetition
We've now moved from matching a specific character to a more general type of character – when we don't know (or don't care) exactly what the character will be. Now we're going to see what happens when we want to talk about a more general quantity of characters: more than three digits in a row; two to four capital letters, and so on. The metacharacters that we use to deal with a number of characters in a row are called quantifiers.
Indefinite Repetition
The easiest of these is the question mark. It should suggest uncertainty – something may be there, or it may not. That's exactly what it does: stating that the immediately preceding character(s) – or
metacharacter(s) – may appear once, or not at all. It's a good way of saying that a particular character or group is optional. To match the word 'he or she', you can put:
> perl matchtest.plx
Enter some text to find: \bs?he\b The text matches the pattern '\bs?he\b'.
>
To make a series of characters (or metacharacters) optional, group them in parentheses as before. Did he say 'what the Entish is' or 'what the Entish word is'? Either will do:
> perl matchtest.plx
Enter some text to find: what the Entish (word )?is The text matches the pattern 'what the Entish (word )?is'.
>
Notice that we had to put the space inside the group: otherwise we end up with two spaces between 'Entish' and 'is', whereas our text only has one:
> perl matchtest.plx
Enter some text to find: what the Entish (word)? is 'what the Entish (word)? is' was not found.
>
As well as matching something one or zero times, you can match something one or more times. We do this with the plus sign – to match an entire word without specifying how long it should be, you can say:
> perl matchtest.plx
Enter some text to find: \b\w+\b The text matches the pattern '\b\w+\b'.
>
In this case, we match the first available word – I.
If, on the other hand, you have something which may be there any number of times but might not be there at all – zero or one or many – you need what's called 'Kleene's star': the * quantifier. So, to find a capital letter after any – but possibly no – spaces at the start of the string, what would you do? The start of the string, then any number of whitespace characters, then a capital:
> perl matchtest.plx
Enter some text to find: ^\s*[A-Z]
'^\s*[A-Z]' was not found.
>
Of course, our test string begins with a quote, so the above pattern won't match, but, sure enough, if you take away that first quote, the pattern will match fine.
Let's review the three qualifiers:
/bea?t/ Matches either 'beat' or 'bet' /bea+t/ Matches 'beat', 'beaat', 'beaaat'…
/bea*t/ Matches 'bet', 'beat', 'beaat'…
Novice Perl programmers tend to go to town on combinations of dot and star, and the results often surprise them, particularly when it comes to searching-and-replacing. We'll explain the rules of the regular expression matcher shortly, but bear the following in mind:
A regular expression should hardly ever start or finish with a starred character.
You should also consider the fact that .* and .+ in the middle of a regular expression will match as much of your string as they possibly can. We'll look more at this 'greedy' behavior later on.
Well-Defined Repetition
If you want to be more precise about how many times a character or roups of characters might be repeated, you can specify the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can be written as follows:
> perl matchtest.plx
Enter some text to find: \s{2,3}
'\s{2,3}' was not found.
>
So we have no doubled or trebled spaces in our string. Notice how we construct that – the minimum, a comma, and the maximum, all inside braces. Omitting either the maximum or the minimum signifies 'or more' and 'or fewer' respectively. For example, {2,} denotes '2 or more', while {,3} is '3 or fewer'. In these cases, the same warnings apply as for the star operator.
Finally, you can specify exactly how many things are to be in a row by simply putting that number inside the curly brackets. Here's the five-letter-word example tidied up a little:
> perl matchtest.plx
Enter some text to find: \b\w{5}\b '\b\w{5}\b' was not found.
>