Posix and Unicode Classes - Beginning Perl

Perl 5.6.0 introduced a few more character classes into the mix – first, those defined by the POSIX (Portable Operating Systems Interface) standard, which are therefore present in a number of other applications. The more common character classes here are:

Shortcut Expansion Description

[[:alpha:]] [a-zA-Z] An alphabetic character.

[[:alnum:]] [0-9A-Za-z] An alphabetic or numeric character.

[[:digit:]] \d A digit, 0-9.

[[:lower:]] [a-z] A lower case letter.

[[:upper:]] [A-Z] An upper case letter.

[[:punct:]]

[!"#$%&'()*+,-./:;<=>?@\[\\\]^_`{|}~]

A punctuation character – note the escaped characters [,\, and ]. The Unicode standard also defines 'properties', which apply to some characters. For instance, the 'IsUpper' property can be used to match any upper-case character, in whichever language or alphabet. If you know the property you are trying to match, you can use the syntax \p{} to match it, for instance, the upper-case character is \p{IsUpper}.

Alternatives

Instead of giving a series of acceptable characters, you may want to say 'match either this or that'. The 'either-or' operator in a regular expression is the same as the bitwise 'or' operator, |. So, to match either 'yes' or 'maybe' in our example, we could say this:

> perl matchtest.plx

Enter some text to find: yes|maybe The text matches the pattern 'yes|maybe'.

That's either 'yes' or 'maybe'. But what if we wanted either 'yes' or 'yet'? To get alternatives on part of an expression, we need to group the options. In a regular expression, grouping is always done with parentheses:

> perl matchtest.plx

Enter some text to find: ye(s|t) The text matches the pattern 'ye(s|t)'.

If we have forgotten the parentheses, we would have tried to match either 'yes' or 't'. In this case, we'd still get a positive match, but it wouldn't be doing what we want – we'd get a match for any string with a 't' in it, whether the words 'yes' or 'yet' were there or not.

You can match either 'this' or 'that' or 'the other' by adding more alternatives:

> perl matchtest.plx

Enter some text to find: (this)|(that)|(the other) '(this)|(that)|(the other)' was not found.

However, in this case, it's more efficient to separate out the common elements:

> perl matchtest.plx

Enter some text to find: th(is|at|e other) 'th(is|at|e other)' was not found.

You can also nest alternatives. Say you want to match one of these patterns:

❑ 'the' followed by whitespace or a letter,

❑ 'or'

You might put something like this:

> perl matchtest.plx

Enter some text to find: (the(\s|[a-z]))|or The text matches the pattern '(the(\s|[a-z]))|or'.

It looks fearsome, but break it down into its components. Our two alternatives are:

❑ the(\s|[a-z])

❑ or

The second part is easy, while the first contains 'the' followed by two alternatives: \s and [a-z].

Hence 'either "the" followed by either a whitespace or a lower case letter, or "or"'. We can, in fact, tidy this up a little, by replacing (\s|[a-z]) with the less cluttered [\sa-z].

> perl matchtest.plx

Enter some text to find: (the[\sa-z])|or The text matches the pattern '(the[\sa-z])|or'.

Repetition

We've now moved from matching a specific character to a more general type of character – when we don't know (or don't care) exactly what the character will be. Now we're going to see what happens when we want to talk about a more general quantity of characters: more than three digits in a row; two to four capital letters, and so on. The metacharacters that we use to deal with a number of characters in a row are called quantifiers.

Indefinite Repetition

The easiest of these is the question mark. It should suggest uncertainty – something may be there, or it may not. That's exactly what it does: stating that the immediately preceding character(s) – or

metacharacter(s) – may appear once, or not at all. It's a good way of saying that a particular character or group is optional. To match the word 'he or she', you can put:

> perl matchtest.plx

Enter some text to find: \bs?he\b The text matches the pattern '\bs?he\b'.

To make a series of characters (or metacharacters) optional, group them in parentheses as before. Did he say 'what the Entish is' or 'what the Entish word is'? Either will do:

> perl matchtest.plx

Enter some text to find: what the Entish (word )?is The text matches the pattern 'what the Entish (word )?is'.

Notice that we had to put the space inside the group: otherwise we end up with two spaces between 'Entish' and 'is', whereas our text only has one:

> perl matchtest.plx

Enter some text to find: what the Entish (word)? is 'what the Entish (word)? is' was not found.

As well as matching something one or zero times, you can match something one or more times. We do this with the plus sign – to match an entire word without specifying how long it should be, you can say:

> perl matchtest.plx

Enter some text to find: \b\w+\b The text matches the pattern '\b\w+\b'.

In this case, we match the first available word – I.

If, on the other hand, you have something which may be there any number of times but might not be there at all – zero or one or many – you need what's called 'Kleene's star': the * quantifier. So, to find a capital letter after any – but possibly no – spaces at the start of the string, what would you do? The start of the string, then any number of whitespace characters, then a capital:

> perl matchtest.plx

Enter some text to find: ^\s*[A-Z]

'^\s*[A-Z]' was not found.

Of course, our test string begins with a quote, so the above pattern won't match, but, sure enough, if you take away that first quote, the pattern will match fine.

Let's review the three qualifiers:

/bea?t/ Matches either 'beat' or 'bet' /bea+t/ Matches 'beat', 'beaat', 'beaaat'…

/bea*t/ Matches 'bet', 'beat', 'beaat'…

Novice Perl programmers tend to go to town on combinations of dot and star, and the results often surprise them, particularly when it comes to searching-and-replacing. We'll explain the rules of the regular expression matcher shortly, but bear the following in mind:

A regular expression should hardly ever start or finish with a starred character.

You should also consider the fact that .* and .+ in the middle of a regular expression will match as much of your string as they possibly can. We'll look more at this 'greedy' behavior later on.

Well-Defined Repetition

If you want to be more precise about how many times a character or roups of characters might be repeated, you can specify the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can be written as follows:

> perl matchtest.plx

Enter some text to find: \s{2,3}

'\s{2,3}' was not found.

So we have no doubled or trebled spaces in our string. Notice how we construct that – the minimum, a comma, and the maximum, all inside braces. Omitting either the maximum or the minimum signifies 'or more' and 'or fewer' respectively. For example, {2,} denotes '2 or more', while {,3} is '3 or fewer'. In these cases, the same warnings apply as for the star operator.

Finally, you can specify exactly how many things are to be in a row by simply putting that number inside the curly brackets. Here's the five-letter-word example tidied up a little:

> perl matchtest.plx

Enter some text to find: \b\w{5}\b '\b\w{5}\b' was not found.

Dans le document Beginning Perl (Page 182-185)