• Aucun résultat trouvé

Posix and Unicode Classes

Dans le document Beginning Perl (Page 182-185)

Perl 5.6.0 introduced a few more character classes into the mix – first, those defined by the POSIX (Portable Operating Systems Interface) standard, which are therefore present in a number of other applications. The more common character classes here are:

Shortcut Expansion Description

[[:alpha:]] [a-zA-Z] An alphabetic character.

[[:alnum:]] [0-9A-Za-z] An alphabetic or numeric character.

[[:digit:]] \d A digit, 0-9.

[[:lower:]] [a-z] A lower case letter.

[[:upper:]] [A-Z] An upper case letter.

[[:punct:]]

[!"#$%&'()*+,-./:;<=>?@\[\\\]^_`{|}~]

A punctuation character – note the escaped characters [,\, and ]. The Unicode standard also defines 'properties', which apply to some characters. For instance, the 'IsUpper' property can be used to match any upper-case character, in whichever language or alphabet. If you know the property you are trying to match, you can use the syntax \p{} to match it, for instance, the upper-case character is \p{IsUpper}.

Alternatives

Instead of giving a series of acceptable characters, you may want to say 'match either this or that'. The 'either-or' operator in a regular expression is the same as the bitwise 'or' operator, |. So, to match either 'yes' or 'maybe' in our example, we could say this:

> perl matchtest.plx

Enter some text to find: yes|maybe The text matches the pattern 'yes|maybe'.

>

That's either 'yes' or 'maybe'. But what if we wanted either 'yes' or 'yet'? To get alternatives on part of an expression, we need to group the options. In a regular expression, grouping is always done with parentheses:

> perl matchtest.plx

Enter some text to find: ye(s|t) The text matches the pattern 'ye(s|t)'.

>

If we have forgotten the parentheses, we would have tried to match either 'yes' or 't'. In this case, we'd still get a positive match, but it wouldn't be doing what we want – we'd get a match for any string with a 't' in it, whether the words 'yes' or 'yet' were there or not.

You can match either 'this' or 'that' or 'the other' by adding more alternatives:

> perl matchtest.plx

Enter some text to find: (this)|(that)|(the other) '(this)|(that)|(the other)' was not found.

>

However, in this case, it's more efficient to separate out the common elements:

> perl matchtest.plx

Enter some text to find: th(is|at|e other) 'th(is|at|e other)' was not found.

You can also nest alternatives. Say you want to match one of these patterns:

❑ 'the' followed by whitespace or a letter,

❑ 'or'

You might put something like this:

> perl matchtest.plx

Enter some text to find: (the(\s|[a-z]))|or The text matches the pattern '(the(\s|[a-z]))|or'.

>

It looks fearsome, but break it down into its components. Our two alternatives are:

❑ the(\s|[a-z])

❑ or

The second part is easy, while the first contains 'the' followed by two alternatives: \s and [a-z].

Hence 'either "the" followed by either a whitespace or a lower case letter, or "or"'. We can, in fact, tidy this up a little, by replacing (\s|[a-z]) with the less cluttered [\sa-z].

> perl matchtest.plx

Enter some text to find: (the[\sa-z])|or The text matches the pattern '(the[\sa-z])|or'.

>

Repetition

We've now moved from matching a specific character to a more general type of character – when we don't know (or don't care) exactly what the character will be. Now we're going to see what happens when we want to talk about a more general quantity of characters: more than three digits in a row; two to four capital letters, and so on. The metacharacters that we use to deal with a number of characters in a row are called quantifiers.

Indefinite Repetition

The easiest of these is the question mark. It should suggest uncertainty – something may be there, or it may not. That's exactly what it does: stating that the immediately preceding character(s) – or

metacharacter(s) – may appear once, or not at all. It's a good way of saying that a particular character or group is optional. To match the word 'he or she', you can put:

> perl matchtest.plx

Enter some text to find: \bs?he\b The text matches the pattern '\bs?he\b'.

>

To make a series of characters (or metacharacters) optional, group them in parentheses as before. Did he say 'what the Entish is' or 'what the Entish word is'? Either will do:

> perl matchtest.plx

Enter some text to find: what the Entish (word )?is The text matches the pattern 'what the Entish (word )?is'.

>

Notice that we had to put the space inside the group: otherwise we end up with two spaces between 'Entish' and 'is', whereas our text only has one:

> perl matchtest.plx

Enter some text to find: what the Entish (word)? is 'what the Entish (word)? is' was not found.

>

As well as matching something one or zero times, you can match something one or more times. We do this with the plus sign – to match an entire word without specifying how long it should be, you can say:

> perl matchtest.plx

Enter some text to find: \b\w+\b The text matches the pattern '\b\w+\b'.

>

In this case, we match the first available word – I.

If, on the other hand, you have something which may be there any number of times but might not be there at all – zero or one or many – you need what's called 'Kleene's star': the * quantifier. So, to find a capital letter after any – but possibly no – spaces at the start of the string, what would you do? The start of the string, then any number of whitespace characters, then a capital:

> perl matchtest.plx

Enter some text to find: ^\s*[A-Z]

'^\s*[A-Z]' was not found.

>

Of course, our test string begins with a quote, so the above pattern won't match, but, sure enough, if you take away that first quote, the pattern will match fine.

Let's review the three qualifiers:

/bea?t/ Matches either 'beat' or 'bet' /bea+t/ Matches 'beat', 'beaat', 'beaaat'…

/bea*t/ Matches 'bet', 'beat', 'beaat'…

Novice Perl programmers tend to go to town on combinations of dot and star, and the results often surprise them, particularly when it comes to searching-and-replacing. We'll explain the rules of the regular expression matcher shortly, but bear the following in mind:

A regular expression should hardly ever start or finish with a starred character.

You should also consider the fact that .* and .+ in the middle of a regular expression will match as much of your string as they possibly can. We'll look more at this 'greedy' behavior later on.

Well-Defined Repetition

If you want to be more precise about how many times a character or roups of characters might be repeated, you can specify the maximum and minimum number of repeats in curly brackets. '2 or 3 spaces' can be written as follows:

> perl matchtest.plx

Enter some text to find: \s{2,3}

'\s{2,3}' was not found.

>

So we have no doubled or trebled spaces in our string. Notice how we construct that – the minimum, a comma, and the maximum, all inside braces. Omitting either the maximum or the minimum signifies 'or more' and 'or fewer' respectively. For example, {2,} denotes '2 or more', while {,3} is '3 or fewer'. In these cases, the same warnings apply as for the star operator.

Finally, you can specify exactly how many things are to be in a row by simply putting that number inside the curly brackets. Here's the five-letter-word example tidied up a little:

> perl matchtest.plx

Enter some text to find: \b\w{5}\b '\b\w{5}\b' was not found.

>

Dans le document Beginning Perl (Page 182-185)