Grouping Patterns

3.6 <STDIN> as an Array

7. Regular Expressions

7.2 Simple Uses of Regular Expressions

7.3.2 Grouping Patterns

The true power of regular expressions comes into play when you can say "one or more of these" or "up to five of those." Let's talk about how this is done.

7.3.2.1 Sequence

The first (and probably least obvious) grouping pattern is sequence. This means that abc matches an a followed by a b followed by a c. Seems simple, but we're giving it a name so we can talk about it later.

7.3.2.2 Multipliers

We've already seen the asterisk (*) as a grouping pattern. The asterisk indicates zero or more of the immediately previous character (or character class).

Two other grouping patterns that work like this are the plus sign (+), meaning one or more of the immediately previous character, and the question mark (?), meaning zero or one of the immediately previous character. For example, the regular expression /fo+ba?r/ matches an f followed by one or more o's followed by a b, followed by an optional a, followed by an r.

In all three of these grouping patterns, the patterns are greedy. If such a multiplier has a chance to match between five and ten characters, it'll pick the 10-character string every time. For example,

$_ = "fred xxxxxxxxxx barney";

s/x+/boom/;

always replaces all consecutive x's with boom (resulting in fred boom barney), rather than just one or two x's, even though a shorter set of x's would also match the same regular expression.

If you need to say "five to ten" x's, you could get away with putting five x's followed by five x's each immediately followed by a question mark. But this looks ugly. Instead, there's an easier way: the general multiplier. The general multiplier consists of a pair of matching curly braces with one or two numbers inside, as in /x{5,10}/. The immediately preceding character (in this case, the letter "x") must be found within the indicated number of repetitions (five through ten here).[1]

[1] Of course, /\d{3}/ doesn't only match three-digit numbers. It would also match any number with more than three digits in it. To match exactly three, you need to use anchors, described later in Section 7.3.3, "Anchoring Patterns."

If you leave off the second number, as in /x{5,}/, it means "that many or more" (five or more in this case), and if you leave off the comma, as in /x{5}/, it means "exactly this many" (five x's). To get five or less x's, you must put the zero in, as in /x{0,5}/.

So, the regular expression /a.{5}b/ matches the letter a separated from the letter b by any five non-newline characters at any point in the string. (Recall that a period matches any single non-newline character, and we're matching five here.) The five characters do not need to be the same. (We'll learn how to force them to be the same in the next section.)

We could dispense with *, +, and ? entirely, since they are completely equivalent to {0,}, {1,}, and {0,1}. But it's easier to type the equivalent single punctuation character, and more familiar as well.

If two multipliers occur in a single expression, the greedy rule is augmented with "leftmost is greediest."

For example:

$_ = "a xxx c xxxxxxxx c xxx d";

/a.*c.*d/;

In this case, the first ".*" in the regular expression matches all characters up to the second c, even though matching only the characters up to the first c would still allow the entire regular expression to match. Right now, this doesn't make any difference (the pattern would match either way), but later when we can look at parts of the regular expression that matched, it'll matter quite a bit.

We can force any multiplier to be nongreedy (or lazy) by following it with a question mark:

$_ = "a xxx c xxxxxxxx c xxx d";

/a.*?c.*d/;

Here, the a.*?c now matches the fewest characters between the a and c, not the most characters. This means the leftmost c is matched, not the rightmost. You can put such a question-mark modifier after any of the multipliers (?,+,*, and {m,n}).

What if the string and regular expression were slightly altered, say, to:

$_ = "a xxx ce xxxxxxxx ci xxx d";

/a.*ce.*d/;

In this case, if the .* matches the most characters possible before the next c, the next regular expression character (e) doesn't match the next character of the string (i). In this case, we get automatic

backtracking: the multiplier is unwound and retried, stopping at someplace earlier (in this case, at the earlier c, next to the e).[2] A complex regular expression may involve many such levels of backtracking, leading to long execution times. In this case, making that match lazy (with a trailing "?") will actually simplify the work that Perl has to perform, so you may want to consider that.

[2] Well, technically there was a lot of backtracking of the * operator to find the c's in the first place. But that's a little trickier to describe, and it works on the same principle.

7.3.2.3 Parentheses as memory

Another grouping operator is a pair of open and close parentheses around any part pattern. This doesn't change whether the pattern matches, but instead causes the part of the string matched by the pattern to be remembered, so that it may be referenced later. So for example, (a) still matches an a, and ([a-z]) still matches any single lowercase letter.

To recall a memorized part of a string, you must include a backslash followed by an integer. This pattern construct represents the same sequence of characters matched earlier in the same-numbered pair of

parentheses (counting from one). For example, /fred(.)barney\1/;

matches a string consisting of fred, followed by any single non-newline character, followed by barney, followed by that same single character. So, it matches fredxbarneyx, but not fredxbarneyy. Compare that with

/fred.barney./;

in which the two unspecified characters can be the same, or different; it doesn't matter.

Where did the 1 come from? It means the first parenthesized part of the regular expression. If there's more than one, the second part (counting the left parentheses from left to right) is referenced as \2, the third as \3, and so on. For example,

/a(.)b(.)c\2d\1/;

matches an a, a character (call it #1), a b, another character (call it #2), a c, the character #2, a d, and the character #1. So it matches axbycydx, for example.

The referenced part can be more than a single character. For example, /a(.*)b\1c/;

matches an a, followed by any number of characters (even zero) followed by b, followed by that same sequence of characters followed by c. So, it would match aFREDbFREDc, or even abc, but not

aXXbXXXc.

7.3.2.4 Alternation

Another grouping construct is alternation, as in a|b|c. This means to match exactly one of the

alternatives (a or b or c in this case). This works even if the alternatives have multiple characters, as in /song|blue/, which matches either song or blue. (For single character alternatives, you're

definitely better off with a character class like /[abc]/.)

What if we wanted to match songbird or bluebird? We could write /songbird|bluebird/, but that bird part shouldn't have to be in there twice. In fact, there's a way out, but we have to talk about the precedence of grouping patterns, which is covered in Section 7.3.4, "Precedence," below.

Dans le document Learning Perl (Page 129-132)

3.6 &lt;STDIN&gt; as an Array

7. Regular Expressions

7.2 Simple Uses of Regular Expressions

7.3.2 Grouping Patterns

3.6 <STDIN> as an Array