S HORTCOMINGS OF sed - Perl as a (better) sed command

Perl as a (better) sed command

4.2 S HORTCOMINGS OF sed

As in the corresponding table for the ^grep command (see table 3.2), the topmost panel of table 4.1 shows the differences in basic matching facilities provided by ^sed commands and Perl. Many of the shortcomings of the classic ^sed are the same as those listed earlier for the classic ^grep, due to their mutual dependence on the classic regex dialect.

Table 4.1 Text-modification capabilities of sed and Perl

Capability^a Classic

sed

POSIX sed

GNU

sed Perl

Word-boundary metacharacter – Y Y Y

Compact character-class shortcuts – ? – Y

Control character representation – – – Y

Repetition ranges Y Y Y Y

Capturing parentheses and backreferences Y Y Y Y+

Metacharacter quoting Y Y Y Y+

Embedded commentary – – – Y

Advanced regex features – – – Y

Case insensitivity – – Y Y

Arbitrary record definitions – – – Y

Line-spanning matches – – – Y

Binary-file processing – – – Y

Directory-file skipping – – Y Y

Arbitrary delimiters Y Y Y Y+

Access to match components – – – Y

Customized replacements – – – Y+

File modifications – – Y Y+

a. Y: has the capability; Y+: has the capability with enhancements; ?: partially has the capability; –: doesn’t have the capability

The middle panel of the table compares the kinds of matching the individual commands support, and the lower panel compares the enhanced matching ser-vices they provide.

You’ll understand what most of the listed capabilities mean, either because they’re self-explanatory or because they were discussed earlier in chapter 3. We’ll focus on the other capabilities here.

Arbitrary delimiters give you the ability to use an arbitrary character to separate the search and replacement fields in the substitution syntax, which can greatly enhance readability. For example, the following commands all strip the leading ^/etc/ from each input line by substituting an empty replacement string (indicated by the adja-cent second and third delimiters) for what was matched:

sed 's/^\/etc\///g' file # default delimiters sed 's|^/etc/||g' file # custom delimiters perl -wpl -e 's|^/etc/||g;' file # Perl

Isn’t the first ^sed solution hard to read, with those slanty lines falling against each other? This effect, called Leaning Toothpick Syndrome, detracts from readability, and can be easily avoided. Like ^sed, Perl allows the first visible character occurring immediately after the ^s to be used as an alternative delimiter; it’s used here to avoid the need for backslashing literal occurrences of the delimiter character within the regex field.

As the following example shows, Perl even allows the use of reflected symbol pairs such as ⁽⁾ and ^{}—whose components are mirror images of each other—to delimit the separate components of the search and replacement fields:

perl -wpl -e 's{^/etc/ }{ }g;' # reflected symbols for delimiters

In recognition of this advanced feature, Perl is accorded a “Y+” (has the capability with enhancements) in the “Arbitrary delimiters” category of table 4.1.³

Customized replacements refers to the ability to adapt the replacement string to the specific characteristics of the string that was matched. An example would be substi-tuting “FREDERICK” for “FRED”, but “Frederick” for “Fred”. The computed replacements mentioned earlier, which are covered later in this chapter, fall into this category, along with the mapped replacements discussed in section 10.4.4.

Perl provides automatic directory file skipping to ensure that (nonsensical) requests for directory files to be edited aren’t honored. In contrast, POSIX sed is happy to dump the edited binary data to the user’s (soon to be unreadable) screen. Like Perl, the GNU sed refuses to perform substitutions on data taken from directories, but it does so less gracefully, by issuing an error message that suggests a faulty permis-sion setting:

3 Remember, Perl’s matching operator also supports arbitrary delimiters, as shown in table 3.3.

sed: read error on /etc: Is a directory

File modification means the utility can store changes in the file where the data origi-nated, rather than having to send those changes to the output. Although POSIX ver-sions of ^sed can’t directly modify a file, GNUsed and Perl can both do that and even make a backup copy of the original file for you before modifying it. But Perl wins over GNUsed in this category because it allows an automatically updated file extension to be used on the backed-up file, for greater reliability (see section 4.7.3).

In summary, table 4.1 shows that Perl has the richest collection of text-editing capabilities, with the GNU version of ^sed coming in second, the POSIX version third, and the classic ^sed last.

But before you can begin to realize Perl’s benefits as a premier text-processing util-ity, you must first learn to use it for performing simple text substitutions, which we’ll discuss next.

4.3 P

^ERFORMING SUBSTITUTIONS

Several Unix utilities can be used to replace one string of text with another. Text sub-stitutions are performed in ^sed using a substitution command:

sed 's/RE/replacement/g' file1 file2 ...

The Perl equivalent uses the substitution operator, in a command of this form:

perl -wpl -e 's/RE/replacement/g;' file1 file2 ...

And the ^vi editor’s substitution command for modifying the current line is

:s/RE/replacement/g

Notice the similarity? It’s no accident that Perl’s syntax is nearly identical to that of sed and ^vi. Larry, in his wisdom, designed it that way, to facilitate your migration to Perl.

In each of these three commands, the ^s before the initial slash indicates that a substitution is being requested, ^RE is a placeholder for the regex of interest, and the slashes delimit the search string and the replacement string. The trailing ^g after the third slash means global; it requests that all possible substitutions be performed, rather than just the leftmost one on each line. (You almost always want that behav-ior, so in Minimal Perl we use the ^g there by default and omit it only where it would spoil the command.)

As with all ^sed-like Perl commands, the one shown here uses the Primary Option Cluster that’s appropriate for “Input processing,” enhanced in this case for automatic printing with the addition of the ^p option (see table 2.9).

Now let’s consider a simple and practical example of the use of ^sed and Perl for performing text substitution. The purpose of the following commands is to edit

the output of ^date to expand the day-name abbreviation, which makes its output more understandable:

$ date

Sun Dec 25 16:53:03 PDT 2005

$ date | sed 's/Sun/Sunday/g' # sed version Sunday Dec 25 16:53:04 PDT 2005

$ date | perl -wpl -e 's/Sun/Sunday/g;' # Perl version Sunday Dec 25 16:53:05 PDT 2005

As you can see, the ^-wpl option-cluster causes Perl to function like a ^sed substitution command.

That’s all well and good, until you realize that these solutions work properly for only one day of the week! It takes seven separate editing operations to do this job properly, which would be most easily handled in ^sed using the “get commands from a file” option, ^-f, along with a file full of appropriate substitution commands:

$ cat expand_daynames.sed s/Sun/Sunday/g

s/Mon/Monday/g ... you get the idea

$ date | sed -f expand_daynames.sed # takes commands from file Sunday Dec 25 16:53:10 PDT 2005

Perl has its own ways of handling a file full of editing commands, but the most ele-mentary and ^sed-like approach is to use multiple substitution operators:

$ date |

> perl -wpl -e '

> s/\bSun\b/Sunday/g;

> s/\bMon\b/Monday/g;

... you get the idea

> '

Sunday Dec 25 16:53:14 PDT 2005

You can do as many substitutions as you want, by stacking them as shown, for execu-tion in top-to-bottom order. But it would be more convenient to package these state-ments in a script:

$ cat expand_daynames

#! /usr/bin/perl -wpl s/\bSun\b/Sunday/g;

s/\bMon\b/Monday/g;

... you get the idea

This approach works well enough, but you’ll see simpler ways to write programs that do multiple substitutions in part 2 (e.g., section 10.4.4).

In addition to allowing you to specify a delimiter of choice, Perl’s substitution operator also permits you to specify a data source other than the default (^$_).

Table 4.2 shows examples of these syntax variations.

As with the matching operator discussed in chapter 3, the substitution operator rec-ognizes several modifiers that change the way it works. These modifiers, which are typed after the closing delimiter, are listed in table 4.3. Most of them also work with the matching operator, but the ^e modifier shown in the bottom panel is a notable exception. Its job is to evaluate the Perl code in the replacement field to generate a replacement string (as demonstrated in section 4.9.1).

Table 4.4 provides examples of using the substitution operator, to help you see how the various syntax variations and modifiers are used together in actual applications.

As shown in the table’s last row, you can even use the ^$& variable (introduced in table 3.4) in the replacement field, to substitute for what was matched, a variation on that string.

You’ll see examples of the features presented in these tables throughout the remainder of this chapter, and also in part 2.

Table 4.2 Substitution operator syntax

Form^a Explanation

s/RE/new/g Using default “/” delimiters, substitutes for all matches of the regular expression RE found in the $_ variable the valuenew

s:RE:new:g Same, but uses custom “:" delimiters

$somevar =~ s/RE/new/g Using default “/" delimiters, substitutes for all matches of the regular expression RE found in the$somevar variable the valuenew

$somevar =~ s:RE:new:g Same, but uses custom “:" delimiters

a.RE stands for a regular expression, and new stands for the string that replaces what RE matches. The substitution operator returns the number of substitutions it performed—not the modified string, as sed does.

Table 4.3 Substitution modifiers

Modifier Meaning Explanation^a

i Ignore case Ignores case variations while matching.

x Expanded mode Permits whitespace and comments in the RE field.

s Single-line mode Allows the “.” metacharacter to match newline.

m Multi-line mode Changes ^ and $ to match the ends of the lines within a record rather than the ends of the record.

g Global Allows multiple substitutions per record, and returns different values for scalar and list contexts (details in part 2).

e Eval(uate) Evaluates new as Perl code, and substitutes its result for what RE matches.

a.RE stands for a regular expression, and new stands for the string that replaces what RE matches in s/RE/new/g.

We’ll look next at how you can exercise more control over where substitutions are allowed to occur.

4.3.1 Performing line-specific substitutions: sed

The ^sed command can restrict its attentions to particular lines, specified either by a single line number or by a range of two (inclusive) line numbers separated by a comma, placed before the ^s. For example, the ^sed command in this example restricts its editing to Line 1 of a file, by using ^1s///g instead of the unrestricted ^s///g:

$ cat beatles # notice contents of first line

# Beatles playlist, for Sun Jun 4 11:01:10 PDT 2006 ...

$ sed '1s/Sun/Sunday/g' beatles

# Beatles playlist, for Sunday Jun 4 11:01:10 PDT 2006 Here Comes the Sun

...

The restriction of the editing operation to Line 1 allows the abbreviated month name to be processed on the heading line, while preventing “Here Comes the Sun” from getting changed into “Here Comes the Sunday” on Line 2 (a data line).

With ^sed, you can even specify a range of lines using context addresses, which look like two Perl matching operators separated by a comma, as in /^Start/,/^Stop/. But that capability is more closely associated with AWK, so we’ll cover it in section 5.5.

Next, you’ll see the Perl counterpart to the ^sed command that edits the first line of the ^beatles file.

4.3.2 Performing line-specific substitutions: Perl

Because Perl is more versatile than ^sed, some operations that are easily expressed in sed require more work to specify precisely in Perl. That may sound strange, or even backward, so let’s consider an everyday analogy.

Table 4.4 Substitution operator examples

Example Meaning

s/perl/Perl/; Substitutes “Perl” for the leftmost occurrence of “perl”

in $_. Global (/g) substitutions are usually preferable.

s/perl/Perl/g; Globally substitutes “Perl” for each occurrence of

“perl” in $_.

$oyster =~ s/perl/Perl/g; Globally substitutes “Perl” for each occurrence of

“perl” in $oyster.

$oyster =~ s/\bperl\b/'$&'/ig; Globally searches for the word “perl” in $oyster, ignoring case, and substitutes for each match that same word (via $&; see table 3.4) wrapped in single quotes (i.e., “'perl'” for “perl”, “'PERL'” for “PERL” , etc.).

When you want to boil a cup of water, you can tell your microwave oven to cook for two minutes by punching the Minute button twice and the Start button once.

That’s easily specified, because it’s implicit that the operation of interest is cooking—

that’s all an oven knows how to do.

However, if you want to tell your computer to do something for two minutes, instead of saying “do your thing for two minutes”—which works splendidly with the microwave—you have to tell the computer exactly what you want it do during that period, because it has numerous options.

In the case at hand, ^sed is like the microwave oven, and Perl is like the computer.

Sometimes you have to type a bit more to get Perl to do the same thing as ^sed—but it’s usually worth it. With that in mind, let’s see how to perform substitutions on spe-cific lines in Perl.

Although Perl doesn’t support the 10,19s/old/new/g syntax of ^sed, it does keep track of the number of the current record (a line by default) in the special vari-able “^$.” (see table 2.2), if the ⁿ or ^p invocation option is used. This being the case, processing specific lines is accomplished by composing an expression that’s true when the line number is in the desired range, and then using the logical ^and to make the operation of interest conditional on that result.

For example, the following commands perform substitutions only on specified lines, and they unconditionally print (courtesy of the ^p option) all lines. The result is that a selectively edited version of the file is sent to the output destination:

• Edit line 1:

perl -wpl -e '$. == 1 and s/old/new/g;' file

• Edit lines 3–11:

perl -wpl -e '3 <= $. and $. <= 11 and s/old/new/g;' file

• Edit lines 10–last:

perl -wpl -e '$. > 9 and s/old/new/g;' file

The relational operators (^<=, etc.) might look familiar, because they’re identical to those used with certain Shell commands. They’re covered in more detail in section 5.6.1.

You can see from the first example that the Perl counterpart to the earlier ^sed command, which edits only the heading line of the ^beatles file, is

perl -wpl -e '$. == 1 and s/Sun/Sunday/g;' beatles

But Perl can also perform substitutions on records bigger than a single line, as we’ll discuss next.

4.3.3 Performing record-specific substitutions: Perl

Unlike ^sed, Perl lets you perform substitutions on arbitrarily defined records. Con-sider the following data file, called ^data_east:

Data for Eastern Region updated: Sun Sep 18 18:40:51 checked: Mon Sep 19 00:00:01 updated: Sun Sep 25 18:40:52 checked: Mon Sep 26 00:00:00 42 56 778 001: Sun Myung 918 42 178 13: Mon Soon 86 574 09 108: Tue Hawt

Upper management has been making a lot of noise recently about “bold new innova-tions” to be announced soon, so Ashanti isn’t surprised when she hears their decree that abbreviations for day names are no longer acceptable for presentations in depart-mental meetings. So before her next weekly meeting, she needs to expand those sud-denly taboo day-name abbreviations in the ^data_east file.

One apparent complication is that there could be any number of “updated/

checked” lines in the file, which are the lines with the day-name abbreviations. On the other hand, they’re all guaranteed to be in the same chunk of text that’s separated by one or more blank lines from the others—and that coincides perfectly with Perl’s defi-nition of a paragraph.

Sun, Mon, and Tue in paragraph 3 are people’s names—not day-name abbrevia-tions—so only paragraph 2 needs to be modified. Accordingly, Ashanti composes the following command to expand its abbreviations:

$ perl -00 -wpl -e '$. == 2 and s/\bSun\b/Sunday/g;

> $. == 2 and s/\bMon\b/Monday/g;' data_east Data for Eastern Region

updated: Sunday Sep 18 18:40:51 checked: Monday Sep 19 00:00:01 updated: Sunday Sep 25 18:40:52 checked: Monday Sep 26 00:00:00 42 56 778 001: Sun Myung 918 42 178 13: Mon Soon 86 574 09 108: Tue Hawt

The use of ^-00 enables paragraph mode (see table 2.1), and the equality tests (⁼⁼) on the values of the “^$.” variable select the record of interest. Note that the occurrences of ^Sun and ^Mon in paragraph 2 were modified as desired, while those in paragraph 3 were correctly exempted from editing.

Now all Ashanti needs to do is run the command again with output redirected to the printer, and she’ll be ready for the meeting.

Line and record numbers are also commonly used in connection with selective printing, which we’ll discuss next.

4.3.4 Using backreferences and

numbered variables in substitutions

Like ^grep (see table 3.8), ^sed recognizes parentheses as literal characters unless they’re backslashed, which turns them into parentheses that capture what’s matched by the regex between them. In contrast, Perl’s parentheses are of the capturing type unless they’re backslashed, which converts them to literal characters.⁴

The regex notation common to ^grep and ^sed allows you to use numbered back-references such as ^\1 and ^\2 to refer to the captured matches. Although you can use these backreferences in both the search and replacement fields of ^sed’s substitution command, with Perl’s substitution operator they work only in the search field; how-ever, you can use their dollar-prefixed relatives (^$1, ^$2, etc.) in the replacement field—and elsewhere in the program too!

Backreferences and numbered variables are useful in substitutions where you need to interject new text between words that match patterns. For example, consider these sed and Perl solutions to the problem of inserting an individual’s personal name between his (occasionally dotted) title and (variously spelled) surname:

$ cat invite # Note Mr. vs Mr, and Bean vs. Been Mr. Bean hereby requests the company of his noble

companion, Teddy, for high tea today with Mr Been.

$ sed 's/$Mr\.$ $Be[ea]n$/\1 Jelly \2/g;’ invite Mr. Jelly Bean hereby requests the company of his noble companion, Teddy, for high tea today with Mr Been.

$ perl –wpl –e 's/(Mr\.) (Be[ea]n)/$1 Jelly $2/g;’ invite

… (same output)

Note the backslashing of parentheses and numbered backreferences with ^sed versus the use of plain parentheses and dollar variables with ^perl.

In addition, it’s noteworthy that only one substitution was performed by each command, due to the lack of a period after the second occurrence of “Mr”. But Perl can treat that period as optional and perform the second substitution, because—

unlike ^sed—it supports the “^?” metacharacter (see table 3.9):

$ perl –wpl –e 's/(Mr\.?) (Be[ea]n)/$1 Jelly $2/g;’ invite Mr. Jelly Bean hereby requests the company of his noble companion, Teddy, for high tea today with Mr Jelly Been.

Now, we’ll turn our attention to another popular use of ^sed, which doesn’t involve substitutions.

4 To keep your mind from boggling, Perl’s policy is simply this—any backslashed symbol is always treat-ed as a literal character.

Dans le document Minimal Perl (Page 124-133)