C OMPARING BASIC FEATURES - Perl as a (better) awk command

Perl as a (better) awk command

5.2 C OMPARING BASIC FEATURES

awk

AND

P

^ERL

This section provides an overview of how AWK and Perl compare in terms of their most fundamental capabilities. Later, we’ll discuss more specific differences (in built-in functions, operators, etc.) built-in the context of illustrative programmbuilt-ing examples.

Due to the fact that a nearly complete⁴ re-creation of an AWK-like programming environment is provided in Perl (albeit with a different syntax), there aren’t many

3 AWK’s earliest comprehensive documentation was in The UNIX Programming Environment by Brian Kernighan and Rob Pike (Prentice-Hall, 1984). The first book devoted to AWK was The Awk Pro-gramming Language (Addison-Wesley, 1988), by AWK’s creators—Al Aho, Peter Weinberger, and Brian Kernighan (hence the name).

4 AWK does have some features Perl lacks; e.g., all AWK versions allow the field separator to be changed during execution (via the FS variable)—although I’ve never heard of anyone exploiting this possibility.

When I asked Larry why he didn’t include an ^FS-like variable in Perl, his typically enigmatic response was, “AWK has to be better at something! ”

ways in which Perl can be said to beat AWK at its own game. However, Perl provides features that go well beyond those of its influential predecessor, allowing the use of AWKish programming techniques with a much wider variety of applications (e.g., networked, database-oriented, and object-oriented).

Perl also provides a richer infrastructure that makes its programmers more produc-tive, through its module-inclusion mechanism and the availability of thousands of high-quality pre-written modules from the Comprehensive Perl Archive Network (CPAN; see chapter 12).

In consideration of the fact that these languages are both rightly famous for their pattern-matching capabilities, let’s see how they stack up in this respect.

5.2.1 Pattern-matching capabilities

Table 5.1 lists the most important differences between noteworthy AWK versions and Perl, which pertain to their fundamental capabilities for pattern matching and related operations.⁵

The comparisons in the upper panel of table 5.1 refer to the capabilities of the dif-ferent regex dialects, those in the middle to the way in which matching is performed, and those in the lower panel to other special features. By observing the increasing number of Ys as you move from Classic AWK’s column to Perl’s, you can see that GAWK’s capabilities are a superset of AWK’s, whereas Perl’s capabilities are generally a superset of GAWK’s.

Perl’s additional capabilities are most clearly indicated in the top and bottom pan-els, which reflect its richer collection of regular expression metacharacters and other special features we’ll cover later in this chapter.

Because AWK has inherited many characteristics from ^grep and ^sed, it’s no sur-prise that the AWK versus Perl comparisons largely echo the findings of the ^grep ver-sus Perl and ^sed versus Perl comparisons in earlier chapters. Most of the listed capabilities have already been discussed in chapter 3 or 4, so here we’ll concentrate on the new ones: stingy matching and record-separator matching.

Stingy matching

Stingy matching is an option provided by Perl to match as little as possible—rather than as much as possible, which is the greedy behavior used by Unix utilities (and Perl by default). You enable it by appending a “^?” to a quantifier (see table 3.9), most commonly “⁺”, which means “one or more of the preceding.”

The stingy (as in miserly) matching option is valued because it makes certain pat-terns much easier to write. For example, stingy matching lets you use ^{^.+?:} to cap-ture the first field of a line in the /etc/passwd file—by matching the shortest sequence starting at the beginning that ends in a colon (the field separator for that

5 There’s no separate column for POSIX AWK because its capabilities are duplicated in GNU AWK.

file). In contrast, many beginners would make the mistake of using the greedy pattern

^.+: in an attempt to get the same result. This pattern matches across as many char-acters as needed—including colons—along its way to matching the required colon at the end, resulting in fields one through six being matched rather than only field one.

Perl’s ability to do stingy matching gives it an edge over AWK. Record-separator matching

Perl’s capability of record separator matching allows you to match a newline (or a cus-tom record separator), which is not allowed by any of the regex-oriented Unix utilities (^grep, ^sed, ^awk, ^vi, etc.). You could use this option, for example, to find a “Z”

Table 5.1 Differences in pattern-matching capabilities of AWK versions and Perl

Capability^a Classic AWK GAWK^b Perl

Word boundary metacharacter – Y Y

Compact character-class shortcuts – ? Y

Control character representation Y Y Y

Repetition ranges – Y Y

Capturing parentheses and backreferences – ?^c Y

Metacharacter quoting ? ? Y

Embedded commentary – – Y

Advanced RE features – – Y

Stingy matching – – Y

Record-separator matching – – Y

Case insensitivity – Y Y

Arbitrary record definitions Y Y+^d Y

Line-spanning matches Y Y Y

Binary-file processing Y Y Y

Directory-file skipping – Y Y+

Match highlighting – – ?

Custom output formatting Y Y Y

Arbitrary delimiters – – Y+

Access to match components – – Y

Customized replacements – – Y+

File modifications – – Y

a. Y: has this capability; Y+: has this capability with enhancements; ?: partially has this capability; –: doesn’t have this capability

b. Using POSIX-compliant features and GNU extensions c. Works only with certain functions

d. Allows the specification of a record separator via regex

occurring at the end of one line that is immediately followed by an “A” at the begin-ning of the next line, using ^Z\nA as your regex. It’s difficult to work around the absence of this capability when you really need it, which gives Perl an advantage over AWK (and every other Unix utility) for having it.

Now that we’ve compared the pattern-matching capabilities of AWK and Perl, we’ll next compare the sets of special variables provided by the languages.

5.2.2 Special variables

Both AWK and Perl provide the programmer with a rich collection of special vari-ables whose values are set automatically in response to various program activities (see table 5.2). A syntactic difference is that almost all AWK variables are named by sequences of uppercase letters, whereas most Perl variables have ^$-prefixed symbols for names.

The fact that Perl provides variables that correspond to AWK’s ^$0, ^NR, ^RS, ^ORS, OFS, ^ARGV, and ^FILENAME attests to the substantial overlap between the languages and tells you that the AWKish programming mindset is well accommodated in Perl.

For instance, after an input record has been automatically read, both languages update a special variable to reflect the total number of records that have been read thus far.

Some bad news in table 5.2 for AWKiologists is that the Perl names for variables that provide the same information are different (e.g., the record-counting variables

“^$.” vs. ^NR), and the only name that is the same (^$0) means something different in the languages.⁶

Table 5.2 Comparison of special variables in AWK and Perl Modern

AWKs^a Perl Comments

$0 $_ AWK’s $0 holds the contents of the current input record. In Perl, $0 holds the script’s name, and $_ holds the current input record.

$1 $F[0] These variables hold the first field^b of the current input record; $2 and

$F[1] would hold the second field, and so forth.

NR $. The ”record number” variable holds the ordinal number of the most recent input record.^c After reading a two-line file followed by a three-line file, its value is 5.

continued on next page a. Some of the listed variables were not present in classic AWK.

b. Requires use of the n or p, and a invocation option in Perl.

c. Requires use of the n or p invocation option in Perl.

6 As discussed in section 2.4.4, $0 knows the name used in the Perl script’s invocation and is routinely used in ^warn and ^die messages. Perl will actually let you use AWK variable names in your Perl pro-grams (see ^man^English), but in the long run, you’re better off using the Perl variables.

FNR N/A The file-specific ”record number” variable holds the ordinal number of the most recent input record from the most recently read file. After reading a two-line file followed by a three-line file, its value is 3. In Perl programs that use eof and close ARGV,^d “$.” acts like FNR.^c

RS $/ The ”input record separator” variable defines what constitutes the end of an input record. In AWK, it’s a linefeed by default, whereas in Perl, it’s an OS-appropriate default. Note that AWK allows this variable to be set to a regex, whereas in Perl it can only be set to a literal string.

ORS $\ The ”output record separator” variable specifies the character or sequence for print to append to the end of each output record. In AWK, it’s a linefeed by default, whereas in Perl, it’s an OS-appropriate default.

FS N/A AWK allows its “input field separator” to be defined via an assignment to FS or by using the -F'sep' invocation option; the former approach allows it to be set and/or changed during execution. Perl also allows the run-time setting (using the –F'sep' option) but lacks an associated variable and therefore the capability to change the input record separator during execution.

OFS $, The “output field separator” variable specifies the string to be used on output in place of the commas between print’s arguments. In Perl, this string is also used to separate elements of arrays whose names appear unquoted in print’s argument list.

NF @F The “number of fields” variable indicates the number of fields in the current record. Perl’s @F variable is used to access the same information (see section 7.1.1).

ARGV @ARGV The “argument vector” variable holds the script’s arguments.

ARGC N/A The ”argument count” variable reports the script’s number of arguments.

In Perl, you can use $ARGC=@ARGV; to load that value into a similar variable name.

FILENAME $ARGV These variables contain the name of the file that has most recently provided input to the program.^c

N/A $& This variable contains the last match.^e

N/A $` This variable contains the portion of the matched record that comes before the beginning of the most recent match.^e

N/A $' This variable contains the portion of the matched record that comes after the end of the most recent match.^e

RSTART N/A This variable provides the location of the beginning of the last match. Perl uses pos()-length($&) to obtain this information.

RLENGTH N/A This variable provides the length in bytes of the last match. Perl uses length($&) to obtain this information.

a. Some of the listed variables were not present in classic AWK.

c. Requires use of the n or p invocation option in Perl.

d. For example, see the extract_cell script in section 5.4.3.

e. You can obtain the same information in AWK by applying the subst function to the matched record with suitable arguments (generally involving RSTART and/or RLENGTH).

Table 5.2 Comparison of special variables in AWK and Perl (continued) Modern

AWKs^a Perl Comments

Another difference is that in some cases one language makes certain types of infor-mation much easier to obtain than the other (e.g., see the entries for Perl’s “^$`” and AWK’s ^RSTART in table 5.2).

Once these variations and the fundamental syntax differences between the lan-guages are properly taken into account, it’s not difficult to write Perl programs that are equivalent to common AWK programs. For example, here are AWK and Perl pro-grams that display the contents of ^file with prepended line numbers, using equiva-lent special variables:

awk '{ print NR ": " $0 }' file perl –wnl -e 'print $., ": ", $_; ' file

The languages differ in another respect that allows ^print statements to be written more concisely in Perl than in AWK. We’ll discuss it next.

5.2.3 Perl’s variable interpolation

Like the Shell, but unlike AWK, Perl allows variables to be interpolated within double-quoted strings, which means the variable names are replaced by their contents.⁷ This lets you view the double-quoted string as a template describing the format of the desired result and include variables, string escapes (such as ^\t), and literal text within it. As a result, many ^print statements become much easier to write—as well as to read.

For example, you can write a more succinct and more readable Perl counterpart to the earlier AWK line-numbering program by using variable interpolation:

perl –wnl -e 'print $., ": ", $_;' file # literal translation perl –wnl -e 'print "$.: $_";' file # better translation

It’s a lot easier to see that the second version is printing the record-number variable, a colon, a space, and the current record than it is to surmise what the first version is doing, which requires mentally filtering out a lot of commas.

What’s more, Perl’s variable interpolation also occurs in regex fields, which allows variable names to be included along with other pattern elements.

For instance, to match and print an input record that consists entirely of a Zip Code, a Perl programmer can write a matching operator in this manner:

/^$zip_code$/ and print;

Note the use of the variable to insert the metacharacters that match the digits of the Zip Code between the anchor metacharacters.

In contrast, an AWK programmer, lacking variable interpolation, has to concate-nate (by juxtaposition) quoted and unquoted elements to compose the same regex:⁸

$0 ~ "^" zip_code "$"

7 In Shell-speak, this process is called variable substitution rather than variable interpolation.

8 When constructing regexes in this way, AWK needs to be instructed to match against the current input line with the ^$0^~^regex notation.

These statements do the same job (thanks to AWK’s automatic ^and ^print, but because Perl has variable interpolation, its solution is more straightforward.

We’ll consider some of Perl’s other advantages next.

5.2.4 Other advantages of Perl over AWK

As discussed in section 4.7, Perl provides in-place editing of input files, through the –i.ext option. This makes it easy for the programmer to save the results of editing operations back in the original file(s). AWK lacks this capability.

Another potential advantage is that in Perl, automatic field processing is disabled by default, so JAPHs only pay its performance penalty in the programs that benefit from it. In contrast, all AWK programs split input records into fields and assign them to variables, whether fields are used in the program or not.⁹

Next, we’ll summarize the results of the language comparison.

5.2.5 Summary of differences in basic features

Here are the most noteworthy differences between AWK and Perl that were touched on in the preceding discussion and in the comparisons of tables 5.1 and 5.2.

Ways in which Perl is superior to AWK

Perl alone (see tables 5.1 and 5.2) provides these useful pattern-matching capabilities:

• Metacharacter quoting, embedded commentary in regexes, stingy matching, record separator matching, and freely usable backreferences

• Arbitrary regex delimiters, access to match components, customized replace-ments in substitutions, and file modifications

• Easy access to the contents of the last match, and the portion of the matched record that comes before or after the match

Only Perl provides variable interpolation, which

• allows the contents of variables to be inserted into quoted strings and regex fields. This feature makes complex programs much easier to write, read, and maintain, and can be used to good advantage in most programs.

Perl alone has in-place editing.

Only Perl has a module-inclusion mechanism, which lets programmers

• package bundles of code for easy reuse;

• download many thousands of freely available modules from the CPAN.

9 Depending on the number of records being processed and the number of fields per record, it seems that AWK could waste a substantial amount of computer time in needless field processing.

Ways in which AWK is superior to Perl

Many simple AWK programs are shorter than their Perl counterparts, in part because and print must always be explicitly stated in ^grep-like Perl programs, whereas it’s implicit in AWK.

It’s easier in AWK than in Perl (see table 5.2) to

• determine a script’s number of arguments;

• obtain a file-specific record number;

• determine the position within a record where the latest match began.

However, to put these differences into proper perspective, Perl’s listed advantages are of much greater significance that AWK’s, because there’s almost nothing that AWK can do that can’t also be done with Perl—although the reverse isn’t true.

Now that you’ve had a general orientation to the most notable differences between AWK and Perl, it’s time to learn how to use Perl to write AWKish programs.

Dans le document Minimal Perl (Page 156-163)