Accessing cell data using array indexing

Perl as a (better) awk command

AND A CTIONS

5.4.3 Accessing cell data using array indexing

As you know, one convenient way to access elements of arrays is to copy them from the array into a list of named variables and then extract the values from those vari-ables. But in cases where you’ve got dozens or even thousands of values to choose from, this approach isn’t practical—typing the ^($A,$B, etc.^)=@F assignment state-ment is too burdensome.

The alternative is to use array indexing to extract specific values directly from the array. You do this by changing the ^@ symbol in the array name to a ^$ sign and appending a pair of square brackets with an integer number (or an expression that evaluates to one) between them.

As shown in table 5.9, Perl is unusual in providing both positive and negative indexing. The benefit is that the last element of the array, for example, can either be retrieved using the index of ^-1 or the index of ^N-1, where ^N is the total number of ele-ments. (The maximum index is one less than the number of elements, because the indices start from ⁰, rather than ¹.)

Now let’s examine Patrick’s script, called mean_annual_precip, which is compact, sweet, and powerful—like his espresso:

#! /usr/bin/perl -00 -wnla

# Parses "Operational Climatic Data Summary" reports to extract

# and print "mean annual precipitation" statistic for each file.

# Find precipitation record, and print its field #33 (index 32)

/^ 2\. PRECIPITATION / and print "\u$ARGV: $F[32]";

The script is designed to read multiple files in sequence (via option ⁿ) using paragraph mode (^-00) and to automatically load the fields of each record into ^@F (using ^a). To home in on the correct portion of each city’s data file, the matching operator is used

Table 5.9 Illustration of array indexing syntax using the field array, @F Storage position

1st 2nd 3rd 4th (and last)

Positive indexing $F[ 0] $F[1] $F[2] $F[3]

Negativeindexing $F[-4] $F[-3] $F[-2] $F[-1]

to find the paragraph with the PRECIPITATION heading by looking at the beginning of a paragraph for a space, a ², a literal dot, and then two spaces before that word.

Next, the logical ^and is used to ^print the file’s name (via ^$ARGV; see table 2.7) followed by a colon, a space, and the value of the appropriate field within the record.

Notice the use of ^\u before ^$ARGV (see table 4.5), which conveniently capitalizes the first letter of each city’s name. To help you visualize the implicit lines-to-record map-ping that determines the index numbers for the fields in ^@F, consider the following data file:

A B C D

When read in line mode, each line would have two fields, and the maximum index for each record’s ^@F would be 1.

However, if you read the same file in paragraph mode, each field would be con-sidered part of the same record and treated for field-numbering purposes as if the input had looked like this:

A B C D

Therefore, field “D” would be stored under the index of 3, rather than under 1, as it would be with input records defined as lines.

Using this logic, Patrick determines that the number of the field he’s interested in is 33, which means it can be retrieved with the index 32.

Okay, now it’s time to run the script, and determine which city is the rainiest:²⁰

$ mean_annual_precip miami new_york seattle Miami: 57.9

New_york: 41.2 Seattle: 38.3

Patrick gently breaks the news to Guillermo about Miami being rainier than Seat-tle. After a brief period of shock-induced choking on his carob-iced hemp biscotti, Guillermo congratulates himself on being wise enough to relocate to (relatively) rain-free Seattle.

But Patrick is left wondering why Seattle has such an oversaturated reputation, given that even New York has more rain. In an attempt to reconcile his observations with reality, he speculates that Seattle’s distinction might be its number of days with significant rainfall, as opposed to its annual amount of rainfall.

20When ^perl is invoked with the ^-00^–wnla options, the ^l option strips blank lines from the end of each record, not newlines. Field 33 occurs at the end of a line, so it still has the line-ending newline attached at its end. But the ^l option automatically adds a newline at the end of ^print’s argument list, making the output double-spaced. To avoid this effect, a chomp$F[32]; statement could be added above the ^print statement to remove any trailing newline present in the field variable. You’ll learn more about ^chomp in section 7.2.4.

Comparing cities for “days of significant rain”

Patrick begins work on a new script to compare the “number of rainy days” statistic from the weather files of the three cities. But instead of writing a script specific to this particular task, he decides to make it a very general one that’s configurable by switches, to liberate him from the chore of writing another new script as each new dis-pute arises in the future.

With the weather-data files, the script could use a matching operator to select the desired record, as mean_annual_precip did. But that technique can’t be expected to work with every file, because it requires the record of interest to have a unique marker within it, such as the precipitation record’s PRECIPITATION. Accordingly, Patrick opts for the more general approach of selecting records by number.

To give the script the flexibility it needs, he lets the user specify the number of the record and the field of interest within it using command-line switches. Prior to testing his new script against mean_annual_precip, he identifies the number of the PRE-CIPITATION record in the weather files using this one-liner:

$ perl -00 -wnl -e 'print "Paragraph #$.:\n$_";' seattle

Now that Patrick knows that the paragraph number for precipitation data is 6, he can test his new script by specifying that record number²¹ along with the number of the desired field within it. The location described by those attributes is like a cell in a two-dimensional table, inasmuch as it occurs at the intersection of a horizontal element (in this case, a paragraph) and a vertical one (a column); so, he dubs the new script extract_cell. Patrick then tries it with an invocation that should produce the same results as mean_annual_precip:

$ extract_cell -recnum=6 -fnum=33 miami new_york seattle Miami: 57.9

...

As he hoped, the new more flexible script produced the same output as the earlier one.

21It will be 6 no matter how many extra blank lines occur between the paragraphs, which is critical, be-cause the weather files for different locations differ in that respect (as mentioned earlier).

But did you notice that the argument for the ^-fnum switch is 33 with extract_cell, whereas mean_annual_precip used 32 to access the same field?

That’s because the new script provides the useful service of automatically decrement-ing the given field number by one to convert it into an array index, so the user won’t have to remember to do that.

Here’s the extract_cell script:

#! /usr/bin/perl -s -00 -wnla

# Prints field indicated by the $recnum/$fnum combination,

# preceded by filename BEGIN {

$fnum and $recnum or

warn "Usage: $0 -recnum=M -fnum=N\n" and exit 255;

# Decrement field number, so user can say 1, and get index of 0 $index=$fnum - 1;

}

$. == $recnum and print "\u$ARGV: $F[$index]";

# Reset record counter $. after end of each file eof and close ARGV;

The script begins by checking that both of the obligatory switches have been supplied and by issuing a “Usage:” message and exiting if they weren’t. The last statement of the script senses whether input from the current file has been exhausted by calling the built-in ^eof function; if that’s true, the script ^closes the current file by referencing its filehandle, ^ARGV.²² The effect is to reset the “^$.” variable back to 1 for the next file, so the program can continue to correctly identify the record number desired by the value of “^$.”.

It’s time to test Guillermo’s theory about Seattle being remarkable for its number of days with significant rainfall rather than its mean amount of rainfall. As it happens, section 2 of the precipitation report contains a statistic that seems well suited to this comparison—it reports the number of days per year on which at least half an inch of rain has fallen:

22This is how Perl programmers get “^$.” to behave like AWK’s FNR rather than AWK’s NR (see table 5.2).

Oops, there’s a snag—Patrick’s eyes keep glazing over²³ as he tries to determine the number of the last field within that data paragraph, which is the one on the last line and under the ANN (for “annual”) column.

However, once he realizes that there are 12 ^JAN-DEC columns on each line, plus an extra column at the beginning and end of that group for a total of 14 per line, plus the 3 fields for the paragraph heading and a few more for the row labels, he’s able to calculate the field number as 101.

Which of the three cities has the largest number of days per annum with at least half an inch of rain? May I have the envelope, please:

$ extract_cell -recnum=6 -fnum=101 miami new_york seattle Miami: 24

New_york: 27 Seattle: 22

It’s New York! Vitas still can’t accept the idea that New York could be rainier than Seattle, so he asks to review the script. Although he is unable to find any fault with it, he does come up with an idea for alleviating the need for those grueling field calcula-tions in the future.

Specifically, he points out that counting 101 fields to identify the last one in the record isn’t necessary, because Perl recognizes the index of ^-1 as referring to that ele-ment. Vitas goes on to say, like, in like fashion, ^-2 could be used to refer to the second element from the end, ^-20 to the twentieth from the end, and so forth (see table 5.9).

Patrick is excited to learn this, but he realizes that he must modify the script to avoid decrementing arguments to ^-fnum= that are already suitable index values, like -¹, while still allowing conventional field-count values, such as ³³, to receive the dec-rementing treatment they require. For additional flexibility, he modifies the script to default to line-mode but to allow paragraph mode to be enabled by a command-line switch when desired. This requires removing the ^-00 invocation option from the she-bang line and selectively enabling paragraph mode by conditionally assigning a null string to ^$/ (see table 2.7).

Here’s the new version, called extract_cell2:

#! /usr/bin/perl -s -wnla

# Prints field indicated by $recnum/$fnum, preceded by filename.

# -fnum switch handles field numbers as well as negative indices.

our ($p); # -p switch for paragraph mode is optional BEGIN {

$fnum and $recnum or

warn "Usage: $0 -recnum=M -fnum=N\n" and exit 255;

23He can’t help but wonder—could that genetically engineered low-carb zest be to blame?

# Decrement positive fnum, so user can say 1, and get index of 0 # But don't decrement negative values; they're indices already!

$index=$fnum; # initially assume $fnum is an index $index >= 1 and $index--; # make it an index if it wasn't $p and $/=""; # set paragraph mode if requested }

$. == $recnum and print "\u$ARGV: $F[$index]";

# Reset record counter $. after end of each file eof and close ARGV;

A quick test shows that the new script successfully extracts the same figures as its predecessor:

$ extract_cell2 -p -recnum=6 -fnum=-1 miami new_york seattle Miami: 24

...

Notice the use of the ^–p switch to enable paragraph mode, which isn’t hard-coded within the script as it was in the earlier version, and the use of the convenient ^-1 index rather than the field number of 101, to specify the last field of the record.

Although the script seems to work properly for the data on the first page of the weather file, Patrick wants to be sure the script will work on the subsequent pages too;

so, he examines them in greater detail. While doing so, he’s excited to find that they all contain a data paragraph called FREQ RAIN AND/OR DRIZZLE, which sounds like a measure on which Seattle might really shine.

Comparing cities for “sogginess”

In preparation for comparing cities for frequency of rain and/or drizzle—which we’ll call sogginess for short—Patrick improvises a paragraph-numbering one-liner to iden-tify the record number for the relevant data paragraph:

$ perl -00 -wnl -e '/12\..*DRIZZLE / and

> print "Paragraph #$.:\n$_";' seattle ...

The appropriate field for comparison is the one in the bottom-right corner of the paragraph, at the intersection of ALLHOURS and ANN. The 15 there indicates that

over periods of an entire year, rain or drizzle has occurred 15 percent of the time. Seat-tle’s track record seems like a pretty formidable one to beat, but let’s see if Miami or New York is up to the challenge:

$ extract_cell2 -p -recnum=23 -fnum=-1 miami new_york seattle Miami: 5

New_york: 9 Seattle: 15

Hooray! Finally, Seattle comes up a winner! Now you can understand why it has such a soggy reputation, despite its relatively unimpressive amounts of annual rainfall and days with significant rain. As any resident can tell you, it’s the frequent drizzling that makes Seattle so refreshingly moist and inviting!

Let’s sum up what we’ve learned about Perl from Patrick and his friends. The grep command couldn’t have done the job of mean_annual_precip or the extract_cell* scripts, because it’s specialized for extracting and displaying entire records. The ^sed command, on the other hand, could at least have whittled down the appropriate record to the correct field for us—if only it knew how to handle multi-line records.

But by using Perl’s AWK-like features, it was easy to arrange for the conditional printing of a specified field from a particular record, prefixed by the name of the file from which it came. Along the way, we added a useful tool to our collection—a script that extracts and prints a cell from a table.

Now it’s time to discuss a powerful yet easy-to-use feature of AWK and Perl that’s not widely known: the ability to apply processing to a range of input records.

Dans le document Minimal Perl (Page 178-184)