Data Validation - DATA PROCESSING - Language Programming

CHAPTER 3: DATA PROCESSING

3.2 Data Validation

Another common use for awk programs is data validation: making sure that data is legal or at least plausible. This section contains several small programs that check input for validity. For example, consider the column-summing pro-grams in the previous section. Are there any numeric fields where there should be nonnumeric ones, or vice versa? Such a program is very close to one we saw before, with the summing removed:

# colcheck - check consistency of columns

# input: rows of numbers and strings

# output: lines whose format differs from first line

==

^{1 {}

nfld

=

^NF

for (i = 1; i <= NF; i++) type[i] = isnum($i) if (NF I= nfld)

printf("line %d has %d fields instead of %d\.n", NR, NF, nfld)

for (i = 1; i <= NF; i++) if (isnum($i) I= type[i])

printf("field %d in line %d differs from line 1\n", i, NR)

function isnum(n) { return n - /A[+-]?[0-9]+$/ }

The test for numbers is again just a sequence of digits with an optional sign; see the discussion of regular expressions in Section 2.1 for a more complete version.

Balanced Delimiters

In the machine-readable text of this book, each program is introduced by a line beginning with • P 1 and is terminated by a line beginning with • P2. These lines are text-formatting commands that make the programs come out in their distinctive font when the text is typeset. Since programs cannot be nested, these text-formatting commands must form an alternating sequence

. P 1 • P2 . P 1 . P2 . . . . P 1 . P2

If one or the other of these delimiters is omitted, the output will be badly man-gled by our text formatter. To make sure that the programs would be typeset properly, we wrote this tiny delimiter checker, which is typical of a large class of such programs:

# p12check - check input for alternating .P1/.P2 delimiters /A\.P1/ { if (p I= 0)

/"\.P2/

END

print ".P1 after .P1, line", NR p

=

if (p I= 1)

print " . P2 with no preceding . P 1 , line" , NR p = 0

if (p I= 0) print "missing .P2 at end" }

If the delimiters are in the right order, the variable p silently goes through the sequence of values 0 1 0 1 0 ... 1 0. Otherwise, the appropriate error messages

78 DATA PROCESSING CHAPTER 3

are printed.

Exercise 3-13. What is the best way to extend this program to handle multiple sets of delimiter pairs? D

Password-File Checking

The password file on a Unix system contains the name of and other informa-tion about authorized users. Each line of the password file has 7 fields, separated by colons:

root:qyxRi2uhuVjrg:0:2::/:

bwk:1L./v6iblzzNE:9:1:Brian Kernighan:/usr/bwk:

ava:otxs1oTVoyvMQ:15:1:Al Aho:/usr/ava:

uucp:xutiBs2hKtcls:48:1:uucp daemon:/usr/lib/uucp:uucico pjw:xNqy//GDc8FFg:170:2:Peter Weinberger:/usr/pjw:

mark:jOz1fuQmqivdE:374:1:Mark Kernighan:/usr/bwk/mark:

The first field is the user's login name, which should be alphanumeric. The second is an encrypted version of the password; if this field is empty, anyone can log in pretending to be that user, while if there is a password, only people who know the password can log in. The third and fourth fields are supposed to be numeric. The sixth field should begin with /. The following program prints all lines that fail to satisfy these criteria, along with the number of the erroneous line and an appropriate diagnostic message. Running this program every night is a small part of keeping a system healthy and safe from intruders.

# passwd - check password file BEGIN {

= ":" }

NF I= 7 {

printf("line ~d, does not have 7 fields: ~s\n", NR, SO) } S1 - /(AA-Za-z0-9)/ {

printf("line %d, nonalphanumeric user id: ~s\n", NR, SO) } S2

== "" {

printf("line %d, no password: %s\n", NR, SO) }

$3 - /[A0-9]/ {

printf("line %d, nonnumeric user id: ~s\n", NR, SO) } S4 - /[A0-9]/ {

printf("line %d, nonnumeric group id: %s\n", NR, SO) S6 ·- /A\// {

printf("line ~d, invalid login directory: ~s\n", NR, SO) This is a good example of a program that can be developed incrementally:

each time someone thinks of a new condition that should be checked, it can be added, so the program steadily becomes more thorough.

Generating Data-Validation Programs

We constructed the password-file checking program by hand, but a more interesting approach is to convert a set of conditions and messages into a check-ing program automatically. Here is a small set of error conditions and mes-sages, where each condition is a pattern from the program above. The error message is to be printed for each input line where the condition is true.

NF I= 7 does not have 7 fields

$1 - /[~A-Za-z0-9]/ nonalphanumeric user id

$2 == "" no password

The following program converts these condition-message pairs into a checking program:

# checkgen - generate data-checking program

# input: expressions of the form: pattern tabs message

# output: program to print message when pattern matches BEGIN { FS = "\t+" }

{ printf(¹¹%s {\n\tprintf(\"line %%d, %s: %%s\\n\",NR,$0) }\n",

$1' $2) }

The output is a sequence of conditions and the actions to print the correspond-ing messages:

NF I= 7 {

printf("line %d, does not have 7 fields: %s\n",NR,$0) }

$1 - /[~A-Za-z0-9]/ {

printf("line %d, nonalphanumeric user id: %s\n",NR,$0) }

$2 == "" {

printf("line %d, no password: %s\n",NR,$0) }

When the resulting checking program is executed, each condition will be tested on each line, and if it is satisfied, the line number, error message, and input line will be printed. Note that in checkgen, some of the special characters in the printf format string must be quoted to produce a valid generated program.

For example, %is preserved by writing%% and \n is created by writing \ \n.

This technique in which one awk program creates another is broadly applica-ble (and of course it's not restricted to awk programs). We will see several more examples of its use throughout this book.

Exercise 3-14. Add a facility to checkgen so that pieces of code can be passed through verbatim, for example, to create a BEGIN action to set the field separator. 0

Which Version of AWK?

Awk is often useful for inspecting programs, or for organizing the activities of other testing programs. This section contains a somewhat incestuous exam-ple: a program that examines awk programs.

The new version of the language has more built-in variables and functions,

80 OAT A PROCESSING CHAPTER 3

so there is a chance that an old program may inadvertently include one of these names, for example, by using as a variable name a word like sub that is now a built-in function. The following program does a reasonable job of detecting such problems in old programs:

# compat - check if awk program uses new built-in names BEGIN { asplit("close system atan2 sin cos rand srand " \

"match sub gsub¹¹^, fens)

I" I / \ I I /#/

asplit( "ARGC ARGV FNR RSTART RLENGTH SUBSEP", vars) asplit("do delete function return", keys)

line

=

^{SO }}

gsub(/"(["'"ll\\")•"1, , line)}

gsub(/\/(["'\/)1\\\/)+\//, "", line) sub(/#.•/, "", line) }

# remove strings,

# reg exprs,

# and comments n = split(line, x, "["'A-Za-z0-9 ]+") # into words for ( i = 1; i <= n; i++) {

-if (x[i] in fens)

warn(x[i] " is now a built-in function") if (x[i] in vars)

warn(x(i] " is now a built-in variable") if (x[i] in keys)

warn(x(i] " is now a keyword")

function asplit(str, arr) { # make an assoc array from str n = split(str, temp)

for (i

=

^1; ⁱ ^<=^{n; i++)}

arr[temp[i]]++

return n function warn(s) {

sub(/"'( \t)•l, "")

printf(¹¹file %s, line %d: %s\n\t%s\n", FILENAME, FNR, s, SO) The only real complexity in this program is in the substitution commands that attempt to remove quoted strings, regular expressions, and comments before an input line is checked. This job isn't done perfectly, so some lines may not be properly processed.

The third argument of the first split function is a string that is interpreted as a regular expression. The leftmost longest substrings matched by this regular expression in the input line become the field separators. The split command divides the resulting input line into alphanumeric strings by using nonalpha-numeric strings as the field separator; this removes all the operators and

punctuation at once.

The function aspli t is just like split, except that it creates an array whose subscripts are the words within the string. Incoming words can then be tested for membership in this array.

This is the output of compat on itself:

file compat, line 12: gsub is now a built-in function

1\11 { gsub(/\/(["\/]:\\\/}+\1/, '"',line)}# reg exprs,

file compat, line 13: sub is now a built-in function

1#1 { sub(/#.*/, "", line) } #and comments file compat, line 26: function is now a keyword

function asplit(str, arr) { # make an assoc array from str file compat, line 30: return is now a keyword

return n

file compat, line 33: function is now a keyword function warn(s) {

file compat, line 34: sub is now a built-in function sub(/"[ \t]*/, "")

file compat, line 35: FNR is now a built-in variable

printf("file %s, line %d: %s\n\t%s\n", FILENAME, FNR, s, $0) Exercise 3-15. Rewrite compat to identify keywords, etc., with regular expressions instead of the function aspli t. Compare the two versions on complexity and speed. o Exercise 3-16. Because awk variables are not declared, a misspelled name will not be detected. Write a program to identify names that are used only once. To make it truly useful, you will have to handle function declarations and variables used in functions. O

Dans le document Language Programming (Page 88-93)