Implementing a Lexical Analyzer

Recording formats

CHAPTER 7: LEXICAL ANALYSIS AND STOPLISTS

7.2 LEXICAL ANALYSIS

7.2.4 Implementing a Lexical Analyzer

Lexical analysis for information retrieval systems is the same as lexical analysis for other text processing systems; in

particular, it is the same as lexical analysis for program translators. This problem has been studied thoroughly, so we ought to adopt the solutions in the program translation literature (Aho, Sethi, and Ullman 1986). There are three ways to implement a lexical analyzer:

Use a lexical analyzer generator, like the UNIX tool lex (Lesk 1975), to generate a lexical analyzer automatically;

Write a lexical analyzer by hand ad hoc; or

Write a lexical analyzer by hand as a finite state machine.

The first approach, using a lexical analyzer generator, is best when the lexical analyzer is complicated; if the lexical analyzer is simple, it is usually easier to implement it by hand. In our discussion of stoplists below, we present a special purpose lexical analyzer generator for automatic indexing that produces efficient lexical analyzers that filter stoplist words. Consequently, we defer further discussion of this alternative.

The second alternative is the worst. An ad hoc algorithm, written just for the problem at hand in whatever way the

programmer can think to do it, is likely to contain subtle errors. Furthermore, finite state machine algorithms are extremely fast, so ad hoc algorithms are likely to be less efficient.

The third approach is the one we present in this section. We assume some knowledge of finite state machines (also called finite automata), and their use in program translation systems. Readers unfamiliar with these topics can consult Hopcroft and Ullman (1979), and Aho, Sethi, and Ullman (1986). Our example is an implementation of a query lexical analyzer as

described above.

The easiest way to begin a finite state machine implementation is to draw a transition diagram for the target machine. A transition diagram for a machine recognizing tokens for our example query lexical analyzer is pictured in Figure 7.1.

In this diagram, characters fall into ten classes: space characters, letters, digits, the left and right parentheses, ampersand, bar, caret, the end of string character, and all other characters. The first step in implementing this finite state machine is to build a mechanism for classifying characters. The easiest and fastest way to do this is to preload an array with the character classes for the character set. Assuming the ASCII character set, such an array would contain 128 elements with the character classes for the corresponding ASCII characters. If such an array is called char_class, for example, then the character class for character 'c' is simply char_class [c]. The character classes themselves form a distinct data type best declared as an enumeration in C. Figure 7.2 contains C declarations for a character class type and array. (Note that the end of file character requires special treatment in C because it is not part of ASCII).

Figure 7.1: Transition diagram for a query lexical analyzer

The same technique is used for fast case conversion. In Figure 7.2, an array of 128 characters called convert_case is preloaded with the printing characters, with lowercase characters substituted for uppercase characters. Nonprinting character positions will not be used, and are set to 0.

/************** Character Classification *****************/

/* Tokenizing requires that ASCII be broken into character */

/* classes distinguished for tokenizing. White space */

/* characters separate tokens. Digits and letters make up */

/* the body of search terms. Parentheses group sub- */

/* expressions. The ampersand, bar, and caret are */

/* operator symbols. */

typedef enum {

WHITE_CH, /* whitespace characters */

DIGIT_CH, /* the digits */

LETTER_CH, /* upper and lower case */

LFT_PAREN_CH, /* the "(" character */

RGT_PAREN_CH, /* the ")" character */

AMPERSAND_CH, /* the "&" character */

BAR_CH, /* the "|" character */

CARET_CH, /* the "^" character */

EOS_CH, /* the end of string character */

OTHER_CH, /* catch-all for everything else */

} CharClassType;

static CharClassType char_class[128] = {

/* ^@ */ EOS_CH, /* ^A */ OTHER_CH, /* ^B */ OTHER_CH, /* ^C */ OTHER_CH, /* ^D */ OTHER_CH, /* ^E */ OTHER_CH, /* ^F */ OTHER_CH, /* ^G */ OTHER_CH, /* ^H */ WHITE_CH,

/* Î */ WHITE_CH, /* ^L */ WHITE_CH, /* Ô */ OTHER_CH, /* ^R */ OTHER_CH, /* Û */ OTHER_CH, /* ^X */ OTHER_CH, /* ^[ */ OTHER_CH, /* ^^ */ OTHER_CH, /* ! */ OTHER_CH, /* $ */ OTHER_CH, /* ' */ OTHER_CH, /* * */ OTHER_CH, /* - */ OTHER_CH, /* 0 */ DIGIT_CH, /* 3 */ DIGIT_CH, /* 6 */ DIGIT_CH, /* 9 */ DIGIT_CH, /* < */ OTHER_CH, /* ? */ OTHER_CH, /* B */ LETTER_CH, /* E */ LETTER_CH, /* H */ LETTER_CH, /* K */ LETTER_CH, /* N */ LETTER_CH, /* ^J */ WHITE_CH, /* ^K */ WHITE_CH, /* ^M */ WHITE_CH, /* ^N */ OTHER_CH, /* ^P */ OTHER_CH, /* ^Q */ OTHER_CH, /* ^S */ OTHER_CH, /* ^T */ OTHER_CH, /* ^V */ OTHER_CH, /* ^W */ OTHER_CH, /* ^Y */ OTHER_CH, /* ^Z */ OTHER_CH, /* ^\ */ OTHER_CH, /* ^] */ OTHER_CH, /* ^_ */ OTHER_CH, /* */ WHITE_CH, /* " */ OTHER_CH, /* # */ OTHER_CH, /* % */ OTHER_CH, /* & */ AMPERSAND_CH, /* ( */ LFT_PAREN_C, /* ) */ RGT_PAREN_CH, /* + */ OTHER_CH, /* , */ OTHER_CH, /* . */ OTHER_CH, /* / */ OTHER_CH, /* 1 */ DIGIT_CH, /* 2 */ DIGIT_CH, /* 4 */ DIGIT_CH, /* 5 */ DIGIT_CH, /* 7 */ DIGIT_CH, /* 8 */ DIGIT_CH, /* : */ OTHER_CH, /* ; */ OTHER_CH, /* = */ OTHER_CH, /* > */ OTHER_CH, /* @ */ OTHER_CH, /* A */ LETTER_CH, /* C */ LETTER_CH, /* D */ LETTER_CH, /* F */ LETTER_CH, /* G */ LETTER_CH, /* I */ LETTER_CH, /* J */ LETTER_CH, /* L */ LETTER_CH, /* M */ LETTER_CH, /* O */ LETTER_CH, /* P */ LETTER_CH,

/* Q */ LETTER_CH, /* R */ LETTER_CH, /* S */ LETTER_CH, /* T */ LETTER_CH, /* U */ LETTER_CH, /* V */ LETTER_CH, /* W */ LETTER_CH, /* X */ LETTER_CH, /* Y */ LETTER_CH, /* Z */ LETTER_CH, /* [ */ OTHER_CH, /* \ */ OTHER_CH, /* ] */ OTHER_CH, /* ^ */ CARET_CH, /* _ */ OTHER_CH, /* ` */ OTHER_CH, /* a */ LETTER_CH, /* b */ LETTER_CH, /* c */ LETTER_CH, /* d */ LETTER_CH, /* e */ LETTER_CH, /* f */ LETTER_CH, /* g */ LETTER_CH, /* h */ LETTER_CH, /* i */ LETTER_CH, /* j */ LETTER_CH, /* k */ LETTER_CH, /* l */ LETTER_CH, /* m */ LETTER_CH, /* n */ LETTER_CH, /* o */ LETTER_CH, /* p */ LETTER_CH, /* q */ LETTER_CH, /* r */ LETTER_CH, /* s */ LETTER_CH, /* t */ LETTER_CH, /* u */ LETTER_CH, /* v */ LETTER_CH, /* w */ LETTER_CH, /* x */ LETTER_CH, /* y */ LETTER_CH, /* z */ LETTER_CH, /* { */ OTHER_CH, /* | */ BAR_CH, /* } */ OTHER_CH,

/* */ OTHER_CH, /* ^? */ OTHER_CH, };

/************** Character Case Conversion *************/

/* Term text must be accumulated in a single case. This */

/* array is used to convert letter case but otherwise */

/* preserve characters. */

static char convert_case[128] = {

/* ^@ */ 0, /* Â */ 0, /* ^B */ 0, /* ^C */ 0, /* ^D */ 0, /* Ê */ 0, /* ^F */ 0, /* ^G */ 0, /* ^H */ 0, /* Î */ 0, /* ^J */ 0, /* ^K */ 0, /* ^L */ 0, /* ^M */ 0, /* ^N */ 0, /* Ô */ 0,

/* ^P */ 0, /* ^Q */ 0, /* ^R */ 0, /* ^S */ 0, /* ^T */ 0, /* ^U */ 0, /* ^V */ 0, /* ^W */ 0, /* ^X */ 0, /* ^Y */ 0, /* ^Z */ 0, /* ^[ */ 0, /* ^\ */ 0, /* ^] */ 0, /* ^^ */ 0, /* ^_ */ 0, /* */ ' ', /* ! */ '!', /* " */ '"', /* # */ '#', /* $ */ '$', /* % */ '%', /* & */ '&', /* ' */ ''', /* ( */ '(', /* ) */ ')', /* * */ '*', /* + */ '+', /* , */ ',', /* - */ '-', /* . */ '.', /* / */ '/', /* 0 */ '0', /* 1 */ '1', /* 2 */ '2', /* 3 */ '3', /* 4 */ '4', /* 5 */ '5', /* 6 */ '6', /* 7 */ '7', /* 8 */ '8', /* 9 */ '9', /* : */ ':', /* ; */ ';', /* < */ '<', /* = */ '=', /* > */ '>', /* ? */ '?', /* @ */ '@', /* A */ 'a', /* B */ 'b', /* C */ 'c', /* D */ 'd', /* E */ 'e', /* F */ 'f', /* G */ 'g', /* H */ 'h', /* I */ 'i', /* J */ 'j', /* K */ 'k', /* L */ 'l', /* M */ 'm', /* N */ 'n', /* O */ o', /* P */ 'p', /* Q */ 'q', /* R */ 'r', /* S */ 's', /* T */ 't', /* U */ 'u', /* V */ 'v', /* W */ 'w', /* X */ 'x', /* Y */ 'y', /* Z */ 'z', /* [ */ '[', /* \ */ '\', /* ] */ ']', /* ^ */ '^', /* _ */ '_', /* ` */ '`', /* a */ 'a', /* b */ 'b', /* c */ 'c', /* d */ 'd', /* e */ 'e', /* f */ 'f', /* g */ 'g', /* h */ 'h', /* i */ 'i', /* j */ 'j', /* k */ 'k', /* l */ 'l', /* m */ 'm', /* n */ 'n', /* o */ 'o',

/* p */ 'p', /* q */ 'q', /* r */ 'r', /* s */ 's', /* t */ 't', /* u */ 'u', /* v */ 'v', /* w */ 'w', /* x */ 'x', /* y */ 'y', /* z */ 'z', /* { */ '{' , /* | */ '|', /* } */ '}', /* ^~*/ '^~', /* ^? */ 0, };

/******************** Tokenizing ***********************/

/* The lexer distinguishes terms, parentheses, the and, or */

/* and not operators, the unrecognized token, and the end */

/* of the input. */

typedef enum {

TERM_TOKEN = 1, /* a search term */

LFT_PAREN_TOKEN = 2, /* left parenthesis */

RGT_PAREN_TOKEN = 3, /* right parenthesis */

AND_TOKEN = 4, /* set intersection connective */

OR_TOKEN = 5, /* set union connective */

NOT_TOKEN = 6, /* set difference connective */

END_TOKEN = 7, /* end of the query */

NO_TOKEN = 8, /* the token is not recognized */

} TokenType;

Figure 7.2: Declarations for a simple query lexical analyzer

There also needs to be a type for tokens. An enumeration type is best for this as well. This type will have an element for each of the tokens: term, left parenthesis, right parenthesis, ampersand, bar, caret, end of string, and the unrecognized token.

Processing is simplified by matching the values of the enumeration type to the final states of the finite state machine. The declaration of the token type also appears in Figure 7.2.

The code for the finite state machine must keep track of the current state, and have a way of changing from state to state on input. A state change is called a transition. Transition information can be encoded in tables, or in flow of control. When there are many states and transitions, a tabular encoding is preferable; in our example, a flow of control encoding is probably clearest. Our example implementation reads characters from an input stream supplied as a parameter. The routine returns the next token from the input each time it is called. If the token is a term, the text of the term (in lowercase) is written to a term buffer supplied as a parameter. Our example code appears in Figure 7.3.

/*FN************************************************************************

GetToken( stream ) Returns: void

Purpose: Get the next token from an input stream Plan: Part 1: Run a state machine on the input

Part 2: Coerce the final state to return the token type

Notes: Run a finite state machine on an input stream, collecting the text of the token if it is a term. The transition table

for this DFA is the following (negative states are final):

State | White Letter ( ) & | ^ EOS Digit Other 0 | 0 1 -2 -3 -4 -5 -6 -7 -8 -8 1 | -1 1 -1 -1 -1 -1 -1 -1 1 -1 See the token type above to see what is recognized in the

various final states.

**/

static TokenType

GetToken( stream, term )

FILE *stream; /* in: where to grab input characters */

char *term; /* out: the token text if the token is a term */

{

int next_ch; /* from the input stream */

int state; /* of the tokenizer DFA */

int i; /* for scanning through the term buffer */

state = 0;

i = 0;

while ( 0 < = state ) {

if ( EOF == (next_ch = getc(stream)) ) next_ch = '\0';

term[i++] = convert_case[next_ch];

switch( state ) {

case 0 :

switch( char_class[next_ch] ) {

case WHITE_CH : i = 0; break;

case LETTER_CH : state = 1; break;

case LFT_PAREN_CH : state = -2; break;

case RGT_PAREN_CH : state = -3; break;

case AMPERSAND_CH : state = -4; break;

case BAR_CH : state = -5; break;

case CARET_CH : state = -6; break;

case EOS_CH : state = -7; break;

case DIGIT_CH : state = -8; break;

case OTHER_CH : state = -8; break;

default : state =-8; break;

}

break;

case 1 :

if ( (DIGIT_CH != char_class[next_ch])

&& (LETTER_CH != char_class[next_ch]) ) {

ungetc( next_ch, stream );

term[i-1] = '\0';

state = -1;

}

break;

default : state = -8; break;

} }

/* Part 2: Coerce the final state to return the type token */

return( (TokenType) (-state) );

} /* GetToken */

Figure 7.3: Code for a simple query lexical analyzer

The algorithm begins in state 0. As each input character is consumed, a switch on the state determines the transition. Input is consumed until a final state (indicated by a negative state number) is reached. When recognizing a term, the algorithm keeps reading until some character other than a letter or a digit is found. Since this character may be part of another token, it must be pushed back on the input stream. The final state is translated to a token type value by changing its sign and coercing it to the correct type (this was the point of matching the token type values to the final machine states).

/*FN***********************************************************************

main( argc, argv )

Returns: int -- 0 on success, 1 on failure Purpose: Program main function

Plan: Part 1: Open a file named on the command line Part 2: List all the tokens found in the file

Part 3: Close the file and return

Notes: This program simply lists the tokens found in a single file named on the command line.

**/

int

main(argc, argv)

int argc; /* in: how many arguments */

char *argv[ ] /* in: text of the arguments */

{

TokenType token; /* next token in the input stream */

char term[128]; /* the term recognized */

FILE *stream; /* where to read the data from */

if ( (2 != argc) || !(stream = fopen(argv[l],"r")) ) exit(l);

switch( token = GetToken(stream,term) ) {

case TERM_TOKEN : (void)printf ( "term: %s\n", term ); break;

case LFT_PAREN_TOKEN : (void)printf ( "left parenthesis\n" ); break;

case RGT PAREN-TOKEN : (void)printf ( "right parenthesis\n" ); break;

case AND_TOKEN : (void)printf ( "and operator\n" ); break;

case OR_TOKEN : (void)printf ( "or operator\n" ); break;

case NOT_TOKEN : (void)printf ( "not operator\n" ); break;

case END_TOKEN : (void)printf ( "end of string\n" ); break;

case NO_TOKEN : (void)printf ( "no token\n" ); break;

default : (void)printf ( "bad data\n" ); break;

}

while ( END_TOKEN != token );

fclose ( stream );

} /* main */

Figure 7.4: Test program for a query lexical analyzer

Figure 7.4 contains a small main program to demonstrate the use of this lexical analyzer. The program reads characters from a file named on the command line, and writes out a description of the token stream that it finds. In real use, the tokens returned by the lexical analyzer would be processed by a query parser, which would also probably call retrieval and display routines.

The code above, augmented with the appropriate include files, is a complete and efficient implementation of our simple lexical analyzer for queries. When tested, this code tokenized at about a third the speed that the computer could read

characters--about as fast as can be expected. An even simpler lexical analyzer for automatic indexing can be constructed in the same way, and it will be just as fast.

7.3 STOPLISTS

It has been recognized since the earliest days of information retrieval (Luhn 1957) that many of the most frequently occurring words in English (like "the," "of," "and," "to," etc.) are worthless as index terms. A search using one of these terms is likely to retrieve almost every item in a database regardless of its relevance, so their discrimination value is low (Salton and McGill 1983; van Rijsbergen 1975). Furthermore, these words make up a large fraction of the text of most documents: the ten most frequently occurring words in English typically account for 20 to 30 percent of the tokens in a document (Francis and Kucera 1982). Eliminating such words from consideration early in automatic indexing speeds processing, saves huge amounts of space in indexes, and does not damage retrieval effectiveness. A list of words filtered out during automatic indexing because they make poor index terms is called a stoplist or a negative dictionary.

One way to improve information retrieval system performance, then, is to eliminate stopwords during automatic indexing. As with lexical analysis, however, it is not clear which words should be included in a stoplist. Traditionally, stoplists are

supposed to have included the most frequently occurring words. However, some frequently occurring words are too important as index terms. For example, included among the 200 most frequently occurring words in general literature in English are

"time," "war," "home," "life," "water," and "world." On the other hand, specialized databases will contain many words useless as index terms that are not frequent in general English. For example, a computer literature database probably need not use index terms like "computer," "program," "source," "machine," and "language."

As with lexical analysis in general, stoplist policy will depend on the database and features of the users and the indexing process. Commercial information systems tend to take a very conservative approach, with few stopwords. For example, the ORBIT Search Service has only eight stopwords: "and," "an," "by," "from," "of," "the," and "with." Larger stoplists are usually advisable. An oft-cited example of a stoplist of 250 words appears in van Rijsbergen (1975). Figure 7.5 contains a stoplist of 425 words derived from the Brown corpus (Francis and Kucera 1982) of 1,014,000 words drawn from a broad range of literature in English. Fox (1990) discusses the derivation of (a slightly shorter version of) this list, which is specially constructed to be used with the lexical analysis generator described below.

a about above across after

again against all almost alone along already also although always among an and another any anybody anyone anything anywhere are area areas around as ask asked asking asks at away b back backed backing backs be because became become becomes been before began behind being beings best better between big both but by c came can cannot case cases certain certainly clear clearly come could

d did differ different differently do does done down downed

downing downs during e each early either end ended ending ends enough even evenly ever

every everybody everyone everything everywhere f face faces fact facts

far felt few find finds first for four from full fully further furthered furthering furthers g gave general generally get

gets give given gives go

going good goods got great greater greatest group grouped grouping groups h had has have having he her herself here high higher highest him himself his how however i if

important in interest interested interesting interests into is it its

itself j just k keep keeps kind knew know known knows l large largely last later latest least less let lets like likely long longer longest m made make making man many may me member members men might more most mostly mr mrs much must my myself n necessary need needed needing needs never new newer newest next no non not nobody noone nothing now nowhere number numbered numbering numbers o of off often old older oldest on once one

only open opened opening opens or order ordered ordering orders other others our out over p part parted parting parts per perhaps place places point pointed pointing points possible present presented presenting presents problem problems put puts q quite r

rather really right room rooms s said same saw say says second seconds see seem seemed seeming seems sees several shall she should show showed showing shows side sides since small smaller smallest so some somebody someone something somewhere state states still such sure t take taken than that the

their them then there therefore these they thing things think thinks this those though thought thoughts three through thus to

today together too took toward turn turned turning turns two u under until up upon

us use uses used v

very w want wanted wanting wants was way ways we

well wells went were what when where whether which while who whole whose why will with within without work worked working works would x y

year years yet you young younger youngest your yours z Figure 7.5: A stoplist for general text

Dans le document Information Retrieval: Data Structures & Algorithms (Page 103-117)