CYK Recognition with a Grammar in Chomsky Normal Form

General Non-Directional Parsing

4.2 The CYK Parsing Method

4.2.2 CYK Recognition with a Grammar in Chomsky Normal Form

Two of the restrictions that we want to impose on the grammar are obvious by now:

no unit rules and noε-rules. We would also like to limit the maximum length of a right-hand side to 2; this would reduce the time complexity to O(n³). It turns out that there is a form for CF grammars that exactly ﬁts these restrictions: the Chomsky Normal Form. It is as if this normal form was invented for this algorithm. A grammar is inChomsky Normal Form(CNF), when all rules either have the formA→a, or A→BC, wherea is a terminal andA,B, andCare non-terminals. Fortunately, as we shall see later, any CF grammar can be mechanically transformed into a CNF grammar.

We will ﬁrst discuss how the CYK algorithm works for a grammar in CNF. There are noε-rules in a CNF grammar, soR_εis empty. The setsR_i,1can be read directly from the rules: they are determined by the rules of the formA→a. A ruleA→BC can never derive a single terminal, because there are noε-rules.

Next, we proceed iteratively as before, ﬁrst processing all substrings of length 2, then all substrings of length 3, etc. When a right-hand sideBCis to derive a substring of lengthl,Bhas to derive the ﬁrst part (which is non-empty), andCthe rest (also non-empty).

B C

ti ··· t_i+k−1 t_i+k ··· t_i+l−1

SoBmust derivesi,k, that is,Bmust be a member ofRi,k, and likewiseCmust derive si+k,l−k; that is,Cmust be a member ofRi+k,l−k. Determining if such akexists is easy: just try all possibilities; they range from 1 tol−1. All setsR_i,k andR_i+k,l−k have already been computed at this point.

This process is much less complicated than the one we saw before, which worked with a general CF grammar, for two reasons. The most important one is that we do not have to repeat the process again and again until no new non-terminals are added toR_i,l. Here, the substrings we are dealing with are really substrings: they cannot be equal to the string we started out with. The second reason is that we have to ﬁnd only one place where the substring must be split in two, because the right-hand side consists of only two non-terminals. In ambiguous grammars, there can be several different splittings, but at this point that does not worry us. Ambiguity is a parsing issue, not a recognition issue.

The algorithm results in a complete collection of setsRi,l. The sentencetconsists of only n symbols, so a substring starting at positioni can never have more than n+1−i symbols. This means that there are no substrings si,l withi+l>n+1.

Therefore, theRi,l sets can be organized in a triangular table, as depicted in Figure 4.8. This table is called therecognition table, or thewell-formed substring table.

R_1,n R_1,n−1 R_2,n−1

··· ··· ···

··· ··· ··· ···

R_1,l ··· R_i,l ··· ···

··· ··· ··· ··· ··· ···

··· ··· ··· ··· ··· ··· ···

R_1,1 ··· R_i,1 ··· ··· R_i+l−1,1 ··· R_n,1

V W

Fig. 4.8. Form of the recognition table

The entryR_i,l is computed from entries along the arrowsV andW simultane-ously, as follows. The ﬁrst entry we consider isR_i,1, at the start of arrowV. All non-terminalsBinR_i,1produce substrings which start at positioniand have a length 1. Since we are trying to obtain parsings for the substring starting at positioniwith lengthl, we are now interested in substrings starting ati+1 and having lengthl−1.

These should be looked for inR_i+1,l−1, at the start of arrowW. Now we combine each of theBs found inR_i,1with each of theCs found inR_i+1,l−1, and for each pairB andCfor which there is a ruleA→BCin the grammar, we insertAinR_i,l. Likewise aBinR_i,2can be combined into anAwith aCfromR_i+2,l−2, etc., and we continue in this way until we reachR_i,l−1at the end point ofV andR_i+l−1,1at the end ofW.

The entryR_i,l cannot be computed until all entries below it are known in the triangle of which it is the top. This restricts the order in which the entries can be computed but still leaves some freedom. One way to compute the recognition table is depicted in Figure 4.9(a); it follows our earlier description in which no substring of lengthlis recognized until all string of lengthl−1 have been recognized. We could also compute the recognition table in the order depicted in Figure 4.9(b). In this order, R_i,lis computed as soon as all sets and input symbols needed for its computation are

(a) off-line order (b) on-line order

Fig. 4.9. Different orders in which the recognition table can be computed

available. This order is particularly suitable for on-line parsing, where the number of symbols in the input is not known in advance, and additional information is computed each time a new symbol is read.

Now let us examine the cost of this algorithm. Figure 4.8 shows that there are n(n+1)/2 entries to be filled. Filling in an entry requires examining all entries on the arrowV, of which there are at mostn; usually there are fewer, and in practical situations many of the entries are empty and need not be examined at all. We will call the number of entries that really have to be considerednoccfor “number of oc-currences”; it is usually much smaller thann and for many grammars it is even a constant, but for worst-case estimates it should be replaced byn. Once the entry on the arrowvhas been chosen, the corresponding entry on the arrowW is fixed, so the cost of finding it does not depend onn. As a result the algorithm has a time require-ment ofO(n²nocc)and operates in a time proportional to the cube of the length of the input sentence in the worst case, as already announced at the beginning of this section.

The cost of the algorithm also depends on the properties of the grammar. The entries along theV andWarrows can each contain at most|VN|non-terminals, where

|VN| is the number of non-terminals in the grammar, the size of the setVN from the formal deﬁnition of a grammar in Section 2.2. But again the actual number is usually much lower, since usually only a very limited subset of the non-terminals can produce a segment of the input of a given length in a given position; we will indicate the number by|VN|occ. So the cost of one combination step isO(|VN|²occ).

Finding the rule in the grammar that combinesB andCinto anAcan be done in constant time, by hashing or precomputation, and does not add to the cost of one combination step. This gives an overall time requirement ofO(|VN|²occn²nocc).

There is some disagreement in the literature over whether the second index of the recognition table should represent the length or the end position of the recognized segment. It is obvious that both carry the same information, but sometimes one is more convenient and at other times the other. There is some evidence, from Earley parsing (Section 7.2) and parsing as intersection (Chapter 13), that using the end point is more fundamental, but for CYK parsing the length is more convenient, both conceptually and for drawing pictures.

Dans le document Monographs in Computer Science (Page 138-141)