• Aucun résultat trouvé

CYK Recognition with a Grammar in Chomsky Normal Form

Dans le document Monographs in Computer Science (Page 138-141)

General Non-Directional Parsing

4.2 The CYK Parsing Method

4.2.2 CYK Recognition with a Grammar in Chomsky Normal Form

Two of the restrictions that we want to impose on the grammar are obvious by now:

no unit rules and noε-rules. We would also like to limit the maximum length of a right-hand side to 2; this would reduce the time complexity to O(n3). It turns out that there is a form for CF grammars that exactly fits these restrictions: the Chomsky Normal Form. It is as if this normal form was invented for this algorithm. A grammar is inChomsky Normal Form(CNF), when all rules either have the formAa, or ABC, wherea is a terminal andA,B, andCare non-terminals. Fortunately, as we shall see later, any CF grammar can be mechanically transformed into a CNF grammar.

We will first discuss how the CYK algorithm works for a grammar in CNF. There are noε-rules in a CNF grammar, soRεis empty. The setsRi,1can be read directly from the rules: they are determined by the rules of the formAa. A ruleABC can never derive a single terminal, because there are noε-rules.

Next, we proceed iteratively as before, first processing all substrings of length 2, then all substrings of length 3, etc. When a right-hand sideBCis to derive a substring of lengthl,Bhas to derive the first part (which is non-empty), andCthe rest (also non-empty).

B C

ti ··· ti+k−1 ti+k ··· ti+l−1

SoBmust derivesi,k, that is,Bmust be a member ofRi,k, and likewiseCmust derive si+k,lk; that is,Cmust be a member ofRi+k,lk. Determining if such akexists is easy: just try all possibilities; they range from 1 tol−1. All setsRi,k andRi+k,l−k have already been computed at this point.

This process is much less complicated than the one we saw before, which worked with a general CF grammar, for two reasons. The most important one is that we do not have to repeat the process again and again until no new non-terminals are added toRi,l. Here, the substrings we are dealing with are really substrings: they cannot be equal to the string we started out with. The second reason is that we have to find only one place where the substring must be split in two, because the right-hand side consists of only two non-terminals. In ambiguous grammars, there can be several different splittings, but at this point that does not worry us. Ambiguity is a parsing issue, not a recognition issue.

The algorithm results in a complete collection of setsRi,l. The sentencetconsists of only n symbols, so a substring starting at positioni can never have more than n+1−i symbols. This means that there are no substrings si,l withi+l>n+1.

Therefore, theRi,l sets can be organized in a triangular table, as depicted in Figure 4.8. This table is called therecognition table, or thewell-formed substring table.

R1,n R1,n−1 R2,n−1

··· ··· ···

··· ··· ··· ···

R1,l ··· Ri,l ··· ···

··· ··· ··· ··· ··· ···

··· ··· ··· ··· ··· ··· ···

R1,1 ··· Ri,1 ··· ··· Ri+l−1,1 ··· Rn,1

V W

Fig. 4.8. Form of the recognition table

The entryRi,l is computed from entries along the arrowsV andW simultane-ously, as follows. The first entry we consider isRi,1, at the start of arrowV. All non-terminalsBinRi,1produce substrings which start at positioniand have a length 1. Since we are trying to obtain parsings for the substring starting at positioniwith lengthl, we are now interested in substrings starting ati+1 and having lengthl−1.

These should be looked for inRi+1,l−1, at the start of arrowW. Now we combine each of theBs found inRi,1with each of theCs found inRi+1,l−1, and for each pairB andCfor which there is a ruleABCin the grammar, we insertAinRi,l. Likewise aBinRi,2can be combined into anAwith aCfromRi+2,l−2, etc., and we continue in this way until we reachRi,l−1at the end point ofV andRi+l−1,1at the end ofW.

The entryRi,l cannot be computed until all entries below it are known in the triangle of which it is the top. This restricts the order in which the entries can be computed but still leaves some freedom. One way to compute the recognition table is depicted in Figure 4.9(a); it follows our earlier description in which no substring of lengthlis recognized until all string of lengthl−1 have been recognized. We could also compute the recognition table in the order depicted in Figure 4.9(b). In this order, Ri,lis computed as soon as all sets and input symbols needed for its computation are

(a) off-line order (b) on-line order

Fig. 4.9. Different orders in which the recognition table can be computed

available. This order is particularly suitable for on-line parsing, where the number of symbols in the input is not known in advance, and additional information is computed each time a new symbol is read.

Now let us examine the cost of this algorithm. Figure 4.8 shows that there are n(n+1)/2 entries to be filled. Filling in an entry requires examining all entries on the arrowV, of which there are at mostn; usually there are fewer, and in practical situations many of the entries are empty and need not be examined at all. We will call the number of entries that really have to be considerednoccfor “number of oc-currences”; it is usually much smaller thann and for many grammars it is even a constant, but for worst-case estimates it should be replaced byn. Once the entry on the arrowvhas been chosen, the corresponding entry on the arrowW is fixed, so the cost of finding it does not depend onn. As a result the algorithm has a time require-ment ofO(n2nocc)and operates in a time proportional to the cube of the length of the input sentence in the worst case, as already announced at the beginning of this section.

The cost of the algorithm also depends on the properties of the grammar. The entries along theV andWarrows can each contain at most|VN|non-terminals, where

|VN| is the number of non-terminals in the grammar, the size of the setVN from the formal definition of a grammar in Section 2.2. But again the actual number is usually much lower, since usually only a very limited subset of the non-terminals can produce a segment of the input of a given length in a given position; we will indicate the number by|VN|occ. So the cost of one combination step isO(|VN|2occ).

Finding the rule in the grammar that combinesB andCinto anAcan be done in constant time, by hashing or precomputation, and does not add to the cost of one combination step. This gives an overall time requirement ofO(|VN|2occn2nocc).

There is some disagreement in the literature over whether the second index of the recognition table should represent the length or the end position of the recognized segment. It is obvious that both carry the same information, but sometimes one is more convenient and at other times the other. There is some evidence, from Earley parsing (Section 7.2) and parsing as intersection (Chapter 13), that using the end point is more fundamental, but for CYK parsing the length is more convenient, both conceptually and for drawing pictures.

Dans le document Monographs in Computer Science (Page 138-141)