CYK Recognition with General CF Grammars

General Non-Directional Parsing

4.2 The CYK Parsing Method

4.2.1 CYK Recognition with General CF Grammars

To see how the CYK algorithm solves the recognition and parsing problem, let us consider the grammar of Figure 4.6. This grammar describes the syntax of numbers

Number_s ---> Integer | Real

Integer ---> Digit | Integer Digit Real ---> Integer Fraction Scale Fraction ---> . Integer

Scale ---> e Sign Integer | Empty

Digit ---> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Sign ---> + |

-Empty ---> ε

Fig. 4.6. A grammar describing numbers in scientiﬁc notation

in scientiﬁc notation. An example sentence produced by this grammar is32.5e+1. We will use this grammar and sentence as an example.

The CYK algorithm ﬁrst concentrates on substrings of the input sentence, short-est substrings ﬁrst, and then works its way up. The following derivations of substrings of length 1 can be read directly from the grammar:

Digit Digit Digit Sign Digit

3 2 . 5 e + 1

This means thatDigitderives3,Digitderives2, etc. Note, however, that this picture is not yet complete. For one thing, there are several other non-terminals de-riving3. This complication arises because the grammar contains so-calledunit rules,

rules of the formA→B, whereAandBare non-terminals. Such rules are also called single rulesorchain rules. We can have chains of them in a derivation. So the next step consists of applying the unit rules, repetitively, for example to ﬁnd out which other non-terminals derive3. This gives us the following result:

Number,

Now we already see some combinations that we recognize from the grammar: For example, anIntegerfollowed by aDigitis again anInteger, and a.(dot) followed by anIntegeris aFraction. We get (again also using unit rules):

Number,Integer Fraction Scale

At this point, we see that the rule forRealis applicable in several ways, and then the rule forNumber, so we get:

So we ﬁnd thatNumberdoes indeed derive32.5e+1.

In the example above, we have seen that unit rules complicate things a bit. An-other complication, one that we have avoided until now, is formed byε-rules. For example, if we want to recognize the input43.1according to the example gram-mar, we have to realize thatScalederivesεhere, so we get the following picture:

Number,Real

In general this is even more complicated. We must take into account the fact that several non-terminals can deriveεbetween any two adjacent terminal symbols in the input sentence, and also in front of the input sentence or at the back. However, as we shall see, the problems caused by these kinds of rules can be solved, albeit at a certain cost.

In the meantime, we will not let these problems discourage us. In the example, we have seen that the CYK algorithm works by determining which non-terminals derive which substrings, shortest substrings ﬁrst. Although we skipped them in the example, the shortest substrings of any input sentence are, of course, theε-substrings.

We shall have to recognize them in arbitrary position, so we ﬁrst computeR_ε, the set of non-terminals that deriveε, using the following closure algorithm.

The setR_εis initialized to the set of non-terminalsAfor whichA→εis a gram-mar rule. For the example gramgram-mar,R_εis initially the set {Empty}. Next, we check each grammar rule: If a right-hand side consists only of symbols that are a member ofR_ε, we add the left-hand side toR_ε(it derivesε, because all symbols in the right-hand side do). In the example,Scalewould be added. This process is repeated until no new non-terminals can be added to the set. For the example, this results in

R_ε= {Empty,Scale}.

Now we direct our attention to the non-empty substrings of the input sentence.

Suppose we have an input sentencet=t1t2···tnand we want to compute the set of non-terminals that derive the substring oftstarting at positioni, of lengthl. We will use the notations_i,lfor this substring, so,

s_i,l=tit_i+1···ti+l−1.

or in a different notation: si,l=ti...i+l−1. Figure 4.7 presents this notation graphi-cally, using a sentence of 4 symbols. We will use the notation R_i,l for the set of

s_1,4 4

s_2,3

s_1,3 3

s_3,2 s_2,2 s_1,2

s_1,1 s_2,1 s_3,1 s_4,1 1

s_1,0 s_2,0 s_3,0 s_4,0 0

length

t1 t2 t3 t4

position

Fig. 4.7. A graphical presentation of substrings

non-terminals deriving the substrings_i,l. This notation can be extended to deal with substrings of length 0:s_i,0=ε, andR_i,0=R_ε, for alli.

Because shorter substrings are dealt with first, we can assume that we are at a stage in the algorithm where all information on substrings with length smaller than a certain l is available. Using this information, we check each right-hand side in the grammar, to see if it derivess_i,l, as follows: suppose we have a right-hand side A1···Am. Then we divides_i,lintom(possibly empty) segments, such thatA1derives the first segment,A2the second, etc. We start withA1. IfA1···Am is to derives_i,l, A1has to derive a first part of it, say of lengthk. That is,A1must derives_i,k (be a member ofR_i,k), andA2···Ammust derive the rest:

A1 A2 ··· Am

ti ··· ti+k−1 ti+k ti+k+1 ··· ti+l−1

This is attempted for everyk for which A1 is a member ofR_i,k, includingk=0.

Naturally, if A1is a terminal, then A1must be equal toti, andkis 1. Checking if A2···Amderivest_i+k···ti+l−1is done in the same way. Unlike Unger’s method, we do not have to try all partitions, because we already know which non-terminals derive which substrings.

Nevertheless, there are two problems with this. In the ﬁrst place,mcould be 1 andA1a non-terminal, so we are dealing with a unit rule. In this case,A1must derive the whole substrings_i,l, and thus be a member ofR_i,l, which is the set that we are computing now, so we do not know yet if this is the case. This problem can be solved by observing that ifA1is to derivesi,l, somewhere along the derivation there must be a ﬁrst step not using a unit rule. So we have:

A1→B→ ··· →C→s^* i,l

whereCis the ﬁrst non-terminal using a non-unit rule in the derivation. Disregarding ε-rules (the second problem) for a moment, this means that at a certain moment in the process of computing the setR_i,l,Cwill be added toR_i,l. Now, if we repeat the computation ofR_i,l again and again, at some momentBwill be added, and during the next repetition,A1will be added. So we have to repeat the process until no new non-terminals are added toR_i,l. This, like the computation ofR_ε, is an example of a closure algorithm.

The second problem is caused by theε-rules. If all but one of theAt deriveε, we have a problem that is basically equivalent to the problem of unit rules. It too requires recomputation of the entries of Runtil nothing changes any more, again using a closure algorithm.

In the end, when we have computed all theRi,l, the recognition problem is solved:

the start symbolSderivest(=s_1,n)if and only ifSis a member ofR_1,n.

This is a complicated process, where part of this complexity stems from the ε-rules and the unit ε-rules. Their presence forces us to do theR_i,l computation repeat-edly; this is inefﬁcient, because after the ﬁrst computation of R_i,l recomputations yield little new information.

Another less obvious but equally serious problem is that a right-hand side may consist of arbitrarily many non-terminals, and trying all possibilities can be a lot of work. We can see that as follows. For a rule whose right-hand side consists of m members,m−1 segment ends have to be found, each of them combining with all the previous ones. Finding a segment end costsO(n)actions, since a list proportional to the length of the input has to be scanned; so finding the requiredm−1 segment ends costsO(n^m−1). And since there areO(n²)elements inR, filling it completely costsO(n^m+1), so the time requirement is exponential in the maximum length of the right-hand sides in the grammar. The longest right-hand side in Figure 4.6 is 3, so the time requirement isO(n⁴). This is far more efficient than exhaustive search, which needs a time that is exponential in the length of the input sentence, but still heavy enough to worry about.

Imposing certain restrictions on the rules may solve these problems to a large ex-tent. However, these restrictions should not limit the generative power of the gram-mar signiﬁcantly.

Dans le document Monographs in Computer Science (Page 134-138)