• Aucun résultat trouvé

CYK Recognition with General CF Grammars

Dans le document Monographs in Computer Science (Page 134-138)

General Non-Directional Parsing

4.2 The CYK Parsing Method

4.2.1 CYK Recognition with General CF Grammars

To see how the CYK algorithm solves the recognition and parsing problem, let us consider the grammar of Figure 4.6. This grammar describes the syntax of numbers

Numbers ---> Integer | Real

Integer ---> Digit | Integer Digit Real ---> Integer Fraction Scale Fraction ---> . Integer

Scale ---> e Sign Integer | Empty

Digit ---> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Sign ---> + |

-Empty ---> ε

Fig. 4.6. A grammar describing numbers in scientific notation

in scientific notation. An example sentence produced by this grammar is32.5e+1. We will use this grammar and sentence as an example.

The CYK algorithm first concentrates on substrings of the input sentence, short-est substrings first, and then works its way up. The following derivations of substrings of length 1 can be read directly from the grammar:

Digit Digit Digit Sign Digit

3 2 . 5 e + 1

This means thatDigitderives3,Digitderives2, etc. Note, however, that this picture is not yet complete. For one thing, there are several other non-terminals de-riving3. This complication arises because the grammar contains so-calledunit rules,

rules of the formAB, whereAandBare non-terminals. Such rules are also called single rulesorchain rules. We can have chains of them in a derivation. So the next step consists of applying the unit rules, repetitively, for example to find out which other non-terminals derive3. This gives us the following result:

Number,

Now we already see some combinations that we recognize from the grammar: For example, anIntegerfollowed by aDigitis again anInteger, and a.(dot) followed by anIntegeris aFraction. We get (again also using unit rules):

Number,Integer Fraction Scale

At this point, we see that the rule forRealis applicable in several ways, and then the rule forNumber, so we get:

So we find thatNumberdoes indeed derive32.5e+1.

In the example above, we have seen that unit rules complicate things a bit. An-other complication, one that we have avoided until now, is formed byε-rules. For example, if we want to recognize the input43.1according to the example gram-mar, we have to realize thatScalederivesεhere, so we get the following picture:

Number,Real

In general this is even more complicated. We must take into account the fact that several non-terminals can deriveεbetween any two adjacent terminal symbols in the input sentence, and also in front of the input sentence or at the back. However, as we shall see, the problems caused by these kinds of rules can be solved, albeit at a certain cost.

In the meantime, we will not let these problems discourage us. In the example, we have seen that the CYK algorithm works by determining which non-terminals derive which substrings, shortest substrings first. Although we skipped them in the example, the shortest substrings of any input sentence are, of course, theε-substrings.

We shall have to recognize them in arbitrary position, so we first computeRε, the set of non-terminals that deriveε, using the following closure algorithm.

The setRεis initialized to the set of non-terminalsAfor whichA→εis a gram-mar rule. For the example gramgram-mar,Rεis initially the set {Empty}. Next, we check each grammar rule: If a right-hand side consists only of symbols that are a member ofRε, we add the left-hand side toRε(it derivesε, because all symbols in the right-hand side do). In the example,Scalewould be added. This process is repeated until no new non-terminals can be added to the set. For the example, this results in

Rε= {Empty,Scale}.

Now we direct our attention to the non-empty substrings of the input sentence.

Suppose we have an input sentencet=t1t2···tnand we want to compute the set of non-terminals that derive the substring oftstarting at positioni, of lengthl. We will use the notationsi,lfor this substring, so,

si,l=titi+1···ti+l−1.

or in a different notation: si,l=ti...i+l1. Figure 4.7 presents this notation graphi-cally, using a sentence of 4 symbols. We will use the notation Ri,l for the set of

s1,4 4

s2,3

s1,3 3

s3,2 s2,2 s1,2

2

s1,1 s2,1 s3,1 s4,1 1

s1,0 s2,0 s3,0 s4,0 0

length

t1 t2 t3 t4

position

Fig. 4.7. A graphical presentation of substrings

non-terminals deriving the substringsi,l. This notation can be extended to deal with substrings of length 0:si,0=ε, andRi,0=Rε, for alli.

Because shorter substrings are dealt with first, we can assume that we are at a stage in the algorithm where all information on substrings with length smaller than a certain l is available. Using this information, we check each right-hand side in the grammar, to see if it derivessi,l, as follows: suppose we have a right-hand side A1···Am. Then we dividesi,lintom(possibly empty) segments, such thatA1derives the first segment,A2the second, etc. We start withA1. IfA1···Am is to derivesi,l, A1has to derive a first part of it, say of lengthk. That is,A1must derivesi,k (be a member ofRi,k), andA2···Ammust derive the rest:

A1 A2 ··· Am

ti ··· ti+k1 ti+k ti+k+1 ··· ti+l1

This is attempted for everyk for which A1 is a member ofRi,k, includingk=0.

Naturally, if A1is a terminal, then A1must be equal toti, andkis 1. Checking if A2···Amderivesti+k···ti+l−1is done in the same way. Unlike Unger’s method, we do not have to try all partitions, because we already know which non-terminals derive which substrings.

Nevertheless, there are two problems with this. In the first place,mcould be 1 andA1a non-terminal, so we are dealing with a unit rule. In this case,A1must derive the whole substringsi,l, and thus be a member ofRi,l, which is the set that we are computing now, so we do not know yet if this is the case. This problem can be solved by observing that ifA1is to derivesi,l, somewhere along the derivation there must be a first step not using a unit rule. So we have:

A1B→ ··· →C→s* i,l

whereCis the first non-terminal using a non-unit rule in the derivation. Disregarding ε-rules (the second problem) for a moment, this means that at a certain moment in the process of computing the setRi,l,Cwill be added toRi,l. Now, if we repeat the computation ofRi,l again and again, at some momentBwill be added, and during the next repetition,A1will be added. So we have to repeat the process until no new non-terminals are added toRi,l. This, like the computation ofRε, is an example of a closure algorithm.

The second problem is caused by theε-rules. If all but one of theAt deriveε, we have a problem that is basically equivalent to the problem of unit rules. It too requires recomputation of the entries of Runtil nothing changes any more, again using a closure algorithm.

In the end, when we have computed all theRi,l, the recognition problem is solved:

the start symbolSderivest(=s1,n)if and only ifSis a member ofR1,n.

This is a complicated process, where part of this complexity stems from the ε-rules and the unit ε-rules. Their presence forces us to do theRi,l computation repeat-edly; this is inefficient, because after the first computation of Ri,l recomputations yield little new information.

Another less obvious but equally serious problem is that a right-hand side may consist of arbitrarily many non-terminals, and trying all possibilities can be a lot of work. We can see that as follows. For a rule whose right-hand side consists of m members,m−1 segment ends have to be found, each of them combining with all the previous ones. Finding a segment end costsO(n)actions, since a list proportional to the length of the input has to be scanned; so finding the requiredm−1 segment ends costsO(nm−1). And since there areO(n2)elements inR, filling it completely costsO(nm+1), so the time requirement is exponential in the maximum length of the right-hand sides in the grammar. The longest right-hand side in Figure 4.6 is 3, so the time requirement isO(n4). This is far more efficient than exhaustive search, which needs a time that is exponential in the length of the input sentence, but still heavy enough to worry about.

Imposing certain restrictions on the rules may solve these problems to a large ex-tent. However, these restrictions should not limit the generative power of the gram-mar significantly.

Dans le document Monographs in Computer Science (Page 134-138)