Uniformizing rational relations for natural language applications using weighted determinization

(1)

Publisher’s version / Version de l'éditeur:

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à [email protected].

Questions? Contact the NRC Publications Archive team at

[email protected]. If you wish to email the authors directly, please see the first page of the publication for their contact information.

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

Implementation and Application of Automata: 15th International Conference, CIAA

2010, Winnipeg, MB, Canada, August 12-15, 2010. Revised Selected Papers,

Lecture Notes in Computer Science; no. 6482, pp. 173-180, 2010-08-12

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

NRC Publications Archive Record / Notice des Archives des publications du CNRC :

https://nrc-publications.canada.ca/eng/view/object/?id=1deb358a-ddf8-419a-b330-7736d8f386e4

https://publications-cnrc.canada.ca/fra/voir/objet/?id=1deb358a-ddf8-419a-b330-7736d8f386e4

NRC Publications Archive

Archives des publications du CNRC

For the publisher’s version, please access the DOI link below./ Pour consulter la version de l’éditeur, utilisez le lien DOI ci-dessous.

https://doi.org/10.1007/978-3-642-18098-9_19

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

Uniformizing rational relations for natural language applications using

weighted determinization

(2)

Uniformizing Rational Relations for Natural

Language Applications using Weighted

Determinization

J Howard Johnson Institute for Information Technology,

National Research Council Canada, Ottawa Canada,

[email protected]

Abstract. Rational functions (or single-valued finite state transduc-tions) have many applications in natural language processing. In par-ticular, morphological and phonological processes are often easily ex-pressed using finite state methods. However, it is often easier to specify a single-valued transduction by a transduction over the same domain that properly contains the desired one, and use a convention external to the formalism to make it single-valued. An alternative is to use weighted determinization using a novel semiring to produce a subsequential trans-ducer in cases where the behaviour of the expected transtrans-ducer is subse-quential. Although there are pathological cases where this can’t be done without extra complexity, they are not likely to be as common as the failure of the resulting machine to be finite. Here the basic algorithm is discussed and the ground prepared for handling the general problem in a better way. A motivating example shows how this approach could be used in practise.

1 Introduction

Rational functions (or single-valued finite state transductions) have many ap-plications in the computer processing of natural language as demonstrated but the use of the Xerox finite state toolkit [2] and similar systems for morpholog-ical and phonologmorpholog-ical analysis and synthesis of many natural languages. Such toolkits have numerous techniques to help the user stay within the constraint of a functional transformation but the inherent non-determinism in many rational operators and processes leads to situations where ambiguities creep in.

This is not a new problem. When clean mathematical formalisms are em-bodied in working software systems, concessions must be made to better enable software development. The LALR(1) parser generator yacc [5] adopted a conven-tion that whenever a shift-reduce conflict occurred, the shift would be selected and reduce-reduce conflicts would be resolved by complaining loudly but select-ing the rule occurrselect-ing earliest in the source text. The lexical analysis tool lex [9] adopted a convention that, in the case of ambiguities, the longest match would

(3)

be selected and in the case of ties, the earliest rule would be selected. Even the everyday application of regular expressions in languages such as Perl or Python have numerous conventions for determining the choice in case of ambiguities.

A particular context where the need for such a mechanism arises is the spec-ification of de-tokenizers for statistical machine translation (SMT). SMT works with appropriately tokenized text in a source language, translating it to tok-enized text in the target language. A tokenizer is needed for the source language and a de-tokenizer (to undo the process) for the target language.

Since it is more natural to think and experiment in terms of tokenization, de-tokenization should be expressed as the inverse operation, appropriately mod-ified. Any of the inputs to tokenization may be appropriate but the developer of the de-tokenizer would prefer a systematic process that can be steered in an understandable manner. Choosing the shortest de-tokenized stream probably would be close to the desired output or can be easily cleaned up by a post-edit transducer if the form of output is clearly known.

Unfortunately, choosing the shortest output is not sufficient to completely disambiguate the output but by taking the lexicographic (dictionary order) min-imum of the shortest output, the answer can be made unique. This total order on strings is called the genealogical or radix order and the first element in a sequence ordered in such a manner is called the genealogical or radix minimum. Note that the lexicographic minimum by itself is often undefined for regular sets; for example, the set of words of the form an_b _{has no defined lexicographic}

minimum within the set if a ≺ b.

Johnson [7] discusses a method of disambiguating a transduction when it is being used to transform a text. This often does not provide sufficient flexibility to the programmer since it would be preferable to do the disambiguation as part of the programming process rather than once at the end.

Sakarovitch [15] discusses this process of uniformization but points out that the radix uniformization might not be achievable with a rational transducer. This is bad news for applications because it means that any algorithm will fail in the general case. Furthermore, although it is possible to uniformize a rational relation, the result suffers from a certain amount of arbitrariness.

Suppose the programmer is willing to accept an algorithm that sometimes fails. Many tokenizers are inherently subsequential in execution since they do not remember more than a finite amount of left context and can manage without any lookahead. A de-tokenizer based on such a tokenizer might also be reasonably expected to be subsequential. Thus we would like to construct a subsequential transducer that outputs the genealogical minimum of the possible outputs from a rational relation. Although it cannot be done in general, with an appropriate implementation that fails gracefully, this could be an useful tool.

Section 2 introduces some definitions and background to help make the fol-lowing discussion more precise. Section 3 discusses how weighted determinization with a specially crafted semiring can be used to partially address the problem. To help clarify ideas, section 4 provides an example of the approach. Section 5 provides some concluding remarks.

(4)

2 Some Definitions and Background

Definition 1. An alphabet is a finite set of symbols.

We will identify two alphabets: Σ for input, and ∆ for output. Although mathematically disjoint, in practice they will correspond to ordinary ASCII or Unicode characters.

Definition 2. A ∗-semiring S is a set with operations ⊕, ⊗,⊛_{, 0 and 1 where:}

x⊕ (y ⊕ z) = (x ⊕ y) ⊕ z x⊗ (y ⊗ z) = (x ⊗ y) ⊗ z x⊕ 0 = 0 ⊕ x = x x⊗ 1 = 1 ⊗ x = x x⊕ y = y ⊕ x x⊗ 0 = 0 ⊗ x = 0 x⊗ (y ⊕ z) = (x ⊗ y) ⊕ (x ⊗ z) (x ⊕ y) ⊗ z = (x ⊗ z) ⊕ (y ⊗ z) (w ⊗ w⊛_{) ⊕ 1 = (w}⊛_{⊗ w) ⊕ 1 = w}⊛ ∀x, y, z ∈ S, ∀w ∈ S − {1} We won’t insist that 1⊛_{be defined; then any field (such as the Real numbers}

ℜ) can satisfy the ∗-semiring axioms with w⊛_{= (1 − w)}−1_{. In the remainder of}

this discussion ⊕ will be written as +, ⊗ as · (or juxtaposition), ⊛ _as ∗_{, 0 as 0}

and 1 as 1.

Definition 3. A Conway ∗-semiring S is a ∗-semiring with: (x ⊕ y)⊛_{= (x}⊛_{⊗ y)}⊛_{⊗ x}⊛

(x ⊗ y)⊛_{= 1 ⊕ (x ⊗ (y ⊗ x)}⊛_{⊗ y)} _{∀x, y ∈ S}

when the appropriate ⊛ _{operations are defined [4].}

Definition 4. An idempotent ∗-semiring S is a ∗-semiring with: x⊕ x = x ∀x ∈ S

Since S cannot be a field in this case, we can insist that 1⊛ _{be defined.}

The set of regular languages Reg(Σ∗_{) over an alphabet Σ is an idempotent}

Conway ∗-semiring.

Definition 5. A totally -ordered alphabet is one where: a b, b a =⇒ a= b, a b, b c =⇒ a c, and a b or b a. We write a ≺ b if a b and a 6= b.

Definition 6. The genealogical order relation over ∆∗

where ∆ is a totally ordered alphabet is defined as:

x≺ y = (

x if |x| < |y| or (|x| = |y| and x = uax1, y= uby1)

y if |y| < |x| or (|x| = |y| and x = ubx1, y= uay1)

(5)

Definition 7. GM(∆∗_{) for an alphabet ∆ is the set ∆}∗_{∪ ⊥ with the operations} ⊕, ⊗,⊛_{, 0, 1:} x⊕ y = ( x if x y or y = ⊥ y if y x or x = ⊥ x⊕ y = ( xy if x, y ∈ ∆∗ ⊥ if x = ⊥ or y = ⊥ x⊛_{= ǫ} ₀_{= ⊥} ₁_{= ǫ}

GM(∆∗_{) can easily be shown to be a idempotent Conway ∗-semiring since}

it is the homomorphic image of Reg(∆∗_{) where the ∅ maps to ⊥ and other}

languages are mapped to their genealogical minimal element [7].

Definition 8. The GCLD x ∧ y = z of two elements of a ∗-semiring is a left divisor that is maximal. That is x = z ⊗ x1, y = z ⊗ y1, and if there is a v such

that x = v ⊗ x2, y = v ⊗ y2 then v = z ⊗ v1. Here x1, y1, x2, y2, and v1 are all

from the ∗-semiring.

We will refer to a ∗-semiring where ∧ is defined for any two pair of elements as a GCLD ∗-semiring.

Definition 9. A weighted finite automaton A over alphabet Σ and weight space S is a 7-tuple:

A= hQ, Σ, S, I, F, E, λ, ρi

where Q is a finite set of states, Σ is a finite alphabet, S is a ∗-semiring, I ⊆ Q is set of initial states, F ⊆ Q is set of final states, E ⊆ Q × (Σ ∪ ǫ) × S × Q is a set of transitions, λ is a function from initial states to S, and ρ is a function from final states to S.

Definition 10. A finite state transducer T with input alphabet Σ and output alphabet ∆ is

T= hQ, Σ, Reg(∆∗

), I, F, E, λ, ρi

where Q is a finite set of states, Σ and ∆ are finite alphabets, and I, F, E, λ, and ρ are as above with S = Reg(∆∗_).

Although standard definitions of finite state transducers choose output tran-sitions from ∆∗ _{without loss of expressive power, this characterization is}

equiv-alent to the usual ones [3] and opens the door to weighted finite automata with weights chosen from GM(∆∗_{). Furthermore, any transducer can be converted}

to such a weighted automaton by applying the necessary homomorphic mapping to the weights.

3 Algorithm

The GM(∆∗_{) ∗-semiring is combined with the implementation of weighted}

de-terminization as discussed by Mohri [14] to yield a method that uniformizes rational relations to (not necessarily finite) subsequential transducers. Note that

(6)

there are cases where the result of the process is subsequential but a finite trans-ducer is not found without further modification to the algorithm. An example where this occurs will be given later.

For now we will live with the fact that deighted determinization is a semi-algorithm that either computes a subsequential transducer or runs forever ex-ploring more and more of the infinite machine until computing resources are exhausted. We can live with the nice terminating cases that cover many prac-tical situations, or follow Mohri by treating weighted determinization as a lazy algorithm that only expands states that are used by a particular application that only visits a finite part of the infinite state-space.

First of all our rational relation expressed in the form of a transducer (Defini-tion 10). Each weight is replaced by its natural homomorphic image in GM(∆∗

). Before applying the weighted determinization algorithm, the automaton must be cleaned up a bit. It must be trim, ǫ-free, and accelerated through the appli-cation of weight pushing.

For a weighted automaton to be trim, any states that are not reachable by a path from a start state are removed together with any transitions they carry. Next, any states from which a final state cannot be reached are removed together with any transitions they carry. The result is an automaton in which all states are accessible (from a start state) and co-accessible (from a final state). Requiring accessibility of states is not important for us except to reduce the sizes of data structures; however, the presence of non co-accessible states can drastically affect the result leading to infinite machines where the algorithm would otherwise converge to a finite solution. This is a well understood phenomenon associated with determinization in general.

To be ǫ-free, we must first recognize that any transition with only an out-put is an ǫ-transition and its ‘weight’ must be aggregated into preceding non-ǫ-transitions using semiring operations. We are helped here by the fact that GM(∆∗

) is a k-closed semiring with k = 0. Effectively, this means that any ǫ-loops evaluate to a weight of ǫ, the multiplicative identity and can be sim-ply deleted. There are no further problems with ǫ-removal. The easiest general algorithms work in an obvious way.

To accelerate the automaton through weight pushing there are some more fundamental changes that need to be made. We will paraphrase Mohri with appropriate modifications:

Let A be a weighted automaton over a semiring S. Assume that S is a GCLD∗-semiring. (This can be weakened further if necessary.) For any state q ∈ Q, assume that the following sum is defined and in S:

d[q] = M

π∈P(q,F )

(w[π] ⊗ ρ(n[π])).

d[q] is the weighted distance from q to F including the final weight and is well defined for all q ∈ Q when S is a k-closed semiring. The weight pushing algorithm consists of computing each weighted distance d[q] and

(7)

of re-weighting the transition weights, initial weights, and final weights in the following way:

∀e ∈ E s.t. d[p[e]] = 0, w[e] ← d[p[e]]\(w[e] ⊗ d[n[e]], ∀q ∈ I, λ(q) ← λ(q) ⊗ d[q],

∀q ∈ F, s.t. d[q] 6= 0, ρ(q) ← d[q]\ρ(q).

Here p[e], n[e], w[e] are the source, destination, weight respectively of e. We are now ready to do weighted determinization using a suitably modified version of Mohri’s presented as Algorithm 1. Note that the same changes made above are again necessary. Line 11 requires that the Greatest Common Left Divisor of the considered weights must be calculated. In Mohri’s case, he can choose divisors more freely and chooses a sum. In this case, we must ensure left divisibility and choose the maximal element that still left divides. The change in line 12 involves replacing a left multiplication of an inverse by a straightforward left division. Left division is defined in this case where an inverse doesn’t exist. There are also a couple of less important differences in the calculation of I′

and λ′_{in lines 3 to 6. This is a small nicety that factors out as much output as can}

be emitted before any input is read. This bizarre idea of writing output before reading anything usually doesn’t occur in practical applications but results in a small reduction in the size of the resulting machine.

Algorithm 1 Weighted-Determinization(A) 1 A ≡ hQ, Σ, S, I, F, E, λ, ρi 2 Q′_{← ∅, F}′_{← ∅, E}′_{← ∅, λ}′_{← ∅, ρ}′_{← ∅, Z ← empty Queue} 3 w′_{← V{λ(q) : q ∈ I}} 4 q′_{← {(q, w}′_{\λ(q)) : q ∈ I}} 5 I′_{← {q}′_} 6 λ′_(q′_{) ← w}′ 7 Z ← Enqueue(Z, q′₎ 8 while NotEmpty(Z) do 9 p′_{← Dequeue(Z)}

10 for each x ∈ i[E[Q[p′_{]]] do}

11 w′_{← V{v ⊗ w : (p, v) ∈ p}′_{, (p, x, w, q) ∈ E}}

12 q′_{← {(q, L{w}′_{\(v ⊗ w) : (p, v) ∈ p}′_{, (p, x, w, q) ∈ E})}

: q = n[e], i[e] = x, e ∈ E[Q[p′_]]}

13 E′_{← E}′_{∪ {(p}′_{, x, w}′_{, q}′_)} 14 if q′_{∈ Q}_/ ′ _then 15 Q′_{← Q}′_{∪ {q}′_} 16 if Q[q′_{] ∩ F 6= ∅ then} 17 F′_{← F}′_{∪ {q}′_} 18 ρ′_(q′_{) ←}L {v ⊗ ρ(q) : (q, v) ∈ q′_{, q ∈ F }} 19 Enqueue(Z, q′₎ 20 return A′_{≡ hQ}′_{, Σ, S, I}′_{, F}′_{, E}′_{, λ}′_{, ρ}′_i

(8)

Note that, following Mohri, the notation Q[p′_{] means the states in p}′_{, E[Q[p}′_]]

are the transitions have have a tail in a state of p′_{, i[E[Q[p}′_{]]] are the labels form}

Σ in transitions have have a tail in a state of p′_{, i[e] is the label form Σ from}

transition e, and n[e] is the destination state from transition e.

There remains one more step in the usual determinization suite. There often is a benefit in minimizing the resulting machine by combining states that have equivalent right context. The generalization of the conventional minimization algorithm for unweighted finite state machines works correctly if the pair of letter and weight from each transition is treated as an element of an expanded alphabet Σ × S. Mohri says that weight pushing should be performed before minimization. In the situation described here this will be unnecessary because we applied weight pushing before determinization and the algorithm preserves the effect.

4 An Example

Suppose that we have text that contains four types of information: (1) Words made up of upper and lower case letters. We will restrict our alphabet to ’a’ and its upper case form ’A’. (2) Numbers are a sequence of digits starting with a non-zero digit. We will restrict our digits to ’0’ and ’1’. (3) Punctuation are individual punctuation marks. Each punctuation mark will be a separate token. We will restrict ourselves to ’,’ and ’.’. (4) White space is a sequence of blanks.

Fig. 1. Rational Relation properly containing Tokenizer

1 2 3 4 5 6 7 8 9 10 E1 E2 ǫ | ǫ | ǫ | , . ǫ | , ǫ | . ǫ | 1 ǫ | 1 1 0 ǫ | 0 ǫ | A a ǫ | a A a ǫ | ǫ ǫ

We wish to construct a subsequential transducer that tokenizes the text in the following way: (1) Words are converted to lower case. (2) Numbers are copied. (3) Punctuation marks are copied. (4) White space is converted to a single blank.

(9)

Every token in the tokenized text is separated by exactly one blank, whether there is white space occurring in the input or not. Extra blanks are inserted appropriately. Word and number tokens must be maximal. We also will produce a de-tokenizer from our specification that produces the genealogical minimum as output. An INR specification for the required transducer follows:

Upper = { A }; Lower = { a }; PosDigit = { 1 }; Digit = { 0, 1 }; Punct = { ’,’, ’.’ }; Blank = ’ ’; Token = ( Upper | Lower )+; Number = PosDigit Digit*;

White = Blank+; ToLower = { ( A, a ), ( a, a ) }*; TCopy = ( Token @@ ToLower ) [[ T ]];

NCopy = ( Number $ ( 0, 0 ) ) [[ N ]]; PCopy = ( Punct $ ( 0, 0 ) ) [[ P ]]; WCopy = ( White, Blank ) [[ W ]]; Binst = ( , Blank ) [[ B ]];

Copy = [[ S ]] ( TCopy | NCopy | PCopy | WCopy | Binst )* [[ E ]]; ZAlph = { T, N, P, W, B, S, E };

Invalid = { T, N, P } { T, N, P };

Tokenize = Copy @ ( ZAlph* Invalid ZAlph* :acomp ) :GMsseq;

Fig. 2. Tokenizer S W P T N | ,| , . | . A | a a | a 1 | 1 | ,| , . | . A | a a | a 0 | 0 1 | 1 | A | a a | a A | a a | a 1 | 1 | ,| , . | . ,| , . | . 1 | 1 1 | 1 A | a a | a | ,| , . | .

Some of the techniques used by INR are unusual. A three-tape transducer with the third tape used for control is used here. Letters T, N, P, W, B, C, E (in double brackets to select tape 3) are tokens that appear on the control tape. The pattern ‘Invalid’ enumerates the exclusions that on output tokens of control types T, N, and P that no two of them can be adjacent. This will enforce the constraint that a Blank of some kind (W or B) will occur between any pair of these.

(10)

Fig. 3. De-tokenizer S W1 P W2 T N W3 W4 | | | | ,| , . | . ,| , . | . ,| , . | . ,| , . | . ,| , . | . | a | A a | A a | A a | A a | A 1 | 1 1 | 1 1 | 1 _{1 | 1} 0 | 0 1 | 1 | |

The composition operator (denoted by ’@’) causes the third tape to be re-moved after the constraint of disallowing invalid sequences. The resulting two-tape transducer before the application of Algorithm 1 (denoted by ’:GMsseq’) is shown in Figure 1. Note that states ‘E1’ and ‘E2’ with their outgoing ǫ transitions have been added to simplify the diagram.

This automaton is trim and the ǫ-closure is handled by observing that only states 1, 7, 8, and 9 have outgoing arcs with labels from the input alphabet and that the extra paths that result from the loops on states 1 and 2 are easily handled by the GM(∆∗_{) semiring operations that cause only the shortest path}

to be kept. No additional weight pushing is needed. Figure 2 shows the result of the application of Algorithm 1 and Figure 3 shows a de-tokenizer that results from applying Algorithm 1 to the inverse relation from Figure 2.

Finally, here is an example where the result of weighted determinization is subsequential but Algorithm 1 does not terminate. Suppose that in the example about, we insist that a word can be either all upper-case or all lower case and mixed case is disallowed. The de-tokenizer, faced with a word token of arbitrary length in lower case, would have to decide whether to output the upper case form or the lower case form. Of course the answer would be upper case since the ASCII letter ordering is being used. However, the decision about output is deferred until the token is comlete, and with arbitrarily long input, will be deferred forever.

5 Conclusion and Future Work

A useful tool for uniformizing finite state transductions can be implemented using a variation of weighted determinization over a novel ∗-semiring.

(11)

This complements other techniques for specifying rational functions in nat-ural language applications but it remains the case that this is a challenging programming medium.

Practically and theoretically speaking it is unsatisfying to have a procedure that fails by running until resources are exhausted. It would be definitely su-perior to terminate if the computation is not possible, provide some diagnostic information, and give a result that is still usable though with some flaws. In addition the case where the expected result is subsequential but the algorithm fails should be addressed.

INR addressed these problems with an approximate stopping criterion that often worked well; however, with more care, a theoretically more satisfying ap-proach is possible. These loose ends will be addressed in a longer version of this paper and in future work.

References

1. Abdali, S. K. and Saunders, B. D.: Transitive closure and related semiring properties via eliminants. Theoretical Computer Science, 40 (1985) 257–274

2. Beesley, K. R. and Karttunen, L.: Finite State Morphology. CSLI Publications, Stanford, CA, USA (2003) http://www.fsmbook.com

3. Berstel, J. Transductions and context-free languages. Teubner, Stuttgart (1979) 4. Conway, J. H.: Regular algebra and finite machines. Chapman and Hall, London

(1971)

5. Johnson, S. C.: YACC—yet another compiler-compiler. Unix Programmer’s Manual Vol 2b, (1979)

6. Johnson, J. H. INR—a program for computing finite state automata. Unpublished manuscript. (1986)

7. Johnson, J. H.: A unified framework for disambiguating finite transductions. Theo-retical Computer Science, 63 (1989) 91–111

8. Lehmann, D. J.: Algebraic structures for transitive closure. Theoretical Computer Science, 4(1) (1977) 59–76

9. Lesk, M. E. and Schmidt, E.: Lex—a lexical analyzer generator. Bell Labs., Murray Hill, NJ, 1978

10. Mohri, M.: Finite-state transducers in language and speech processing. Computa-tional Linguistics, 23(2) (1997) 269–312

11. Mohri, M.: Generic ǫ-removal algorithm for weighted automata. Proceedings of CIAA 2000, London, Canada (2000) 230–242

12. Mohri, M., Pereira, F., and Riley, M.: The design principles of a weighted finite-state transducer library. Theoretical Computer Science, 231(1) (2000) 17–32 13. Mohri, M.: Generic ǫ-removal and input ǫ-normalization algorithms for weighted

transducers. International Journal of Foundations of Computer Science (2002) 14. Mohri, M.: Weighted automata algorithms. Chapter 6 from: Handbook of Weighted

Automata (M. Droste, W. Kuich, and H. Vogler (eds.). Springer, Berlin (2009) 213– 254

15. Sakarovitch, J.: Elements of automata theory. Cambridge University Press, Cam-bridge, UK (2009)