• Aucun résultat trouvé

Zen and the Art of Symbolic Computing: Light and Fast Applicative Algorithms for Computational Linguistics

N/A
N/A
Protected

Academic year: 2022

Partager "Zen and the Art of Symbolic Computing: Light and Fast Applicative Algorithms for Computational Linguistics"

Copied!
2
0
0

Texte intégral

(1)

Zen and the Art of Symbolic Computing:

Light and Fast Applicative Algorithms for Computational Linguistics

G´erard Huet

INRIA Rocquencourt,

BP 105, 78153 Le Chesnay Cedex, France, [email protected],

http://pauillac.inria.fr/~huet

Abstract. Computational linguistics is an application of computer sci- ence which presents interesting challenges from the programming method- ology point of view. Developing a realistic platform for the treatment of a natural language in its phonological, morphological, syntactic, and ulti- mately semantic aspects demands a principled modular architecture with complex cooperation between the various layers. Representing large lex- ical data bases, treating sophisticated phonological and morphological transformations, and processing in real time large corpuses demands fast finite-state methods toolkits. Analysing the syntactic structure, comput- ing anaphoric relations, and dealing with the representation of informa- tion flow in dialogue understanding, demands the processing of complex constraints on graph structures, with sophisticated sharing of large non- deterministic search spaces.

The talk reports on experiments in using declarative programming for the processing of the sanskrit language, in its phonological and morphological aspects. A lexicon-based morphological tagger has been designed, using an original algorithm for the analysis of euphony (the so-called sandhi process, which glues together the words of a sentence in a continuous stream of phonemes). This work, described in [2], has been implemented in a purely applicative core subset of Objective Caml [5]. The basic structures underlying this methodology have been abstracted in the Zen toolkit, distributed as free software [3]. Two complementary techniques have been put to use. Firstly, we advocate the systematic use ofzippers [1] for the programming of mutable data structures in an applicative way.

Zippers, orlinear contexts, are related to the interaction combinators of linear logic. Secondly, asharing functorallows the uniform minimisation of inductive data structures by representing them as shared dags. This is similar to the traditional technique of bottom-up hashing, but the com- putation of the keys is left to the client invoking the functor, which has two advantages: keys are computed along with the bottom-up traversal of the structure, and more importantly their computation may profit of specific statistical properties of the data at hand, optimising the buck- ets balancing in ways which would be unattainable by generic functions.

These two complementary technologies are discussed in [4].

The talk discusses the use of these tools in the uniform representation of finite state automata and transducers asdecorated lexical trees (also

(2)

II

called tries). The trie acts as a spanning tree of the automaton search space, along a preferred deterministic skeleton. Non deterministic transi- tions are constructed as choice points with virtual addresses, which may be either absolute words (locating the target state by a path from the starting state) or relativedifferential words (bytecode of the trie zipper processor, representing the shortest path in the spanning tree of the state graph). Sharing such automata structures gives uniformly an associated equivalent minimal automaton. For instance, the lexicon is itself repre- sented by its characteristic minimal recognizer. But this applies as well to possibly non-deterministic transducers. Thus our segmenting sandhi analyser compiles a lexicon of 120000 flexed forms with a data base of 2800 string rewrite rules into a very compact transducer of 7300 states fitting in 700KB of memory, the whole computation taking 9s on a plain PC.

We believe that our experiment with functional programming applied to lexical and morphological processing of natural language is a con- vincing case that direct declarative programming techniques are often superior to more traditional imperative programming techniques using complex object-oriented methodologies. Our programs are very short, easy to maintain and debug, though efficient enough for real-scale use.

It is our belief that this extends to other areas of Computational Lin- guistics, and indeed to most areas of Symbolic Computation.

References

1. G´erard Huet. “The Zipper”. J. Functional Programming 7,5 (Sept. 1997), pp. 549–

554.

2. G´erard Huet. “Transducers as Lexicon Morphisms, Segmentation by Euphony Anal- ysis, And Application to a Sanskrit Tagger”. Draft available as http://pauillac.

inria.fr/~huet/FREE/tagger.pdf.

3. G´erard Huet. “The Zen Computational Linguistics Toolkit”. ESSLLI 2002 Lectures, Trento, Italy, Aug. 2002. Available as:http://pauillac.inria.fr/~huet/PUBLIC/

esslli.pdf.

4. G´erard Huet. “Linear Contexts and the Sharing Functor: Techniques for Symbolic Computation”. Submitted for publication, 2002.

5. Xavier Leroy et al. “Objective Caml.” See: http://caml.inria.fr/ocaml/index.

html.

Références

Documents relatifs

The Initialization function, init , computes the rst element of the chain from the identier of the node and from global information, the list of all the identiers and the size of

In this paper we show that is possible to represent NLP models such as Probabilistic Context Free Grammars, Probabilistic Left Corner Grammars and Hidden Markov Models

Natural language processing is such an important subject that can not only afford to develop the field of artificial intelli- gence [17], but also help our everyday lives, eg.: lives

The theory T G is adapted for the description of functions with fixed arity as are Partial Recursive functions. Nevertheless, functions with variable arity are programmable

In particular using only équations as axioms it is impossible to specify sufficiently complete abstract types including models containing partial recursive functions with

Postmodem theory is not the only approach which suggests that a such a dialectical relation between language and the social world exists, but its analysis of this

In order to make the transducer p-sequential, lexically ambiguous word forms must be processed in a specific way: any difference between the several word tags for such a word form

For solving different mixed-integer optimization problems metaheuristics like TS, GA, ACO, and SA have been proposed by using GPU computing and their high performance is remarkable