Structured Language Models - Automatic Acquisition of Language Models for Speech Recognition

The other major class of approaches to language modeling is those that assign struc-ture to the sequence of words, and utilize that strucstruc-ture to compute probabilities.

These approaches typically require an expert to initially encode the structure, typ-ically in the form of a grammar. In general, they have the advantage over n-gram approaches because there arefarfewer parameters to estimate for these models, which means far less training data is required to robustly estimate their parameters.

However, structured models require some amount of computation to determinethe structure of a particular test sentence, which is costly in comparison to the ecient n-gram mechanism. And, this computation could fail to nd the structure of the sentence when the test sentence is agrammatical. With such sentences structured models have a very dicult time assigning probability. This problem is referred to as the robustness problem.

When discussing perplexity results for such structural language models, typically the test set coverage is rst quoted, which is what percentage of the test set parses according to the grammar, and then the perplexity of the language model is quoted for that particular subset of the test set. This makes objective comparisons of such

models dicult since the perplexity varies with the particular subset of the test set that parses.

Some extensions have been made to these systems to try to deal in some man-ner with the robustness problem. For example, both the TINA and PLR models, described below, accommodate such exceptions. But, the emphasis of these robust parsingtechniques is to extract the correct meaning of these sentences in rather than to provide a reasonable estimate of the probability of the sentence.

2.11.1 SCFG

One structural approach is the stochastic context free grammar (SCFG) [26], which mirrors the context-freeness of the grammar by assigning independent probabilities to each rule. These probabilities are estimated by examining the parse trees of all of the training sentences.

A sentence is then assigned a probability by multiplying the probabilities of the rules used in parsing the sentence. If the sentence is ambiguous (has two or more parses) the probability of the sentence is then the sum of the probabilities of each of the parses. This process generates a valid probability distribution, since the sum of the probabilities of all sentences which parse will be 1.0, and all sentences which do not parse have zero probability.

SCFG's have been well studied by the theoretical community, and well-known algorithms have been developed to manipulate SCFG's, and to compute useful quan-tities for them [32, 26]. A SCFG may be converted into the corresponding n-gram model [43]. The inside/outside algorithm is used to optimize the probabilities such that the total probability of the training set is optimized [2, 29]. Given a sentence prex, it is possible to predict the probabilities of all possible next words which may follow according to the SCFG.

2.11.2 TINA

TINA [38] is a structured probabilistic language model developed explicitly for both understanding and ease of integration with a speech recognition system. From a hand-written input CFG, TINA constructs a probabilistic recursivetransition network which records probabilities on the arcs of the network. The probability of a sentence, if it parses, is the product of all probabilities along the arcs. Thus, the probabilities are conditional bigram probabilities, at all levels of the recursive network.

For each unique non-terminal NT in the input grammar, TINA constructs a small bigram network that joins all rules that have the particular left-hand-side non-terminal. This process is actually a form of generalization as it allows more sentences to be accepted than were accepted by the original grammar. Probabilities are no longer explicitly rule-dependent as in the SCFG formalism, but instead are depen-dent in a bigram-like manner on the unit immediately preceding a given unit.

TINA parses a test sentence using a best rst strategy whereby the partial parse which looks most promising is chosen for extension. This has the eect of creating an extremely ecient parser, when the model is trained well, and was motivated by the notion that humans do not have nearly as much diculty with ambiguous parses as computers do. TINA also has a trace mechanism and a feature-unication procedure, which complicates the probabilistic model and makes it more dicult to estimate the perplexity.

TINA, like SCFG, only assigns probability to those sentences which it accept.

It therefore has the same diculties with robustness as other structural approaches.

Recent work has been done to address this problem, mainly by resorting to standard bigram models for words that fall outside of parsable sequences [39].

2.11.3 PLR

PLR [19] is another approach to which generates a probabilistic language model from a CFG as input. It is a probabilistic extension to the ecientLR parsing algorithm [1], and is designed to achieve full coverage of a test set by implementing error recovery

schemes when a sentence fails to parse.

The LR model takes a CFG as input and uses this CFG to construct a determin-istic parser. This parser looks at each word in the input sentence in sequence, and decides to either shift the word onto the top of the stack, or reduce the top of the stack to a particular non-terminal. The PLR model extends this by predicting word probabilities as a function of which units are on the top of the stack. At every itera-tion, the stack represents the partially reduced portion of the test sentence processed so far.

One limitation of this approach is that it requires that the input CFG fall within a special class of CFG's known as theLR grammars. If the grammar is in this class, it is possible to construct a deterministic LR parser. An advantage of this model is that it is already formulated as a word by word predictive model, thus making integration to speech recognition systems easier. It is also extremely ecient as, at any given point, the LR machine knows exactly which action, shift or reduce, applies.

Chapter 3

Dans le document Automatic Acquisition of Language Models for Speech Recognition (Page 33-37)