Sequence generation
6.1 Lecture
• To generate a text, one can generate each token sequentially before taking the decision to stop the generation.
• The choice of a word in a vocabulary and the choice of stopping the generation are usually treated in a unified way using an abstractend-of-generation token. This token is usually appended at the end of all texts used by the system.
• To generate a token, one need a probability distribution over a vocabulary. Then, one can sample a word from the distribution. If one knows which token should be generated
— or more generally, if one has a target distribution —, the computed distribution can be trained by minimising a cross entropy.
• If the vocabulary isfixed, then one typically uses a network ending with a linear layer followed by a softmax.
• The input of this network must change in some way if one wants the probability dis- tribution over the vocabulary to change from token to token.
• Let us study some specific natural language generation systems (NLG):language mod- els.
• Language models have two main uses:
– they assign a probability to any sequence of tokens (What is the natural probability distribution of the sentences of a given language?);
– they assign a probability to any token in a given context.
• For a long time, language model were developped for their ability to model a probab- ility distribution over all possible sequences. It was a useful addition to models that output a list of candidate sentences, for example in speech recognition and in machine translation. More recently, neural language models have been popular for their ability to generate very powerful word embeddings.
• Typically, given an (incomplete) text w0¨ ¨ ¨wt´1, we are interested in the probability distribution for the following token. WritingWi the random variable (ranging over the vocabulary) representing the possible value of theithtoken, this distribution is written:
PpWt |W0“w0,¨ ¨ ¨ , Wt´1 “wt´1q
By abuse of notation, we will write this distribution ‘Ppwt |w0¨ ¨ ¨wt´1q’.
47
• The first popular language models were n-gram models (or variation thereof). An n-gram model only use then´1 preceding tokens as context:
Ppwt |w0¨ ¨ ¨wt´1q «Ppwt|wt´n`1¨ ¨ ¨wt´1q The later is estimated by counting in a corpus.
• One of the main limitations of n-gram models is that all contexts are considered com- pletely independent: observingwtin contextwt´n`1¨ ¨ ¨wt´1would only help the model to refine its prediction for this specific context while ideally this observation contains information about all similar contexts as well. For example, if one observes play in the contextMy cat likes to, one could infer thatplay is also likely to occur in (similar) contextMy dog likes to.
• This lead Bengio, Ducharme and Vincent (2001) to map each token to a word em- bedding, turning thepn´1q-gram the context consists in into a matrix. This matrix is then used as input to a feed-forward neural network that computes a probability distribution over the vocabulary. The whole system is trained by minimisation of the cross entropy between the predicted distribution and the actual (degenerate, ‘one-hot like’) distribution — in other words, the negative log-likelihood of the actual word.
• [illustration]
• The simplest way to condition the predicted probability distribution for wt on the whole left context (instead of the lastn´1 tokens only as in the previous models) is to use a recurrent neural network (Mikolov, Karafi´at et al. 2010; Sundermeyer, Schl¨uter and Ney2012).
• Typically, one defines an LSTM (fLSTM :pct, ht, intq ÞÑ pct`1, ht`1q) such that:
– the hidden state (i.e. output) ht of the LSTM at time t is used to compute the probability distribution for wt;
– the input int of the LSTM at timet is the embedding ofwt.
Note that some implementations start at a time step´1 instead of 0 and use a special vector (embedding of a ‘start-of-prediction’ token) as input at this step (no distribution is computed for t“ ´1).
• [illustration]
• With such an architecture, the output at time t depends on the state of all times t1 ď t. This means that the computation graph can grow very large, especially when texts multiple sentence long are used for training. As the whole computation graph is usually kept in memory in order to perform gradient computation, this might lead to technical difficulties for training.
• The same principle can be used to define a decoder. Given some input vector x and target sequencey:
– use x to compute an initial state pc0, h0q for the LSTM (typically, using some other network);
– the subsequent generation is then conditionned byx— it is said to be a decoding ofx.
• More generally, an architecture thatfirst transforms some inputx into the initial state of a recurrent neural network which then generates an output sequence is called an encoder-decoder.
• In the particular case where the inputxis also a sequence, the encoder-decoder is said to beseq2seq (forsequence to sequence). A simple and obvious choice is to implement the sequence encoder with an RNN (Cho et al.2014), typically a bidirectional LSTM. This kind of architectures is particurarly adapted to problems such as machine translation (Sutskever, Vinyals and Le2014) or text simplification (Alva-Manchego, Scarton and Specia2020).
• The most common way to train (encoder-)decoder models is to train them similarly as the neural language models mentioned above: by minimising the sum/average of the negative log-likelihood of gold tokensin their gold context. This means that at training time, the input of the decoder at each time step is the correct one — independently of its actual prediction.
• This is unfortunately a source oferror propagation: during prediction, as soon as the system makes a mistake, it finds itself in a kind of situations never encountered at training time. For example, if the system makes a grammatical error, (if the training data does not contain any such error,) the system’s subsequent predictions will be unreliable because it has never been trained in a context containing a grammatical error. In other words, the system is not robust to its own mistakes.
• This lack of robustness is not specific to NLG systems, but existing solutions do not necessarily work well when applied to these systems.
– For labeling tasks such as part-of-speech tagging, updating the RNN using its actual predictions instead of the ground truth proves effective. Such a strategy, however, does not really make sense for NLG. Suppose for instance that the target sentence isThis book tells the story of the famous computer scientist Alan Turing and that the system first generatesThis book tells the story of, as in the target, but then generates Alan instead of the. Should the system really be trained to predictfamous in the contextThis book tells the story of Alan?
– In syntactic parsing, it is sometimes possible to define adynamic oracle, a function that determines what is the best action to perform in any situation (Goldberg and Nivre2012). There is unfortunately no obvious way to define (let alone compute) a dynamic oracle for NLG.
• A promising direction to solve this problem is to useReinforcement Learning (Sutton and Barto2018).
– In reinforcement learning, each time an action is taken, a scalar reward is given to the system.
– There is no need for a gold sequence of actions. Models are trained based on their own predictions instead.
– The loss is defined in such a way that minimising it leads to a maximisation of the expectedsum of rewards at prediction time.
• Reinforcement learning has been applied to some NLG tasks such as machine transla- tion (Bahdanau, Brakel et al.2017; Ranzato et al.2016). In that case, it is natural to define rewards such that the sum of rewards associated with a given generation episode
is the value of the actual performance metric one is interested in (e.g. the BLUE score).
Reinforcement learning systems are however notoriously difficult to train.
• In the seq2seq models described so far, theflow of information was very simple: the only access the decoder had to the input sequence was through its initial state, which can be seen as a bottleneck. This means that all the information from the input sequence used to generate the end of the output sequence must be kept in memory during the whole decoding process in the memory of the decoder’s RNN, which might be difficult, especially when generating long sequences.
• To alleviate the burden on the decoder’s RNN, a possible solution consists in not simply using the RNN’s output to compute the probability distribution at each time step, but the concatenation of this output and the encoding of the input sequence.
• [illustration]
• An improvement over this first solution consists in computing a different encoding of the input sequence at each generation step. Common implementation of this idea:
1. use the output of the decoder’s RNN to compute (normalised) attention scores over the input sequence;
2. use the attention scores to compute a linear combination of the input tokens’
representations;
3. use the concatenation of this linear combination and the decoder’s RNN output to compute the output distribution.
• This attention mechanism allows the decoder to focus on different parts of the input at different steps of the decoding process, which has proven effective for tasks such as machine translation (Bahdanau, Cho and Bengio2015) or text summarisation (Rush, Chopra and Weston2015).
• Note that this attention mechanism raises the complexity of the model fromOpn`mq (wheren is the length of the input and mthe length of the output) toOpn`n mq.
• There is another limitation to the seq2seq architectures desribed above. This limitation lies in the output distribution being computed by a traditional feed-forward network, ending with a (fixed-size) linear layer followed by a softmax. This means in particular that the output vocabulary isfixed a priori.
• It happens that, quite frequently, the input vocabulary and the (ideal) output vocab- ulary share some elements. This is obvious for tasks such as text summarisation or question answering — in which the input and output sequences are written in the same language —, but also often true in machine translation (think about proper nouns, for example). So, why not dynamically adding the words used in the input sentence to the output vocabulary? This seems particularly interesting in cases such as machine translation where words missing from the output vocabulary built at training time (i.e.
proper nouns) are very likely to occur in the input sentence when they are need.
• This is done by pointing (Vinyals, Fortunato and Jaitly2015) over the input element.
To build a probability distribution over the union of a vocabulary V and the input elements, one can computes
– (unnormalised) scores for each entry inV using a traditional feed-forward network
– and (unnormalised) attention scores for each input element.
The distribution is then defined by applying the softmax function to the union of all these scores (which usually corresponds to the concatenation of two vectors).
• [illustration]
• The NLG systems we have described are able to compute a probability conditioned on their input for any sequence of tokens:
Ppw0¨ ¨ ¨wn´1 |inputq “
nź´1 i“0
Ppwi |w0¨ ¨ ¨wi´1, inputq
• At prediction time, for a given input, one would ideally be able to find the output sequence with maximal probability. In general, this is unfortunately not doable in practice, as this could require an arbitrarily large computation cost. (Remember that there is an exponential number of sequences of a given length, and we usually don’t even know the length of the optimal sequence.)
• The most simple decoding strategy is thegreedy one: at each time step, select the word with the highest probability.
• Because our systems are rarely perfect, we cannot expect the greedy strategy to work very well. As an illustration, consider a translation task, from French to English. Sup- pose that the input sentence isLe jardinier tondait les moutons. (with gold target The gardener sheared the sheep.) and that the system has predictedTheand thengardener as most probable words in first and second position respectively. The systems could then easily assign a higher probability tomowed than tosheared for the third position in this context (becauseto mow is indeed a common translation oftondre, especially in gardening contexts). We can imagine the greedy strategy leading the system to gener- ateThe gardener mowed the sheep. but with a very low overall probability, assheep is extremely unlikely in the context The gardener mowed the. Crucially, the probability assigned to the generated sentence could be much lower than the probability assigned to the gold sentence.
• The most common improvement over the greedy strategy is beam search (first intro- duced for speech recognition; Lowerre1976), of which greedy is a degenerate case. For a givennPN˚, decoding with a beam of widthnconsists in maintaining in parallel a set ofnbest hypotheses:
– start with a set l0 of n empty hypotheses (i.e. empty sequences of tokens), all associated with the probability 1;
– at each time stepkě0,
∗ for each incomplete hypothesis w0¨ ¨ ¨wk´1 in lk, compute the probability of each of its possible continuationsw0¨ ¨ ¨wk(multiplying the probability of the hypothesis with the probability of the next token);
∗ define lk`1 as the set of the n most likely sequences among all the continu- ations considered just above and all complete hypotheses in lk.
∗ if all sequences inlk`1 are complete, end the process and return the sequence with the highest probability.
In practice, one can implement beam decoding by maintaining a batch of decoder states.
• On average, longer sequences of tokens are assigned lower probabilities than shorter ones. As a result, the beam search algorithm just described is biased towards shorter sequences. To eliminate this bias, a solution is to use, instead of a sequence’s prob- ability, this probability raised to the power 1n wheren is the sequence’s length, in the selection criterion.1
• The larger the beam, the more likely it is to contain and return the best candidate according to the model. If a beam search with a given width does not return the target sentence, increasing the width of the beam only makes sense if the target sentence is assigned a higher score than any sentence currently in the beam — otherwise, the width increase might include the target in the beam but would still not lead to it being selected in the end.
• Classical RNNs such as LSTMs are not the only choices for implementing encoders or decoders. Transformer-based models (Vaswani et al.2017) have become popular too.
• The encoder part of a transformer is a stacking of layers each of which crucially includes a self-attention mechanism: at a given layer, each token can access the encoding at the preceding layer of all tokens. This is possible because the encoding presupposes that the full input sequence is known from the start.
• Decoding, however, is done one token at a time, because we want the generation of a given token to take into consideration all previously generated tokens. The decoder part of a transformer is also a stacking of layers quite similar to the layers of their encoder counterparts, but tokens are processed linearly, as with a classical RNN. The decoding layers of a typical transformer feature not one but two attention mechanisms:
– a traditional attention mechanism on the input sequence;
– a (left-restricted) self-attention mechanism: each token being generated is able to attend to the encodings of all previously generated tokens.
• [illustration]
1It is common in practice to work with log-probabilities. The normalisation proposed here corresponds to dividing the log-probability of a sequence by its length — in other words, to use the average conditional log-probability of its tokens.