Modeling Past and Future for Neural Machine Translation

(1)

Modeling Past and Future for Neural Machine Translation

Zaixiang Zheng^∗ Nanjing University [email protected]

Hao Zhou^∗ Toutiao AI Lab [email protected]

Shujian Huang Nanjing University [email protected] Lili Mou

University of Waterloo [email protected]

Xinyu Dai Nanjing University

[email protected]

Jiajun Chen Nanjing University [email protected]

Zhaopeng Tu Tencent AI Lab [email protected]

Abstract

Existing neural machine translation systems do not explicitly model what has been translated and what has not during the decoding phase. To address this problem, we propose a novel mechanism that separates the source information into two parts: translated PAST

contents and untranslated FUTURE contents, which are modeled by two additional recurrent layers. The PAST and FUTUREcontents are fed to both the attention model and the decoder states, which provides Neural Machine Translation (NMT) systems with the knowl- edge of translated and untranslated contents.

Experimental results show that the proposed approach significantly improves the performance in Chinese-English, German-English, and English-German translation tasks. Specif- ically, the proposed model outperforms the conventional coverage model in terms of both the translation quality and the alignment error rate.^†

1 Introduction

Neural machine translation (NMT) generally adopts an encoder-decoder framework (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014), where the encoder summarizes the source sentence into a source context vector, and the decoder generates the target sentence word-by-word based on the given source. During translation, the decoder implicitly serves several functionalities at the same time:

*Equal contributions.

†Our code can be downloaded fromhttps://github.

com/zhengzx-nlp/past-and-future-nmt.

1. Building a language model over the target sentence for translation fluency (LM).

2. Acquiring the most relevant source-side information to generate the current target word (PRESENT).

3. Maintaining what parts in the source have been translated (PAST) and what parts have not (FUTURE).

However, it may be difficult for a single recurrent neural network (RNN) decoder to accomplish these functionalities simultaneously. A recent suc- cessful extension of NMT models is the attention mechanism (Bahdanau et al., 2015; Luong et al., 2015), which makes a soft selection over source words and yields anattentive vectorto represent the most relevant source parts for the current decoding state. In this sense, the attention mechanism separates the PRESENT functionality from the decoder RNN, achieving significant performance improvement.

In addition to PRESENT, we address the impor- tance of modeling PAST and FUTURE contents in machine translation. The PAST contents indicate translated information, whereas the FUTURE contents indicate untranslated information, both being crucial to NMT models, especially to avoid under- translation and over-translation (Tu et al., 2016).

Ideally, PAST grows and FUTURE declines during the translation process. However, it may be difficult for a single RNN to explicitly model the above processes.

In this paper, we propose a novel neural machine translation system that explicitly models PAST and FUTURE contents with two additional RNN layers.

The RNN modeling the PASTcontents (called PAST

layer) starts from scratch and accumulates the in-

145

Transactions of the Association for Computational Linguistics, vol. 6, pp. 145–157, 2018. Action Editor: Philipp Koehn.

(2)

formation that is being translated at each decoding step (i.e., the PRESENT information yielded by attention). The RNN modeling the FUTUREcontents (called FUTURE layer) begins with holistic source summarization, and subtracts the PRESENT information at each step. The two processes are guided by proposed auxiliary objectives. Intuitively, the RNN state of the PASTlayer corresponds to source contents that have been translated at a particular step, and the RNN state of the FUTURE layer corresponds to source contents of untranslated words.

At each decoding step, PASTand FUTUREtogether provide a full summarization of the source information. We then feed the PASTand FUTUREinforma- tion to both the attention model and decoder states.

In this way, our proposed mechanism not only provides coverage information for the attention model, but also gives a holistic view of the source information at each time.

We conducted experiments on Chinese-English, German-English, and English-German benchmarks.

Experiments show that the proposed mechanism yields 2.7, 1.7, and 1.1 improvements of BLEU scores in three tasks, respectively. In addition, it ob- tains an alignment error rate of 35.90%, significantly lower than the baseline (39.73%) and the coverage model (38.73%) by Tu et al. (2016). We observe that in traditional attention-based NMT, most errors occur due to over- and under-translation, which is probably because the decoder RNN fails to keep track of what has been translated and what has not.

Our model can alleviate such problems by explicitly modeling PASTand FUTUREcontents.

2 Motivation

In this section, we first introduce the standard attention-based NMT, and then motivate our model by several empirical findings.

The attention mechanism, proposed in Bahdanau et al. (2015), yields a dynamic source context vector for the translation at a particular decoding step, modeling PRESENTinformation as described in Sec- tion 1. This process is illustrated in Figure 1.

Formally, let x = {x₁, . . . , x_I} be a given input sentence. The encoder RNN—generally imple- mented as a bi-directional RNN (Schuster and Pali- wal, 1997)—transforms the sentence to a sequence

c_t

hi

hI

h1

x_i x₁

Encoder

+

α_t,1 α_{t, i} α_{t, I} s s

y_t Decoder

Initialize with source summarization source vector for present translation

x_I

Figure 1: Architecture of attention-based NMT.

of annotations with h_i = −→ h_i;←−

h_i

being the an- notation ofx_i. (−→h_i and←−h_i refer to RNN’s hidden states in both directions.)

Based on the source annotations, another decoder RNN generates the translation by predicting a target wordy_tat each time stept:

P(y_t|y_<t,x) = softmax(g(y_t−1,s_t,c_t)), (1) where g(·) is a non-linear activation, and s_t is the decoding state for time stept, computed by

s_t=f(y_t₋₁,s_t₋₁,c_t). (2) Here f(·) is an RNN activation function, e.g., the Gated Recurrent Unit (GRU) (Cho et al., 2014) and Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). c_t is a vector summarizing relevant source information. It is computed as a weighted sum of the source annotations:

c_t= XI i=1

α_t,i·h_i, (3) where the weights (α_t,i fori = 1· · · , I) are given by the attention mechanism:

α_t,i = softmax a(s_t−1,h_i)

. (4)

Here, a(·) is a scoring function, measuring the de- gree to which the decoding state and source information match to each other.

(3)

SRC 与此同时,他呼吁提高民事服务效率,这也是鼓舞民心之举。

REF

in the meanwhile he calls for better efficiency in civil service ,which helps to promote people ’s trust . NMT at the same time , he called for a

higher efficiency in civil service efficiency.

(a) Translation example. We highlightunder-translated words in bold and italicizeover-translatedwords.

Initialize Decoder States with . . . BLEU Source Summarization 35.13

All-Zero Vector 35.01

(b) Source summarization is not fully exploited by NMT decoder.

Table 1: Evidence shows that attention-based NMT fails to make full use of source information, thus los- ing the holistic picture of source contents.

Intuitively, the attention-based decoder selects source annotations that are most relevant to the decoder state, based on which the current target word is predicted. In other words,c_tis some source information for the PRESENTtranslation.

The decoder RNN is initialized with the summarization of the entire source sentence−→

h_I;←− h1

, given by:

s₀ = tanh(W_s−→h_I;←h−₁

). (5)

After we analyze existing attention-based NMT in detail, our intuition arises as follows. Ideally, with the source summarization in mind, after generating each target wordy_tfrom the source contentsc_t, the decoder should keep track of (1) translated source contents by accumulating ct, and (2) untranslated source contents by subtracting c_t from the source summarization. However, such information is not well learned in practice, as there lacks explicit mech- anisms to maintain translated and untranslated contents. Evidence shows that attention-based NMT still suffers from serious over- and under-translation problems (Tu et al., 2016; Tu et al., 2017b). Exam- ples of under-translation are shown in Table 1a.

Another piece of evidence also shows that the decoder may lack a holistic view of the source infor-

mation, as explained below. We conduct a pilot experiment by removing the initialization of the RNN decoder. If the “holistic” context is well exploited by the decoder, translation performance would significantly decrease without the initialization. As shown in Table 1b, however, translation performance only decreases slightly after we remove the initialization.

This indicates NMT decoders do not make full use of source summarization, that the initialization only helps the prediction at the beginning of the sentence.

We attribute the vanishing of such signals to the overloaded use of decoder states (e.g., LM, PAST, and FUTURE functionalities), and hence we propose to explicitly model the holistic source summarization by PAST and FUTURE contents at each decoding step.

3 Related Work

Our research is built upon an attention-based sequence-to-sequence model (Bahdanau et al., 2015), but is also related to coverage modeling, future modeling, and functionality separation. We dis- cuss these topics in the following.

Coverage Modeling. Tu et al. (2016) and Mi et al.

(2016) maintain a coverage vector to indicate which source words have been translated and which source words have not. These vectors are updated by accumulating attention probabilities at each decoding step, which provides an opportunity for the attention model to distinguish translated source words from untranslated ones. Viewing coverage vectors as a (soft) indicator of translated source contents, following this idea, we take one step further. We model translated and untranslated source contents by directly manipulating the attention vector (i.e., the source contents that are being translated) instead of attention probability (i.e., the probability of a source word being translated).

In addition, we explicitly model both translated (with P^AST-RNN) and untranslated (with FUTURE- RNN) instead of using a single coverage vector to indicate translated source words. The difference with Tu et al. (2016) is that the PAST and FUTURE

contents in our model are fed not only to the attention mechanism but also the decoder’s states.

In the context of semantic-level coverage, Wang et al. (2016) propose a memory-enhanced decoder

(4)

s_t s₁

Past Layer s₀

c_t c₁

s^P_t

Attention (Present) Layer

Decoder Layer Future Layer s_t

s₁ ^F

s₀^F ^F

s₀^P

source summarization

s^P_t-1

Figure 2: NMT decoder augmented with PASTand FUTURElayers.

and Meng et al. (2016) propose a memory-enhanced attention model. Both implement the memory with a Neural Turing Machine (Graves et al., 2014), in which the reading and writing operations are expected to erase translated contents and highlight untranslated contents. However, their models lack an explicit objective to guide such intuition, which is one of the key ingredients for the success in this work. In addition, we use two separate layers to explicitly model translated and untranslated contents, which is another distinguishing feature of the proposed approach.

Future Modeling. Standard neural sequence decoders generate target sentences from left to right, thus failing to estimate some desired properties in the future (e.g., the length of target sentence). To address this problem, actor-critic algorithms are em- ployed to predict future properties (Li et al., 2017;

Bahdanau et al., 2017), in their models, an interpola- tion of the actor (the standard generation policy) and the critic (a value function that estimates the future values) is used for decision making. Concerning the future generation at each decoding step, Weng et al.

(2017) guide the decoder’s hidden states to not only generate the current target word, but also predict the target words that remain untranslated. Along the direction of future modeling, we introduce a FUTURE

layer to maintain the untranslated source contents, which is updated at each decoding step by subtracting the source content being translated (i.e., attention vector) from the last state (i.e., the untranslated source content so far).

Functionality Separation. Recent work has re- vealed that the overloaded use of representations makes model training difficult, and such problems can be alleviated by explicitly separating these functions (Reed and Freitas, 2015; Ba et al., 2016; Miller et al., 2016; Gulcehre et al., 2016; Rockt¨aschel et al., 2017). For example, Miller et al. (2016) separate the functionality of look-up keys and memory contents in memory networks (Sukhbaatar et al., 2015). Rockt¨aschel et al. (2017) propose a key- value-predictattention model, which outputs three vectors at each step: the first is used to predict the next-word distribution; the second serves as the key for decoding; and the third is used for the attention mechanism. In this work, we further separate PAST

and FUTUREfunctionalities from the decoder’s hidden representations.

4 Modeling PASTand FUTUREfor Neural Machine Translation

In this section, we describe how to separate PAST

and FUTURE functions from decoding states. We introduce two additional RNN layers (Figure 2):

• FUTURE Layer (Section 4.1) encodes source contents to be translated.

• PAST Layer (Section 4.2) encodes translated source contents.

Let us takey={y₁, y₂, y₃, y₄}as an example of the target sentence. The initial state of the FUTURE

layer is a summarization of the whole source sentence, indicating that all source contents need to be translated. The initial state of the PAST layer is an all-zero vector, indicating no source content is yet

(5)

Neural Network Layer

✕ Element-wise Multiplication

+ Element-wise

Addition

s_t-1 s_t

r_t ✕

+

u_t

!

1-

✕

!

tanh

✕

s_t

~

c_t

F F

F

(a) GRU

Projected Minus

—

+ Element-wise

Addition

s_t-1 s_t

r_t ✕

c_t

+

u_t

!

1-

✕

!

tanh

✕

s_t

~

—

tanh

F F

F

(b) GRU-o

Projected Minus

—

+ Element-wise

Addition

s_t-1 s_t

r_t

c_t

+

u_t

!

1-

✕

!

tanh

✕

s_t

~

✕ —

F F

F

(c) GRU-i

Figure 3: Variants of activation functions for the FUTURElayer.

translated.

Afterc₁ is obtained by the attention mechanism, we (1) update the FUTURE layer by “subtracting”

c₁ from the previous state, and (2) update the PAST

layer state by “adding”c₁to the previous state. The two RNN states are updated as described above at every step of generatingy₁, y₂, y₃, and y₄. In this way, at each time step, the FUTURE layer encodes source contents to be translated in the future steps, while the PASTlayer encodes translated source contents up to the current step.

The advantages of the PASTand the FUTURElay- ers are two-fold. First, they provide coverage information, which is fed to the attention model and guides NMT systems to pay more attention to untranslated source contents. Second, they provide a holistic view of the source information, since we would anticipate “PAST + FUTURE = HOLISTIC.”

We describe them in detail in the rest of this section.

4.1 Modeling FUTURE

Formally, the FUTURE layer is a recurrent neural network (the first gray layer in Figure 2) , and its state at time steptis computed by

s^F_t =F(s^F_t−1,c_t), (6) whereF is the activation function for the FUTURE

layer. We have several variants ofF, aiming to better model the expected subtraction, as described in Section 4.1.1. The FUTURERNN is initialized with the summarization of the whole source sentence, as computed by Equation 5.

When calculating attention context at time stept, we feed the attention model with the FUTUREstate

from the last time step, which encodes source contents to be translated. We rewrite Equation 4 as

α_t,i = softmax a(s_t−1,h_i,s^F_t₋₁)

. (7)

After obtaining attention context c_t, we update FUTURE states via Equation 6, and feed both of them to decoder states:

s_t=f(s_t₋₁, y_t₋₁,c_t,s^F_t ), (8) wherec_t encodes the source context of the present translation, ands^F_t encodes the source context on the future translation.

4.1.1 Activation Functions for Subtraction We design several variants of RNN activation functions to better model the subtractive operation (Figure 3):

GRU. A natural choice is the standard GRU¹, which learns subtraction directly from the data:

s^F_t = GRU(s^F_t₋₁,c_t) (9)

=u_t·s^F_t₋₁+ (1−u_t)·˜s^F_t ;

˜s^F_t = tanh(U(r_t·s^F_t₋₁) +Wc_t); (10) r_t=σ(U_rs^F_t₋₁+W_rc_t); (11) u_t=σ(U_us^F_t₋₁+W_uc_t), (12) wherer_tis a reset gate determining the combination of the input with the previous state, andu_tis an update gate defining how much of the previous state to keep around. The standard GRU uses a feed-forward

1Our work focuses on GRU, but can be applied to any RNN architectures such as LSTM.

(6)

neural network (Equation 10) to model the subtraction without any explicit operation, which may lead to the difficulty of the training.

In the following two variants, we provide GRU with explicit subtraction operations, which are in- spired by the well known phenomenon that minus operation can be applied to the semantics of word embeddings (Mikolov et al., 2013).² Therefore we subtract the semantics being translated from the untranslated FUTUREcontents at each decoding step.

GRU with Outside Minus (GRU-o). Instead of directly feedingc_tto GRU, we compute the current untranslated contents M(s^F_t₋₁,c_t) with an explicit minus operation, and then feed it to GRU:

s^F_t = GRU(s^F_t₋₁,M(s^F_t₋₁,c_t)); (13) M(s^F_t₋₁,c_t) = tanh(U_ms^F_t₋₁−W_mc_t). (14) GRU with Inside Minus (GRU-i). We can alter- natively integrate a minus operation into the calculation of˜s^F_t :

˜

s^F_t = tanh(Us^F_t−1−W(r_t·c_t)). (15) Compared with Equation 10, the differences between GRU-iand standard GRU are:

1. Minus operation is applied to produce the en- ergy of the intermediate candidate state˜s^F_t; 2. The reset gater_tis used to control the amount

of information flowing from inputs instead of from the previous states^F_t₋₁.

Note that for both GRU-o and GRU-i, we leave enough freedom for GRU to decide the extent of integrating with subtraction operations. In other words, the information subtraction is “soft.”

4.2 Modeling PAST

Formally, the PASTlayer is another recurrent neural network (the second gray layer in Figure 2), and its state at time steptis calculated by:

s^P_t = GRU(s^P_t−1,c_t). (16) Initially,s^P_t is an all-zero vector, which denotes no source content is yet translated. We chooseGRU as the activation function for the PAST layer, since

2E(“King”)−E(“Man”) = E(“Queen”)−E(“Woman”), whereE(·)is the embedding of a word.

the internal structure ofGRUis in accord with the

“addition” operation.

We feed the PASTstate from last time step to both attention model and decoder state:

α_t,i= softmax a(s_t−1,h_i,s^P_t₋₁)

; (17) st=f(st−1, yt−1,ct,s^P_t₋₁). (18) 4.3 Modeling PASTand FUTURE

We integrate PAST and FUTURElayers together in our final model (Figure 2):

αt,i= softmax a(st−1,hi,s^F_t₋₁,s^P_t₋₁)

; (19) s_t=f(s_t₋₁, y_t₋₁,c_t,s^F_t−1,s^P_t−1). (20) In this way, both the attention model and the decoder state are aware of what has, and what has not yet been translated.

4.4 Learning

We introduce additional loss functions to estimate the semantic subtraction and addition, which guide the training of the FUTURE layer and PAST layer, respectively.

Loss Function for Subtraction. As described above, the FUTURElayer models the future semantics in a declining way: ∆^F_t = s^F_t₋₁ −s^F_t ≈ c_t. Since source and target sides contain equivalent semantic information in machine translation (Tu et al., 2017a): c_t ≈ E(y_t), we directly measure the con- sistence between ∆^F_t and E(yt), which guides the subtraction to learn the right thing:

loss(∆^F_t ,E(y_t)) =−log ^exp ^l(∆^F^t^,E(y^t⁾⁾

P

yexp l(∆^F_t,E(y)); l(u,v) =u^>Wv+b.

In other words, we explicitly guide the FUTURE

layer by this subtractive loss, expecting ∆^F_t to be discriminative of the current wordy_t.

Loss Function for Addition. Likewise, we introduce another loss function to measure the information incrementation of the PAST layer. Notice that

∆^P_t =s^P_t −s^P_t₋₁ ≈c_t, is defined similarly to∆^F_t except a minus sign. In this way, we can reasonably assume the FUTUREand PASTlayers are indeed do- ing subtraction and addition, respectively.

(7)

Training Objective. We train the proposed model θˆon a set of training examples{[xⁿ,yⁿ]}^Nn=1, and the training objective is

θˆ= arg min

θ

XN n=1

|y|

X

t=1

−logP(y_t|y_<t,x;θ)

| {z } neg. log-likelihood +loss(∆^F_t,E(y_t)|θ)

| {z } FUTUREloss +loss(∆^P_t ,E(yt)|θ)

| {z } PASTloss

.

5 Experiments

Dataset. We conduct experiments on Chinese- English (Zh-En), German-English (De-En), and English-German (En-De) translation tasks.

For Zh-En, the training set consists of 1.6m sentence pairs, which are extracted from the LDC corpora³. The NIST 2003 (MT03) dataset is our development set; the NIST 2002 (MT02), 2004 (MT04), 2005 (MT05), 2006 (MT06) datasets are test sets.

We also evaluate the alignment performance on the standard benchmark of Liu and Sun (2015), which contains 900 manually aligned sentence pairs. We measure the alignment quality with the alignment error rate (Och and Ney, 2003).

For De-En and En-De, we conduct experiments on the WMT17 (Bojar et al., 2017) corpus. The dataset consists of 5.6M sentence pairs. We usenewstest2016 as our development set, and newstest2017 as our testset. We follow Sen- nrich et al. (2017a) to segment both German and English words into subwords using byte-pair encod- ing (Sennrich et al., 2016, BPE).

We measure the translation quality with BLEU scores (Papineni et al., 2002). We use the multi-bleu script for Zh-En⁴, and the multi-bleu-detok script for De-En and En-De⁵.

3The corpora includes LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06

4https://github.com/moses-smt/

mosesdecoder/blob/master/scripts/generic/

multi-bleu.perl

5https://github.com/EdinburghNLP/

nematus/blob/master/data/

multi-bleu-detok.perl

Training Details. We use the Nematus⁶ (Sen- nrich et al., 2017b), implementing a baseline translation system, RNNSEARCH. For Zh-En, we limit the vocabulary size to 30K. For De-En and En-De, the number of joint BPE operations is 90,000. We use the total BPE vocabulary for each side.

We tie the weights of the target-side embeddings and the output weight matrix (Press and Wolf, 2017) for De-En. All out-of-vocabulary words are mapped to a special tokenUNK.

We train each model with sentences lengths of up to 50 words in the training data. The dimension of word embeddings is 512, and all hidden sizes are 1024. In training, we set the batch size to 80 for Zh- En, and 64 for De-En and En-De. We set the beam size to 12 in testing. We shuffle the training corpus after each epoch.

We use Adam (Kingma and Ba, 2014) with an- nealing (Denkowski and Neubig, 2017) as our optimization algorithm. We set the initial learning rate as 0.0005, which halves when the validation cross- entropy does not decrease.

For the proposed model, we use the same set- ting as the baseline model. The FUTUREand PAST

layer sizes are 1024. We employ a two-pass strategy for training the proposed model, which has proven useful to ease training difficulty when the model is relatively complicated (Shen et al., 2016; Wang et al., 2017; Wang et al., 2018). Model parameters shared with the baseline are initialized by the baseline model.

5.1 Results on Chinese-English

We first evaluate the proposed model on the Chinese-English translation and alignment tasks.

5.1.1 Translation Quality

Table 2 shows the translation performances on Chinese-English. Clearly the proposed approach significantly improves the translation quality in all cases, although there are still considerable differences among different variants.

FUTURE Layer. (Rows 1-4). All the activation functions for the FUTURElayer obtain BLEU score improvements: GRU +0.52, GRU-o +1.03, and GRU-i +1.12. Specifically, GRU-o is better than

6https://github.com/EdinburghNLP/nematus

(8)

# Model Dev MT02 MT04 MT05 MT06 Avg. ∆

0 RNNSEARCH 35.90 36.84 37.16 34.17 31.56 35.13 -

1 + FRNN(GRU) 36.11 36,94 38.52 34.58 32.08 35.65 +0.52

2 + FRNN(GRU-o) 36.70 37.81 38.59 35.10 32.60 36.16 +1.03

3 + FRNN(GRU-i) 36.98 38.24 38.66 34.68 32.66 36.24 +1.12

4 + FRNN(GRU-i) + LOSS 37.15 38.80 39.13 35.79 33.75 36.92 +1.80

5 + PRNN 36.90 37.62 39.04 35.24 32.80 36.32 +1.19

6 + PRNN+ LOSS 36.95 39.06 39.55 35.05 33.80 36.88 +1.76

7 + FRNN(GRU-i) + PRNN 37.44 37.26 39.10 35.29 32.78 36.37 +1.25 8 + FRNN(GRU-i) + PRNN+ LOSS 37.90 39.65 40.37 36.75 34.55 37.84 +2.71 9 RNNSEARCH-2DEC 35.56 36.74 37.38 34.09 31.82 35.12 -0.01

10 RNNSEARCH-3DEC 36.07 37.64 37.62 34.14 32.73 35.64 +0.51

11 COVERAGE(Tu et al., 2016) 36.56 37.54 38.39 34.47 32.38 35.87 +0.74 Table 2: Case-insensitive BLEU on Chinese-English Translation. “LOSS” means applying loss functions for FUTURElayer (FRNN) and PASTlayer (PRNN).

a regular GRU for its minus operation, and GRU- iis the best, which shows that our elaborately de- signed architecture is more proper for modeling the decreasing phenomenon of the future semantics.

Adding subtractive loss gives an extra 0.68 BLEU score improvement, which indicates that adding g is beneficial guided objective for FRNNto learn the minus operation.

PAST Layer. (Rows 5-6). We observe the same trend on introducing the PAST layer: using it alone achieves a significant improvement (+1.19), and with the additional objective, it further improves the translation performance (+0.57).

Stacking the FUTURE and the PAST Together.

(Rows 7-8). The model’s final architecture outperforms our intermediate models (1-6) by combin- ing FRNN and PRNN. By further separating the functionaries of past content modeling and language modeling into different neural components, the final model is more flexible, obtaining a 0.91 BLEU improvement over the best intermediate model (Row 4) and an improvement of 2.71 BLEU points over the RNNSEARCHbaseline.

Comparison with Other Work. (Rows 9-11).

We also conduct experiments with multi-layer decoders (Wu et al., 2016) to see whether the NMT system can automatically model the translated and untranslated contents with additional decoder lay-

ers (Rows 9-10). However, we find that the performance is not improved using a two-layer decoder (Row 9), until a deeper version (three-layer decoder, Row 10) is used. This indicates that enhancing performance by simply adding more RNN layers into the decoder without any explicit instruction is non- trivial, which is consistent with the observation of Britz et al. (2017).

Our model also outperforms the word-level COV-

ERAGE(Tu et al., 2016), which considers the coverage information of the source words independently.

Our proposed model can be regarded as a high-level coverage model, which captures higher level coverage information, and gives more specific signals for the decision of attention and target prediction. Our model is more deeply involved in generating target words, by being fed not only to the attention model as in Tu et al. (2016), but also to the decoder state.

5.1.2 Subjective Evaluation

Following Tu et al. (2016), we conduct subjective evaluations to validate the benefit of modeling the PASTand the FUTURE(Table 3). Four human eval- uators are asked to evaluate the translations of 100 source sentences, which are randomly sampled from the testsets without knowing from which system the translation is selected. For the BASE system, 1.7%

of the source words are over-translated and 8.8%

are under-translated. Our proposed model alleviates these problems by explicitly modeling the dynamic

(9)

System Architecture De-En En-De Dev Test Dev Test Rikters et al. (2017) cGRU + BPE + dropout 31.9 27.2 27.4 21.0 + name entity forcing +synthetic data 36.9 29.0 30.9 22.7 Escolano et al. (2017) Char2Char + Rescoring with inverse model 32.1 - 27.0 -

+synthetic data - 28.1 - 21.2

Sennrich et al. (2017a) cGRU + BPE +synthetic data 38.0 32.0 32.2 26.1

this work BASE 32.0 27.8 28.3 23.3

COVERAGE 32.2 28.7 28.9 23.6

OURS 33.5 29.7 29.5 24.3

Table 5: Results of De-En and En-De “synthetic data” denotes additional 10M monolingual sentences, which is not used in our work.

Model Over-Trans Under-Trans

Ratio ∆ Ratio ∆

BASE 1.7% – 8.8% –

COVERAGE 1.5% -11.8% 7.7% -12.4%

OURS 1.6% -5.9% 5.7% -35.2%

Table 3: Subjective evaluation on over- and under- translation for Chinese-English. “Ratio” denotes the percentage of source words which are over- or under-translated, “∆” indicates relative improvement. “BASE” denotes RNNSEARCHand “OURS” denotes “+ FRNN(GRU-i) + PRNN+ LOSS”.

source contents by the PAST and the FUTURE layers, reducing 11.8% and 35.2% of over-translation and under-translation errors, respectively. The proposed model is especially effective for alleviating the under-translation problem, which is a more serious translation problem for NMT systems, and is mainly caused by lacking necessary coverage information (Tu et al., 2016).

5.1.3 Alignment Quality

Table 4 lists the alignment performances of our proposed model. We find that the COVERAGEmodel does improve attention model. But our model can produce much better alignments compared to the word level coverage (Tu et al., 2016). Our model distinguishes the PASTand FUTUREdirectly, which is a higher level coverage mechanism than the word coverage model.

Model AER ∆

BASE 39.73 –

COVERAGE 38.73 -1.00

OURS 35.90 -3.83

Table 4: Evaluation of the alignment quality. The lower the score, the better the alignment quality.

5.2 Results on German-English

We also evaluate our model on the WMT17 benchmarks for both De-En and En-De. As shown in Table 5, our baseline gives comparable BLEU scores to the state-of-the-art NMT systems of WMT17. Our proposed model improves the strong baseline on both De-En and En-De. This shows that our proposed model works well across different language pairs.

Rikters et al. (2017) and Sennrich et al. (2017a) obtain higher BLEU scores than our model, because they use additional large scale synthetic data (about 10M) for training. It maybe unfair to compare our model to theirs directly.

5.3 Analysis

We conduct analyses on Zh-En, to better understand our model from different perspectives.

Parameters and Speeds. As shown in Table 6, the baseline model (BASE) has 80M parameters. A single FUTURE or PAST layer introduces 15M to 17M parameters, and the corresponding objective introduces 18M parameters. In this work, the most complex model introduces 65M parameters, which

(10)

Model #Para. Speed Train Test

BASE 80M 42.59 2.05

+ FRNN(GRU) 96M 31.91 1.99

+ FRNN(GRU-o) 97M 30.88 1.93 + FRNN(GRU-i) 97M 31.06 1.95

+ LOSS 115M 29.40 1.68

+ PRNN 95M 32.01 1.98

+ LOSS 113M 29.60 1.69

+ FRNN+ PRNN 110M 26.01 1.88

+ LOSS 145M 22.94 1.52

RNNSEARCH-2DEC 93M 39.89 2.00 RNNSEARCH-3DEC 105M 35.57 1.82

COVERAGE 80M 40.48 1.90

Table 6: Statistics of parameters, training and testing speeds (sentences per second).

leads to a relatively slower training speed. How- ever, our proposed model does not significant slow down the decoding speed. The most time consum- ing part is the calculation of the subtraction and addition losses. As we show in the next paragraph, our system works well by only using the losses in training, which further improve the decoding speed of our model.

Effectiveness of Subtraction and Addition Loss.

Adding subtraction and addition loss functions helps twofold: (1) guiding the training of the proposed subtraction and addition operation; (2) enabling better reranking of generated candidates in testing. Ta- ble 7 lists the improvements from the two perspectives. When applied only in training, the two loss functions lead to an improvement of 0.48 BLEU points by better modeling subtraction and addition operations. On top of that, reranking with FUTURE

and PASTloss scores in testing further improves the performance by +0.99 BLEU points.

Initialization of the FUTURE Layer. The baseline model does not obtain abundant accuracy improvement by feeding the source summarization into the decoder (Table 1). We also experiment to not feed the source summarization into the decoder of the proposed model, which leads to a significant BLEU score drop on Zh-En. This shows that our proposed model better use the source summarization

Model L^OSSused in BLEU ∆ Train Test

BASE – – 35.13 –

OURS

× × 36.37 +1.25

X × 36.85 +1.72

X X 37.84 +2.71

Table 7: Contributions of loss functions from param- eter training (“Train”) and reranking of candidates in testing (“Test”).

Initialize FRNNwith . . . BLEU Source Summarization 36.24 All-Zero Vector 35.81

Table 8: Influence of initialization of FRNN

layer (GRU-i)

with explicitly modeling the FUTUREcompared to the conventional encoder-decoder baseline.

Case Study. We also compare the translation cases for the baseline, word level coverage and our proposed models. As shown in Table 9, our baseline system suffers from the over-translation problems (case 1), which is consistent with the results of human evaluation (Section 3). The BASE system also incorrectly translates “the royal family” into

“the people of hong kong”, which is totally irrele- vant here. We attribute the former case to the lack of untranslated future modeling, and the latter one to the overloaded use of the decoder state where the language modeling of the decoder leads to the flu- ent but wrong predictions. In contrast, the proposed approach almost addresses the errors in these cases.

6 Conclusion

Modeling source contents well is crucial for encoder-decoder based NMT systems. However, current NMT models suffer from distinguishing translated and untranslated translation contents, due to the lack of explicitly modeling past and future translations. In this paper, we separate PAST and FUTUREfunctionalities from decoder states, which can maintain a dynamic yet holistic view of the source content at each decoding step. Experimen- tal results show that the proposed approach signifi-

(11)

Source 布什还表示,应巴基斯坦和印度政府的邀请,他将于3月份对巴基斯坦和印度进行访问。

Reference bush also said thatat the invitation of the pakistani and indian governments, he would visit pakistan and india in march .

BASE bush also said that he would visit pakistan and india in march .

COVERAGE bush also said that at the invitation of pakistan and india , he will visit pakistan and india in march .

OURS bush also said that at the invitation of the pakistani and indian governments , he will visit pakistan and india in march .

Source 所以有不少人认为说,如果是这样的话,对皇室、对日本的社会也是

会有很大的影响的。

Reference therefore , many people say that it will have a great impact on the royal familyand japanese society .

BASE therefore , many people are of the view that if this is the case , it will also have a great impact onthe people of hong kongand the japanese society .

COVERAGE therefore , many people think that if this is the case , there will be great impact on the royal and japanese society .

OURS therefore , many people think that if this is the case , it will have a great impact on the royal and japanese society .

Table 9: Comparison on Translation Examples. We italicize some translation errors and highlight the correct onesin bold.

cantly improves translation performances across different language pairs. With better modeling of past and future translations, our approach performs much better than the standard attention-based NMT, reducing the errors of under and over translations.

7 Acknowledgement

We would like to thank the anonymous reviewers as well as the Action Editor, Philipp Koehn, for in- sightful comments and suggestions. Shujian Huang is the corresponding author. This work is sup- ported by the National Science Foundation of China (No. 61672277, 61772261), the Jiangsu Provin- cial Research Foundation for Basic Research (No.

BK20170074).

References

Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z Leibo, and Catalin Ionescu. 2016. Using fast weights to attend to the recent past. InNIPS 2016.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointly learning to align and translate. InICLR 2015.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. An actor-critic algorithm for sequence prediction. InICLR 2017.

Ondˇrej Bojar, Christian Buck, Rajen Chatterjee, Chris- tian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, and Julia Kreutzer. 2017. Proceedings of the second conference on machine translation. InProceedings of the Second Conference on Machine Translation. Asso- ciation for Computational Linguistics.

Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. InEMNLP 2017.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. InEMNLP 2014.

Michael Denkowski and Graham Neubig. 2017.

Stronger baselines for trustable results in neural machine translation. In Proceedings of the First Work- shop on Neural Machine Translation.

(12)

Carlos Escolano, Marta R. Costa-juss`a, and Jos´e A. R.

Fonollosa. 2017. The TALP-UPC neural machine translation system for German/Finnish-English using the inverse direction model in rescoring. InProceed- ings of the Second Conference on Machine Transla- tion, Volume 2: Shared Task Papers.

Alex Graves, Greg Wayne, and Ivo Danihelka. 2014.

Neural turing machines. arXiv:1410.5401.

Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. 2016. Dynamic neural turing machine with soft and hard addressing schemes.

arXiv:1607.00036.

Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.Neural computation.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. InEMNLP 2013.

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.ICLR 2014.

Jiwei Li, Will Monroe, and Daniel Jurafsky.

2017. Learning to decode for future success.

arXiv:1701.06549.

Yang Liu and Maosong Sun. 2015. Contrastive unsu- pervised word alignment with non-local features. In AAAI 2015.

Thang Luong, Hieu Pham, and Christopher D. Manning.

2015. Effective approaches to attention-based neural machine translation. InEMNLP 2015.

Fandong Meng, Zhengdong Lu, Hang Li, and Qun Liu.

2016. Interactive attention for neural machine translation. InCOLING 2016.

Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. 2016. Coverage embedding models for neural machine translation. EMNLP 2016.

Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space.ICLR 2013.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir- Hossein Karimi, Antoine Bordes, and Jason Weston.

2016. Key-value memory networks for directly reading documents. InEMNLP 2016.

Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models.

Computational Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InACL 2002.

Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. InEACL 2017.

Scott Reed and Nando De Freitas. 2015. Neural programmer-interpreters. Computer Science.

Mat¯ıss Rikters, Chantal Amrhein, Maksym Del, and Mark Fishel. 2017. C-3MA: Tartu-Riga-Zurich translation systems for WMT17.

Tim Rockt¨aschel, Johannes Welbl, and Sebastian Riedel.

2017. Frustratingly short attention spans in neural language modeling. InICLR 2017.

Mike Schuster and Kuldip K. Paliwal. 1997. Bidirec- tional recurrent neural networks. IEEE Transactions on Signal Processing.

Rico Sennrich, Barry Haddow, and Alexandra Birch.

2016. Neural machine translation of rare words with subword units.Computer Science.

Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, An- tonio Valerio Miceli Barone, and Philip Williams.

2017a. The University of Edinburgh’s Neural MT Sys- tems for WMT17. InProceedings of the Second Con- ference on Machine Translation, Volume 2: Shared Task Papers in ACL.

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexan- dra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel L¨aubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde.

2017b. Nematus: a toolkit for neural machine translation. InEACL 2017.

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. InACL 2016.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In NIPS 2015.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.

Sequence to sequence learning with neural networks.

InNIPS 2014.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. InACL 2016.

Zhaopeng Tu, Yang Liu, Zhengdong Lu, Xiaohua Liu, and Hang Li. 2017a. Context gates for neural machine translation. Transactions of the Association for Computational Linguistics.

Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017b. Neural machine translation with reconstruction. InAAAI 2017.

Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu.

2016. Memory-enhanced decoder for neural machine translation. InEMNLP 2016.

Xing Wang, Zhengdong Lu, Zhaopeng Tu, Hang Li, Deyi Xiong, and Min Zhang. 2017. Neural machine translation advised by statistical machine translation. In AAAI 2017.

Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, and Qun Liu. 2018. Trans- lating pro-drop languages with reconstruction models.

InAAAI 2018.

(13)

Rongxiang Weng, Shujian Huang, Zaixiang Zheng, Xin- Yu Dai, and Jiajun Chen. 2017. Neural machine translation with word predictions. InEMNLP 2017.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.

Le, Mohammad Norouzi, Wolfgang Macherey, Maxim

Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.

2016. Google’s neural machine translation system:

Bridging the gap between human and machine translation.arXiv:1609.08144.

(14)