Apprentissage discriminant des mod` eles continus de traduction
Quoc-Khanh Do, Alexandre Allauzen, Fran¸cois Yvon
Univ. Paris-Sud / LIMSI-CNRS
Q.K. Do (LIMSI-CNRS) SMT 1 / 23
Contenu
1
Introduction
2
Mod`eles continus de traduction
3
M´ethodes d’apprentissage
4
R´esultats exp´erimentaux
5
Conclusions
Q.K. Do (LIMSI-CNRS) SMT 2 / 23
Outline
1
Introduction
2
Mod`eles continus de traduction
3
M´ethodes d’apprentissage
4
R´esultats exp´erimentaux
5
Conclusions
Q.K. Do (LIMSI-CNRS) SMT 3 / 23
Mod` eles neuronaux
R
R R
Wih Who
0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1
00 0 0 1 0 0 0 0
RAP Analyse
Syntaxique
Analyse S´emantique
Alignement de mots
Traduction Automatique r´eseau neurone
Q.K. Do (LIMSI-CNRS) SMT 4 / 23
Mod` eles neuronaux
R
R R
Wih Who
0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1
00 0 0 1 0 0 0 0
RAP Analyse
Syntaxique
Analyse S´emantique
Alignement de mots
Traduction Automatique r´eseau neurone
−0,9%
WER
90,44%
pr´ecision F1
71,3 corr´elation
+0,07 F1
+1,6 BLEU
Q.K. Do (LIMSI-CNRS) SMT 4 / 23
Mod` eles neuronaux en Traduction Automatique
Syst`eme de traduction:
assemblage de mod`eles
⇒
s : everybody talks about happiness these days .
tout le monde parle de bonheur aujourd’hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .
. . .
`
a tout le monde parle du bonheur aujourd’hui . que tout le monde parle de bonheur ces derniers .
tout le monde parle le bonheur ce .
Q.K. Do (LIMSI-CNRS) SMT 5 / 23
Mod` eles neuronaux en Traduction Automatique
Syst`eme de traduction:
assemblage de mod`eles
⇒
s : everybody talks about happiness these days .
tout le monde parle de bonheur aujourd’hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .
. . .
`
a tout le monde parle du bonheur aujourd’hui . que tout le monde parle de bonheur ces derniers .
tout le monde parle le bonheur ce .
Mod` ele de traduction continu 5h / 2000 phrases
Maximum de vraisemblance ↔ BLEU
Q.K. Do (LIMSI-CNRS) SMT 5 / 23
Outline
1
Introduction
2
Mod`eles continus de traduction
3
M´ethodes d’apprentissage
4
R´esultats exp´erimentaux
5
Conclusions
Q.K. Do (LIMSI-CNRS) SMT 6 / 23
Approche n-gramme en TAS
s
̅8:à t
̅8:to
s
̅9:recevoir t
̅9:receive
s
̅10:le t
̅10:the
s
̅11:nobel de la paix t
̅11:nobel peace
s
̅12:prix t
̅12:prize
u
8u
9u
10u
11u
12s : ....
t : ....
à recevoir le prix nobel de la paix
org : ....
....
....
Traduction en deux ´etapes (Crego and Mari˜ no2006)
1 2
⇒ Mod`ele n-gramme sur les bisegments:
P (s, t) = Y
Li=1
P(u
i| u
i−1, ..., u
i−n+1)
See http://ncode.limsi.fr/
Q.K. Do (LIMSI-CNRS) SMT 7 / 23
Approche n-gramme en TAS
s
̅8:à t
̅8:to
s
̅9:recevoir t
̅9:receive
s
̅10:le t
̅10:the
s
̅11:nobel de la paix t
̅11:nobel peace
s
̅12:prix t
̅12:prize
u
8u
9u
10u
11u
12s : ....
t : ....
à recevoir le prix nobel de la paix
org : ....
....
....
1
Traduction en deux ´etapes (Crego and Mari˜ no2006)
1
R´eordonnance de la source
2
⇒ Mod`ele n-gramme sur les bisegments:
P (s, t) = Y
Li=1
P(u
i| u
i−1, ..., u
i−n+1)
See http://ncode.limsi.fr/
Q.K. Do (LIMSI-CNRS) SMT 7 / 23
Approche n-gramme en TAS
s
̅8:à t
̅8:to
s
̅9:recevoir t
̅9:receive
s
̅10:le t
̅10:the
s
̅11:nobel de la paix t
̅11:nobel peace
s
̅12:prix t
̅12:prize
u
8u
9u
10u
11u
12s : ....
t : ....
à recevoir le prix nobel de la paix
org : ....
....
....
1
2
Traduction en deux ´etapes (Crego and Mari˜ no2006)
1
R´eordonnance de la source
2
Traduction monotone par segment
⇒ Mod`ele n-gramme sur les bisegments:
P (s, t) = Y
Li=1
P(u
i| u
i−1, ..., u
i−n+1)
See http://ncode.limsi.fr/
Q.K. Do (LIMSI-CNRS) SMT 7 / 23
Mod` ele n-gramme factoris´ e
s̅9: recevoir t̅9: receive
s̅10: le t̅10: the s̅11: nobel de la paix
t̅11: nobel peace
P ( | )= P (
t̅11: nobel peace|
s̅11: nobel de la paix s̅9: recevoir s̅10: le t̅9: receive t̅10: the) P (
s̅11: nobel de la paix|
s̅9: recevoir s̅10: le t̅9: receive t̅10: the)
= P ( nobel | recevoir, le , receive, the )
× P( de | le, nobel , receive, the )
× P( la | nobel, de , receive, the )
× P( paix | de, la , receive, the )
× P( nobel | la, paix , receive, the )
× P( peace | la, paix , the, nobel )
Q.K. Do (LIMSI-CNRS) SMT 8 / 23
Mod` ele n-gramme factoris´ e
s̅9: recevoir t̅9: receive
s̅10: le t̅10: the s̅11: nobel de la paix
t̅11: nobel peace
P ( | )= P (
t̅11: nobel peace|
s̅11: nobel de la paix s̅9: recevoir s̅10: le t̅9: receive t̅10: the) P (
s̅11: nobel de la paix|
s̅9: recevoir s̅10: le t̅9: receive t̅10: the)
= P ( nobel | recevoir, le , receive, the )
× P( de | le, nobel , receive, the )
× P( la | nobel, de , receive, the )
× P( paix | de, la , receive, the )
× P( nobel | la, paix , receive, the )
× P( peace | la, paix , the, nobel )
paix de
la
receive the
Q.K. Do (LIMSI-CNRS) SMT 8 / 23
Mod` ele n-gramme factoris´ e
s̅9: recevoir t̅9: receive
s̅10: le t̅10: the s̅11: nobel de la paix
t̅11: nobel peace
P ( | )= P (
t̅11: nobel peace|
s̅11: nobel de la paix s̅9: recevoir s̅10: le t̅9: receive t̅10: the) P (
s̅11: nobel de la paix|
s̅9: recevoir s̅10: le t̅9: receive t̅10: the)
= P ( nobel | recevoir, le , receive, the )
× P( de | le, nobel , receive, the )
× P( la | nobel, de , receive, the )
× P( paix | de, la , receive, the )
× P( nobel | la, paix , receive, the )
× P( peace | la, paix , the, nobel )
peace la
paix
the nobel
Q.K. Do (LIMSI-CNRS) SMT 8 / 23
Structure neuronale
Couche cachée 0
10 0 00
0 0 01 0 0
0 0 0 00 1
10 0 0 00
la
paix
the
3
nobel
peace
Couche de sortie Rs
Rt Rs
Rt
Activation : a
θ(w, c) b
θ(w, c) = exp(a
θ(w, c))
Q.K. Do (LIMSI-CNRS) SMT 9 / 23
Structure neuronale
Couche cachée 0
10 0 00
0 0 01 0 0
0 0 0 00 1
10 0 0 00
la
paix
the
3
nobel
peace
Couche de sortie Rs
Rt Rs
Rt
bθ(w,c) Pw0 ∈Vbθ(w0,c)
Activation : a
θ(w, c) b
θ(w, c) = exp(a
θ(w, c))
Q.K. Do (LIMSI-CNRS) SMT 9 / 23
Structure neuronale
Couche cachée 0
10 0 00
0 0 01 0 0
0 0 0 00 1
10 0 0 00
la
paix
the
3
nobel
peace
Couche de sortie Rs
Rt Rs
Rt
b
θ(w, c)
Activation : a
θ(w, c) b
θ(w, c) = exp(a
θ(w, c))
Q.K. Do (LIMSI-CNRS) SMT 9 / 23
Outline
1
Introduction
2
Mod`eles continus de traduction
3
M´ethodes d’apprentissage
4
R´esultats exp´erimentaux
5
Conclusions
Q.K. Do (LIMSI-CNRS) SMT 10 / 23
Maximum de vraisemblance
L
cll(θ, S
n) = P
(w,c)∈Sn
− log p
θ(w | c) + R (θ) o` u p
θ(w | c) =
Pbθ(w,c)w0 ∈V
bθ(w0,c)
Q.K. Do (LIMSI-CNRS) SMT 11 / 23
Maximum de vraisemblance
L
cll(θ, S
n) = P
(w,c)∈Sn
− log p
θ(w | c) + R (θ) o` u p
θ(w | c) =
Pbθ(w,c)w0 ∈V
bθ(w0,c)
En pratique : SOUL (Le et al.2011)
R
R R
Wih shared projection space
0 1 0 0 0 0 0 0 0
wi-1
0 0 0 0 0 0 0 1
wi-2 0
0 0 0 0 1 0 0 0 0
wi-3
}
short list The associated clustering treeSub-class layers
Word layers
}
top classesQ.K. Do (LIMSI-CNRS) SMT 11 / 23
Estimation contrastive bruit´ ee (NCE)
X
w0= 1
X
w0= 0
Q.K. Do (LIMSI-CNRS) SMT 12 / 23
Estimation contrastive bruit´ ee (NCE)
p( X
w0= 1 | w
0, c)
=
p pθ(w0|c)θ(w0|c)+KpN(w0)
p( X
w0= 0 | w
0, c)
=
p KpN(w0)θ(w0|c)+KpN(w0)
Maximum de vraisemblance pour la classification de { w, w
∗1, ..., w
∗K} → fonction objectif pour NCE
Q.K. Do (LIMSI-CNRS) SMT 12 / 23
R´ e´ evaluation des hypoth` eses
s : everybody talks about happiness these days .
tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .
. . .
à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .
tout le monde parle le bonheur ce . "
n-best hypotheses SMT system
h1 h2 h3 . . . . . hn
Q.K. Do (LIMSI-CNRS) SMT 13 / 23
R´ e´ evaluation des hypoth` eses
s : everybody talks about happiness these days .
tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .
. . .
à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .
tout le monde parle le bonheur ce . "
n-best hypotheses purpose2. In a domain adaptation context, we assume that
an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.
3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain cor- pus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain par- allel corpus, the same training procedure can also be used.
A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.
There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a rerank- ing step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to bet- ter take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.
3.1. RescoringN-best lists with CSTMs
Due to the high computational cost of normalizing the output layer, continuous models are in most cases3introduced in a post-processing step calledN-best reranking.
We thus assume that for each source sentences, the de- coder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is as- sociated with the decoder scoreF(s,h)computed as:
F(s,h) = XK k=1
kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of co- efficients (k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).
When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f✓(s,h). As explained in Section 2.2,f✓(s,h)typically
2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.
3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.
Algorithm 1Joint optimization procedure for✓and 1:Initialize✓and
2:foreach iterationdo
3: forMmini-batchesdo . is fixed
4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch
5: Update✓
6: end for
7: Update using dev set .✓is fixed
8:end for
corresponds to the negated log-probability of the derivation:
f✓(s,h) = logP✓(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:
G,✓(s,h) =F(s,h) + K+1f✓(s,h) (4) This scoring function depends on the CSTM’s parame- ters✓, as well as on the coefficients of the scoring func- tion. In the approach proposed here, optimizing the rerank- ing step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.
The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typ- ically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.
In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES4). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative ob- jective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.
3.2. A max-margin approach
As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Leth⇤denote the hypothesis with the best sentence BLEU score. A max-margin loss func- tion [33, 34, 20] for estimating✓can then be formulated as follows:
Lmm(✓,s) = G,✓(s,h⇤) + max
1jN(G,✓(s,hj) +cost↵(hj)), (5) where cost↵(hj) =↵ sBLEU(h⇤) sBLEU(hj). The parameter↵mitigates the contribution of the cost function
4http://www.statmt.org/moses/
Ranked
SMT system
h1 h2 h3 . . . . . hn
Q.K. Do (LIMSI-CNRS) SMT 13 / 23
M´ethodes d’apprentissage
R´ e´ evaluation des hypoth` eses
s : everybody talks about happiness these days .
tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .
. . .
à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .
tout le monde parle le bonheur ce . "
n-best hypotheses purpose2. In a domain adaptation context, we assume that
an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.
3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain cor- pus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain par- allel corpus, the same training procedure can also be used.
A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.
There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a rerank- ing step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to bet- ter take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.
3.1. RescoringN-best lists with CSTMs
Due to the high computational cost of normalizing the output layer, continuous models are in most cases3introduced in a post-processing step calledN-best reranking.
We thus assume that for each source sentences, the de- coder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is as- sociated with the decoder scoreF(s,h)computed as:
F(s,h) = XK k=1
kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of co- efficients ( k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).
When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f✓(s,h). As explained in Section 2.2,f✓(s,h)typically
2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.
3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.
Algorithm 1Joint optimization procedure for✓and 1:Initialize✓and
2:foreach iterationdo
3: forMmini-batchesdo . is fixed
4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch
5: Update✓
6: end for
7: Update using dev set .✓is fixed 8:end for
corresponds to the negated log-probability of the derivation:
f✓(s,h) = logP✓(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:
G,✓(s,h) =F(s,h) + K+1f✓(s,h) (4) This scoring function depends on the CSTM’s parame- ters✓, as well as on the coefficients of the scoring func- tion. In the approach proposed here, optimizing the rerank- ing step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.
The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typ- ically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.
In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES4). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative ob- jective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.
3.2. A max-margin approach
As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Leth⇤denote the hypothesis with the best sentence BLEU score. A max-margin loss func- tion [33, 34, 20] for estimating✓can then be formulated as follows:
Lmm(✓,s) = G,✓(s,h⇤) + max
1jN(G,✓(s,hj) +cost↵(hj)), (5) where cost↵(hj) =↵ sBLEU(h⇤) sBLEU(hj). The parameter↵mitigates the contribution of the cost function
4http://www.statmt.org/moses/
Conventional SMT scores
Ranked
an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.
3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain cor- pus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain par- allel corpus, the same training procedure can also be used.
A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.
There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a rerank- ing step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to bet- ter take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.
3.1. RescoringN-best lists with CSTMs
Due to the high computational cost of normalizing the output layer, continuous models are in most cases3introduced in a post-processing step calledN-best reranking.
We thus assume that for each source sentences, the de- coder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is as- sociated with the decoder scoreF (s,h)computed as:
F (s,h) = XK k=1
kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of co- efficients (k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).
When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f✓(s,h). As explained in Section 2.2,f✓(s,h)typically
2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.
3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.
1:Initialize✓and 2:foreach iterationdo
3: forMmini-batchesdo . is fixed
4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch
5: Update✓
6: end for
7: Update using dev set .✓is fixed
8:end for
corresponds to the negated log-probability of the derivation:
f✓(s,h) = logP✓(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:
G,✓(s,h) =F (s,h) + K+1f✓(s,h) (4) This scoring function depends on the CSTM’s parame- ters✓, as well as on the coefficients of the scoring func- tion. In the approach proposed here, optimizing the rerank- ing step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.
The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typ- ically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.
In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES4). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative ob- jective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.
3.2. A max-margin approach
As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Leth⇤denote the hypothesis with the best sentence BLEU score. A max-margin loss func- tion [33, 34, 20] for estimating✓can then be formulated as follows:
Lmm(✓,s) = G ,✓(s,h⇤) + max
1jN(G,✓(s,hj) +cost↵(hj)), (5) where cost↵(hj) =↵ sBLEU(h⇤) sBLEU(hj). The parameter↵mitigates the contribution of the cost function
4http://www.statmt.org/moses/
SMT system Continuous space translation model
h1 h2 h3 . . . . . hn
Q.K. Do (LIMSI-CNRS) SMT 13 / 23
Algorithme d’optimisation
1:
Init. de θ et λ
2:
for chaque it´eration do
3:
for M paquets do
4:
Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet
5:
Mise ` a jour de θ . λ fix´e
6:
end for
7:
Mise ` a jour de λ en utilisant le dev. . θ fix´e
8:
end for
Q.K. Do (LIMSI-CNRS) SMT 14 / 23
Algorithme d’optimisation
1:
Init. de θ et λ
2:
for chaque it´eration do
3:
for M paquets do
4:
Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet
5:
Mise `a jour de θ . λ fix´e
6:
end for
7:
Mise ` a jour de λ en utilisant le dev. . θ fix´e
8:
end for
Q.K. Do (LIMSI-CNRS) SMT 14 / 23
Algorithme d’optimisation
1:
Init. de θ et λ
2:
for chaque it´eration do
3:
for M paquets do
4:
Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet
5:
Mise ` a jour de θ . λ fix´e
6:
end for
7:
Mise ` a jour de λ en utilisant le dev. . θ fix´e
8:
end for
Q.K. Do (LIMSI-CNRS) SMT 14 / 23
Algorithme d’optimisation
1:
Init. de θ et λ
2:
for chaque it´eration do
3:
for M paquets do
4:
Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet
5:
Mise `a jour de θ . λ fix´e
6:
end for
7:
Mise `a jour de λ en utilisant le dev. . θ fix´e
8:
end for
KBMIRA (Cherry and Foster2012)
Q.K. Do (LIMSI-CNRS) SMT 14 / 23
Algorithme d’optimisation
1:
Init. de θ et λ
2:
for chaque it´eration do
3:
for M paquets do
4:
Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet
5:
Mise ` a jour de θ . λ fix´e
6:
end for
7:
Mise ` a jour de λ en utilisant le dev. . θ fix´e
8:
end for
KBMIRA (Cherry and Foster2012) L (θ, s) ???
Q.K. Do (LIMSI-CNRS) SMT 14 / 23
Fonction objectif
Fonction de coˆ ut:
cost
α(h) = α sBLEU(h
∗) − sBLEU (h)), avec h
∗= argmax
h
sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU
Max-margin
n-best hypotheses sBLEU
Q.K. Do (LIMSI-CNRS) SMT 15 / 23
Fonction objectif
Fonction de coˆ ut:
cost
α(h) = α sBLEU(h
∗) − sBLEU (h)), avec h
∗= argmax
h
sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU
Max-margin
n-best hypotheses h*
sBLEU
G
,✓(s, h
⇤) + cost
↵(h
⇤)
0
Q.K. Do (LIMSI-CNRS) SMT 15 / 23
Fonction objectif
Fonction de coˆ ut:
cost
α(h) = α sBLEU(h
∗) − sBLEU (h)), avec h
∗= argmax
h
sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU
Max-margin
n-best hypotheses h*
argmax
h⇥
G
,✓(s, h) + cost
↵(h) ⇤
sBLEU
G
,✓(s, h
⇤) + cost
↵(h
⇤)
0
Q.K. Do (LIMSI-CNRS) SMT 15 / 23
Fonction objectif
Fonction de coˆ ut:
cost
α(h) = α sBLEU(h
∗) − sBLEU (h)), avec h
∗= argmax
h
sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU
Max-margin
n-best hypotheses h*
argmax
h⇥
G
,✓(s, h) + cost
↵(h) ⇤
sBLEU
G
,✓(s, h
⇤) + cost
↵(h
⇤)
0
En pratique: consid´erer plusieurs paires d’hypoth`eses!!!
Q.K. Do (LIMSI-CNRS) SMT 15 / 23
Outline
1
Introduction
2
Mod`eles continus de traduction
3
M´ethodes d’apprentissage
4
R´esultats exp´erimentaux
5
Conclusions
Q.K. Do (LIMSI-CNRS) SMT 16 / 23
Adaptation du syst` eme avec mod` ele continu
Configuration exp´erimentale
train N−best dev N−best
+ modele log−lineaire Modele NN
entrainement
entrainement discriminant
dev train
Du domaine
(CLL, NCE)
train dev
Hors domaine
Systeme baseline
(Hors domaine)
discriminant entrainement
λ θ
Q.K. Do (LIMSI-CNRS) SMT 17 / 23
Adaptation du syst` eme avec mod` ele continu
Configuration exp´erimentale
train N−best dev N−best
+ modele log−lineaire Modele NN
entrainement
entrainement discriminant
dev train
Du domaine
(CLL, NCE)
train dev
Hors domaine
Systeme baseline
(Hors domaine)
discriminant entrainement
λ θ
WMT en-fr
TEDTalks
WMT’13
300-best
KBMIRA
Q.K. Do (LIMSI-CNRS) SMT 17 / 23
R´ e´ evaluation des hypoth` eses
NCE vs. SOUL
dev test
Syst`eme de base 33, 9 27, 6 Ajout d’un mod`ele continu
+ SOUL 35, 1 (+1, 2) 28, 9 (+1, 3) + NCE 35, 0 (+1, 1) 28, 8 (+1, 2)
Vitesse d’entraˆınement Vitesse d’inf´erence
SOUL 1000 mots/s 20000 mots/s
NCE 1000 mots/s 25000 mots/s
Q.K. Do (LIMSI-CNRS) SMT 18 / 23
Entraˆınement discriminant
Mod`ele NCE
Utilisant les scores non-normalis´es du mod`ele NCE:
dev test
Syst`eme de traduction 33, 9 27, 6 Ajout d’un mod`ele continu
+ NCE 35, 0 28, 8
Ajout d’un mod`ele continu discriminant
+ Discriminant 35, 3 (+1, 4) 29, 0 (+1, 4) + Init. NCE + discriminant 35,4 (+1, 5) 29,7 (+2, 1)
Q.K. Do (LIMSI-CNRS) SMT 19 / 23
Entraˆınement discriminant
Mod`ele SOUL
Utilisant les scores normalis´es par SOUL:
dev test
Syst`eme de traduction 33, 9 27, 6 Ajout d’un mod`ele continu
+ SOUL 35, 1 28, 9
Ajout d’un mod`ele continu discriminant
+ Discriminant 35, 0 (+1, 1) 28, 9 (+1, 3) + Init. SOUL + discriminant 35,7 (+1, 8) 29,3 (+1, 7)
Q.K. Do (LIMSI-CNRS) SMT 20 / 23
Comment fonctionne le NCE?
0 2 4 6 8 10 12 14 16
Iteration 2
1 0 1 2 3 4
5 Coef de normalisation moyen
(a) Evolution du coefficient de ´ normalisation (moyenne et ´ ecart type)
0 2 4 6 8 10 12 14 16
Iteration 101
102 103 104 105
Perplexite
Perplexite
(b) Evolution de la perplexit´ ´ e mesur´ ee sur les donn´ ees de validation.
Q.K. Do (LIMSI-CNRS) SMT 21 / 23
Outline
1
Introduction
2
Mod`eles continus de traduction
3
M´ethodes d’apprentissage
4
R´esultats exp´erimentaux
5
Conclusions
Q.K. Do (LIMSI-CNRS) SMT 22 / 23
Conclusions
Avantages du nouveau cadre d’apprentissage du mod`ele neuronal:
Optimisation corr´ el´ ee avec BLEU
Apprendre ` a corriger les erreurs du syst` eme de traduction
La normalisation des scores de sortie n’est plus n´ ecessaire (avec NCE) Recommandation: pr´ e-apprentissage du mod` ele NCE!!!
Q.K. Do (LIMSI-CNRS) SMT 23 / 23
Colin Cherry and George Foster.
2012.
Batch tuning strategies for statistical machine translation.
In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
(NAACL-HLT), pages 427–436, June.
Josep M. Crego and Jos´e B. Mari˜ no.
2006.
Improving statistical MT by coupling reordering and decoding.
Machine Translation, 20(3):199–215.
Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, and Fran¸cois Yvon.
2011.
Structured output layer neural network language model.
In Proceedings of ICASSP, pages 5524–5527.
Q.K. Do (LIMSI-CNRS) SMT 23 / 23