Apprentissage discriminant des modèles continus de traduction

(1)

Apprentissage discriminant des mod` eles continus de traduction

Quoc-Khanh Do, Alexandre Allauzen, Fran¸cois Yvon

Univ. Paris-Sud / LIMSI-CNRS

Q.K. Do (LIMSI-CNRS) SMT 1 / 23

(2)

Contenu

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

(3)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

(4)

Mod` eles neuronaux

R

R R

Wih Who

0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1

00 0 0 1 0 0 0 0

RAP Analyse

Syntaxique

Analyse S´emantique

Alignement de mots

Traduction Automatique r´eseau neurone

(5)

Mod` eles neuronaux

R

R R

Wih Who

0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1

00 0 0 1 0 0 0 0

RAP Analyse

Syntaxique

Analyse S´emantique

Alignement de mots

Traduction Automatique r´eseau neurone

−0,9%

WER

90,44%

pr´ecision F1

71,3 corr´elation

+0,07 F1

+1,6 BLEU

(6)

Mod` eles neuronaux en Traduction Automatique

Syst`eme de traduction:

assemblage de mod`eles

⇒

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd’hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

`

a tout le monde parle du bonheur aujourd’hui . que tout le monde parle de bonheur ces derniers .

tout le monde parle le bonheur ce .

(7)

Mod` eles neuronaux en Traduction Automatique

Syst`eme de traduction:

assemblage de mod`eles

⇒

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd’hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

`

a tout le monde parle du bonheur aujourd’hui . que tout le monde parle de bonheur ces derniers .

tout le monde parle le bonheur ce .

Mod` ele de traduction continu 5h / 2000 phrases

Maximum de vraisemblance ↔ BLEU

(8)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

(9)

Approche n-gramme en TAS

s

̅_8:

à t

̅₈:

to

s

̅_9:

recevoir t

̅₉:

receive

s

̅_10:

le t

̅₁₀:

the

s

̅_11:

nobel de la paix t

̅₁₁:

nobel peace

s

̅_12:

prix t

̅₁₂:

prize

u

₈

u

₉

u

₁₀

u

₁₁

u

₁₂

s : ....

t : ....

à recevoir le prix nobel de la paix

org : ....

....

Traduction en deux ´etapes (Crego and Mari˜ no2006)

1 2

⇒ Mod`ele n-gramme sur les bisegments:

P (s, t) = Y

L

i=1

P(u

i

| u

i−1

, ..., u

i−n+1

)

See http://ncode.limsi.fr/

(10)

Approche n-gramme en TAS

s

̅_8:

à t

̅₈:

to

s

̅_9:

recevoir t

̅₉:

receive

s

̅_10:

le t

̅₁₀:

the

s

̅_11:

nobel de la paix t

̅₁₁:

nobel peace

s

̅_12:

prix t

̅₁₂:

prize

u

₈

u

₉

u

₁₀

u

₁₁

u

₁₂

s : ....

t : ....

à recevoir le prix nobel de la paix

org : ....

....

1

Traduction en deux ´etapes (Crego and Mari˜ no2006)

1

R´eordonnance de la source

2

⇒ Mod`ele n-gramme sur les bisegments:

P (s, t) = Y

L

i=1

P(u

i

| u

i−1

, ..., u

i−n+1

)

See http://ncode.limsi.fr/

(11)

Approche n-gramme en TAS

s

̅_8:

à t

̅₈:

to

s

̅_9:

recevoir t

̅₉:

receive

s

̅_10:

le t

̅₁₀:

the

s

̅_11:

nobel de la paix t

̅₁₁:

nobel peace

s

̅_12:

prix t

̅₁₂:

prize

u

₈

u

₉

u

₁₀

u

₁₁

u

₁₂

s : ....

t : ....

à recevoir le prix nobel de la paix

org : ....

....

1

2

Traduction en deux ´etapes (Crego and Mari˜ no2006)

1

R´eordonnance de la source

2

Traduction monotone par segment

⇒ Mod`ele n-gramme sur les bisegments:

P (s, t) = Y

L

i=1

P(u

i

| u

i−1

, ..., u

i−n+1

)

See http://ncode.limsi.fr/

(12)

Mod` ele n-gramme factoris´ e

s̅₉: recevoir t̅₉: receive

s̅₁₀: le t̅₁₀: the s̅₁₁: nobel de la paix

t̅₁₁: nobel peace

P ( | )= P (

^t^̅¹¹^:nobel peace

|

s̅₁₁: nobel de la paix ^s^̅⁹^:recevoir s̅₁₀: le ^t^̅⁹^:receive t̅₁₀: the

) P (

s̅_11:nobel de la paix

|

^s^̅^9:recevoir s̅_10:le ^t^̅^9:receive t̅_10:the

)

= P ( nobel | recevoir, le , receive, the )

× P( de | le, nobel , receive, the )

× P( la | nobel, de , receive, the )

× P( paix | de, la , receive, the )

× P( nobel | la, paix , receive, the )

× P( peace | la, paix , the, nobel )

(13)

Mod` ele n-gramme factoris´ e

P ( | )= P (

|

) P (

|

)

= P ( nobel | recevoir, le , receive, the )

× P( de | le, nobel , receive, the )

× P( la | nobel, de , receive, the )

× P( paix | de, la , receive, the )

× P( nobel | la, paix , receive, the )

× P( peace | la, paix , the, nobel )

paix de

la

receive the

(14)

Mod` ele n-gramme factoris´ e

P ( | )= P (

|

) P (

|

)

= P ( nobel | recevoir, le , receive, the )

× P( de | le, nobel , receive, the )

× P( la | nobel, de , receive, the )

× P( paix | de, la , receive, the )

× P( nobel | la, paix , receive, the )

× P( peace | la, paix , the, nobel )

peace la

paix

the nobel

(15)

Structure neuronale

Couche cachée 0

10 0 00

0 0 01 0 0

0 0 0 00 1

10 0 0 00

la

paix

the

3

nobel

peace

Couche de sortie Rs

Rt Rs

Rt

Activation : a

θ

(w, c) b

θ

(w, c) = exp(a

θ

(w, c))

(16)

Structure neuronale

Couche cachée 0

10 0 00

0 0 01 0 0

0 0 0 00 1

10 0 0 00

la

paix

the

3

nobel

peace

Couche de sortie Rs

Rt Rs

Rt

b_θ(w,c) Pw0 ∈Vb_θ(w⁰,c)

Activation : a

θ

(w, c) b

θ

(w, c) = exp(a

θ

(w, c))

(17)

Structure neuronale

Couche cachée 0

10 0 00

0 0 01 0 0

0 0 0 00 1

10 0 0 00

la

paix

the

3

nobel

peace

Couche de sortie Rs

Rt Rs

Rt

b

θ

(w, c)

Activation : a

θ

(w, c) b

θ

(w, c) = exp(a

θ

(w, c))

(18)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

(19)

Maximum de vraisemblance

L

^cll

(θ, S

ⁿ

) = P

(w,c)∈Sn

− log p

_θ

(w | c) + R (θ) o` u p

_θ

(w | c) =

^P^b^θ^(w,c)

w0 ∈V

bθ(w⁰,c)

(20)

Maximum de vraisemblance

L

^cll

(θ, S

ⁿ

) = P

(w,c)∈Sn

− log p

_θ

(w | c) + R (θ) o` u p

_θ

(w | c) =

^P^b^θ^(w,c)

w0 ∈V

b_θ(w⁰,c)

En pratique : SOUL (Le et al.2011)

R

R R

Wih shared projection space

0 1 0 0 0 0 0 0 0

wi-1

0 0 0 0 0 0 0 1

wi-2 0

0 0 0 0 1 0 0 0 0

wi-3

}

^{short list} The associated clustering tree

Sub-class layers

Word layers

}

top classes

(21)

Estimation contrastive bruit´ ee (NCE)

X

^w⁰

= 1

X

^w⁰

= 0

(22)

Estimation contrastive bruit´ ee (NCE)

p( X

^w⁰

= 1 | w

⁰

, c)

=

_p ^p^θ^(w⁰^|c)

θ(w⁰|c)+Kp_N(w⁰)

p( X

^w⁰

= 0 | w

⁰

, c)

=

_p ^Kp^N^(w⁰⁾

θ(w⁰|c)+Kp_N(w⁰)

Maximum de vraisemblance pour la classification de { w, w

^∗₁

, ..., w

^∗_K

} → fonction objectif pour NCE

(23)

R´ e´ evaluation des hypoth` eses

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .

tout le monde parle le bonheur ce . "

n-best hypotheses SMT system

h1 h2 h₃ . . . . . hn

(24)

R´ e´ evaluation des hypoth` eses

. . .

n-best hypotheses purpose². In a domain adaptation context, we assume that

an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.

3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain corpus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain parallel corpus, the same training procedure can also be used.

A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.

There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a reranking step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to better take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.

3.1. RescoringN-best lists with CSTMs

Due to the high computational cost of normalizing the output layer, continuous models are in most cases³introduced in a post-processing step calledN-best reranking.

We thus assume that for each source sentences, the decoder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is associated with the decoder scoreF(s,h)computed as:

F(s,h) = XK k=1

kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of coefficients (k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).

When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f✓(s,h). As explained in Section 2.2,f✓(s,h)typically

2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.

3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.

Algorithm 1Joint optimization procedure for✓and 1:Initialize✓and

2:foreach iterationdo

3: forMmini-batchesdo . is fixed

4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch

5: Update✓

6: end for

7: Update using dev set .✓is fixed

8:end for

corresponds to the negated log-probability of the derivation:

f✓(s,h) = logP✓(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:

G,✓(s,h) =F(s,h) + K+1f✓(s,h) (4) This scoring function depends on the CSTM’s parameters✓, as well as on the coefficients of the scoring function. In the approach proposed here, optimizing the reranking step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.

The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.

In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES⁴). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative objective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.

3.2. A max-margin approach

As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Leth^⇤denote the hypothesis with the best sentence BLEU score. A max-margin loss function [33, 34, 20] for estimating✓can then be formulated as follows:

Lmm(✓,s) = G,✓(s,h^⇤) + max

1jN(G,✓(s,hj) +cost↵(hj)), (5) where cost↵(hj) =↵ sBLEU(h^⇤) sBLEU(hj). The parameter↵mitigates the contribution of the cost function

4http://www.statmt.org/moses/

Ranked

SMT system

h1 h2 h3 . . . . . hn

(25)

M´ethodes d’apprentissage

R´ e´ evaluation des hypoth` eses

. . .

n-best hypotheses purpose². In a domain adaptation context, we assume that

We thus assume that for each source sentences, the decoder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is associated with the decoder scoreF(s,h)computed as:

F(s,h) = XK k=1

kf_k(s,h), (3) whereKfeature functions (fk) are weighted by a set of coefficients ( k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).

When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f_✓(s,h). As explained in Section 2.2,f_✓(s,h)typically

Algorithm 1Joint optimization procedure for✓and 1:Initialize✓and

2:foreach iterationdo

5: Update✓

6: end for

7: Update using dev set .✓is fixed 8:end for

G_,✓(s,h) =F(s,h) + _K+1f_✓(s,h) (4) This scoring function depends on the CSTM’s parameters✓, as well as on the coefficients of the scoring function. In the approach proposed here, optimizing the reranking step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.

Lmm(✓,s) = G_,✓(s,h^⇤) + max

1jN(G,✓(s,hj) +cost↵(hj)), (5) where cost↵(h_j) =↵ sBLEU(h^⇤) sBLEU(h_j). The parameter↵mitigates the contribution of the cost function

Conventional SMT scores

Ranked

We thus assume that for each source sentences, the decoder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is associated with the decoder scoreF (s,h)computed as:

F (s,h) = XK k=1

kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of coefficients (k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).

When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f✓(s,h). As explained in Section 2.2,f✓(s,h)typically

1:Initialize✓and 2:foreach iterationdo

5: Update✓

6: end for

7: Update using dev set .✓is fixed

8:end for

G,✓(s,h) =F (s,h) + K+1f✓(s,h) (4) This scoring function depends on the CSTM’s parameters✓, as well as on the coefficients of the scoring function. In the approach proposed here, optimizing the reranking step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.

Lmm(✓,s) = G ,✓(s,h^⇤) + max

1jN(G,✓(s,hj) +cost↵(hj)), (5) where cost↵(hj) =↵ sBLEU(h^⇤) sBLEU(hj). The parameter↵mitigates the contribution of the cost function

SMT system Continuous space translation model

h1 h2 h3 . . . . . hn

(26)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise ` a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

(27)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise `a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

(28)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise ` a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

(29)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise `a jour de θ . λ fix´e

6:

end for

7:

Mise `a jour de λ en utilisant le dev. . θ fix´e

8:

end for

KBMIRA (Cherry and Foster2012)

(30)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise ` a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

KBMIRA (Cherry and Foster2012) L (θ, s) ???

(31)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

^∗

) − sBLEU (h)), avec h

^∗

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

n-best hypotheses sBLEU

(32)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

^∗

) − sBLEU (h)), avec h

^∗

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

n-best hypotheses h*

sBLEU

G

,✓

(s, h

^⇤

) + cost

↵

(h

^⇤

)

0

(33)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

^∗

) − sBLEU (h)), avec h

^∗

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

argmax

_h

⇥

G

,✓

(s, h) + cost

↵

(h) ⇤

sBLEU

G

,✓

(s, h

^⇤

) + cost

↵

(h

^⇤

)

0

(34)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

^∗

) − sBLEU (h)), avec h

^∗

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

argmax

_h

⇥

G

,✓

(s, h) + cost

↵

(h) ⇤

sBLEU

G

,✓

(s, h

^⇤

) + cost

↵

(h

^⇤

)

0 En pratique: consid´erer plusieurs paires d’hypoth`eses!!!

(35)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

(36)

Adaptation du syst` eme avec mod` ele continu

Configuration exp´erimentale

train N−best dev N−best

+ modele log−lineaire Modele NN

entrainement

entrainement discriminant

dev train

Du domaine

(CLL, NCE)

train dev

Hors domaine

Systeme baseline

(Hors domaine)

discriminant entrainement

λ θ

(37)

Adaptation du syst` eme avec mod` ele continu

Configuration exp´erimentale

train N−best dev N−best

+ modele log−lineaire Modele NN

entrainement

entrainement discriminant

dev train

Du domaine

(CLL, NCE)

train dev

Hors domaine

Systeme baseline

(Hors domaine)

discriminant entrainement

λ θ

WMT en-fr

TEDTalks

WMT’13

300-best

KBMIRA

(38)

R´ e´ evaluation des hypoth` eses

NCE vs. SOUL

dev test

Syst`eme de base 33, 9 27, 6 Ajout d’un mod`ele continu

+ SOUL 35, 1 (+1, 2) 28, 9 (+1, 3) + NCE 35, 0 (+1, 1) 28, 8 (+1, 2)

Vitesse d’entraˆınement Vitesse d’inf´erence

SOUL 1000 mots/s 20000 mots/s

NCE 1000 mots/s 25000 mots/s

(39)

Entraˆınement discriminant

Mod`ele NCE

Utilisant les scores non-normalis´es du mod`ele NCE:

dev test

Syst`eme de traduction 33, 9 27, 6 Ajout d’un mod`ele continu

+ NCE 35, 0 28, 8

Ajout d’un mod`ele continu discriminant

+ Discriminant 35, 3 (+1, 4) 29, 0 (+1, 4) + Init. NCE + discriminant 35,4 (+1, 5) 29,7 (+2, 1)

(40)

Entraˆınement discriminant

Mod`ele SOUL

Utilisant les scores normalis´es par SOUL:

dev test

Syst`eme de traduction 33, 9 27, 6 Ajout d’un mod`ele continu

+ SOUL 35, 1 28, 9

Ajout d’un mod`ele continu discriminant

+ Discriminant 35, 0 (+1, 1) 28, 9 (+1, 3) + Init. SOUL + discriminant 35,7 (+1, 8) 29,3 (+1, 7)

(41)

Comment fonctionne le NCE?

0 2 4 6 8 10 12 14 16

Iteration 2

1 0 1 2 3 4

5 Coef de normalisation moyen

(a) Evolution du coefficient de ´ normalisation (moyenne et ´ ecart type)

0 2 4 6 8 10 12 14 16

Iteration 10¹

10² 10³ 10⁴ 10⁵

Perplexite

(b) Evolution de la perplexit´ ´ e mesur´ ee sur les donn´ ees de validation.

(42)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

(43)

Conclusions

Avantages du nouveau cadre d’apprentissage du mod`ele neuronal:

Optimisation corr´ el´ ee avec BLEU

Apprendre ` a corriger les erreurs du syst` eme de traduction

La normalisation des scores de sortie n’est plus n´ ecessaire (avec NCE) Recommandation: pr´ e-apprentissage du mod` ele NCE!!!

(44)