• Aucun résultat trouvé

Apprentissage discriminant des modèles continus de traduction

N/A
N/A
Protected

Academic year: 2022

Partager "Apprentissage discriminant des modèles continus de traduction"

Copied!
44
0
0

Texte intégral

(1)

Apprentissage discriminant des mod` eles continus de traduction

Quoc-Khanh Do, Alexandre Allauzen, Fran¸cois Yvon

Univ. Paris-Sud / LIMSI-CNRS

Q.K. Do (LIMSI-CNRS) SMT 1 / 23

(2)

Contenu

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

Q.K. Do (LIMSI-CNRS) SMT 2 / 23

(3)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

Q.K. Do (LIMSI-CNRS) SMT 3 / 23

(4)

Mod` eles neuronaux

R

R R

Wih Who

0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1

00 0 0 1 0 0 0 0

RAP Analyse

Syntaxique

Analyse S´emantique

Alignement de mots

Traduction Automatique r´eseau neurone

Q.K. Do (LIMSI-CNRS) SMT 4 / 23

(5)

Mod` eles neuronaux

R

R R

Wih Who

0 1 00 0 0 0 0 0 0 0 0 0 0 0 0 0 1

00 0 0 1 0 0 0 0

RAP Analyse

Syntaxique

Analyse S´emantique

Alignement de mots

Traduction Automatique r´eseau neurone

−0,9%

WER

90,44%

pr´ecision F1

71,3 corr´elation

+0,07 F1

+1,6 BLEU

Q.K. Do (LIMSI-CNRS) SMT 4 / 23

(6)

Mod` eles neuronaux en Traduction Automatique

Syst`eme de traduction:

assemblage de mod`eles

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd’hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

`

a tout le monde parle du bonheur aujourd’hui . que tout le monde parle de bonheur ces derniers .

tout le monde parle le bonheur ce .

Q.K. Do (LIMSI-CNRS) SMT 5 / 23

(7)

Mod` eles neuronaux en Traduction Automatique

Syst`eme de traduction:

assemblage de mod`eles

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd’hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

`

a tout le monde parle du bonheur aujourd’hui . que tout le monde parle de bonheur ces derniers .

tout le monde parle le bonheur ce .

Mod` ele de traduction continu 5h / 2000 phrases

Maximum de vraisemblance ↔ BLEU

Q.K. Do (LIMSI-CNRS) SMT 5 / 23

(8)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

Q.K. Do (LIMSI-CNRS) SMT 6 / 23

(9)

Approche n-gramme en TAS

s

̅8:

à t

̅8:

to

s

̅9:

recevoir t

̅9:

receive

s

̅10:

le t

̅10:

the

s

̅11:

nobel de la paix t

̅11:

nobel peace

s

̅12:

prix t

̅12:

prize

u

8

u

9

u

10

u

11

u

12

s : ....

t : ....

à recevoir le prix nobel de la paix

org : ....

....

....

Traduction en deux ´etapes (Crego and Mari˜ no2006)

1 2

⇒ Mod`ele n-gramme sur les bisegments:

P (s, t) = Y

L

i=1

P(u

i

| u

i−1

, ..., u

i−n+1

)

See http://ncode.limsi.fr/

Q.K. Do (LIMSI-CNRS) SMT 7 / 23

(10)

Approche n-gramme en TAS

s

̅8:

à t

̅8:

to

s

̅9:

recevoir t

̅9:

receive

s

̅10:

le t

̅10:

the

s

̅11:

nobel de la paix t

̅11:

nobel peace

s

̅12:

prix t

̅12:

prize

u

8

u

9

u

10

u

11

u

12

s : ....

t : ....

à recevoir le prix nobel de la paix

org : ....

....

....

1

Traduction en deux ´etapes (Crego and Mari˜ no2006)

1

R´eordonnance de la source

2

⇒ Mod`ele n-gramme sur les bisegments:

P (s, t) = Y

L

i=1

P(u

i

| u

i−1

, ..., u

i−n+1

)

See http://ncode.limsi.fr/

Q.K. Do (LIMSI-CNRS) SMT 7 / 23

(11)

Approche n-gramme en TAS

s

̅8:

à t

̅8:

to

s

̅9:

recevoir t

̅9:

receive

s

̅10:

le t

̅10:

the

s

̅11:

nobel de la paix t

̅11:

nobel peace

s

̅12:

prix t

̅12:

prize

u

8

u

9

u

10

u

11

u

12

s : ....

t : ....

à recevoir le prix nobel de la paix

org : ....

....

....

1

2

Traduction en deux ´etapes (Crego and Mari˜ no2006)

1

R´eordonnance de la source

2

Traduction monotone par segment

⇒ Mod`ele n-gramme sur les bisegments:

P (s, t) = Y

L

i=1

P(u

i

| u

i−1

, ..., u

i−n+1

)

See http://ncode.limsi.fr/

Q.K. Do (LIMSI-CNRS) SMT 7 / 23

(12)

Mod` ele n-gramme factoris´ e

s̅9: recevoir t̅9: receive

s̅10: le t̅10: the s̅11: nobel de la paix

t̅11: nobel peace

P ( | )= P (

t̅11: nobel peace

|

s̅11: nobel de la paix s̅9: recevoir s̅10: le t̅9: receive t̅10: the

) P (

s̅11: nobel de la paix

|

s̅9: recevoir s̅10: le t̅9: receive t̅10: the

)

= P ( nobel | recevoir, le , receive, the )

× P( de | le, nobel , receive, the )

× P( la | nobel, de , receive, the )

× P( paix | de, la , receive, the )

× P( nobel | la, paix , receive, the )

× P( peace | la, paix , the, nobel )

Q.K. Do (LIMSI-CNRS) SMT 8 / 23

(13)

Mod` ele n-gramme factoris´ e

s̅9: recevoir t̅9: receive

s̅10: le t̅10: the s̅11: nobel de la paix

t̅11: nobel peace

P ( | )= P (

t̅11: nobel peace

|

s̅11: nobel de la paix s̅9: recevoir s̅10: le t̅9: receive t̅10: the

) P (

s̅11: nobel de la paix

|

s̅9: recevoir s̅10: le t̅9: receive t̅10: the

)

= P ( nobel | recevoir, le , receive, the )

× P( de | le, nobel , receive, the )

× P( la | nobel, de , receive, the )

× P( paix | de, la , receive, the )

× P( nobel | la, paix , receive, the )

× P( peace | la, paix , the, nobel )

paix de

la

receive the

Q.K. Do (LIMSI-CNRS) SMT 8 / 23

(14)

Mod` ele n-gramme factoris´ e

s̅9: recevoir t̅9: receive

s̅10: le t̅10: the s̅11: nobel de la paix

t̅11: nobel peace

P ( | )= P (

t̅11: nobel peace

|

s̅11: nobel de la paix s̅9: recevoir s̅10: le t̅9: receive t̅10: the

) P (

s̅11: nobel de la paix

|

s̅9: recevoir s̅10: le t̅9: receive t̅10: the

)

= P ( nobel | recevoir, le , receive, the )

× P( de | le, nobel , receive, the )

× P( la | nobel, de , receive, the )

× P( paix | de, la , receive, the )

× P( nobel | la, paix , receive, the )

× P( peace | la, paix , the, nobel )

peace la

paix

the nobel

Q.K. Do (LIMSI-CNRS) SMT 8 / 23

(15)

Structure neuronale

Couche cachée 0

10 0 00

0 0 01 0 0

0 0 0 00 1

10 0 0 00

la

paix

the

3

nobel

peace

Couche de sortie Rs

Rt Rs

Rt

Activation : a

θ

(w, c) b

θ

(w, c) = exp(a

θ

(w, c))

Q.K. Do (LIMSI-CNRS) SMT 9 / 23

(16)

Structure neuronale

Couche cachée 0

10 0 00

0 0 01 0 0

0 0 0 00 1

10 0 0 00

la

paix

the

3

nobel

peace

Couche de sortie Rs

Rt Rs

Rt

bθ(w,c) Pw0 ∈Vbθ(w0,c)

Activation : a

θ

(w, c) b

θ

(w, c) = exp(a

θ

(w, c))

Q.K. Do (LIMSI-CNRS) SMT 9 / 23

(17)

Structure neuronale

Couche cachée 0

10 0 00

0 0 01 0 0

0 0 0 00 1

10 0 0 00

la

paix

the

3

nobel

peace

Couche de sortie Rs

Rt Rs

Rt

b

θ

(w, c)

Activation : a

θ

(w, c) b

θ

(w, c) = exp(a

θ

(w, c))

Q.K. Do (LIMSI-CNRS) SMT 9 / 23

(18)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

Q.K. Do (LIMSI-CNRS) SMT 10 / 23

(19)

Maximum de vraisemblance

L

cll

(θ, S

n

) = P

(w,c)∈Sn

− log p

θ

(w | c) + R (θ) o` u p

θ

(w | c) =

Pbθ(w,c)

w0 ∈V

bθ(w0,c)

Q.K. Do (LIMSI-CNRS) SMT 11 / 23

(20)

Maximum de vraisemblance

L

cll

(θ, S

n

) = P

(w,c)∈Sn

− log p

θ

(w | c) + R (θ) o` u p

θ

(w | c) =

Pbθ(w,c)

w0 ∈V

bθ(w0,c)

En pratique : SOUL (Le et al.2011)

R

R R

Wih shared projection space

0 1 0 0 0 0 0 0 0

wi-1

0 0 0 0 0 0 0 1

wi-2 0

0 0 0 0 1 0 0 0 0

wi-3

}

short list The associated clustering tree

Sub-class layers

Word layers

}

top classes

Q.K. Do (LIMSI-CNRS) SMT 11 / 23

(21)

Estimation contrastive bruit´ ee (NCE)

X

w0

= 1

X

w0

= 0

Q.K. Do (LIMSI-CNRS) SMT 12 / 23

(22)

Estimation contrastive bruit´ ee (NCE)

p( X

w0

= 1 | w

0

, c)

=

p pθ(w0|c)

θ(w0|c)+KpN(w0)

p( X

w0

= 0 | w

0

, c)

=

p KpN(w0)

θ(w0|c)+KpN(w0)

Maximum de vraisemblance pour la classification de { w, w

1

, ..., w

K

} → fonction objectif pour NCE

Q.K. Do (LIMSI-CNRS) SMT 12 / 23

(23)

R´ e´ evaluation des hypoth` eses

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .

tout le monde parle le bonheur ce . "

n-best hypotheses SMT system

h1 h2 h3 . . . . . hn

Q.K. Do (LIMSI-CNRS) SMT 13 / 23

(24)

R´ e´ evaluation des hypoth` eses

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .

tout le monde parle le bonheur ce . "

n-best hypotheses purpose2. In a domain adaptation context, we assume that

an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.

3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain cor- pus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain par- allel corpus, the same training procedure can also be used.

A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.

There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a rerank- ing step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to bet- ter take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.

3.1. RescoringN-best lists with CSTMs

Due to the high computational cost of normalizing the output layer, continuous models are in most cases3introduced in a post-processing step calledN-best reranking.

We thus assume that for each source sentences, the de- coder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is as- sociated with the decoder scoreF(s,h)computed as:

F(s,h) = XK k=1

kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of co- efficients (k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).

When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f(s,h). As explained in Section 2.2,f(s,h)typically

2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.

3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.

Algorithm 1Joint optimization procedure for✓and 1:Initialize✓and

2:foreach iterationdo

3: forMmini-batchesdo . is fixed

4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch

5: Update✓

6: end for

7: Update using dev set .✓is fixed

8:end for

corresponds to the negated log-probability of the derivation:

f(s,h) = logP(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:

G,✓(s,h) =F(s,h) + K+1f(s,h) (4) This scoring function depends on the CSTM’s parame- ters✓, as well as on the coefficients of the scoring func- tion. In the approach proposed here, optimizing the rerank- ing step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.

The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typ- ically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.

In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES4). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative ob- jective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.

3.2. A max-margin approach

As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Lethdenote the hypothesis with the best sentence BLEU score. A max-margin loss func- tion [33, 34, 20] for estimating✓can then be formulated as follows:

Lmm(✓,s) = G,✓(s,h) + max

1jN(G,✓(s,hj) +cost(hj)), (5) where cost(hj) =↵ sBLEU(h) sBLEU(hj). The parameter↵mitigates the contribution of the cost function

4http://www.statmt.org/moses/

Ranked

SMT system

h1 h2 h3 . . . . . hn

Q.K. Do (LIMSI-CNRS) SMT 13 / 23

(25)

ethodes d’apprentissage

R´ e´ evaluation des hypoth` eses

s : everybody talks about happiness these days .

tout le monde parle de bonheur aujourd'hui . tout le monde parle de bonheur actuellement . tout le monde parle de bonheur ces derniers jours .

. . .

à tout le monde parle du bonheur aujourd'hui . que tout le monde parle de bonheur ces derniers jours .

tout le monde parle le bonheur ce . "

n-best hypotheses purpose2. In a domain adaptation context, we assume that

an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.

3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain cor- pus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain par- allel corpus, the same training procedure can also be used.

A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.

There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a rerank- ing step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to bet- ter take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.

3.1. RescoringN-best lists with CSTMs

Due to the high computational cost of normalizing the output layer, continuous models are in most cases3introduced in a post-processing step calledN-best reranking.

We thus assume that for each source sentences, the de- coder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is as- sociated with the decoder scoreF(s,h)computed as:

F(s,h) = XK k=1

kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of co- efficients ( k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).

When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f(s,h). As explained in Section 2.2,f(s,h)typically

2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.

3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.

Algorithm 1Joint optimization procedure for✓and 1:Initialize✓and

2:foreach iterationdo

3: forMmini-batchesdo . is fixed

4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch

5: Update✓

6: end for

7: Update using dev set .✓is fixed 8:end for

corresponds to the negated log-probability of the derivation:

f(s,h) = logP(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:

G,✓(s,h) =F(s,h) + K+1f(s,h) (4) This scoring function depends on the CSTM’s parame- ters✓, as well as on the coefficients of the scoring func- tion. In the approach proposed here, optimizing the rerank- ing step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.

The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typ- ically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.

In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES4). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative ob- jective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.

3.2. A max-margin approach

As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Lethdenote the hypothesis with the best sentence BLEU score. A max-margin loss func- tion [33, 34, 20] for estimating✓can then be formulated as follows:

Lmm(✓,s) = G,✓(s,h) + max

1jN(G,✓(s,hj) +cost(hj)), (5) where cost(hj) =↵ sBLEU(h) sBLEU(hj). The parameter↵mitigates the contribution of the cost function

4http://www.statmt.org/moses/

Conventional SMT scores

Ranked

an existing CSTM –trained on the out-of-domain data– al- ready exists. This model is thus well suited to bootstrap the adaptation process.

3. Objective functions for adaptation In most previous works (eg. [8, 9]), CSTMs are estimated by maximizing the regularized conditional log-likelihood (CLL) on parallel training corpora. This estimation procedure is used to train a baseline CSTM on the out-of-domain cor- pus, producing a baseline model that will serve as an initial point for domain adaptation. Given a small in-domain par- allel corpus, the same training procedure can also be used.

A straightforward adaptation algorithm consists in running a few epochs of the standard back-propagation algorithm on the in-domain data to maximize the conditional likelihood using, as initial parameters, the out-of-domain model.

There is however only a loose relationship between the CLL criterion and the final translation quality. The CSTM is usually integrated in the translation process through a rerank- ing step, the goal of which is to reorder a reduced set of candidate translations, calledN-best list. Therefore, to bet- ter take advantage of the small amount of in-domain data, we propose to explore alternative objective functions that are more directly related to the translation quality (as reflected by the BLEU score) after reranking. We first present the general learning algorithm, then the various objective functions.

3.1. RescoringN-best lists with CSTMs

Due to the high computational cost of normalizing the output layer, continuous models are in most cases3introduced in a post-processing step calledN-best reranking.

We thus assume that for each source sentences, the de- coder can generate anN-best list{h1,h2, ...,hN}of N top translation candidates. Each hypothesishi= (ti,ai)is as- sociated with the decoder scoreF (s,h)computed as:

F (s,h) = XK k=1

kfk(s,h), (3) whereKfeature functions (fk) are weighted by a set of co- efficients (k). Then-gram approach differs from other ap- proaches by the hidden variables associated to derivations, such as the source word reordering and the segmentation of the resulting parallel sentence. The basic feature functions used in this study are very similar to those used by standard phrase-based SMT systems (see [30] for instance).

When reranking with a continuous space model,F(.) is augmented to also include an additional feature denoted f(s,h). As explained in Section 2.2,f(s,h)typically

2The following parameters can be initialized given a source and target language monolingual models: the source and target word embeddings re- spectively, and the structured output layer’s structure.

3See however [31, 32, 19] for early attempts to integrate Neural Network Translation Models within the decoder.

1:Initialize✓and 2:foreach iterationdo

3: forMmini-batchesdo . is fixed

4: Compute the sub-gradient ofL(✓,s)for allsin the mini-batch

5: Update✓

6: end for

7: Update using dev set .✓is fixed

8:end for

corresponds to the negated log-probability of the derivation:

f(s,h) = logP(s,h), where✓is the vector containing the CSTM’s free parameters. The scoring function used in reranking is then:

G,✓(s,h) =F (s,h) + K+1f(s,h) (4) This scoring function depends on the CSTM’s parame- ters✓, as well as on the coefficients of the scoring func- tion. In the approach proposed here, optimizing the rerank- ing step will thus requiresto alternatively tune the vector of coefficients and to adapt the CSTM’s weight vector✓: the former procedure uses the development data, while the latter will use the in-domain parallel corpus.

The corresponding proposed optimization procedure splits the in-domain set in mini-batches of a fixed size (typ- ically 128 subsequent sentence pairs). As sketched in Algo- rithm 1, each mini-batch is used to update the parameters✓ of the CSTM while keeping fixed. The vector is updated everyMmini-batches.

In our study, tuning is performed using standard tools (here, the K-Best Mira algorithm described in [21] as imple- mented in MOSES4). The training of CSTMs (with fixed ) is more interesting and we compare two discriminative ob- jective functions, which aim at better taking the translation quality into account. These two objectives are in turn com- pared to the conventional maximization of the conditional likelihood criterion on parallel data.

3.2. A max-margin approach

As explained above, each hypothesishiproduced by the decoder is scored according to (4). Its quality can also be evaluated by the sentence-level approximation of the BLEU scoresBLEU(hi). Lethdenote the hypothesis with the best sentence BLEU score. A max-margin loss func- tion [33, 34, 20] for estimating✓can then be formulated as follows:

Lmm(✓,s) = G ,✓(s,h) + max

1jN(G,✓(s,hj) +cost(hj)), (5) where cost(hj) =↵ sBLEU(h) sBLEU(hj). The parameter↵mitigates the contribution of the cost function

4http://www.statmt.org/moses/

SMT system Continuous space translation model

h1 h2 h3 . . . . . hn

Q.K. Do (LIMSI-CNRS) SMT 13 / 23

(26)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise ` a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

Q.K. Do (LIMSI-CNRS) SMT 14 / 23

(27)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise `a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

Q.K. Do (LIMSI-CNRS) SMT 14 / 23

(28)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise ` a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

Q.K. Do (LIMSI-CNRS) SMT 14 / 23

(29)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise `a jour de θ . λ fix´e

6:

end for

7:

Mise `a jour de λ en utilisant le dev. . θ fix´e

8:

end for

KBMIRA (Cherry and Foster2012)

Q.K. Do (LIMSI-CNRS) SMT 14 / 23

(30)

Algorithme d’optimisation

1:

Init. de θ et λ

2:

for chaque it´eration do

3:

for M paquets do

4:

Calcul du sous-gradient de L (θ) pour chaque phrase s du paquet

5:

Mise ` a jour de θ . λ fix´e

6:

end for

7:

Mise ` a jour de λ en utilisant le dev. . θ fix´e

8:

end for

KBMIRA (Cherry and Foster2012) L (θ, s) ???

Q.K. Do (LIMSI-CNRS) SMT 14 / 23

(31)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

) − sBLEU (h)), avec h

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

n-best hypotheses sBLEU

Q.K. Do (LIMSI-CNRS) SMT 15 / 23

(32)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

) − sBLEU (h)), avec h

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

n-best hypotheses h*

sBLEU

G

,✓

(s, h

) + cost

(h

)

0

Q.K. Do (LIMSI-CNRS) SMT 15 / 23

(33)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

) − sBLEU (h)), avec h

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

n-best hypotheses h*

argmax

h

G

,✓

(s, h) + cost

(h) ⇤

sBLEU

G

,✓

(s, h

) + cost

(h

)

0

Q.K. Do (LIMSI-CNRS) SMT 15 / 23

(34)

Fonction objectif

Fonction de coˆ ut:

cost

α

(h) = α sBLEU(h

) − sBLEU (h)), avec h

= argmax

h

sBLEU (h), la meilleure hypoth`ese sBLEU est le sentence-BLEU

Max-margin

n-best hypotheses h*

argmax

h

G

,✓

(s, h) + cost

(h) ⇤

sBLEU

G

,✓

(s, h

) + cost

(h

)

0

En pratique: consid´erer plusieurs paires d’hypoth`eses!!!

Q.K. Do (LIMSI-CNRS) SMT 15 / 23

(35)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

Q.K. Do (LIMSI-CNRS) SMT 16 / 23

(36)

Adaptation du syst` eme avec mod` ele continu

Configuration exp´erimentale

train N−best dev N−best

+ modele log−lineaire Modele NN

entrainement

entrainement discriminant

dev train

Du domaine

(CLL, NCE)

train dev

Hors domaine

Systeme baseline

(Hors domaine)

discriminant entrainement

λ θ

Q.K. Do (LIMSI-CNRS) SMT 17 / 23

(37)

Adaptation du syst` eme avec mod` ele continu

Configuration exp´erimentale

train N−best dev N−best

+ modele log−lineaire Modele NN

entrainement

entrainement discriminant

dev train

Du domaine

(CLL, NCE)

train dev

Hors domaine

Systeme baseline

(Hors domaine)

discriminant entrainement

λ θ

WMT en-fr

TEDTalks

WMT’13

300-best

KBMIRA

Q.K. Do (LIMSI-CNRS) SMT 17 / 23

(38)

R´ e´ evaluation des hypoth` eses

NCE vs. SOUL

dev test

Syst`eme de base 33, 9 27, 6 Ajout d’un mod`ele continu

+ SOUL 35, 1 (+1, 2) 28, 9 (+1, 3) + NCE 35, 0 (+1, 1) 28, 8 (+1, 2)

Vitesse d’entraˆınement Vitesse d’inf´erence

SOUL 1000 mots/s 20000 mots/s

NCE 1000 mots/s 25000 mots/s

Q.K. Do (LIMSI-CNRS) SMT 18 / 23

(39)

Entraˆınement discriminant

Mod`ele NCE

Utilisant les scores non-normalis´es du mod`ele NCE:

dev test

Syst`eme de traduction 33, 9 27, 6 Ajout d’un mod`ele continu

+ NCE 35, 0 28, 8

Ajout d’un mod`ele continu discriminant

+ Discriminant 35, 3 (+1, 4) 29, 0 (+1, 4) + Init. NCE + discriminant 35,4 (+1, 5) 29,7 (+2, 1)

Q.K. Do (LIMSI-CNRS) SMT 19 / 23

(40)

Entraˆınement discriminant

Mod`ele SOUL

Utilisant les scores normalis´es par SOUL:

dev test

Syst`eme de traduction 33, 9 27, 6 Ajout d’un mod`ele continu

+ SOUL 35, 1 28, 9

Ajout d’un mod`ele continu discriminant

+ Discriminant 35, 0 (+1, 1) 28, 9 (+1, 3) + Init. SOUL + discriminant 35,7 (+1, 8) 29,3 (+1, 7)

Q.K. Do (LIMSI-CNRS) SMT 20 / 23

(41)

Comment fonctionne le NCE?

0 2 4 6 8 10 12 14 16

Iteration 2

1 0 1 2 3 4

5 Coef de normalisation moyen

(a) Evolution du coefficient de ´ normalisation (moyenne et ´ ecart type)

0 2 4 6 8 10 12 14 16

Iteration 101

102 103 104 105

Perplexite

Perplexite

(b) Evolution de la perplexit´ ´ e mesur´ ee sur les donn´ ees de validation.

Q.K. Do (LIMSI-CNRS) SMT 21 / 23

(42)

Outline

1

Introduction

2

Mod`eles continus de traduction

3

M´ethodes d’apprentissage

4

R´esultats exp´erimentaux

5

Conclusions

Q.K. Do (LIMSI-CNRS) SMT 22 / 23

(43)

Conclusions

Avantages du nouveau cadre d’apprentissage du mod`ele neuronal:

Optimisation corr´ el´ ee avec BLEU

Apprendre ` a corriger les erreurs du syst` eme de traduction

La normalisation des scores de sortie n’est plus n´ ecessaire (avec NCE) Recommandation: pr´ e-apprentissage du mod` ele NCE!!!

Q.K. Do (LIMSI-CNRS) SMT 23 / 23

(44)

Colin Cherry and George Foster.

2012.

Batch tuning strategies for statistical machine translation.

In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

(NAACL-HLT), pages 427–436, June.

Josep M. Crego and Jos´e B. Mari˜ no.

2006.

Improving statistical MT by coupling reordering and decoding.

Machine Translation, 20(3):199–215.

Hai-Son Le, Ilya Oparin, Alexandre Allauzen, Jean-Luc Gauvain, and Fran¸cois Yvon.

2011.

Structured output layer neural network language model.

In Proceedings of ICASSP, pages 5524–5527.

Q.K. Do (LIMSI-CNRS) SMT 23 / 23

Références

Documents relatifs

ML estimation in the presence of a nuisance vector Using the Expectation Max- imization (EM) algorithm Or the Bayesian Expectation Maximization (BEM) algorithm... ML estimation in

We discuss optimizations to the existing algorithm and evaluate them using our prototypical reasoner cheetah on several large bio-medical knowledge bases..

Self-adaptation bases on the mutation operator that modifies the prob- lem variables using the strategy parameters to search the problem and parameter spaces

Les contributions de cet article sont d’une part de proposer un cadre discriminant pour l’apprentissage des modèles continus de traduction permettant d’orienter l’optimisation

Ce modèle découle de l’application de l’équation ( 1.13 ) au modèle d’interterspectre de fluctuations de pression pariétale établi par Chase dans [ 9 ] et que nous détaillons

This project has two goals: to build a model of the TAPS as a test bed for a piezoelectric micro power generator (PMPG), and to use the model to explore corrosion

This procedure is usually fast and it has been proven to be robust for observations with the High Band Antenna (HBA) system of the Low Frequency Array (LOFAR; van Haarlem et al.

This approach, called the Fuzzy EM (FEM) method, is illustrated using three classical problems: normal mean and variance estimation from a fuzzy sample, multiple linear regression