• Aucun résultat trouvé

Chapter 5 Unit Seletion 71

5.3 T arget feature seletion and weight tuning

5.3.1 Unit seletion and onatenation

We briey revisit the unit seletion framework for speeh synthesis. A typial TTS (text to

speeh synthesis) algorithm an be broadly divided into two steps, generation of speiation

and the atual synthesis. This division is made to separate the steps whih perform a target

ost alulation from those whih do not. In the rst stage, the text to be synthesized is

analyzed. This stage produes the speiation of the phoneme sequene to be synthesized

t n 1 = (t 1 , ...t j , ...t n )

,

n

phonemes starting from

1

, for the input text. The seond stage does

the atual synthesis of the required phoneme sequene in two steps, pre-seletion and nal

seletion through lattie resolution. This seond synthesis stage depends on the target ost

alulationforits synthesisperformane. The targetostalulationisdonebytheomparison

tel-00927121, version 1 - 1 1 Jan 2014

are`pereptually'similararepre-seletedforthenalsearhbasedonthistargetost. Ageneral

target funtion isalulated asfollows:

C(t i , u i ) = X F

ρ=1

w ρ C ρ (t i , u i )

(5.4)

where,

t i

,

u i

arethe target and a andidate;

F

isthe numberof target features;

C ρ (t i , u i )(ρ = 1, ..., F )

is the target feature osts between the elements of the target and andidate feature

vetors;

w ρ

isthe weight of afeature

ρ

:

The seletion among the set of pre-seleted andidates is operated by resolution of a

lat-tie of andidates using the Viterbi algorithm. The result of this seletion is a path in the

lattie of andidates whih minimizes a weighted linear ombination of three osts: the

tar-get ost (

where

w

,

w aj

and

w vj

areweights for the omponent target ost,aousti join ost and visual

joinost. We hoose theseweightsasexplained in(Toutios etal.,2011) (see setion6.1.2).

An ideal target ost funtion

The usage of target ost funtion is to rank andidates intheorder of their suitability to t a

target position during synthesis. Eah andidate is assigned a ost (positive real number) by

thetarget ost funtion,lower theost better suitable isthe andidate for atarget position. If

we assume thatthere is a metri to measure thepereptual dissimilaritybetween atarget and

a andidate, thenideally, the ranking of andidates based on their target osts should be the

same asthatofthe orderingbasedon their pereptualdissimilarities to thetarget.

At the time of synthesis, the target speiation only has the target feature desription,

but no aousti or visual speeh realization. So, thedeision is madebased on thetarget ost.

Hene, an optimum target ost funtion is very important for good synthesis results. A good

set of target features andwell tunedweights dene a good target ost funtion. The following

setionpresents asimple and robustiterative algorithm to simultaneously eliminateredundant

tel-00927121, version 1 - 1 1 Jan 2014

5.3.2 Target feature seletion and weight tuning

Thealgorithm tobedesribedalleviatestheproblemofredundany andnoisethatissetindue

to the exhaustive set of features onsidered. Itsimportane is also due to the fat that, with

a large set of features, it is pratially infeasible to have a orpus whih overs all the feature

ombinations possible. The algorithm uses the orpus, for whih we have both atual speeh

realizations and target featuredesriptions for eahof theandidatespresent init.

Sine foranyspeeh segments, therearepossiblevariantswhiharepereptually onsidered

goodalternatives. But, itispratially impossible torankandidates intermsoftheir absolute

pereptualqualitywith respetto anytarget. Being 'similar'to analreadyexistingspeehunit

is a reasonable way to say how well will a andidate t in a `target' position. If we devie a

way to measure the dissimilarity between two units, it an be used on the andidates in the

orpus. They have both the target feature desription and speeh realization available. The

omparison between theordering obtained bythis measure versus theranking usingthe target

ost an be used to evaluate the target funtion. In the following paragraphs, we dene two

things neessaryfor the evaluation of a target funtion: disorder withrespet to a target ost

funtion anddissimilaritybetween two speeh realizations.

5.3.2.1 Disorder

The disagreement in the ranking of andidates given by the target ost funtion versus the

ordering by dissimilarity measure, needs to be quantied. With respet to a partiular target

t

whose speeh is available, the andidate ranking based on the target ost funtion should be

inagreement withtheir dissimilaritybasedordering. We refer theordering basedonthetarget

ost as ranking. Consider a target

t

and two andidates

u

and

v

. With respet to thetarget

t

, let their dissimilarity measures be

D(t, u)

and

D(t, v)

, and their target osts be

C(t, u)

and

C(t, v)

. Then for an ideal target ost funtion,one of thefollowing three onditions should be

true:

1.

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

2.

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

3.

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

Thedissimilaritymeasureisbasedontheomparison oftwo speehrealizations. Weassume

that similar speeh realizations are pereptually similar. This assumption implies that the

tel-00927121, version 1 - 1 1 Jan 2014

Target Cost Dissimilarity

Table5.4: Thistableillustratestheideaofomparisonofadissimilaritymeasurebasedordering

andtherankingassignedbasedonthetarget ost. Atarget

t

andfourandidates

{c 1 , c 2 , c 3 , c 4 }

are assumed. It is assumed that for the target and the andidates, the speeh realization is

available for omparison.

D(t, c i )

is the dissimilarity between the speeh realizations of the target

t

andandidate

c i

,whihisa symmetrifuntion.

C(t, c i )

isthetargetost between the

targetspeiationof

t

andandidate

c i

. Forthegiventargetandwithrespettoeahavailable

andidate, the dissimilarity based ordering of andidates and the target ost based ranking is

ompared to alulate the disorder. Thetotal disorderis thesumof thefourtholumn.

the dissimilarity measure we are expressing the dierene in their speeh realizations. Our

approah is based on this idea thatthe ordering given by an ideal target ost funtion should

agreewiththeorderinggivenbythisdissimilaritymeasure. Duringpre-seletion,thetargetost

funtion assigns aranking to the available andidates, for pruning thelesssuitable andidates.

For this reason, we refer to the target ost based ordering as ranking. Unlike some systems

we don't train the target ost funtion to omputes the dissimilarity (Hunt and Blak, 1996).

We only fous on the andidate ordering given by the target ost funtion. The above three

onditions state that, the omparison of two andidates for a target position based on their

target ostswouldbesimilarto thatbasedontheir dissimilaritytothetarget,ifthetargetwas

to have a speeh realization available (hypothetial). We denote theabove three onditions by

thefollowing:

C(t, u) ∗ C(t, v) ⇔ D(t, u) ∗ D(t, v)

(5.6)

Where,

∗ ∈ {<, =, >}

.

Wedene the disorderwithrespettothis target and thetwo andidatesasfollows:

δ t (u, v) =

tel-00927121, version 1 - 1 1 Jan 2014

For eah of the phonemes

p

in the phoneme set, let

U p

be the omplete set of andidates

inthe orpus withthat phonemilabel. Using leave-one-out tehnique, onsidering eah ofthe

elements from this set as a target and all theothers as andidates, the total disorder for that

phoneme isalulated for apartiular target ost funtionasfollows:

∆ = X

Wetakeadissimilaritymeasuresimilartothatin(Latazetal.,2011)fortheaoustimodality.

Here, we desribe a funtion that we have used to ompare two speeh segments. It gives an

estimate of their dissimilarity. We onsidered four omponents to onstitute the dissimilarity

measure

D(u, v)

between units

u

and

v

ofa partiular phoneme

p

asfollows:

D(u, v) = w dur D dur (u, v) + w ac D ac (u, v)+

w vs D vs (u, v) + w f0 D f 0 (u, v)

(5.9)

D dur

,

D ac

,

D vs

and

D f0

aretheomponentsintermsoftheduration,aoustispeeh,visual

speehand f0 ofthe unitsand

w dur

,

w ac

,

w vs

and

w f 0

aretheweightsgivento these respetive

omponents. The duration dissimilarity

D dur

is alulated as thedierene between the

dura-tionsofthetwounits

v

and

u

,

dur u

and

dur v

respetivelyandnormalizedtomakethevalueliein therange[0,1℄.

dur min (p) = min u,v∈U p |dur u − dur v |

and

dur max (p) = max u,v∈U p |dur u − dur v |

,

whih are the maximum and minimum duration dierenes among the units of phoneme

p

.

Then, theduration dissimilarityomponent isalulated asfollows:

D dur (u, v) = |(dur u − dur v )| − dur min (p)

dur max (p) − dur min (p)

(5.10)

For the other three omponents; aousti, visual and f0; the RMSE (root mean squared

error) is alulated between two trajetories of respetive features by making the duration or

numberof samples

N

equal bysimple linearinterpolation.

d rmse (u, v) =

tel-00927121, version 1 - 1 1 Jan 2014

MFCC asexplained insetion5.3.3.

d min (p) = min u,v∈U p d rmse (u, v)

and

d max (p) = max u,v∈U p d rmse (u, v)

, whih are the

maxi-mumandminimumRMSEsamongalltheunitsofphoneme

p

. TheRMSEisnormalizedsimilar

to

D dur

to makethe valuelie inthe range

[0, 1]

using

d min (p)

and

d max (p)

:

D rmse (u, v) = d rmse (u, v) − d min (p)

d max (p) − d min (p)

(5.12)

5.3.2.3 Primitives of the algorithm

The main idea behind the algorithm to be desribed is that, eah target feature has some

ontributing information whih getsreeted inspeeh. If auseful feature isremoved fromthe

target ost, then, the performane of the target ost funtion should deteriorate. The extend

to whih itdeteriorates when a target feature is exluded, quanties thefeature's importane.

We estimate the relative importane ofa target feature based on thedeterioration of seletion

performanewhen a target featureis exludedfrom thetarget ost. This isexplained indetail

inthe following disussion. Forsimpliityofnotation,westopshowingaandidate andatarget

withthetargetostfuntion. Letsassumethattheurrentsetoftargetfeaturesis

F

,andurrent

feature being onsidered is

f

. Lets denote the singleton feature set

{f }

with

F

,

F c = F − F

.

Letus express the target ostfuntion asfollows:

T C = w F T C F + (1 − w F )

The target ost(TC) shownaboveis the weighted sumofthe following two omponents:

(a) Thetarget ost funtion withonefeature

f

,

T C F

.

(b) A targetost funtion whihexluded feature

f

,fromthetarget feature set,

T C F c

.

The target ost funtionhighlightedas(1) intheabove equation takesall thefeaturesinto

aount andthe targetost funtionhighlighted as(2)above exludesfeature

f

. Using (1)and

(2) as the two target osts, two disorders are alulated. The disorder alulated using (1) is

referred to as Combined Disorder (CD), whih depends also on

w F

. The disorder alulated

using (2) is refereed to as Exlusion Disorder (ED). The following an be said with respetto

theomparison ofCD and ED:

Afeature

f

isonsideredtoontributeinformation,ifdisorderinreases whenitsexluded fromthetarget ost:

ED f > CD

.

tel-00927121, version 1 - 1 1 Jan 2014

A feature

f

is onsideredto ontributenoise, if the disorder dereases when its exluded

fromthetarget ost:

ED f < CD

.

Those featureswhih ontributeinformation,theirweightsshouldbeinreasedproportional

totheirontribution,featureswhihseem toontributenoise,theirweightsshouldbedereased

till they beome ontributing features; if a feature ontributes only noise (for long), they are

eliminated fromthe feature set.

The following possibilitiesneed to be onsidered whilelassifyingfeaturesinformative:

(1) Featuresmightprovideinformationifgivenanoptimumweight(intheweighombination).

Exluding these features might modify thedisorder ompared to their inlusion and the

inreaseor derease dependson theombination ofweights.

(2) Featureswhihdon'tprovideanyinformation willnotaet thedisorderwiththeir

exlu-sion andinlusion even witha hange intheir relative weight inthetarget ost.

(3) Features whih ontribute only noise by their inlusion in the target ost, regardless of

thenon-zero weight givento them,the ombined disorderwill alwaysbegreater than the

disorderwiththeir exlusion.

Based on this analysis we developed an iterative algorithm. At any iteration, the weights

areupdatedbasedon theomparison ofED of dierent featuresand CDasfollows:

Those featuresfor whih

ED > CD

,their weights inrease. The inrease inproportional to thedierene inED andCD.

Those featuresfor whih

ED < CD

, they an belong to either ategory (1) or (3). The

featureweightsareupdatedproportionaltothedierene inCDandED.Afeaturewhih

shows thistrend (

ED < CD

) forlong, it iseliminatedfrom thefeature set.

Featuresbelongingto ategory (2)arealsoeliminated (

ED = CD

).

A fration of total weight from the set of features for whih (

ED < CD

) is distributed amongfeaturesfor whih (

ED > CD

).

Tomakethe hange intheweightsslow,the weightsat eah iterationaremadeafuntion

ofthepreviousiteration. Anynewweightafteraniteration,isafration(xedparameter)

tel-00927121, version 1 - 1 1 Jan 2014

5.3.2.4 Algorithm

Weprovidethepreisedetailsofthealgorithmhere. Notation: Foranyiteration

i

,theomplete

setof featuresis

F i

;a singleton sethaving feature

f

is denotedbytheset

F

;thesetoffeatures

exluding a feature

f

fromset

F

is

F i c = F i − F

;thedisorder withtheomplete set offeatures

andtheir weightsat iteration

i

(frompreviousiteration)i.e.,theombineddisorderCDis

∆(i)

;

thedisorderwithafeature

f

exludedfromthetargetost(ED)is

∆ F i c (i)

;setofallthefeatures

for whih

∆ F i c (i) > ∆(i)

is denoted by

F + i

and

F i

for those whih are qualied to remain in

thefeature setwith

∆ F i c (i) < ∆(i)

;setof all featureswhih arebeing eliminatedare

F 0 i

. For a

feature

f

,

t f (i)

isthenumberofiterationsithasbeenin

F i

onseutivelytilliteration

i

without

beingeliminated.

At every iteration

i

thefollowing quantities arealulated forupdatingthefeature weights:

Information Component (

I F (i)

): For a feature

f ∈ F + i

,i.e.

∆(i) < ∆ F i c (i)

:

Based onthis

N F (i)

alulatedasfollows to updatetheweight at every iteration.

N F (i) = (1 − N F (i))

featureswhihontributemorenoise willlosemore weight inthetarget funtionssubsequently.

Inase thereis only onefeature in

F i

,then

N F (i) = 1

.

The following arethe parameters ofalgorithm:

• T

,themaximumnumber of tolerant iterationsfor anoisy feature. Afeature

f

for whih

t f (i) > T

is eliminatedfromthefeature list. If afeature

f

hanges fromset

F i

to set

F + i

inaniteration

i

,then

t F (i)

is setto

0

.

• α

and

α +

, the frations of weights of any features in

F i

and

F + i

respetively that is

tel-00927121, version 1 - 1 1 Jan 2014

arriedforward fromthe weight inthepreviousiteration. Thismakestheupdatedweight

in the urrent iteration a funtion of the weight in the previous iteration. It is done to

makethehange inweightsslow.

• β

is the fration of the total hangeable weight in

F i

that is gained by features in

F + i

.

Thelogibehindthisdistributionisthat,featuresin

F i

looseweight whilefeaturesin

F + i

gainweight.

Maximumallowed iterations, for whih the algorithm is exeuted. This isxed based on therateof hange intotal disorder(derease inombineddisorder periteration).

Thegoal ofthealgorithm isto seletthe setoffeaturesandtunetheirrespetive weightsin

suh awaythatthe disorder

desribed byequation (5.8)is minimized:

Beginning: Target ost funtion with the omplete set of features whih are assigned

equal weights.

At every iteration

i

:

Thefollowing arerst determined:

◦ ∆(i)

.

for all

f ∈ F i

:

F c (i)

.

Eliminationofallthosefeatures

f

forwhihoneofthefollowingonditionsissatised:

1.

(∆(i) − ∆ F c (i)) ≈ 0

2.

(∆(i) − ∆ F c (i)) > 0

and

t F (i) > T

Update weights: The update is suh that the hange is slow. For that, a fration

of weight (

α +

for features in

F + i

and

α

for features in

F i

) remains onstant with

respet tothe previousiteration.

For afeature

f ∈ F + i

: Moretheinformation inthefeature, highertheweight.

w F (i) = α + w F (i − 1) (1) +

W F +

i I F (i) (2)

(5.17)

Therstomponent (1),dependsonthefeatureweightinthepreviousiteration;

the seondomponent(2),dependsontheinformationomponent ofthefeature.

W F +

i

is the total weight that will be redistributed in

F + i

.

W F +

i

is alulated as

tel-00927121, version 1 - 1 1 Jan 2014

follows:

The rst omponent (i), is the total hangeable weight of features in

F + i

; the

seond omponent (ii), is the total hangeable weight of features in

F i

that is

gained by features in

F + i

; the third omponent (iii), is the total weight of the

features being eliminated,

F 0 i

. The total weight ofthe featuresbeing eliminated

F 0 i

isre-distributed amongfeaturesin

F + i

.

For afeature

f ∈ F i

: Lesserthe noise ontribution,higher theweight.

w F (i) = α w F (i − 1) (1) +

W F

i N F (i) (2)

(5.19)

The rst omponent (1), depends ontheweight ofthe feature

f

inthe previous

iteration;the seondomponent (2),dependsontheNoiseComponentoffeature

f

.

W F − i

is the fration of total hangeable weight of features in

F i

that is

redistributed to featuresin

F i

itself. It isalulated asfollows:

W F

i = (1 − α )(1 − β) X

a∈F i

w A (i − 1)

(5.20)

Termination: Thealgorithm isterminatedwhenmaximumnumberof allowed iterations areexeutedor whenthere isno improvement (dereaseinombineddisorder) inan

iter-ationbeyond aertain

ǫ

. The bestweightsw.r.t theleastdisorder alongalltheiterations

arehosenfor the naltarget ost for thephoneme.

Documents relatifs