Unit seletion and onatenation - T arget feature seletion and weight tuning

Chapter 5 Unit Seletion 71

5.3 T arget feature seletion and weight tuning

5.3.1 Unit seletion and onatenation

We briey revisit the unit seletion framework for speeh synthesis. A typial TTS (text to

speeh synthesis) algorithm an be broadly divided into two steps, generation of speiation

and the atual synthesis. This division is made to separate the steps whih perform a target

ost alulation from those whih do not. In the rst stage, the text to be synthesized is

analyzed. This stage produes the speiation of the phoneme sequene to be synthesized

t ⁿ ₁ = (t ₁ , ...t _j , ...t _n )

n

^phonemes ^starting ^from

1

^, ^for ^the ^input ^text. ^The ^seond ^stage ^does

the atual synthesis of the required phoneme sequene in two steps, pre-seletion and nal

seletion through lattie resolution. This seond synthesis stage depends on the target ost

alulationforits synthesisperformane. The targetostalulationisdonebytheomparison

tel-00927121, version 1 - 1 1 Jan 2014

are`pereptually'similararepre-seletedforthenalsearhbasedonthistargetost. Ageneral

target funtion isalulated asfollows:

C(t i , u _i ) = X F

ρ=1

w _ρ C _ρ (t i , u _i )

^(5.4)

where,

t _i

u _i

âre^the ^target ând â ândidate;

F

^is^the ^number^of ^target ^features;

C _ρ (t i , u _i )(ρ = 1, ..., F )

îs ^the ^target ^feature ôsts ^between ^the êlements ôf ^the ^target ând ândidate ^feature

vetors;

w _ρ

îs^the ^weight ôf â^feature

ρ

The seletion among the set of pre-seleted andidates is operated by resolution of a

lat-tie of andidates using the Viterbi algorithm. The result of this seletion is a path in the

lattie of andidates whih minimizes a weighted linear ombination of three osts: the

tar-get ost (

where

w

w _aj

^and

w _vj

âre^weights ^for ^the ômponent ^target ôst,âousti ^join ôst ând ^visual

joinost. We hoose theseweightsasexplained in(Toutios etal.,2011) (see setion6.1.2).

An ideal target ost funtion

The usage of target ost funtion is to rank andidates intheorder of their suitability to t a

target position during synthesis. Eah andidate is assigned a ost (positive real number) by

thetarget ost funtion,lower theost better suitable isthe andidate for atarget position. If

we assume thatthere is a metri to measure thepereptual dissimilaritybetween atarget and

a andidate, thenideally, the ranking of andidates based on their target osts should be the

same asthatofthe orderingbasedon their pereptualdissimilarities to thetarget.

At the time of synthesis, the target speiation only has the target feature desription,

but no aousti or visual speeh realization. So, thedeision is madebased on thetarget ost.

Hene, an optimum target ost funtion is very important for good synthesis results. A good

set of target features andwell tunedweights dene a good target ost funtion. The following

setionpresents asimple and robustiterative algorithm to simultaneously eliminateredundant

tel-00927121, version 1 - 1 1 Jan 2014

5.3.2 Target feature seletion and weight tuning

Thealgorithm tobedesribedalleviatestheproblemofredundany andnoisethatissetindue

to the exhaustive set of features onsidered. Itsimportane is also due to the fat that, with

a large set of features, it is pratially infeasible to have a orpus whih overs all the feature

ombinations possible. The algorithm uses the orpus, for whih we have both atual speeh

realizations and target featuredesriptions for eahof theandidatespresent init.

Sine foranyspeeh segments, therearepossiblevariantswhiharepereptually onsidered

goodalternatives. But, itispratially impossible torankandidates intermsoftheir absolute

pereptualqualitywith respetto anytarget. Being 'similar'to analreadyexistingspeehunit

is a reasonable way to say how well will a andidate t in a `target' position. If we devie a

way to measure the dissimilarity between two units, it an be used on the andidates in the

orpus. They have both the target feature desription and speeh realization available. The

omparison between theordering obtained bythis measure versus theranking usingthe target

ost an be used to evaluate the target funtion. In the following paragraphs, we dene two

things neessaryfor the evaluation of a target funtion: disorder withrespet to a target ost

funtion anddissimilaritybetween two speeh realizations.

5.3.2.1 Disorder

The disagreement in the ranking of andidates given by the target ost funtion versus the

ordering by dissimilarity measure, needs to be quantied. With respet to a partiular target

t

^whose ^speeh îs âvailable, ^the ândidate ^ranking ^based ôn ^the ^target ôst ^funtion ^should ^be

inagreement withtheir dissimilaritybasedordering. We refer theordering basedonthetarget

ost as ranking. Consider a target

t

^and ^two ^andidates

u

^and

v

^. ^With ^respet ^to ^the^target

t

^, ^let ^their dissimilarity measures be

D(t, u)

^and

D(t, v)

^, ^and ^their ^target ^osts ^be

C(t, u)

^and

C(t, v)

^. ^Then ^for ân îdeal ^target ôst ^funtion,ône ôf ^the^following ^three ônditions ^should ^be

true:

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

Thedissimilaritymeasureisbasedontheomparison oftwo speehrealizations. Weassume

that similar speeh realizations are pereptually similar. This assumption implies that the

tel-00927121, version 1 - 1 1 Jan 2014

Target Cost Dissimilarity

Table5.4: Thistableillustratestheideaofomparisonofadissimilaritymeasurebasedordering

andtherankingassignedbasedonthetarget ost. Atarget

t

^and^four^andidates

{c ₁ , c ₂ , c ₃ , c ₄ }

are assumed. It is assumed that for the target and the andidates, the speeh realization is

available for omparison.

D(t, c i )

^is ^the dissimilarity between the speeh realizations of the target

t

^and^andidate

c _i

^,^whih^is^a ^symmetri^funtion.

C(t, c i )

^is^the^target^ost ^between ^the

targetspeiationof

t

^and^andidate

c i

^. ^Fôr^the^given^targetând^with^respet^toêahâvailable

andidate, the dissimilarity based ordering of andidates and the target ost based ranking is

ompared to alulate the disorder. Thetotal disorderis thesumof thefourtholumn.

the dissimilarity measure we are expressing the dierene in their speeh realizations. Our

approah is based on this idea thatthe ordering given by an ideal target ost funtion should

agreewiththeorderinggivenbythisdissimilaritymeasure. Duringpre-seletion,thetargetost

funtion assigns aranking to the available andidates, for pruning thelesssuitable andidates.

For this reason, we refer to the target ost based ordering as ranking. Unlike some systems

we don't train the target ost funtion to omputes the dissimilarity (Hunt and Blak, 1996).

We only fous on the andidate ordering given by the target ost funtion. The above three

onditions state that, the omparison of two andidates for a target position based on their

target ostswouldbesimilarto thatbasedontheir dissimilaritytothetarget,ifthetargetwas

to have a speeh realization available (hypothetial). We denote theabove three onditions by

thefollowing:

C(t, u) ∗ C(t, v) ⇔ D(t, u) ∗ D(t, v)

^(5.6)

Where,

∗ ∈ {<, =, >}

Wedene the disorderwithrespettothis target and thetwo andidatesasfollows:

δ t (u, v) =

tel-00927121, version 1 - 1 1 Jan 2014

For eah of the phonemes

p

ⁱⁿ ^the ^phoneme ^set, ^let

U _p

^be ^the ômplete ^set ôf ândidates

inthe orpus withthat phonemilabel. Using leave-one-out tehnique, onsidering eah ofthe

elements from this set as a target and all theothers as andidates, the total disorder for that

phoneme isalulated for apartiular target ost funtionasfollows:

∆ = X

Wetakeadissimilaritymeasuresimilartothatin(Latazetal.,2011)fortheaoustimodality.

Here, we desribe a funtion that we have used to ompare two speeh segments. It gives an

estimate of their dissimilarity. We onsidered four omponents to onstitute the dissimilarity

measure

D(u, v)

^between ^units

u

^and

v

^of^a ^partiular ^phoneme

p

^as^follows:

D(u, v) = w _dur D ^dur (u, v) + w ac D ^ac (u, v)+

w _vs D ^vs (u, v) + w _f0 D ^f ⁰ (u, v)

(5.9)

D ^dur

D ^ac

D ^vs

^and

D ^f0

âre^theômponentsⁱⁿ^termsôf^the^duration,âousti^speeh,^visual

speehand f0 ofthe unitsand

w _dur

w _ac

w _vs

^and

w _f ₀

^are^the^weights^given^to ^these ^respetive

omponents. The duration dissimilarity

D ^dur

îs âlulated âs ^the^dierene ^between ^the

dura-tionsofthetwounits

v

^and

u

dur _u

^and

dur _v

respetivelyandnormalizedtomakethevalueliein therange[0,1℄.

dur min (p) = min _u,v∈U _p |dur u − dur v |

^and

dur max (p) = max _u,v∈U _p |dur u − dur v |

whih are the maximum and minimum duration dierenes among the units of phoneme

p

Then, theduration dissimilarityomponent isalulated asfollows:

D ^dur (u, v) = |(dur u − dur v )| − dur min (p)

dur _max (p) − dur _min (p)

^(5.10)

For the other three omponents; aousti, visual and f0; the RMSE (root mean squared

error) is alulated between two trajetories of respetive features by making the duration or

numberof samples

N

^equal ^by^simple ^linearinterpolation.

d ^rmse (u, v) =

tel-00927121, version 1 - 1 1 Jan 2014

MFCC asexplained insetion5.3.3.

d _min (p) = min _u,v∈U _p d ^rmse (u, v)

^and

d _max (p) = max _u,v∈U _p d ^rmse (u, v)

^, ^whih ^are ^the

maxi-mumandminimumRMSEsamongalltheunitsofphoneme

p

^. ^The^RMSE^is^normalized^similar

D ^dur

^to ^make^the ^value^lie ⁱⁿ^the ^range

[0, 1]

^using

d _min (p)

^and

d _max (p)

D ^rmse (u, v) = d ^rmse (u, v) − d _min (p)

d max (p) − d min (p)

^(5.12)

5.3.2.3 Primitives of the algorithm

The main idea behind the algorithm to be desribed is that, eah target feature has some

ontributing information whih getsreeted inspeeh. If auseful feature isremoved fromthe

target ost, then, the performane of the target ost funtion should deteriorate. The extend

to whih itdeteriorates when a target feature is exluded, quanties thefeature's importane.

We estimate the relative importane ofa target feature based on thedeterioration of seletion

performanewhen a target featureis exludedfrom thetarget ost. This isexplained indetail

inthe following disussion. Forsimpliityofnotation,westopshowingaandidate andatarget

withthetargetostfuntion. Letsassumethattheurrentsetoftargetfeaturesis

F

^,^and^urrent

feature being onsidered is

f

^. ^Lets ^denote ^the ^singleton ^feature ^set

{f }

^with

F

F ^c = F − F

Letus express the target ostfuntion asfollows:

T C = w _F T C _F + (1 − w _F )

The target ost(TC) shownaboveis the weighted sumofthe following two omponents:

(a) Thetarget ost funtion withonefeature

f

T C _F

(b) A targetost funtion whihexluded feature

f

^,^from^the^target ^feature ^set,

T C _F ^c

The target ost funtionhighlightedas(1) intheabove equation takesall thefeaturesinto

aount andthe targetost funtionhighlighted as(2)above exludesfeature

f

^. ^Using ⁽¹⁾^and

(2) as the two target osts, two disorders are alulated. The disorder alulated using (1) is

referred to as Combined Disorder (CD), whih depends also on

w F

^. ^The ^disorder ^alulated

using (2) is refereed to as Exlusion Disorder (ED). The following an be said with respetto

theomparison ofCD and ED:

•

^A^feature

f

îsônsidered^toôntributeinformation,ifdisorderinreases whenitsexluded fromthetarget ost:

ED _f > CD

tel-00927121, version 1 - 1 1 Jan 2014

•

^A ^feature

f

îs ônsidered^to ôntribute^noise, îf ^the ^disorder ^dereases ^when îts êxluded

fromthetarget ost:

ED _f < CD

Those featureswhih ontributeinformation,theirweightsshouldbeinreasedproportional

totheirontribution,featureswhihseem toontributenoise,theirweightsshouldbedereased

till they beome ontributing features; if a feature ontributes only noise (for long), they are

eliminated fromthe feature set.

The following possibilitiesneed to be onsidered whilelassifyingfeaturesinformative:

(1) Featuresmightprovideinformationifgivenanoptimumweight(intheweighombination).

Exluding these features might modify thedisorder ompared to their inlusion and the

inreaseor derease dependson theombination ofweights.

(2) Featureswhihdon'tprovideanyinformation willnotaet thedisorderwiththeir

exlu-sion andinlusion even witha hange intheir relative weight inthetarget ost.

(3) Features whih ontribute only noise by their inlusion in the target ost, regardless of

thenon-zero weight givento them,the ombined disorderwill alwaysbegreater than the

disorderwiththeir exlusion.

Based on this analysis we developed an iterative algorithm. At any iteration, the weights

areupdatedbasedon theomparison ofED of dierent featuresand CDasfollows:

•

^Those ^features^for ^whih

ED > CD

^,^their ^weights ^inrease. ^The ^inrease ⁱⁿproportional to thedierene inED andCD.

•

^Those ^features^for ^whih

ED < CD

^, ^they ân ^belong ^to êither âtegory ⁽¹⁾ ôr ^(3). ^The

featureweightsareupdatedproportionaltothedierene inCDandED.Afeaturewhih

shows thistrend (

ED < CD

⁾ ^for^long, ît îsêliminated^from ^the^feature ^set.

•

^Fêatures^belonging^to âtegory ⁽²⁾âreâlsoêliminated ⁽

ED = CD

^).

•

Â ^fration ôf ^total ^weight ^from ^the ^set ôf ^features ^for ^whih ⁽

ED < CD

⁾ ^is distributed amongfeaturesfor whih (

ED > CD

^).

•

^Tô^make^the ^hange ⁱⁿ^the^weights^slow,^the ^weightsât êa^h îterationâre^madeâ^funtion

ofthepreviousiteration. Anynewweightafteraniteration,isafration(xedparameter)

tel-00927121, version 1 - 1 1 Jan 2014

5.3.2.4 Algorithm

Weprovidethepreisedetailsofthealgorithmhere. Notation: Foranyiteration

i

^,^the^omplete

setof featuresis

F i

^;^a ^singleton ^set^having ^feature

f

^is ^denoted^by^the^set

F

^;^the^set^of^features

exluding a feature

f

^from^set

F

^is

F _i ^c = F i − F

^;^the^disorder ^with^the^omplete ^set ^of^features

andtheir weightsat iteration

i

^(from^previousîteration)î.e.,^theômbined^disorder^CDîs

∆(i)

thedisorderwithafeature

f

êxluded^from^the^targetôst^(ED)îs

∆ F _i ^c (i)

^;^set^of^all^the^features

for whih

∆ F _i ^c (i) > ∆(i)

^is ^denoted ^by

F ⁺ i

^and

F ⁻ i

^for ^those ^whih ^are ^qualied ^to ^remain ⁱⁿ

thefeature setwith

∆ F _i ^c (i) < ∆(i)

^;^setôf âll ^features^whih âre^being êliminatedâre

F ⁰ _i

^. ^F^or ^a

feature

f

t _f (i)

îs^the^numberôfîterationsît^has^beenⁱⁿ

F ⁻ i

onseutivelytilliteration

i

^without

beingeliminated.

At every iteration

i

^the^following ^quantities âreâlulated ^forûpdating^the^feature ^weights:

Information Component (

I _F (i)

^): ^F^or ^a ^feature

f ∈ F ⁺ i

^,^i.e.

∆(i) < ∆ F _i ^c (i)

Based onthis

N _F ^′ (i)

âlulatedâs^follows ^to ûpdate^the^weight ât êvery îteration.

N _F ^′ (i) = (1 − N F (i))

featureswhihontributemorenoise willlosemore weight inthetarget funtionssubsequently.

Inase thereis only onefeature in

F ⁻ _i

^,^then

N _F ^′ (i) = 1

The following arethe parameters ofalgorithm:

• T

^,^the^maximum^number ôf ^toleran^t îterations^for â^noisy ^feature. Â^feature

f

^for ^whih

t _f (i) > T

îs êliminated^from^the^feature ^list. Îf â^feature

f

^hanges ^from^set

F ⁻ _i

^to ^set

F ⁺ _i

inaniteration

i

^,^then

t _F (i)

^is ^set^to

0 • α ₋

^and

α ₊

^, ^the ^frations ôf ^weights ôf âny ^features ⁱⁿ

F ⁻ i

^and

F ⁺ i

respetively that is

tel-00927121, version 1 - 1 1 Jan 2014

arriedforward fromthe weight inthepreviousiteration. Thismakestheupdatedweight

in the urrent iteration a funtion of the weight in the previous iteration. It is done to

makethehange inweightsslow.

• β

^is ^the ^fration ^of ^the ^total ^hangeable ^weight ⁱⁿ

F ⁻ i

^that ^is ^gained ^by ^features ⁱⁿ

F ⁺ i

Thelogibehindthisdistributionisthat,featuresin

F ⁻ _i

^loose^weight ^while^featuresⁱⁿ

F ⁺ _i

gainweight.

•

^Maximum^allowed iterations, for whih the algorithm is exeuted. This isxed based on therateof hange intotal disorder(derease inombineddisorder periteration).

Thegoal ofthealgorithm isto seletthe setoffeaturesandtunetheirrespetive weightsin

suh awaythatthe disorder

∆

^desribed ^by^equation ^(5.8)^is ^minimized:

•

^Beginning: ^Târget ôst ^funtion ^with ^the ômplete ^set ôf ^features ^whih âre âssigned

equal weights.

•

Ât êvery îteration

i

⋆

^The^following ^are^rst determined:

◦ ∆(i)

◦

^for ^all

f ∈ F i

∆ _F ^c (i)

⋆

Eliminationofallthosefeatures

f

^for^whihôneôf^the^followingônditionsîs^satised:

(∆(i) − ∆ F ^c (i)) ≈ 0

(∆(i) − ∆ F ^c (i)) > 0

^and

t _F (i) > T

⋆

Ûpdate ^weights: ^The ûpdate îs ^suh ^that ^the ^hange îs ^slow. ^Fôr ^that, â ^fration

of weight (

α ₊

^for ^features ⁱⁿ

F ⁺ i

^and

α ₋

^for ^features ⁱⁿ

F ⁻ i

⁾ ^remains ^onstant ^with

respet tothe previousiteration.

◦

^For ^a^feature

f ∈ F ⁺ _i

^: ^More^theinformation inthefeature, highertheweight.

w _F (i) = α ₊ w _F (i − 1) (1) +

W _F ⁺

i I _F (i) (2)

(5.17)

Therstomponent (1),dependsonthefeatureweightinthepreviousiteration;

the seondomponent(2),dependsontheinformationomponent ofthefeature.

W _F ⁺

i

is the total weight that will be redistributed in

F ⁺ i

W _F ⁺

i

is alulated as

tel-00927121, version 1 - 1 1 Jan 2014

follows:

The rst omponent (i), is the total hangeable weight of features in

F ⁺ _i

^; ^the

seond omponent (ii), is the total hangeable weight of features in

F ⁻ _i

^that ^is

gained by features in

F ⁺ _i

^; ^the ^third ômponent ^(iii), îs ^the ^total ^weight ôf ^the

features being eliminated,

F ⁰ i

^. ^The ^total ^weight ^of^the ^features^being ^eliminated

F ⁰ _i

^isre-distributed amongfeaturesin

F ⁺ _i

◦

^For ^a^feature

f ∈ F ⁻ _i

^: ^Lesser^the ^noise ontribution,higher theweight.

w _F (i) = α ₋ w _F (i − 1) (1) +

W _F −

i N _F ^′ (i) (2)

(5.19)

The rst omponent (1), depends ontheweight ofthe feature

f

ⁱⁿ^the ^previous

iteration;the seondomponent (2),dependsontheNoiseComponentoffeature

f

W _F − i

is the fration of total hangeable weight of features in

F ⁻ _i

^that ^is

redistributed to featuresin

F ⁻ i

îtself. Ît îsâlulated âs^follows:

W _F −

i = (1 − α ₋ )(1 − β) X

a∈F ⁻ _i

w _A (i − 1)

^(5.20)

•

^Termination: Thealgorithm isterminatedwhenmaximumnumberof allowed iterations areexeutedor whenthere isno improvement (dereaseinombineddisorder) inan

iter-ationbeyond aertain

ǫ

^. ^The ^best^weights^w.r.t ^the^least^disorder âlongâll^theîterations

arehosenfor the naltarget ost for thephoneme.

Dans le document Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales ~ Association Francophone de la Communication Parlée (Page 86-95)

Unit seletion and onatenation

Chapter 5 Unit Seletion 71

5.3 T arget feature seletion and weight tuning

5.3.1 Unit seletion and onatenation

t n 1 = (t 1 , ...t j , ...t n )

n

1

tel-00927121, version 1 - 1 1 Jan 2014

C(t i , u i ) = X F

ρ=1

w ρ C ρ (t i , u i )

t i

u i

F

C ρ (t i , u i )(ρ = 1, ..., F )

w ρ

ρ

w

w aj

w vj

tel-00927121, version 1 - 1 1 Jan 2014

t

t

u

v

t

D(t, u)

D(t, v)

C(t, u)

C(t, v)

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)

tel-00927121, version 1 - 1 1 Jan 2014

t

{c 1 , c 2 , c 3 , c 4 }

D(t, c i )

t

c i

C(t, c i )

t

c i

C(t, u) ∗ C(t, v) ⇔ D(t, u) ∗ D(t, v)

∗ ∈ {<, =, >}

δ t (u, v) =

tel-00927121, version 1 - 1 1 Jan 2014

p

U p

∆ = X

D(u, v)

u

v

p

D(u, v) = w dur D dur (u, v) + w ac D ac (u, v)+

w vs D vs (u, v) + w f0 D f 0 (u, v)

D dur

D ac

D vs

D f0

w dur

w ac

w vs

w f 0

D dur

v

u

dur u

dur v

dur min (p) = min u,v∈U p |dur u − dur v |

dur max (p) = max u,v∈U p |dur u − dur v |

p

D dur (u, v) = |(dur u − dur v )| − dur min (p)

dur max (p) − dur min (p)

N

d rmse (u, v) =

tel-00927121, version 1 - 1 1 Jan 2014

d min (p) = min u,v∈U p d rmse (u, v)

d max (p) = max u,v∈U p d rmse (u, v)

p

D dur

t ⁿ ₁ = (t ₁ , ...t _j , ...t _n )

C(t i , u _i ) = X F

w _ρ C _ρ (t i , u _i )

t _i

u _i

C _ρ (t i , u _i )(ρ = 1, ..., F )

w _ρ

w _aj

w _vj

{c ₁ , c ₂ , c ₃ , c ₄ }

c _i

U _p

D(u, v) = w _dur D ^dur (u, v) + w ac D ^ac (u, v)+

w _vs D ^vs (u, v) + w _f0 D ^f ⁰ (u, v)

D ^dur

D ^ac

D ^vs

D ^f0

w _dur

w _ac

w _vs

w _f ₀

D ^dur

dur _u

dur _v

dur min (p) = min _u,v∈U _p |dur u − dur v |

dur max (p) = max _u,v∈U _p |dur u − dur v |

D ^dur (u, v) = |(dur u − dur v )| − dur min (p)

dur _max (p) − dur _min (p)

d ^rmse (u, v) =

d _min (p) = min _u,v∈U _p d ^rmse (u, v)

d _max (p) = max _u,v∈U _p d ^rmse (u, v)

D ^dur

d _min (p)

d _max (p)

D ^rmse (u, v) = d ^rmse (u, v) − d _min (p)

F ^c = F − F

T C = w _F T C _F + (1 − w _F )

T C _F

T C _F ^c

ED _f > CD

ED _f < CD

F _i ^c = F i − F

∆ F _i ^c (i)

∆ F _i ^c (i) > ∆(i)

F ⁺ i

F ⁻ i

∆ F _i ^c (i) < ∆(i)

F ⁰ _i

t _f (i)

F ⁻ i

I _F (i)

f ∈ F ⁺ i

∆(i) < ∆ F _i ^c (i)

N _F ^′ (i)

N _F ^′ (i) = (1 − N F (i))

F ⁻ _i

N _F ^′ (i) = 1

t _f (i) > T

F ⁻ _i

F ⁺ _i