Chapter 5 Unit Seletion 71
5.3 T arget feature seletion and weight tuning
5.3.1 Unit seletion and onatenation
We briey revisit the unit seletion framework for speeh synthesis. A typial TTS (text to
speeh synthesis) algorithm an be broadly divided into two steps, generation of speiation
and the atual synthesis. This division is made to separate the steps whih perform a target
ost alulation from those whih do not. In the rst stage, the text to be synthesized is
analyzed. This stage produes the speiation of the phoneme sequene to be synthesized
t n 1 = (t 1 , ...t j , ...t n )
,n
phonemes starting from1
, for the input text. The seond stage doesthe atual synthesis of the required phoneme sequene in two steps, pre-seletion and nal
seletion through lattie resolution. This seond synthesis stage depends on the target ost
alulationforits synthesisperformane. The targetostalulationisdonebytheomparison
tel-00927121, version 1 - 1 1 Jan 2014
are`pereptually'similararepre-seletedforthenalsearhbasedonthistargetost. Ageneral
target funtion isalulated asfollows:
C(t i , u i ) = X F
ρ=1
w ρ C ρ (t i , u i )
(5.4)where,
t i
,u i
arethe target and a andidate;F
isthe numberof target features;C ρ (t i , u i )(ρ = 1, ..., F )
is the target feature osts between the elements of the target and andidate featurevetors;
w ρ
isthe weight of afeatureρ
:The seletion among the set of pre-seleted andidates is operated by resolution of a
lat-tie of andidates using the Viterbi algorithm. The result of this seletion is a path in the
lattie of andidates whih minimizes a weighted linear ombination of three osts: the
tar-get ost (
where
w
,w aj
andw vj
areweights for the omponent target ost,aousti join ost and visualjoinost. We hoose theseweightsasexplained in(Toutios etal.,2011) (see setion6.1.2).
An ideal target ost funtion
The usage of target ost funtion is to rank andidates intheorder of their suitability to t a
target position during synthesis. Eah andidate is assigned a ost (positive real number) by
thetarget ost funtion,lower theost better suitable isthe andidate for atarget position. If
we assume thatthere is a metri to measure thepereptual dissimilaritybetween atarget and
a andidate, thenideally, the ranking of andidates based on their target osts should be the
same asthatofthe orderingbasedon their pereptualdissimilarities to thetarget.
At the time of synthesis, the target speiation only has the target feature desription,
but no aousti or visual speeh realization. So, thedeision is madebased on thetarget ost.
Hene, an optimum target ost funtion is very important for good synthesis results. A good
set of target features andwell tunedweights dene a good target ost funtion. The following
setionpresents asimple and robustiterative algorithm to simultaneously eliminateredundant
tel-00927121, version 1 - 1 1 Jan 2014
5.3.2 Target feature seletion and weight tuning
Thealgorithm tobedesribedalleviatestheproblemofredundany andnoisethatissetindue
to the exhaustive set of features onsidered. Itsimportane is also due to the fat that, with
a large set of features, it is pratially infeasible to have a orpus whih overs all the feature
ombinations possible. The algorithm uses the orpus, for whih we have both atual speeh
realizations and target featuredesriptions for eahof theandidatespresent init.
Sine foranyspeeh segments, therearepossiblevariantswhiharepereptually onsidered
goodalternatives. But, itispratially impossible torankandidates intermsoftheir absolute
pereptualqualitywith respetto anytarget. Being 'similar'to analreadyexistingspeehunit
is a reasonable way to say how well will a andidate t in a `target' position. If we devie a
way to measure the dissimilarity between two units, it an be used on the andidates in the
orpus. They have both the target feature desription and speeh realization available. The
omparison between theordering obtained bythis measure versus theranking usingthe target
ost an be used to evaluate the target funtion. In the following paragraphs, we dene two
things neessaryfor the evaluation of a target funtion: disorder withrespet to a target ost
funtion anddissimilaritybetween two speeh realizations.
5.3.2.1 Disorder
The disagreement in the ranking of andidates given by the target ost funtion versus the
ordering by dissimilarity measure, needs to be quantied. With respet to a partiular target
t
whose speeh is available, the andidate ranking based on the target ost funtion should beinagreement withtheir dissimilaritybasedordering. We refer theordering basedonthetarget
ost as ranking. Consider a target
t
and two andidatesu
andv
. With respet to thetargett
, let their dissimilarity measures beD(t, u)
andD(t, v)
, and their target osts beC(t, u)
andC(t, v)
. Then for an ideal target ost funtion,one of thefollowing three onditions should betrue:
1.
C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)
2.
C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)
3.
C(t, u) < C(t, v) ⇔ D(t, u) < D(t, v)
Thedissimilaritymeasureisbasedontheomparison oftwo speehrealizations. Weassume
that similar speeh realizations are pereptually similar. This assumption implies that the
tel-00927121, version 1 - 1 1 Jan 2014
Target Cost Dissimilarity
Table5.4: Thistableillustratestheideaofomparisonofadissimilaritymeasurebasedordering
andtherankingassignedbasedonthetarget ost. Atarget
t
andfourandidates{c 1 , c 2 , c 3 , c 4 }
are assumed. It is assumed that for the target and the andidates, the speeh realization is
available for omparison.
D(t, c i )
is the dissimilarity between the speeh realizations of the targett
andandidatec i
,whihisa symmetrifuntion.C(t, c i )
isthetargetost between thetargetspeiationof
t
andandidatec i
. Forthegiventargetandwithrespettoeahavailableandidate, the dissimilarity based ordering of andidates and the target ost based ranking is
ompared to alulate the disorder. Thetotal disorderis thesumof thefourtholumn.
the dissimilarity measure we are expressing the dierene in their speeh realizations. Our
approah is based on this idea thatthe ordering given by an ideal target ost funtion should
agreewiththeorderinggivenbythisdissimilaritymeasure. Duringpre-seletion,thetargetost
funtion assigns aranking to the available andidates, for pruning thelesssuitable andidates.
For this reason, we refer to the target ost based ordering as ranking. Unlike some systems
we don't train the target ost funtion to omputes the dissimilarity (Hunt and Blak, 1996).
We only fous on the andidate ordering given by the target ost funtion. The above three
onditions state that, the omparison of two andidates for a target position based on their
target ostswouldbesimilarto thatbasedontheir dissimilaritytothetarget,ifthetargetwas
to have a speeh realization available (hypothetial). We denote theabove three onditions by
thefollowing:
C(t, u) ∗ C(t, v) ⇔ D(t, u) ∗ D(t, v)
(5.6)Where,
∗ ∈ {<, =, >}
.Wedene the disorderwithrespettothis target and thetwo andidatesasfollows:
δ t (u, v) =
tel-00927121, version 1 - 1 1 Jan 2014
For eah of the phonemes
p
in the phoneme set, letU p
be the omplete set of andidatesinthe orpus withthat phonemilabel. Using leave-one-out tehnique, onsidering eah ofthe
elements from this set as a target and all theothers as andidates, the total disorder for that
phoneme isalulated for apartiular target ost funtionasfollows:
∆ = X
Wetakeadissimilaritymeasuresimilartothatin(Latazetal.,2011)fortheaoustimodality.
Here, we desribe a funtion that we have used to ompare two speeh segments. It gives an
estimate of their dissimilarity. We onsidered four omponents to onstitute the dissimilarity
measure
D(u, v)
between unitsu
andv
ofa partiular phonemep
asfollows:D(u, v) = w dur D dur (u, v) + w ac D ac (u, v)+
w vs D vs (u, v) + w f0 D f 0 (u, v)
(5.9)
D dur
,D ac
,D vs
andD f0
aretheomponentsintermsoftheduration,aoustispeeh,visualspeehand f0 ofthe unitsand
w dur
,w ac
,w vs
andw f 0
aretheweightsgivento these respetiveomponents. The duration dissimilarity
D dur
is alulated as thedierene between thedura-tionsofthetwounits
v
andu
,dur u
anddur v
respetivelyandnormalizedtomakethevalueliein therange[0,1℄.dur min (p) = min u,v∈U p |dur u − dur v |
anddur max (p) = max u,v∈U p |dur u − dur v |
,whih are the maximum and minimum duration dierenes among the units of phoneme
p
.Then, theduration dissimilarityomponent isalulated asfollows:
D dur (u, v) = |(dur u − dur v )| − dur min (p)
dur max (p) − dur min (p)
(5.10)For the other three omponents; aousti, visual and f0; the RMSE (root mean squared
error) is alulated between two trajetories of respetive features by making the duration or
numberof samples
N
equal bysimple linearinterpolation.d rmse (u, v) =
tel-00927121, version 1 - 1 1 Jan 2014
MFCC asexplained insetion5.3.3.
d min (p) = min u,v∈U p d rmse (u, v)
andd max (p) = max u,v∈U p d rmse (u, v)
, whih are themaxi-mumandminimumRMSEsamongalltheunitsofphoneme
p
. TheRMSEisnormalizedsimilarto
D dur
to makethe valuelie inthe range[0, 1]
usingd min (p)
andd max (p)
:D rmse (u, v) = d rmse (u, v) − d min (p)
d max (p) − d min (p)
(5.12)5.3.2.3 Primitives of the algorithm
The main idea behind the algorithm to be desribed is that, eah target feature has some
ontributing information whih getsreeted inspeeh. If auseful feature isremoved fromthe
target ost, then, the performane of the target ost funtion should deteriorate. The extend
to whih itdeteriorates when a target feature is exluded, quanties thefeature's importane.
We estimate the relative importane ofa target feature based on thedeterioration of seletion
performanewhen a target featureis exludedfrom thetarget ost. This isexplained indetail
inthe following disussion. Forsimpliityofnotation,westopshowingaandidate andatarget
withthetargetostfuntion. Letsassumethattheurrentsetoftargetfeaturesis
F
,andurrentfeature being onsidered is
f
. Lets denote the singleton feature set{f }
withF
,F c = F − F
.Letus express the target ostfuntion asfollows:
T C = w F T C F + (1 − w F )
The target ost(TC) shownaboveis the weighted sumofthe following two omponents:
(a) Thetarget ost funtion withonefeature
f
,T C F
.(b) A targetost funtion whihexluded feature
f
,fromthetarget feature set,T C F c
.The target ost funtionhighlightedas(1) intheabove equation takesall thefeaturesinto
aount andthe targetost funtionhighlighted as(2)above exludesfeature
f
. Using (1)and(2) as the two target osts, two disorders are alulated. The disorder alulated using (1) is
referred to as Combined Disorder (CD), whih depends also on
w F
. The disorder alulatedusing (2) is refereed to as Exlusion Disorder (ED). The following an be said with respetto
theomparison ofCD and ED:
•
Afeaturef
isonsideredtoontributeinformation,ifdisorderinreases whenitsexluded fromthetarget ost:ED f > CD
.tel-00927121, version 1 - 1 1 Jan 2014
•
A featuref
is onsideredto ontributenoise, if the disorder dereases when its exludedfromthetarget ost:
ED f < CD
.Those featureswhih ontributeinformation,theirweightsshouldbeinreasedproportional
totheirontribution,featureswhihseem toontributenoise,theirweightsshouldbedereased
till they beome ontributing features; if a feature ontributes only noise (for long), they are
eliminated fromthe feature set.
The following possibilitiesneed to be onsidered whilelassifyingfeaturesinformative:
(1) Featuresmightprovideinformationifgivenanoptimumweight(intheweighombination).
Exluding these features might modify thedisorder ompared to their inlusion and the
inreaseor derease dependson theombination ofweights.
(2) Featureswhihdon'tprovideanyinformation willnotaet thedisorderwiththeir
exlu-sion andinlusion even witha hange intheir relative weight inthetarget ost.
(3) Features whih ontribute only noise by their inlusion in the target ost, regardless of
thenon-zero weight givento them,the ombined disorderwill alwaysbegreater than the
disorderwiththeir exlusion.
Based on this analysis we developed an iterative algorithm. At any iteration, the weights
areupdatedbasedon theomparison ofED of dierent featuresand CDasfollows:
•
Those featuresfor whihED > CD
,their weights inrease. The inrease inproportional to thedierene inED andCD.•
Those featuresfor whihED < CD
, they an belong to either ategory (1) or (3). Thefeatureweightsareupdatedproportionaltothedierene inCDandED.Afeaturewhih
shows thistrend (
ED < CD
) forlong, it iseliminatedfrom thefeature set.•
Featuresbelongingto ategory (2)arealsoeliminated (ED = CD
).•
A fration of total weight from the set of features for whih (ED < CD
) is distributed amongfeaturesfor whih (ED > CD
).•
Tomakethe hange intheweightsslow,the weightsat eah iterationaremadeafuntionofthepreviousiteration. Anynewweightafteraniteration,isafration(xedparameter)
tel-00927121, version 1 - 1 1 Jan 2014
5.3.2.4 Algorithm
Weprovidethepreisedetailsofthealgorithmhere. Notation: Foranyiteration
i
,theompletesetof featuresis
F i
;a singleton sethaving featuref
is denotedbythesetF
;thesetoffeaturesexluding a feature
f
fromsetF
isF i c = F i − F
;thedisorder withtheomplete set offeaturesandtheir weightsat iteration
i
(frompreviousiteration)i.e.,theombineddisorderCDis∆(i)
;thedisorderwithafeature
f
exludedfromthetargetost(ED)is∆ F i c (i)
;setofallthefeaturesfor whih
∆ F i c (i) > ∆(i)
is denoted byF + i
andF − i
for those whih are qualied to remain inthefeature setwith
∆ F i c (i) < ∆(i)
;setof all featureswhih arebeing eliminatedareF 0 i
. For afeature
f
,t f (i)
isthenumberofiterationsithasbeeninF − i
onseutivelytilliterationi
withoutbeingeliminated.
At every iteration
i
thefollowing quantities arealulated forupdatingthefeature weights:Information Component (
I F (i)
): For a featuref ∈ F + i
,i.e.∆(i) < ∆ F i c (i)
:Based onthis
N F ′ (i)
alulatedasfollows to updatetheweight at every iteration.N F ′ (i) = (1 − N F (i))
featureswhihontributemorenoise willlosemore weight inthetarget funtionssubsequently.
Inase thereis only onefeature in
F − i
,thenN F ′ (i) = 1
.The following arethe parameters ofalgorithm:
• T
,themaximumnumber of tolerant iterationsfor anoisy feature. Afeaturef
for whiht f (i) > T
is eliminatedfromthefeature list. If afeaturef
hanges fromsetF − i
to setF + i
inaniteration
i
,thent F (i)
is setto0
.• α −
andα +
, the frations of weights of any features inF − i
andF + i
respetively that istel-00927121, version 1 - 1 1 Jan 2014
arriedforward fromthe weight inthepreviousiteration. Thismakestheupdatedweight
in the urrent iteration a funtion of the weight in the previous iteration. It is done to
makethehange inweightsslow.
• β
is the fration of the total hangeable weight inF − i
that is gained by features inF + i
.Thelogibehindthisdistributionisthat,featuresin
F − i
looseweight whilefeaturesinF + i
gainweight.
•
Maximumallowed iterations, for whih the algorithm is exeuted. This isxed based on therateof hange intotal disorder(derease inombineddisorder periteration).Thegoal ofthealgorithm isto seletthe setoffeaturesandtunetheirrespetive weightsin
suh awaythatthe disorder
∆
desribed byequation (5.8)is minimized:•
Beginning: Target ost funtion with the omplete set of features whih are assignedequal weights.
•
At every iterationi
:⋆
Thefollowing arerst determined:◦ ∆(i)
.◦
for allf ∈ F i
:∆ F c (i)
.⋆
Eliminationofallthosefeaturesf
forwhihoneofthefollowingonditionsissatised:1.
(∆(i) − ∆ F c (i)) ≈ 0
2.
(∆(i) − ∆ F c (i)) > 0
andt F (i) > T
⋆
Update weights: The update is suh that the hange is slow. For that, a frationof weight (
α +
for features inF + i
andα −
for features inF − i
) remains onstant withrespet tothe previousiteration.
◦
For afeaturef ∈ F + i
: Moretheinformation inthefeature, highertheweight.w F (i) = α + w F (i − 1) (1) +
W F +
i I F (i) (2)
(5.17)
Therstomponent (1),dependsonthefeatureweightinthepreviousiteration;
the seondomponent(2),dependsontheinformationomponent ofthefeature.
W F +
i
is the total weight that will be redistributed in
F + i
.W F +
i
is alulated as
tel-00927121, version 1 - 1 1 Jan 2014
follows:
The rst omponent (i), is the total hangeable weight of features in
F + i
; theseond omponent (ii), is the total hangeable weight of features in
F − i
that isgained by features in
F + i
; the third omponent (iii), is the total weight of thefeatures being eliminated,
F 0 i
. The total weight ofthe featuresbeing eliminatedF 0 i
isre-distributed amongfeaturesinF + i
.◦
For afeaturef ∈ F − i
: Lesserthe noise ontribution,higher theweight.w F (i) = α − w F (i − 1) (1) +
W F −
i N F ′ (i) (2)
(5.19)
The rst omponent (1), depends ontheweight ofthe feature
f
inthe previousiteration;the seondomponent (2),dependsontheNoiseComponentoffeature
f
.W F − i
is the fration of total hangeable weight of features in
F − i
that isredistributed to featuresin
F − i
itself. It isalulated asfollows:W F −
i = (1 − α − )(1 − β) X
a∈F − i
w A (i − 1)
(5.20)•
Termination: Thealgorithm isterminatedwhenmaximumnumberof allowed iterations areexeutedor whenthere isno improvement (dereaseinombineddisorder) inaniter-ationbeyond aertain
ǫ
. The bestweightsw.r.t theleastdisorder alongalltheiterationsarehosenfor the naltarget ost for thephoneme.