Objetive evaluation of synthesis results

Chapter 5 Unit Seletion 71

5.2 Corpus based visual target features

5.2.3 Objetive evaluation of synthesis results

Inthissubsetionwedesribetheobjetiveevaluationdonetoomparethevariousvisualtarget

osts. For thepurposeofevaluatingthesynthesisresults,weusedamethodbasedon

leave-one-outross-validationtehnique. Wesynthesizedeahofthesentenesintheorpus,atotalof319

sentenes. Thisis done byexluding the sentene being synthesizedfrom the seletion orpus.

Eahof thesynthesized sentenes areompared with thereal sentenes. The advantage of this

method isthat itavoids building aspei test orpus for evaluation. However, we marginally

reduethe hoie of seletion,byexludingsome diphones from theseletion proess.

After synthesizing a given sentene, all the half-phones (two half-phones in eah diphone)

of the synthesized sentene and the atual sentene were re-sampled individually to make the

numberof visual samples equal in both the realand synthesizedsentenes (see Fig. 5.5). This

wasdoneusingasimplelinearinterpolationofthe12PCAoeients. Afterthis,thePearson's

orrelation oeients between 12 PCA oeients of all the synthesized sentenes and the

real sentenes atually present in the orpus was determined. Similarly, Pearson's orrelation

oeientsbetween 4artiulatoryparameters wasalsodetermined. Therootmeansquareerror

(RMSE) between artiulatory feature and PCA oeient trajetories of the synthesized and

thereal sentenes present intheorpus wasdetermined.

x _d

^and

y _d

^are ^the ^sequenes ^of^the

d ^th

^PCA ôeient ôfâ ^real ând synthesized sentene

tel-00927121, version 1 - 1 1 Jan 2014

Synthesized sentence

Real sentence

diphone ba diphone ac

Figure 5.5: Adjusting diphone lengths. Eah of the orresponding half-phones whih are part of the

diphones in the synthesizedand real sentenes arere-sampled through linear interpolation to make the

numberof visualsamples equal.

having

n

^samples:

•

^The^Pearson's ôrrelationôeient îsâlulated âs^follows:

r _x _d _y _d =

•

^The^Root ^Mean^Squared Êrror ^(RMSE)îsâlulated âs^follows:

rmse x _d ,y _d =

Thoughitisalmostimpossibletohaveaperfetorrelationbetweentherealandsynthesized

sentene, itseemsto be areasonableassumption thatthetrajetories for twodiphones seleted

withsimilarphonetiontextandlinguistidesriptionwouldbesigniantlyorrelated. Forthe

visualtargetost,weperformedobjetiveevaluationofthevisualspeehanimation alone. This

wasbasedontheassumptionthatthevisualspeehanimationwouldbestronglyorrelatedwith

the underlying aousti speeh. Besides, the features modied aount predominantly for the

visual modality of speeh unlike some others like phoneme artiulation, voiing whih aount

for theaoustisof speeh. Anexample ofthetrajetories oftherstprinipal omponent of a

synthesizedsentene andtheorrespondingreal senteneareshowningure5.6.

Evaluation results

Based on the above explained objetive evaluation tehnique the performane of the various

visual target ost tehniques were determined (See tables 5.2 and 5.3). The target ost

teh-tel-00927121, version 1 - 1 1 Jan 2014

desription (IPD) and Modied phoneti desription (MPD) performed omparable to eah

other (

r _x _d _y _d = 0.813

^for ^PC ^1). ^Similarly^, ^the ^two ôntinuous ^visual ^target ôsts; ôntextual

phoneme dierene based approah (CPD) andphoneme dierene basedon ontextual

signif-iane (PDCS) performedomparable to eah other (

r _x _d _y _d = 0.816

^for ^PC^1). ^The ^ontinuous

visualtargetostsgavemarginallybetterresultsonsistently omparedtothebinaryvisual

tar-getostapproahesevenwhendierentweightsforthevisualtarget ostomponent wereused.

Thisis also apparent when observing theperformane withrespetto artiulatory features. In

fat, theorrelation for the rsttwo methods IPD andMPD is

0.70

ând îtînreases ûp^to

0.72

for the CPD and PDCS for jaw opening (see table 5.2). Table 5.3 shows the RMSE between

real and syntheti trajetories for the artiulatory features. TheRMSE is almost thesame for

the4 methods. We shouldnotie thateah of theexamined methods aetsthe rankingofthe

seleted andidates though it is not that obvious that there are dierenesbetween them. We

shouldemphasizethatthe relativeimportane ofthisexaminedvisualtarget ostomponentin

theoveralltarget ost is

1%

^, âs^we ^haveâ ^large ^set ôf^features. ^Therefore ^this ân êxplain ^this

marginal variationintheperformane.

Hene, these results indiate thata ontinuous target ost omponent represents the

dier-enesbetweenphonemesbetter,optimizingthesynthesisperformaneforpartiularorpusthan

disrete binarytarget ost omponents hasto beontemplated. Given thelimited generalizing

power, for aorpus ofsmall size andwithout averywell balaned diphoneoverage inthe

or-pus, the ategorialtarget ost based on lassial knowledge an be onsidered suient. One

should observe thattheobjetive evaluationusedinthis work ispurelyvisual.

Examining the results of the objetive evaluation presented here, it an be said that they

are quite good. The overall orrelation is quite high. In addition, the RMSE is very low and

aeptable. Infat,the jawopeningRMSEisaround

2mm

^,^lip^opening⁽

2.7mm

^),^lip^spreading

(

1.38mm

⁾ ^and ^lip ^protrusion ^is

4mm

^. ^This îs â ^good îndiation ^that ôur ^synthesis ^method

provides similar trajetories to those of real sentenes. This is quite interesting, as we know

thatthepurpose ofsynthesisis not to generate theexat speaker artiulation (unlike

aousti-to-artiulatory inversion). As natural speeh realization is variable and so good synthesis an

also be obtained by dierent trajetories whih don't exatly math with one real referene.

But as our system takes into aount the speiity of the speaker into aount, we manage

to obtain a similar result whih is loserto thespeaker's artiulation. Thus, it seems thatour

aousti-visual synthesis, based on the main idea of onsidering the speeh signal as bimodal,

was able to apture the speaker spei artiulation nely. This an be learly seen in Figure

tel-00927121, version 1 - 1 1 Jan 2014

modied/optimized to takeanypartiular orpus theydesribe.

Table5.2: Correlationoeientsbetweentherealandsynthesizedtrajetoriesofrst3prinipalomponent

oeientsandthethreeartiulatoryfeaturesbyvarioustargetoststrategies. IPD:initialphonemedesription,

MPD:Modiedphonemedesription,CPD:ontextualphonemedierene,PDCS:phonemedierenebasedon

ontextualsigniane. Theartiulatoryfeatures: JO(jawopening),LP(lipprotrusion), LO(lipopening)and

LS(lipspreading). Therstfourprinipalomponentsaountforabout

58%

24%

^and

7%

respetively.

PC IPD MPD CPD PDCS

Table 5.3: RootMean SquareError (RMSE)inmillimetersbetweenthereal andsynthesized trajetories of

thefourartiulatoryfeatures(samenotationsastable5.2).

Dans le document Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales ~ Association Francophone de la Communication Parlée (Page 82-85)

Chapter 5 Unit Seletion 71

5.2 Corpus based visual target features

5.2.3 Objetive evaluation of synthesis results

x d

y d

d th

tel-00927121, version 1 - 1 1 Jan 2014

Synthesized sentence

Real sentence

diphone ba diphone ac

diphone ba diphone ac

n

•

r x d y d =

•

rmse x d ,y d =

teh-tel-00927121, version 1 - 1 1 Jan 2014

r x d y d = 0.813

r x d y d = 0.816

0.70

0.72

1%

2mm

2.7mm

1.38mm

4mm

tel-00927121, version 1 - 1 1 Jan 2014

58%

24%

7%

x _d

y _d

d ^th

r _x _d _y _d =

rmse x _d ,y _d =

r _x _d _y _d = 0.813

r _x _d _y _d = 0.816