Chapter 5 Unit Seletion 71
5.2 Corpus based visual target features
5.2.3 Objetive evaluation of synthesis results
Inthissubsetionwedesribetheobjetiveevaluationdonetoomparethevariousvisualtarget
osts. For thepurposeofevaluatingthesynthesisresults,weusedamethodbasedon
leave-one-outross-validationtehnique. Wesynthesizedeahofthesentenesintheorpus,atotalof319
sentenes. Thisis done byexluding the sentene being synthesizedfrom the seletion orpus.
Eahof thesynthesized sentenes areompared with thereal sentenes. The advantage of this
method isthat itavoids building aspei test orpus for evaluation. However, we marginally
reduethe hoie of seletion,byexludingsome diphones from theseletion proess.
After synthesizing a given sentene, all the half-phones (two half-phones in eah diphone)
of the synthesized sentene and the atual sentene were re-sampled individually to make the
numberof visual samples equal in both the realand synthesizedsentenes (see Fig. 5.5). This
wasdoneusingasimplelinearinterpolationofthe12PCAoeients. Afterthis,thePearson's
orrelation oeients between 12 PCA oeients of all the synthesized sentenes and the
real sentenes atually present in the orpus was determined. Similarly, Pearson's orrelation
oeientsbetween 4artiulatoryparameters wasalsodetermined. Therootmeansquareerror
(RMSE) between artiulatory feature and PCA oeient trajetories of the synthesized and
thereal sentenes present intheorpus wasdetermined.
If
x d
andy d
are the sequenes ofthed th
PCA oeient ofa real and synthesized sentenetel-00927121, version 1 - 1 1 Jan 2014
Synthesized sentence
Real sentence
diphone ba diphone ac
diphone ba diphone ac
Figure 5.5: Adjusting diphone lengths. Eah of the orresponding half-phones whih are part of the
diphones in the synthesizedand real sentenes arere-sampled through linear interpolation to make the
numberof visualsamples equal.
having
n
samples:•
ThePearson's orrelationoeient isalulated asfollows:r x d y d =
•
TheRoot MeanSquared Error (RMSE)isalulated asfollows:rmse x d ,y d =
Thoughitisalmostimpossibletohaveaperfetorrelationbetweentherealandsynthesized
sentene, itseemsto be areasonableassumption thatthetrajetories for twodiphones seleted
withsimilarphonetiontextandlinguistidesriptionwouldbesigniantlyorrelated. Forthe
visualtargetost,weperformedobjetiveevaluationofthevisualspeehanimation alone. This
wasbasedontheassumptionthatthevisualspeehanimationwouldbestronglyorrelatedwith
the underlying aousti speeh. Besides, the features modied aount predominantly for the
visual modality of speeh unlike some others like phoneme artiulation, voiing whih aount
for theaoustisof speeh. Anexample ofthetrajetories oftherstprinipal omponent of a
synthesizedsentene andtheorrespondingreal senteneareshowningure5.6.
Evaluation results
Based on the above explained objetive evaluation tehnique the performane of the various
visual target ost tehniques were determined (See tables 5.2 and 5.3). The target ost
teh-tel-00927121, version 1 - 1 1 Jan 2014
desription (IPD) and Modied phoneti desription (MPD) performed omparable to eah
other (
r x d y d = 0.813
for PC 1). Similarly, the two ontinuous visual target osts; ontextualphoneme dierene based approah (CPD) andphoneme dierene basedon ontextual
signif-iane (PDCS) performedomparable to eah other (
r x d y d = 0.816
for PC1). The ontinuousvisualtargetostsgavemarginallybetterresultsonsistently omparedtothebinaryvisual
tar-getostapproahesevenwhendierentweightsforthevisualtarget ostomponent wereused.
Thisis also apparent when observing theperformane withrespetto artiulatory features. In
fat, theorrelation for the rsttwo methods IPD andMPD is
0.70
and itinreases upto0.72
for the CPD and PDCS for jaw opening (see table 5.2). Table 5.3 shows the RMSE between
real and syntheti trajetories for the artiulatory features. TheRMSE is almost thesame for
the4 methods. We shouldnotie thateah of theexamined methods aetsthe rankingofthe
seleted andidates though it is not that obvious that there are dierenesbetween them. We
shouldemphasizethatthe relativeimportane ofthisexaminedvisualtarget ostomponentin
theoveralltarget ost is
1%
, aswe havea large set offeatures. Therefore this an explain thismarginal variationintheperformane.
Hene, these results indiate thata ontinuous target ost omponent represents the
dier-enesbetweenphonemesbetter,optimizingthesynthesisperformaneforpartiularorpusthan
disrete binarytarget ost omponents hasto beontemplated. Given thelimited generalizing
power, for aorpus ofsmall size andwithout averywell balaned diphoneoverage inthe
or-pus, the ategorialtarget ost based on lassial knowledge an be onsidered suient. One
should observe thattheobjetive evaluationusedinthis work ispurelyvisual.
Examining the results of the objetive evaluation presented here, it an be said that they
are quite good. The overall orrelation is quite high. In addition, the RMSE is very low and
aeptable. Infat,the jawopeningRMSEisaround
2mm
,lipopening(2.7mm
),lipspreading(
1.38mm
) and lip protrusion is4mm
. This is a good indiation that our synthesis methodprovides similar trajetories to those of real sentenes. This is quite interesting, as we know
thatthepurpose ofsynthesisis not to generate theexat speaker artiulation (unlike
aousti-to-artiulatory inversion). As natural speeh realization is variable and so good synthesis an
also be obtained by dierent trajetories whih don't exatly math with one real referene.
But as our system takes into aount the speiity of the speaker into aount, we manage
to obtain a similar result whih is loserto thespeaker's artiulation. Thus, it seems thatour
aousti-visual synthesis, based on the main idea of onsidering the speeh signal as bimodal,
was able to apture the speaker spei artiulation nely. This an be learly seen in Figure
tel-00927121, version 1 - 1 1 Jan 2014
modied/optimized to takeanypartiular orpus theydesribe.
Table5.2: Correlationoeientsbetweentherealandsynthesizedtrajetoriesofrst3prinipalomponent
oeientsandthethreeartiulatoryfeaturesbyvarioustargetoststrategies. IPD:initialphonemedesription,
MPD:Modiedphonemedesription,CPD:ontextualphonemedierene,PDCS:phonemedierenebasedon
ontextualsigniane. Theartiulatoryfeatures: JO(jawopening),LP(lipprotrusion), LO(lipopening)and
LS(lipspreading). Therstfourprinipalomponentsaountforabout
58%
,24%
and7%
respetively.PC IPD MPD CPD PDCS
Table 5.3: RootMean SquareError (RMSE)inmillimetersbetweenthereal andsynthesized trajetories of
thefourartiulatoryfeatures(samenotationsastable5.2).