Chapter 4 Phoneme Classiation Based on F aial Data 55
4.2 Learning phoneme kinematis using EMA data
4.2.3 Results
The HMM training and alignment is done exatly in thesame way asexplained for thefaial
data. TwosetsofHMMsaretrainedusingthe two featuresetsextratedfromEMAdata. Only
monophone HMMs were trained and used for segmentation. This is beause of the overage
tel-00927121, version 1 - 1 1 Jan 2014
64Chapter4.PhonemeClassiationBasedonFaial
I B
H
L A D F
G C K
E
Figure 4.5: EMA data aquisition: Loation of sensors, frontal view.
Lip opening andlipspread aregiven bythe distanes
k EL)k ~
andk GKk ~
.Lip protrusionisgivenbythe displaementofthe enterofgravityofthe
four points (E, G, K, L) along the normal vetor to the plane formed
by vetors
EL ~
andGK ~
. Figure adapted and modied from (Ptzinger,J
A D F C
z
x
y axis axis
axis L
E
Figure 4.6: EMAdata aquisition: Sensor loations on the mid-sagittal plane.
The following tongue related features are alulated: 1. Tongue tip movement,
k AJ ~ k
, 2. Horizontal displaement of the tongue,(k J F ~ k) x
, 3. Tongue shape,(k ADk) ~ (x,z)
and(k ACk) ~ (x,z)
, 4. Tongue height,(k J F ~ k) z
. Figure adapted andtel-00927121, version 1 - 1 1 Jan 2014
is usedfor theanalysis. Thesegmentation resultsare obtained for thetwo sets ofHMMs. The
reognition errorsaredetermined foreah phonemelassfor thesegmentation preditedbythe
twoHMMsets. Thisisinsimilarlinesasexplainedintheaseoffaialmarkerdata. Theresults
in omparison with those obtained by HMMs trained using features extrated from the faial
markerdataaregiveningure4.7. Faial dataandEMA datahavea lotof dierenesbesides
just the phoneti transript, duration and overage of phonemes. There are other signiant
dierenes suh asthe following. First, Unlike faial data where the artiulation isompletely
uninhibitedandnatural,theaetofthepreseneofsensorsonartiulationannotbeompletely
ruledout. Inadditiontothat,thefaialdeformationhappeningduringtheartiulationofspeeh
annotbeompletelyapturedthroughjust5points(4onlipsand1onthehin),inthisrespet
faial data an be onsidered better. Besides, trajetories of just 4 points on the tongue are
aptured and parameters were extrated subsequently. This an not apture the omplexity
of the artiulatory deformation of the tongue. These dierenes and fators aount for the
marginal improvement with the addition of tongue related information, whih is ontrary to
whatonewouldexpet. Broadly,theadditionoftonguefeaturesimprovesthealignment results
for most of thephonemes whihdon't fall intheategory ofvisiblephonemes (seegure. 4.2).
Forthephonemeswhihfallintheategoryofvisiblephonemes,ratherpreditably,theaddition
of tongueinformation doesnot improve the reognition.
Figures 4.8to4.11givethestart andendstatistisof thephonemes basedonthealignment
results without and with tongue related data to the artiulatory features. Considering those
phonemes for whih the reognition errors have redued with theaddition of tongue data, the
following observations an be made. For velars, the expetation of aousti to visual start
dierene is positive, i.e.
(E(Ds) > 0)
, whih indiates the o-artiulation eet on their left ontextual phonemes. For alveolars and dentals, thevariane of the dierene in aousti andvisual start (
Ds
) hasredued. Besides, for the phoneme /l
/,thedierene intheaousti andvisual ends(
(E(De) < 0)
) shows an inuene onthe following phonemes. For other phonemes,these gures show that there is no signiant hange inthe statistiswith theinlusion of the
tonguedata. Thisanbeaountedbythe reognitionerrors, whihhasnotimproved withthe
additionof tonguedata.
4.3 Conlusion
The results of segmentation using EMA data whih inludes tongue related features, in
om-parison of those obtained by faial features, shows only a marginal improvement. This is in
tel-00927121, version 1 - 1 1 Jan 2014
Chapter4.PhonemeClassiationBasedonFaialData
0 0.1 0.2 0.3 0.4
Face:Art+PCA EMA:Art
EMA:Art+tongue
Consonants Vowels TE
sil
B.L L.D Alv Plt Vlr Uvl U.V R.Vsil p b m 4 w S Z v f t d n s z l ñ j k g K e a E i @ ˜e ˜ a y u œ o ø ˜ o
Figure4.7: ShowstheforedalignmentresultsusingtrainedHMMsusing dierent Databasedfeatures. Fae:PCA+Artarethefeaturevetorsextratedfrom
the faialmarker datahavingthe four artiulatoryfeaturesandrst3PCAoeients. EMA:Artare theartiulatory featurevetorextratedfrom theEMA
dataandEMA:Art+tonguearethe artiulatoryandtonguemovement relatedfeaturevetorfromEMAdata
tel-00927121, version 1 - 1 1 Jan 2014
−150
−100
−50 0 50 100 150 200
sil B.L L.D Alv Plt Vlr Uvl U.V R.V
Consonants Vowels
(As−Vs) ms
Phonemes
Figure 4.8: Means and varianes of the phonemes start dierenes alulated for the alignment based
onartiulatory parametersof EMAdata
−250
−200
−150
−100
−50 0 50 100 150 200 250
sil B.L L.D Alv Plt Vlr Uvl U.V R.V
Consonants Vowels
(Ae−Ve) ms
Phonemes
Figure4.9: Means andvarianes of the phoneme enddierenes alulatedfor the alignment basedon
artiulatoryparameters ofEMAdata
tel-00927121, version 1 - 1 1 Jan 2014
−150
−100
−50 0 50 100
150 sil B.L L.D Alv Plt Vlr Uvl U.V R.V
Consonants Vowels
(As−Vs) ms
Phonemes
Figure 4.10: Means andvarianesof the phonemes start dierenes alulatedfor the alignmentbased
onbothartiulatoryandtonguerelatedparameters ofEMAdata
−200
−150
−100
−50 0 50 100
150 sil B.L L.D Alv Plt Vlr Uvl U.V R.V
Consonants Vowels
(Ae−Ve) ms
Phonemes
Figure 4.11: Means and varianes of the phonemes end dierenes alulatedfor the alignment based
onbothartiulatoryandtonguerelatedparameters ofEMAdata
tel-00927121, version 1 - 1 1 Jan 2014
based on these automati segmentation results. This lassiation is used to analyze the
per-eptual evaluation results. It is useful for bringing out the orrelation between objetive and
pereptualevaluationresults,thus paving way for better objetive evaluationtehniques 2
.
2
tel-00927121, version 1 - 1 1 Jan 2014
tel-00927121, version 1 - 1 1 Jan 2014
Unit Seletion
In the previous hapter we have presented an overview of our text to aousti-visual speeh
synthesissystem alledViSAC. It synthesizes speeh using unit-seletionand onatenation of
speeh segments from a pre-reorded speeh orpus. Suh speeh synthesis systems whih are
basedonunitseletion typiallyhave threestages. For agiventext to besynthesized, theNLP
module rst generates the speiation of the required target phoneme sequene. The
spei-ation is then onverted in terms of thesynthesis unit. For example, thesynthesis unit inthe
aseofour systemisdiphone. Itisneessarythatthetarget speiation hasall theimportant
information whihaetsspeehrealization. Then,for eahrequiredtarget inthespeiation,
all theandidates intheorpus are ranked based on a target ostfuntion. This ost funtion
is generally dened as the weighted sum of individual feature osts. At the end of this
an-didate ranking, for eah required target in the speiation, utmosta xed maximum number
of andidates are pre-seleted and rest pruned. This senario of multiple possible andidates
for eah required target in the sequene, denes a lattie. Finally, the sequene of those nal
andidates whih optimizes a total ost funtion is seleted for onatenation. This isdone by
theresolution ofthe lattie throughViterbi algorithm. The total ost funtion isthe weighted
sumof thetarget ost andthe onatenation osts.
For all the three stages mentioned above, `speiation of targets' or `desription of
andi-dates'isruial. Thisalso shows thatthetarget featurestruture and thealulation oftarget
ost plays a entral role. In the pre-seletion stage, it is neessary that the ranking given to
the andidates present in the orpus is onsistent with theordering based on their pereptual
suitability for any required target. This is also important to ensure that no good andidates
get pruned. This dependson the target ost. Besides pre-seletion, target ost also inuenes
the nal seletion of andidate sequene from the lattie. The set of target features and their
optimum weights whih dene thetarget ost, deide the eieny of the target ost funtion
tel-00927121, version 1 - 1 1 Jan 2014
and hene the synthesis performane. With respet to target ost, the following two aspets
need to be explored:
•
Deiding the set of target features that will be usedfor target speiation or andidatedesription.
•
Tuning the weights of the target features to optimize the overall synthesis performane, for agiven orpus.In addition to the target ost, the onatenation ost also needs to be onsidered. The
onatenation ost estimatesthe pereptual disontinuity dueto the onatenation oftwo
an-didates. The alulation of the aousti and visual onatenation ost in our system was
ex-plained in the previous hapter. The objetive of unit seletion is to have a nal synthesized
speeh whih is pereptually similar to a natural speeh sequene (hypothetial) rendered by
thespeaker. Thisrequiresat leastaontinuous speehwithout pereptible disontinuities, and
onstituentspeehsegmentswhihareloallysuitableforeahrequiredtarget. Thisrequiresan
optimum ombination of target and onatenation osts. This, indiates the need to tune the
total ostfuntion besidesoptimizingthetotal ost.
Thishapterdealswiththesedierentaspetsofunitseletion. Inthefollowing setions,we
desribeexperimentsthatwereperformedwiththeobjetiveofoptimizingthesynthesisresults.
Inthefollowing setions, werst givean aount of thesetof target featuresinsetion5.1. In
setion5.2,wedetailexperimentsthatwereperformedtomodifytargetfeaturevaluesor design
newtargetfeaturesforvisualmodality. Insetion5.3,weexplainatargetosttuningapproah
thatwe have developed beforeonluding.
5.1 Target features
At the time of synthesis, targets are speied using a set of features, generally alled target
features. This set of target features is generally deided based on the linguisti and phoneti
studies whih explain various patterns in speeh. Consequently, the lassially used target
features inlude linguisti, phoneti and prosodi ontext. Some of these features are relevant
irrespetive of a language and some might be language-spei. For example, unlike phoneme
voiing whih is usually relevant irrespetive of a language, the observation of rhythm group
(RG)pattern isrelevantfor Frenh. Thisis beause inFrenhtheend ofRGgivestheposition
ofthestressedsyllablewhihisusuallythelastsyllableofRG.Hene,thefeaturesrelatedtoRG
thatarerelevanttoFrenh,might notberelevant orequallyimportant forotherlanguages. For
tel-00927121, version 1 - 1 1 Jan 2014
onthetextanalysis. Inthease ofatextto besynthesized, thedesriptionof atargetinterms
of thesefeaturesprovides an `abstrat'information about speeh. Thetarget feature ost for a
partiular andidate isbasedon the featurevalueof the targetand thatof theandidate being
onsidered. Theexpetationisthatsame featurevaluesaount for ahypothetialsimilarityin
thespeeh realizationand henealso theandidate suitability.
Inoursystem,thesefeaturesdesribeaphonemeatvarious logiallevelsinwhihasentene
an be sub-divided (see Fig. 3.6). Some of the featuresare more spei to Frenh language.
These set of features, espeially the linguisti features, are predominantly generi and an be
diretly applied irrespetive of the orpus being used. The set of linguisti features inludes
phoneme numberinthe syllable; syllablekind;syllableposition inthe rhythm group(RG) and
sentene; syllable number in the word, RG and sentene; word position in RG and sentene;
word number in RG and sentene; RG position in sentene; proximity of the nearest left and
right silene;kind of sentene.
They either have nite integral values or ategorial values based on the feature. These
featuresareeitherusedto desribe the harateristi ofa target or aandidate or a ontextual
(left/right)phoneme or both. Thephoneti features inlude,besidesthephoneme identity,the
listof featuresgiven intable5.1. Exept the phoneme identity,theother phoneti featuresare
used to dene ontext (left and right phoneme). This set of generi target features whih are
extrated through the text analysis is augmented by additional orpus-based target features.
This is done to take the speakerharateristis into aount whih is important espeially for
thevisual modality. Hene, the orpus spei featuresdesigned mainly aount for the visual
modalityof speeh.
5.2 Corpus based visual target features
We have desribed thesetof generitarget featuresintheprevious setion,whiharegenerally
assumedtodependsolelyontextanalysis. Thesetoftargetfeaturesrelatedtophonetiontext
also belongs to this ategory. The phoneti ontext of any partiular phoneme inuenes its
artiulationsigniantly. Thisiswellknownasoartiulation. Thedegree bywhiha phoneme
inuenesitssurroundingphonemes orisinuenedbythemvaries(Löfqvist,1990). The
estab-lishedphoneti knowledge regardingoartiulation holds almost all thetime(Ladefoged, 1982;
Ladefoged and Maddieson, 1995). Hene, these target features and their values for dierent
phonemes areusuallybasedontheharaterization denedbyphonetiiansthatisfoundinthe
literature. Hene,their values aresetbasedontheinformation extratedthroughtext analysis.
tel-00927121, version 1 - 1 1 Jan 2014
Table 5.1: Thesesetof featuresdene the phoneti ontext of aphoneme, target or andidate.
These feature values either desribe previous or following phoneme. The target feature osts
for thesefeaturesarebinary valued funtionstakingeither
0
or1
basedonwhether thefeaturevaluesbeingompared aresame or dierent respetively.
Feature Name Possible values
Voiing voied, unvoied
enesandidiosynrasies. Duetotheusageofareordedaudio-visualorpus,inasethespeaker
has any peuliar artiulation, it might be visually or aoustially pereived inthe synthesized
speeh and present some inoherene. For example, let us assume that andidates are being
looked upfora targetphoneme whoseleftontextual phonemeisonsideredto havelip
protru-sionduring itsartiulation. Then obviously,thoseandidates whose leftontextualphoneme is
onsidered tohave alip protrusionduring itsartiulation willgethigher ranking. Ifthis target
ontextualphonemeisatuallyartiulateddierently andnotatuallyprotruded,thenseleting
a andidate with a protrusion left ontextual phoneme might be inappropriate. This kind of
ategorization might slightly vary from person to person and it is well known (Johnson etal.,
1993;Raphaeland Bell-Berti,1975;Maeda,1989). Hene, inase thesefeaturevalueshave any
inonsisteny in omparison with the atual harateristi in the orpus, it will be visible in
the synthesized speeh. We have performed two experiments whih aim at a phoneti ontext
adaptationthatisbasedontheharateristisobservedintheorpus. Theyanbedividedinto
thefollowing twoategories:
•
Changingtargetfeaturevaluesforsomephonemesbasedontheartiulatoryharateristis estimatedfrom the orpus. Werefer to this approah asphoneti ategorymodiation.•
Replaingategorial phoneti target features,byreal valuedtarget featuresto representorpus spei harateristis. Thesefeatures enode thesame information aounted by
the ategorial features, with higher preision. We refer to this approah as ontinuous
tel-00927121, version 1 - 1 1 Jan 2014
i e E a e~ y 2 9 @ 9~ u o O o~ a~ w H j p b m f v t d n s z S Z J k g N R l _
Figure 5.1: Jaw Opening statistis. Eah segment represents aphoneme,entered atthe mean andits
lengthbeingtwiethe standarddeviation. The numberof ourreneof eahphoneme ispresented.
In the following subsetions, we desribe theseexperiments. The modiedfeature valuesor
introduedfeaturesarethosewhihmainlyharaterizethevisualmodalityofspeeh. Hene,we
refer to themasvisual target ost. Themain goal isto seewhether theseexperimentsimprove
theperformane ofseletion and onsequently of synthesis. The objetive evaluation results of
these twomethods arethenpresentedinsubsequentsubsetions.
5.2.1 Phoneti ategory modiation
Allthetarget featureswhihprovidetheinformationrelatedtophonetiontext areategorial
(see Table 5.1). The orresponding phoneti feature osts are binary; whih take
0
, when thetargetandandidatefeaturevaluesaresameand
1
,whentheyaredierent. Amongthesetargetfeatures, two features aount for the patterns invisual speeh animation. They are `Plae of
artiulation' and `Lip shape during artiulation'. We would refer to the latter feature as `Lip
Shape'. `Plae of artiulation' information is enoded only for labial phonemes and also their
plae ofartiulation isvisibly unambiguous. Hene, we fouson`Lip Shape'.
We want to determine the harateristi lip shapes of phonemes as observed and diretly
measurable from the reorded audiovisual orpus. In ase theobserved `Lip Shape'is dierent
fromtheexpetedlassialategorization,theategoryismodiedaordingly. Thisinformation
tel-00927121, version 1 - 1 1 Jan 2014
i e E a e~ y 2 9 @ 9~ u o O o~a~ w H j p b m f v t d n s z S Z J k g N R l _ 5
10 15 20 25 30
LipProtrusion
68 73
Figure 5.2: Lip Protrusionstatistis. The phonemes of interest are framed: the `protruded' phonemes
are {
y
,ø
,œ
,@
,œ ˜
,u
,o
,˜ o
,O
,˜ a
,w
,4
}. The segmentsplottedin red, green andbrown seemtoviolatethe general pattern realulated with andidates withouta `protruded' ontext. The segmentsplotted in
redorrespondtothephonemeswhoseategorywasmodied. Thebrownandgreensegmentsareofthose
phonemes wherestatistiswere realulatedwith andidates without`protruded' ontext.
more aurately. The expetation was that their synthesized visual speeh omponent would
be more similar to the real visual speeh after the hanges. This modiation of thephoneti
ontextshould modifythevisualtarget ost,whihisapartofthetargetost(TC).Thevisual
target ost of a phoneme (left or right phoneme of a diphone) is alulated by summing the
visual featuredierenes oftheleft andtheright ontextual phonemes.
We performed a statistial analysis of the artiulatory features. These set of artiulatory
featuresinludedlipprotrusion,lipopening,lipspreadingandjawopening(seeFig.3.4)(Robert
et al., 2005). The statistis were alulated by onsidering the artiulatory feature vetors at
the enter of the phoneme artiulation. This is also the plae of onatenation in the visual
and aousti domain. The statistis of the phoneti artiulatory features are shown in gure
5.1 to 5.4. We onsidered the mean, variane and the numberof ourrene of eah phoneme.
For anygiven phoneme, thelip shapean be either`Protruded' or `Spread', or might not have
any typial shape in whih ase we lassify as `not protruded and not spread' whih we refer
to as simply`none'. The range of artiulatory feature statistis for eah of these ategories is
determined rst. Thisis dependson the pattern that majority of phonemes belongingto eah
tel-00927121, version 1 - 1 1 Jan 2014
i e E a e~ y 2 9 @ 9~ u o O o~ a~ w H j p b m f v t d n s z S Z J k g N R l _ 12
14 16 18 20 22 24 26 28 30
LipOpening
Figure5.3: Lip Opening statistis.
determined. Welookedmore loselyat LipProtrusion andLipSpread asothers arerelated.
Typially by lassial phoneti knowledge, the set of phonemes whih inluded {
y
,ø
,œ
,@
,˜
œ
,u
,o
,˜ o
,O
,˜ a
,w
,4
} was lassied as `protruded' and the set of phonemes whih inluded{
i
,e
,a
,E
,˜ e
} wasategorized as `spread' phonemes . All theother phonemes were onsideredas`notspread and not protruded'based ontheshape of thelips. Thisategorization generally
holds. Nevertheless, we an observe that some phonemes need to be reonsidered. For this
purposeandto bemoreaurate, theoartiulationaetsofthesurroundingphonemes should
be removed. In fat, if one of the neighboring phonemes is protruded, for instane, it is very
likely thatthesurrounded phonemewill be protrudedtoo,even ifitisnot itsmainartiulatory
harateristi, beause of oartiulation. Therefore, for phonemes whose visual artiulation
seemed to be dierent from their initial lassiation, their artiulatory feature statistis were
realulated by onsidering a subset of phoneme instanes in the orpus. For example, the
phoneme /
f
/ seemed to be `spread' unlike its lassial phoneti lassiation of `not spread'.phoneme /