Results - Learning phoneme kinematis using EMA data

Chapter 4 Phoneme Classiation Based on F aial Data 55

4.2 Learning phoneme kinematis using EMA data

4.2.3 Results

The HMM training and alignment is done exatly in thesame way asexplained for thefaial

data. TwosetsofHMMsaretrainedusingthe two featuresetsextratedfromEMAdata. Only

monophone HMMs were trained and used for segmentation. This is beause of the overage

tel-00927121, version 1 - 1 1 Jan 2014

64Chapter4.PhonemeClassiationBasedonFaial

I B

H

L A D F

G C K

E

Figure 4.5: EMA data aquisition: Loation of sensors, frontal view.

Lip opening andlipspread aregiven bythe distanes

k EL)k ~

^and

k GKk ~

Lip protrusionisgivenbythe displaementofthe enterofgravityofthe

four points (E, G, K, L) along the normal vetor to the plane formed

by vetors

EL ~

^and

GK ~

^. ^Figure ^adapted ^and ^modied ^from ^(Ptzinger,

J

A D F C

z

x

y axis axis

axis L

E

Figure 4.6: EMAdata aquisition: Sensor loations on the mid-sagittal plane.

The following tongue related features are alulated: 1. Tongue tip movement,

k AJ ~ k

^, ^2. ^Horizontal displaement of the tongue,

(k J F ~ k) x

^, ^3. ^T^ongue ^shape,

(k ADk) ~ (x,z)

^and

(k ACk) ~ (x,z)

^, ^4. ^T^ongue ^height,

(k J F ~ k) z

^. ^Figure ^adapted ^and

tel-00927121, version 1 - 1 1 Jan 2014

is usedfor theanalysis. Thesegmentation resultsare obtained for thetwo sets ofHMMs. The

reognition errorsaredetermined foreah phonemelassfor thesegmentation preditedbythe

twoHMMsets. Thisisinsimilarlinesasexplainedintheaseoffaialmarkerdata. Theresults

in omparison with those obtained by HMMs trained using features extrated from the faial

markerdataaregiveningure4.7. Faial dataandEMA datahavea lotof dierenesbesides

just the phoneti transript, duration and overage of phonemes. There are other signiant

dierenes suh asthe following. First, Unlike faial data where the artiulation isompletely

uninhibitedandnatural,theaetofthepreseneofsensorsonartiulationannotbeompletely

ruledout. Inadditiontothat,thefaialdeformationhappeningduringtheartiulationofspeeh

annotbeompletelyapturedthroughjust5points(4onlipsand1onthehin),inthisrespet

faial data an be onsidered better. Besides, trajetories of just 4 points on the tongue are

aptured and parameters were extrated subsequently. This an not apture the omplexity

of the artiulatory deformation of the tongue. These dierenes and fators aount for the

marginal improvement with the addition of tongue related information, whih is ontrary to

whatonewouldexpet. Broadly,theadditionoftonguefeaturesimprovesthealignment results

for most of thephonemes whihdon't fall intheategory ofvisiblephonemes (seegure. 4.2).

Forthephonemeswhihfallintheategoryofvisiblephonemes,ratherpreditably,theaddition

of tongueinformation doesnot improve the reognition.

Figures 4.8to4.11givethestart andendstatistisof thephonemes basedonthealignment

results without and with tongue related data to the artiulatory features. Considering those

phonemes for whih the reognition errors have redued with theaddition of tongue data, the

following observations an be made. For velars, the expetation of aousti to visual start

dierene is positive, i.e.

(E(Ds) > 0)

^, ^whih ^indiates ^the o-artiulation eet on their left ontextual phonemes. For alveolars and dentals, thevariane of the dierene in aousti and

visual start (

Ds

⁾ ^has^redued. ^Besides, ^for ^the ^phoneme ^/

l

^/,^the^dierene ⁱⁿ^the^aousti ^and

visual ends(

(E(De) < 0)

⁾ ^shows ân înuene ôn^the ^following ^phonemes. ^Fôr ôther ^phonemes,

these gures show that there is no signiant hange inthe statistiswith theinlusion of the

tonguedata. Thisanbeaountedbythe reognitionerrors, whihhasnotimproved withthe

additionof tonguedata.

4.3 Conlusion

The results of segmentation using EMA data whih inludes tongue related features, in

om-parison of those obtained by faial features, shows only a marginal improvement. This is in

tel-00927121, version 1 - 1 1 Jan 2014

Chapter4.PhonemeClassiationBasedonFaialData

0 0.1 0.2 0.3 0.4

Face:Art+PCA EMA:Art

EMA:Art+tongue

Consonants Vowels TE

sil

^B.L ^L.D Âlv ^Plt ^Vlr Ûvl Û.V ^R.V

sil p b m 4 w S Z v f t d n s z l ñ j k g K e a E i @ ˜e ˜ a y u œ o ø ˜ o

Figure4.7: ShowstheforedalignmentresultsusingtrainedHMMsusing dierent Databasedfeatures. Fae:PCA+Artarethefeaturevetorsextratedfrom

the faialmarker datahavingthe four artiulatoryfeaturesandrst3PCAoeients. EMA:Artare theartiulatory featurevetorextratedfrom theEMA

dataandEMA:Art+tonguearethe artiulatoryandtonguemovement relatedfeaturevetorfromEMAdata

tel-00927121, version 1 - 1 1 Jan 2014

−150

−100

−50 0 50 100 150 200

sil B.L L.D Alv Plt Vlr Uvl U.V R.V

Consonants Vowels

(As−Vs) ms

Phonemes

Figure 4.8: Means and varianes of the phonemes start dierenes alulated for the alignment based

onartiulatory parametersof EMAdata

−250

−200

−150

−100

−50 0 50 100 150 200 250

sil B.L L.D Alv Plt Vlr Uvl U.V R.V

Consonants Vowels

(Ae−Ve) ms

Phonemes

Figure4.9: Means andvarianes of the phoneme enddierenes alulatedfor the alignment basedon

artiulatoryparameters ofEMAdata

tel-00927121, version 1 - 1 1 Jan 2014

−150

−100

−50 0 50 100

150 sil B.L L.D Alv Plt Vlr Uvl U.V R.V

Consonants Vowels

(As−Vs) ms

Phonemes

Figure 4.10: Means andvarianesof the phonemes start dierenes alulatedfor the alignmentbased

onbothartiulatoryandtonguerelatedparameters ofEMAdata

−200

−150

−100

−50 0 50 100

150 sil B.L L.D Alv Plt Vlr Uvl U.V R.V

Consonants Vowels

(Ae−Ve) ms

Phonemes

Figure 4.11: Means and varianes of the phonemes end dierenes alulatedfor the alignment based

onbothartiulatoryandtonguerelatedparameters ofEMAdata

tel-00927121, version 1 - 1 1 Jan 2014

based on these automati segmentation results. This lassiation is used to analyze the

per-eptual evaluation results. It is useful for bringing out the orrelation between objetive and

pereptualevaluationresults,thus paving way for better objetive evaluationtehniques 2

tel-00927121, version 1 - 1 1 Jan 2014

Unit Seletion

In the previous hapter we have presented an overview of our text to aousti-visual speeh

synthesissystem alledViSAC. It synthesizes speeh using unit-seletionand onatenation of

speeh segments from a pre-reorded speeh orpus. Suh speeh synthesis systems whih are

basedonunitseletion typiallyhave threestages. For agiventext to besynthesized, theNLP

module rst generates the speiation of the required target phoneme sequene. The

spei-ation is then onverted in terms of thesynthesis unit. For example, thesynthesis unit inthe

aseofour systemisdiphone. Itisneessarythatthetarget speiation hasall theimportant

information whihaetsspeehrealization. Then,for eahrequiredtarget inthespeiation,

all theandidates intheorpus are ranked based on a target ostfuntion. This ost funtion

is generally dened as the weighted sum of individual feature osts. At the end of this

an-didate ranking, for eah required target in the speiation, utmosta xed maximum number

of andidates are pre-seleted and rest pruned. This senario of multiple possible andidates

for eah required target in the sequene, denes a lattie. Finally, the sequene of those nal

andidates whih optimizes a total ost funtion is seleted for onatenation. This isdone by

theresolution ofthe lattie throughViterbi algorithm. The total ost funtion isthe weighted

sumof thetarget ost andthe onatenation osts.

For all the three stages mentioned above, `speiation of targets' or `desription of

andi-dates'isruial. Thisalso shows thatthetarget featurestruture and thealulation oftarget

ost plays a entral role. In the pre-seletion stage, it is neessary that the ranking given to

the andidates present in the orpus is onsistent with theordering based on their pereptual

suitability for any required target. This is also important to ensure that no good andidates

get pruned. This dependson the target ost. Besides pre-seletion, target ost also inuenes

the nal seletion of andidate sequene from the lattie. The set of target features and their

optimum weights whih dene thetarget ost, deide the eieny of the target ost funtion

tel-00927121, version 1 - 1 1 Jan 2014

and hene the synthesis performane. With respet to target ost, the following two aspets

need to be explored:

•

^Deiding ^the ^set ôf ^target ^features ^that ^will ^be ûsed^for ^target ^speiation ôr ândidate

desription.

•

^Tûning ^the ^weights ôf ^the ^target ^features ^to ôptimize ^the ôverall ^synthesis performane, for agiven orpus.

In addition to the target ost, the onatenation ost also needs to be onsidered. The

onatenation ost estimatesthe pereptual disontinuity dueto the onatenation oftwo

an-didates. The alulation of the aousti and visual onatenation ost in our system was

ex-plained in the previous hapter. The objetive of unit seletion is to have a nal synthesized

speeh whih is pereptually similar to a natural speeh sequene (hypothetial) rendered by

thespeaker. Thisrequiresat leastaontinuous speehwithout pereptible disontinuities, and

onstituentspeehsegmentswhihareloallysuitableforeahrequiredtarget. Thisrequiresan

optimum ombination of target and onatenation osts. This, indiates the need to tune the

total ostfuntion besidesoptimizingthetotal ost.

Thishapterdealswiththesedierentaspetsofunitseletion. Inthefollowing setions,we

desribeexperimentsthatwereperformedwiththeobjetiveofoptimizingthesynthesisresults.

Inthefollowing setions, werst givean aount of thesetof target featuresinsetion5.1. In

setion5.2,wedetailexperimentsthatwereperformedtomodifytargetfeaturevaluesor design

newtargetfeaturesforvisualmodality. Insetion5.3,weexplainatargetosttuningapproah

thatwe have developed beforeonluding.

5.1 Target features

At the time of synthesis, targets are speied using a set of features, generally alled target

features. This set of target features is generally deided based on the linguisti and phoneti

studies whih explain various patterns in speeh. Consequently, the lassially used target

features inlude linguisti, phoneti and prosodi ontext. Some of these features are relevant

irrespetive of a language and some might be language-spei. For example, unlike phoneme

voiing whih is usually relevant irrespetive of a language, the observation of rhythm group

(RG)pattern isrelevantfor Frenh. Thisis beause inFrenhtheend ofRGgivestheposition

ofthestressedsyllablewhihisusuallythelastsyllableofRG.Hene,thefeaturesrelatedtoRG

thatarerelevanttoFrenh,might notberelevant orequallyimportant forotherlanguages. For

tel-00927121, version 1 - 1 1 Jan 2014

onthetextanalysis. Inthease ofatextto besynthesized, thedesriptionof atargetinterms

of thesefeaturesprovides an `abstrat'information about speeh. Thetarget feature ost for a

partiular andidate isbasedon the featurevalueof the targetand thatof theandidate being

onsidered. Theexpetationisthatsame featurevaluesaount for ahypothetialsimilarityin

thespeeh realizationand henealso theandidate suitability.

Inoursystem,thesefeaturesdesribeaphonemeatvarious logiallevelsinwhihasentene

an be sub-divided (see Fig. 3.6). Some of the featuresare more spei to Frenh language.

These set of features, espeially the linguisti features, are predominantly generi and an be

diretly applied irrespetive of the orpus being used. The set of linguisti features inludes

phoneme numberinthe syllable; syllablekind;syllableposition inthe rhythm group(RG) and

sentene; syllable number in the word, RG and sentene; word position in RG and sentene;

word number in RG and sentene; RG position in sentene; proximity of the nearest left and

right silene;kind of sentene.

They either have nite integral values or ategorial values based on the feature. These

featuresareeitherusedto desribe the harateristi ofa target or aandidate or a ontextual

(left/right)phoneme or both. Thephoneti features inlude,besidesthephoneme identity,the

listof featuresgiven intable5.1. Exept the phoneme identity,theother phoneti featuresare

used to dene ontext (left and right phoneme). This set of generi target features whih are

extrated through the text analysis is augmented by additional orpus-based target features.

This is done to take the speakerharateristis into aount whih is important espeially for

thevisual modality. Hene, the orpus spei featuresdesigned mainly aount for the visual

modalityof speeh.

5.2 Corpus based visual target features

We have desribed thesetof generitarget featuresintheprevious setion,whiharegenerally

assumedtodependsolelyontextanalysis. Thesetoftargetfeaturesrelatedtophonetiontext

also belongs to this ategory. The phoneti ontext of any partiular phoneme inuenes its

artiulationsigniantly. Thisiswellknownasoartiulation. Thedegree bywhiha phoneme

inuenesitssurroundingphonemes orisinuenedbythemvaries(Löfqvist,1990). The

estab-lishedphoneti knowledge regardingoartiulation holds almost all thetime(Ladefoged, 1982;

Ladefoged and Maddieson, 1995). Hene, these target features and their values for dierent

phonemes areusuallybasedontheharaterization denedbyphonetiiansthatisfoundinthe

literature. Hene,their values aresetbasedontheinformation extratedthroughtext analysis.

tel-00927121, version 1 - 1 1 Jan 2014

Table 5.1: Thesesetof featuresdene the phoneti ontext of aphoneme, target or andidate.

These feature values either desribe previous or following phoneme. The target feature osts

for thesefeaturesarebinary valued funtionstakingeither

0

^or

1

^based^on^whether ^the^feature

valuesbeingompared aresame or dierent respetively.

Feature Name Possible values

Voiing voied, unvoied

enesandidiosynrasies. Duetotheusageofareordedaudio-visualorpus,inasethespeaker

has any peuliar artiulation, it might be visually or aoustially pereived inthe synthesized

speeh and present some inoherene. For example, let us assume that andidates are being

looked upfora targetphoneme whoseleftontextual phonemeisonsideredto havelip

protru-sionduring itsartiulation. Then obviously,thoseandidates whose leftontextualphoneme is

onsidered tohave alip protrusionduring itsartiulation willgethigher ranking. Ifthis target

ontextualphonemeisatuallyartiulateddierently andnotatuallyprotruded,thenseleting

a andidate with a protrusion left ontextual phoneme might be inappropriate. This kind of

ategorization might slightly vary from person to person and it is well known (Johnson etal.,

1993;Raphaeland Bell-Berti,1975;Maeda,1989). Hene, inase thesefeaturevalueshave any

inonsisteny in omparison with the atual harateristi in the orpus, it will be visible in

the synthesized speeh. We have performed two experiments whih aim at a phoneti ontext

adaptationthatisbasedontheharateristisobservedintheorpus. Theyanbedividedinto

thefollowing twoategories:

•

^Changing^target^feature^v^alues^for^some^phonemes^based^on^theartiulatoryharateristis estimatedfrom the orpus. Werefer to this approah asphoneti ategorymodiation.

•

^Replaing^ategorial ^phoneti ^target ^features,^by^real ^valued^target ^features^to ^represent

orpus spei harateristis. Thesefeatures enode thesame information aounted by

the ategorial features, with higher preision. We refer to this approah as ontinuous

tel-00927121, version 1 - 1 1 Jan 2014

i e E a e~ y 2 9 @ 9~ u o O o~ a~ w H j p b m f v t d n s z S Z J k g N R l _

Figure 5.1: Jaw Opening statistis. Eah segment represents aphoneme,entered atthe mean andits

lengthbeingtwiethe standarddeviation. The numberof ourreneof eahphoneme ispresented.

In the following subsetions, we desribe theseexperiments. The modiedfeature valuesor

introduedfeaturesarethosewhihmainlyharaterizethevisualmodalityofspeeh. Hene,we

refer to themasvisual target ost. Themain goal isto seewhether theseexperimentsimprove

theperformane ofseletion and onsequently of synthesis. The objetive evaluation results of

these twomethods arethenpresentedinsubsequentsubsetions.

5.2.1 Phoneti ategory modiation

Allthetarget featureswhihprovidetheinformationrelatedtophonetiontext areategorial

(see Table 5.1). The orresponding phoneti feature osts are binary; whih take

0

^, ^when ^the

targetandandidatefeaturevaluesaresameand

1

^,^when^they^are^dierent. ^Among^these^target

features, two features aount for the patterns invisual speeh animation. They are `Plae of

artiulation' and `Lip shape during artiulation'. We would refer to the latter feature as `Lip

Shape'. `Plae of artiulation' information is enoded only for labial phonemes and also their

plae ofartiulation isvisibly unambiguous. Hene, we fouson`Lip Shape'.

We want to determine the harateristi lip shapes of phonemes as observed and diretly

measurable from the reorded audiovisual orpus. In ase theobserved `Lip Shape'is dierent

fromtheexpetedlassialategorization,theategoryismodiedaordingly. Thisinformation

tel-00927121, version 1 - 1 1 Jan 2014

i e E a e~ y 2 9 @ 9~ u o O o~a~ w H j p b m f v t d n s z S Z J k g N R l _ 5

10 15 20 25 30

LipProtrusion

68 73

Figure 5.2: Lip Protrusionstatistis. The phonemes of interest are framed: the `protruded' phonemes

are {

y

ø

œ

@

œ ˜

u

o

˜ o

O

˜ a

w

4

^}. ^The ^segments^plottedⁱⁿ ^red, ^green ^and^brown ^seem^to^violate

the general pattern realulated with andidates withouta `protruded' ontext. The segmentsplotted in

redorrespondtothephonemeswhoseategorywasmodied. Thebrownandgreensegmentsareofthose

phonemes wherestatistiswere realulatedwith andidates without`protruded' ontext.

more aurately. The expetation was that their synthesized visual speeh omponent would

be more similar to the real visual speeh after the hanges. This modiation of thephoneti

ontextshould modifythevisualtarget ost,whihisapartofthetargetost(TC).Thevisual

target ost of a phoneme (left or right phoneme of a diphone) is alulated by summing the

visual featuredierenes oftheleft andtheright ontextual phonemes.

We performed a statistial analysis of the artiulatory features. These set of artiulatory

featuresinludedlipprotrusion,lipopening,lipspreadingandjawopening(seeFig.3.4)(Robert

et al., 2005). The statistis were alulated by onsidering the artiulatory feature vetors at

the enter of the phoneme artiulation. This is also the plae of onatenation in the visual

and aousti domain. The statistis of the phoneti artiulatory features are shown in gure

5.1 to 5.4. We onsidered the mean, variane and the numberof ourrene of eah phoneme.

For anygiven phoneme, thelip shapean be either`Protruded' or `Spread', or might not have

any typial shape in whih ase we lassify as `not protruded and not spread' whih we refer

to as simply`none'. The range of artiulatory feature statistis for eah of these ategories is

determined rst. Thisis dependson the pattern that majority of phonemes belongingto eah

tel-00927121, version 1 - 1 1 Jan 2014

i e E a e~ y 2 9 @ 9~ u o O o~ a~ w H j p b m f v t d n s z S Z J k g N R l _ 12

14 16 18 20 22 24 26 28 30

LipOpening

Figure5.3: Lip Opening statistis.

determined. Welookedmore loselyat LipProtrusion andLipSpread asothers arerelated.

Typially by lassial phoneti knowledge, the set of phonemes whih inluded {

y

ø

œ

@

˜

œ

u

o

˜ o

O

˜ a

w

4

^} ^was ^lassied ^as `protruded' and the set of phonemes whih inluded

{

i

e

a

E

˜ e

^} ^wasâtegorized âs ^`spread' ^phonemes ^. Âll ^theôther ^phonemes ^were ônsidered

as`notspread and not protruded'based ontheshape of thelips. Thisategorization generally

holds. Nevertheless, we an observe that some phonemes need to be reonsidered. For this

purposeandto bemoreaurate, theoartiulationaetsofthesurroundingphonemes should

be removed. In fat, if one of the neighboring phonemes is protruded, for instane, it is very

likely thatthesurrounded phonemewill be protrudedtoo,even ifitisnot itsmainartiulatory

harateristi, beause of oartiulation. Therefore, for phonemes whose visual artiulation

seemed to be dierent from their initial lassiation, their artiulatory feature statistis were

realulated by onsidering a subset of phoneme instanes in the orpus. For example, the

phoneme /

f

^/ ^seemed ^to ^be ^`spread' ûnlike îts ^lassial ^phoneti ^lassiation ôf ^`not ^spread'.

phoneme /

f

^/ ^seemed ^to ^be ^`spread' ûnlike îts ^lassial ^phoneti ^lassiation ôf ^`not ^spread'.

Dans le document Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales ~ Association Francophone de la Communication Parlée (Page 65-0)