Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales ~ Association Francophone de la Communication Parlée

(1)

Ecole doctorale IAEM Lorraine ´

D´ epartement de formation doctorale en informatique

Synth` ese Acoustico-Visuelle de la Parole par S´ el´ ection d’Unit´ es Bimodales

(Acoustic-Visual Speech Synthesis by Bimodal Unit Selection)

TH` ESE

pour l’obtention du

Doctorat de l’Universit´ e de Lorraine

(sp´ ecialit´ e informatique)

pr´esent´ee par

Utpala MUSTI

Composition du jury

Rapporteurs : Jean-Claude MARTIN - Professeur en Informatique, Universit´e Paris-Sud Piero COSI - Senior Researcher, CNR, ISTC, Italie

Examinateurs : Catherine PELACHAUD - Directeur de recherche, CNRS-TELECOM ParisTech Bernd M ¨ OBIUS - Professeur, Universit¨ at des Saarlandes

Anne BOYER - Professeur, Universit´e de Lorraine Yves LAPRIE - Directeur de recherche, CNRS-loria

Vincent COLOTTE - Maˆıtre de conférences, Université de Lorraine Slim OUNI - Maˆıtre de conférences, Université de Lorraine

Laboratoire Lorrain de Recherche en Informatique et ses Applications — UMR 7503

tel-00927121, version 1 - 1 1 Jan 2014

(2)

tel-00927121, version 1 - 1

(3)

To my daughter,Samyukta.

tel-00927121, version 1 - 1 1 Jan 2014

(4)

tel-00927121, version 1 - 1 1 Jan 2014

(5)

Audio-Visual Speeh 7

Chapter 1 Audio-Visual Speeh Synthesis: An Introdution 11

1.1 Fae modeling and animation . . . 12

1.2 Separate visual speeh synthesis . . . 13

1.3 Simultaneous synthesis of audio-visual speeh . . . 17

1.4 Conlusion. . . 18

Chapter 2 Speeh Synthesis Using Unit Seletion: Literature Survey 19 2.1 Unit seletion paradigm . . . 20

2.2 Segmentation . . . 22

2.3 Target ost funtion . . . 23

2.3.1 Visualtarget features . . . 24

2.3.2 Target featureweighting . . . 25

2.3.3 Alternatives to onventional target ostfuntion . . . 27

2.4 Conatenation ost funtion. . . 27

2.5 Evaluation . . . 30

2.5.1 Objetive automatievaluation ofaousti andaudio-visual speeh . . . . 31

2.5.2 Human-entered evaluationof aoustiand audio-visual speeh . . . 32

Chapter 3 Aousti-Visual Speeh Synthesis System: An Overview 37 3.1 Corpus preparation . . . 38

3.1.1 Text seletion . . . 38

3.1.2 Aquisition . . . 39

3.1.3 Dataproessing andparameter extration . . . 40

3.1.4 Segmentation . . . 43

3.1.5 Bimodalspeeh database . . . 44

3.2 Bimodal speeh synthesis . . . 44

3.2.1 Naturallanguage proessing . . . 44

3.2.2 Target unitdesription . . . 46

tel-00927121, version 1 - 1 1 Jan 2014

(6)

3.2.3 Bimodalunitseletion andonatenation . . . 46

3.3 Visual speeh rendering . . . 50

Chapter 4 PhonemeClassiation Based on Faial Data 55 4.1 Visual speeh segmentation using faial data . . . 55

4.1.1 Reognition error . . . 57

4.1.2 Foredalignment results . . . 57

4.2 Learning phoneme kinematis using EMA data . . . 62

4.2.1 Dataaquisition . . . 62

4.2.2 Feature extration . . . 63

4.2.3 Results . . . 63

Chapter 5 Unit Seletion 71 5.1 Target features . . . 72

5.2 Corpus based visual target features . . . 73

5.2.1 Phoneti ategory modiation . . . 75

5.2.2 Continuous visual targetost funtion . . . 78

5.2.3 Objetive evaluationofsynthesisresults . . . 80

5.3 Target feature seletion and weight tuning . . . 83

5.3.1 Unitseletion and onatenation . . . 84

5.3.2 Target featureseletion andweight tuning . . . 86

5.3.3 Appliation to AV target ostfuntion tuning . . . 93

5.3.4 Analysisof seletedfeaturesand their relative importane . . . 94

Chapter 6 Evaluation 101 6.1 Objetive evaluation . . . 101

6.1.1 Objetive evaluationbasedon omparison oftwo signals . . . 102

6.1.2 Objetive evaluationbasedon statistialanalysisand thresholds . . . 103

6.2 Human-entered evaluation . . . 104

6.2.1 Intelligibility tests . . . 104

6.2.2 Quality evaluationtests . . . 105

6.3 Analysis of pereptual evaluation for better objetive metris . . . 107

Chapter 7 Conlusion 113

Publiations 119

tel-00927121, version 1 - 1 1 Jan 2014

(7)

Appendixs 121

Bibliography 123

tel-00927121, version 1 - 1 1 Jan 2014

(8)

tel-00927121, version 1 - 1 1 Jan 2014

(9)

Enatedoranimatedstories aremorepopularthanaudionarrations orthoseinthebooks. Itis

easytoonludethatthisisduetoitsaudio-visualnatureasitprovidesarihexperiene. Besides

entertainment,ingeneralwepereiveeverythingthroughourearsandeyes,simultaneously. The

visual information thatis pereived througheyeseitherompliments or reinfores theauditory

information. Thisappliesto speehaswell,whihisone oftheprimemodesofommuniation.

Speeh pereption in thedayto day life is primarily bimodal. We see and hear, what isbeing

spokenbypeopleandunderstandthespeehifitisinaknownlanguage. Whenever,theauditory

inputisambiguousornoise-ridden,wetrytosupplement thereeivedinformationbylookingat

thesoure,i.e.,thespeaker. Thisbimodalnatureofspeehisillustratedbytheobservationthat,

we humans try to have a fae-to-fae onversation while disussing issues of high importane.

Thisis beause,fae-to-fae ommuniation onveys theomplementary information relatedto

speeh artiulation, emotions, more eetively than just voie. Hene, bimodal speeh an be

onsidered more eetive in ondene building. Besides entertainment and ommuniation,

the basi milestone towards verbal ommuniation, i.e., speeh development in babies also has

signiant ontribution of theobservation of visual speehalong withtheorrespondingsound

(Teinonenetal.,2008;Andersen etal.,1984).

Some of these above mentioned general observations about the advantages of audio-visual

speeh over aousti-only speeh have been experimentally veried. It has been shown that

additionofvisualspeehenhanesspeehdetetionandreognition,thusimprovingintelligibility

whenaudioismissing,degradedwithnoise,orwheretherearemultiplesouresofspeeh(Sumby

and Pollak, 1954;Ouni etal.,2007;Summereld, 1979;Shwartz etal.,2004). Theevaluation

resultsofvisualspeehintelligibilitybyLeGoetal.(1994)showthatthenaturalfaepresented

`without'or`withdegraded' audiorestorestwo-thirdsoftheaoustiintelligibility; withafaial

modelwithoutatongueandjustalipmodelrestoreshalfandone-thirdofitrespetively. Speeh

presentedalongwithfaial animationhasbeenobservedto bemorepreferredinterfaetovoie-

onlypresentation. Theyhavebeenshowntoinreasetheinterativeexperieneofusers(Pandzi

etal.,1999).

tel-00927121, version 1 - 1 1 Jan 2014

(10)

These advantages of audio-visual speeh over aousti speeh indiate its vast appliation

possibilities. It has been widely used inentertainment and e-ommere for developing virtual

agents. These appliation do not neessarily need highauray of speeh artiulation. There

are other appliations whih require high auray omparable to that of natural audio-visual

speeh. Theseinlude appliationsforpedagogiativities,for example,virtuallanguagetutors

for e-learning,teahingspeeh artiulationto hearingimpaired et(Massaro,2006). Itan also

beusedto develop virtual announers for publi plaes thatareusually noisy.

Considering allthe preedingdisussion, itan besaid thataudio-visual speehsynthesisis

a signiant domain to pursue. But, theadvantages of natural bimodalspeeh an be realized

throughsynthesizedaudio-visualspeeh,onlyifitisomparableto theformer. Itissobeause,

humanshaveimpliitexpetationsfromaudio-visualspeehbasedonthelearningandexperiene

ofgeneral fae-to-fae ommuniations. Thesearerelatedto temporal alignment andoherene

between the aousti and visual modalities. For instane, while hearing sounds like `p', we

expet a losureof thelips justin timebefore the onset of thatsound. Similarly, we expetto

hear high-pithed voie for a onversation where somebodyisseen to be inextreme fear. This

means that the synthesized audio-visual speeh has to have theaousti and visual streams to

betemporallysynhronous andoherent witheah other.

A majority of approahes for audio-visual (AV) speeh synthesis, synthesize the faial an-

imation over speeh aoustis, and then perform additional proessing for synhronizing the

two wherever neessary. This is based on the assumption that AV speeh synthesis is a set of

twodierentproblems,therebyaddressingthem sequentiallybysynthesizing visualspeehover

synthesizedspeeh aoustis. There aretwo problems with this approah. To begin with,syn-

hronizingthetwostreamssynthesizedseparatelyisnotstraight-forward. Humansareextremely

sensitive to any asynhrony between the audio and speeh animation. In fat, this sensitivity

to disriminatesynhronous speehfromasynhronousspeeh developsvery earlyinhumansin

their infany with a signiant preferene to synhronous speeh (Dodd, 1979). Results from

(Grant and Greenberg, 2001, 2004) show that human speeh pereption is extremely sensitive

toanylaginthe visualdomain whenompared toaudiounlike theother wayaround. Itisalso

observed that this asynhrony auses a surge inthe intelligibility of asynhronous audio-visual

speeh. Moreover, this also brings in the issue of inonsisteny in visual and aousti domain

whihmightbringindisomfort(Mattheysesetal.,2009). Thisinonsistenyanalsoaetthe

nal pereption of the audio-visual speeh, as illustrated by some of theexperimental data in

(GreenandKuhl,1989,1991). Theseexperimentalresultsshowthatthepereptionofplaeand

tel-00927121, version 1 - 1 1 Jan 2014

(11)

aoustimodality. Theworstase, wherepereptionofAVspeehanbehighlyaetedisthat

ofMGurkeet (MGurkand MaDonald,1976). Infat,whendierentfaial animation and

aoustis arepresented synhronously, subjets would experiene fusion or ombination eet.

Fusionseetisseen,forexample,whenvisual/

g

^/^is^presentedsynhronouslywithaousti/

b

^/.

The resultis pereived as /

d

^/. ^Similarly^, ^when^visual ^/

b

^/ ^is ^presented^with ^aousti^/

g

^/) ^syn-

hronously,itispereivedas/

bg

^/,^whihîsânêxampleôf^theômbinationêet. ^Thisîndiates

that synthesizing audio-visual speeh by separating thesynthesis of the two modalities, might

not always ensure the best result in terms of synhrony and oherene of the two modalities.

In general, simultaneous proessing of aousti and visual speeh is shown to be advantageous

withrespetto audio-visualintegrationthatarenotavailablewiththeirindependentproessing

(Chen andRao,1998).

To ensure a perfet alignment and oherene between aousti and visual modalities, we

advoate synthesizing audio-visual speeh simultaneously by treating the two modalities as a

single entity. In this thesis, we present our method for audio-visual speeh synthesis based

on this priniple. We base our speeh synthesis on the unit seletion paradigm. We perform

simultaneous synthesis of aousti and visual modalities by onatenating bimodal units. We

keep the natural assoiation between the two modalities intat while doing so, as the visual

and aoustimodalitiesbelongto the same speehsegment. It should be emphasized that this

approah impliitly addresses theabove mentioned issues ofasynhrony and inoherene. This

workanbeonsideredastheruialrststeptowardsaomprehensivetalking-head. Atually,

our main fous is to synthesize the audio-visual speeh dynamis aurately. The resultant is

not a omplete talkinghead yet. Our faial representation islimitedto sparse meshdesribing

the outer surfae of the fae inluding the lips. The audio-visual speeh does not inlude the

informationrelatedtotheinternalartiulatorsliketongue,teethandotheromponentsneessary

for expressive speeh. In the ourse of this work rst we studied the bimodal speeh orpus,

thatweaquired, bydesigningandanalyzingvisualspeehsegmentationexperiments. Then,we

developedthebasisystemwhihimplementedourideaofbimodalunitonatenation. Byusing

thebasi synthesisframeworkof bimodalunit-seletion system,we developedmethodologiesto

improve the bimodal synthesis. In our work, we are addressing the following problems: (1)

unit-seletiontakingbothaousti andvisual onsiderationsinto aount whihan drastially

inreases theomplexity,(2)weight tuning, whih isa diult probleminspeehsynthesis. In

fat, we developed orpus spei visual target osts and an iterative target feature weighting

algorithm. Finally, we performed pereptual and subjetive evaluation experiments through

tel-00927121, version 1 - 1 1 Jan 2014

(12)

This thesis is organized as follows. We begin by reviewing theeld of audio-visual speeh

synthesis,inhapter1. Inthishapter, wedisusstheways inwhihthefaehasbeen modeled

andanimated. Wealsodisussthevariousapproahesofaudio-visual speehsynthesisbasedon

separate or joint synthesis of the two modalities. Our speeh synthesis system is built on the

generiparadigmofunitseletionandthisisthetopiofhapter2. Wereviewliteraturerelated

to some aspets of unit seletion. It inludes, segmentation, that is performed during orpus

preparation. Besides,the various building bloksof seletion areexamined: target desription,

target andonatenation osts. Finally,wereviewthewaysofevaluatingsynthesizedspeeh.In

hapter3,wepresentourworkbyprovidingrstanoverviewofouraudio-visualspeehsynthesis

system. It also details our audio-visual orpus reording and database preparation for our

synthesis system. The resultant audio-visual database that we have is an interesting resoure

whih an be used for studying various phonemes. As a rst step in this diretion, we have

performedsegmentation of the visual data. We desribe thesesegmentation experiments, their

results and analysis of these results in hapter 4. In hapter 5, we detail dierent strategies

thatwe developed to optimize oursystem. It inludesdesigning newvisual targetfeatures and

target feature weighting. Finally in hapter 6, we present the objetive evaluation, pereptual

evaluation and the analysis done to bring out the relation between the two. We onlude in

hapter7 andexplain our futurework.

tel-00927121, version 1 - 1 1 Jan 2014

(13)

Audio-Visual Speeh Synthesis: An

Introdution

Inthis hapter, we lookat someof its earliersynthesisapproahes. For anyspeeh,aoustior

audio-visual, to be synthesized from text, the underlying phoneme sequene orresponding to

the text hasto be rst speied. Given this speiation, various approahes an be followed

for AV speeh synthesis. Firstly, these approahes an be divided basedon whetherthe visual

and aousti modalities are synthesized separately or simultaneously. Seondly, the synthesis

of aousti or visual modalities in the ase of separate synthesis an be divided based on the

synthesis paradigm: rule based, artiulatory or onatenative (Theobald, 2007). Thirdly, the

approahes an be lassied based on their faial renderingtehnique: 3D modeling of faeor

image-based.

In arule-basedsynthesissystem,thewell knownrepresentativeharateristis ofspeehare

simulated using predened rules. Whereas, artiulatory synthesis is done bythe simulation of

natural proess of speehprodution using models of human anatomy. For instane, air owis

simulatedthroughaontrolledmodelofhumanvoaltrat,andskinofthefaeisdeformedusing

bonesand musles. Conatenative speeh synthesisis performedbyonatenating segments of

reorded humanspeeh,generallyalledorpus. Thisanbeput intoa broaderategoryalled

orpus-based speeh synthesis whih also inludes HMM-based speeh synthesis. HMM-based

synthesisdependsonthelearningofpatternsofspeehparametersfromagivenorpus,whihis

thenusedtogenerate speehparameters. Conatenative approah islikememorizing thewhole

data, andthenaessing thememory at the timeofsynthesis.

Inthefollowingsetions,wefousonaudio-visualspeehsynthesis. First,webrieydesribe

thefaial renderingtehniques (setion1.1). Then, we disusstheapproahes whihsynthesize

theaousti andvisual modalitiesseparately andsimultaneously insetions 1.2 and1.3.

tel-00927121, version 1 - 1 1 Jan 2014

(14)

1.1 Fae modeling and animation

Thefaehasbeen enoded andpresentedintwo waysfor thepurposeof faial animation. The

rst approah is the 3Dmodeling ofthe fae. Theouter surfae of thefae ismodeled using a

mesh of onneted polygons. Thesepolygons aremade of predened edges onneting a set of

3D point verties. Also, hanges inthe 3D point loations and the onsequent hanges inthe

meshaountforthedeformationsinthefae. Therst3D-faialmodelwasdevelopedbyParke

(Parke, 1972, 1975, 1982). In this model, the 3D points were dened and ontrolled by a set

of parameters. Theseparameters wereoneptually dividedinto two distint sets(funtionally

theymight have an overlap): onformation parameters and expressionparameters. The onfor-

mationparametersweretheoneswhihdenethedimensionsofthe3Dfae. Thatis,if3Dfaes

are modeled based on real human subjets for instane, then onformation parameters dene

thebasi`dierentiating' dimensionsof thatpartiularhumanfae. Theseinluded parameters

like aspet ratio of fae (height to width), relative sizes speifying forehead, eye separation,

nose height, heek, hin, et. The expression parameters were those whih desribed mainly

themovements ofeyesand mouth. Theyinluded deformations like jawrotation, width ofthe

mouth,positionofupperlipandornersofthemouth,et. Thesedeformationsmightberelated

to speeh or emotional expressions. From these two ategories of parameters, the3Dpoints on

the fae positions were determined using dierent types of operations, applied independently

to some regions or to the whole fae. Eyes were ontrolled by spei proedures. The other

operations inluded, interpolation, rotation, translation and saling. The nal rendering was

done through Phong interpolation (Phong, 1975) based on theparameter speifying thedire-

tion of light soure. There are many virtual haraters whih are desendants of this Parke's

model (Cohen and Massaro, 1993; Beskow, 1995; Olives et al., 1999). These desendants of

Parke'smodelhavevarious additionstoimprove theappearaneoffaeandanimation: likethe

additionofthe tongue,earsorthebakoftheheadandtheadditionofontrol parameters. The

advantage of these kind of parametri models is that thewholemesh is speied using a small

set of parameters. Parke's parametri model is dierent from some other parametri models,

whiharebasedonmodelingtheunderlyinganatomial struturelike bones, musles,skinand

fores ating on them (Waters and Terzopoulous, 1990;Waters, 1987; Lee et al., 1995; Ekman

and Friesen, 1978). This kind of modeling has been observed to be omputationally intensive

(Bailly et al., 2003). Some talking heads whih present emotional faial animations are based

onpseudo-musle ontrations (Cosietal.,2003;Pelahaud etal.,2001). MPEG-4standardizes

theparametri modelsbydeningaminimumsetof84featurepoints(FPs)loatedonthefae.

tel-00927121, version 1 - 1 1 Jan 2014

(15)

alledfaial ation parameters (FAPs)(Ostermann,1998).

Besides 3D modeling of the fae, the seond approah for representing a fae is through

the usage of faial images. These are most often images of real people. Hene, image-based

approahes are generally data-driven. Faial animations using images are generated in two

ways. First, it an be done by interpolating few spei images that are representative of the

typial artiulation of visually idential phonemes alled visemes (Ezzat and Poggio, 1998).

Alternatively, itanbedone byonatenatingimage sequenes (Bregler etal.,1997;E.Cosatto

etal.,2000).

The image-based approahes of modeling present more realisti faes. This is beause of

their proximity to the real faial appearane, whih isoften desribed as being photo-realisti.

But, this way of enoding or presenting a fae is most often limited to a straight-head frontal

view of the fae. Besides, storage of images inurs signiantly higher memory requirement

to storage of a few parameter trajetories. On the other hand, 3D-model-based approah is

exible in terms of the view and head orientations in whih a fae an be rendered. But, an

additional proessing step is required to add the internal artiulators like tongue and teeth to

rendertheompleteartiulatory information. Itispossibleto augment the3Dmodelbyadding

texturalinformationtomakethenalfaialanimationexibleandomparativelyphoto-realisti

Elisie etal. (2001). Another alternative of modeling the faeis morphable-models presentedin

(Cootes et al., 1998; Blanz and Vetter, 1999). These models also embed both geometri and

texture relatedinformation to present a relatively photo-realisti andexible faial model.

1.2 Separate visual speeh synthesis

Conventionally, AV speeh synthesis is onsidered as two separate problems; the generation of

speeh aoustis and the generation of faial animation to a given speeh aoustis (real or

synthesized). Consequently, it has been performed by synthesizing the two modalities sepa-

rately. Faial animation is generated over a given speeh aoustis, whih is eithersynthesized

or reorded. Thisapproah requiresadditionalproessing toorret thealignment between the

twomodalitiesintheaseofonatenativevisualspeehsynthesis(Bregleretal.,1997). Werefer

to the faial animation relatedto speehas visual speeh. We fous on visualspeeh synthesis

stage, onsidering the aoustispeeh already available. Two onepts, whih might surfae in

thedisussionofvisual speehare: visemes andoartiulation. In thefollowing paragraphs,we

rstexplain these two onepts beforegoing ahead withthesynthesis tehniques.

Visemes: Visiblespeehartiulationpresentssimilaritiesformanyphonemes. Basedonthis

tel-00927121, version 1 - 1 1 Jan 2014

(16)

sets are dened as visemes. It is the fundamental unit in theontext of visual speeh (Fisher,

1968). For example, pereption of visual speehwhile phonemes inthe set{

p

^,

b

^,

m

^}^are^being

artiulatedisalmostthesame. Hene, theybelongtoone visemeset. Intheurrent disussion,

wemeanbyviseme,asequeneofvisualspeehparametersdesribingaompletesegmentrather

than statitargets. Onthe ontrary,werefer to asingle sampleof theseparameters desribing

a snapshot of a partiular target fae as `key frame'. The visual speeh parameters an be

image frames or trajetories ofontrol parameters or 3Dpoints onthe fae. Thismany-to-one

mapping of visual speeh makesthe separation of visual speeh synthesis from aousti speeh

synthesisadvantageous. Itisbeause,thesystemgetsoniseduetothereduinginthenumber

ofdistint units. Inthe aseofonatenativevisual speehsynthesis, thisinreases thepossible

andidates.

Coartiulation: Coartiulationis thephenomenonin whihtheartiulation ofa phoneme

is inuened bythe artiulation ofthe neighboring phonemes. Synthesized visual speeh needs

to aurately represent oartiulation. In ase of parametri 3D-faial-models, theparameters

foranimatingthemhavebeengeneratedtakingoartiulationinto aountusingrules(Beskow,

1995;Pelahaud etal.,1994) or mathematial oartiulation models(Öhman,1967;Cohen and

Massaro,1993;Cosietal.,2002). Beskow(1995)mentionsthateahphonemehasatargetvetor

speifying thetypial artiulatory gestures. These target vetors are under-speied for some

phonemes whihareinterpolated basedonthe ontext to aount foroartiulation. Pelahaud

et al. (1994) divide phonemes into lusters based on their deformability in dierent ontexts.

Phonemeswithlowerdeformabilityserveasthekeyframesforoartiulation. Öhman(1967)a-

ountsforthehangesduringthe transformationofa

V ₁ CV ₂

(vowel-onsonant-vowel ) sequene.

Cohen and Massaro (1993) implement Löfqvist gestural theory, where phonemes are speied

with target feature vetors. Coartiulation is dened as the super-imposition of time-varying

dominane funtions desribing dierent artiulators. These dominane funtionsare negative

exponential funtions whih peak at the target feature vetors. This oartiulation model has

beenfurther augmented byCosi etal.(2002) bythe addition ofresistanefuntions. Thesere-

sistanefuntionsensurethatsomespeitargetongurationsareattainedbysuppressingthe

dominaneofneighboring phonemes. Thisisespeiallyimportantfor phonemes likelabials and

bilabials. Beskow (2004) reports an experimental omparison of various approahes to aount

for oartiulation. He reports that the mathematial model proposed by Cohen and Massaro

(1993) performs well in omparison with the real data; whereas, with respet to intelligibility,

rule-based tehniques perform better. These models an be optimized through hand-tuning or

tel-00927121, version 1 - 1 1 Jan 2014

(17)

Elisie et al., 2001). Ezzat etal. (2002) also perform tuning of a oartiulation model through

statistial learning on reorded orpus. Theiroartiulation model is similar to that of Cohen

and Massaro (1993). Instead of using motion data, they used image-based orpus for tuning

their model.

Corpus-based approahes:

Insteadofusingsomeexpliitoartiulationmodels,theoartiulationanbeimpliitlyenoded

inthesynthesizedvisualspeeh. Thisisdoneinorpusbasedapproahes. Firstly,theomplete

trajetories of visual speeh parameters an be generated using models like HMMs, whih are

trained onrealdata(Brand,1999;Masukoetal.,1998). Inthis ase,theHMManbemodeled

as a triphone, whih desribes a phoneme in the required phoneti ontext. Alternatively,

the omplete sequene of visual speeh parameters for real motion apture dataan be stored

and used by onatenating them for synthesis (Minnis and Breen, 2000). In this approah,

oartiulation isenoded through thesynthesis unit, liketriphone or diphone.

Inase ofonatenative approahes,thevisual speehdatabasehasto beprepared. Besides

aquisition, the orpusneeds proessingto annotatethe individualunits interms oftheir pho-

netilabels,segmentboundaries, informationrelatedtothegeometripropertiesofthefaesfor

ensuring smooth transition at the onatenation points. One of the onatenative approahes

for dubbingappliationsispresentedinBregler etal.(1997). Theypreparethevisual database

by phonetially segmenting an unonstrained video sequene. This segmented video is anno-

tatedto inludethe informationbasedon theorientationof thehead,theshapeandpositionof

mouth. They use eigenpoints to estimatethe duiary points on the fae (mouth, teeth, hin

andjawline)using26hand annotedimages. Also,thesynthesisisdonebytheonatenationof

triphone video lips. The synthesized mouthsequenes are thenmorphed onto the bakground

video sequene. Theresulting video sequene isompressed or strethed to time-align withthe

target audiobetween phoneme boundaries.

The synthesisdesribed in(E.Cosattoetal.,2000) isbasedontheonatenationofvariable

lengthvideosequenes ofmouthimages(andalsootherfaialparts). Thedatabase isdesribed

in terms of 3D geometri features of thehead and appearane features extrated by Prinipal

ComponentAnalysis(PCA). Theyfurther subdividethefaialparts into heeks, teeth,tongue,

jaw,ettomakethesynthesismoreexible. Thenalsynthesisisdonebyoverlayingbitmapsof

thefaial partspresentinthedatabaseontoabakgroundvideoasin(CosattoandGraf,1998).

There are other similar works of image based onatenative approahes (Weissenfeld et al.,

tel-00927121, version 1 - 1 1 Jan 2014

(18)

Embedding (LLE) to desribe theappearane parameters of themouth images unlike Cosatto

and Graf (1998) who use PCA. Liu and Ostermann (2009) use PCA to extrat appearane

parameters and Ative Appearane Models (AAM) to extrat thegeometri parameters of the

fae (lip width, lip height, visibility of teeth and tongue). A similar approah, but whih is

based on parametri 3D faial model is presented in (Ma et al., 2006). In this approah, the

ontrol parameters extrated fromreorded 3Dfaial marker data areonatenated using unit

seletion. The resultant trajetories areusedto animatevirtual onversational agents.

Some onatenative approahes ombine HMM and onatenative approahes for visual

speeh synthesis. One suh work is presented in (Lijuan et al., 2010). It is image-based ap-

proah where the seletion proess is guided by the trajetory of lip movements generated by

trained HMMs. These HMMs are trained by the AV-speeh orpus. This approah is similar

to an earlier work by Govokhina et al. (2006). In that, phonetially aligned trajetories of 3D

faial markers are seleted based on the trajetories generated by trained HMMs. A hybrid

unitseletion andHMMbasedapproahfor visualspeehsynthesisispresentedin(Edge etal.,

2009). Thiswork uses theseleted units to train state-based models and searh through these

learned models through Viterbi type algorithm. The similarity in speeh aoustis (aousti

parameters)wasusedto guidethroughunitseletion. Thenalsequene ofstate-basedmodels

isused to generate smooth trajetories for visual speeh. Baillyetal. (2009) desribe asystem

whih generates artiulatory gestures (ontrol parameters) for a video realisti (image based)

faialanimationusingHMMs. Theyinorporateaphasingmodeltolearnthelagbetweenvisual

gestures and orresponding speeh aoustis. Theyompare this HMM-based tehnique whih

inludes thephasing model with3 other tehniques: (1)onatenation ofartiulatory gestures

seletedbasedonthephonetiontext,(2)onatenationofartiulatorygesturesbasedonsele-

tion thatis guided throughthe phasing modelbased HMM,(3) trajetory generated by HMM

models trained on audio-synhronized artiulatory gestures. They onlude that the phasing

modelbased HMMsimprove thesynthesis.

Almost all of these works report lip-synhronization problems. Bregler et al. (1997) report

that plosives were observed to have oasional lip-synhronization problem, Cosatto and Graf

(2000),reportlip-synhronization being ritiizedinsubjetive evaluation. Geiger etal. (2003)

present the pereptual evaluation of the synthesis approah presented in (Ezzat et al., 2002).

They report that the synthesized audio-visual speeh is not omparable to the natural audio-

visualspeeh,tothe extentthatisrequiredfor developing appliationsforteahinglanguageor

tel-00927121, version 1 - 1 1 Jan 2014

(19)

1.3 Simultaneous synthesis of audio-visual speeh

The potential appliation of audio-visual speeh hinges not only on the auray of the syn-

thesized visual speeh, but also on the extent to whih the aousti and visual streams agree

with eah other in terms of synhrony and oherene. It is obvious from the previous setion

that, through the separatesynthesisof aousti and visualmodalities,these onditions arenot

always guaranteed. In this setion, we look at approahes whih synthesize audio and visual

speehsimultaneously. Theentral mehanismofalltheseapproahesistokeep theassoiation

between the visual and aousti modalities, thereby preserving the natural synhrony and o-

herene. Majorityofapproahesinthisategoryarebasedontheonatenationofsynhronous

bimodal units. One approah presented by Tamura et al. (1999), uses HMM models trained

using synhronous audio-visual speeh data to generate bimodal speeh parameters. But, it

should besaid that this approah was still at a muh preliminarylevel asthegenerated visual

speeh parameters wererelated onlyto the lip ontours.

The onept of synhronous bimodal unit onatenation for Swedish AV speeh synthesis

has been presented in(Hallgren and Lyberg, 1998). The visual speeh information isreorded

astrajetoriesof 3Dmarkers alloverthefae, espeiallyaround thelips. Thereorded marker

information is used to ontrol a3D model ofthe head. Thishead model is further texturedto

makeit lookmore natural.

Tworeentimage-basedapproahesthatuseonatenationofbimodalunitsare(Fagel,2006;

Mattheysesetal.,2009). In(Fagel,2006),AVspeehsynthesisis donefor German byonate-

nating synhronous bimodal polyphone segments. Thiswas with a 4-minute orpus onsisting

of bimodal speeh: video of speeh aligned with the orresponding phoneti transript. The

seletion ofpolyphonesegmentsforonatenationwasbasedonaonatenationost alulated

as a weighted sum of aousti and visual onatenation osts. The pre-seletion of possible

polyphonesegments from the orpus is based on hunks (longest polyphone segments thatare

available inthe orpus),and thevisual jointost alulation isbasedonthepixelto pixelolor

dierenesinthe end framesofthevideo lipsto beonatenated. Hene, itisquitelear that

synthesisinurs a large overall proessing time. In (Mattheyses et al.,2009), theonventional

unit-seletiontehnique whih hasbeenwidelyusedfor aoustispeehsynthesisisextendedto

performAV speehsynthesis. Itisdonebyinludinganadditional joinostterm forvisualjoin

disontinuities. Their systemis similar to the one explained in (Liu and Ostermann, 2009) in

termsofthevisualfeaturesextrated andusedtodesribethefaialgeometry andappearane.

Thesemethods like anyimage-based tehnique inur highstorage requirement whenompared

tel-00927121, version 1 - 1 1 Jan 2014

(20)

1.4 Conlusion

Inthishapter, we have disussedvarioustehniquestomodelthefaethatarebasedoneither

its 3Dor image-basedrepresentation. We have also disussedthevarious prosandonsof eah

tehnique. Further, we have also examined some approahes of AV speeh synthesis that are

based on either the sequential (synthesizing faial animation after aousti speeh synthesis)

or simultaneous synthesis ofthe two modalities. We have highlighted the disadvantages of the

former. Consequently, we are in favor of synhronous, data-driven synthesis of audio-visual

speeh. Our approah is based on this line of synthesis. As an be seen in hapter 3, our

approah is using a unit-seletion paradigm to synthesize both visual and aousti modalities

simultaneously. Inthefollowinghapter, wepresentasurveyofvariousaspetsofunitseletion

and thenwe introdue our systeminhapter3.

tel-00927121, version 1 - 1 1 Jan 2014

(21)

Speeh Synthesis Using Unit Seletion:

Literature Survey

Speehsynthesisisa wellestablishedeldofresearh withsigniantprogress inthepastthree

deades. Though synthesized speehis getting loserto human speeh,itisstill far frombeing

onsidered a solved problem. In addition, we are still away from a perfet all-purpose speeh

synthesizer. This is true for both aousti-onlyand audio-visual speeh. Among thesynthesis

tehniques onatenative tehniqueshave beome very popular inreent times. Thesemethods

havebeenwidelyusedandevolved foraoustisynthesis. Nevertheless, theparadigmisgeneri

and has been extended to visual or audio-visual speeh synthesis. In the earlier onatenative

aousti synthesis, fewerinstanes ofeah diphone were stored inthe inventory. Thesynthesis

speiation inluded the prosodi desription related to duration and pith of targets in the

sentene to be synthesized. At the time ofsynthesis, these diphones weremodiedusing signal

proessing tehniques to bring inthe hanges relatedto prosody and then onatenated. This

kindofintensivesignalproessingdoneonthewaveformdistortsitsnaturalness. Theadvantage

of this system was the small size of the diphone inventory whih was a neessary requirement

at the timeof its usage. Moreover, it an be said that inspite of usageof signal proessing, it

doesnot aount for allthevariations ofspeehaurately.

As omputer storage is getting heaper and faster, it has beome possible to store huge

speeh database many times larger than the earlier smaller inventory of diphones. Usage of a

huge orpus,makesitpossibleto inludealargesetofandidate diphoneswithlargevariability

intheir waveforms. Moreover, itisevenpossible tohave longersynthesisunits than adiphone.

Infat,itisevenpossibletosearhforwholesentenesorbighunksofsentenes. Thisindiates

thedrastiredutionintheneedtoproessthespeehsignal. Consequently,theresultantspeeh

preserves thenaturalness of the original speeh as the speeh segments are onatenated with

tel-00927121, version 1 - 1 1 Jan 2014

(22)

little to nosignal proessing.

Nevertheless, the usage of a large speeh orpus has dierent problems. A large variane

inthe synthesis andidates means that seletion has to be done arefully, to synthesize speeh

whih issimilarto a natural utterane. This isthe lassial unitseletion problem. We disuss

some of the issues of unit seletion tehniques, and the approahes that have been applied to

resolve them. In the following setions, we rst give a brief introdution of the emergene of

theframework ofunitseletion and itsbasi paradigm(insetion 2.1). Insetion 2.2wegive a

shortdesriptionofthesegmentationtehniquesusedinorpuspreparation,thenadesriptionof

pre-seletion of andidates and the onventional target ost formulation based on independent

feature spae assumption and its tuning (in setion 2.3). Next, (in setion 2.4) we give a

brief aount of the ways join evaluation tehniques have been analyzed for their orrelation

with human pereption of disontinuity when non-ontiguous units areonatenated. Finally,

(in setion 2.5), we deal with the objetive and pereptual evaluation methodologies that are

generally employed to estimate and sometimes qualify a text-to-speeh synthesis (aousti or

audio-visual)for its useina speidomain.

2.1 Unit seletion paradigm

Unit seletion depends on the seletion of the best possible set of units from dierent variants

available in the orpus. Consequently, the rst requirement is to have a orpus that not only

has a good overage of the possible speeh variants, but whih is also omparatively small to

keep thesearhtime short(Möbius,2000). Given apartiular speeh orpus,thequalityofthe

synthesizedspeeh using unitseletion depends on its usage. Manyfatorsaet thesynthesis

results. For example, onatenation of units an be said to be the most obvious reason for

audibledisruptionandmanyinitialsystemswerebasedontheredutionofonatenationpoints

(Sagisaka, 1988). In (Sagisaka, 1988), theseletion oflongest segments isgiven preferene and

theonatenation atertainloationslikeatCV(onsonant-vowel) boundariesorinthemiddle

of vowels is penalized. Alternatively, when it is not possible to avoid onatenation of non-

ontiguous units, minimization of distortion at the onatenation point minimizes the quality

degradation (Takeda etal.,1990; Iwahashi etal.,1992). Besides reduing theonatenation of

non-ontiguous units,thereareotherneessaryfatorsthatneedtobeonsidered. Forexample,

the phoneti ontext of the seleted unit and the speeh realization of the unit itself seems

important (Takeda etal.,1990;Iwahashi etal.,1992).

Thesearhproedureproposedin(HuntandBlak,1996)forunitseletionoersauniation

tel-00927121, version 1 - 1 1 Jan 2014

(23)

possibleoptimal solution to the seletion-onatenation problem. For a sequene of andidates

u

^,ând â^sequene ôf^required ^target ûnits

t

^;^the ^paradigm^presented^by^Hunt ^and ^Blak⁽¹⁹⁹⁶⁾

optimizesa total ostfuntion whih isa weightedsum ofthefollowing:

•

^The^pereptualsuitabilityof

u

^,^for

t

^,^whihîs âlled^the^target ôst,^denoted^by

T C(t, c)

^.

•

^The ^total disontinuity at all the onatenation points, alled the join ost denoted by

J C(c)

^.

Denoting the weights ofthe target ost andthe joinost by

w tc

^and

w jc

respetively; from a given orpus, the searh for the nal sequene of andidates is done based on the optimum

andidate sequenewhih minimizes thetotal ost(

C

⁾^as^shown ^below:

C = min

u w _tc T C(t, u) + w _jc J C (u)

^(2.1)

Here, the pre-seletion of units is based on a same-size units like phones or diphones for

eah target position. This pre-seletion is based on the target ost determining thesuitability

of the andidate and its ontext. Also, in this general framework, the seletion of longest

ontiguous andidates is enfored impliitly by making the individual join osts for any two

ontiguous units in the orpus zero (Balestri et al., 1999). This has the advantage of taking

into aount the variability of speeh realization besides reduing the onatenation artifats

for theseletion ofpossible bestset of andidates. In ontrast, some methods expliitly searh

forlongestontiguousunitsforonatenation allednon-uniformunitseletion,wheretheunits

sought for onatenation are not of same size or type (Taylor and Blak, 1999; Boëard, 2001;

Shweitzer et al., 2003). This is dierent from the earlier paradigm whih is impliitly non-

uniform unitseletion, asthere might bemanyontiguous segmentsof variablesizeinthenal

synthesizedspeeh. Clarketal.(2004)giveagooddesriptionofthepratialaspetsofbuilding

a unitseletion basedspeeh synthesizer. Taylor (2009), gives aomprehensive overview ofthe

dierent approahes addressing various aspets of unit seletion based speeh synthesis. Our

approah isbased ontherst paradigm, whihis animpliit non-uniform unitseletion.

Extending unit seletion to audio-visual speeh synthesis

In majority of AV speeh synthesis approahes, visual speeh is synthesized over an available

aoustispeehthatiseithersynthesizedorreal. Intheaseofvisualoraudio-visualspeehsyn-

thesisusingunitseletion,theseletionofsegmentshastobedoneonsideringtherequirements

tel-00927121, version 1 - 1 1 Jan 2014

(24)

in thetarget ost funtion, and also additional joinriteria to aount for the visual modality

relateddisontinuities inthejoinost funtion.

2.2 Segmentation

Itisobviousthatunitseletion dependsonaspeehdatabase. Segmentationisone ofthesteps

of this database preparation, in whih reorded speeh is divided into phoneti segments by

demarating their temporal boundaries. Thesephoneti segments onstitute thebasi building

bloksfor synthesis. Speeh segmentation without any otherspeier is onventionally usedto

refer to aousti speeh segmentation. Though the best way in terms of auray is manual

segmentation(Cosietal.,1991;LjoljeandRiley,1993;Ljoljeetal.,1997),itistime-onsuming,

laboriousandheneostly. Forthisreason,automatispeehsegmentationisonsideredagood

alternative. Themost popular and widelyused tehnique for automatispeeh segmentation is

to fore a HMM based phoneti speeh reognizer to reognize the speeh to a given phoneti

transript. Demaration of phoneti boundaries is a result of this fored-reognition whih is

onventionallyalledforedalignment. Thisalignmenttehniquehasavoidedtheneedfor man-

ualalignmenttosomeextentandalsoonsideredgoodenoughforHMMtrainingthatisrequired

in speeh reognition. But, segmentation needs to be more aurate for onatenative speeh

synthesisespeiallyforthosewhih arebasedononatenation at phonemeboundaries. Conse-

quently,variousmethods have been usedforthe renement ofthephoneti segment boundaries

further (Toledano et al., 2003). Some of the reent works use a ombination of segmentation

methods to derive multiple timemarks to arrive at more aurate segmentation (Kominek and

Blak, 2004;Park and Kim,2007).

For onatenative visual or AV speeh synthesis, generally theboundary time-marksdeter-

mined by the aousti speeh segmentation of an audio-visual orpus are used while dening

theandidates intheorpus (Bregler etal.,1997;Hallgren andLyberg,1998;E.Cosattoetal.,

2000). This way of segmentation is widely followed and pratially shown to work for visual

speehsynthesis. Nevertheless,thisisnot inaordane withtheunderlyingprinipal ofspeeh

prodution. The speeh artiulators have to beready witha target ongurations required for

the prodution of a sound (phone) for it to happen. That is, the start and end in the visual

and aousti modalities may not neessarily be thesame. Some works have tried to learn this

timelag between aousti and visualbyadding phasingmodels(Govokhina etal., 2007;Bailly

et al., 2009). These phasing models are arrived at through iterative proess involving HMM

learning, foredalignment oftrajetoriesof artiulatory gestures,omparison withtheaousti

tel-00927121, version 1 - 1 1 Jan 2014

(25)

works through reognition of the speeh segment, it provides an interesting tool to study the

unique harateristis ofphonemes. We exploitthis ideatoharaterize phonemes (Chapter4).

2.3 Target ost funtion

Measuring the suitability of a andidate in the orpus for a target position in the speeh to

be synthesized is a neessarystep inunit seletion. The eieny of a target ost funtion in

ranking and pre-seleting andidates also aets the probability of a good join and thus the

qualityof thesynthesized speeh. Generally, the target and theandidate are dened interms

of fatorswhihareknown to aount for thevariationinspeeh realizationbased onphoneti

andlinguististudies. Thesefatorsareattheabstratlevelwhiharenotdiretlyexpressiblein

termsoftheatualspeehparametersquantitatively. Thesearereferredtoashigh-levelfeatures.

Thesefeaturesantakeeithernon-negativeintegral valuesor anbeategorial. Thesefeatures

might inlude:

•

^Phoneti ^features^like ^the^phonemiîdentityôf^theûrrent ûnitând ^theneighboring units (ontext), type of phoneme (vowel, onsonant), voiing of phoneme (voied, unvoied),

manner ofartiulation et.

•

^Linguisti ^features ^like ^position ôf â ^syllable ât ^vârious ^levels ^(word, ^rhythm ^group, ^sen-

tene, et); position of word in a rhythm group or sentene; type of sentene et. These

featuresgenerally aount for the various suprasegmental prosodi patterns. Some ofthe

featuresinthis ategorymight be language spei.

Targetfeaturesetanalsoinludefeaturesthatarebasedonthestatistialanalysisofspeeh

relatedparameters whihareextrated fromorpus,whih arereferredto aslow-level features.

For example, some systems use prosody predition models that mainly provide duration and

pith speiation ofthe segments to be seleted. Theseprosodypredition models aretrained

on realorpus. It helps inreduing the numberof high-leveltarget featuresneededto desribe

prosody(Lataz etal., 2010). The low-level target featuresare also used to speed-up the pre-

seletion byreduing thesearh spae (Blakand Taylor,1997).

Lot ofsystemsusetargetfeaturesetwhihonsistsofmajorityofhigherlevelfeatures(Hunt

and Blak, 1996; Coorman et al., 2000; Lataz et al., 2010). Some systems use higher-level

target featuresexlusively to allowthe automati seletion ofandidateswithsuitable prosodi

harateristisratherthanpreditionbasedonprosodimodels(Prudonandd'Alessandro,2001;

tel-00927121, version 1 - 1 1 Jan 2014

(26)

individualfeatureosts. Three kindsoftarget featureostshave been generallyused(Coorman

etal.,2000):

1. Categorial distane measures: Where the distane is either a binary valued or non-

negative integer-valued funtion between ategorialfeatures.

2. Salardistane measures: Non-negativereal valuedfuntion for featureslike duration,F0

et.

3. Vetor distane measures: Distane alulation for multi-dimensional features, like the

aoustiand visual featurevetors.

Categorialdistanemeasuresarealulatedforthehigh-leveltargetfeatureswhiletheother

two are basedon the low-levelfeatures. For AV speeh synthesisthe set of target featureshas

to be augmentedto inludetheinformation regardingspeeh realizationinthevisual modality.

Besidesthetargetfeaturedesription,theweightingoffeaturesforagiventargetsetintheorder

oftheir relativeimportaneis ruialforseletion. Theseaspetsarepresentedinthefollowing

two setions. Besides the onventional target ost, alternatives have been proposed whih we

reviewinsubsetion 2.3.3.

2.3.1 Visual target features

For the visual speeh synthesis many of the high level target features used are those whih

desribe the visual or audio-visual target. These features might inlude typial artiulatory

harateristis like lip losures in bilabials. They might also inlude rate of speeh related

harateristis. Besides features whih are equally important for visual and aousti speeh

realization(e.g., plaeof artiulation),or those whihaount more fortheaousti realization

(e.g., voiing), there are some featureswhih aremore important for desribing a visualtarget

(e.g., shape of the lips during the artiulation of a phoneme). Many of the onatenative AV

speeh synthesis systems use a visual target ost based on the similarity of two phonemes in

termsof visiblefaial deformations, asdesribed below.

In (Bregler etal., 1997), aategorial phoneme ontext distane isusedfor the seletion of

triphone whihaounts forthe visual target ost. Phonemes of samelabelareassigned

0

^ost,

and phonemes belongingto two dierent visemelasses areassigned

1

^,^and ^dierent ^phonemes

of same viseme lass are assigned a ost between

0

^and

1

^whih ^are ^derived ^from ^onfusion

matriesdesribed in(Owens andBlazek,1985).

In(E.Cosattoetal.,2000),avisemedistanematrixisusedforthealulationoftargetost

tel-00927121, version 1 - 1 1 Jan 2014

(27)

domainirrespetiveofthedierenesintheaoustidomain. Theseletionofthevisualsegment

isbasedondurationandphonetilabelofthetargetsegmentwhihisobtainedfromtheaousti

speeh. Eahtargetframe isspeiedintermsofthephoneti annotationof awindowof frame

sequenes onsisting ofsome xed numberinluding itself to aount for ontext. Thewindow

lengthisdierentforeahphoneme. Theandidate isseletedwiththemostproximateontext

whihismeasuredbythe targetost. Thetarget ostweight vetorisbasedon theexponential

deayinginueneinspiredby(CohenandMassaro,1993). Weissenfeldetal.(2005)useasimilar

visualtargetostwhere thedierene matrixisalulatedbasedonthevisualdierenematrix

populated using the Eulidean distane in visual feature spae. It is based on theassumption

that eah phoneme an be desribed by its mean visual feature vetor, whih is speaker and

orpus spei. In (Mattheyses et al., 2010), a similar visual target ost alulated based on

orpusis inluded. Thedierenematrix thatisalulated representstheinter-phoneme visual

distanes based on the mean and variane of visual parameters at themiddle of the phoneme

units present in the orpus. These kind of ost funtions whih are alulated for a spei

orpus don'tguarantee optimumperformane for anyother orpus ingeneral.

2.3.2 Target feature weighting

Thetarget osttuning involves the determination ofrelative importane of target featuresand

assigning weights to the individual target feature osts to be used for target ost alulation.

Ideally, itisdone insuha waythat theordering of andidatesbased onthe target ostorre-

spondstotheirpereptualsuitabilityasatarget. Sinethesynthesizedspeehhastobeatleast

aeptable, intelligible and near natural speeh for human listeners, some system tuning teh-

niques arebased on human listeningtests (Coorman etal.,2000; Alíasetal.,2004). Listening

testsaretime-takingandrequirehumansubjetswhihmakethempratiallyostly. Moreover

thesopeofthiskindoftuningislimitedtoafewsetofsentenesandheneitannotguarantee

onsistent synthesis results. It beomes furtherdiult whenthesetof target featuresislarge.

Heneautomati weight tuning hasbeen applied inmanyof theworks(Hunt and Blak,1996;

Meron and Hiros, 1999; Park et al., 2003; Alías and Llorà, 2003; Colotte and Beaufort, 2005;

Lataz etal.,2010).

The targetfeature weighting tehniquesan bedividedinto two ategories: (1)joint weight

tuning of onatenation and target feature ost funtions, either at the individual unit level

seletion by using pairs of synthesis units or at sentene level, (2) separate weight tuning of

target and onatenation ost funtions, generally by tuning the target feature osts at the

tel-00927121, version 1 - 1 1 Jan 2014

(28)

inluded for seletion istreated asthetarget,and seletedor synthesizedfromtheorpus. The

target and the seleted units are ompared using objetive distane measures to perform the

tuning.

One ofthe two tehniquespresentedbyHunt and Blak (1996) alled `weight spaesearh'

(WSS) is basedon the rst ategory of weight tuning. It isbased on theusage oftargets from

real sentenes held out for training from the synthesis database. The weight tuning is done

by searhing the weight spae, in suh a way that the waveforms of synthesized sentenes and

that of real sentenes are similar. The weight spae searh is limited to a nite set of weight

ombinations and hoose the best weights among the searhed ombinations for dening the

target ost funtion. This method is omputationally very expensive in ase of large number

of features and possible set of target feature ost values. Meron and Hiros (1999) presented

aeleration tehniques for WSS by partial synthesis and omparison. Alías and Llorà (2003)

performed target tuning by using geneti algorithm for doing the weight spae searh. The

advantage of this is that the searh spae is randomized and searh evolves towards better

weight ombination, unlikeintheformerworkswhereaxedniteombinationsweresearhed.

Lataz et al. (2010) also present an automati weighting tehnique for tuning target features

and onatenation osts together. In their tehnique the ordering given by weighted sum of

target ost and onatenation ost, and the ordering given by an aousti distane metri are

ompared. Aseletederrorisalulatedbasedonthemismathinthisordering. Theyreferthis

tehnique asMinimum Seletion Error training. Further, they propose that theset of weights

obtained for allthe andidates treated astargetsbeinglustered usingdeision trees.

One ofthetehniques whihperformstarget feature weighting separatefromonatenation

ostsweightingisbasedonmultiplelinearregression(HuntandBlak,1996). Usingthismethod,

the target feature weights for eah phoneme ina language's phoneme set aretuned separately

to ome up with dierent target osts for dierent phonemes. Eah of the andidate in the

database is onsidered as a target eah time and the

n

^most ^similar ^andidates ^are ^seleted

from the phoneme's andidate set leaving the target out. The ordering of andidates for the

pre-seletion of

n

ândidatesîs^basedôn ânôbjetive^distane ^measure. ^The ^target^weightsâre

determined using Linear Regression suh that the target ost predits the objetive distane

measure. Meronand Hiros(1999) presented a way to extend this regression training (RT) for

weighting the target features and onatenation osts together using target pairs unlike single

targets. Theyalso propose lustering ofphoneti ontextsby using adeision tree to splitthe

phoneme pairs into dierent lusters. This is done with a phoneti ontextual question whih

tel-00927121, version 1 - 1 1 Jan 2014

(29)

Eah target feature aounts for variations in speeh, and their duration. Based on the

disriminativeinformationaountedbyeahofthefeatures,theyhavebeenweightedinColotte

and Beaufort (2005). Aousti representation of a partiular phoneme units were divided into

lusters through K-Means algorithm using Kullbak-Leibler divergene as thesimilarity index.

Theweightofthefeatureisbasedonitsdisriminativeinformationbetweenthedierentlusters.

This is applied to all the phonemes in the phoneme set of the language separately. Another

approah to weight tuning is to view unit seletion as a lassiation problem (Park et al.,

2003), in whih instead of dening an objetive funtion to aount for the subjetive speeh

quality,thelassiation error istaken as theobjetive funtion to be optimized. It is diult

toomparethesemethodsintermsoftheirsynthesisresults. Therearemanyfatorswhihvary

intheseapproahes,like, speehorpus,testsentenes, evaluationmethodologies et. Hene, it

isnot straightforward to relatively judge their performane.

2.3.3 Alternatives to onventional target ost funtion

Thetargetostputforthby(HuntandBlak,1996)wasweightedsumofindividualfeatureosts

(dierenes). Wheneveraandidatewiththeexattargetfeaturedesriptionisnotavailable,the

andidateseletedforsynthesisbasedonthissimpleformulationformeasuringtarget-andidate

similarity or rather dissimilarity might not always reet the atual human pereption. The

following two ases need little more onsideration: (1) where a andidate with required exat

feature desription is not available, but, a andidate with a speeh realization similar to the

requiredonebutwithadierentfeaturedesriptionisavailable;(2)whereneithertheaandidate

with exat feature desription nor witha similar speeh realization is available, inwhih ase,

a better possible alternative(s) have to be seleted. To onsider the speeh realization besides

thetargetombination aloneofandidates,alternateapproahesfortargetostalulation have

beenproposedwhihbasethe seletion onthepereptual similarityestimatedthroughaousti

distanes (Taylor,2006). Themain ideabehindtheproposed method isto have representation

of thesegment to beseleted intermsof thelow-levelfeaturesbyusing thehigh-level features.

Thiswasdonebylusteringtheandidatesofapartiularphonemeusingaoustidistanesand

using deisiontrees to hoose aluster for unitseletionsbyTaylor(2006).

2.4 Conatenation ost funtion

Itisknownthattheaoustispeehqualitydegradesduetotheonatenationofnon-ontiguous

speeh segments. Also, studies have shown that onsidering the spetral smoothness at the

tel-00927121, version 1 - 1 1 Jan 2014

(30)

etal.,1992). Thisholdsfor visual speehaswell. Hene, anyabruptjumpinthevisual speeh

sequenean reate pereptual disomfortandonfusion. Consequently,thefousonredution

of onatenation artifats arguably dates bak to the onset of onatenative speeh synthesis

itself. Espeially in unit seletion based speeh synthesis, there is a wide variability in the

andidatesfor eahtarget required. Thisresults ina largevariane intheonatenationpoints

aswell,likeinthe middleofaphonewhendiphoneisthesynthesisunit. Goodonatenationis

important not onlyfor a good synthesis quality, butalso for intelligibility(Clarketal.,2007).

While designing good onatenation strategies for unit seletion, dierent approahes have

beenfollowed. Theandidateprefereneforonatenation isbasedontheobservationthatnat-

urallyontiguousunitsautomatiallyjoinwell. Hene,allsystemsgivepreferenetoontiguous

unitsintheorpus,besidesonsideringimportantphonetiandprosodiharateristis. Infat,

some systems go further and searh the longestpossible units fromthe orpus,soas to redue

the number of onatenation points (Shweitzer et al., 2003). Sine it is infeasible to have a

naturally ontiguous speeh in theorpus for every target sequene to be synthesized, various

joinoptimization tehniques have been developed.

The most widely followed approah for onatenation is to minimize the dierenes at the

onatenation points. This strategy is based on the observation that huge dierenes in the

waveforms at the onatenation points aount for pereptible degradation. Various distane

metris alulated using various aousti parameters have been explored for estimating the

pereptual degradation due to joins. Cepstra, line spetral frequenies, log area ratios, mel

frequeny epstral oeients, multiple entroid analysis (MCA) oeients, linear preditive

oding oeients area few of them. Eulidean, Absolute, Kullbak-Leibler, Mahalanobis are

some of the distane measures explored. Given these many alternatives, it beomes neessary

to base the join dierene estimation using those measures that orrelated well with human

pereption. Hene, there are many attempts to evaluate the parameter and distane measure

ombinationsto rankthembasedontheirorrelationtohumanpereptionofjoindisontinuity.

Some of these works ask listenersto evaluate joins on a 5-point MOS sale and ompare these

soreswiththedistanesalulatedusingvariousmetrisandaoustiparameters(Woutersand

Maon,1998,Vepaetal.,2002,2004,Donovan,2001,Bellegarda,2004). Insomeotherworks,the

omparison between humanpereption anddistanemetris isbasedonthedetetionof ajoin,

i.e. abinarysore(KlabbersandVeldhuis,1998,2001,StylianouandSyrdal,2001,Pantazisetal.,

2005). Theresults presented inthevarious worksdon't agreemuh witheahother. Kullbak-

Leibler divergene has been reported to perform well withdierent parameters in some of the

tel-00927121, version 1 - 1 1 Jan 2014

(31)

reported between the objetive distanemeasures and the pereptual evaluationresults is 0.66

whihhasbeen deemedlow. Hene,the hoie ofanypartiularspeehparameterizationand a

distanemeasure doesnot ensure an aurate estimateof pereptualdisruption at thejoin.

While tryingto redue the joindisruption due to onatenation,naturally ontiguous units

an be used to determine the set of units whih an naturally join well. This an be based

on their proximity to naturally good joins, i.e., ontiguous units in the orpus. The work

done by Vepa and King (2003) an be onsidered to be in this diretion. In their work, the

natural evolution patternsintheaousti parameters arelearned from theorpus,and usedas

the basis for the evaluation of a join and dening a join ost funtion. Naturally ontiguous

speehsamples areneverpereivedasdisontinuous, though they areseldomexatlythesame.

From this observation, it an be onluded that humans are insensitive to a slight disruption

at the onatenation point. This has been used as a basis for formulation of the evaluation of

joins by Coorman et al. (2000). They have desribed a masking funtion to evaluate a join .

Consequently,belowa ertaintransparenythreshold thejoinostis zero.

Irrespetive of the distane between two onatenation points, it has been observed that

join disruption is not pereived uniformly aross all the phoneti ontexts. In other words,

the pereptual degradation of speeh is high in some phoneti units and ontexts than some

others. Syrdal, 2001, 2005 report a systemati study of the human sensitivity to disruption at

various ontexts, asummaryof theresultspresentedis asfollows: disontinuities arepereived

more with femalevoie based speeh synthesisto male voie based speeh synthesis, higher in

vowels than in onsonants, higher in diphthongs than to other vowels and higher in sonorant

phonemes than non-sonorants. They also reported a omprehensive list of join disontinuity

detetion (

%

⁾ ^based ôn ^the ^phoneme ^type. ^This ^shows ^that ^phonemi ôntext îs împortant

andonatenation inertainontextsor phonemes arelesspreferableto someothers andhene

phoneme independent handling ofonatenation strategies might not bethebest.

Conatenation of audio-visual units

All the salient points onsidered for aousti unit onatenation are equally appliable for vi-

sual or audio-visual unitonatenation. Here, the waythedistanes are alulated for unitsat

onatenationpointsdependsonthevisualfeatures. Forexample,in(Bregleretal.,1997),adis-

tane to measurethedierene inlip shapesintheoverlapping segments of adjaent triphones

is inluded to aount for the onatenation ost. It is alulated as the Eulidean distane

(frame-by-frame) between four element feature vetor of artiulatory features, outer-lip-width,

tel-00927121, version 1 - 1 1 Jan 2014

(32)

ided based on the plae of least dierene in the lip shapes. In (E.Cosatto et al., 2000), the

visualonatenationosthastwoomponents,theskipostandatransitionost. Skipostisa

penaltyfor anytwoframeswhih arenot ontiguous intheorpusand alulated basedonthe

orderingofframesintheorpus,

0

^for âny^two ^naturallyôntiguous ûnits ôr^frames^. ^The^tran-

sition ostis alulated based on thevisual distanebetween two frames. Itsalulated asthe

Eulidean distane of two PCA feature vetors extrated based on the appearane. Similarly,

in (Ma et al., 2006), two frames are given zero onatenation ost when they are ontiguous

in the original orpus, for those frames whih are not ontiguous its alulated as a sum of a

minimum onstant value and a variable omponent alulated based on the frames. The vari-

able omponent inturn hastwo omponents, one of whih is alulated based on the distane

alulated between thetwo frames. The seond omponent of this variableonatenation ost

ensures that the visemi transition in the synthesized and original orpus are the same. For

exampletwoframes

i

^and

j

ân^beônatenatedîf^the^preeding^frameôf

j

^belongs^to^the^same

visemilabelasthatof

i

^. ^Thetrajetoriesat thejoins aremadesmoothbyapplyingalowpass lterandubisplines. In(Fagel,2006),the videojoint ostalulation isbasedonthepixelto

pixelolordierenesintheborderframesinthesegmentstobeonatenated(omputationally

expensive).

2.5 Evaluation

We have onsidered various aspets of unit-seletion based speeh synthesis. In this setion,

we present the ways of evaluating synthesized speeh. This is neessaryfor exploring dierent

approahes to improve synthesis quality, in whih ase hanges need to be quantied and for

omparativeevaluationofdierentsynthesissystems. Theseanberelatedtoseletion,onate-

nation and overall systemtuning. As synthesized speeh istargetedfor human pereption, the

mostauratewaytoevaluateasynthesizedspeehispereptualevaluationbyhumansubjets.

In-spite of its auray, automati evaluation is often done instead, by omparing synthesized

speeh with a referene speeh. This referene is generally reorded real speeh whih is not

inluded in the orpus. This omparison isquantied using some objetive evaluation metris.

Inthefollowing,wepresenttheobjetive evaluationmetris andthenthepereptualevaluation

by human subjets. The evaluation of synthesized speeh by human subjets is done in two

tel-00927121, version 1 - 1 1 Jan 2014

(33)

2.5.1 Objetive automati evaluation of aousti and audio-visual speeh

Variousdistanemeasureshavebeenproposedforomparingrealandsynthesizedspeehsignals.

For example,epstral distaneisusedasa distanemeasureinmany worksforaousti speeh

(Hunt and Blak, 1996; Meron and Hiros, 1999; Alías and Llorà, 2003). (Lataz et al., 2010)

used onstituent distanes measures for duration, f0 and spetrum. Objetive evaluation of

audio-visualspeehisgenerallydonebasedonanindependentobjetiveevaluationofvisualand

aousti modalities. Alternatively, the objetive evaluation of only one modality is performed

sometimes, based on the fous of analysis. For instane, in (Huang et al., 2002) only the

synthesized visual speeh is evaluated. It was done using three objetive evaluation metris .

Theseweredevelopedforestimatingthepreision(naturalness)andsmoothnessofvisualspeeh;

andsynhronizationbetweenaoustiandvisualmodality. Firstly,preisionwasestimatedusing

thesum ofEulidean distane between the real andsynthesizedsentenes,alulated on visual

parameters. Seondly,smoothnesswasestimatedusingthesumofEulideandistanealulated

between adjaent framesin thesynthesized speeh whih are from non-ontiguous loationsin

theorpus. Lastly,audio-visual synhronization wasestimated basedon the phoneti labels of

synthesized frames. For this, only a few important phonemes were onsidered, whih belong

to one ofthe following two ategories. The rst ategory wasof those phonemes whih have a

hangeinthediretionof themouth movement, i.e.,fromlosingto openingorvieversa. The

seondategoryinludedthosephonemeswhihhavemaximalmouthshapeslikeopenorlosed

mouths. SimilarlyEulidean distanemeasurehasbeenusedbysomeothers(Weissenfeldetal.,

2005).

Instead of omparing real and synthesized speeh, Liu and Ostermann (2009) use average

targetost,averagesegmentlengthandaveragevisualdierenebetweenframesastheobjetive

evaluationmetrisandminimizethemduringtotalosttuning. Thisisbasedontheassumption

that the average target ost is representative of the lip-synhronization (audio-visual synhro-

nization) and the other two metris represent the smoothness of the speeh animation. But

nally,for evaluatingtheweightsresulting fromthe tuningproess,rossorrelationoeient

between thePCA oeients ofthe synthesized and real sentenes was alulated to represent

thesubjetivequalityofthesynthesizedvisualspeeh. Similarly,(Baillyetal.,2009) reportthe

omparison of dierent artiulatory gesture predition tehniques using the orrelation oe-

ient between originalandpreditedgestures. For objetiveevaluationofthesynthesizedvisual

speeh, Ma et al. (2006) use average errors of normalized artiulatory parameters (lip-height,

lip-width,lip-protrusion)betweentheoriginalandsynthesizedspeeh. Thoughthesetehniques

Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales ~ Association Francophone de la Communication Parlée

Ecole doctorale IAEM Lorraine ´

D´ epartement de formation doctorale en informatique

Synth` ese Acoustico-Visuelle de la Parole par S´ el´ ection d’Unit´ es Bimodales

(Acoustic-Visual Speech Synthesis by Bimodal Unit Selection)

TH` ESE

pour l’obtention du

Doctorat de l’Universit´ e de Lorraine

(sp´ ecialit´ e informatique)

pr´esent´ee par

Utpala MUSTI

Composition du jury

Rapporteurs : Jean-Claude MARTIN - Professeur en Informatique, Universit´e Paris-Sud Piero COSI - Senior Researcher, CNR, ISTC, Italie

Examinateurs : Catherine PELACHAUD - Directeur de recherche, CNRS-TELECOM ParisTech Bernd M ¨ OBIUS - Professeur, Universit¨ at des Saarlandes

Anne BOYER - Professeur, Universit´e de Lorraine Yves LAPRIE - Directeur de recherche, CNRS-loria

Vincent COLOTTE - Maˆıtre de conférences, Université de Lorraine Slim OUNI - Maˆıtre de conférences, Université de Lorraine

Laboratoire Lorrain de Recherche en Informatique et ses Applications — UMR 7503

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

g

b

d

b

g

bg

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

p

b

m

V 1 CV 2

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

u

t

•

u

t

T C(t, c)

•

J C(c)

w tc

w jc

C

C = min

u w tc T C(t, u) + w jc J C (u)

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

•

•

tel-00927121, version 1 - 1 1 Jan 2014

0

1

0

1

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

n

n

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

tel-00927121, version 1 - 1 1 Jan 2014

%

tel-00927121, version 1 - 1 1 Jan 2014

V ₁ CV ₂

u w _tc T C(t, u) + w _jc J C (u)