Acoustic-Visual Speech Synthesis by Bimodal Unit Selection

(1)

HAL Id: tel-01749331

https://tel.archives-ouvertes.fr/tel-01749331v2

Submitted on 11 Jan 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Selection

Utpala Musti

To cite this version:

Utpala Musti. Acoustic-Visual Speech Synthesis by Bimodal Unit Selection. Machine Learning

[cs.LG]. Université de Lorraine, 2013. English. �NNT : 2013LORR0003�. �tel-01749331v2�

(2)

´

Ecole doctorale IAEM Lorraine

D´

epartement de formation doctorale en informatique

Synth`

ese Acoustico-Visuelle de la Parole par

S´

el´

ection d’Unit´

es Bimodales

(Acoustic-Visual Speech Synthesis by Bimodal Unit Selection)

TH`

ESE

pour l’obtention du

Doctorat de l’Universit´

e de Lorraine

(sp´

ecialit´

e informatique)

pr´esent´ee par

Utpala MUSTI

Composition du jury

Rapporteurs :

Jean-Claude MARTIN

- Professeur en Informatique, Universit´e Paris-Sud

Piero COSI

- Senior Researcher, CNR, ISTC, Italie

Examinateurs :

Catherine PELACHAUD

- Directeur de recherche, CNRS-TELECOM ParisTech

Bernd M ¨

OBIUS

- Professeur, Universit¨

at des Saarlandes

Anne BOYER

- Professeur, Universit´e de Lorraine

Yves LAPRIE

- Directeur de recherche, CNRS-loria

Vincent COLOTTE

- Maˆıtre de conf´erences, Universit´e de Lorraine

Slim OUNI

- Maˆıtre de conf´erences, Universit´e de Lorraine

(3)

(4)

(5)

(6)

Audio-Visual Spee h 7

Chapter 1 Audio-Visual Spee h Synthesis: An Introdu tion 11

1.1 Fa e modeling and animation . . . 12

1.2 Separate visual spee h synthesis . . . 13

1.3 Simultaneous synthesis of audio-visual spee h . . . 17

1.4 Con lusion. . . 18

Chapter 2 Spee h Synthesis Using Unit Sele tion: Literature Survey 19 2.1 Unit sele tion paradigm . . . 20

2.2 Segmentation . . . 22

2.3 Target ost fun tion . . . 23

2.3.1 Visualtarget features . . . 24

2.3.2 Target featureweighting . . . 25

2.3.3 Alternatives to onventional target ostfun tion . . . 27

2.4 Con atenation ost fun tion. . . 27

2.5 Evaluation . . . 30

2.5.1 Obje tive automati evaluation ofa ousti andaudio-visual spee h . . . . 31

2.5.2 Human- entered evaluationof a ousti and audio-visual spee h . . . 32

Chapter 3 A ousti -Visual Spee h Synthesis System: An Overview 37 3.1 Corpus preparation . . . 38

3.1.1 Text sele tion . . . 38

3.1.2 A quisition . . . 39

3.1.3 Datapro essing andparameter extra tion . . . 40

3.1.4 Segmentation . . . 43

3.1.5 Bimodalspee h database . . . 44

3.2 Bimodal spee h synthesis . . . 44

3.2.1 Naturallanguage pro essing . . . 44

(7)

3.2.3 Bimodalunitsele tion and on atenation . . . 46

3.3 Visual spee h rendering . . . 50

Chapter 4 PhonemeClassi ation Based on Fa ial Data 55 4.1 Visual spee h segmentation using fa ial data . . . 55

4.1.1 Re ognition error . . . 57

4.1.2 For edalignment results . . . 57

4.2 Learning phoneme kinemati s using EMA data . . . 62

4.2.1 Dataa quisition . . . 62

4.2.2 Feature extra tion . . . 63

4.2.3 Results . . . 63

Chapter 5 Unit Sele tion 71 5.1 Target features . . . 72

5.2 Corpus based visual target features . . . 73

5.2.1 Phoneti ategory modi ation . . . 75

5.2.2 Continuous visual target ost fun tion . . . 78

5.2.3 Obje tive evaluationofsynthesisresults . . . 80

5.3 Target feature sele tion and weight tuning . . . 83

5.3.1 Unitsele tion and on atenation . . . 84

5.3.2 Target featuresele tion andweight tuning . . . 86

5.3.3 Appli ation to AV target ostfun tion tuning . . . 93

5.3.4 Analysisof sele tedfeaturesand their relative importan e . . . 94

Chapter 6 Evaluation 101 6.1 Obje tive evaluation . . . 101

6.1.1 Obje tive evaluationbasedon omparison oftwo signals . . . 102

6.1.2 Obje tive evaluationbasedon statisti alanalysisand thresholds . . . 103

6.2 Human- entered evaluation . . . 104

6.2.1 Intelligibility tests . . . 104

6.2.2 Quality evaluationtests . . . 105

6.3 Analysis of per eptual evaluation for better obje tive metri s . . . 107

6.4 Con lusion. . . 110

Chapter 7 Con lusion 113

(8)

Appendixs 121

(9)

(10)

Ena tedoranimatedstories aremorepopularthanaudionarrations orthoseinthebooks. Itis easyto on ludethatthisisduetoitsaudio-visualnatureasitprovidesari hexperien e. Besides entertainment,ingeneralweper eiveeverythingthroughourearsandeyes,simultaneously. The visual information thatis per eived througheyeseither ompliments or reinfor es theauditory information. Thisappliesto spee haswell,whi hisone oftheprimemodesof ommuni ation. Spee h per eption in thedayto day life is primarily bimodal. We see and hear, what isbeing spokenbypeopleandunderstandthespee hifitisinaknownlanguage. Whenever,theauditory inputisambiguousornoise-ridden,wetrytosupplement there eivedinformationbylookingat thesour e,i.e.,thespeaker. Thisbimodalnatureofspee hisillustratedbytheobservationthat, we humans try to have a fa e-to-fa e onversation while dis ussing issues of high importan e. Thisis be ause,fa e-to-fa e ommuni ation onveys the omplementary information relatedto spee h arti ulation, emotions, more ee tively than just voi e. Hen e, bimodal spee h an be onsidered more ee tive in onden e building. Besides entertainment and ommuni ation, the basi milestone towards verbal ommuni ation, i.e., spee h development in babies also has signi ant ontribution of theobservation of visual spee halong withthe orrespondingsound (Teinonenetal.,2008;Andersen etal.,1984).

Some of these above mentioned general observations about the advantages of audio-visual spee h over a ousti -only spee h have been experimentally veried. It has been shown that additionofvisualspee henhan esspee hdete tionandre ognition,thusimprovingintelligibility whenaudioismissing,degradedwithnoise,orwheretherearemultiplesour esofspee h(Sumby and Polla k, 1954;Ouni etal.,2007;Summereld, 1979;S hwartz etal.,2004). Theevaluation resultsofvisualspee hintelligibilitybyLeGoetal.(1994)showthatthenaturalfa epresented `without'or`withdegraded' audiorestorestwo-thirdsofthea ousti intelligibility; withafa ial modelwithoutatongueandjustalipmodelrestoreshalfandone-thirdofitrespe tively. Spee h presentedalongwithfa ial animationhasbeenobservedto bemorepreferredinterfa eto voi e-onlypresentation. Theyhavebeenshowntoin reasetheintera tiveexperien eofusers(Pandzi etal.,1999).

(11)

These advantages of audio-visual spee h over a ousti spee h indi ate its vast appli ation possibilities. It has been widely used inentertainment and e- ommer e for developing virtual agents. These appli ation do not ne essarily need higha ura y of spee h arti ulation. There are other appli ations whi h require high a ura y omparable to that of natural audio-visual spee h. Thesein lude appli ationsforpedagogi a tivities,for example,virtuallanguagetutors for e-learning,tea hingspee h arti ulationto hearingimpaired et (Massaro,2006). It an also beusedto develop virtual announ ers for publi pla es thatareusually noisy.

Considering allthe pre edingdis ussion, it an besaid thataudio-visual spee hsynthesisis a signi ant domain to pursue. But, theadvantages of natural bimodalspee h an be realized throughsynthesizedaudio-visualspee h,onlyifitis omparableto theformer. Itissobe ause, humanshaveimpli itexpe tationsfromaudio-visualspee hbasedonthelearningandexperien e ofgeneral fa e-to-fa e ommuni ations. Thesearerelatedto temporal alignment and oheren e between the a ousti and visual modalities. For instan e, while hearing sounds like `p', we expe t a losureof thelips justin timebefore the onset of thatsound. Similarly, we expe tto hear high-pit hed voi e for a onversation where somebodyisseen to be inextreme fear. This means that the synthesized audio-visual spee h has to have thea ousti and visual streams to betemporallysyn hronous and oherent withea h other.

A majority of approa hes for audio-visual (AV) spee h synthesis, synthesize the fa ial an-imation over spee h a ousti s, and then perform additional pro essing for syn hronizing the two wherever ne essary. This is based on the assumption that AV spee h synthesis is a set of twodierentproblems,therebyaddressingthem sequentiallybysynthesizing visualspee hover synthesizedspee h a ousti s. There aretwo problems with this approa h. To begin with, syn- hronizingthetwostreamssynthesizedseparatelyisnotstraight-forward. Humansareextremely sensitive to any asyn hrony between the audio and spee h animation. In fa t, this sensitivity to dis riminatesyn hronous spee hfromasyn hronousspee h developsvery earlyinhumansin their infan y with a signi ant preferen e to syn hronous spee h (Dodd, 1979). Results from (Grant and Greenberg, 2001, 2004) show that human spee h per eption is extremely sensitive toanylaginthe visualdomain when ompared toaudiounlike theother wayaround. Itisalso observed that this asyn hrony auses a surge inthe intelligibility of asyn hronous audio-visual spee h. Moreover, this also brings in the issue of in onsisten y in visual and a ousti domain whi hmightbringindis omfort(Mattheysesetal.,2009). Thisin onsisten y analsoae tthe nal per eption of the audio-visual spee h, as illustrated by some of theexperimental data in (GreenandKuhl,1989,1991). Theseexperimentalresultsshowthattheper eptionofpla eand

(12)

a ousti modality. Theworst ase, whereper eptionofAVspee h anbehighlyae tedisthat ofM Gurkee t (M Gurkand Ma Donald,1976). Infa t,whendierentfa ial animation and a ousti s arepresented syn hronously, subje ts would experien e fusion or ombination ee t. Fusionsee tisseen,forexample,whenvisual/

g

/ispresentedsyn hronouslywitha ousti /

b

/. The resultis per eived as /

d

/. Similarly, whenvisual /

b

/ is presentedwith a ousti /

g

/) syn- hronously,itisper eivedas/

bg

/,whi hisanexampleofthe ombinationee t. Thisindi ates that synthesizing audio-visual spee h by separating thesynthesis of the two modalities, might not always ensure the best result in terms of syn hrony and oheren e of the two modalities. In general, simultaneous pro essing of a ousti and visual spee h is shown to be advantageous withrespe tto audio-visualintegrationthatarenotavailablewiththeirindependentpro essing (Chen andRao,1998).

To ensure a perfe t alignment and oheren e between a ousti and visual modalities, we advo ate synthesizing audio-visual spee h simultaneously by treating the two modalities as a single entity. In this thesis, we present our method for audio-visual spee h synthesis based on this prin iple. We base our spee h synthesis on the unit sele tion paradigm. We perform simultaneous synthesis of a ousti and visual modalities by on atenating bimodal units. We keep the natural asso iation between the two modalities inta t while doing so, as the visual and a ousti modalitiesbelongto the same spee hsegment. It should be emphasized that this approa h impli itly addresses theabove mentioned issues ofasyn hrony and in oheren e. This work anbe onsideredasthe ru ialrststeptowardsa omprehensivetalking-head. A tually, our main fo us is to synthesize the audio-visual spee h dynami s a urately. The resultant is not a omplete talkinghead yet. Our fa ial representation islimitedto sparse meshdes ribing the outer surfa e of the fa e in luding the lips. The audio-visual spee h does not in lude the informationrelatedtotheinternalarti ulatorsliketongue,teethandother omponentsne essary for expressive spee h. In the ourse of this work rst we studied the bimodal spee h orpus, thatwea quired, bydesigningandanalyzingvisualspee hsegmentationexperiments. Then,we developedthebasi systemwhi himplementedourideaofbimodalunit on atenation. Byusing thebasi synthesisframeworkof bimodalunit-sele tion system,we developedmethodologiesto improve the bimodal synthesis. In our work, we are addressing the following problems: (1) unit-sele tiontakingbotha ousti andvisual onsiderationsinto a ount whi h an drasti ally in reases the omplexity,(2)weight tuning, whi h isa di ult probleminspee hsynthesis. In fa t, we developed orpus spe i visual target osts and an iterative target feature weighting algorithm. Finally, we performed per eptual and subje tive evaluation experiments through

(13)

This thesis is organized as follows. We begin by reviewing theeld of audio-visual spee h synthesis,in hapter1. Inthis hapter, wedis usstheways inwhi hthefa ehasbeen modeled andanimated. Wealsodis ussthevariousapproa hesofaudio-visual spee hsynthesisbasedon separate or joint synthesis of the two modalities. Our spee h synthesis system is built on the generi paradigmofunitsele tionandthisisthetopi of hapter2. Wereviewliteraturerelated to some aspe ts of unit sele tion. It in ludes, segmentation, that is performed during orpus preparation. Besides,the various building blo ksof sele tion areexamined: target des ription, target and on atenation osts. Finally,wereviewthewaysofevaluatingsynthesizedspee h.In hapter3,wepresentourworkbyprovidingrstanoverviewofouraudio-visualspee hsynthesis system. It also details our audio-visual orpus re ording and database preparation for our synthesis system. The resultant audio-visual database that we have is an interesting resour e whi h an be used for studying various phonemes. As a rst step in this dire tion, we have performedsegmentation of the visual data. We des ribe thesesegmentation experiments, their results and analysis of these results in hapter 4. In hapter 5, we detail dierent strategies thatwe developed to optimize oursystem. It in ludesdesigning newvisual targetfeatures and target feature weighting. Finally in hapter 6, we present the obje tive evaluation, per eptual evaluation and the analysis done to bring out the relation between the two. We on lude in hapter7 andexplain our futurework.

(14)

Audio-Visual Spee h Synthesis: An

Introdu tion

Inthis hapter, we lookat someof its earliersynthesisapproa hes. For anyspee h,a ousti or audio-visual, to be synthesized from text, the underlying phoneme sequen e orresponding to the text hasto be rst spe ied. Given this spe i ation, various approa hes an be followed for AV spee h synthesis. Firstly, these approa hes an be divided basedon whetherthe visual and a ousti modalities are synthesized separately or simultaneously. Se ondly, the synthesis of a ousti or visual modalities in the ase of separate synthesis an be divided based on the synthesis paradigm: rule based, arti ulatory or on atenative (Theobald, 2007). Thirdly, the approa hes an be lassied based on their fa ial renderingte hnique: 3D modeling of fa eor image-based.

In arule-basedsynthesissystem,thewell knownrepresentative hara teristi s ofspee hare simulated using predened rules. Whereas, arti ulatory synthesis is done bythe simulation of natural pro ess of spee hprodu tion using models of human anatomy. For instan e, air owis simulatedthrougha ontrolledmodelofhumanvo altra t,andskinofthefa eisdeformedusing bonesand mus les. Con atenative spee h synthesisis performedby on atenating segments of re orded humanspee h,generally alled orpus. This anbeput intoa broader ategory alled orpus-based spee h synthesis whi h also in ludes HMM-based spee h synthesis. HMM-based synthesisdependsonthelearningofpatternsofspee hparametersfromagiven orpus,whi his thenusedtogenerate spee hparameters. Con atenative approa h islikememorizing thewhole data, andthena essing thememory at the timeofsynthesis.

Inthefollowingse tions,wefo usonaudio-visualspee hsynthesis. First,webrieydes ribe thefa ial renderingte hniques (se tion1.1). Then, we dis usstheapproa hes whi hsynthesize thea ousti andvisual modalitiesseparately andsimultaneously inse tions 1.2 and1.3.

(15)

1.1 Fa e modeling and animation

Thefa ehasbeen en oded andpresentedintwo waysfor thepurposeof fa ial animation. The rst approa h is the 3Dmodeling ofthe fa e. Theouter surfa e of thefa e ismodeled using a mesh of onne ted polygons. Thesepolygons aremade of predened edges onne ting a set of 3D point verti es. Also, hanges inthe 3D point lo ations and the onsequent hanges inthe mesha ountforthedeformationsinthefa e. Therst3D-fa ialmodelwasdevelopedbyParke (Parke, 1972, 1975, 1982). In this model, the 3D points were dened and ontrolled by a set of parameters. Theseparameters were on eptually dividedinto two distin t sets(fun tionally theymight have an overlap): onformation parameters and expressionparameters. The onfor-mationparametersweretheoneswhi hdenethedimensionsofthe3Dfa e. Thatis,if3Dfa es are modeled based on real human subje ts for instan e, then onformation parameters dene thebasi `dierentiating' dimensionsof thatparti ularhumanfa e. Thesein luded parameters like aspe t ratio of fa e (height to width), relative sizes spe ifying forehead, eye separation, nose height, heek, hin, et . The expression parameters were those whi h des ribed mainly themovements ofeyesand mouth. Theyin luded deformations like jawrotation, width ofthe mouth,positionofupperlipand ornersofthemouth,et . Thesedeformationsmightberelated to spee h or emotional expressions. From these two ategories of parameters, the3Dpoints on the fa e positions were determined using dierent types of operations, applied independently to some regions or to the whole fa e. Eyes were ontrolled by spe i pro edures. The other operations in luded, interpolation, rotation, translation and s aling. The nal rendering was done through Phong interpolation (Phong, 1975) based on theparameter spe ifying the dire -tion of light sour e. There are many virtual hara ters whi h are des endants of this Parke's model (Cohen and Massaro, 1993; Beskow, 1995; Olives et al., 1999). These des endants of Parke'smodelhavevarious additionstoimprove theappearan eoffa eandanimation: likethe additionofthe tongue,earsortheba koftheheadandtheadditionof ontrol parameters. The advantage of these kind of parametri models is that thewholemesh is spe ied using a small set of parameters. Parke's parametri model is dierent from some other parametri models, whi harebasedonmodelingtheunderlyinganatomi al stru turelike bones, mus les,skinand for es a ting on them (Waters and Terzopoulous, 1990;Waters, 1987; Lee et al., 1995; Ekman and Friesen, 1978). This kind of modeling has been observed to be omputationally intensive (Bailly et al., 2003). Some talking heads whi h present emotional fa ial animations are based onpseudo-mus le ontra tions (Cosietal.,2003;Pela haud etal.,2001). MPEG-4standardizes theparametri modelsbydeningaminimumsetof84featurepoints(FPs)lo atedonthefa e.

(16)

alledfa ial a tion parameters (FAPs)(Ostermann,1998).

Besides 3D modeling of the fa e, the se ond approa h for representing a fa e is through the usage of fa ial images. These are most often images of real people. Hen e, image-based approa hes are generally data-driven. Fa ial animations using images are generated in two ways. First, it an be done by interpolating few spe i images that are representative of the typi al arti ulation of visually identi al phonemes alled visemes (Ezzat and Poggio, 1998). Alternatively, it anbedone by on atenatingimage sequen es (Bregler etal.,1997;E.Cosatto etal.,2000).

The image-based approa hes of modeling present more realisti fa es. This is be ause of their proximity to the real fa ial appearan e, whi h isoften des ribed as being photo-realisti . But, this way of en oding or presenting a fa e is most often limited to a straight-head frontal view of the fa e. Besides, storage of images in urs signi antly higher memory requirement to storage of a few parameter traje tories. On the other hand, 3D-model-based approa h is exible in terms of the view and head orientations in whi h a fa e an be rendered. But, an additional pro essing step is required to add the internal arti ulators like tongue and teeth to renderthe ompletearti ulatory information. Itispossibleto augment the3Dmodelbyadding texturalinformationtomakethenalfa ialanimationexibleand omparativelyphoto-realisti

Elisie etal. (2001). Another alternative of modeling the fa eis morphable-models presentedin (Cootes et al., 1998; Blanz and Vetter, 1999). These models also embed both geometri and texture relatedinformation to present a relatively photo-realisti andexible fa ial model.

1.2 Separate visual spee h synthesis

Conventionally, AV spee h synthesis is onsidered as two separate problems; the generation of spee h a ousti s and the generation of fa ial animation to a given spee h a ousti s (real or synthesized). Consequently, it has been performed by synthesizing the two modalities sepa-rately. Fa ial animation is generated over a given spee h a ousti s, whi h is eithersynthesized or re orded. Thisapproa h requiresadditionalpro essing to orre t thealignment between the twomodalitiesinthe aseof on atenativevisualspee hsynthesis(Bregleretal.,1997). Werefer to the fa ial animation relatedto spee has visual spee h. We fo us on visualspee h synthesis stage, onsidering the a ousti spee h already available. Two on epts, whi h might surfa e in thedis ussionofvisual spee hare: visemes and oarti ulation. In thefollowing paragraphs,we rstexplain these two on epts beforegoing ahead withthesynthesis te hniques.

(17)

sets are dened as visemes. It is the fundamental unit in the ontext of visual spee h (Fisher,

1968). For example, per eption of visual spee hwhile phonemes inthe set{

p

,

b

,

m

}arebeing arti ulatedisalmostthesame. Hen e, theybelongtoone visemeset. Inthe urrent dis ussion, wemeanbyviseme,asequen eofvisualspee hparametersdes ribinga ompletesegmentrather than stati targets. Onthe ontrary,werefer to asingle sampleof theseparameters des ribing a snapshot of a parti ular target fa e as `key frame'. The visual spee h parameters an be image frames or traje tories of ontrol parameters or 3Dpoints onthe fa e. Thismany-to-one mapping of visual spee h makesthe separation of visual spee h synthesis from a ousti spee h synthesisadvantageous. Itisbe ause,thesystemgets on iseduetotheredu inginthenumber ofdistin t units. Inthe aseof on atenativevisual spee hsynthesis, thisin reases thepossible andidates.

Coarti ulation: Coarti ulationis thephenomenonin whi hthearti ulation ofa phoneme is inuen ed bythe arti ulation ofthe neighboring phonemes. Synthesized visual spee h needs to a urately represent oarti ulation. In ase of parametri 3D-fa ial-models, theparameters foranimatingthemhavebeengeneratedtaking oarti ulationinto a ountusingrules(Beskow,

1995;Pela haud etal.,1994) or mathemati al oarti ulation models(Öhman,1967;Cohen and Massaro,1993;Cosietal.,2002). Beskow(1995)mentionsthatea hphonemehasatargetve tor spe ifying thetypi al arti ulatory gestures. These target ve tors are under-spe ied for some phonemes whi hareinterpolated basedonthe ontext to a ount for oarti ulation. Pela haud et al. (1994) divide phonemes into lusters based on their deformability in dierent ontexts. Phonemeswithlowerdeformabilityserveasthekeyframesfor oarti ulation. Öhman(1967) a - ountsforthe hangesduringthe transformationofa

V

1 CV

2

(vowel- onsonant-vowel ) sequen e.

Cohen and Massaro (1993) implement Löfqvist gestural theory, where phonemes are spe ied with target feature ve tors. Coarti ulation is dened as the super-imposition of time-varying dominan e fun tions des ribing dierent arti ulators. These dominan e fun tionsare negative exponential fun tions whi h peak at the target feature ve tors. This oarti ulation model has beenfurther augmented byCosi etal.(2002) bythe addition ofresistan efun tions. These re-sistan efun tionsensurethatsomespe i target ongurationsareattainedbysuppressingthe dominan eofneighboring phonemes. Thisisespe iallyimportantfor phonemes likelabials and bilabials. Beskow (2004) reports an experimental omparison of various approa hes to a ount for oarti ulation. He reports that the mathemati al model proposed by Cohen and Massaro

(1993) performs well in omparison with the real data; whereas, with respe t to intelligibility, rule-based te hniques perform better. These models an be optimized through hand-tuning or

(18)

Elisie et al., 2001). Ezzat etal. (2002) also perform tuning of a oarti ulation model through statisti al learning on re orded orpus. Their oarti ulation model is similar to that of Cohen and Massaro (1993). Instead of using motion data, they used image-based orpus for tuning their model.

Corpus-based approa hes:

Insteadofusingsomeexpli it oarti ulationmodels,the oarti ulation anbeimpli itlyen oded inthesynthesizedvisualspee h. Thisisdonein orpusbasedapproa hes. Firstly,the omplete traje tories of visual spee h parameters an be generated using models like HMMs, whi h are trained onrealdata(Brand,1999;Masukoetal.,1998). Inthis ase,theHMM anbemodeled as a triphone, whi h des ribes a phoneme in the required phoneti ontext. Alternatively, the omplete sequen e of visual spee h parameters for real motion apture data an be stored and used by on atenating them for synthesis (Minnis and Breen, 2000). In this approa h, oarti ulation isen oded through thesynthesis unit, liketriphone or diphone.

In ase of on atenative approa hes,thevisual spee hdatabasehasto beprepared. Besides a quisition, the orpusneeds pro essingto annotatethe individualunits interms oftheir pho-neti labels,segmentboundaries, informationrelatedtothegeometri propertiesofthefa esfor ensuring smooth transition at the on atenation points. One of the on atenative approa hes for dubbingappli ationsispresentedinBregler etal.(1997). Theypreparethevisual database by phoneti ally segmenting an un onstrained video sequen e. This segmented video is anno-tatedto in ludethe informationbasedon theorientationof thehead,theshapeandpositionof mouth. They use eigenpoints to estimatethe du iary points on the fa e (mouth, teeth, hin andjawline)using26hand annotedimages. Also,thesynthesisisdonebythe on atenationof triphone video lips. The synthesized mouthsequen es are thenmorphed onto the ba kground video sequen e. Theresulting video sequen e is ompressed or stret hed to time-align withthe target audiobetween phoneme boundaries.

The synthesisdes ribed in(E.Cosattoetal.,2000) isbasedonthe on atenationofvariable lengthvideosequen es ofmouthimages(andalsootherfa ialparts). Thedatabase isdes ribed in terms of 3D geometri features of thehead and appearan e features extra ted by Prin ipal ComponentAnalysis(PCA). Theyfurther subdividethefa ialparts into heeks, teeth,tongue, jaw,et tomakethesynthesismoreexible. Thenalsynthesisisdonebyoverlayingbitmapsof thefa ial partspresentinthedatabaseontoaba kgroundvideoasin(CosattoandGraf,1998). There are other similar works of image based on atenative approa hes (Weissenfeld et al.,

(19)

Embedding (LLE) to des ribe theappearan e parameters of themouth images unlike Cosatto and Graf (1998) who use PCA. Liu and Ostermann (2009) use PCA to extra t appearan e parameters and A tive Appearan e Models (AAM) to extra t thegeometri parameters of the fa e (lip width, lip height, visibility of teeth and tongue). A similar approa h, but whi h is based on parametri 3D fa ial model is presented in (Ma et al., 2006). In this approa h, the ontrol parameters extra ted fromre orded 3Dfa ial marker data are on atenated using unit sele tion. The resultant traje tories areusedto animatevirtual onversational agents.

Some on atenative approa hes ombine HMM and on atenative approa hes for visual spee h synthesis. One su h work is presented in (Lijuan et al., 2010). It is image-based ap-proa h where the sele tion pro ess is guided by the traje tory of lip movements generated by trained HMMs. These HMMs are trained by the AV-spee h orpus. This approa h is similar to an earlier work by Govokhina et al. (2006). In that, phoneti ally aligned traje tories of 3D fa ial markers are sele ted based on the traje tories generated by trained HMMs. A hybrid unitsele tion andHMMbasedapproa hfor visualspee hsynthesisispresentedin(Edge etal.,

2009). Thiswork uses thesele ted units to train state-based models and sear h through these learned models through Viterbi type algorithm. The similarity in spee h a ousti s (a ousti parameters)wasusedto guidethroughunitsele tion. Thenalsequen e ofstate-basedmodels isused to generate smooth traje tories for visual spee h. Baillyetal. (2009) des ribe asystem whi h generates arti ulatory gestures ( ontrol parameters) for a video realisti (image based) fa ialanimationusingHMMs. Theyin orporateaphasingmodeltolearnthelagbetweenvisual gestures and orresponding spee h a ousti s. They ompare this HMM-based te hnique whi h in ludes thephasing model with3 other te hniques: (1) on atenation ofarti ulatory gestures sele tedbasedonthephoneti ontext,(2) on atenationofarti ulatorygesturesbasedon sele -tion thatis guided throughthe phasing modelbased HMM,(3) traje tory generated by HMM models trained on audio-syn hronized arti ulatory gestures. They on lude that the phasing modelbased HMMsimprove thesynthesis.

Almost all of these works report lip-syn hronization problems. Bregler et al. (1997) report that plosives were observed to have o asional lip-syn hronization problem, Cosatto and Graf

(2000),reportlip-syn hronization being riti izedinsubje tive evaluation. Geiger etal. (2003) present the per eptual evaluation of the synthesis approa h presented in (Ezzat et al., 2002). They report that the synthesized audio-visual spee h is not omparable to the natural audio-visualspee h,tothe extentthatisrequiredfor developing appli ationsfortea hinglanguageor

(20)

1.3 Simultaneous synthesis of audio-visual spee h

The potential appli ation of audio-visual spee h hinges not only on the a ura y of the syn-thesized visual spee h, but also on the extent to whi h the a ousti and visual streams agree with ea h other in terms of syn hrony and oheren e. It is obvious from the previous se tion that, through the separatesynthesisof a ousti and visualmodalities,these onditions arenot always guaranteed. In this se tion, we look at approa hes whi h synthesize audio and visual spee hsimultaneously. The entral me hanismofalltheseapproa hesistokeep theasso iation between the visual and a ousti modalities, thereby preserving the natural syn hrony and o-heren e. Majorityofapproa hesinthis ategoryarebasedonthe on atenationofsyn hronous bimodal units. One approa h presented by Tamura et al. (1999), uses HMM models trained using syn hronous audio-visual spee h data to generate bimodal spee h parameters. But, it should besaid that this approa h was still at a mu h preliminarylevel asthegenerated visual spee h parameters wererelated onlyto the lip ontours.

The on ept of syn hronous bimodal unit on atenation for Swedish AV spee h synthesis has been presented in(Hallgren and Lyberg, 1998). The visual spee h information isre orded astraje toriesof 3Dmarkers alloverthefa e, espe iallyaround thelips. There orded marker information is used to ontrol a3D model ofthe head. Thishead model is further texturedto makeit lookmore natural.

Twore entimage-basedapproa hesthatuse on atenationofbimodalunitsare(Fagel,2006;

Mattheysesetal.,2009). In(Fagel,2006),AVspee hsynthesisis donefor German by on ate-nating syn hronous bimodal polyphone segments. Thiswas with a 4-minute orpus onsisting of bimodal spee h: video of spee h aligned with the orresponding phoneti trans ript. The sele tion ofpolyphonesegmentsfor on atenationwasbasedona on atenation ost al ulated as a weighted sum of a ousti and visual on atenation osts. The pre-sele tion of possible polyphonesegments from the orpus is based on hunks (longest polyphone segments thatare available inthe orpus),and thevisual joint ost al ulation isbasedonthepixelto pixel olor dieren esinthe end framesofthevideo lipsto be on atenated. Hen e, itisquite lear that synthesisin urs a large overall pro essing time. In (Mattheyses et al.,2009), the onventional unit-sele tionte hnique whi h hasbeenwidelyusedfor a ousti spee hsynthesisisextendedto performAV spee hsynthesis. Itisdonebyin ludinganadditional join ostterm forvisualjoin dis ontinuities. Their systemis similar to the one explained in (Liu and Ostermann, 2009) in termsofthevisualfeaturesextra ted andusedtodes ribethefa ialgeometry andappearan e. Thesemethods like anyimage-based te hnique in ur highstorage requirement when ompared

(21)

1.4 Con lusion

Inthis hapter, we have dis ussedvariouste hniquestomodelthefa ethatarebasedoneither its 3Dor image-basedrepresentation. We have also dis ussedthevarious prosand onsof ea h te hnique. Further, we have also examined some approa hes of AV spee h synthesis that are based on either the sequential (synthesizing fa ial animation after a ousti spee h synthesis) or simultaneous synthesis ofthe two modalities. We have highlighted the disadvantages of the former. Consequently, we are in favor of syn hronous, data-driven synthesis of audio-visual spee h. Our approa h is based on this line of synthesis. As an be seen in hapter 3, our approa h is using a unit-sele tion paradigm to synthesize both visual and a ousti modalities simultaneously. Inthefollowing hapter, wepresentasurveyofvariousaspe tsofunitsele tion and thenwe introdu e our systemin hapter3.

(22)

Spee h Synthesis Using Unit Sele tion:

Literature Survey

Spee hsynthesisisa wellestablishedeldofresear h withsigni antprogress inthepastthree de ades. Though synthesized spee his getting loserto human spee h,itisstill far frombeing onsidered a solved problem. In addition, we are still away from a perfe t all-purpose spee h synthesizer. This is true for both a ousti -onlyand audio-visual spee h. Among thesynthesis te hniques on atenative te hniqueshave be ome very popular inre ent times. Thesemethods havebeenwidelyusedandevolved fora ousti synthesis. Nevertheless, theparadigmisgeneri and has been extended to visual or audio-visual spee h synthesis. In the earlier on atenative a ousti synthesis, fewerinstan es ofea h diphone were stored inthe inventory. Thesynthesis spe i ation in luded the prosodi des ription related to duration and pit h of targets in the senten e to be synthesized. At the time ofsynthesis, these diphones weremodiedusing signal pro essing te hniques to bring inthe hanges relatedto prosody and then on atenated. This kindofintensivesignalpro essingdoneonthewaveformdistortsitsnaturalness. Theadvantage of this system was the small size of the diphone inventory whi h was a ne essary requirement at the timeof its usage. Moreover, it an be said that inspite of usageof signal pro essing, it doesnot a ount for allthevariations ofspee ha urately.

As omputer storage is getting heaper and faster, it has be ome possible to store huge spee h database many times larger than the earlier smaller inventory of diphones. Usage of a huge orpus,makesitpossibleto in ludealargesetof andidate diphoneswithlargevariability intheir waveforms. Moreover, itisevenpossible tohave longersynthesisunits than adiphone. Infa t,itisevenpossibletosear hforwholesenten esorbig hunksofsenten es. Thisindi ates thedrasti redu tionintheneedtopro essthespee hsignal. Consequently,theresultantspee h preserves thenaturalness of the original spee h as the spee h segments are on atenated with

(23)

little to nosignal pro essing.

Nevertheless, the usage of a large spee h orpus has dierent problems. A large varian e inthe synthesis andidates means that sele tion has to be done arefully, to synthesize spee h whi h issimilarto a natural utteran e. This isthe lassi al unitsele tion problem. We dis uss some of the issues of unit sele tion te hniques, and the approa hes that have been applied to resolve them. In the following se tions, we rst give a brief introdu tion of the emergen e of theframework ofunitsele tion and itsbasi paradigm(inse tion 2.1). Inse tion 2.2wegive a shortdes riptionofthesegmentationte hniquesusedin orpuspreparation,thenades riptionof pre-sele tion of andidates and the onventional target ost formulation based on independent feature spa e assumption and its tuning (in se tion 2.3). Next, (in se tion 2.4) we give a brief a ount of the ways join evaluation te hniques have been analyzed for their orrelation with human per eption of dis ontinuity when non- ontiguous units are on atenated. Finally, (in se tion 2.5), we deal with the obje tive and per eptual evaluation methodologies that are generally employed to estimate and sometimes qualify a text-to-spee h synthesis (a ousti or audio-visual)for its useina spe i domain.

2.1 Unit sele tion paradigm

Unit sele tion depends on the sele tion of the best possible set of units from dierent variants available in the orpus. Consequently, the rst requirement is to have a orpus that not only has a good overage of the possible spee h variants, but whi h is also omparatively small to keep thesear htime short(Möbius,2000). Given aparti ular spee h orpus,thequalityofthe synthesizedspee h using unitsele tion depends on its usage. Manyfa torsae t thesynthesis results. For example, on atenation of units an be said to be the most obvious reason for audibledisruptionandmanyinitialsystemswerebasedontheredu tionof on atenationpoints (Sagisaka, 1988). In (Sagisaka, 1988), thesele tion oflongest segments isgiven preferen e and the on atenation at ertainlo ationslikeatCV( onsonant-vowel) boundariesorinthemiddle of vowels is penalized. Alternatively, when it is not possible to avoid on atenation of non- ontiguous units, minimization of distortion at the on atenation point minimizes the quality degradation (Takeda etal.,1990; Iwahashi etal.,1992). Besides redu ing the on atenation of non- ontiguous units,thereareotherne essaryfa torsthatneedtobe onsidered. Forexample, the phoneti ontext of the sele ted unit and the spee h realization of the unit itself seems important (Takeda etal.,1990;Iwahashi etal.,1992).

(24)

possibleoptimal solution to the sele tion- on atenation problem. For a sequen e of andidates

u

,and asequen e ofrequired target units

t

;the paradigmpresentedbyHunt and Bla k(1996) optimizesa total ostfun tion whi h isa weightedsum ofthefollowing:

•

Theper eptualsuitabilityof

u

,for

t

,whi his alledthetarget ost,denotedby

T C(t, c)

.

•

The total dis ontinuity at all the on atenation points, alled the join ost denoted by

J C(c)

.

Denoting the weights ofthe target ost andthe join ost by

w

tc

and

w

jc

respe tively; from a given orpus, the sear h for the nal sequen e of andidates is done based on the optimum andidate sequen ewhi h minimizes thetotal ost(

C

)asshown below:

C

= min

u

w

tc

T C(t, u) + w

jc

J C

(u)

(2.1) Here, the pre-sele tion of units is based on a same-size units like phones or diphones for ea h target position. This pre-sele tion is based on the target ost determining thesuitability of the andidate and its ontext. Also, in this general framework, the sele tion of longest ontiguous andidates is enfor ed impli itly by making the individual join osts for any two ontiguous units in the orpus zero (Balestri et al., 1999). This has the advantage of taking into a ount the variability of spee h realization besides redu ing the on atenation artifa ts for thesele tion ofpossible bestset of andidates. In ontrast, some methods expli itly sear h forlongest ontiguousunitsfor on atenation allednon-uniformunitsele tion,wheretheunits sought for on atenation are not of same size or type (Taylor and Bla k, 1999; Boëard, 2001;

S hweitzer et al., 2003). This is dierent from the earlier paradigm whi h is impli itly non-uniform unitsele tion, asthere might bemany ontiguous segmentsof variablesizeinthenal synthesizedspee h. Clarketal.(2004)giveagooddes riptionofthepra ti alaspe tsofbuilding a unitsele tion basedspee h synthesizer. Taylor (2009), gives a omprehensive overview ofthe dierent approa hes addressing various aspe ts of unit sele tion based spee h synthesis. Our approa h isbased ontherst paradigm, whi his animpli it non-uniform unitsele tion.

Extending unit sele tion to audio-visual spee h synthesis

In majority of AV spee h synthesis approa hes, visual spee h is synthesized over an available a ousti spee hthatiseithersynthesizedorreal. Inthe aseofvisualoraudio-visualspee h syn-thesisusingunitsele tion,thesele tionofsegmentshastobedone onsideringtherequirements

(25)

in thetarget ost fun tion, and also additional join riteria to a ount for the visual modality relateddis ontinuities inthejoin ost fun tion.

2.2 Segmentation

Itisobviousthatunitsele tion dependsonaspee hdatabase. Segmentationisone ofthesteps of this database preparation, in whi h re orded spee h is divided into phoneti segments by demar ating their temporal boundaries. Thesephoneti segments onstitute thebasi building blo ksfor synthesis. Spee h segmentation without any otherspe ier is onventionally usedto refer to a ousti spee h segmentation. Though the best way in terms of a ura y is manual segmentation(Cosietal.,1991;LjoljeandRiley,1993;Ljoljeetal.,1997),itistime- onsuming, laboriousandhen e ostly. Forthisreason,automati spee hsegmentationis onsideredagood alternative. Themost popular and widelyused te hnique for automati spee h segmentation is to for e a HMM based phoneti spee h re ognizer to re ognize the spee h to a given phoneti trans ript. Demar ation of phoneti boundaries is a result of this for ed-re ognition whi h is onventionally alledfor edalignment. Thisalignmentte hniquehasavoidedtheneedfor man-ualalignmenttosomeextentandalso onsideredgoodenoughforHMMtrainingthatisrequired in spee h re ognition. But, segmentation needs to be more a urate for on atenative spee h synthesisespe iallyforthosewhi h arebasedon on atenation at phonemeboundaries. Conse-quently,variousmethods have been usedforthe renement ofthephoneti segment boundaries further (Toledano et al., 2003). Some of the re ent works use a ombination of segmentation methods to derive multiple timemarks to arrive at more a urate segmentation (Kominek and Bla k, 2004;Park and Kim,2007).

For on atenative visual or AV spee h synthesis, generally theboundary time-marks deter-mined by the a ousti spee h segmentation of an audio-visual orpus are used while dening the andidates inthe orpus (Bregler etal.,1997;Hallgren andLyberg,1998;E.Cosattoetal.,

2000). This way of segmentation is widely followed and pra ti ally shown to work for visual spee hsynthesis. Nevertheless,thisisnot ina ordan e withtheunderlyingprin ipal ofspee h produ tion. The spee h arti ulators have to beready witha target ongurations required for the produ tion of a sound (phone) for it to happen. That is, the start and end in the visual and a ousti modalities may not ne essarily be thesame. Some works have tried to learn this timelag between a ousti and visualbyadding phasingmodels(Govokhina etal., 2007;Bailly et al., 2009). These phasing models are arrived at through iterative pro ess involving HMM learning, for edalignment oftraje toriesof arti ulatory gestures, omparison withthea ousti

(26)

works through re ognition of the spee h segment, it provides an interesting tool to study the unique hara teristi s ofphonemes. We exploitthis ideato hara terize phonemes (Chapter4).

2.3 Target ost fun tion

Measuring the suitability of a andidate in the orpus for a target position in the spee h to be synthesized is a ne essarystep inunit sele tion. The e ien y of a target ost fun tion in ranking and pre-sele ting andidates also ae ts the probability of a good join and thus the qualityof thesynthesized spee h. Generally, the target and the andidate are dened interms of fa torswhi hareknown to a ount for thevariationinspee h realizationbased onphoneti andlinguisti studies. Thesefa torsareattheabstra tlevelwhi harenotdire tlyexpressiblein termsofthea tualspee hparametersquantitatively. Thesearereferredtoashigh-levelfeatures. Thesefeatures antakeeithernon-negativeintegral valuesor anbe ategori al. Thesefeatures might in lude:

•

Phoneti featureslike thephonemi identityofthe urrent unitand theneighboring units ( ontext), type of phoneme (vowel, onsonant), voi ing of phoneme (voi ed, unvoi ed), manner ofarti ulation et .

•

Linguisti features like position of a syllable at various levels (word, rhythm group, sen-ten e, et ); position of word in a rhythm group or senten e; type of senten e et . These featuresgenerally a ount for the various suprasegmental prosodi patterns. Some ofthe featuresinthis ategorymight be language spe i .

Targetfeatureset analsoin ludefeaturesthatarebasedonthestatisti alanalysisofspee h relatedparameters whi hareextra ted from orpus,whi h arereferredto aslow-level features. For example, some systems use prosody predi tion models that mainly provide duration and pit h spe i ation ofthe segments to be sele ted. Theseprosodypredi tion models aretrained on real orpus. It helps inredu ing the numberof high-leveltarget featuresneededto des ribe prosody(Lata z etal., 2010). The low-level target featuresare also used to speed-up the pre-sele tion byredu ing thesear h spa e (Bla kand Taylor,1997).

Lot ofsystemsusetargetfeaturesetwhi h onsistsofmajorityofhigherlevelfeatures(Hunt and Bla k, 1996; Coorman et al., 2000; Lata z et al., 2010). Some systems use higher-level target featuresex lusively to allowthe automati sele tion of andidateswithsuitable prosodi hara teristi sratherthanpredi tionbasedonprosodi models(Prudonandd'Alessandro,2001;

(27)

individualfeature osts. Three kindsoftarget feature ostshave been generallyused(Coorman etal.,2000):

1. Categori al distan e measures: Where the distan e is either a binary valued or non-negative integer-valued fun tion between ategori alfeatures.

2. S alardistan e measures: Non-negativereal valuedfun tion for featureslike duration,F0 et .

3. Ve tor distan e measures: Distan e al ulation for multi-dimensional features, like the a ousti and visual featureve tors.

Categori aldistan emeasuresare al ulatedforthehigh-leveltargetfeatureswhiletheother two are basedon the low-levelfeatures. For AV spee h synthesisthe set of target featureshas to be augmentedto in ludetheinformation regardingspee h realizationinthevisual modality. Besidesthetargetfeaturedes ription,theweightingoffeaturesforagiventargetsetintheorder oftheir relativeimportan eis ru ialforsele tion. Theseaspe tsarepresentedinthefollowing two se tions. Besides the onventional target ost, alternatives have been proposed whi h we reviewinsubse tion 2.3.3.

2.3.1 Visual target features

For the visual spee h synthesis many of the high level target features used are those whi h des ribe the visual or audio-visual target. These features might in lude typi al arti ulatory hara teristi s like lip losures in bilabials. They might also in lude rate of spee h related hara teristi s. Besides features whi h are equally important for visual and a ousti spee h realization(e.g., pla eof arti ulation),or those whi ha ount more forthea ousti realization (e.g., voi ing), there are some featureswhi h aremore important for des ribing a visualtarget (e.g., shape of the lips during the arti ulation of a phoneme). Many of the on atenative AV spee h synthesis systems use a visual target ost based on the similarity of two phonemes in termsof visiblefa ial deformations, asdes ribed below.

In (Bregler etal., 1997), a ategori al phoneme ontext distan e isusedfor the sele tion of triphone whi ha ounts forthe visual target ost. Phonemes of samelabelareassigned

0

ost, and phonemes belongingto two dierent viseme lasses areassigned

1

,and dierent phonemes of same viseme lass are assigned a ost between

0

and

1

whi h are derived from onfusion matri esdes ribed in(Owens andBlazek,1985).

(28)

domainirrespe tiveofthedieren esinthea ousti domain. Thesele tionofthevisualsegment isbasedondurationandphoneti labelofthetargetsegmentwhi hisobtainedfromthea ousti spee h. Ea htargetframe isspe iedintermsofthephoneti annotationof awindowof frame sequen es onsisting ofsome xed numberin luding itself to a ount for ontext. Thewindow lengthisdierentforea hphoneme. The andidate issele tedwiththemostproximate ontext whi hismeasuredbythe target ost. Thetarget ostweight ve torisbasedon theexponential de ayinginuen einspiredby(CohenandMassaro,1993). Weissenfeldetal.(2005)useasimilar visualtarget ostwhere thedieren e matrixis al ulatedbasedonthevisualdieren ematrix populated using the Eu lidean distan e in visual feature spa e. It is based on theassumption that ea h phoneme an be des ribed by its mean visual feature ve tor, whi h is speaker and orpus spe i . In (Mattheyses et al., 2010), a similar visual target ost al ulated based on orpusis in luded. Thedieren ematrix thatis al ulated representstheinter-phoneme visual distan es based on the mean and varian e of visual parameters at themiddle of the phoneme units present in the orpus. These kind of ost fun tions whi h are al ulated for a spe i orpus don'tguarantee optimumperforman e for anyother orpus ingeneral.

2.3.2 Target feature weighting

Thetarget osttuning involves the determination ofrelative importan e of target featuresand assigning weights to the individual target feature osts to be used for target ost al ulation. Ideally, itisdone insu ha waythat theordering of andidatesbased onthe target ost orre-spondstotheirper eptualsuitabilityasatarget. Sin ethesynthesizedspee hhastobeatleast a eptable, intelligible and near natural spee h for human listeners, some system tuning te h-niques arebased on human listeningtests (Coorman etal.,2000; Alíasetal.,2004). Listening testsaretime-takingandrequirehumansubje tswhi hmakethempra ti ally ostly. Moreover thes opeofthiskindoftuningislimitedtoafewsetofsenten esandhen eit annotguarantee onsistent synthesis results. It be omes furtherdi ult whenthesetof target featuresislarge. Hen eautomati weight tuning hasbeen applied inmanyof theworks(Hunt and Bla k,1996;

Meron and Hiros, 1999; Park et al., 2003; Alías and Llorà, 2003; Colotte and Beaufort, 2005;

Lata z etal.,2010).

The targetfeature weighting te hniques an bedividedinto two ategories: (1)joint weight tuning of on atenation and target feature ost fun tions, either at the individual unit level sele tion by using pairs of synthesis units or at senten e level, (2) separate weight tuning of target and on atenation ost fun tions, generally by tuning the target feature osts at the

(29)

in luded for sele tion istreated asthetarget,and sele tedor synthesizedfromthe orpus. The target and the sele ted units are ompared using obje tive distan e measures to perform the tuning.

One ofthe two te hniquespresentedbyHunt and Bla k (1996) alled `weight spa esear h' (WSS) is basedon the rst ategory of weight tuning. It isbased on theusage oftargets from real senten es held out for training from the synthesis database. The weight tuning is done by sear hing the weight spa e, in su h a way that the waveforms of synthesized senten es and that of real senten es are similar. The weight spa e sear h is limited to a nite set of weight ombinations and hoose the best weights among the sear hed ombinations for dening the target ost fun tion. This method is omputationally very expensive in ase of large number of features and possible set of target feature ost values. Meron and Hiros (1999) presented a eleration te hniques for WSS by partial synthesis and omparison. Alías and Llorà (2003) performed target tuning by using geneti algorithm for doing the weight spa e sear h. The advantage of this is that the sear h spa e is randomized and sear h evolves towards better weight ombination, unlikeintheformerworkswhereaxednite ombinationsweresear hed.

Lata z et al. (2010) also present an automati weighting te hnique for tuning target features and on atenation osts together. In their te hnique the ordering given by weighted sum of target ost and on atenation ost, and the ordering given by an a ousti distan e metri are ompared. Asele tederroris al ulatedbasedonthemismat hinthisordering. Theyreferthis te hnique asMinimum Sele tion Error training. Further, they propose that theset of weights obtained for allthe andidates treated astargetsbeing lustered usingde ision trees.

One ofthete hniques whi hperformstarget feature weighting separatefrom on atenation ostsweightingisbasedonmultiplelinearregression(HuntandBla k,1996). Usingthismethod, the target feature weights for ea h phoneme ina language's phoneme set aretuned separately to ome up with dierent target osts for dierent phonemes. Ea h of the andidate in the database is onsidered as a target ea h time and the

n

most similar andidates are sele ted from the phoneme's andidate set leaving the target out. The ordering of andidates for the pre-sele tion of

n

andidatesisbasedon anobje tivedistan e measure. The targetweightsare determined using Linear Regression su h that the target ost predi ts the obje tive distan e measure. Meronand Hiros(1999) presented a way to extend this regression training (RT) for weighting the target features and on atenation osts together using target pairs unlike single targets. Theyalso propose lustering ofphoneti ontextsby using ade ision tree to splitthe phoneme pairs into dierent lusters. This is done with a phoneti ontextual question whi h

(30)

Ea h target feature a ounts for variations in spee h, and their duration. Based on the dis riminativeinformationa ountedbyea hofthefeatures,theyhavebeenweightedinColotte and Beaufort (2005). A ousti representation of a parti ular phoneme units were divided into lusters through K-Means algorithm using Kullba k-Leibler divergen e as thesimilarity index. Theweightofthefeatureisbasedonitsdis riminativeinformationbetweenthedierent lusters. This is applied to all the phonemes in the phoneme set of the language separately. Another approa h to weight tuning is to view unit sele tion as a lassi ation problem (Park et al.,

2003), in whi h instead of dening an obje tive fun tion to a ount for the subje tive spee h quality,the lassi ation error istaken as theobje tive fun tion to be optimized. It is di ult to omparethesemethodsintermsoftheirsynthesisresults. Therearemanyfa torswhi hvary intheseapproa hes,like, spee h orpus,testsenten es, evaluationmethodologies et . Hen e, it isnot straightforward to relatively judge their performan e.

2.3.3 Alternatives to onventional target ost fun tion

Thetarget ostputforthby(HuntandBla k,1996)wasweightedsumofindividualfeature osts (dieren es). Whenevera andidatewiththeexa ttargetfeaturedes riptionisnotavailable,the andidatesele tedforsynthesisbasedonthissimpleformulationformeasuringtarget- andidate similarity or rather dissimilarity might not always ree t the a tual human per eption. The following two ases need little more onsideration: (1) where a andidate with required exa t feature des ription is not available, but, a andidate with a spee h realization similar to the requiredonebutwithadierentfeaturedes riptionisavailable;(2)whereneitherthea andidate with exa t feature des ription nor witha similar spee h realization is available, inwhi h ase, a better possible alternative(s) have to be sele ted. To onsider the spee h realization besides thetarget ombination aloneof andidates,alternateapproa hesfortarget ost al ulation have beenproposedwhi hbasethe sele tion ontheper eptual similarityestimatedthrougha ousti distan es (Taylor,2006). Themain ideabehindtheproposed method isto have representation of thesegment to besele ted intermsof thelow-levelfeaturesbyusing thehigh-level features. Thiswasdoneby lusteringthe andidatesofaparti ularphonemeusinga ousti distan esand using de isiontrees to hoose a luster for unitsele tionsbyTaylor(2006).

2.4 Con atenation ost fun tion

Itisknownthatthea ousti spee hqualitydegradesduetothe on atenationofnon- ontiguous spee h segments. Also, studies have shown that onsidering the spe tral smoothness at the

(31)

etal.,1992). Thisholdsfor visual spee haswell. Hen e, anyabruptjumpinthevisual spee h sequen e an reate per eptual dis omfortand onfusion. Consequently,thefo usonredu tion of on atenation artifa ts arguably dates ba k to the onset of on atenative spee h synthesis itself. Espe ially in unit sele tion based spee h synthesis, there is a wide variability in the andidatesfor ea htarget required. Thisresults ina largevarian e inthe on atenationpoints aswell,likeinthe middleofaphonewhendiphoneisthesynthesisunit. Good on atenationis important not onlyfor a good synthesis quality, butalso for intelligibility(Clarketal.,2007).

While designing good on atenation strategies for unit sele tion, dierent approa hes have beenfollowed. The andidatepreferen efor on atenation isbasedontheobservationthat nat-urally ontiguousunitsautomati allyjoinwell. Hen e,allsystemsgivepreferen eto ontiguous unitsinthe orpus,besides onsideringimportantphoneti andprosodi hara teristi s. Infa t, some systems go further and sear h the longestpossible units fromthe orpus,soas to redu e the number of on atenation points (S hweitzer et al., 2003). Sin e it is infeasible to have a naturally ontiguous spee h in the orpus for every target sequen e to be synthesized, various joinoptimization te hniques have been developed.

The most widely followed approa h for on atenation is to minimize the dieren es at the on atenation points. This strategy is based on the observation that huge dieren es in the waveforms at the on atenation points a ount for per eptible degradation. Various distan e metri s al ulated using various a ousti parameters have been explored for estimating the per eptual degradation due to joins. Cepstra, line spe tral frequen ies, log area ratios, mel frequen y epstral oe ients, multiple entroid analysis (MCA) oe ients, linear predi tive oding oe ients area few of them. Eu lidean, Absolute, Kullba k-Leibler, Mahalanobis are some of the distan e measures explored. Given these many alternatives, it be omes ne essary to base the join dieren e estimation using those measures that orrelated well with human per eption. Hen e, there are many attempts to evaluate the parameter and distan e measure ombinationsto rankthembasedontheir orrelationtohumanper eptionofjoindis ontinuity. Some of these works ask listenersto evaluate joins on a 5-point MOS s ale and ompare these s oreswiththedistan es al ulatedusingvariousmetri sanda ousti parameters(Woutersand Ma on,1998,Vepaetal.,2002,2004,Donovan,2001,Bellegarda,2004). Insomeotherworks,the omparison between humanper eption anddistan emetri s isbasedonthedete tionof ajoin, i.e. abinarys ore(KlabbersandVeldhuis,1998,2001,StylianouandSyrdal,2001,Pantazisetal.,

2005). Theresults presented inthevarious worksdon't agreemu h withea hother. Kullba k-Leibler divergen e has been reported to perform well withdierent parameters in some of the

(32)

reported between the obje tive distan emeasures and the per eptual evaluationresults is 0.66 whi hhasbeen deemedlow. Hen e,the hoi e ofanyparti ularspee hparameterizationand a distan emeasure doesnot ensure an a urate estimateof per eptualdisruption at thejoin.

While tryingto redu e the joindisruption due to on atenation,naturally ontiguous units an be used to determine the set of units whi h an naturally join well. This an be based on their proximity to naturally good joins, i.e., ontiguous units in the orpus. The work done by Vepa and King (2003) an be onsidered to be in this dire tion. In their work, the natural evolution patternsinthea ousti parameters arelearned from the orpus,and usedas the basis for the evaluation of a join and dening a join ost fun tion. Naturally ontiguous spee hsamples areneverper eivedasdis ontinuous, though they areseldomexa tlythesame. From this observation, it an be on luded that humans are insensitive to a slight disruption at the on atenation point. This has been used as a basis for formulation of the evaluation of joins by Coorman et al. (2000). They have des ribed a masking fun tion to evaluate a join . Consequently,belowa ertaintransparen ythreshold thejoin ostis zero.

Irrespe tive of the distan e between two on atenation points, it has been observed that join disruption is not per eived uniformly a ross all the phoneti ontexts. In other words, the per eptual degradation of spee h is high in some phoneti units and ontexts than some others. Syrdal, 2001, 2005 report a systemati study of the human sensitivity to disruption at various ontexts, asummaryof theresultspresentedis asfollows: dis ontinuities areper eived more with femalevoi e based spee h synthesisto male voi e based spee h synthesis, higher in vowels than in onsonants, higher in diphthongs than to other vowels and higher in sonorant phonemes than non-sonorants. They also reported a omprehensive list of join dis ontinuity dete tion (

%

) based on the phoneme type. This shows that phonemi ontext is important and on atenation in ertain ontextsor phonemes arelesspreferableto someothers andhen e phoneme independent handling of on atenation strategies might not bethebest.

Con atenation of audio-visual units

All the salient points onsidered for a ousti unit on atenation are equally appli able for vi-sual or audio-visual unit on atenation. Here, the waythedistan es are al ulated for unitsat on atenationpointsdependsonthevisualfeatures. Forexample,in(Bregleretal.,1997),a dis-tan e to measurethedieren e inlip shapesintheoverlapping segments of adja ent triphones is in luded to a ount for the on atenation ost. It is al ulated as the Eu lidean distan e (frame-by-frame) between four element feature ve tor of arti ulatory features, outer-lip-width,

(33)

ided based on the pla e of least dieren e in the lip shapes. In (E.Cosatto et al., 2000), the visual on atenation osthastwo omponents,theskip ostandatransition ost. Skip ostisa penaltyfor anytwoframeswhi h arenot ontiguous inthe orpusand al ulated basedonthe orderingofframesinthe orpus,

0

for anytwo naturally ontiguous units orframes. The tran-sition ostis al ulated based on thevisual distan ebetween two frames. Its al ulated asthe Eu lidean distan e of two PCA feature ve tors extra ted based on the appearan e. Similarly, in (Ma et al., 2006), two frames are given zero on atenation ost when they are ontiguous in the original orpus, for those frames whi h are not ontiguous its al ulated as a sum of a minimum onstant value and a variable omponent al ulated based on the frames. The vari-able omponent inturn hastwo omponents, one of whi h is al ulated based on the distan e al ulated between thetwo frames. The se ond omponent of this variable on atenation ost ensures that the visemi transition in the synthesized and original orpus are the same. For exampletwoframes

i

and

j

anbe on atenatedifthepre edingframeof

j

belongstothesame visemi labelasthatof

i

. Thetraje toriesat thejoins aremadesmoothbyapplyingalowpass lterand ubi splines. In(Fagel,2006),the videojoint ost al ulation isbasedonthepixelto pixel olordieren esintheborderframesinthesegmentstobe on atenated( omputationally expensive).

2.5 Evaluation

We have onsidered various aspe ts of unit-sele tion based spee h synthesis. In this se tion, we present the ways of evaluating synthesized spee h. This is ne essaryfor exploring dierent approa hes to improve synthesis quality, in whi h ase hanges need to be quantied and for omparativeevaluationofdierentsynthesissystems. These anberelatedtosele tion, on ate-nation and overall systemtuning. As synthesized spee h istargetedfor human per eption, the mosta uratewaytoevaluateasynthesizedspee hisper eptualevaluationbyhumansubje ts. In-spite of its a ura y, automati evaluation is often done instead, by omparing synthesized spee h with a referen e spee h. This referen e is generally re orded real spee h whi h is not in luded in the orpus. This omparison isquantied using some obje tive evaluation metri s. Inthefollowing,wepresenttheobje tive evaluationmetri s andthentheper eptualevaluation by human subje ts. The evaluation of synthesized spee h by human subje ts is done in two

(34)

2.5.1 Obje tive automati evaluation of a ousti and audio-visual spee h Variousdistan emeasureshavebeenproposedfor omparingrealandsynthesizedspee hsignals. For example, epstral distan eisusedasa distan emeasureinmany worksfora ousti spee h (Hunt and Bla k, 1996; Meron and Hiros, 1999; Alías and Llorà, 2003). (Lata z et al., 2010) used onstituent distan es measures for duration, f0 and spe trum. Obje tive evaluation of audio-visualspee hisgenerallydonebasedonanindependentobje tiveevaluationofvisualand a ousti modalities. Alternatively, the obje tive evaluation of only one modality is performed sometimes, based on the fo us of analysis. For instan e, in (Huang et al., 2002) only the synthesized visual spee h is evaluated. It was done using three obje tive evaluation metri s . Theseweredevelopedforestimatingthepre ision(naturalness)andsmoothnessofvisualspee h; andsyn hronizationbetweena ousti andvisualmodality. Firstly,pre isionwasestimatedusing thesum ofEu lidean distan e between the real andsynthesizedsenten es, al ulated on visual parameters. Se ondly,smoothnesswasestimatedusingthesumofEu lideandistan e al ulated between adja ent framesin thesynthesized spee h whi h are from non- ontiguous lo ationsin the orpus. Lastly,audio-visual syn hronization wasestimated basedon the phoneti labels of synthesized frames. For this, only a few important phonemes were onsidered, whi h belong to one ofthe following two ategories. The rst ategory wasof those phonemes whi h have a hangeinthedire tionof themouth movement, i.e.,from losingto openingorvi eversa. The se ond ategoryin ludedthosephonemeswhi hhavemaximalmouthshapeslikeopenor losed mouths. SimilarlyEu lidean distan emeasurehasbeenusedbysomeothers(Weissenfeldetal.,

2005).

Instead of omparing real and synthesized spee h, Liu and Ostermann (2009) use average target ost,averagesegmentlengthandaveragevisualdieren ebetweenframesastheobje tive evaluationmetri sandminimizethemduringtotal osttuning. Thisisbasedontheassumption that the average target ost is representative of the lip-syn hronization (audio-visual syn hro-nization) and the other two metri s represent the smoothness of the spee h animation. But nally,for evaluatingtheweightsresulting fromthe tuningpro ess, ross orrelation oe ient between thePCA oe ients ofthe synthesized and real senten es was al ulated to represent thesubje tivequalityofthesynthesizedvisualspee h. Similarly,(Baillyetal.,2009) reportthe omparison of dierent arti ulatory gesture predi tion te hniques using the orrelation oe- ient between originalandpredi tedgestures. For obje tiveevaluationofthesynthesizedvisual spee h, Ma et al. (2006) use average errors of normalized arti ulatory parameters (lip-height, lip-width,lip-protrusion)betweentheoriginalandsynthesizedspee h. Thoughthesete hniques

(35)

withhumanper eption hasnot been quantied systemati ally.

2.5.2 Human- entered evaluation of a ousti and audio-visual spee h

For any text-to-spee h synthesis system, an evaluation of the overall system performan e by human subje ts is inevitable irrespe tive of whi h domain it is to be deployed. This is so be ause, the nal users of any synthesized spee h are humans. Manual evaluation of text to spee h synthesis system is generally done to evaluate at least two aspe ts of its synthesized spee h: quality(espe iallynaturalness)andintelligibility. Thesespe i aspe tstobeevaluated andtheevaluationte hniquesdependonthetargetappli ations. Thesepossibleappli ations an be, onversationalagentsforhearingimpaired,orformoviedubbingwithdierentaudioorvideo tra k, human- omputer intera tion to mention just a few to name. Some of the aspe ts whi h areappli ationspe i arethe following: (1)suitabilityofthespeakerwhi hdependsonhis/her voi e larity, ethni ity and native language whi h ae t pronun iation and also pleasantness for e- ommer e related appli ation, (2) time required for synthesis, (3) prosodi omponent a ura y, (4)overall intelligibility.

Generally,thequalityofsynthesizedspee hisevaluatedintermsofthesubje tiveevaluation measures, Mean Opinion S ore (MOS) or DMOS (degradation Mean Opinion s ore). These are also know as Absolute Category Rating (ACR) or Degradation Category Rating (DCR) respe tively. Intheseevaluations,humansubje tsaregenerallyaskedtogivea ategori als ore with respe t to some parti ular aspe t of the spee h whether it be a ousti , visual or audio-visualspee h. Thedieren ebetween thetwo (MOSand DMOS)isthatinthese ond ase,the s ore is generally given with respe t to a referen e, generally thereal utteran e. The dierent aspe tsofquality anbebroadly lassiedintonaturalness,pronun iation, pleasantness,overall omprehension and intelligibility. Their dierent ategories depend on the attribute that is being evaluated. The dierent aspe ts to be evaluated also depend on the method used for fa emodelingand rendering,besidesthetarget appli ation domain. Forexample,for a human- omputer intera ting experien e like virtual avatar for e- ommer e, the likabilityof thevirtual hara ter and its expressiveness of emotions are also important for onden e building. For example,in(Maetal.,2006)thea ura yandnaturalnessofthesynthesizedspee harereported in omparison with that of natural audio-visual spee h using the usual 5 point MOS s ale. Similarly,Baillyetal.(2009)report subje tive evaluationofaudio-visual spee hbysynthesized imagesequen eovernaturalaudiobypreferen etestsbasedon5-s aleMOStest(5-verygood, 4-good,3-average,2-insu ient,1-veryinsu ient). Alternatively,naturalnesstestsare ondu ted

(36)

alledTuring tests(Geiger etal.,2003;Liuand Ostermann,2009).

Theevaluationofintelligibilityisdonebytheper eptualevaluationatvariouslevels,phoneme, wordandsenten e. Forphonemelevelintelligibilitytesting,rhymetestsandnonsensewordsare utilized. Inrhymetests, wordsdieringinasinglephonemesegmentarepresentedandaskedto report thea tual word thatis heard bya human subje t. Diagnosti Rhyme Test (Fairbanks,

1958),ModiedRhyme Test(MRT)(Houseetal.,1963)aretwoofthewellknownrhyme tests. Bothusesinglesyllabi wordsets,former onsistsofword pairs,whereasthelaterhassetsofsix words ea h. Senten e-level tests are ondu ted to assay the intelligibility of words in ontext. The most ommonlyusedtest is withsemanti allyunpredi table senten es (SUS)proposed by

Benoîtetal. (1996). Inthese testsspe ial senten es are onstru ted whi h followthe synta ti rules of a language but don't have a oherent meaning as a whole whi h makes it di ult to ontextually predi t the word. (Lemmetty, 1999), gives a good a ount of the evaluationtests for syntheti spee hintelligibility.

It is di ult to evaluate theintelligibility of audio-visual spee h. Synthesized AV spee h is oftentested for itsmost itedadvantageovera ousti -onlyspee h,i.eimprovement in intelligi-bilityinnoisy onditions (LeGoetal.,1994). Consequently,theadditionofvisual modalityis evaluatedbyaddingnoisetothe a ousti modality. Thisisbe ausetheintelligibilityresultsofa visual-onlyspee hwouldbeverylow, espe iallyforSUS.Onthe ontrary,in aseof learspee h withoutanynoise,theintelligibilityus losetothebestpossibleanddoesnotaddanyadditional advantage of visual modality. For instan e, E.Cosatto etal. (2000) report thatthe AV spee h showssigni antimprovement intermsof theintelligibilityinnoisewhen omparedto a ousti spee h, with an error rate of

4%

for AV spee h ompared to

20%

with a ousti spee h. Fagel

(2006)reportsintelligibilitytestsofsynthesizedaudioandAVspee hin omparisonwithnatural audioandAVspee h. Itwasreportedintermsoftheper entageofvowel+ onsonant,voweland onsonant re ognition errors. Ounietal.(2007) presentmetri s toquantify theimprovementin intelligibilitybetween two visual onditionsin omparison witha ousti -only spee h.

Inthemethodswhi hperformvisualspee hsynthesisover a ousti spee h,the syn hroniza-tion of the two modalities is an additional aspe t whi h needs to be evaluated. For example,

Bregleretal.(1997)per eptuallyevaluatethelip-utteran esyn hronization,triphone-video syn- hronizationi.e. thedisruptionleveldueto on atenationofunitsbesides oarti ulationee ts. They report that there are o asional visible timing errors inthe ase of stop onsonants and thevisiblearti ulation isunsatisfa tory ompared to thenatural arti ulation ofphoneme when therequiredphoneme sequen eisnot availableinthe orpus. Mattheysesetal.(2009) reporta