Ecole doctorale IAEM Lorraine ´
D´ epartement de formation doctorale en informatique
Synth` ese Acoustico-Visuelle de la Parole par S´ el´ ection d’Unit´ es Bimodales
(Acoustic-Visual Speech Synthesis by Bimodal Unit Selection)
TH` ESE
pour l’obtention du
Doctorat de l’Universit´ e de Lorraine
(sp´ ecialit´ e informatique)
pr´esent´ee par
Utpala MUSTI
Composition du jury
Rapporteurs : Jean-Claude MARTIN - Professeur en Informatique, Universit´e Paris-Sud Piero COSI - Senior Researcher, CNR, ISTC, Italie
Examinateurs : Catherine PELACHAUD - Directeur de recherche, CNRS-TELECOM ParisTech Bernd M ¨ OBIUS - Professeur, Universit¨ at des Saarlandes
Anne BOYER - Professeur, Universit´e de Lorraine Yves LAPRIE - Directeur de recherche, CNRS-loria
Vincent COLOTTE - Maˆıtre de conf´erences, Universit´e de Lorraine Slim OUNI - Maˆıtre de conf´erences, Universit´e de Lorraine
Laboratoire Lorrain de Recherche en Informatique et ses Applications — UMR 7503
tel-00927121, version 1 - 1 1 Jan 2014
tel-00927121, version 1 - 1
To my daughter,Samyukta.
tel-00927121, version 1 - 1 1 Jan 2014
tel-00927121, version 1 - 1 1 Jan 2014
Audio-Visual Speeh 7
Chapter 1 Audio-Visual Speeh Synthesis: An Introdution 11
1.1 Fae modeling and animation . . . 12
1.2 Separate visual speeh synthesis . . . 13
1.3 Simultaneous synthesis of audio-visual speeh . . . 17
1.4 Conlusion. . . 18
Chapter 2 Speeh Synthesis Using Unit Seletion: Literature Survey 19 2.1 Unit seletion paradigm . . . 20
2.2 Segmentation . . . 22
2.3 Target ost funtion . . . 23
2.3.1 Visualtarget features . . . 24
2.3.2 Target featureweighting . . . 25
2.3.3 Alternatives to onventional target ostfuntion . . . 27
2.4 Conatenation ost funtion. . . 27
2.5 Evaluation . . . 30
2.5.1 Objetive automatievaluation ofaousti andaudio-visual speeh . . . . 31
2.5.2 Human-entered evaluationof aoustiand audio-visual speeh . . . 32
2.6 Conlusion. . . 35
Chapter 3 Aousti-Visual Speeh Synthesis System: An Overview 37 3.1 Corpus preparation . . . 38
3.1.1 Text seletion . . . 38
3.1.2 Aquisition . . . 39
3.1.3 Dataproessing andparameter extration . . . 40
3.1.4 Segmentation . . . 43
3.1.5 Bimodalspeeh database . . . 44
3.2 Bimodal speeh synthesis . . . 44
3.2.1 Naturallanguage proessing . . . 44
3.2.2 Target unitdesription . . . 46
tel-00927121, version 1 - 1 1 Jan 2014
3.2.3 Bimodalunitseletion andonatenation . . . 46
3.3 Visual speeh rendering . . . 50
3.4 Conlusion. . . 50
Chapter 4 PhonemeClassiation Based on Faial Data 55 4.1 Visual speeh segmentation using faial data . . . 55
4.1.1 Reognition error . . . 57
4.1.2 Foredalignment results . . . 57
4.2 Learning phoneme kinematis using EMA data . . . 62
4.2.1 Dataaquisition . . . 62
4.2.2 Feature extration . . . 63
4.2.3 Results . . . 63
4.3 Conlusion. . . 65
Chapter 5 Unit Seletion 71 5.1 Target features . . . 72
5.2 Corpus based visual target features . . . 73
5.2.1 Phoneti ategory modiation . . . 75
5.2.2 Continuous visual targetost funtion . . . 78
5.2.3 Objetive evaluationofsynthesisresults . . . 80
5.3 Target feature seletion and weight tuning . . . 83
5.3.1 Unitseletion and onatenation . . . 84
5.3.2 Target featureseletion andweight tuning . . . 86
5.3.3 Appliation to AV target ostfuntion tuning . . . 93
5.3.4 Analysisof seletedfeaturesand their relative importane . . . 94
5.4 Conlusion. . . 99
Chapter 6 Evaluation 101 6.1 Objetive evaluation . . . 101
6.1.1 Objetive evaluationbasedon omparison oftwo signals . . . 102
6.1.2 Objetive evaluationbasedon statistialanalysisand thresholds . . . 103
6.2 Human-entered evaluation . . . 104
6.2.1 Intelligibility tests . . . 104
6.2.2 Quality evaluationtests . . . 105
6.3 Analysis of pereptual evaluation for better objetive metris . . . 107
6.4 Conlusion. . . 110
Chapter 7 Conlusion 113
Publiations 119
tel-00927121, version 1 - 1 1 Jan 2014
Appendixs 121
Bibliography 123
tel-00927121, version 1 - 1 1 Jan 2014
tel-00927121, version 1 - 1 1 Jan 2014
Enatedoranimatedstories aremorepopularthanaudionarrations orthoseinthebooks. Itis
easytoonludethatthisisduetoitsaudio-visualnatureasitprovidesarihexperiene. Besides
entertainment,ingeneralwepereiveeverythingthroughourearsandeyes,simultaneously. The
visual information thatis pereived througheyeseitherompliments or reinfores theauditory
information. Thisappliesto speehaswell,whihisone oftheprimemodesofommuniation.
Speeh pereption in thedayto day life is primarily bimodal. We see and hear, what isbeing
spokenbypeopleandunderstandthespeehifitisinaknownlanguage. Whenever,theauditory
inputisambiguousornoise-ridden,wetrytosupplement thereeivedinformationbylookingat
thesoure,i.e.,thespeaker. Thisbimodalnatureofspeehisillustratedbytheobservationthat,
we humans try to have a fae-to-fae onversation while disussing issues of high importane.
Thisis beause,fae-to-fae ommuniation onveys theomplementary information relatedto
speeh artiulation, emotions, more eetively than just voie. Hene, bimodal speeh an be
onsidered more eetive in ondene building. Besides entertainment and ommuniation,
the basi milestone towards verbal ommuniation, i.e., speeh development in babies also has
signiant ontribution of theobservation of visual speehalong withtheorrespondingsound
(Teinonenetal.,2008;Andersen etal.,1984).
Some of these above mentioned general observations about the advantages of audio-visual
speeh over aousti-only speeh have been experimentally veried. It has been shown that
additionofvisualspeehenhanesspeehdetetionandreognition,thusimprovingintelligibility
whenaudioismissing,degradedwithnoise,orwheretherearemultiplesouresofspeeh(Sumby
and Pollak, 1954;Ouni etal.,2007;Summereld, 1979;Shwartz etal.,2004). Theevaluation
resultsofvisualspeehintelligibilitybyLeGoetal.(1994)showthatthenaturalfaepresented
`without'or`withdegraded' audiorestorestwo-thirdsoftheaoustiintelligibility; withafaial
modelwithoutatongueandjustalipmodelrestoreshalfandone-thirdofitrespetively. Speeh
presentedalongwithfaial animationhasbeenobservedto bemorepreferredinterfaetovoie-
onlypresentation. Theyhavebeenshowntoinreasetheinterativeexperieneofusers(Pandzi
etal.,1999).
tel-00927121, version 1 - 1 1 Jan 2014
These advantages of audio-visual speeh over aousti speeh indiate its vast appliation
possibilities. It has been widely used inentertainment and e-ommere for developing virtual
agents. These appliation do not neessarily need highauray of speeh artiulation. There
are other appliations whih require high auray omparable to that of natural audio-visual
speeh. Theseinlude appliationsforpedagogiativities,for example,virtuallanguagetutors
for e-learning,teahingspeeh artiulationto hearingimpaired et(Massaro,2006). Itan also
beusedto develop virtual announers for publi plaes thatareusually noisy.
Considering allthe preedingdisussion, itan besaid thataudio-visual speehsynthesisis
a signiant domain to pursue. But, theadvantages of natural bimodalspeeh an be realized
throughsynthesizedaudio-visualspeeh,onlyifitisomparableto theformer. Itissobeause,
humanshaveimpliitexpetationsfromaudio-visualspeehbasedonthelearningandexperiene
ofgeneral fae-to-fae ommuniations. Thesearerelatedto temporal alignment andoherene
between the aousti and visual modalities. For instane, while hearing sounds like `p', we
expet a losureof thelips justin timebefore the onset of thatsound. Similarly, we expetto
hear high-pithed voie for a onversation where somebodyisseen to be inextreme fear. This
means that the synthesized audio-visual speeh has to have theaousti and visual streams to
betemporallysynhronous andoherent witheah other.
A majority of approahes for audio-visual (AV) speeh synthesis, synthesize the faial an-
imation over speeh aoustis, and then perform additional proessing for synhronizing the
two wherever neessary. This is based on the assumption that AV speeh synthesis is a set of
twodierentproblems,therebyaddressingthem sequentiallybysynthesizing visualspeehover
synthesizedspeeh aoustis. There aretwo problems with this approah. To begin with,syn-
hronizingthetwostreamssynthesizedseparatelyisnotstraight-forward. Humansareextremely
sensitive to any asynhrony between the audio and speeh animation. In fat, this sensitivity
to disriminatesynhronous speehfromasynhronousspeeh developsvery earlyinhumansin
their infany with a signiant preferene to synhronous speeh (Dodd, 1979). Results from
(Grant and Greenberg, 2001, 2004) show that human speeh pereption is extremely sensitive
toanylaginthe visualdomain whenompared toaudiounlike theother wayaround. Itisalso
observed that this asynhrony auses a surge inthe intelligibility of asynhronous audio-visual
speeh. Moreover, this also brings in the issue of inonsisteny in visual and aousti domain
whihmightbringindisomfort(Mattheysesetal.,2009). Thisinonsistenyanalsoaetthe
nal pereption of the audio-visual speeh, as illustrated by some of theexperimental data in
(GreenandKuhl,1989,1991). Theseexperimentalresultsshowthatthepereptionofplaeand
tel-00927121, version 1 - 1 1 Jan 2014
aoustimodality. Theworstase, wherepereptionofAVspeehanbehighlyaetedisthat
ofMGurkeet (MGurkand MaDonald,1976). Infat,whendierentfaial animation and
aoustis arepresented synhronously, subjets would experiene fusion or ombination eet.
Fusionseetisseen,forexample,whenvisual/
g
/ispresentedsynhronouslywithaousti/b
/.The resultis pereived as /
d
/. Similarly, whenvisual /b
/ is presentedwith aousti/g
/) syn-hronously,itispereivedas/
bg
/,whihisanexampleoftheombinationeet. Thisindiatesthat synthesizing audio-visual speeh by separating thesynthesis of the two modalities, might
not always ensure the best result in terms of synhrony and oherene of the two modalities.
In general, simultaneous proessing of aousti and visual speeh is shown to be advantageous
withrespetto audio-visualintegrationthatarenotavailablewiththeirindependentproessing
(Chen andRao,1998).
To ensure a perfet alignment and oherene between aousti and visual modalities, we
advoate synthesizing audio-visual speeh simultaneously by treating the two modalities as a
single entity. In this thesis, we present our method for audio-visual speeh synthesis based
on this priniple. We base our speeh synthesis on the unit seletion paradigm. We perform
simultaneous synthesis of aousti and visual modalities by onatenating bimodal units. We
keep the natural assoiation between the two modalities intat while doing so, as the visual
and aoustimodalitiesbelongto the same speehsegment. It should be emphasized that this
approah impliitly addresses theabove mentioned issues ofasynhrony and inoherene. This
workanbeonsideredastheruialrststeptowardsaomprehensivetalking-head. Atually,
our main fous is to synthesize the audio-visual speeh dynamis aurately. The resultant is
not a omplete talkinghead yet. Our faial representation islimitedto sparse meshdesribing
the outer surfae of the fae inluding the lips. The audio-visual speeh does not inlude the
informationrelatedtotheinternalartiulatorsliketongue,teethandotheromponentsneessary
for expressive speeh. In the ourse of this work rst we studied the bimodal speeh orpus,
thatweaquired, bydesigningandanalyzingvisualspeehsegmentationexperiments. Then,we
developedthebasisystemwhihimplementedourideaofbimodalunitonatenation. Byusing
thebasi synthesisframeworkof bimodalunit-seletion system,we developedmethodologiesto
improve the bimodal synthesis. In our work, we are addressing the following problems: (1)
unit-seletiontakingbothaousti andvisual onsiderationsinto aount whihan drastially
inreases theomplexity,(2)weight tuning, whih isa diult probleminspeehsynthesis. In
fat, we developed orpus spei visual target osts and an iterative target feature weighting
algorithm. Finally, we performed pereptual and subjetive evaluation experiments through
tel-00927121, version 1 - 1 1 Jan 2014
This thesis is organized as follows. We begin by reviewing theeld of audio-visual speeh
synthesis,inhapter1. Inthishapter, wedisusstheways inwhihthefaehasbeen modeled
andanimated. Wealsodisussthevariousapproahesofaudio-visual speehsynthesisbasedon
separate or joint synthesis of the two modalities. Our speeh synthesis system is built on the
generiparadigmofunitseletionandthisisthetopiofhapter2. Wereviewliteraturerelated
to some aspets of unit seletion. It inludes, segmentation, that is performed during orpus
preparation. Besides,the various building bloksof seletion areexamined: target desription,
target andonatenation osts. Finally,wereviewthewaysofevaluatingsynthesizedspeeh.In
hapter3,wepresentourworkbyprovidingrstanoverviewofouraudio-visualspeehsynthesis
system. It also details our audio-visual orpus reording and database preparation for our
synthesis system. The resultant audio-visual database that we have is an interesting resoure
whih an be used for studying various phonemes. As a rst step in this diretion, we have
performedsegmentation of the visual data. We desribe thesesegmentation experiments, their
results and analysis of these results in hapter 4. In hapter 5, we detail dierent strategies
thatwe developed to optimize oursystem. It inludesdesigning newvisual targetfeatures and
target feature weighting. Finally in hapter 6, we present the objetive evaluation, pereptual
evaluation and the analysis done to bring out the relation between the two. We onlude in
hapter7 andexplain our futurework.
tel-00927121, version 1 - 1 1 Jan 2014
Audio-Visual Speeh Synthesis: An
Introdution
Inthis hapter, we lookat someof its earliersynthesisapproahes. For anyspeeh,aoustior
audio-visual, to be synthesized from text, the underlying phoneme sequene orresponding to
the text hasto be rst speied. Given this speiation, various approahes an be followed
for AV speeh synthesis. Firstly, these approahes an be divided basedon whetherthe visual
and aousti modalities are synthesized separately or simultaneously. Seondly, the synthesis
of aousti or visual modalities in the ase of separate synthesis an be divided based on the
synthesis paradigm: rule based, artiulatory or onatenative (Theobald, 2007). Thirdly, the
approahes an be lassied based on their faial renderingtehnique: 3D modeling of faeor
image-based.
In arule-basedsynthesissystem,thewell knownrepresentativeharateristis ofspeehare
simulated using predened rules. Whereas, artiulatory synthesis is done bythe simulation of
natural proess of speehprodution using models of human anatomy. For instane, air owis
simulatedthroughaontrolledmodelofhumanvoaltrat,andskinofthefaeisdeformedusing
bonesand musles. Conatenative speeh synthesisis performedbyonatenating segments of
reorded humanspeeh,generallyalledorpus. Thisanbeput intoa broaderategoryalled
orpus-based speeh synthesis whih also inludes HMM-based speeh synthesis. HMM-based
synthesisdependsonthelearningofpatternsofspeehparametersfromagivenorpus,whihis
thenusedtogenerate speehparameters. Conatenative approah islikememorizing thewhole
data, andthenaessing thememory at the timeofsynthesis.
Inthefollowingsetions,wefousonaudio-visualspeehsynthesis. First,webrieydesribe
thefaial renderingtehniques (setion1.1). Then, we disusstheapproahes whihsynthesize
theaousti andvisual modalitiesseparately andsimultaneously insetions 1.2 and1.3.
tel-00927121, version 1 - 1 1 Jan 2014
1.1 Fae modeling and animation
Thefaehasbeen enoded andpresentedintwo waysfor thepurposeof faial animation. The
rst approah is the 3Dmodeling ofthe fae. Theouter surfae of thefae ismodeled using a
mesh of onneted polygons. Thesepolygons aremade of predened edges onneting a set of
3D point verties. Also, hanges inthe 3D point loations and the onsequent hanges inthe
meshaountforthedeformationsinthefae. Therst3D-faialmodelwasdevelopedbyParke
(Parke, 1972, 1975, 1982). In this model, the 3D points were dened and ontrolled by a set
of parameters. Theseparameters wereoneptually dividedinto two distint sets(funtionally
theymight have an overlap): onformation parameters and expressionparameters. The onfor-
mationparametersweretheoneswhihdenethedimensionsofthe3Dfae. Thatis,if3Dfaes
are modeled based on real human subjets for instane, then onformation parameters dene
thebasi`dierentiating' dimensionsof thatpartiularhumanfae. Theseinluded parameters
like aspet ratio of fae (height to width), relative sizes speifying forehead, eye separation,
nose height, heek, hin, et. The expression parameters were those whih desribed mainly
themovements ofeyesand mouth. Theyinluded deformations like jawrotation, width ofthe
mouth,positionofupperlipandornersofthemouth,et. Thesedeformationsmightberelated
to speeh or emotional expressions. From these two ategories of parameters, the3Dpoints on
the fae positions were determined using dierent types of operations, applied independently
to some regions or to the whole fae. Eyes were ontrolled by spei proedures. The other
operations inluded, interpolation, rotation, translation and saling. The nal rendering was
done through Phong interpolation (Phong, 1975) based on theparameter speifying thedire-
tion of light soure. There are many virtual haraters whih are desendants of this Parke's
model (Cohen and Massaro, 1993; Beskow, 1995; Olives et al., 1999). These desendants of
Parke'smodelhavevarious additionstoimprove theappearaneoffaeandanimation: likethe
additionofthe tongue,earsorthebakoftheheadandtheadditionofontrol parameters. The
advantage of these kind of parametri models is that thewholemesh is speied using a small
set of parameters. Parke's parametri model is dierent from some other parametri models,
whiharebasedonmodelingtheunderlyinganatomial struturelike bones, musles,skinand
fores ating on them (Waters and Terzopoulous, 1990;Waters, 1987; Lee et al., 1995; Ekman
and Friesen, 1978). This kind of modeling has been observed to be omputationally intensive
(Bailly et al., 2003). Some talking heads whih present emotional faial animations are based
onpseudo-musle ontrations (Cosietal.,2003;Pelahaud etal.,2001). MPEG-4standardizes
theparametri modelsbydeningaminimumsetof84featurepoints(FPs)loatedonthefae.
tel-00927121, version 1 - 1 1 Jan 2014
alledfaial ation parameters (FAPs)(Ostermann,1998).
Besides 3D modeling of the fae, the seond approah for representing a fae is through
the usage of faial images. These are most often images of real people. Hene, image-based
approahes are generally data-driven. Faial animations using images are generated in two
ways. First, it an be done by interpolating few spei images that are representative of the
typial artiulation of visually idential phonemes alled visemes (Ezzat and Poggio, 1998).
Alternatively, itanbedone byonatenatingimage sequenes (Bregler etal.,1997;E.Cosatto
etal.,2000).
The image-based approahes of modeling present more realisti faes. This is beause of
their proximity to the real faial appearane, whih isoften desribed as being photo-realisti.
But, this way of enoding or presenting a fae is most often limited to a straight-head frontal
view of the fae. Besides, storage of images inurs signiantly higher memory requirement
to storage of a few parameter trajetories. On the other hand, 3D-model-based approah is
exible in terms of the view and head orientations in whih a fae an be rendered. But, an
additional proessing step is required to add the internal artiulators like tongue and teeth to
rendertheompleteartiulatory information. Itispossibleto augment the3Dmodelbyadding
texturalinformationtomakethenalfaialanimationexibleandomparativelyphoto-realisti
Elisie etal. (2001). Another alternative of modeling the faeis morphable-models presentedin
(Cootes et al., 1998; Blanz and Vetter, 1999). These models also embed both geometri and
texture relatedinformation to present a relatively photo-realisti andexible faial model.
1.2 Separate visual speeh synthesis
Conventionally, AV speeh synthesis is onsidered as two separate problems; the generation of
speeh aoustis and the generation of faial animation to a given speeh aoustis (real or
synthesized). Consequently, it has been performed by synthesizing the two modalities sepa-
rately. Faial animation is generated over a given speeh aoustis, whih is eithersynthesized
or reorded. Thisapproah requiresadditionalproessing toorret thealignment between the
twomodalitiesintheaseofonatenativevisualspeehsynthesis(Bregleretal.,1997). Werefer
to the faial animation relatedto speehas visual speeh. We fous on visualspeeh synthesis
stage, onsidering the aoustispeeh already available. Two onepts, whih might surfae in
thedisussionofvisual speehare: visemes andoartiulation. In thefollowing paragraphs,we
rstexplain these two onepts beforegoing ahead withthesynthesis tehniques.
Visemes: Visiblespeehartiulationpresentssimilaritiesformanyphonemes. Basedonthis
tel-00927121, version 1 - 1 1 Jan 2014
sets are dened as visemes. It is the fundamental unit in theontext of visual speeh (Fisher,
1968). For example, pereption of visual speehwhile phonemes inthe set{
p
,b
,m
}arebeingartiulatedisalmostthesame. Hene, theybelongtoone visemeset. Intheurrent disussion,
wemeanbyviseme,asequeneofvisualspeehparametersdesribingaompletesegmentrather
than statitargets. Onthe ontrary,werefer to asingle sampleof theseparameters desribing
a snapshot of a partiular target fae as `key frame'. The visual speeh parameters an be
image frames or trajetories ofontrol parameters or 3Dpoints onthe fae. Thismany-to-one
mapping of visual speeh makesthe separation of visual speeh synthesis from aousti speeh
synthesisadvantageous. Itisbeause,thesystemgetsoniseduetothereduinginthenumber
ofdistint units. Inthe aseofonatenativevisual speehsynthesis, thisinreases thepossible
andidates.
Coartiulation: Coartiulationis thephenomenonin whihtheartiulation ofa phoneme
is inuened bythe artiulation ofthe neighboring phonemes. Synthesized visual speeh needs
to aurately represent oartiulation. In ase of parametri 3D-faial-models, theparameters
foranimatingthemhavebeengeneratedtakingoartiulationinto aountusingrules(Beskow,
1995;Pelahaud etal.,1994) or mathematial oartiulation models(Öhman,1967;Cohen and
Massaro,1993;Cosietal.,2002). Beskow(1995)mentionsthateahphonemehasatargetvetor
speifying thetypial artiulatory gestures. These target vetors are under-speied for some
phonemes whihareinterpolated basedonthe ontext to aount foroartiulation. Pelahaud
et al. (1994) divide phonemes into lusters based on their deformability in dierent ontexts.
Phonemeswithlowerdeformabilityserveasthekeyframesforoartiulation. Öhman(1967)a-
ountsforthehangesduringthe transformationofa
V 1 CV 2
(vowel-onsonant-vowel ) sequene.Cohen and Massaro (1993) implement Löfqvist gestural theory, where phonemes are speied
with target feature vetors. Coartiulation is dened as the super-imposition of time-varying
dominane funtions desribing dierent artiulators. These dominane funtionsare negative
exponential funtions whih peak at the target feature vetors. This oartiulation model has
beenfurther augmented byCosi etal.(2002) bythe addition ofresistanefuntions. Thesere-
sistanefuntionsensurethatsomespeitargetongurationsareattainedbysuppressingthe
dominaneofneighboring phonemes. Thisisespeiallyimportantfor phonemes likelabials and
bilabials. Beskow (2004) reports an experimental omparison of various approahes to aount
for oartiulation. He reports that the mathematial model proposed by Cohen and Massaro
(1993) performs well in omparison with the real data; whereas, with respet to intelligibility,
rule-based tehniques perform better. These models an be optimized through hand-tuning or
tel-00927121, version 1 - 1 1 Jan 2014
Elisie et al., 2001). Ezzat etal. (2002) also perform tuning of a oartiulation model through
statistial learning on reorded orpus. Theiroartiulation model is similar to that of Cohen
and Massaro (1993). Instead of using motion data, they used image-based orpus for tuning
their model.
Corpus-based approahes:
Insteadofusingsomeexpliitoartiulationmodels,theoartiulationanbeimpliitlyenoded
inthesynthesizedvisualspeeh. Thisisdoneinorpusbasedapproahes. Firstly,theomplete
trajetories of visual speeh parameters an be generated using models like HMMs, whih are
trained onrealdata(Brand,1999;Masukoetal.,1998). Inthis ase,theHMManbemodeled
as a triphone, whih desribes a phoneme in the required phoneti ontext. Alternatively,
the omplete sequene of visual speeh parameters for real motion apture dataan be stored
and used by onatenating them for synthesis (Minnis and Breen, 2000). In this approah,
oartiulation isenoded through thesynthesis unit, liketriphone or diphone.
Inase ofonatenative approahes,thevisual speehdatabasehasto beprepared. Besides
aquisition, the orpusneeds proessingto annotatethe individualunits interms oftheir pho-
netilabels,segmentboundaries, informationrelatedtothegeometripropertiesofthefaesfor
ensuring smooth transition at the onatenation points. One of the onatenative approahes
for dubbingappliationsispresentedinBregler etal.(1997). Theypreparethevisual database
by phonetially segmenting an unonstrained video sequene. This segmented video is anno-
tatedto inludethe informationbasedon theorientationof thehead,theshapeandpositionof
mouth. They use eigenpoints to estimatethe duiary points on the fae (mouth, teeth, hin
andjawline)using26hand annotedimages. Also,thesynthesisisdonebytheonatenationof
triphone video lips. The synthesized mouthsequenes are thenmorphed onto the bakground
video sequene. Theresulting video sequene isompressed or strethed to time-align withthe
target audiobetween phoneme boundaries.
The synthesisdesribed in(E.Cosattoetal.,2000) isbasedontheonatenationofvariable
lengthvideosequenes ofmouthimages(andalsootherfaialparts). Thedatabase isdesribed
in terms of 3D geometri features of thehead and appearane features extrated by Prinipal
ComponentAnalysis(PCA). Theyfurther subdividethefaialparts into heeks, teeth,tongue,
jaw,ettomakethesynthesismoreexible. Thenalsynthesisisdonebyoverlayingbitmapsof
thefaial partspresentinthedatabaseontoabakgroundvideoasin(CosattoandGraf,1998).
There are other similar works of image based onatenative approahes (Weissenfeld et al.,
tel-00927121, version 1 - 1 1 Jan 2014
Embedding (LLE) to desribe theappearane parameters of themouth images unlike Cosatto
and Graf (1998) who use PCA. Liu and Ostermann (2009) use PCA to extrat appearane
parameters and Ative Appearane Models (AAM) to extrat thegeometri parameters of the
fae (lip width, lip height, visibility of teeth and tongue). A similar approah, but whih is
based on parametri 3D faial model is presented in (Ma et al., 2006). In this approah, the
ontrol parameters extrated fromreorded 3Dfaial marker data areonatenated using unit
seletion. The resultant trajetories areusedto animatevirtual onversational agents.
Some onatenative approahes ombine HMM and onatenative approahes for visual
speeh synthesis. One suh work is presented in (Lijuan et al., 2010). It is image-based ap-
proah where the seletion proess is guided by the trajetory of lip movements generated by
trained HMMs. These HMMs are trained by the AV-speeh orpus. This approah is similar
to an earlier work by Govokhina et al. (2006). In that, phonetially aligned trajetories of 3D
faial markers are seleted based on the trajetories generated by trained HMMs. A hybrid
unitseletion andHMMbasedapproahfor visualspeehsynthesisispresentedin(Edge etal.,
2009). Thiswork uses theseleted units to train state-based models and searh through these
learned models through Viterbi type algorithm. The similarity in speeh aoustis (aousti
parameters)wasusedto guidethroughunitseletion. Thenalsequene ofstate-basedmodels
isused to generate smooth trajetories for visual speeh. Baillyetal. (2009) desribe asystem
whih generates artiulatory gestures (ontrol parameters) for a video realisti (image based)
faialanimationusingHMMs. Theyinorporateaphasingmodeltolearnthelagbetweenvisual
gestures and orresponding speeh aoustis. Theyompare this HMM-based tehnique whih
inludes thephasing model with3 other tehniques: (1)onatenation ofartiulatory gestures
seletedbasedonthephonetiontext,(2)onatenationofartiulatorygesturesbasedonsele-
tion thatis guided throughthe phasing modelbased HMM,(3) trajetory generated by HMM
models trained on audio-synhronized artiulatory gestures. They onlude that the phasing
modelbased HMMsimprove thesynthesis.
Almost all of these works report lip-synhronization problems. Bregler et al. (1997) report
that plosives were observed to have oasional lip-synhronization problem, Cosatto and Graf
(2000),reportlip-synhronization being ritiizedinsubjetive evaluation. Geiger etal. (2003)
present the pereptual evaluation of the synthesis approah presented in (Ezzat et al., 2002).
They report that the synthesized audio-visual speeh is not omparable to the natural audio-
visualspeeh,tothe extentthatisrequiredfor developing appliationsforteahinglanguageor
tel-00927121, version 1 - 1 1 Jan 2014
1.3 Simultaneous synthesis of audio-visual speeh
The potential appliation of audio-visual speeh hinges not only on the auray of the syn-
thesized visual speeh, but also on the extent to whih the aousti and visual streams agree
with eah other in terms of synhrony and oherene. It is obvious from the previous setion
that, through the separatesynthesisof aousti and visualmodalities,these onditions arenot
always guaranteed. In this setion, we look at approahes whih synthesize audio and visual
speehsimultaneously. Theentral mehanismofalltheseapproahesistokeep theassoiation
between the visual and aousti modalities, thereby preserving the natural synhrony and o-
herene. Majorityofapproahesinthisategoryarebasedontheonatenationofsynhronous
bimodal units. One approah presented by Tamura et al. (1999), uses HMM models trained
using synhronous audio-visual speeh data to generate bimodal speeh parameters. But, it
should besaid that this approah was still at a muh preliminarylevel asthegenerated visual
speeh parameters wererelated onlyto the lip ontours.
The onept of synhronous bimodal unit onatenation for Swedish AV speeh synthesis
has been presented in(Hallgren and Lyberg, 1998). The visual speeh information isreorded
astrajetoriesof 3Dmarkers alloverthefae, espeiallyaround thelips. Thereorded marker
information is used to ontrol a3D model ofthe head. Thishead model is further texturedto
makeit lookmore natural.
Tworeentimage-basedapproahesthatuseonatenationofbimodalunitsare(Fagel,2006;
Mattheysesetal.,2009). In(Fagel,2006),AVspeehsynthesisis donefor German byonate-
nating synhronous bimodal polyphone segments. Thiswas with a 4-minute orpus onsisting
of bimodal speeh: video of speeh aligned with the orresponding phoneti transript. The
seletion ofpolyphonesegmentsforonatenationwasbasedonaonatenationost alulated
as a weighted sum of aousti and visual onatenation osts. The pre-seletion of possible
polyphonesegments from the orpus is based on hunks (longest polyphone segments thatare
available inthe orpus),and thevisual jointost alulation isbasedonthepixelto pixelolor
dierenesinthe end framesofthevideo lipsto beonatenated. Hene, itisquitelear that
synthesisinurs a large overall proessing time. In (Mattheyses et al.,2009), theonventional
unit-seletiontehnique whih hasbeenwidelyusedfor aoustispeehsynthesisisextendedto
performAV speehsynthesis. Itisdonebyinludinganadditional joinostterm forvisualjoin
disontinuities. Their systemis similar to the one explained in (Liu and Ostermann, 2009) in
termsofthevisualfeaturesextrated andusedtodesribethefaialgeometry andappearane.
Thesemethods like anyimage-based tehnique inur highstorage requirement whenompared
tel-00927121, version 1 - 1 1 Jan 2014
1.4 Conlusion
Inthishapter, we have disussedvarioustehniquestomodelthefaethatarebasedoneither
its 3Dor image-basedrepresentation. We have also disussedthevarious prosandonsof eah
tehnique. Further, we have also examined some approahes of AV speeh synthesis that are
based on either the sequential (synthesizing faial animation after aousti speeh synthesis)
or simultaneous synthesis ofthe two modalities. We have highlighted the disadvantages of the
former. Consequently, we are in favor of synhronous, data-driven synthesis of audio-visual
speeh. Our approah is based on this line of synthesis. As an be seen in hapter 3, our
approah is using a unit-seletion paradigm to synthesize both visual and aousti modalities
simultaneously. Inthefollowinghapter, wepresentasurveyofvariousaspetsofunitseletion
and thenwe introdue our systeminhapter3.
tel-00927121, version 1 - 1 1 Jan 2014
Speeh Synthesis Using Unit Seletion:
Literature Survey
Speehsynthesisisa wellestablishedeldofresearh withsigniantprogress inthepastthree
deades. Though synthesized speehis getting loserto human speeh,itisstill far frombeing
onsidered a solved problem. In addition, we are still away from a perfet all-purpose speeh
synthesizer. This is true for both aousti-onlyand audio-visual speeh. Among thesynthesis
tehniques onatenative tehniqueshave beome very popular inreent times. Thesemethods
havebeenwidelyusedandevolved foraoustisynthesis. Nevertheless, theparadigmisgeneri
and has been extended to visual or audio-visual speeh synthesis. In the earlier onatenative
aousti synthesis, fewerinstanes ofeah diphone were stored inthe inventory. Thesynthesis
speiation inluded the prosodi desription related to duration and pith of targets in the
sentene to be synthesized. At the time ofsynthesis, these diphones weremodiedusing signal
proessing tehniques to bring inthe hanges relatedto prosody and then onatenated. This
kindofintensivesignalproessingdoneonthewaveformdistortsitsnaturalness. Theadvantage
of this system was the small size of the diphone inventory whih was a neessary requirement
at the timeof its usage. Moreover, it an be said that inspite of usageof signal proessing, it
doesnot aount for allthevariations ofspeehaurately.
As omputer storage is getting heaper and faster, it has beome possible to store huge
speeh database many times larger than the earlier smaller inventory of diphones. Usage of a
huge orpus,makesitpossibleto inludealargesetofandidate diphoneswithlargevariability
intheir waveforms. Moreover, itisevenpossible tohave longersynthesisunits than adiphone.
Infat,itisevenpossibletosearhforwholesentenesorbighunksofsentenes. Thisindiates
thedrastiredutionintheneedtoproessthespeehsignal. Consequently,theresultantspeeh
preserves thenaturalness of the original speeh as the speeh segments are onatenated with
tel-00927121, version 1 - 1 1 Jan 2014
little to nosignal proessing.
Nevertheless, the usage of a large speeh orpus has dierent problems. A large variane
inthe synthesis andidates means that seletion has to be done arefully, to synthesize speeh
whih issimilarto a natural utterane. This isthe lassial unitseletion problem. We disuss
some of the issues of unit seletion tehniques, and the approahes that have been applied to
resolve them. In the following setions, we rst give a brief introdution of the emergene of
theframework ofunitseletion and itsbasi paradigm(insetion 2.1). Insetion 2.2wegive a
shortdesriptionofthesegmentationtehniquesusedinorpuspreparation,thenadesriptionof
pre-seletion of andidates and the onventional target ost formulation based on independent
feature spae assumption and its tuning (in setion 2.3). Next, (in setion 2.4) we give a
brief aount of the ways join evaluation tehniques have been analyzed for their orrelation
with human pereption of disontinuity when non-ontiguous units areonatenated. Finally,
(in setion 2.5), we deal with the objetive and pereptual evaluation methodologies that are
generally employed to estimate and sometimes qualify a text-to-speeh synthesis (aousti or
audio-visual)for its useina speidomain.
2.1 Unit seletion paradigm
Unit seletion depends on the seletion of the best possible set of units from dierent variants
available in the orpus. Consequently, the rst requirement is to have a orpus that not only
has a good overage of the possible speeh variants, but whih is also omparatively small to
keep thesearhtime short(Möbius,2000). Given apartiular speeh orpus,thequalityofthe
synthesizedspeeh using unitseletion depends on its usage. Manyfatorsaet thesynthesis
results. For example, onatenation of units an be said to be the most obvious reason for
audibledisruptionandmanyinitialsystemswerebasedontheredutionofonatenationpoints
(Sagisaka, 1988). In (Sagisaka, 1988), theseletion oflongest segments isgiven preferene and
theonatenation atertainloationslikeatCV(onsonant-vowel) boundariesorinthemiddle
of vowels is penalized. Alternatively, when it is not possible to avoid onatenation of non-
ontiguous units, minimization of distortion at the onatenation point minimizes the quality
degradation (Takeda etal.,1990; Iwahashi etal.,1992). Besides reduing theonatenation of
non-ontiguous units,thereareotherneessaryfatorsthatneedtobeonsidered. Forexample,
the phoneti ontext of the seleted unit and the speeh realization of the unit itself seems
important (Takeda etal.,1990;Iwahashi etal.,1992).
Thesearhproedureproposedin(HuntandBlak,1996)forunitseletionoersauniation
tel-00927121, version 1 - 1 1 Jan 2014
possibleoptimal solution to the seletion-onatenation problem. For a sequene of andidates
u
,and asequene ofrequired target unitst
;the paradigmpresentedbyHunt and Blak(1996)optimizesa total ostfuntion whih isa weightedsum ofthefollowing:
•
Thepereptualsuitabilityofu
,fort
,whihis alledthetarget ost,denotedbyT C(t, c)
.•
The total disontinuity at all the onatenation points, alled the join ost denoted byJ C(c)
.Denoting the weights ofthe target ost andthe joinost by
w tc
andw jc
respetively; from a given orpus, the searh for the nal sequene of andidates is done based on the optimumandidate sequenewhih minimizes thetotal ost(
C
)asshown below:C = min
u w tc T C(t, u) + w jc J C (u)
(2.1)Here, the pre-seletion of units is based on a same-size units like phones or diphones for
eah target position. This pre-seletion is based on the target ost determining thesuitability
of the andidate and its ontext. Also, in this general framework, the seletion of longest
ontiguous andidates is enfored impliitly by making the individual join osts for any two
ontiguous units in the orpus zero (Balestri et al., 1999). This has the advantage of taking
into aount the variability of speeh realization besides reduing the onatenation artifats
for theseletion ofpossible bestset of andidates. In ontrast, some methods expliitly searh
forlongestontiguousunitsforonatenation allednon-uniformunitseletion,wheretheunits
sought for onatenation are not of same size or type (Taylor and Blak, 1999; Boëard, 2001;
Shweitzer et al., 2003). This is dierent from the earlier paradigm whih is impliitly non-
uniform unitseletion, asthere might bemanyontiguous segmentsof variablesizeinthenal
synthesizedspeeh. Clarketal.(2004)giveagooddesriptionofthepratialaspetsofbuilding
a unitseletion basedspeeh synthesizer. Taylor (2009), gives aomprehensive overview ofthe
dierent approahes addressing various aspets of unit seletion based speeh synthesis. Our
approah isbased ontherst paradigm, whihis animpliit non-uniform unitseletion.
Extending unit seletion to audio-visual speeh synthesis
In majority of AV speeh synthesis approahes, visual speeh is synthesized over an available
aoustispeehthatiseithersynthesizedorreal. Intheaseofvisualoraudio-visualspeehsyn-
thesisusingunitseletion,theseletionofsegmentshastobedoneonsideringtherequirements
tel-00927121, version 1 - 1 1 Jan 2014
in thetarget ost funtion, and also additional joinriteria to aount for the visual modality
relateddisontinuities inthejoinost funtion.
2.2 Segmentation
Itisobviousthatunitseletion dependsonaspeehdatabase. Segmentationisone ofthesteps
of this database preparation, in whih reorded speeh is divided into phoneti segments by
demarating their temporal boundaries. Thesephoneti segments onstitute thebasi building
bloksfor synthesis. Speeh segmentation without any otherspeier is onventionally usedto
refer to aousti speeh segmentation. Though the best way in terms of auray is manual
segmentation(Cosietal.,1991;LjoljeandRiley,1993;Ljoljeetal.,1997),itistime-onsuming,
laboriousandheneostly. Forthisreason,automatispeehsegmentationisonsideredagood
alternative. Themost popular and widelyused tehnique for automatispeeh segmentation is
to fore a HMM based phoneti speeh reognizer to reognize the speeh to a given phoneti
transript. Demaration of phoneti boundaries is a result of this fored-reognition whih is
onventionallyalledforedalignment. Thisalignmenttehniquehasavoidedtheneedfor man-
ualalignmenttosomeextentandalsoonsideredgoodenoughforHMMtrainingthatisrequired
in speeh reognition. But, segmentation needs to be more aurate for onatenative speeh
synthesisespeiallyforthosewhih arebasedononatenation at phonemeboundaries. Conse-
quently,variousmethods have been usedforthe renement ofthephoneti segment boundaries
further (Toledano et al., 2003). Some of the reent works use a ombination of segmentation
methods to derive multiple timemarks to arrive at more aurate segmentation (Kominek and
Blak, 2004;Park and Kim,2007).
For onatenative visual or AV speeh synthesis, generally theboundary time-marksdeter-
mined by the aousti speeh segmentation of an audio-visual orpus are used while dening
theandidates intheorpus (Bregler etal.,1997;Hallgren andLyberg,1998;E.Cosattoetal.,
2000). This way of segmentation is widely followed and pratially shown to work for visual
speehsynthesis. Nevertheless,thisisnot inaordane withtheunderlyingprinipal ofspeeh
prodution. The speeh artiulators have to beready witha target ongurations required for
the prodution of a sound (phone) for it to happen. That is, the start and end in the visual
and aousti modalities may not neessarily be thesame. Some works have tried to learn this
timelag between aousti and visualbyadding phasingmodels(Govokhina etal., 2007;Bailly
et al., 2009). These phasing models are arrived at through iterative proess involving HMM
learning, foredalignment oftrajetoriesof artiulatory gestures,omparison withtheaousti
tel-00927121, version 1 - 1 1 Jan 2014
works through reognition of the speeh segment, it provides an interesting tool to study the
unique harateristis ofphonemes. We exploitthis ideatoharaterize phonemes (Chapter4).
2.3 Target ost funtion
Measuring the suitability of a andidate in the orpus for a target position in the speeh to
be synthesized is a neessarystep inunit seletion. The eieny of a target ost funtion in
ranking and pre-seleting andidates also aets the probability of a good join and thus the
qualityof thesynthesized speeh. Generally, the target and theandidate are dened interms
of fatorswhihareknown to aount for thevariationinspeeh realizationbased onphoneti
andlinguististudies. Thesefatorsareattheabstratlevelwhiharenotdiretlyexpressiblein
termsoftheatualspeehparametersquantitatively. Thesearereferredtoashigh-levelfeatures.
Thesefeaturesantakeeithernon-negativeintegral valuesor anbeategorial. Thesefeatures
might inlude:
•
Phoneti featureslike thephonemiidentityoftheurrent unitand theneighboring units (ontext), type of phoneme (vowel, onsonant), voiing of phoneme (voied, unvoied),manner ofartiulation et.
•
Linguisti features like position of a syllable at various levels (word, rhythm group, sen-tene, et); position of word in a rhythm group or sentene; type of sentene et. These
featuresgenerally aount for the various suprasegmental prosodi patterns. Some ofthe
featuresinthis ategorymight be language spei.
Targetfeaturesetanalsoinludefeaturesthatarebasedonthestatistialanalysisofspeeh
relatedparameters whihareextrated fromorpus,whih arereferredto aslow-level features.
For example, some systems use prosody predition models that mainly provide duration and
pith speiation ofthe segments to be seleted. Theseprosodypredition models aretrained
on realorpus. It helps inreduing the numberof high-leveltarget featuresneededto desribe
prosody(Lataz etal., 2010). The low-level target featuresare also used to speed-up the pre-
seletion byreduing thesearh spae (Blakand Taylor,1997).
Lot ofsystemsusetargetfeaturesetwhihonsistsofmajorityofhigherlevelfeatures(Hunt
and Blak, 1996; Coorman et al., 2000; Lataz et al., 2010). Some systems use higher-level
target featuresexlusively to allowthe automati seletion ofandidateswithsuitable prosodi
harateristisratherthanpreditionbasedonprosodimodels(Prudonandd'Alessandro,2001;
tel-00927121, version 1 - 1 1 Jan 2014
individualfeatureosts. Three kindsoftarget featureostshave been generallyused(Coorman
etal.,2000):
1. Categorial distane measures: Where the distane is either a binary valued or non-
negative integer-valued funtion between ategorialfeatures.
2. Salardistane measures: Non-negativereal valuedfuntion for featureslike duration,F0
et.
3. Vetor distane measures: Distane alulation for multi-dimensional features, like the
aoustiand visual featurevetors.
Categorialdistanemeasuresarealulatedforthehigh-leveltargetfeatureswhiletheother
two are basedon the low-levelfeatures. For AV speeh synthesisthe set of target featureshas
to be augmentedto inludetheinformation regardingspeeh realizationinthevisual modality.
Besidesthetargetfeaturedesription,theweightingoffeaturesforagiventargetsetintheorder
oftheir relativeimportaneis ruialforseletion. Theseaspetsarepresentedinthefollowing
two setions. Besides the onventional target ost, alternatives have been proposed whih we
reviewinsubsetion 2.3.3.
2.3.1 Visual target features
For the visual speeh synthesis many of the high level target features used are those whih
desribe the visual or audio-visual target. These features might inlude typial artiulatory
harateristis like lip losures in bilabials. They might also inlude rate of speeh related
harateristis. Besides features whih are equally important for visual and aousti speeh
realization(e.g., plaeof artiulation),or those whihaount more fortheaousti realization
(e.g., voiing), there are some featureswhih aremore important for desribing a visualtarget
(e.g., shape of the lips during the artiulation of a phoneme). Many of the onatenative AV
speeh synthesis systems use a visual target ost based on the similarity of two phonemes in
termsof visiblefaial deformations, asdesribed below.
In (Bregler etal., 1997), aategorial phoneme ontext distane isusedfor the seletion of
triphone whihaounts forthe visual target ost. Phonemes of samelabelareassigned
0
ost,and phonemes belongingto two dierent visemelasses areassigned
1
,and dierent phonemesof same viseme lass are assigned a ost between
0
and1
whih are derived from onfusionmatriesdesribed in(Owens andBlazek,1985).
In(E.Cosattoetal.,2000),avisemedistanematrixisusedforthealulationoftargetost
tel-00927121, version 1 - 1 1 Jan 2014
domainirrespetiveofthedierenesintheaoustidomain. Theseletionofthevisualsegment
isbasedondurationandphonetilabelofthetargetsegmentwhihisobtainedfromtheaousti
speeh. Eahtargetframe isspeiedintermsofthephoneti annotationof awindowof frame
sequenes onsisting ofsome xed numberinluding itself to aount for ontext. Thewindow
lengthisdierentforeahphoneme. Theandidate isseletedwiththemostproximateontext
whihismeasuredbythe targetost. Thetarget ostweight vetorisbasedon theexponential
deayinginueneinspiredby(CohenandMassaro,1993). Weissenfeldetal.(2005)useasimilar
visualtargetostwhere thedierene matrixisalulatedbasedonthevisualdierenematrix
populated using the Eulidean distane in visual feature spae. It is based on theassumption
that eah phoneme an be desribed by its mean visual feature vetor, whih is speaker and
orpus spei. In (Mattheyses et al., 2010), a similar visual target ost alulated based on
orpusis inluded. Thedierenematrix thatisalulated representstheinter-phoneme visual
distanes based on the mean and variane of visual parameters at themiddle of the phoneme
units present in the orpus. These kind of ost funtions whih are alulated for a spei
orpus don'tguarantee optimumperformane for anyother orpus ingeneral.
2.3.2 Target feature weighting
Thetarget osttuning involves the determination ofrelative importane of target featuresand
assigning weights to the individual target feature osts to be used for target ost alulation.
Ideally, itisdone insuha waythat theordering of andidatesbased onthe target ostorre-
spondstotheirpereptualsuitabilityasatarget. Sinethesynthesizedspeehhastobeatleast
aeptable, intelligible and near natural speeh for human listeners, some system tuning teh-
niques arebased on human listeningtests (Coorman etal.,2000; Alíasetal.,2004). Listening
testsaretime-takingandrequirehumansubjetswhihmakethempratiallyostly. Moreover
thesopeofthiskindoftuningislimitedtoafewsetofsentenesandheneitannotguarantee
onsistent synthesis results. It beomes furtherdiult whenthesetof target featuresislarge.
Heneautomati weight tuning hasbeen applied inmanyof theworks(Hunt and Blak,1996;
Meron and Hiros, 1999; Park et al., 2003; Alías and Llorà, 2003; Colotte and Beaufort, 2005;
Lataz etal.,2010).
The targetfeature weighting tehniquesan bedividedinto two ategories: (1)joint weight
tuning of onatenation and target feature ost funtions, either at the individual unit level
seletion by using pairs of synthesis units or at sentene level, (2) separate weight tuning of
target and onatenation ost funtions, generally by tuning the target feature osts at the
tel-00927121, version 1 - 1 1 Jan 2014
inluded for seletion istreated asthetarget,and seletedor synthesizedfromtheorpus. The
target and the seleted units are ompared using objetive distane measures to perform the
tuning.
One ofthe two tehniquespresentedbyHunt and Blak (1996) alled `weight spaesearh'
(WSS) is basedon the rst ategory of weight tuning. It isbased on theusage oftargets from
real sentenes held out for training from the synthesis database. The weight tuning is done
by searhing the weight spae, in suh a way that the waveforms of synthesized sentenes and
that of real sentenes are similar. The weight spae searh is limited to a nite set of weight
ombinations and hoose the best weights among the searhed ombinations for dening the
target ost funtion. This method is omputationally very expensive in ase of large number
of features and possible set of target feature ost values. Meron and Hiros (1999) presented
aeleration tehniques for WSS by partial synthesis and omparison. Alías and Llorà (2003)
performed target tuning by using geneti algorithm for doing the weight spae searh. The
advantage of this is that the searh spae is randomized and searh evolves towards better
weight ombination, unlikeintheformerworkswhereaxedniteombinationsweresearhed.
Lataz et al. (2010) also present an automati weighting tehnique for tuning target features
and onatenation osts together. In their tehnique the ordering given by weighted sum of
target ost and onatenation ost, and the ordering given by an aousti distane metri are
ompared. Aseletederrorisalulatedbasedonthemismathinthisordering. Theyreferthis
tehnique asMinimum Seletion Error training. Further, they propose that theset of weights
obtained for allthe andidates treated astargetsbeinglustered usingdeision trees.
One ofthetehniques whihperformstarget feature weighting separatefromonatenation
ostsweightingisbasedonmultiplelinearregression(HuntandBlak,1996). Usingthismethod,
the target feature weights for eah phoneme ina language's phoneme set aretuned separately
to ome up with dierent target osts for dierent phonemes. Eah of the andidate in the
database is onsidered as a target eah time and the
n
most similar andidates are seletedfrom the phoneme's andidate set leaving the target out. The ordering of andidates for the
pre-seletion of
n
andidatesisbasedon anobjetivedistane measure. The targetweightsaredetermined using Linear Regression suh that the target ost predits the objetive distane
measure. Meronand Hiros(1999) presented a way to extend this regression training (RT) for
weighting the target features and onatenation osts together using target pairs unlike single
targets. Theyalso propose lustering ofphoneti ontextsby using adeision tree to splitthe
phoneme pairs into dierent lusters. This is done with a phoneti ontextual question whih
tel-00927121, version 1 - 1 1 Jan 2014
Eah target feature aounts for variations in speeh, and their duration. Based on the
disriminativeinformationaountedbyeahofthefeatures,theyhavebeenweightedinColotte
and Beaufort (2005). Aousti representation of a partiular phoneme units were divided into
lusters through K-Means algorithm using Kullbak-Leibler divergene as thesimilarity index.
Theweightofthefeatureisbasedonitsdisriminativeinformationbetweenthedierentlusters.
This is applied to all the phonemes in the phoneme set of the language separately. Another
approah to weight tuning is to view unit seletion as a lassiation problem (Park et al.,
2003), in whih instead of dening an objetive funtion to aount for the subjetive speeh
quality,thelassiation error istaken as theobjetive funtion to be optimized. It is diult
toomparethesemethodsintermsoftheirsynthesisresults. Therearemanyfatorswhihvary
intheseapproahes,like, speehorpus,testsentenes, evaluationmethodologies et. Hene, it
isnot straightforward to relatively judge their performane.
2.3.3 Alternatives to onventional target ost funtion
Thetargetostputforthby(HuntandBlak,1996)wasweightedsumofindividualfeatureosts
(dierenes). Wheneveraandidatewiththeexattargetfeaturedesriptionisnotavailable,the
andidateseletedforsynthesisbasedonthissimpleformulationformeasuringtarget-andidate
similarity or rather dissimilarity might not always reet the atual human pereption. The
following two ases need little more onsideration: (1) where a andidate with required exat
feature desription is not available, but, a andidate with a speeh realization similar to the
requiredonebutwithadierentfeaturedesriptionisavailable;(2)whereneithertheaandidate
with exat feature desription nor witha similar speeh realization is available, inwhih ase,
a better possible alternative(s) have to be seleted. To onsider the speeh realization besides
thetargetombination aloneofandidates,alternateapproahesfortargetostalulation have
beenproposedwhihbasethe seletion onthepereptual similarityestimatedthroughaousti
distanes (Taylor,2006). Themain ideabehindtheproposed method isto have representation
of thesegment to beseleted intermsof thelow-levelfeaturesbyusing thehigh-level features.
Thiswasdonebylusteringtheandidatesofapartiularphonemeusingaoustidistanesand
using deisiontrees to hoose aluster for unitseletionsbyTaylor(2006).
2.4 Conatenation ost funtion
Itisknownthattheaoustispeehqualitydegradesduetotheonatenationofnon-ontiguous
speeh segments. Also, studies have shown that onsidering the spetral smoothness at the
tel-00927121, version 1 - 1 1 Jan 2014
etal.,1992). Thisholdsfor visual speehaswell. Hene, anyabruptjumpinthevisual speeh
sequenean reate pereptual disomfortandonfusion. Consequently,thefousonredution
of onatenation artifats arguably dates bak to the onset of onatenative speeh synthesis
itself. Espeially in unit seletion based speeh synthesis, there is a wide variability in the
andidatesfor eahtarget required. Thisresults ina largevariane intheonatenationpoints
aswell,likeinthe middleofaphonewhendiphoneisthesynthesisunit. Goodonatenationis
important not onlyfor a good synthesis quality, butalso for intelligibility(Clarketal.,2007).
While designing good onatenation strategies for unit seletion, dierent approahes have
beenfollowed. Theandidateprefereneforonatenation isbasedontheobservationthatnat-
urallyontiguousunitsautomatiallyjoinwell. Hene,allsystemsgivepreferenetoontiguous
unitsintheorpus,besidesonsideringimportantphonetiandprosodiharateristis. Infat,
some systems go further and searh the longestpossible units fromthe orpus,soas to redue
the number of onatenation points (Shweitzer et al., 2003). Sine it is infeasible to have a
naturally ontiguous speeh in theorpus for every target sequene to be synthesized, various
joinoptimization tehniques have been developed.
The most widely followed approah for onatenation is to minimize the dierenes at the
onatenation points. This strategy is based on the observation that huge dierenes in the
waveforms at the onatenation points aount for pereptible degradation. Various distane
metris alulated using various aousti parameters have been explored for estimating the
pereptual degradation due to joins. Cepstra, line spetral frequenies, log area ratios, mel
frequeny epstral oeients, multiple entroid analysis (MCA) oeients, linear preditive
oding oeients area few of them. Eulidean, Absolute, Kullbak-Leibler, Mahalanobis are
some of the distane measures explored. Given these many alternatives, it beomes neessary
to base the join dierene estimation using those measures that orrelated well with human
pereption. Hene, there are many attempts to evaluate the parameter and distane measure
ombinationsto rankthembasedontheirorrelationtohumanpereptionofjoindisontinuity.
Some of these works ask listenersto evaluate joins on a 5-point MOS sale and ompare these
soreswiththedistanesalulatedusingvariousmetrisandaoustiparameters(Woutersand
Maon,1998,Vepaetal.,2002,2004,Donovan,2001,Bellegarda,2004). Insomeotherworks,the
omparison between humanpereption anddistanemetris isbasedonthedetetionof ajoin,
i.e. abinarysore(KlabbersandVeldhuis,1998,2001,StylianouandSyrdal,2001,Pantazisetal.,
2005). Theresults presented inthevarious worksdon't agreemuh witheahother. Kullbak-
Leibler divergene has been reported to perform well withdierent parameters in some of the
tel-00927121, version 1 - 1 1 Jan 2014
reported between the objetive distanemeasures and the pereptual evaluationresults is 0.66
whihhasbeen deemedlow. Hene,the hoie ofanypartiularspeehparameterizationand a
distanemeasure doesnot ensure an aurate estimateof pereptualdisruption at thejoin.
While tryingto redue the joindisruption due to onatenation,naturally ontiguous units
an be used to determine the set of units whih an naturally join well. This an be based
on their proximity to naturally good joins, i.e., ontiguous units in the orpus. The work
done by Vepa and King (2003) an be onsidered to be in this diretion. In their work, the
natural evolution patternsintheaousti parameters arelearned from theorpus,and usedas
the basis for the evaluation of a join and dening a join ost funtion. Naturally ontiguous
speehsamples areneverpereivedasdisontinuous, though they areseldomexatlythesame.
From this observation, it an be onluded that humans are insensitive to a slight disruption
at the onatenation point. This has been used as a basis for formulation of the evaluation of
joins by Coorman et al. (2000). They have desribed a masking funtion to evaluate a join .
Consequently,belowa ertaintransparenythreshold thejoinostis zero.
Irrespetive of the distane between two onatenation points, it has been observed that
join disruption is not pereived uniformly aross all the phoneti ontexts. In other words,
the pereptual degradation of speeh is high in some phoneti units and ontexts than some
others. Syrdal, 2001, 2005 report a systemati study of the human sensitivity to disruption at
various ontexts, asummaryof theresultspresentedis asfollows: disontinuities arepereived
more with femalevoie based speeh synthesisto male voie based speeh synthesis, higher in
vowels than in onsonants, higher in diphthongs than to other vowels and higher in sonorant
phonemes than non-sonorants. They also reported a omprehensive list of join disontinuity
detetion (
%
) based on the phoneme type. This shows that phonemi ontext is importantandonatenation inertainontextsor phonemes arelesspreferableto someothers andhene
phoneme independent handling ofonatenation strategies might not bethebest.
Conatenation of audio-visual units
All the salient points onsidered for aousti unit onatenation are equally appliable for vi-
sual or audio-visual unitonatenation. Here, the waythedistanes are alulated for unitsat
onatenationpointsdependsonthevisualfeatures. Forexample,in(Bregleretal.,1997),adis-
tane to measurethedierene inlip shapesintheoverlapping segments of adjaent triphones
is inluded to aount for the onatenation ost. It is alulated as the Eulidean distane
(frame-by-frame) between four element feature vetor of artiulatory features, outer-lip-width,
tel-00927121, version 1 - 1 1 Jan 2014
ided based on the plae of least dierene in the lip shapes. In (E.Cosatto et al., 2000), the
visualonatenationosthastwoomponents,theskipostandatransitionost. Skipostisa
penaltyfor anytwoframeswhih arenot ontiguous intheorpusand alulated basedonthe
orderingofframesintheorpus,
0
for anytwo naturallyontiguous units orframes. Thetran-sition ostis alulated based on thevisual distanebetween two frames. Itsalulated asthe
Eulidean distane of two PCA feature vetors extrated based on the appearane. Similarly,
in (Ma et al., 2006), two frames are given zero onatenation ost when they are ontiguous
in the original orpus, for those frames whih are not ontiguous its alulated as a sum of a
minimum onstant value and a variable omponent alulated based on the frames. The vari-
able omponent inturn hastwo omponents, one of whih is alulated based on the distane
alulated between thetwo frames. The seond omponent of this variableonatenation ost
ensures that the visemi transition in the synthesized and original orpus are the same. For
exampletwoframes
i
andj
anbeonatenatedifthepreedingframeofj
belongstothesamevisemilabelasthatof
i
. Thetrajetoriesat thejoins aremadesmoothbyapplyingalowpass lterandubisplines. In(Fagel,2006),the videojoint ostalulation isbasedonthepixeltopixelolordierenesintheborderframesinthesegmentstobeonatenated(omputationally
expensive).
2.5 Evaluation
We have onsidered various aspets of unit-seletion based speeh synthesis. In this setion,
we present the ways of evaluating synthesized speeh. This is neessaryfor exploring dierent
approahes to improve synthesis quality, in whih ase hanges need to be quantied and for
omparativeevaluationofdierentsynthesissystems. Theseanberelatedtoseletion,onate-
nation and overall systemtuning. As synthesized speeh istargetedfor human pereption, the
mostauratewaytoevaluateasynthesizedspeehispereptualevaluationbyhumansubjets.
In-spite of its auray, automati evaluation is often done instead, by omparing synthesized
speeh with a referene speeh. This referene is generally reorded real speeh whih is not
inluded in the orpus. This omparison isquantied using some objetive evaluation metris.
Inthefollowing,wepresenttheobjetive evaluationmetris andthenthepereptualevaluation
by human subjets. The evaluation of synthesized speeh by human subjets is done in two
tel-00927121, version 1 - 1 1 Jan 2014
2.5.1 Objetive automati evaluation of aousti and audio-visual speeh
Variousdistanemeasureshavebeenproposedforomparingrealandsynthesizedspeehsignals.
For example,epstral distaneisusedasa distanemeasureinmany worksforaousti speeh
(Hunt and Blak, 1996; Meron and Hiros, 1999; Alías and Llorà, 2003). (Lataz et al., 2010)
used onstituent distanes measures for duration, f0 and spetrum. Objetive evaluation of
audio-visualspeehisgenerallydonebasedonanindependentobjetiveevaluationofvisualand
aousti modalities. Alternatively, the objetive evaluation of only one modality is performed
sometimes, based on the fous of analysis. For instane, in (Huang et al., 2002) only the
synthesized visual speeh is evaluated. It was done using three objetive evaluation metris .
Theseweredevelopedforestimatingthepreision(naturalness)andsmoothnessofvisualspeeh;
andsynhronizationbetweenaoustiandvisualmodality. Firstly,preisionwasestimatedusing
thesum ofEulidean distane between the real andsynthesizedsentenes,alulated on visual
parameters. Seondly,smoothnesswasestimatedusingthesumofEulideandistanealulated
between adjaent framesin thesynthesized speeh whih are from non-ontiguous loationsin
theorpus. Lastly,audio-visual synhronization wasestimated basedon the phoneti labels of
synthesized frames. For this, only a few important phonemes were onsidered, whih belong
to one ofthe following two ategories. The rst ategory wasof those phonemes whih have a
hangeinthediretionof themouth movement, i.e.,fromlosingto openingorvieversa. The
seondategoryinludedthosephonemeswhihhavemaximalmouthshapeslikeopenorlosed
mouths. SimilarlyEulidean distanemeasurehasbeenusedbysomeothers(Weissenfeldetal.,
2005).
Instead of omparing real and synthesized speeh, Liu and Ostermann (2009) use average
targetost,averagesegmentlengthandaveragevisualdierenebetweenframesastheobjetive
evaluationmetrisandminimizethemduringtotalosttuning. Thisisbasedontheassumption
that the average target ost is representative of the lip-synhronization (audio-visual synhro-
nization) and the other two metris represent the smoothness of the speeh animation. But
nally,for evaluatingtheweightsresulting fromthe tuningproess,rossorrelationoeient
between thePCA oeients ofthe synthesized and real sentenes was alulated to represent
thesubjetivequalityofthesynthesizedvisualspeeh. Similarly,(Baillyetal.,2009) reportthe
omparison of dierent artiulatory gesture predition tehniques using the orrelation oe-
ient between originalandpreditedgestures. For objetiveevaluationofthesynthesizedvisual
speeh, Ma et al. (2006) use average errors of normalized artiulatory parameters (lip-height,
lip-width,lip-protrusion)betweentheoriginalandsynthesizedspeeh. Thoughthesetehniques