Data proessing and parameter extration - Aousti-Visual Speeh Synthesis System: An Overview 37

Chapter 3 Aousti-Visual Speeh Synthesis System: An Overview 37

3.1.3 Data proessing and parameter extration

The sampling rate of the aquired 3D marker data was around 188Hz. There was a slight

variane in the sampling rate aross sentenes. A set of sentenes were reorded in dierent

sessions with short pauses between suessive sessions. This variane in the aquired data is

due to a slight variablelag between the timeinstant the imageswere aptured and sent to the

omputerforstorage. Thedatawaslteredusingalow-pass lterwithaut-ofrequenyof25

Hz. Suh a proessing removes additive noise from the visual trajetories without suppressing

important positional information.

PrinipalComponentAnalysis(PCA)wasappliedonasubsetofmarkersofthelowerpartof

thefae(jaw,lips, andheeks;seeFig 3.2). Thereason forthis hoiewasthatthemovements

of markers on the lower part of the faeare tightly onneted to speeh gestures. Markers on

theupperpartof the faeeitherdo not move,or their movements areofno diret relevaneto

speeh. This an be said beause the speeh is reorded with a neutral voie with no strong

prosodieets. WehavenotusedanyguidedPCAasitdoesnotprovidesigniantadvantage.

Besides,the projetion onto prinipal omponents and reonstrution are straightforward and

fast. Thisuniedapproah keepsit simpleand straight forward for thesynthesispurpose. The

faial deformationswhen eahof theprinipal omponentsis setat

−3

^and

3

^z-sores ^is^shown

in gure3.3. Therst two omponents aount for

79.6%

ôf ^faial ^speeh ^data^variane. Ît îs

diult to draw denite onlusions about theinuene of eah prinipal omponent on faial

tel-00927121, version 1 - 1 1 Jan 2014

Figure 3.2: PCA isapplied on 178 (plotted as blueirles) out of 252 painted markers.

terms of thepereived faial deformations. Broadly, thefollowing observation an be made by

looking atvisual speehanimation byvaryinga singleprinipalomponent. Thersttwo

prin-ipalomponentsmainly aountfor ombinedjawopening/losingandlipprotrusiongestures.

The third omponent aounts for lip opening, after removal ofthe jawontribution. Some of

theomponentsthough relatedto speeh,areaugmented bysome gestures thatare speito

speaker's faial expressions. This seems to be the ase for omponents 4 and 5. They seem

to apture lip spreading. However, due to some asymmetry in our speaker's artiulation, lip

spreading is divided into two modes: one aounting for spreading toward the left side of the

lipsandonefor spreadingtoward theright side. Component 6isasmilinggesture, howeveritis

diult tolassify itasbelongingto speehartiulation or purefaial expression. Components

7 to 12 seem to aount for very subtle lip deformations, whih we believe are idiosynrati

harateristisof our speaker.

Several experiments indiated that retaining as less as three omponents ould lead to an

animationwhihwouldbeaeptable,inthesensethatitwouldapturethebasispeehgestures

andwouldlter outalmostall thespeakerspeigestures. However, suhananimation would

laksomenaturalness,whihismostlyaptured byseondary omponents. Wearealsoinfavor

of keeping the speiity of the speaker spei gestures. Retaining 12 omponents leads to

animations that arenatural enough for all purposes. One of the goals of our proposed system

is to synthesize trajetories orresponding to the PCA-redued visual information, for these

tel-00927121, version 1 - 1 1 Jan 2014

Chapter3.Aousti-VisualSpeehSynthesisSystem:An

PC 1 (57.75%) PC 2 (21.93%) PC 3 (6.46%) PC 4 (2.27%)

PC 5 (1.55%) PC 6 (1.07%) PC 7 (0.93%) PC 8 (0.56%)

PC 9 (0.44%) PC 10 (0.38%) PC 11 (0.33%) PC 12 (0.32%)

-3 σ -3 σ

-3 σ

-3 σ -3 σ

-3 σ

-3 σ -3 σ

+3 σ +3 σ +3 σ +3 σ

+3 σ +3 σ

+3 σ

+3 σ +3 σ

+3 σ

+3 σ +3 σ

Figure 3.3: Faial deformationswhen eah of the prinipal omponents is set at

−3

^and

3

^z-sores.

tel-00927121, version 1 - 1 1 Jan 2014

A B C

D o

Fp

Figure 3.4: Calulation of labial features is done using the 4 points on the fae:

A

B

C

^and

D

^. ^Lip ôpêning ând ^lip ^sprêad âre ^given ^by ^the ^distanes

k CDk ~

^and

k ABk ~

^. ^Lip ^protrusion ^is

given by the displaement of

O

^, ^the ênter ôf ^gravity ôf ^the ^four ^points ⁽

A

B

C

D

⁾^along ^the

normal vetor (

OF p ~

⁾ ^to^the ^plane ^formed ^by ^vetors

AB ~

^and

CD ~

^. ^Jawôpening îsâlulate^d âs

the distane between the enter of the hinand a xed point on the head.

informationanbereonstrutedusingthese12trajetories. Themeanvaluesofthepositionsof

themarkersatthe upperpartofthefaemaythenbeaddedto ompletethefaevisualization.

Hene, the 12 rst prinipal omponents, whih explains about

94%

^of ^the ^variane ^of ^the

lower part of the fae are retained for storage and reonstrution at runtime. Besides the 12

PCA oeients, four artiulatory parameters ( lip protrusion,lip opening, lip spread andjaw

opening) are alulated as explained in gure 3.4) (Robert et al., 2005). These artiulatory

featuresareused forthe analysisofvisual speeh orpus andduring impliitly duringseletion

asvisual targetostsare designed basedon thesefeatures.

The aoustispeeh paramters extrated inluded the LPC(Linear preditive oding)

oef-ients, f0,and energy.

3.1.4 Segmentation

We perform segmentation based on the fored alignment of aousti speeh. These predited

segmentboundariesareonsideredasthesynhronousbimodalsegmentboundaries,andhosen

to represent speeh segments in the orpus. The synthesis unit of target searh and synthesis

isthe diphone. Besides makingthestorageandindexingofbimodalspeehsegmentsextremely

simple, it reinfores the prinipal idea of synhronous inseparable bimodal speeh intat. A

diphone extends from the mid of one phone to the mid of the next phone. The middle of

the phone is a relatively stationary region. Hene by using diphone as the synthesis unit, the

aousti artifats due to any segmentation errors areredued. Diphone units also aount for

tel-00927121, version 1 - 1 1 Jan 2014

Diphoneasasynthesisunitisreportedtoprodue omparativelygoodqualityspeeh(Moulines

and Charpentier, 1990). The Segmentation based on speeh aoustis and annotation of data

was done using sripts developed by Colotte (2009). The monophone HMMs whih are used

by these sriptsare trained ona very large aousti speehorpus and provide highlyaurate

segmentation.

Dans le document Synthèse Acoustico-Visuelle de la Parole par Séléction d'Unités Bimodales ~ Association Francophone de la Communication Parlée (Page 42-46)