Chapter 3 Aousti-Visual Speeh Synthesis System: An Overview 37
3.1.3 Data proessing and parameter extration
The sampling rate of the aquired 3D marker data was around 188Hz. There was a slight
variane in the sampling rate aross sentenes. A set of sentenes were reorded in dierent
sessions with short pauses between suessive sessions. This variane in the aquired data is
due to a slight variablelag between the timeinstant the imageswere aptured and sent to the
omputerforstorage. Thedatawaslteredusingalow-pass lterwithaut-ofrequenyof25
Hz. Suh a proessing removes additive noise from the visual trajetories without suppressing
important positional information.
PrinipalComponentAnalysis(PCA)wasappliedonasubsetofmarkersofthelowerpartof
thefae(jaw,lips, andheeks;seeFig 3.2). Thereason forthis hoiewasthatthemovements
of markers on the lower part of the faeare tightly onneted to speeh gestures. Markers on
theupperpartof the faeeitherdo not move,or their movements areofno diret relevaneto
speeh. This an be said beause the speeh is reorded with a neutral voie with no strong
prosodieets. WehavenotusedanyguidedPCAasitdoesnotprovidesigniantadvantage.
Besides,the projetion onto prinipal omponents and reonstrution are straightforward and
fast. Thisuniedapproah keepsit simpleand straight forward for thesynthesispurpose. The
faial deformationswhen eahof theprinipal omponentsis setat
−3
and3
z-sores isshownin gure3.3. Therst two omponents aount for
79.6%
of faial speeh datavariane. It isdiult to draw denite onlusions about theinuene of eah prinipal omponent on faial
tel-00927121, version 1 - 1 1 Jan 2014
Figure 3.2: PCA isapplied on 178 (plotted as blueirles) out of 252 painted markers.
terms of thepereived faial deformations. Broadly, thefollowing observation an be made by
looking atvisual speehanimation byvaryinga singleprinipalomponent. Thersttwo
prin-ipalomponentsmainly aountfor ombinedjawopening/losingandlipprotrusiongestures.
The third omponent aounts for lip opening, after removal ofthe jawontribution. Some of
theomponentsthough relatedto speeh,areaugmented bysome gestures thatare speito
speaker's faial expressions. This seems to be the ase for omponents 4 and 5. They seem
to apture lip spreading. However, due to some asymmetry in our speaker's artiulation, lip
spreading is divided into two modes: one aounting for spreading toward the left side of the
lipsandonefor spreadingtoward theright side. Component 6isasmilinggesture, howeveritis
diult tolassify itasbelongingto speehartiulation or purefaial expression. Components
7 to 12 seem to aount for very subtle lip deformations, whih we believe are idiosynrati
harateristisof our speaker.
Several experiments indiated that retaining as less as three omponents ould lead to an
animationwhihwouldbeaeptable,inthesensethatitwouldapturethebasispeehgestures
andwouldlter outalmostall thespeakerspeigestures. However, suhananimation would
laksomenaturalness,whihismostlyaptured byseondary omponents. Wearealsoinfavor
of keeping the speiity of the speaker spei gestures. Retaining 12 omponents leads to
animations that arenatural enough for all purposes. One of the goals of our proposed system
is to synthesize trajetories orresponding to the PCA-redued visual information, for these
tel-00927121, version 1 - 1 1 Jan 2014
Chapter3.Aousti-VisualSpeehSynthesisSystem:An
PC 1 (57.75%) PC 2 (21.93%) PC 3 (6.46%) PC 4 (2.27%)
PC 5 (1.55%) PC 6 (1.07%) PC 7 (0.93%) PC 8 (0.56%)
PC 9 (0.44%) PC 10 (0.38%) PC 11 (0.33%) PC 12 (0.32%)
-3 σ -3 σ
-3 σ -3 σ
-3 σ -3 σ
-3 σ
-3 σ -3 σ
-3 σ
-3 σ -3 σ
+3 σ +3 σ +3 σ +3 σ
+3 σ +3 σ
+3 σ
+3 σ +3 σ
+3 σ
+3 σ
+3 σ +3 σ
Figure 3.3: Faial deformationswhen eah of the prinipal omponents is set at
−3
and3
z-sores.tel-00927121, version 1 - 1 1 Jan 2014
A B C
D o
Fp
Figure 3.4: Calulation of labial features is done using the 4 points on the fae:
A
,B
,C
andD
. Lip opening and lip spread are given by the distanesk CDk ~
andk ABk ~
. Lip protrusion isgiven by the displaement of
O
, the enter of gravity of the four points (A
,B
,C
,D
)along thenormal vetor (
OF p ~
) tothe plane formed by vetorsAB ~
andCD ~
. Jawopening isalulated asthe distane between the enter of the hinand a xed point on the head.
informationanbereonstrutedusingthese12trajetories. Themeanvaluesofthepositionsof
themarkersatthe upperpartofthefaemaythenbeaddedto ompletethefaevisualization.
Hene, the 12 rst prinipal omponents, whih explains about
94%
of the variane of thelower part of the fae are retained for storage and reonstrution at runtime. Besides the 12
PCA oeients, four artiulatory parameters ( lip protrusion,lip opening, lip spread andjaw
opening) are alulated as explained in gure 3.4) (Robert et al., 2005). These artiulatory
featuresareused forthe analysisofvisual speeh orpus andduring impliitly duringseletion
asvisual targetostsare designed basedon thesefeatures.
The aoustispeeh paramters extrated inluded the LPC(Linear preditive oding)
oef-ients, f0,and energy.
3.1.4 Segmentation
We perform segmentation based on the fored alignment of aousti speeh. These predited
segmentboundariesareonsideredasthesynhronousbimodalsegmentboundaries,andhosen
to represent speeh segments in the orpus. The synthesis unit of target searh and synthesis
isthe diphone. Besides makingthestorageandindexingofbimodalspeehsegmentsextremely
simple, it reinfores the prinipal idea of synhronous inseparable bimodal speeh intat. A
diphone extends from the mid of one phone to the mid of the next phone. The middle of
the phone is a relatively stationary region. Hene by using diphone as the synthesis unit, the
aousti artifats due to any segmentation errors areredued. Diphone units also aount for
tel-00927121, version 1 - 1 1 Jan 2014
Diphoneasasynthesisunitisreportedtoprodue omparativelygoodqualityspeeh(Moulines
and Charpentier, 1990). The Segmentation based on speeh aoustis and annotation of data
was done using sripts developed by Colotte (2009). The monophone HMMs whih are used
by these sriptsare trained ona very large aousti speehorpus and provide highlyaurate
segmentation.