Chapter 5 Unit Seletion 71
5.3 T arget feature seletion and weight tuning
5.3.4 Analysis of seleted features and their relative importane
In this setion, we present the analysis of target features based on their relative importane
for eah of the onstituent aspets inluded in the dissimilarity metri: pith, loal aousti
speeh, duration and visual speeh. They are based on target feature weighting by taking
one onstituent metri at a time in the dissimilarity metri. The features with lower weights
(
< 0.01
) are not shown inthis analysis. These results arepresentedfor vowels andonsonantsseparately. Linguisti features an desribe a urrent andidate or its left or right ontext.
Phoneti featuresan desribe a andidate's left or right ontext (see setion 5.1). To analyze
the results, we alulate the mean and standard deviation of weights assigned to eah feature
by taking togetherthe ontext and theurrent andidate. The weights are assigned suh that
thesumofthe weightsoverallthe target featuresis1. Theseresultsareshown intables5.5to
5.12.
•
Pith: For vowels, mean total weight given to linguisti features is 0.19 and 0.81 totel-00927121, version 1 - 1 1 Jan 2014
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Voiing 0.71 0.28 Voiing 0.26 0.32
Kind 0.08 0.13 Kind 0.13 0.15
Lipshape 0.13 0.19
Mannerofartiulation 0.11 0.14
Table5.5: Phoneti featuresimportant for pith
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Leftsilene 0.05 0.07 Right silene 0.14 0.21
Syllable positioninRG 0.04 0.06 Syllable position inRG 0.07 0.08
Word position insentene 0.03 0.07 Syllable position inword 0.04 0.07
Phoneme numberinsyllable 0.03 0.09 Word positioninRG 0.03 0.05
Right silene 0.02 0.03 Word positioninsentene 0.02 0.06
Syllable positioninword 0.01 0.01 Phoneme number insyllable 0.02 0.02
Syllable number insentene 0.01 0.01
Syllable kind 0.01 0.01
Word numberinRG 0.01 0.03
Table 5.6: Linguisti featuresimportant for pith
0.36asthe meantotal weight and0.64 for phonetifeatureswith astandard deviation of
0.26. Thepreedingontextisimportantintermsofboth phonetiandlinguistifeatures
for pith predition. Thelist ofimportant linguistiand phoneti featureswiththemean
andstandarddeviation ofweightsforvowelsandonsonantsisgivenintables5.5and5.6.
◦
Phoneti features: For both vowels andonsonants, ontextualphoneme voiing andphoneme kind areimportant features. For onsonants, lip shape during artiulation
and mannerof artiulation arealsoimportant.
◦
Linguisti features: For both vowels and onsonants, relative position of nearestfollowing and preeding silene, syllable position in rhythm group(RG) and word,
phoneme numberinasyllable andword position inasentene areimportant.
•
Loal speeh aoustis: Theaousti featuresonsidered(MFCCs) an beassumedtodesribe loal speeh aoustis. For vowels, phoneti features get total mean weight of
0.67 and 0.33 for linguisti features with a standard deviation of 0.26. For onsonants,
thetotal meanweight for linguistifeaturesis 0.19and 0.81for phoneti features, witha
standarddeviation of0.12. Thelistofimportantlinguistiand phonetifeatureswiththe
mean and standard deviation of weights for vowels and onsonants is given intables 5.7
tel-00927121, version 1 - 1 1 Jan 2014
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Voiing 0.26 0.25 Lipshape 0.32 0.20
Plae ofartiulation 0.21 0.22 Plaeof artiulation 0.20 0.27
Mannerof artiulation 0.13 0.11 Voiing 0.12 0.19
Kind 0.04 0.06 Mannerofartiulation 0.10 0.12
Lipshape 0.03 0.04 Kind 0.07 0.10
Table5.7: Phoneti featuresimportant for loalspeeh aoustis
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Syllable positioninword 0.07 0.11 Syllable position inRG 0.04 0.05
Leftsilene 0.05 0.05 Right silene 0.04 0.06
Syllable positioninRG 0.04 0.04 Leftsilene 0.02 0.02
Word position insentene 0.04 0.06 Syllable position inword 0.02 0.02
Phoneme numberinsyllable 0.04 0.06 Word positioninRG 0.01 0.01
Syllable kind 0.04 0.05 Phoneme number insyllable 0.01 0.01
Right silene 0.02 0.03 Word positioninsentene 0.01 0.01
Table 5.8: Linguisti featuresimportant forloalspeehaoustis
◦
Phoneti features: For vowels,voiing ofthepreedingphonemes, plaeand mannerof artiulation of the following phoneme are the most important features, followed
by plae of artiulation of the preedingand voiing of the following phoneme. For
onsonants, lip shape of the following phonemes seems to be the most important
feature besidesplae ofartiulation and kind of the following phonemes. Just asin
theaseof f0,voiing of the preedingphoneme isalso animportant feature.
◦
Linguisti features: For both vowels and onsonants, syllable position in word andRG, relative position of the nearest left and right silene, phoneme number in a
syllable, word positionina sentene areimportant. Syllable kindand word position
insentene arealsoimportant for vowels andonsonantsrespetively.
•
Duration: For duration,linguistifeaturesaredominant andinvariablythemostimpor-tant ompared to phoneti features. The pattern is even more pronouned in thease of
vowels. Forvowelsandonsonants,the totalmeanweightassignedtolinguisti featuresis
0.65 and 0.62 respetively, and the standard deviation is 0.18 and 0.25 respetively. The
listofimportant linguistiandphoneti featureswith themeanandstandarddeviation of
weights for vowels and onsonantsisgiven intables5.9and 5.10.
◦
Phoneti features: For both vowels andonsonantskind of following phoneme isthetel-00927121, version 1 - 1 1 Jan 2014
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Kind 0.25 0.16 Kind 0.15 0.13
Lipshape 0.05 0.06 Mannerofartiulation 0.10 0.16
Plae ofartiulation 0.03 0.08 Voiing 0.08 0.13
Mannerof artiulation 0.02 0.04 Lipshape 0.04 0.09
Voiing 0.01 0.02 Plaeof artiulation 0.02 0.02
Table 5.9: Phoneti features important for duration
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Syllable positionRG 0.41 0.22 Syllable position inRG 0.23 0.15
Syllable positioninword 0.07 0.09 Right silene 0.16 0.23
Right silene 0.04 0.09 Leftsilene 0.08 0.12
Syllable kind 0.04 0.08 Syllable position inword 0.03 0.05
Leftsilene 0.02 0.04 Word positioninRG 0.03 0.08
Phoneme numberinsyllable 0.02 0.04 Phoneme number insyllable 0.02 0.05
Word position inRG 0.02 0.03 Syllable number inRG 0.02 0.02
Word numberinRG 0.01 0.02
Syllable number insentene 0.01 0.02
Table5.10: Linguistifeaturesimportant for duration
most important feature. For onsonants, the manner of artiulation and voiing of
thefollowing ontextual phoneme isalso important.
◦
Linguistifeatures: For both vowels and onsonants, thesyllable position intheRGisthe most important feature,followed byrelative positionsof leftand right silene,
syllablepositioninword,phoneme numberina syllable,word position inaRG.
•
Visual features: For visual speeh,thetotal meanweight assignedto linguistifeaturesis 0.31 for vowels and 0.12 for onsonants with a standard deviation of 0.17 and 0.10
respetively. The list of important linguisti and phoneti features with the mean and
standard deviation ofweightsfor vowels andonsonantsis given intables5.11 and5.12.
◦
Phoneti features: For vowels, plae of artiulation of the following and preeding phonemesarethemostimportantfeaturesinthedereasingorderofimportane. Thelipshape duringartiulation and mannerof artiulationof theontextualphonemes
arealsoobservedtobeimportant. Foronsonants,lipshapeofthefollowingphoneme,
lipshapeofthepreedingphonemeandplaeofartiulationofthepreedingphoneme
are observed to be the 3 most important features in thedereasing order of
impor-tel-00927121, version 1 - 1 1 Jan 2014
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Plae ofartiulation 0.36 0.18 Lipshape 0.77 0.16
Lipshape 0.14 0.19 Plaeof artiulation 0.04 0.05
Mannerof artiulation 0.09 0.09 Voiing 0.02 0.03
Voiing 0.07 0.09
Kind 0.04 0.06
Table 5.11: Phoneti featuresimportant for visual speeh
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Syllable positioninword 0.11 0.11 Syllable position inword 0.03 0.05
Syllable kind 0.04 0.07 Syllable number insentene 0.02 0.02
Syllable numberin Sen 0.04 0.02 Right silene 0.01 0.02
Phoneme numberinsyllable 0.02 0.03 Word positioninsentene 0.01 0.02
Right silene 0.02 0.04
Word position insentene 0.02 0.01
Word numberinRG 0.02 0.05
Table5.12: Linguistifeaturesimportant for visualspeeh
◦
Linguistifeatures: For vowels,syllable positionin aword is animportant feature.Theanalysisoftheseseletedfeaturesisinitselfaninterestingproblem. Therelativeimportane
oftheontextualfeaturesindiatesthattheright ontextismore importantthan theleft. This
is more pronouned in phoneti features weights. One of the possible interpretations of this
is that the instanes of antiipatory oartiulation is higher than the instanes of arryover
oartiulationinFrenh. Wordnumberinsentenehasgoteliminatedformostofthephonemes
as theorpus is not suient to establish any suh relation. Numeri features ingeneral have
got lower weights whih show that the relative position is more important than their exat
position. The former features are size invariant. For example, `syllable position in RG' does
not depend on the total number of syllables in RG. But `syllable number in RG' depends on
thetotal numberof syllables in RG. The seleted features and their relative weightsimpliitly
indiate the validity of the algorithm. For example, for pith and duration, syllable position
inRG, relative position of nearest left and right silene, syllable postion inword are shownto
be important. Thesefeatures are known to be important for explaining many of the prosodi
patternsinFrenh.
Withthefthombinationwithequalweightstoallthefouronstituentsofthedissimilarity
metri,theseletedfeaturesontainthefeatureswhihareimportantforallthefouronstituent
tel-00927121, version 1 - 1 1 Jan 2014
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Voiing 0.48 0.27 Lipshape 0.35 0.15
Kind 0.13 0.16 Voiing 0.17 0.22
Plae ofartiulation 0.06 0.05 Plaeof artiulation 0.10 0.10
Mannerof artiulation 0.03 0.02 Kind 0.08 0.10
Lipshape 0.02 0.02 Mannerofartiulation 0.04 0.04
Table 5.13: Phoneti features for aousti-visual speeh
Vowels Consonants
Weight Weight
Feature
µ σ
Featureµ σ
Syllable positioninRG 0.09 0.08 Right silene 0.10 0.13
Right silene 0.04 0.05 Syllable position inRG 0.06 0.07
Leftsilene 0.04 0.06 Syllable position inword 0.02 0.02
Syllable positioninword 0.04 0.07 Leftsilene 0.01 0.02
Phoneme numberinsyllable 0.03 0.05 Phnumber insyllable 0.01 0.03
Syllable Kind 0.02 0.03 Word positioninsentene 0.01 0.01
Word position insentene 0.01 0.01
Table5.14: Linguistifeaturesfor aousti-visualspeeh
and onsonants are 0.28 and 0.26 respetively, and the standard deviation is 0.24 and 0.17
respetively. We use these features and their weights determined in our synthesis system. We
present the objetive and pereptual evaluation done for the synthesized speeh using these
feature weights.
5.4 Conlusion
Inthishapter,wehavepresentedthesetoforpus-independenttargetfeaturesandexplainedthe
orpus-based visual target featuresthat we developed for improving synthesis withour orpus.
We detailed the iterative target feature weighting tehnique that we have designed. It assigns
weightsandperformseliminationofredundantfeaturessimultaneously. Wenallypresentedthe
analysisofthepatternsthatwereobservedintheseletedfeaturesandtheirweights. Therelative
weighting ofthe target featureaetsseletion and henethesynthesisresults. Majorityofthe
observations with respetto seleted features and their relative weights are inagreement with
thephoneti and linguisti studieswhih show thestrengthof this algorithm. It also indiates
that the onstituent metris inluded to represent pith, duration, loal speeh aoustis and
visual speehareindeed orrelated to theseaspets.
The weight tuning algorithm that we presented (setion 5.3.2) performs automati weight
tel-00927121, version 1 - 1 1 Jan 2014
alisations and (2) a set of target features used to desribe the targets and andidates. The
performaneof seletion based onthe resultant target ost dependson various fators. Firstly,
Forthevariousaspetsinluded,dierentdistanemeasuresanbeinvestigatedwithrespetto
their orrelationwithhumanpereption. Suhstudies havebeen done withrespetto aousti
onatenationosts(WoutersandMaon,1998;Vepaetal.,2002;KlabbersandVeldhuis,1998).
Seondly,theimportaneofthe dierentaspetsofdissimilaritymetrivariesamongphonemes.
For example, itis knownthat vowel durations aremore important for good prosody. The two
above mentioned fators require substantial investigation. Lastly, the weights given to these
onstituent metris might further improve by systematiand extensive pereptualexperiments
withhumanpartiipants. It anbearguedthatthisproessisineient andslow. But,a good
justiation to suh an approah is that weight tuning problem inthe huge dimensional spae
of target features is being mitigated by setting the weights of onstituents of the dissimilarity
metri whih is a muh smaller dimension. Also, sine the synthesized speeh is targeted for
humans, reinforement from human partiipants is advantageous. We performed evaluations
through human subjets to assess the nal systemwith the resultant target featuresand their
weights usingthe weight tuningalgorithm. Inthefollowing hapter, we desribethese tests
be-sidessummarizingobjetiveevaluationtehniquesthatwe have usedwhiledeveloping seletion
strategies 3
.
3
tel-00927121, version 1 - 1 1 Jan 2014
Evaluation
Throughout the development proess, the dierent methodologies being used to improve
syn-thesisweresystematially andautomatiallyevaluated. Thisobjetiveevaluationwasbasedon
some metris that we dened. This evaluation an be performedeither by omparing
synthe-sized AV speeh signalsto real speeh signals,or based ona omparison withorpus statistis.
However, as this aousti-visual speeh synthesis system is targeted for humans, the system
should be evaluated using pereptual experiments where human beings are the enter of this
evaluation. In the ontext of audio-visual speeh, the evaluation of both the hannels is not
straightforward andrequiresaarefulonsiderationofthevariousfatorswhihmightaetthe
synthesisquality andthe limitations ofthesystemwhile settingbenhmarksfor omparison.
Inthis hapter, werstdesribethevariousobjetiveevaluationmetrisusedforevaluating
dierent seletion tehniques (in setion 6.1). In setion 6.2, we desribe the pereptual and
subjetive evaluations done alongwith their results. Finally, we present a preliminaryanalysis
ofthesubjetiveevaluationresultsinomparisonwiththeobjetiveevaluationmetrisinsetion
6.3.
4
6.1 Objetive evaluation
For a fast automati evaluation of the synthesized speeh, it is a general pratie to leave
some of thesentenes outsidethe synthesis orpusfor omparison purpose. Theyaregenerally
eitherspeiallydesigned or hosen basedonsome neessaryonditions. Theyareonsideredas
referenes for omparative evaluation. We have aset of20 test sentenes whih arenot part of
thesynthesisorpus for omparative evaluation.
4
A shortoverviewof oursystem and evaluation results presentedin this hapter werepublishedin(Musti
etal.,2012)
tel-00927121, version 1 - 1 1 Jan 2014
6.1.1 Objetive evaluation based on omparison of two signals
We haveutilized threeobjetive evaluationmetris whihhave been introdued intheprevious
hapter (setion 5.3) and the orrelation oeient and root mean squared error (RMSE)
be-tween real and synthesized testsentenes. To make theduration (number of samples)equal in
bothsentenes asimple linearinterpolationisapplied for eahdemi-phones whereverneessary
(see Fig. 5.5). Lets assume that,
x d
andy d
are the sequenes of thed th
aousti or visualparametersofarealandsynthesizedsentenerespetively having
n
samples. Then, thersttwometris arealulated asfollows:
•
Pearson's Correlation Coeient: the orrelationoeientr x d y d
is alulatedasfollows:r x d y d =
Theonsidered aoustiparametersweretherst13MFCCsandF0. Theonsideredvisual
parameters were the rst12 PCAoeients.
The duration basedmetris arealulated asfollows:
1. For the purposeof omparing any two andidates
u
andv
of thesame phoneti label forthepurposeof target weight tuning thefollowing metri wasused:
D dur (u, v) = |(dur u − dur v )| − dur min (p)
dur max (p) − dur min (p)
(6.3)Where,
dur max (p)
anddur min (p)
arethe maximum and minimum of durations ofall theandidates forphoneme
p
;anddur u
anddur v
arethedurations ofandidateu
andv
.2. For thepurposeofomparing twowhole sentenes(segment wise),thefollowing duration
metri wasused:
Where,
s
andr
are the synthesized and real sentenes respetively havingN
phonetisegments and
dur s (j)
anddur r (j)
arethedurations ofj th
phoneti segments of real andtel-00927121, version 1 - 1 1 Jan 2014
6.1.2 Objetive evaluation based on statistial analysis and thresholds
Sometimesobjetive evaluationmetriswhiharebasedonstatistialanalysisof theorpusare
developed and utilized for various purposes. For the purpose of total ost weight tuning for
settingthe weightsof thetarget ost,aousti andvisual joinosts, we utilizedthree objetive
evaluationmetris whih belong to thisategory. We rstalulated the standard deviation of
therstPCAoeient(denotedby
σ P C1
)fromthewholeorpus. Similarly,standarddeviationofitsrstorderderivative(denotedby
σ ∆P C1)
)fromthewholeorpuswasalsoalulated. Then,forasetofsynthesizedsentenes,thesentenesweresannedatalltheonatenationboundaries
to ount thefollowing:
•
Total instanes where thedierenesbetween rstPCA oeients exeedǫ pc1
.•
TotalinstaneswherethedierenesbetweenrstorderderivativeofrstPCAoeientsexeed
ǫ ∆P C1
.•
Total instanes where the dierenes in f0 exeedǫ f0
. Bark was hosen as the suitablepereptual sale.
Therstprinipalomponentanditsderivativewerehosenastherstprinipalomponent
itself aounts for about
57%
of the datavariane and also gives an indiation of thedisonti-nuityinthesubsequent omponents. Thesevaluesgive anindiation of thevisualand aousti
disontinuation attheonatenation boundaries. Thesevalues alongwithaduration wereused
for evaluating the eieny ofthe total ost funtion. Besides theabove 3 metris,a duration
metri based on the omparison of real and synthesized sentenes was also used as explained
below.
•
Total instanes of vowels where the dierene in duration ratio of synthesized and realsentenes isgreater than
ǫ dur
.The thresholds were hosen empirially by pereptual experimentation. In this ase the
onsidered thresholds were
ǫ pc1 = 0.5σ P C1
,ǫ ∆P C1 = 0.5σ ∆P C1
,ǫ f0 = 0.25
Barks andǫ dur = 150%
. Throughoutthedevelopmentproess,thismethodwasapplied forthetuningofthetotal ost weights, whenever we modied the omponents of target ost funtion or onatenationost funtion. The following weights wereused for thetotal ost funtion for seletion,
w = 1
,w aj = 0.943
andw vj = 0.897
,wherew
,w aj
andw vj
aretheweightsassignedto target,aoustionatenation and visualonatenation ost funtionsrespetively.
5
5
tel-00927121, version 1 - 1 1 Jan 2014
6.2 Human-entered evaluation
To evaluate our overall audio-visual speeh synthesis system, the following pereptual
intelli-gibility and subjetive quality evaluation tests were onduted. In these tests a total of 39
partiipants between 19 to 65 years of age with normal auditory and visual abilities
partii-pated. Among the partiipants, 15 were femaleand the rest were male partiipants. All these
partiipants were native Frenh speakers. Thetestswereonduted arossinternet where eah
ofthe partiipantsheardandsawthestimuliintheiromputerswiththeavailablehardware. A
setof basi instrutions wasplayed at thebeginning of thesetests.
6.2.1 Intelligibilitytests
Theintelligibilitytest wasatthe word level. Eahhumansubjetwaspresentedwith50 oneor
two syllabi Frenh words and asked to reognize and report the word. Some examples of the
wordsthatwerepresentedinlude{anneau (ring),bien(good),hane(luk), pine(lip),laine
(wool), uisine(kithen) }. Among these words, 11 werethose whih are present intheorpus.
These in-orpus words were inluded to set a benhmarkfor the best possible intelligibility by
thereorded data.
These tests were done at two levels: (1) aousti-only speeh, (2) audio-visual speeh. In
eahoftheseategories,theaoustispeehomponentwasdegradedtotwonoiselevels. Hene,
eah word was played 4 times: (1) aousti-only with low noise omponent (SNR of -6 dB),
(2) aousti-only withhigh noise omponent (SNR of -10 dB), (3) audio-visual withlow noise
(SNR -6dB), (4) audio-visual speeh with high noise (SNR of -10 dB). The addition of noise
also ensures that the listener pays attention to the visual modality of speeh. The aim is to
evaluatebothvisualandaoustimodalities, andalsotoestimatetheadvantageofaudio-visual
speeh over aousti-only speeh. These noise thresholds were deided based on the several
audio-visualpereptualexperimentstostrikeatrade-obetweenthesetwoobjetive. Thefaial
animation isshown asthe3Dsurfae of thefaeusingsparsemesh,whih hasthedynamis of
faial deformations, but without the texture and olor information (see Fig. 3.9). Besides, the
informationregardinginternalartiulators,teethandtongueisalsomissingfromtheanimations.
Table6.2 inludesthe intelligibility soresinterms ofthe frationof thetotal words
reog-nized in eah of the four ategories by dierent users. Table 6.1 shows the mean intelligibility
reog-nized in eah of the four ategories by dierent users. Table 6.1 shows the mean intelligibility