• Aucun résultat trouvé

Chapter 5 Unit Seletion 71

5.3 T arget feature seletion and weight tuning

5.3.4 Analysis of seleted features and their relative importane

In this setion, we present the analysis of target features based on their relative importane

for eah of the onstituent aspets inluded in the dissimilarity metri: pith, loal aousti

speeh, duration and visual speeh. They are based on target feature weighting by taking

one onstituent metri at a time in the dissimilarity metri. The features with lower weights

(

< 0.01

) are not shown inthis analysis. These results arepresentedfor vowels andonsonants

separately. Linguisti features an desribe a urrent andidate or its left or right ontext.

Phoneti featuresan desribe a andidate's left or right ontext (see setion 5.1). To analyze

the results, we alulate the mean and standard deviation of weights assigned to eah feature

by taking togetherthe ontext and theurrent andidate. The weights are assigned suh that

thesumofthe weightsoverallthe target featuresis1. Theseresultsareshown intables5.5to

5.12.

Pith: For vowels, mean total weight given to linguisti features is 0.19 and 0.81 to

tel-00927121, version 1 - 1 1 Jan 2014

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Voiing 0.71 0.28 Voiing 0.26 0.32

Kind 0.08 0.13 Kind 0.13 0.15

Lipshape 0.13 0.19

Mannerofartiulation 0.11 0.14

Table5.5: Phoneti featuresimportant for pith

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Leftsilene 0.05 0.07 Right silene 0.14 0.21

Syllable positioninRG 0.04 0.06 Syllable position inRG 0.07 0.08

Word position insentene 0.03 0.07 Syllable position inword 0.04 0.07

Phoneme numberinsyllable 0.03 0.09 Word positioninRG 0.03 0.05

Right silene 0.02 0.03 Word positioninsentene 0.02 0.06

Syllable positioninword 0.01 0.01 Phoneme number insyllable 0.02 0.02

Syllable number insentene 0.01 0.01

Syllable kind 0.01 0.01

Word numberinRG 0.01 0.03

Table 5.6: Linguisti featuresimportant for pith

0.36asthe meantotal weight and0.64 for phonetifeatureswith astandard deviation of

0.26. Thepreedingontextisimportantintermsofboth phonetiandlinguistifeatures

for pith predition. Thelist ofimportant linguistiand phoneti featureswiththemean

andstandarddeviation ofweightsforvowelsandonsonantsisgivenintables5.5and5.6.

Phoneti features: For both vowels andonsonants, ontextualphoneme voiing and

phoneme kind areimportant features. For onsonants, lip shape during artiulation

and mannerof artiulation arealsoimportant.

Linguisti features: For both vowels and onsonants, relative position of nearest

following and preeding silene, syllable position in rhythm group(RG) and word,

phoneme numberinasyllable andword position inasentene areimportant.

Loal speeh aoustis: Theaousti featuresonsidered(MFCCs) an beassumedto

desribe loal speeh aoustis. For vowels, phoneti features get total mean weight of

0.67 and 0.33 for linguisti features with a standard deviation of 0.26. For onsonants,

thetotal meanweight for linguistifeaturesis 0.19and 0.81for phoneti features, witha

standarddeviation of0.12. Thelistofimportantlinguistiand phonetifeatureswiththe

mean and standard deviation of weights for vowels and onsonants is given intables 5.7

tel-00927121, version 1 - 1 1 Jan 2014

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Voiing 0.26 0.25 Lipshape 0.32 0.20

Plae ofartiulation 0.21 0.22 Plaeof artiulation 0.20 0.27

Mannerof artiulation 0.13 0.11 Voiing 0.12 0.19

Kind 0.04 0.06 Mannerofartiulation 0.10 0.12

Lipshape 0.03 0.04 Kind 0.07 0.10

Table5.7: Phoneti featuresimportant for loalspeeh aoustis

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Syllable positioninword 0.07 0.11 Syllable position inRG 0.04 0.05

Leftsilene 0.05 0.05 Right silene 0.04 0.06

Syllable positioninRG 0.04 0.04 Leftsilene 0.02 0.02

Word position insentene 0.04 0.06 Syllable position inword 0.02 0.02

Phoneme numberinsyllable 0.04 0.06 Word positioninRG 0.01 0.01

Syllable kind 0.04 0.05 Phoneme number insyllable 0.01 0.01

Right silene 0.02 0.03 Word positioninsentene 0.01 0.01

Table 5.8: Linguisti featuresimportant forloalspeehaoustis

Phoneti features: For vowels,voiing ofthepreedingphonemes, plaeand manner

of artiulation of the following phoneme are the most important features, followed

by plae of artiulation of the preedingand voiing of the following phoneme. For

onsonants, lip shape of the following phonemes seems to be the most important

feature besidesplae ofartiulation and kind of the following phonemes. Just asin

theaseof f0,voiing of the preedingphoneme isalso animportant feature.

Linguisti features: For both vowels and onsonants, syllable position in word and

RG, relative position of the nearest left and right silene, phoneme number in a

syllable, word positionina sentene areimportant. Syllable kindand word position

insentene arealsoimportant for vowels andonsonantsrespetively.

Duration: For duration,linguistifeaturesaredominant andinvariablythemost

impor-tant ompared to phoneti features. The pattern is even more pronouned in thease of

vowels. Forvowelsandonsonants,the totalmeanweightassignedtolinguisti featuresis

0.65 and 0.62 respetively, and the standard deviation is 0.18 and 0.25 respetively. The

listofimportant linguistiandphoneti featureswith themeanandstandarddeviation of

weights for vowels and onsonantsisgiven intables5.9and 5.10.

Phoneti features: For both vowels andonsonantskind of following phoneme isthe

tel-00927121, version 1 - 1 1 Jan 2014

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Kind 0.25 0.16 Kind 0.15 0.13

Lipshape 0.05 0.06 Mannerofartiulation 0.10 0.16

Plae ofartiulation 0.03 0.08 Voiing 0.08 0.13

Mannerof artiulation 0.02 0.04 Lipshape 0.04 0.09

Voiing 0.01 0.02 Plaeof artiulation 0.02 0.02

Table 5.9: Phoneti features important for duration

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Syllable positionRG 0.41 0.22 Syllable position inRG 0.23 0.15

Syllable positioninword 0.07 0.09 Right silene 0.16 0.23

Right silene 0.04 0.09 Leftsilene 0.08 0.12

Syllable kind 0.04 0.08 Syllable position inword 0.03 0.05

Leftsilene 0.02 0.04 Word positioninRG 0.03 0.08

Phoneme numberinsyllable 0.02 0.04 Phoneme number insyllable 0.02 0.05

Word position inRG 0.02 0.03 Syllable number inRG 0.02 0.02

Word numberinRG 0.01 0.02

Syllable number insentene 0.01 0.02

Table5.10: Linguistifeaturesimportant for duration

most important feature. For onsonants, the manner of artiulation and voiing of

thefollowing ontextual phoneme isalso important.

Linguistifeatures: For both vowels and onsonants, thesyllable position intheRG

isthe most important feature,followed byrelative positionsof leftand right silene,

syllablepositioninword,phoneme numberina syllable,word position inaRG.

Visual features: For visual speeh,thetotal meanweight assignedto linguistifeatures

is 0.31 for vowels and 0.12 for onsonants with a standard deviation of 0.17 and 0.10

respetively. The list of important linguisti and phoneti features with the mean and

standard deviation ofweightsfor vowels andonsonantsis given intables5.11 and5.12.

Phoneti features: For vowels, plae of artiulation of the following and preeding phonemesarethemostimportantfeaturesinthedereasingorderofimportane. The

lipshape duringartiulation and mannerof artiulationof theontextualphonemes

arealsoobservedtobeimportant. Foronsonants,lipshapeofthefollowingphoneme,

lipshapeofthepreedingphonemeandplaeofartiulationofthepreedingphoneme

are observed to be the 3 most important features in thedereasing order of

impor-tel-00927121, version 1 - 1 1 Jan 2014

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Plae ofartiulation 0.36 0.18 Lipshape 0.77 0.16

Lipshape 0.14 0.19 Plaeof artiulation 0.04 0.05

Mannerof artiulation 0.09 0.09 Voiing 0.02 0.03

Voiing 0.07 0.09

Kind 0.04 0.06

Table 5.11: Phoneti featuresimportant for visual speeh

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Syllable positioninword 0.11 0.11 Syllable position inword 0.03 0.05

Syllable kind 0.04 0.07 Syllable number insentene 0.02 0.02

Syllable numberin Sen 0.04 0.02 Right silene 0.01 0.02

Phoneme numberinsyllable 0.02 0.03 Word positioninsentene 0.01 0.02

Right silene 0.02 0.04

Word position insentene 0.02 0.01

Word numberinRG 0.02 0.05

Table5.12: Linguistifeaturesimportant for visualspeeh

Linguistifeatures: For vowels,syllable positionin aword is animportant feature.

Theanalysisoftheseseletedfeaturesisinitselfaninterestingproblem. Therelativeimportane

oftheontextualfeaturesindiatesthattheright ontextismore importantthan theleft. This

is more pronouned in phoneti features weights. One of the possible interpretations of this

is that the instanes of antiipatory oartiulation is higher than the instanes of arryover

oartiulationinFrenh. Wordnumberinsentenehasgoteliminatedformostofthephonemes

as theorpus is not suient to establish any suh relation. Numeri features ingeneral have

got lower weights whih show that the relative position is more important than their exat

position. The former features are size invariant. For example, `syllable position in RG' does

not depend on the total number of syllables in RG. But `syllable number in RG' depends on

thetotal numberof syllables in RG. The seleted features and their relative weightsimpliitly

indiate the validity of the algorithm. For example, for pith and duration, syllable position

inRG, relative position of nearest left and right silene, syllable postion inword are shownto

be important. Thesefeatures are known to be important for explaining many of the prosodi

patternsinFrenh.

Withthefthombinationwithequalweightstoallthefouronstituentsofthedissimilarity

metri,theseletedfeaturesontainthefeatureswhihareimportantforallthefouronstituent

tel-00927121, version 1 - 1 1 Jan 2014

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Voiing 0.48 0.27 Lipshape 0.35 0.15

Kind 0.13 0.16 Voiing 0.17 0.22

Plae ofartiulation 0.06 0.05 Plaeof artiulation 0.10 0.10

Mannerof artiulation 0.03 0.02 Kind 0.08 0.10

Lipshape 0.02 0.02 Mannerofartiulation 0.04 0.04

Table 5.13: Phoneti features for aousti-visual speeh

Vowels Consonants

Weight Weight

Feature

µ σ

Feature

µ σ

Syllable positioninRG 0.09 0.08 Right silene 0.10 0.13

Right silene 0.04 0.05 Syllable position inRG 0.06 0.07

Leftsilene 0.04 0.06 Syllable position inword 0.02 0.02

Syllable positioninword 0.04 0.07 Leftsilene 0.01 0.02

Phoneme numberinsyllable 0.03 0.05 Phnumber insyllable 0.01 0.03

Syllable Kind 0.02 0.03 Word positioninsentene 0.01 0.01

Word position insentene 0.01 0.01

Table5.14: Linguistifeaturesfor aousti-visualspeeh

and onsonants are 0.28 and 0.26 respetively, and the standard deviation is 0.24 and 0.17

respetively. We use these features and their weights determined in our synthesis system. We

present the objetive and pereptual evaluation done for the synthesized speeh using these

feature weights.

5.4 Conlusion

Inthishapter,wehavepresentedthesetoforpus-independenttargetfeaturesandexplainedthe

orpus-based visual target featuresthat we developed for improving synthesis withour orpus.

We detailed the iterative target feature weighting tehnique that we have designed. It assigns

weightsandperformseliminationofredundantfeaturessimultaneously. Wenallypresentedthe

analysisofthepatternsthatwereobservedintheseletedfeaturesandtheirweights. Therelative

weighting ofthe target featureaetsseletion and henethesynthesisresults. Majorityofthe

observations with respetto seleted features and their relative weights are inagreement with

thephoneti and linguisti studieswhih show thestrengthof this algorithm. It also indiates

that the onstituent metris inluded to represent pith, duration, loal speeh aoustis and

visual speehareindeed orrelated to theseaspets.

The weight tuning algorithm that we presented (setion 5.3.2) performs automati weight

tel-00927121, version 1 - 1 1 Jan 2014

alisations and (2) a set of target features used to desribe the targets and andidates. The

performaneof seletion based onthe resultant target ost dependson various fators. Firstly,

Forthevariousaspetsinluded,dierentdistanemeasuresanbeinvestigatedwithrespetto

their orrelationwithhumanpereption. Suhstudies havebeen done withrespetto aousti

onatenationosts(WoutersandMaon,1998;Vepaetal.,2002;KlabbersandVeldhuis,1998).

Seondly,theimportaneofthe dierentaspetsofdissimilaritymetrivariesamongphonemes.

For example, itis knownthat vowel durations aremore important for good prosody. The two

above mentioned fators require substantial investigation. Lastly, the weights given to these

onstituent metris might further improve by systematiand extensive pereptualexperiments

withhumanpartiipants. It anbearguedthatthisproessisineient andslow. But,a good

justiation to suh an approah is that weight tuning problem inthe huge dimensional spae

of target features is being mitigated by setting the weights of onstituents of the dissimilarity

metri whih is a muh smaller dimension. Also, sine the synthesized speeh is targeted for

humans, reinforement from human partiipants is advantageous. We performed evaluations

through human subjets to assess the nal systemwith the resultant target featuresand their

weights usingthe weight tuningalgorithm. Inthefollowing hapter, we desribethese tests

be-sidessummarizingobjetiveevaluationtehniquesthatwe have usedwhiledeveloping seletion

strategies 3

.

3

tel-00927121, version 1 - 1 1 Jan 2014

Evaluation

Throughout the development proess, the dierent methodologies being used to improve

syn-thesisweresystematially andautomatiallyevaluated. Thisobjetiveevaluationwasbasedon

some metris that we dened. This evaluation an be performedeither by omparing

synthe-sized AV speeh signalsto real speeh signals,or based ona omparison withorpus statistis.

However, as this aousti-visual speeh synthesis system is targeted for humans, the system

should be evaluated using pereptual experiments where human beings are the enter of this

evaluation. In the ontext of audio-visual speeh, the evaluation of both the hannels is not

straightforward andrequiresaarefulonsiderationofthevariousfatorswhihmightaetthe

synthesisquality andthe limitations ofthesystemwhile settingbenhmarksfor omparison.

Inthis hapter, werstdesribethevariousobjetiveevaluationmetrisusedforevaluating

dierent seletion tehniques (in setion 6.1). In setion 6.2, we desribe the pereptual and

subjetive evaluations done alongwith their results. Finally, we present a preliminaryanalysis

ofthesubjetiveevaluationresultsinomparisonwiththeobjetiveevaluationmetrisinsetion

6.3.

4

6.1 Objetive evaluation

For a fast automati evaluation of the synthesized speeh, it is a general pratie to leave

some of thesentenes outsidethe synthesis orpusfor omparison purpose. Theyaregenerally

eitherspeiallydesigned or hosen basedonsome neessaryonditions. Theyareonsideredas

referenes for omparative evaluation. We have aset of20 test sentenes whih arenot part of

thesynthesisorpus for omparative evaluation.

4

A shortoverviewof oursystem and evaluation results presentedin this hapter werepublishedin(Musti

etal.,2012)

tel-00927121, version 1 - 1 1 Jan 2014

6.1.1 Objetive evaluation based on omparison of two signals

We haveutilized threeobjetive evaluationmetris whihhave been introdued intheprevious

hapter (setion 5.3) and the orrelation oeient and root mean squared error (RMSE)

be-tween real and synthesized testsentenes. To make theduration (number of samples)equal in

bothsentenes asimple linearinterpolationisapplied for eahdemi-phones whereverneessary

(see Fig. 5.5). Lets assume that,

x d

and

y d

are the sequenes of the

d th

aousti or visual

parametersofarealandsynthesizedsentenerespetively having

n

samples. Then, thersttwo

metris arealulated asfollows:

Pearson's Correlation Coeient: the orrelationoeient

r x d y d

is alulatedasfollows:

r x d y d =

Theonsidered aoustiparametersweretherst13MFCCsandF0. Theonsideredvisual

parameters were the rst12 PCAoeients.

The duration basedmetris arealulated asfollows:

1. For the purposeof omparing any two andidates

u

and

v

of thesame phoneti label for

thepurposeof target weight tuning thefollowing metri wasused:

D dur (u, v) = |(dur u − dur v )| − dur min (p)

dur max (p) − dur min (p)

(6.3)

Where,

dur max (p)

and

dur min (p)

arethe maximum and minimum of durations ofall the

andidates forphoneme

p

;and

dur u

and

dur v

arethedurations ofandidate

u

and

v

.

2. For thepurposeofomparing twowhole sentenes(segment wise),thefollowing duration

metri wasused:

Where,

s

and

r

are the synthesized and real sentenes respetively having

N

phoneti

segments and

dur s (j)

and

dur r (j)

arethedurations of

j th

phoneti segments of real and

tel-00927121, version 1 - 1 1 Jan 2014

6.1.2 Objetive evaluation based on statistial analysis and thresholds

Sometimesobjetive evaluationmetriswhiharebasedonstatistialanalysisof theorpusare

developed and utilized for various purposes. For the purpose of total ost weight tuning for

settingthe weightsof thetarget ost,aousti andvisual joinosts, we utilizedthree objetive

evaluationmetris whih belong to thisategory. We rstalulated the standard deviation of

therstPCAoeient(denotedby

σ P C1

)fromthewholeorpus. Similarly,standarddeviation

ofitsrstorderderivative(denotedby

σ ∆P C1)

)fromthewholeorpuswasalsoalulated. Then,

forasetofsynthesizedsentenes,thesentenesweresannedatalltheonatenationboundaries

to ount thefollowing:

Total instanes where thedierenesbetween rstPCA oeients exeed

ǫ pc1

.

TotalinstaneswherethedierenesbetweenrstorderderivativeofrstPCAoeients

exeed

ǫ ∆P C1

.

Total instanes where the dierenes in f0 exeed

ǫ f0

. Bark was hosen as the suitable

pereptual sale.

Therstprinipalomponentanditsderivativewerehosenastherstprinipalomponent

itself aounts for about

57%

of the datavariane and also gives an indiation of the

disonti-nuityinthesubsequent omponents. Thesevaluesgive anindiation of thevisualand aousti

disontinuation attheonatenation boundaries. Thesevalues alongwithaduration wereused

for evaluating the eieny ofthe total ost funtion. Besides theabove 3 metris,a duration

metri based on the omparison of real and synthesized sentenes was also used as explained

below.

Total instanes of vowels where the dierene in duration ratio of synthesized and real

sentenes isgreater than

ǫ dur

.

The thresholds were hosen empirially by pereptual experimentation. In this ase the

onsidered thresholds were

ǫ pc1 = 0.5σ P C1

,

ǫ ∆P C1 = 0.5σ ∆P C1

,

ǫ f0 = 0.25

Barks and

ǫ dur = 150%

. Throughoutthedevelopmentproess,thismethodwasapplied forthetuningofthetotal ost weights, whenever we modied the omponents of target ost funtion or onatenation

ost funtion. The following weights wereused for thetotal ost funtion for seletion,

w = 1

,

w aj = 0.943

and

w vj = 0.897

,where

w

,

w aj

and

w vj

aretheweightsassignedto target,aousti

onatenation and visualonatenation ost funtionsrespetively.

5

5

tel-00927121, version 1 - 1 1 Jan 2014

6.2 Human-entered evaluation

To evaluate our overall audio-visual speeh synthesis system, the following pereptual

intelli-gibility and subjetive quality evaluation tests were onduted. In these tests a total of 39

partiipants between 19 to 65 years of age with normal auditory and visual abilities

partii-pated. Among the partiipants, 15 were femaleand the rest were male partiipants. All these

partiipants were native Frenh speakers. Thetestswereonduted arossinternet where eah

ofthe partiipantsheardandsawthestimuliintheiromputerswiththeavailablehardware. A

setof basi instrutions wasplayed at thebeginning of thesetests.

6.2.1 Intelligibilitytests

Theintelligibilitytest wasatthe word level. Eahhumansubjetwaspresentedwith50 oneor

two syllabi Frenh words and asked to reognize and report the word. Some examples of the

wordsthatwerepresentedinlude{anneau (ring),bien(good),hane(luk), pine(lip),laine

(wool), uisine(kithen) }. Among these words, 11 werethose whih are present intheorpus.

These in-orpus words were inluded to set a benhmarkfor the best possible intelligibility by

thereorded data.

These tests were done at two levels: (1) aousti-only speeh, (2) audio-visual speeh. In

eahoftheseategories,theaoustispeehomponentwasdegradedtotwonoiselevels. Hene,

eah word was played 4 times: (1) aousti-only with low noise omponent (SNR of -6 dB),

(2) aousti-only withhigh noise omponent (SNR of -10 dB), (3) audio-visual withlow noise

(SNR -6dB), (4) audio-visual speeh with high noise (SNR of -10 dB). The addition of noise

also ensures that the listener pays attention to the visual modality of speeh. The aim is to

evaluatebothvisualandaoustimodalities, andalsotoestimatetheadvantageofaudio-visual

speeh over aousti-only speeh. These noise thresholds were deided based on the several

audio-visualpereptualexperimentstostrikeatrade-obetweenthesetwoobjetive. Thefaial

animation isshown asthe3Dsurfae of thefaeusingsparsemesh,whih hasthedynamis of

faial deformations, but without the texture and olor information (see Fig. 3.9). Besides, the

informationregardinginternalartiulators,teethandtongueisalsomissingfromtheanimations.

Table6.2 inludesthe intelligibility soresinterms ofthe frationof thetotal words

reog-nized in eah of the four ategories by dierent users. Table 6.1 shows the mean intelligibility

reog-nized in eah of the four ategories by dierent users. Table 6.1 shows the mean intelligibility

Documents relatifs