• Aucun résultat trouvé

Chapter 2

2.1 Percentage distributions of tongue shapes by context and country based on data presented in Delattre and Freeman (1968). . . 51 2.2 Simplified summary of temporal and spatial patterns in English/r/and/l/. . . . 53 Chapter 4

4.1 Stimuli and fillers from Experiment 1. . . 109 4.2 Participant demographics from production experiments. . . 114 4.3 Observed tongue configurations divided into three categories ordered from most

bunched to most retroflex. . . 127 4.4 Output of a linear-mixed effects regression model predicting lip protrusion. . . . 134 4.5 Mean formant values and their standard deviations for all tongue configurations

in women. . . 137 4.6 Mean F3 values and their standard deviations for/r/according to the following

vowel. . . 140 4.7 Output of a linear-mixed effects regression model predicting F3. . . 142 4.8 Mean F2 values and their standard deviations for/r/according to the following

vowel. . . 143 4.9 Output of a linear-mixed effects regression model predicting F2. . . 145 4.10 Output of a generalised mixed effects logistic regression predicting

hyperarticu-lation. . . 159

xxvii

Chapter 5

5.1 Test words from Experiment 2. . . 174 5.2 Evaluation metrics for semantic segmentation using a Convolutional Neural

Network (CNN). . . 180 5.3 Global evaluation metrics for semantic segmentation of the mouth from front

camera images using a CNN. . . 180 5.4 Class evaluation metrics for semantic segmentation of the mouth from front

camera images using a CNN. . . 180 5.5 Ellipse measures and their corresponding lip dimensions resulting from automatic

semantic segmentation of the lips using a CNN. . . 181 5.6 Mean formant values and their standard deviations for/w/and/r/in female

subjects. . . 184 5.7 Output of a generalised linear mixed-effects model predicting the probability a

token is a/w/according to the first three formants. . . 186 5.8 Mean and standard deviation percentage change from a neutral lip posture in lip

protrusion, width and height for/w/and/r/according to manual lip measures. 187 5.9 Output of a generalised linear mixed-effects model predicting the probability a

token is a/w/according to hand measured lip dimensions. . . 188 5.10 Mean and standard deviation lip dimensions for /w/ and/r/ from automatic

semantic segmentation using a CNN. . . 190 5.11 Output of a generalised linear mixed-effects model predicting the probability

a token is a/w/according to the lip dimensions acquired automatically from semantic segmentation using a CNN. . . 192 Chapter 6

6.1 Participant demographics from the perception experiment. . . 218 6.2 Experiment 3 test words. . . 219

List of tables xxix

6.3 Mean formant values and their standard deviations for/r/and/w/produced by the speaker who supplied stimuli for Experiment 3 and by the speakers in Experiment 2. . . 235 6.4 Mean lip dimensions and their standard deviations for/r/,/w/and a neutral lip

setting in the speaker who supplied stimuli for the perception experiment. . . . 236 6.5 Raw stimulus-response confusion matrices for the identification of/r/,/w/and

/l/in unimodal and congruous audio-visual modalities. . . 237 6.6 Categorisation of hits, misses, false alarms and correct rejections in the/r/-/w/

and/w/-/r/stimulus-response pairs. . . 239 6.7 Summary statistics for sensitivity, bias and the proportion of correct responses

in each contrast (/l/-/w/,/l/-/r/,/r/-/w/) in each presentation modality. . . 240 6.8 Output of a generalised linear mixed-effects model predicting the probability a

token is accurately identified. . . 243 6.9 Post-hoc pairwise comparisons of the significant interaction between Stimulus

and Modality on identification accuracy from a generalised linear mixed-effects model. . . 243 6.10 Output of a linear mixed-effects model predicting perceptual sensitivity. . . 246 6.11 Post-hoc pairwise comparisons of the significant interaction between Contrast

and Modality on perceptual sensitivity from a linear mixed-effects model. . . 247 6.12 Confusion matrices presenting responses to incongruent audio-visual trials. . . 251 6.13 Output of a generalised linear mixed-effects model predicting the probability of a

visual response in incongruous audio-visual stimuli. . . 253 Appendix B

B.1 Experiment 3 filler and control words . . . 300 B.2 Experiment 3 test words presented in the auditory-only modality for Group 1

and in the visual-only modality for Group 2 . . . 301

B.3 Experiment 3 test words presented in the auditory-only modality for Group 2 and in the visual-only modality for Group 1 . . . 301 B.4 Experiment 3 test words presented in the congruous audio-visual modality for

Group 1. . . 301 B.5 Experiment 3 test words presented in the congruous audio-visual modality for

Group 2. . . 302 B.6 Experiment 3 test words presented in the incongruous audio-visual modality for

both groups (Groups 1 and 2). . . 302

Abbreviations

AAA Articulate Assistant Advanced CNN Convolutional Neural Network CU Curled Up

DNN Deep Neural Network

EMA Electromagnetic Articulography EMG Electromyography

FB Front Bunched fps frames per second FU Front Up

H&H Theory ‘Hyper’- and ‘Hypo’-articulation Theory MB Mid Bunched

MRI Magnetic Resonance Imaging SNR Signal-to-Noise Ratio

SSBE Standard Southern British English TU Tip Up

UTI Ultrasound Tongue Imaging

xxxi

Glossary

American English The rhotic variety of English spoken in North America.

Anglo-English The non-rhotic variety of English spoken in England.

approximant A consonant whose articulators approach each other but not to such an extent as to create turbulent airflow.

bunched An articulation whose primary constriction occurs at the tongue dorsum. The tongue tip is generally lowered.

clear speech (or hyperspeech)Speech produced with the goal of improving intelligibility in the listener.

covert articulations Articulations which are visibly different from one another but do not produce an audible difference. Covert articulations are therefore not perceptible or recoverable from listening to the auditory signal alone.

endolabial A type of close lip rounding termed by Catford, which is produced with the inner surfaces of the lips. This type of rounding is associated with back vowels such as [u] and the semi-vowel [w] and is equivalent to our labelhorizontal labialisation. Another

xxxiii

equivalent term isinner rounding, coined by Sweet. As Trask describes in hisDictionary of Phonetics and Phonology,outroundingis also an unfortunate synonym.

exolabial A type of lip rounding termed by Catford, which is produced with the outer surfaces of the lips. This type of rounding is associated with front vowels such as [y] and is equivalent to our labelvertical labialisation. Another equivalent term isouter round-ing, coined by Sweet. As Trask describes in hisDictionary of Phonetics and Phonology, inroundingis also an unfortunate synonym.

fiducial A fixed line used as a basis of reference and measure.

focalisation The convergence of neighbouring formants in the spectrum of a vowel, resulting in spectral prominence in that focalised region. Vowels which exhibitfocalisationare known asfocalvowels and are generally considered to be moreperceptually salient than their non-focal counterparts (Schwartz, Abry, Boë, Ménard, & Vallée, 2005).

horizontal labialisation A type oflabialisationgenerally associated with back vowels. The lips are pouted by drawing the lip corners together to form a small, round opening.

hyperarticulation A type of clear speechwhich helps the listener to retrieve and decode phonetic cues. At the segmental level, hyperarticulation may involve modifications to articulation with the goal of enhancing the phonetic contrasts between sounds.

hypercorrection Proposed by Ohala in his perception-oriented account of sound change, the phonetically experienced listener erroneously corrects acoustic variation from the speaker, resulting in misperception. This scenario may trigger sound change when the listener turns speaker.

hypocorrection Proposed by Ohala in his perception-oriented account of sound change, the listener takes the acoustic signal at face value and fails to correct for phonetic variation, resulting in misperception. This scenario may trigger sound change when the listener turns speaker.

Glossary xxxv

intrusive/r/ A type of /r/-sandhi and an extension of linking /r/ in which /r/ is pro-nounced at the end of words which do not end with an etymological or orthographic/r/

(e.g.,saw it [sO:ô It])

labialisation A secondary labial articulation occurring in consonants and vowels, resulting in a reduction in the overall lip area.

linking/r/ A type of /r/-sandhiin which/r/is pronounced in words which end with an etymological and orthographic/r/(e.g.,car and driver [kA:ô @n "dôaIv@]).

lip protrusion A type of labialisationwhich may accompany bothhorizontal labialisa-tionandvertical labialisation. The lips are pushed forward, extending the length of the vocal tract.

magnetic resonance imaging (MRI) A tool for speech production research which provides dynamic images of the vocal tract in its entirety, although constriction generally images rather poorly. Recent advances in technology at the University of Southern California have increased the spatiotemporal resolution and quality of the data, capturing videos at around 83 fps, which is a dramatic increase from the previous 23 fps obtained in their earlier MRI datasets (as discussed in Toutios et al., 2016).

McGurk Effect A perceptual illusion occurring in incongruous audio-visual stimuli presented in the laboratory in which the listener reports hearing neither the auditory nor the visually presented sound, but a combination of the phonetic properties of the two, e.g.

auditory-/ga/combined with visual-/ba/is perceived as/da/.

motor equivalence The ability to use a variety of movements to achieve the same goal under different conditions. In speech, different vocal tract shapes may be employed to achieve the same acoustic goal. For example, the primary acoustic cue of the vowel /u/ is a low second formant, which may be produced with a narrow constriction at the lips and/or at the palate. Perkell, Matthies, Svirsky, and Jordan (1993) observed a negative correlation between the two constrictions. If the palatal constriction is too large, the

labial constriction will compensate with a narrower constriction, and vice versa. This negative correlation corresponds to a phonetictrading relation.

non-rhotic A variety of English allowing/r/to only be pronounced directly before a vowel.

perceptual compensation Proposed by Ohala in his perception-oriented account of sound change, the listener factors out phonetic variation from the speaker and successfully reconstructs the speaker’s intended phoneme. Perceptual compensation prevents sound change from occurring.

perceptually salient Although multiple phonetic cues may be used to distinguish one sound from another, a perceptually salient one is a cue which provides particularly important information to the listener about the identity of the sound in question. Listeners are more sensitive to salient cues than they are to less salient ones and as a result, manipulations to salient speech cues would have a substantial impact on perception in the listener, contrary to changes to less salient ones.

/r/-sandhi A hiatus-filling (or linking) phenomenon which is generally associated with non-rhotic Englishes occurring at word boundaries in connected speech. Innon-rhotic varieties,/r/is only pronounced when directly followed by a vowel./r/-sandhiis the name given to a realisation of/r/which is not normally pronounced in an isolated word (e.g.,car [kA:]), but is realised in connected speech when directly followed by a word beginning with a vowel (e.g., car and driver [kA:ô @n "dôaIv@]). A distinction is made between two sub-phenomena of/r/-sandhi:linking/r/andintrusive/r/.

Received Pronunciation The accent traditionally considered the prestige standard in Eng-land.

retroflex An articulation whose primary constriction occurs at the tongue tip. The tongue dorsum is generally lowered.

rhotic A variety of English allowing/r/to be pronounced in all syllable contexts.

Glossary xxxvii

semantic segmentation A type of image classification which involves the training of a Convolutional Neural Network (CNN) to classify each pixel in an image according to a predefined set of classes.

singular fit A warning message occurring in linear mixed models, which is generally indica-tive of overfitting of the model. It often occurs when the random effects structure is too complex to be supported by the data, probably due to a lack of data.

sublaminal Associated with extremeretroflextongue shapes, the underside of the tongue tip forms the main palatal constriction.

sublingual space Generally associated with apicals, particularly alveolar, dental andretroflex ones, a space or cavity is formed underneath the tongue when the tongue tip is raised towards the palate.

sulcalization (or tongue-dorsum concavity)Associated withbunchedtongue shapes, cre-ates a visible concave-shaped dip in the midsagittal tongue surface.

trading relations When different articulatory manoeuvres reciprocally contribute to a per-ceptually important acoustic cue, these manoeuvres may covary in order to maintain the cue in question at a constant level. As a result, dependence on one of these manoeuvres would be accompanied by less of another, and vice versa. Seemotor equivalencefor an example.

vertical labialisation A type of labialisationgenerally associated with front vowels. The lips come together by raising the bottom lip and closing the jaw, resulting in a small, slit-like opening.

viseme A set of phonemes that have identical appearance on the lips, e.g., English /p/, /b/, /m/

visual capture A perceptual illusion occurring in incongruous audio-visual stimuli in which the listener reports hearing the visually presented sound instead of the auditory one, e.g.,

auditory-/ba/ paired with visual/va/is perceived as/va/. Note the difference between visual captureand theMcGurk Effect.

visual enhancement Speech perception is generally more accurate when listeners can both hear and see the speaker as opposed to just listening to them. Visual enhancementis the advantage for audio-visual speech compared to auditory-only speech.