• Aucun résultat trouvé

Audio-visual speech perception 1

1.1 Information provided by the visible articulators

Although non-facial movements involving the head, hands and in some respects, the entire body are used in a meaningful way in face-to-face communication, whether that be in signed or in spoken languages, we will focus on the articulatory information provided by the face, and more specifically the lips, and the role it plays in the perception of speech sounds. Indeed, most of the research on visible speech signals concentrates on the movements of the lower face, which convey the primary articulatory cues to speech events (Brooke, 1998). However, for the sake of completeness, we wish to mention the fact that certain body movements, which are not directly related to speech articulation, have been shown to convey supplementary prosodic cues to the auditory ones. For example, movements of both the head and eyebrows are used for the visual prosodic cues, or ‘visual prosody’ (Graf, Costatto, Strom, & Huang, 2002), involved in stress, prominence, rhythm and phrasing (e.g., Cvejic, Kim, Davis, & Gibert, 2010;

1.1. Information provided by the visible articulators 9

Granström, House, & Lundeberg, 1999; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004; Scarborough, Dmitrieva, Hall-Lew, Zhao, & Brenier, 2007).

Turning our attention to primary articulatory cues, at the most basic level, seeing the movements of the visible articulators, i.e., the lips, jaw, face, tongue tip and teeth (Badin, Tarabalka, Elisei, & Bailly, 2010), indicates to the listener that a person is speaking. This visual information is particularly useful in noisy conditions where knowing when a person is speaking enables listeners to direct their attention to the target signal. Fitzroy et al. (2018) recorded Electroencephalography (EEG) data and compared the auditory evoked potentials elicited by acoustic onsets in attended and unattended live speech in a room with multiple live speakers.

Their results indicated that a visible talker is both easier to perceptually attend and harder to perceptually ignore than an unseen one. In noisy conditions, seeing that a person is speaking has also been found to aid segmentation of multiple auditory streams (Castellanos, Benedí, &

Casacuberta, 1996, cited in Peelle and Sommers 2015).

However, the contribution of the visual speech cues generated by a talker in face-to-face interactions far exceeds just facilitating attention to the speaker. By presenting information about the position of a speaker’s articulators, visible speech gestures may provide cues to the place of articulation of vowels and the place and manner of articulation of consonants (Summerfield, 1983, cited in Hazan et al. 2006). Visual cues of place of articulation may be particularly beneficial when the auditory conditions are degraded, e.g., due to hearing loss or environmental noise. As the acoustic cues for place of articulation are easily masked in noise, visual cues may actually be more robust than acoustic ones in some cases (Brooke, 1998). The availability of place information in the visual signal thus provides a complementary source of information to the auditory one (Peelle & Sommers, 2015) and may allow for enhanced perception of phonetic contrasts which are not very audible but are very visible, such as [m]-[n].

Contrary to the cues for place of articulation, cues for manner of articulation and voicing are not very visible but are very audible. As a result, Summerfield (1983) suggested that there is ‘a fortunate complementary relationship between what is lost in noise or impairment, and what can be provided by vision’ (p. 183), which allows people with hearing impairments and people

communicating in noisy conditions to supplement their perception of speech with lip reading.

However, when the acoustic information from speech is masked or removed entirely and perceivers have to rely solely on visual cues, speech perception performance is heavily reduced. For example, in cases of profound or total hearing loss, very few people are capable of understanding speech fluently by lip reading alone (Summerfield, Bruce, Cowey, Ellis, & Perrett, 1992). This is no doubt due to the ambiguous nature of the information provided by visual speech. In auditory speech, the phoneme is considered the minimal unit of contrast in the sound system of any given language. If you replace one phoneme with another, the meaning of the spoken word will change. The equivalent of the phoneme in the visual domain is theviseme (Fisher, 1968). Although its definition is somewhat disputed, Bear, Harvey, Theobald, and Lan (2014) have provided a working definition which states that aviseme is a set of phonemes that have identical appearance on the lips. Therefore, although one phoneme belongs to one visemeclass, many phonemes may share the sameviseme. For example, while the acoustic difference between realisations of/p/and/b/in English is readily perceptible due to contrasts in voice onset time, visually they are almost identical (Peelle & Sommers, 2015). Consequently, this many-to-one mapping between phonemes andvisemesresults in perceptual ambiguity in visual speech cues. At present, agreement has yet to be reached concerning the exact number of visemesin English, perhaps due to inter- and intra-speaker variation. Indeed, Bear et al. (2014) reviewed the phoneme-to-viseme maps for consonants presented in 15 previous studies and the number ofvisemesranges from 4 to 10. Even at the most liberal estimate of 10, there are evidently far fewer consonantvisemesthan there are consonant phonemes in English, and the same can be said for the vowels. However, as Peelle and Sommers (2015) explained, although visual speech cues do not offer additional information compared to auditory-only speech for every phoneme, in many cases, visual cues may help disambiguate similar-sounding speech sounds.