• Aucun résultat trouvé

Audio-visual speech perception 1

1.2 Visual cues enhance auditory perception

1.2 Visual cues enhance auditory perception

A large body of research has shown that speech perception is more accurate when listeners can both hear and see a speaker as opposed to just listening to them. One of the first and most widely cited studies which explicitly demonstrated the utility of visual cues in the perception of speech was that of Sumby and Pollack (1954). In this study, a large cohort of participants (n = 129) were asked to identify bi-syllabic words produced by a speaker seated in front of them. White noise at different intensity levels from 0 dB to -30 dB Signal-to-Noise Ratio (SNR) were presented to the subjects through a headset. Half of the subjects faced away from the speaker, while the other half watched the speaker’s facial movements. In the absence of noise, subjects correctly identified nearly all of the bi-syllabic words in both the auditory-only and the audio-visual conditions with no obvious difference in performance between the two conditions.

However, as the SNR decreased, i.e., as the speech signal became less audible, the visual cues played a more critical role in allowing subjects to accurately identify spoken words. Indeed, their results showed that adding the visual cue was the equivalent of improving the SNR by 15 dB. As a result, Sumby and Pollack concluded that visual speech cues contribute the most to speech intelligibility in noisy conditions.

The advantage for audio-visual speech compared to auditory-only speech, frequently known as visual enhancement, has since been replicated countless times. It is now widely accepted that visual speech is one of the most robust cues that people use when listening to speech in noisy environments (Lalonde & Werner, 2019). Seeing the speaker’s face even from very far away (e.g., 30 m) has been shown to improve auditory speech recognition (Jordan & Sergeant, 2000). Speech perception may also be enhanced by visual cues in optimal listening conditions.

Reisberg, McLean, and Goldfield (1987) demonstrated that vision enhances the perception of speech in a foreign language, speech produced by a non-native speaker and in semantically complex utterances (cited in Dohen, 2009). However, perceptual performance varies and the degree of sensitivity to visual speech cues has been linked to factors related to the perceiver’s linguistic experience and development, age and sex, as well as the style of speech and the

visual salience of the cues presented. Visual cues are reported to be less beneficial to speech intelligibility in typically developing children than in adults (Desjardins, Rogers, & Werker, 1997; Lalonde & Werner, 2019; Ross et al., 2011), although older adults have been found to show no difference in their ability to perceive audio-visual speech in noise relative to younger adults (Smayda, Van Engen, Maddox, & Chandrasekaran, 2016; Sommers, Tye-Murray, & Spehar, 2005).

Both children and adults with developmental disorders such as dyslexia have been shown to present deficits in their ability to gain from visual speech information relative to those without learning disorders (e.g., van Laarhoven, Keetels, Schakel, & Vroomen, 2018). Women have also been shown to be more sensitive to visual cues than men in some studies (e.g., Dancer, Krain, Thompson, Davis, & et al, 1994; Traunmüller & Öhrström, 2007; Watson, Qiu, Chamberlain, &

Li, 1996), although in others, effects of speaker sex have not been reliably observed (e.g., Auer &

Bernstein, 2007; Tye-Murray, Sommers, & Spehar, 2007). Shaywitz et al. (1995) considered brain activation in male and female participants during orthographic, phonological and semantic language tasks and found that their activations significantly differ. They concluded that their data provide evidence for a sex difference in the functional organisation of the brain for language, which includes phonological processing. Differences in brain activity may thus account for the reported female advantage in lip reading andvisual enhancementin audio-visual speech (Desjardins & Werker, 2004).

In a way, the perception of non-native sound contrasts could be considered to be on a par with the perception of native sounds in noisy conditions, as it puts non-native perceivers at a disadvantage to native ones. Just as the benefits of visual cues vary in the perception of native speech sounds in noise, so do the results from studies assessing the benefits of visual cues in non-native speech perception. Pereira (2013) compared the sensitivity to visual cues in the perception of English vowels in Spanish learners with that of native English speakers in auditory-only, visual-only and audio-visual modalities. The results indicated that while the native speakers performed better in the audio-visual modality than the auditory-only one, no significant difference was observed between the two modalities in the Spanish learners.

However, in the visual-only modality, the learners could use visual speech cues to some extent

1.2. Visual cues enhance auditory perception 13

but failed to integrate visual information to the auditory input in the audio-visual condition.

In contrast, Navarra and Soto-Faraco (2007) found that while Spanish-dominant bilinguals could not distinguish between the/e/-/E/contrast in Catalan in auditory-only presentation, when presented with the accompanying visual cues, their discrimination not only improved, but did not significantly differ from that of Catalan-dominant bilinguals. Furthermore, it has been suggested that the perceiver’s native language may impact sensitivity to non-native visual speech. For example, Hazan et al. (2006) found that Spanish learners show much greater sensitivity to visual cues than Japanese learners in their audio-visual perception of the non-native labial/labiodental contrast in English, although perception did improve with the presence of visual cues in both learner groups.

Research has also linked the observed variation in the benefit of visual cues to the perceptual salience of the speech cues under presentation. In a second experiment, Hazan et al. (2006) examined the perception of the/l/-/r/contrast in learners of English and found that neither Korean nor Japanese learners showed evidence of making use of visual cues in their perception of the contrast. The authors suggested that this lack ofvisual enhancementis due to the fact that the/l/-/r/contrast is not particularly visuallysalient. Similar results have been observed in the perception of native speech contrasts. Traunmüller and Öhrström (2007) observed a difference in thevisual enhancementeffect between lip rounding and mouth opening in the perception of Swedish vowels. They presented Swedish subjects with auditory, visual and audio-visual nonsense syllables in optimal listening conditions containing rounded and non-rounded vowels of different heights. They found that subjects relied more heavily on visual cues for vowel rounding than for vowel height, which they concluded may be due to the fact that lip rounding is more visuallysalientthan mouth opening. As a result, Traunmüller and Öhrström suggested that the perception of any given feature is dominated by the modality which provides the most reliable information. In their data, contrary to contrasts involving height, the visual modality was moresalientthan the acoustic one for rounding, which explains the improved perceptual performance with the presence of visual cues.

Various studies have demonstrated that speakers may well be aware of the benefits of

producing visuallysalientcues in improving their speech intelligibility. It has been suggested that speakers make adaptations to their articulation in noisy environments as an intentional communication strategy to facilitate the transmission of the speech signal to the listener (e.g., Fitzpatrick, Kim, & Davis, 2015). Speech adaptations in noise, known as Lombard Speech (Lombard, 1911), may result in changes to both acoustic (e.g., Junqua, 1993) and visual speech cues (Fitzpatrick et al., 2015), and studies have shown that these changes can make speech more intelligible to listeners (e.g., Gagné et al., 2002; Van Summers, Pisoni, Bernacki, Pedlow, & Stokes, 1988). With regards to articulation,clear speechhas been shown to present moresalientvisual cues with more extreme and greater degrees of articulatory movements, including increased lip protrusion and jaw movements (Tang et al., 2015), although strategies may be speaker-specific and not all speakers make use of the visual modality to improve their speech intelligibility in noise (Garnier, Ménard, & Alexandre, 2018). It has also been observed thatclear speech improves speech intelligibility to a greater extent in audio-visual than in auditory-only speech presentation (Kim, Sironic, & Davis, 2011; Van Engen, Phelps, Smiljanic, & Chandrasekaran, 2014), suggesting that the enhanced articulatory gestures made when speaking in noise may serve to make speech more visually intelligible.

Finally, it is worth pointing out that although there is an assumption that the less auditory information available to listeners, the more they will rely on visual cues, research has shown that this is not necessarily the case. Early speech perception in noise studies have indicated that visual cues benefit speech perception the most in the noisiest of conditions (e.g., Erber, 1975; Sumby & Pollack, 1954). However, maximal benefit may actually occur midway between extreme noise and no noise at all. Contrary to past studies, Ross et al. (2011) used a large word list, which participants were not exposed to prior to experimentation. Their results indicated that word recognition is considerably poorer at low SNRs than previously shown. The maximal gain from audio-visual stimulation was found to be at an SNR of around -12 dB, where performance was up to three times higher relative to auditory-only presentation. They concluded that maximum audio-visual multisensory integration occurs between the extremes where subjects have to rely mostly on lip reading (-24 dB) and where information from articulation is largely