• Aucun résultat trouvé

Audio-visual speech perception 1

1.3 Visual cues influence auditory perception

1.3.2 Visual capture

According to Alsius et al. (2018), another reason for the variability reported in the frequency ofMcGurk Effectillusions may be due to confusion surrounding its exact definition. In the original study, according to McGurk and Macdonald (1976), the effect results in either a fused percept or a combination of the auditory and visual cues. In the typical stimuli used inMcGurk paradigms, where auditory/ba/is combined with visual/ga/, perception responses of/da/or /bga/would thus be considered possibleMcGurkillusions. However, in instances in which the visual component overrides the auditory one, e.g., perceiving/ba/in the context of auditory /ga/paired with visual/ba/, some researchers have revised McGurk and Macdonald’s original definition to incorporate these visual responses as possible manifestations of theMcGurk Effect, due to the fact they are visually influenced (e.g., Colin et al., 2002; Rosenblum & Saldaña, 1992;

Sams, Manninen, Surakka, Helin, & Kättö, 1998, as cited in Alsius et al. 2018). However, to distinguish between illusory audio-visual responses and visual ones, other researchers have avoided using theMcGurk Effectterminology and employed instead the termvisual capture.2

2The termvisual dominancealso appears in the literature.

1.3. Visual cues influence auditory perception 19

Visual capture, a less well-known illusion than theMcGurk Effect, occurs when listeners who are perceiving incongruous audio-visual speech report hearing the visually presented sound instead of the auditory one (Mattheyses & Verhelst, 2015). This effect is arguably even more dramatic than theMcGurk Effectas ‘it is invisual capturethat the impact of the visible articulation of speech on the resulting percept is most obvious’ (Desjardins et al., 1997, p. 86).

It has been remarked that forvisual captureto occur, the phonetic cues in the visual signal need to be moreperceptually salientthan the ones in the acoustic signal (Masapollo, Polka, &

Ménard, 2017). When adults aged 18-40 years were presented with incongruous auditory-/ga/

paired with visual-/ba/and auditory-/ka/paired with visual-/pa/, McGurk and Macdonald (1976) found higher proportions of visual responses (31% and 37%, respectively) than auditory ones (11% and 13%, respectively), indicating thatvisual captureoccurred in some subjects in these contexts. On the other hand, in the opposing incongruous pairings (i.e., auditory-/ba/

paired with visual-/ga/and auditory-/pa/paired with visual-/ka/), fused percepts were much more common and visual responses were extremely rare. This disparity is probably due to the fact that the labial articulation for /p/and/b/is more visuallysalient than that of/k/

and/g/. In a later study by McGurk (1981), adult subjects were presented with auditory/ba/

paired with visual/ba, va, Da, da, za, ga/. In the case of the three most frontal articulations, /ba, va, Da/, the ones with clearly visible articulations, there was completevisual capture(cited in Werker, Frost, & McGurk, 1992). As a result, Werker et al. (1992) state:

in bimodal speech perception, when the visible articulation – theviseme[...] – unambiguously specifies a particular place of articulation, visual capture can be anticipated. On the other hand, where thevisemeis associated with a range of possible places of articulation, visual bias (as shown in “blends”) is more likely result. (p. 553)

Indeed, as far as we are aware, high rates ofvisual capturehave never been reported in cases where the place of articulation is not visible.3 McGurk (1981) reported some instances ofvisual captureoccurring for visual/da/paired with auditory/ba/, although the fused percept of/va/

3Visible articulations generally include labial and dental articulations.

was much more likely. Moreover, we know of no existing study which presents evidence of visual captureoccurring in vowels, although theMcGurkfusion effect has been shown to take place (e.g., Traunmüller & Öhrström, 2007). Based on observations from previous studies, it seems then thatvisual capturemay be anticipated when the phonetic cues in the visual signal are more perceptually salientthan the ones in the acoustic signal (i.e., for certain place of articulation cues in consonants) and when the visual cue unambiguously specifies the phoneme under presentation (i.e., is aviseme), makingvisual capturewith vowels arguably unlikely.

Visual dominance over the auditory modality has been shown to occur in non-speech signals too, indicating that there may be an underlying bias to pay special attention to visual cues more generally. One of the most famous examples is depicted in the Colavita visual dominance effect (Colavita, 1974). The basic experimental paradigm involves a random order of (non-speech) auditory, visual and audio-visual stimuli being presented to subjects who are instructed to make one response whenever they see a visual target and another response whenever they hear an auditory target. For example, participants are instructed to press one button in response to an auditory stimulus and another button in response to a visual one. In the original experiment (Colavita, 1974), participants were not informed that both the auditory and visual stimuli may occur together, while in more recent ones, participants were explicitly told that trials containing both modalities may occur, and in these instances, they should press both the auditory and the visual buttons together (e.g., Koppen & Spence, 2007). Regardless of how informed participants might have been, many studies have shown that while subjects respond to unimodal auditory and visual trials with no problem, they fail to respond to auditory targets when they are presented with auditory and visual targets at the same time (Spence, 2009). Subjects generally respond to bimodal audio-visual tokens with the visual response only.

In the original study by Colavita (1974), subjects reported that they had not noticed that the experiment contained bimodal audio-visual tokens as well as unimodal ones. Hecht and Reiner (2009) considered multimodal presentations of various senses including vision, audition and touch. Interestingly, they found the same visual dominance effect in bi-sensory visual-tactile stimuli, but no bias towards either modality in bi-sensory audio-tactile stimuli, suggesting that

1.4. Visual cues and theories on the objects of speech perception 21

dominance may be specifically visual in nature (Spence, 2009).

1.4 Visual cues and theories on the objects of speech