• Aucun résultat trouvé

Audio-visual speech perception 1

1.4 Visual cues and theories on the objects of speech percep- percep-tion

It is now widely accepted that the perception of speech is influenced by what we see as well as by what we hear. As a result, audio-visual speech perception has played a role in the ongoing debate over the objects of speech perception (Rosenblum, 2008a). Notably, researchers are divided on whether the mechanism for audio-visual integration is innate or whether it develops with linguistic experience. Proponents of gestural accounts of speech perception such as Motor Theory (Liberman & Mattingly, 1985) and Direct Realism (Fowler, 1986) have interpreted audio-visual integration as direct evidence that speech is represented as articulatory gestures (and not sounds). In their view, as speech is underlyingly represented as articulatory gestures, the fact that speech perception is enhanced with the visual cues of these gestures is not surprising (Desjardins et al., 1997; Rosenblum, 2008a). On the other hand, supporters of less controversial auditory-based theories of speech perception (e.g., Diehl & Kluender, 1989; Massaro, 1987;

Ohala, 1996; Stevens, 1989) suggest that visual speech input integrates with the acoustic input over the course of development due to increased linguistic experience (Rosenblum, 2008a).

Given children’s lack of experience in comparison to adults, one way in which researchers have responded to the question of whether the underlying representation of visual speech requires linguistic experience to develop is to consider the perception of speech in young children and infants (Desjardins et al., 1997). However, as we will show, perceptual evidence from children is mixed and is therefore open to interpretation. Studies have found that pre-linguistic infants less than 7-months-old are sensitive to the correspondence between the auditory and visual speech signals (P. Kuhl & Meltzoff, 1982; Patterson & Werker, 1999). Others have suggested that pre-linguistic infants show evidence of the McGurk Effect (Burnham

& Dodd, 2004) and may use visual information about speech articulation to learn phoneme boundaries (Teinonen, Aslin, Alku, & Csibra, 2008). These results would therefore support an

integrated, multimodal representation of articulatory and acoustic phonetic information at a very young age (Patterson & Werker, 1999).

However, as we briefly indicated inSection 1.2(p. 11), a variety of researchers have observed that children are less sensitive to visual speech cues than adults, which would suggest that visual cues may not initially be well specified in children’s representations of speech. In the original demonstration of theMcGurk Effect, as well as adults, McGurk and Macdonald (1976) also considered the impact of visual speech cues on the perception of children aged 3-5 and 7-8 years. The number of non-auditory percepts (i.e.,visual capture, fused and combination responses) was smaller in children than in adults in all stimulus contexts. These results have since been replicated in other studies. For example, Massaro (1984) found that children aged 4-9 years present about half of the visual influence shown by adults in incongruous audio-visual combinations of/ba/and/da/and Desjardins et al. (1997) report nearly 60% lessvisual capture in incongruous audio-visual combinations of/ba, va, da, Da/in children aged 3-5 years than in adults. The fact that children benefit less from visual cues than adults has also been observed in congruous audio-visual speech. Ross et al. (2011) tested the audio-visual speech recognition abilities in typically developing children aged between 5 and 14 years and compared them to those in adults. They found that children benefited less from observing visual articulations in speech in noise and that this difference tended to be more pronounced as the amount of noise increased. Even children between the ages of 12 and 17 years performed less well than adults. As a result, Ross et al. (2011) concluded thatvisual enhancementof speech continues to increase until adolescence, and maybe even into adulthood. Finally, Lalonde and Frush Holt (2015) examined developmental differences in the ability to use visuallysalientspeech cues and visual phonological knowledge in 3- and 4-year old typically developing children. They found that visual saliency contributed to audio-visual speech discrimination benefit in all age groups.

In a speech recognition task where participants listened to a word presented in noise and were asked to repeat it out loud, 4-year-olds’ and adults’ substitution errors were more likely to involve visually confusable phonemes in the audio-visual condition than the auditory-only one, suggesting that they used visual phonological representations and knowledge to take

1.4. Visual cues and theories on the objects of speech perception 23

advantage of visuallysalientspeech cues. In contrast, 3-year-olds showed no evidence of this visual phonological knowledge in their substitution errors. As a result, Lalonde and Frush Holt (2015) concluded there may be developmental differences in the mechanisms of audio-visual benefit.

1.4.1 The perception-production link

Given the results from the aforementioned studies, it seems then that even very young infants are sensitive to visual information from speech, but audio-visual speech perception and visual phonological representations take time and linguistic experience to fully form. This is perhaps not that surprising as the same could be said for the development of auditory phonological representations of speech. But what is it about the linguistic experience that makes audio-visual integration possible? Do underlying representations of audio-visual speech emerge from the experience of seeing speech or does experience of producing speech also play a role? This question has been addressed once again by looking at the perception and production of speech in children. Desjardins et al. (1997) tested the hypothesis that young children have not yet had the opportunity to specify fully their representations of visible speech because they have had less experience of correctly producing speech than have adults. They divided a group of 16 4-year-olds into two groups according to whether they made substitution errors or not for the consonants/T, D, b, d, v/in their production. The results indicated that children who substitute are poorer lip-readers and are less influenced by the visual component in incongruous audio-visual syllables (i.e., they report lessvisual capture) than those who do not substitute. They concluded that the underlying representation of visible speech is mediated by a child’s ability to correctly produce consonants. As the authors remarked, their study does not address whether experience of producing speech is actually required for the establishment of an underlying representation that includes visual information. However, Desjardins et al. noted that as very young infants’ percepts are influenced by visual speech cues despite not being able to produce consonants themselves, experience of producing consonants cannot be absolutely essential.

While Desjardins et al. (1997) considered the impact of production on perception, other

researchers have considered the impact of perception on production. It has been suggested that access to visual speech cues may aid children to acquire an adult-like articulation of certain speech sounds. It is generally agreed that children produce consonants with observable labial articulations such as/p, b, m/before non-labial consonants (Steinberg & Sciarini, 2013).

Lin and Demuth (2015) presented articulatory data for the acquisition of/l/in 25 typically developing Australian English-speaking children aged 3;0 to 7;11. Onset/w/was also included as a control. Lin and Demuth found that children’s/w/productions were dominated by lip rounding, which they argued is due to the visual accessibility of the labial articulation in /w/productions. In coda/l/, the most common articulation in children was vocalised, i.e., it was produced with a posterior lingual constriction accompanied by a labial constriction. An intermediate articulation between vocalised and adult-like coda/l/was also observed in which children drop the labial constriction and add or enhance the adult-like lingual constriction. Lin and Demuth speculated that lip rounding may be dropped during acquisition in accordance with visual feedback that a labial constriction is not typical for coda/l/. Visual cues of adult articulations may thus be utilised by children as visible feedback during the acquisition process.

Similarly, in congenitally blind speakers, it has been suggested that a lack of visible speech cues has an impact on both the perception and production of speech. Ménard, Dupont, Baum, and Aubin (2009) investigated the production and perception of Canadian French vowels in blind and sighted speakers and found that while visually-impaired speakers showed greater auditory acuity than sighted speakers, their vowel space is significantly smaller, perhaps due to a reduced magnitude of rounding contrasts. The authors interpreted these results as an indication that the availability of visible speech cues influences speech perception and production. In another study, Ménard, Trudeau-Fisette, Côté, and Turgeon (2016) observed that inclear speech, lip movements were larger in sighted speakers but not in visually impaired speakers, which again indicates that having access to visual cues influences the perception and the production of speech.