• Aucun résultat trouvé

Audio-visual speech perception 1

1.5 Visual cues and spoken language evolution and change

1.5.2 Sound change

We propose that Rosenblum’s (2008a) evolutionary account of multimodal speech perception could also be extended to diachronic sound change. If the perception of speech is inherently multimodal, visual cues may also be implicated in sound change. However, models of sound change tend to neglect the role of visual cues and focus instead on the impact of the auditory ones. While the most well-known models of sound change are divided on who initiates sound change, the listener or the speaker, they generally converge on the notion that speech is perceived from the auditory signal alone. In this section, we will briefly describe two of the most well-known models of sound change, categorised as perception-oriented (i.e., listener-based) and production-oriented (i.e., speaker-based) accounts, and will present empirical evidence which suggests that visual cues may well play a part in sound change and in the shaping of the sound systems of the world’s languages.

The most famous perception-oriented account of sound change is provided by Ohala, who asserts that the main source of variation in speech, and hence the driving force behind sound change, is the misperception of the acoustic signal by the listener (e.g., Ohala, 1981). In his view, much of the variation which underpins the acoustic speech signal is phonetically predictable.

When the phonetically experienced listener is able to factor out this variation, sound change does not occur. In contrast, sound change can be triggered when the listener takes the acoustic signal at face value and fails to apply their phonetic knowledge of how speech sounds interact in perception (Chitoran, 2012). When the listener turns speaker, he may thus produce a new

form, which is different to the one intended by the original speaker, which Ohala termed hypocorrection. Another scenario which may result in sound change, labelledhypercorrection, occurs when the listener performs an erroneous correction of the acoustic speech signal, again resulting in a new form in his/her production. Ohala (1981) provided the example of the vowel/u/, which may be subject to assimilation in the context of a surrounding anterior consonant such as/t/, e.g.,/ut/may have the surface form[yt]. In the case ofhypocorrection, as schematised inFigure 1.2a, the listener fails to reconstruct[yt]as the intended/ut/, which is interpreted as/yt/and then, when the listener turns speaker, is produced as[yt], triggering a sound change. In the case ofhypercorrection, schematised inFigure 1.2b, the speaker intends to produce /yt/ and does so appropriately, resulting in the surface form [yt]. The listener incorrectly reconstructs the intended/yt/as/ut/given his phonetic knowledge of assimilation in this particular context, which results in a production of[ut]when it is the listener’s turn to speak. As Chitoran (2012) pointed out, both of these scenarios imply a mismatch between production and perception in the listener.

Production-oriented accounts of sound change, notably the one proposed by Lindblom (1990), converge with perception-oriented ones in that they too consider phonetic variation in speech to be the impetus for sound change. However, the source of this variation is considered to originate from the speaker as opposed to the listener. In his ‘Hyper’- and ‘Hypo’-articulation (H&H) Theory4, Lindblom proposes that speech varies on a continuum fromhyperarticulated listener-orientedclear speechto hypoarticulated speaker-orientedcasual speech. The speaker’s aim is to produce utterances that are intelligible to the listener, but to do so expending as little energy as possible. As J. F. Hay, Sato, Coren, Moran, and Diehl (2006) noted, speakers try to achieve sufficient, as opposed to maximal, distinctiveness in their articulation of speech sounds, and thus make active adjustments to their production of speech according to the predicted perceptual needs of the listener and to their own articulatory needs. Inhyperarticulatedspeech, the listener’s perceptual needs take precedence over the speaker’s articulatory needs, which requires more effort from the part of the speaker. In hypoarticulated speech, the speaker uses

4H&H Theory will be revisited later in the thesis, notably in Experiment 1 when we discusshyperarticulation in more detail.

1.5. Visual cues and spoken language evolution and change 29

(a)Hypocorrection

(b)Hypercorrection

Figure 1.2:Listener-oriented sound change scenarios according to Ohala (1981) including(a) hypocorrection and(b)hypercorrection.

minimal articulatory effort to conserve energy but the listener’s perception may suffer as a consequence. Hyperarticulationis therefore at odds with hypoarticulation: hyperarticulation increases perceptibility in the listener, but requires additional effort from the speaker. Sound change is therefore goal-driven (i.e., teleological) and predicted to arise when a speaker feels the need to adjust their articulation to one which is either easier to perceive or easier to produce.

Although the models proposed by Ohala and Lindblom do not converge on who is the initiator of sound change, the listener or the speaker, speech perception is viewed by both models as the transformation of theauditoryinput signal into neural representations of speech sounds in the listener. Other modalities involved in the perception of speech such as visual cues are notably absent from both models. Up to this point we have considered how sound change is modelled in phonological theory. Like many good theories, these approaches have been built on extensive experimental work, as Chitoran (2012) noted. We now need to consider whether incorporating visual cues to sound change models is actually necessary, based on empirical evidence from the literature. We will present two cases from English which suggest that visual speech cues may indeed be implicated in sound change. This evidence demonstrates the need to consider visual as well as auditory speech perception in sound change models.

The phonetic realisation of the/f/-/T/contrast in English is well-known for being acous-tically ambiguous. In acoustic terms,[f]-[T]lack spectral peaks and have very low intensity, which makes them difficult to differentiate (Tabain, 1998). In native speakers of English,[T]is regularly fronted to[f], particularly in British accents. Listener-driven models of sound change would explain the change from/T/to /f/as the misperception of[T] in the listener, given its acoustic similarity to[f]. However, McGuire and Babel (2012) noted that listener-driven models cannot account for the fact that while the sound change from /T/to /f/ is widely attested cross-linguistically, there are no known cases of/f/being substituted for/T/in the literature on language typology.5 McGuire and Babel (2012) therefore described an ‘asymmetry’

in the/f/-/T/substitution pattern. They proposed that a bias towards/f/originates in the

5Interdental fricatives are also typologically rare more generally. Only 7% of the 451 languages included in the UCLA Phonological Segment Inventory Database (UPSID) show interdental fricatives (Maddieson, 1984;

Maddieson & Precoda, 1989).

1.5. Visual cues and spoken language evolution and change 31

greater visual saliency and stability of/f/. As McGuire and Babel noted, it has been remarked in previous studies that the visual cue of the lips may be more informative than the acoustic cues in disambiguating/T/and/f/(e.g., Jongman, Wang, & Kim, 2003; Miller & Nicely, 1955).

McGuire and Babel (2012) considered how visual information may be implicated in the sound change involving /T/and/f/by examining the role of visual cues in the perception of the contrast across multiple speakers in American English. Their results suggested that /T/ is more variable than/f/ in both articulation and acoustics. For example, the visibility of the tongue gesture for/T/varied across the speakers who served as perception stimuli because it was produced both inter-dentally and dentally. Furthermore, the acoustics of /T/in the same speakers substantially differed across different vowel environments. McGuire and Babel therefore proposed that it is this variability which has contributed to the unstable nature of /T/across time, which they argued offers an explanation for the asymmetry in the patterning of/f/and/T/. In their view, listeners are faced with unpredictable inter-speaker variability in the production of/T/and failure to perceive either an auditory or a visual/T/cue will lead to the sound being categorised as/f/based of their acoustic and visual phonetic similarities.

As a result, McGuire and Babel concluded that their results demonstrate the need to consider multimodal phonetic information when theorising about sound change, as well as in discussions on acquisition and on typological distributions of sounds in the world’s languages.

Another acoustically ambiguous contrast involves the/O/-/A/contrast in certain varieties of American English due to the Northern Cities Vowel Shift, in which both vowels undergo fronting, resulting in a merger. Havenhill and Do (2018), which presented work from Havenhill’s thesis (2018), considered both the production and the perception of the/O/-/A/contrast in American English. Articulatory data indicated that some speakers distinguish/O/from/A/

with a combination of tongue position and lip rounding, while others used either tongue position or lip rounding alone, which has acoustic consequences: /A/and/O/are more similar in the cases in which only one articulatory dimension varies, as opposed to two. While all speakers maintained some degree of acoustic contrast between the vowels, Havenhill and Do considered the impact of visual cues to the perception of the/A/-/O/ contrast. They found

that despite having a similar acoustic output, the articulatory configurations in which /O/

is produced with unrounded lips are perceptually weaker than those produced with visible rounding. Unrounded/O/was more likely to be (mis)perceived as/A/than rounded/O/when listeners had access to visual speech cues. Havenhill and Do argued that their results showed that visual cues may play a role in shaping phonological systems through misperception-based sound change. They proposed that visual speech cues can inhibit misperception of the speech signal in cases where two sounds are acoustically similar, which suggests that phonological systems may be ‘optimised’ for both auditory and visual perceptibility. Like McGuire and Babel (2012), Havenhill and Do (2018) also concluded that theories on language variation and sound change must consider how speech is conveyed across multiple perceptual modalities.