• Aucun résultat trouvé

In many languages the voicing distinction between stop consonants in initial position depends on laryngeal timing. The time interval between closure release and voicing onset (Voice Onset Time) has been extensively investigated since it is the major acoustic correlate of voicing. This chapter reviews the literature on the neurophysiological correlates of voicing perception with an emphasis on the basic auditory sensitivity common to human newborns and animals. Since it has been shown that the acoustic level of perception is indexed by late cortical evoked-potentials, we recorded the N100 component in French-speaking subjects, for whom the phonological boundary corresponds to none of the acoustic ones. We hypothesized that in French, the basic acoustic boundaries should be revealed by the N100 characteristics without any contamination by the phonological one. This hypothesis was confirmed by the results, which open the way to a better understanding of the link between the acoustic and the phonological levels of speech perception.

1 Hoonhorst I, Colin C, Markessis E, Radeau M, Deltenre P, Serniclaes, W. N100 component: an electrophysiological cue of voicing perception. In: Fuchs S, Loevenbruck H, Pape D, Perrier P, editors. Some aspects of speech and the brain. Bern: Perter Lang Verlagsgruppe, 2009:5-34.

Introduction

VOICING

In 1977, Abramson opened the way to a wide field of research by defining Voice Onset Time (VOT) as the “temporal relation between the onset of glottal pulsing and the release of the initial stop consonant” (p. 296) and by underlining the importance of the temporal order between these two events, i.e. the temporal order between voicing onset and closure release.

Voicing lead was used to characterize productions in which periodic energy precedes closure release, while voicing lag described the inverse pattern.

The huge number of subsequent studies conducted on VOT can certainly be explained by the peculiarly straightforward relation between the perception of voicing and VOT, which is the most important cue for voicing perception, at least in initial position (Lisker et al., 1978). In medial position, it is combined with other temporal cues related to laryngeal timing (Saerens et al., 1989). These other acoustic cues, such as the value of the fundamental frequency, formant transition duration and the frequency value of F1, only affect voicing perception when the VOT is ambiguous and they are therefore considered as secondary cues (Summerfield & Haggard, 1977; Hillenbrand, 1984).

There are multiple reports showing that the VOT perceptual boundary shifts as a function of place of articulation (Abramson & Lisker, 1973; Sharma et al., 2000) and vocalic context (Summerfield & Haggard, 1974). This provides supplementary degrees of freedom for the design of experiments on voicing perception in various contexts.

TWO OR THREE LEVELS OF SPEECH PERCEPTION?

Based on their empirical findings showing an effect of inter-stimulus-interval (ISI) on discrimination performances, Werker and Logan (1985), followed by other authors (Burnham, 1986; Flege, 1988; 1992), developed a three-factor model of speech perception in which the acoustic, phonetic and phonological levels of speech processing are dissociated. They showed a correlation between the ISI and the complexity of the level of perception reached: ISIs of 250, 500 and 1500 ms were respectively associated with the acoustic, phonetic and phonological levels of perception. The acoustic level was related to the ability to discriminate acoustic contrasts that are not phonological in any language of the world; the phonetic level was tied to the ability to discriminate contrasts which are phonological in languages not

spoken by the subject, and the phonological level was linked to the capacity to discriminate phonemes belonging to his/her own language. The key finding of this study was the demonstration that the experimental setting may influence the subject’s discrimination and lead him/her to perceive contrasts that are phonological neither in his/her language nor in any of the other languages.

This study raised two main issues in the field of the development of perceptual abilities. First, by underlining the impact of experimental conditions on the ability to discriminate fine-grained contrasts, one must accept the existence of successive levels of speech processing which we may expect to correspond to a hierarchical functional organization of the perceptual system. Secondly, even though Werker and Logan (ibid) postulated a phonetic level of perception, they did not specify its nature and cautiously concluded “it is not clear whether phonetically relevant perception is a function of a specific linguistic processor or the result of second-order auditory factors […]” (p. 43).

The suggestion of an intermediary phonetic mode of perception comes from the observation that human perceptual boundaries are built up as a combination of trade-off between acoustic cues, which Werker and Logan (1985) called “second-order auditory factors”. Indeed, some authors have shown that categories of voicing come from the combination of VOT and first formant (F1) transition (Simon & Fourcin, 1978) or that categories for place of articulation arise from the second and third formant (F2 and F3) combination. Since the resolution of this debate remains unclear, we will use a simplified scheme with two levels of speech perception:

an acoustical and a phonological one. The acoustic level is characterized as non-speech specific and is common to non-human animals and human babies before six months of age.

As for the phonological level of perception, it is both speech- and language-specific.

Combined results gathered with diverse methods (Multi-Unit-Activity, Current Source Density, intra-cortical and scalp recordings of cortical activity) have provided evidence of the neurophysiological correlates of these two levels of voicing perception. Single-unit activity recordings in the cat auditory nerve (Sinex & McDonald, 1988; 1989), in the inferior colliculus (Chen et al., 1996; Chen & Sinex, 1999) and in the primary auditory cortex of monkeys (Steinschneider et al., 2003) and of humans (Liégeois-Chauvel et al., 1999) have provided some support for a hierarchical functional organization of the perceptual system.

Behavioural experiments have demonstrated that animals (Kuhl & Miller, 1975; 1978) and infants below six months of age raised in diverse linguistic backgrounds (Eimas et al., 1971;

Lasky et al., 1975) share common voicing categories delineated by -30 ms and +30 ms VOT

boundaries (Abramson & Lisker, 1970). By analogy with the “language-universal” categories of Miller and Eimas (1996), defined as the initial categories according to which the initial acoustic space is partitioned, we will use “universal boundaries” throughout the chapter to refer to the initial boundaries (-30 and +30 ms VOT) that delineate the voicing continuum, i.e.

the acoustic boundaries to which animals and infants before six months of age are sensitive.

Other behavioural and electrophysiological studies have found evidence of the maturation from a universal and acoustic mode to a phonological mode of perception (e.g. Werker &

Tees, 1984; Hoonhorst et al., in press) according to rules of the mother tongue. Indeed, it has been shown for voicing that the two universal boundaries (-30 and +30 ms) are used in Thai (Donald, 1978), whereas in English (Lisker & Abramson, 1967), only the +30 ms VOT boundary remains relevant. In French, Spanish, Polish, Dutch, Hebrew and Arabic (Serniclaes, 1987; Williams, 1977; Flege & Eefting, 1986; Maassen et al. , 2001; Horev et al., 2007; Yeni-Komshian et al., 1977), none of the universal boundaries are relevant. Indeed, a new phonological boundary located at 0 ms VOT emerges, just midway between the two universal predispositions for perceiving voicing. The emergence of this new boundary centred on 0 ms VOT is the most commonly used mechanism in two-category languages. With its single boundary at +30 ms, English appears as an exception among the languages with two VOT categories. It is nevertheless on English that the vast majority of studies have been conducted.

CATEGORICAL PERCEPTION

Categorical perception (CP), which allows the definition of a finite percept on the basis of an infinite number of acoustic realizations, is the major characteristic of the hierarchical system for speech processing. The study of perceptual boundaries and their mechanisms is therefore at the heart of research on speech perception.

Categorical perception is frequently presented as a transform by which continuous physical variations are encompassed into discrete perceptual categories, i.e. there is an analog-to-digital transformation (Harnad, 1987). Identification and discrimination tasks have been extensively used to find evidence of CP. While identification requires labelling auditory stimuli, discrimination requires the subject to determine whether pairs of stimuli are identical or different. Categorical perception is often presented as the fact that we are only able to discriminate stimuli that do not have the same labelling. However, according to Liberman et al.. (1957), who first defined this phenomenon, CP does not imply an all-or-none capacity but

refers to the observation that the contrasts between sounds separated by the same acoustic distance and belonging to different phoneme categories are more easily discriminated than those inside categories (Liberman et al., 1957). Carney et al. (1977) reinforced this relative nature of the CP phenomenon by emphasizing “the improved discrimination of stimuli near a category boundary” (p. 969).

Although the hierarchical organization as well as the categorical mode of speech perception is firmly established, their physiological correlates remain to be better delineated. Determining the link between basic auditory mechanisms and the perception of voicing is the core interest of the research presented below.

NEURAL ENCODING OF VOICING

The comparison between humans and non-human animals suggests that VOT universal boundaries, located at -30 and +30 ms, have developed according to constraints imposed by the auditory system. Several studies have been devoted to the understanding of the with different stimuli presented in Hirsh’s study were that while only two ms are sufficient for a subject to distinguish the presence of two sounds, about 20 ms are needed to determine the temporal order of the same two sounds. This difference by a factor of 10 is, according to Hirsh, related to the involvement of two different mechanisms, i.e. “more central structures for the anatomical and physiological correlates” (p. 767) must be involved in the second task. This higher threshold of the perceptive system which determines the temporal order between two successive events is not specific to auditory stimuli, since Piéron (1964) reached similar conclusions for the visual modality. Considering that temporal order is the critical cue to perceive the sign of VOT or TOT2 (Pisoni, 1977) and that at least 20 ms are needed to determine temporal order between two sounds, several studies have sought to find a correlate of the corresponding neural code.

2 Tone Onset Time is the non-speech analogue of VOT and corresponds to the time elapsed between two tone onsets.

The neural code is defined by Eggermont (2001) as the link between behaviour (discrimination performances, for our purpose) and neural activity. Because by definition discrimination implies two percepts, it is expected that discrimination between two stimuli occurs as soon as their two neural representations are sufficiently different as to evoke different percepts.

Regarding the +30 ms VOT boundary, several studies have suggested that the neural code for VOT values greater than 30 ms is a double-peaked (“double-on”) response time-locked to closure release and voicing onset as opposed to a single-peak (“single-on”) response for shorter VOT values. In this respect, it is worth noting that the majority of the world’s languages make the most of this neurophysiological signature by producing voiced and voiceless phonemes with VOT values strategically positioned with respect to the single- vs double-on response boundaries.

The following section presents some of the key results in the field of neural encoding of voicing.

Data on animals

Testing the discrimination abilities of Macaque monkeys on a /bae-dae-gae/ continuum, Kuhl and Padden (1983) showed that they were sensitive to the same boundaries as human beings, demonstrating in this way that neither categorical perception nor the sensitivity to acoustic contrasts are human-specific. Acoustic boundaries were therefore associated to natural psychophysical boundaries (Kuhl & Miller, 1975) and attributed to auditory constraints that could serve as natural markers to shape the perceptive map.

Further evidence of the contribution of universal boundaries to categorical perception was given by Sinex and McDonald in numerous studies (e.g. 1988; 1989) in which they recorded the discharge pattern of the auditory nerve fibre responses in chinchillas. Results of these studies highlighted the role of neural variability in perceiving acoustic categories. Indeed, Sinex and McDonald (ibid) showed that neural representations of voicing in the auditory nerve are non-linear since the variance in the latency of neural responses is far more important for within-category exemplars of a stimulus than for between-category exemplars. This high degree of uncertainty within categories was responsible for weaker discrimination performances.

This non-linear neural encoding of linear acoustic changes was also shown by Chen et al.

(1996) in the discharge pattern of the inferior colliculus of chinchillas and by Steinschneider

et al. (1994; 1995) and Steinschneider et al. (2003) in the primary auditory cortex of monkeys.

More specifically, Steinschneider et al. (1995) described two patterns of neuron discharge responses when animals were submitted to stimuli varying in VOT values from 0 to 60 ms with 20 ms VOT steps. Stimuli like /da/, with a VOT value shorter than 20 ms, elicited a

“single-on response” time-locked to closure release whereas /ta/ stimuli elicited a “double-on response” time-locked to both closure release and voicing onset. Interestingly, this differential pattern of responses according to the VOT values is the same as that observed by Sinex and McDonald (1988) in the auditory nerve fibres but is different from the one recorded in thalamocortical fibres (Steinschneider et al., 1994). Responses recorded in the primary auditory cortex are indeed characterized by transient time-locked responses on both closure release and voicing onset whereas neural responses in thalamocortical fibres are characterized by a time-locked response on closure release followed by a phase-locked repetitive response during vocal fold vibrations. This led Steinschneider et al. (1994) to state that the transformation of the pattern of response between the thalamus and the auditory cortex, i.e.

the accentuation of the transient onset components, is at the root of the voiced-voiceless distinction.

Whatever the relative roles of auditory nerve and cortical mechanisms in the build-up of the single- vs double-on neural code of VOT categories, the fact that it is expressed in the time-locked response pattern of many cortical neurons allows it to be studied by means of non-invasive scalp recordings.

Data on humans

Dealing this time with the perception of human beings, Steinschneider et al. (1999) presented /ba-da-ga/ and /pa-ta-ka/ stimuli to English-speaking epileptic subjects while recording intra-cortical activity (Multi Unit Activity, Current Source Density, Auditory Evoked-Potentials – AEP) from the Heschl’s gyrus, the planum temporale and the superior temporal gyrus. Results showed a differential pattern of results according to the VOT value: stimuli characterized by a VOT of 0 or 20 ms elicited a single-on response whereas a VOT of 40, 60 or 80 ms generated a double-on response time-locked to closure release and voicing onset. Results led these authors to suggest that a 20 ms refractory period after closure release is at the root of this differential response pattern. Only stimuli with a VOT longer than at least 20 ms would be able to elicit the second transient response associated with voicing onset. These results stressed the link between the psychoacoustical boundary (+30 ms in English), typically

obtained with identification and discrimination tasks, and the auditory constraints that underlie this sensitivity. Simos et al. (1998a; 1998b) also highlighted this link by showing a non-linear decrease in the N100 magnetic response amplitude when subjects were presented with TOT stimuli with values increasing from +20 to +40 ms. Simos et al. (1998) showed the same positive correlation between the amplitude of evoked N100 magnetic field (N100m) and VOT values with /ga-ka/ stimuli.

Regarding the representation of voicing in the auditory cortex, Steinschneider et al. (1999;

2005) showed that categorical perception of voicing relies in particular on primary and secondary auditory cortical fields. They found evidence that the anterior portion of Heschl’s gyrus, which corresponds to the primary auditory cortex, is more specialized than the posterior portion in the representation of the two transient events of VOT, i.e. closure release and voicing onset. More specifically, electrodes localized in the lateral sites of the anterior portion of Heschl’s gyrus provided larger responses time-locked to voicing onset relative to responses time-locked to closure release than electrodes localized in central and medial sites.

Using surface recorded cerebral evoked magnetic fields, Simos et al. (1998a; b) demonstrated a more medial localization of the N100m for stimuli with VOT values of 40 and 60 ms than for those with 0 and 20 ms VOT values. Although they used quite different methods, these two studies suggest that the detailed representation of voicing is different between different cortical regions.

As far as hemispheric lateralization is concerned, Liégeois-Chauvel et al. (1999) demonstrated the specialization of the left Heschl’s gyrus for processing transient acoustic information contained in voiced and voiceless syllables in human beings while general temporal processing in animals is treated bilaterally in the primary auditory cortex. Trébuchon-Da Fonséca et al. (2005) reached the same conclusions with /ba-pa/ stimuli. This left hemispheric dominance for the processing of VOT was, however, not supported by Steinschneider et al.

(2005), who obtained the same results when recording neural responses in the right Heschl’s gyrus. Given the possible role of the preexisting epileptogenic lesions in the subjects participating in the latter two studies, one must remain cautious before drawing definite conclusions on lateralization of the VOT-related double-on response.

Concerning the correspondence between intra-cortical recordings and scalp recordings, Trébuchon-Da Fonséca et al. (2005) compared both methodologies and obtained comparable results. As different locations in the Heschl’s gyrus respond differently to the transient events contained in voiced and voiceless stimuli, one must be prepared when performing scalp

recordings to obtain, as stated by Steinschneider et al. (2003), “a composite wave that reflects activation of multiple auditory cortical fields, each with its own capacity to follow temporal features of complex sounds” (p. 318).

Sharma and Dorman (1999) recorded scalp auditory evoked-potentials and found a single-on response with short VOT values (perceived as /da/) as opposed to a double-on response pattern (perceived as /ta/) for longer VOT values. The first peak, labelled N100’, was evoked by the closure release, and the second, labelled N100, by voicing onset. Although this result led Sharma and Dorman to suggest that the morphology of the N100 component was a neurophysiological correlate of voicing perception, they later provided evidence for a different view (Sharma & Dorman, 2000). In the latter study, they presented syllables with different places of articulation. When performing identification and discrimination tasks, English-speaking subjects showed a VOT perceptual boundary centred on 27.5 ms for the /ba-pa/ continuum and on 46 ms for the /ga-ka/ continuum. This well-known shift of the VOT boundary according to the place of articulation (the consonants with more back articulation display longer VOTs: Lisker & Abramson, 1967; for the trade-off between F1 and VOT associated with this shift, see Summerfield & Haggard, 1977; Parker, 1988) was not associated with a corresponding shift in the neurophysiological boundary between the double- and single-on patterns. It was therefore concluded that the N100 morphology (single vs double-peak) was not a reliable indicator of voicing contrast. Although a correspondence between the psychoacoustic boundary and the physiological one was found by Steinschneider et al. (2005) with intra-cortical recordings when varying place of articulation, they acknowledged that the matching between these two types of boundaries was not perfect, showing in this way the complexity of the trade-offs between acoustic events involved in the perception of voicing.

Sharma and Dorman (1999) recorded scalp auditory evoked-potentials and found a single-on response with short VOT values (perceived as /da/) as opposed to a double-on response pattern (perceived as /ta/) for longer VOT values. The first peak, labelled N100’, was evoked by the closure release, and the second, labelled N100, by voicing onset. Although this result led Sharma and Dorman to suggest that the morphology of the N100 component was a neurophysiological correlate of voicing perception, they later provided evidence for a different view (Sharma & Dorman, 2000). In the latter study, they presented syllables with different places of articulation. When performing identification and discrimination tasks, English-speaking subjects showed a VOT perceptual boundary centred on 27.5 ms for the /ba-pa/ continuum and on 46 ms for the /ga-ka/ continuum. This well-known shift of the VOT boundary according to the place of articulation (the consonants with more back articulation display longer VOTs: Lisker & Abramson, 1967; for the trade-off between F1 and VOT associated with this shift, see Summerfield & Haggard, 1977; Parker, 1988) was not associated with a corresponding shift in the neurophysiological boundary between the double- and single-on patterns. It was therefore concluded that the N100 morphology (single vs double-peak) was not a reliable indicator of voicing contrast. Although a correspondence between the psychoacoustic boundary and the physiological one was found by Steinschneider et al. (2005) with intra-cortical recordings when varying place of articulation, they acknowledged that the matching between these two types of boundaries was not perfect, showing in this way the complexity of the trade-offs between acoustic events involved in the perception of voicing.