• Aucun résultat trouvé

A controversial theory of speech perception is the motor theory (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; updated by Liberman & Mattingly, 1985; then by Liberman & Whalen, 2000), which claims that for both perception and production the primary representations of speech are the phonetic gestures of the articulatory apparatus. In other words, there is a shared code between speech production and perception.

Recent research in speech perception using modern neuroimaging techniques has provided some evidence for the involvement of motor regions in speech processing, which, I suggest, supports the motor theory of speech perception. Wilson, Saygin, Sereno and Iacoboni (2004) demonstrated that overlapping motor and premotor areas are activated when producing and listening to speech sounds. In this study participants were asked to listen to blocks of the same monosyllable meaningless word and, subsequently, to produce this same stimulus in an fMRI experiment. Common activations were observed between conditions in the superior part of the ventral premotor cortex bilaterally, extending towards the primary motor cortex.

Although this study did not use words, it provides strong evidence on the important role of the premotor cortex in sublexical processing of speech.

Tremblay and Small (2011) used fMRI in order to investigate whether there are brain regions conjointly active during speech perception and production and whether the activity of these is modulated by articulatory complexity. They found that the left ventral premotor cortex was significantly activated in both tasks (i.e. speech production and perception) but activation in this region was modulated by the articulatory complexity of syllables only during speech production

A series of studies using Transcranial Magnetic Stimulation (TMS) have also shed light on the role of motor cortex in speech perception. Watkins and Paus (2004) demonstrated that motor evoked potentials (MEPs) measured on lips muscles and elicited by TMS pulses on the face primary motor cortex change their excitability when participants heard speech sounds or viewed speech-related lips movements but not when listening to non-verbal sounds or

9 viewing brow movements. Given that the excitability of face primary motor cortex changes while attending speech-related stimuli, it can be argued that this part of the primary motor cortex not only underlies speech production but it is also implicated in speech perception.

Another study (Meister, Wilson, Deblieck, Wu and Iacoboni, 2007) used fMRI to localise a part of the premotor cortex equally activated during speech perception and production. After its localisation, the activity of this region was inhibited using repetitive TMS (rTMS) and participants were asked to perform a simple speech perception task. Results revealed that their performance significantly deteriorated compared to the control condition, in which rTMS was not applied.

Further, Mötönnen and Watkins (2009) demonstrated that inhibiting articulatory motor cortex (more specifically the part of the primary motor cortex controlling lip movements) impairs categorical perception of phonemes whereas inhibition of hand motor cortex does not.

The studies mentioned above provide strong evidence that motor regions are, indeed, activated while listening to speech sounds and play an active role in understanding speech, at least at a sub-lexical level. The involvement of motor regions could be particularly useful during effortful speech comprehension, for example while listening to speech in noise, in which case the acoustic signal may not be sufficient to support comprehension unless it is coupled with additional information.

The above overview of theories of speech perception described phenomena which suggest that speech comprehension is not a simple bottom-up process. Several factors have an impact on speech comprehension, such as previous exposure to a certain speech stimulus (i.e.

repetition priming), familiarity with speaker’s voice and the presence of visual (McGurk &

McDonald, 1976) or tactile (Fowler, 1995) cues. These top-down influences can have a facilitatory impact on speech perception in general but in particular under challenging listening conditions (e.g. background noise) or when the received acoustic input has been degraded (e.g. sine-wave speech) or distorted (e.g. time-compressed speech), as in these cases the information the receiver is able to extract from the acoustic signal may not be sufficient for comprehension. In this report I will focus on the role of two top-down influences, namely native language and semantic context, on effortful speech comprehension.

10 3. The benefit of native language and semantic context while listening to speech in noise and

the neural correlates of speech intelligibility

The first aim of the present study is to investigate the role of native language and semantic context on comprehension of speech in noise. Additionally, using words mixed with noise in different signal to noise ratios, the present study aims to shed light on the neural correlates of listening to masked speech.

3. A. Neural correlates of effortful speech comprehension

There are different types of challenges in the speech input we receive in our everyday life. For example, the acoustic signal can be compressed and spectral information can be eliminated (e.g. filtered speech, vocoded speech) or masked (e.g. speech in noise). In order to study how the human brain successfully deals with challenging speech input one can present artificially distorted or degraded speech.

The manipulation used in the present study is masking speech by noise. Speech in noise is created by embedding a speech sound (e.g. a phoneme, word or sentence) in noise. By varying the signal to noise ratio (i.e. the ratio of the levels of signal and noise), the

intelligibility of the stimulus changes. Various types of noise, such as speech-shaped noise (Golestani, Rosen & Scott, 2009) or multi-speaker babble tracks (Holle, Obleser,

Rueschemeyer & Gunter, 2010) have been used in order to simulate the noisy background we have to face in everyday life conditions. Being potentially intelligible but challenging, speech in noise has been used in the literature in order to determine the role of various factors, such as listener’s age (Pichora-Fuller, Schneider & Daneman, 1995), native language (Golestani et al., 2009) and hearing impairments (Bronkhorst & Plomp, 1989), in speech comprehension.

The question of which brain regions are involved in understanding speech in noise was directly addressed by Wong, Uppunda, Parrish and Dhar (2008). In this study, participants were asked to listen to words in a quiet background or embedded in noise with 2 different signal to noise ratios (SNR), -5 and +20 decibels (dB), and match them with pictures. Results revealed additional activation in the superior temporal gyrus (STG) as well as in parietal and frontal regions for speech embedded in noise compared to quiet background. The comparison between the two noisy conditions (-5dB versus +20dB) revealed increased activation in posterior STG and left anterior insula. Thus, the intelligibility of the acoustic signal modulates activity in auditory and motor regions as well as in an attentional network, suggesting,

11 according to the authors, that understanding speech in noise requires not only a finer spectro-temporal analysis of the acoustic signal but also higher level cognitive and attentional processes.

The involvement of motor regions in effortful speech comprehension has also been demonstrated by studies using other types of signal distortion or degradation, such as time-compression and noise-vocoding. Time-time-compression of speech increases the number of speech units presented over a given period of time, similarly to artificially increasing speech rate. This distortion is based on the pitch synchronous overlap technique (PSOLA), which averages across time adjacent pitch periods for voiced speech segments and chooses an arbitrary time window to average unvoiced segments (Mehler et al., 1993). In a recent fMRI experiment, (Adank and Devlin, 2010) individuals were asked to listen carefully to time-compressed sentences and say whether their content was true or false. Increased activation in left premotor cortex and posterior superior temporal sulcus was observed while listening to time-compressed compared to normal speech during the first block of 16 sentences. It was followed by a decrease in activation of these regions, without returning to the same level as in the normal speech condition, probably reflecting adaptation to the distorted input.

A well-studied type of degraded speech is noise-vocoded speech. Noise-vocoding (Shannon, Zeng, Kamath, Wygonski and Ekelid, 1995) involves decomposing an acoustic signal into a defined number of bandpass-filtered channels, from which the time-varying amplitude envelope is extracted. The extracted amplitude envelopes are smoothed by a lowpass filter and then each one of them is used to modulate a separate wideband noise. Each modulated noise is band-pass filtered into the frequency range of the source channel, and the amplitude-modulated noise bands are then recombined. The number of frequency bands used to process the signal is crucial to the intelligibility of the noise-vocoded output- the greater the number of bands, the more intelligible the output. Noise-vocoding removes much of the spectral information from the signal but preserves the slowly varying temporal cues. Noise-vocoding has been used to simulate sound transduced by cochlear implant processors (Shannon et al., 1995; Loizou, Dorman, & Tu, 1999; Faulkner, Rosen, & Smith, 2000).

In a recent fMRI experiment (Hervais-Adelman, Carlyon, Johnsrude & Davis, submitted) participants were asked to listen to completely incomprehensible and potentially comprehensible noise-vocoded words (vocoded by using one and six frequency bands, respectively), as well as clear words. The contrast between degraded but potentially

comprehensible versus clear words revealed significant activation in the left premotor cortex

12 as well as in the left anterior insula. These regions may be recruited in order to provide the system with supplementary or alternative representations of speech on the sub-lexical or lexical level which could potentially contribute to the understanding of a highly degraded input. It is remarkable that no difference was observed in these regions while contrasting non-intelligible versus clear speech, suggesting that the previously mentioned activations can be attributed to the extra cost of the challenging input and not to the increased spectral

information.

Very few studies have used high temporal resolution methods, such as

Electroencephalography (EEG) and Magnetoencephalography (MEG), to look at the cortical electrical activity during effortful speech comprehension. A recent study by Millman, Woods and Quinlan (2010) used MEG to investigate how noise-vocoded words are represented in terms of power changes and functional asymmetries between the two hemispheres. The results showed localised low-frequency (delta and theta) changes in the left hemisphere and high frequency changes in the right hemisphere for noise-vocoded compared to clear words, suggesting differential roles for each hemisphere in speech processing. In another study (Obleser & Kotz, 2010) participants were asked to attend to noise-vocoded sentences with either a high (e.g. “she sifts the flour”) or low (e.g. “she weighs the flour”) cloze probability verb while EEG was recorded. The amplitude of the N400 component of the signal was shown to depend on the intelligibility of the acoustic signal, as it was higher for 4 versus 16 frequency bands. The N400 is a negative component of event related potentials (ERPs) peaking at around 400ms post stimulus-onset and it is thought to reflect cognitive-linguistic integration processes.

To sum up, there is already some evidence for the implication of mainly auditory and motor regions in effortful speech comprehension from fMRI studies as well as amplitude changes and asymmetric processing from EEG and MEG studies. However, more research is necessary in order to better investigate the neural correlates of speech comprehension when the received acoustic input is not clear.

3. B. Semantic priming

Semantic priming was first reported by Meyer and Schvaneveldt (1971) in a lexical decision task, in which individuals were presented with pairs of written words or

pseudowords and were asked to indicate whether both stimuli were words or not. The authors

13 found that when the words were semantically related participants responded faster and more accurately compared to unrelated word pairs.

Semantic priming is consistent with Collins and Loftus’ (1975) spreading-activation theory of semantic processing. Their theory proposes that recognition of a word not only involves activation of the concept corresponding to this word but also activation of related concepts. Thus, if one of these concepts is subsequently presented, it will be processed faster as it will have already been activated. Semantic priming has been extensively studied with visually presented words and different tasks, such as lexical decision (e.g. Shelton & Martin, 1992) and reading tasks (e.g. Masson, 1995). Non-linguistic tasks, in which participants were for example asked to make pleasantness versus gender judgements on words (e.g. Draine &

Greenwald, 1998) have also revealed a semantic priming effect as these judgements were more accurate when stimuli were preceded by semantically related words.

In the auditory domain, semantic priming has been recently investigated using a

lexical decision task (Daltrozzo, Signoret, Tillman & Perrin, 2011), in which individuals heard two semantically related or unrelated words separated by a pause of 50ms. The first word, which participants were asked to ignore, was the prime while the second word, on which they had to make a lexical decision, was the target. Results revealed better and faster performance for related compared to unrelated pairs even when the prime was presented at a very low intensity level while, in a post-test, participants failed to correctly categorize the prime at this intensity, providing evidence that they were unaware of the word and unable to consciously process it. It can be therefore concluded that preceding semantic information constitutes a strong top-down influence which facilitates subsequent processes in speech perception.

3. C. Native language benefit in speech comprehension

Previous research (Nabelek and Donahue, 1984; Takata and Nabelek, 1990; van Wijngaarden, Steeneken and Houtgast, 2002; Golestani, Rosen and Scott, 2009) has

demonstrated that effortful speech comprehension is more difficult for non-native compared to native listeners of a given language. More specifically, bilingual individuals are able to make better use of the linguistic context in their native compared to the non-native language under adverse listening conditions. Note that in this context the term “bilingual” refers to people who started speaking a second language after the age of three.

The use of contextual information in bilinguals has been extensively studied using the

14 Speech Perception in Noise (SPiN) sentences (Kalikow, Stevens & Elliott, 1977). These are affirmative sentences, in which the last word can be of either high or low predictability (high or low cloze probability sentences). SPiN sentences are embedded in different levels of noise and people are asked to listen to them and identify the last word. The rationale of this

paradigm is that, given the fact that people can usually predict the last word in the high cloze probability condition, it is usually identified faster and more accurately compared to low cloze probability sentences even when the signal to noise ratio is low.

Mayo, Florentine and Buus (1997) demonstrated that the age of second language acquisition plays a key role in understanding speech in noise in this language. The authors compared performance of four groups of listeners, monolingual English speakers, bilinguals since infancy, bilinguals who had learnt English before the age of 6 and “late” bilinguals (post-puberty) in identifying the last word of SPiN sentences of either high or low cloze probability. Results revealed a main effect of context as listeners performed better for high compared to low cloze probability sentences but this difference in performance negatively correlated with the age of second language acquisition. Hence, contextual information plays a facilitatory role in understanding speech in noise but only for the native language or, to a lesser extent, for non-native languages acquired very early in life.

Another related study (Bradlow & Alexander, 2007) tested native and non-native English speakers with high and low cloze probability sentences under two different

conditions, clear and plain speech. Clear speech was characterized by very clear intonation, as if the speaker was addressing to a person with hearing loss, while in the plain speech

condition the speaker used a more conversational style. Results revealed that native listeners benefit from acoustic (i.e. clear versus plain speech) and semantic (i.e. high versus low last word predictability) information while non-native listeners benefit from semantic information only in the clear speech condition. The authors propose that non-native listeners can also make use of context in order to facilitate speech comprehension but only under optimal listening conditions.

Although it is clear that listening to speech in one’s native language facilitates the use of contextual information, supporting speech comprehension, the question of which part of context is used remains open. Given that sentences contain semantic, syntactic and prosodic information, it is difficult to investigate the role of each one of these sources separately in a study using sentences.

The role of semantics in language comprehension has been investigated in a semantic

15 priming study by Bernstein, Bissonnette, Vyas and Barclay (1989). Using a visual retroactive semantic priming paradigm, the authors demonstrated that when a word (target), presented for a very limited amount of time, is followed by a visual mask and a semantically related word (prime), it is identified or recognized more accurately than when it is followed only by a mask. Interestingly, when the target is followed by a semantically unrelated prime, its identification deteriorated compared to the condition in which it is only followed by a mask.

In other words, priming does not work only forwards, as in the experiments presented above, but also backwards, as a word can have an impact on the perception of a previously presented one.

In order to investigate the native language benefit in speech, Golestani, Rosen and Scott (2009) used a modified, “auditory” version of the backwards semantic priming

paradigm. In this version, participants heard a word masked with noise (“target”), followed by a semantically related or unrelated word (“prime”) which was always clear (not embedded in noise). Their task was to select the target from two semantically related words, the target itself and a foil, in a two-alternative forced choice (2AFC) task taking place at the end of each trial (see figure 2 below). The experiment included native French speakers, with school knowledge of English, and took place in both English and French (i.e. participants' native and non-native language). The primes were mixed with speech-shaped noise at a range of signal to noise ratios (SNR) of -7dB, -6dB and -5dB. Decreasing SNR values were expected to produce stimuli harder to understand. An infinite SNR condition, in which primes were not mixed with noise, was also included.

16 Figure 2: Schematic representation of a trial in Golestani, Rosen and Scott (2009). The target word was embedded in noise, whose intensity varied between mini-blocks of 6 trials. It was followed by a semantically related or unrelated unmasked word (prime). In the end of each trial, a 2AFC screen appeared and participants had to recognise the prime previously presented.

Results revealed that participants were more accurate in recognizing the target in their mother tongue that in their non-native language. What is more, recognition was better for high SNR levels in both languages. Critically for the authors' hypothesis, according to which participants would make better use of semantic context in their native language, semantic relatedness played a facilitatory role in the native language condition, with higher

performance for targets followed by a semantically related word. The opposite effect was observed in the non-native language trials, in which participants demonstrated a better performance for targets followed by semantically unrelated compared to related words (see figure 3 below). This language by semantic-relatedness interaction suggests a benefit of the native language in the use of semantic context so as to understand speech in noise. The converse, observed in the English trials, although difficult to interpret, could be related to the organisation of the mental lexicon in bilinguals. For example, one could imagine that all the words composing a person’s vocabulary are grouped into clusters according to certain criteria, which could be different in the native and non-native languages.

performance for targets followed by a semantically related word. The opposite effect was observed in the non-native language trials, in which participants demonstrated a better performance for targets followed by semantically unrelated compared to related words (see figure 3 below). This language by semantic-relatedness interaction suggests a benefit of the native language in the use of semantic context so as to understand speech in noise. The converse, observed in the English trials, although difficult to interpret, could be related to the organisation of the mental lexicon in bilinguals. For example, one could imagine that all the words composing a person’s vocabulary are grouped into clusters according to certain criteria, which could be different in the native and non-native languages.

Documents relatifs