• Aucun résultat trouvé

non-hyperarticulated speech 4

4.1.1 Aims and predictions

D

espite the abundance of articulatory studies on English/r/,Anglo-Englishremains largely unexplored, as our review of the literature inChapter 2indicated. There is an assumption thatAnglo-English/r/is produced with the tongue tip raised, which is perhaps due to the data presented in Delattre and Freeman (1968). However, with only three English subjects, their dataset can hardly be described as representative and Delattre and Freeman never claimed it to be so. We therefore aim to determine if the tip up tongue shape is indeed typical of/r/inAnglo-Englishpre-vocalic/r/by using a larger cohort of speakers. In non-rhoticEnglishes,/r/is produced in more retroflex-compatible contexts than inrhoticEnglishes.

Higher rates ofretroflexionhave been found in New Zealand English thanAmerican English.

We intend to directly compare results fromAnglo-Englishwith the ones presented in Heyne et al. (2018) for New Zealand English and in Mielke et al. (2016) forAmerican English. All

99

three studies utilise the same imaging technique (UTI) and speakers were recorded at a similar time (2016-2018). We will also assess whether similar phonetic factors to those observed in American Englishconstrain tongue shape, focusing in particular on the impact of the following vowel.Retroflexionrates have been found to increase in the context of open-back vowels as opposed to close-front ones inAmerican English, which is probably due to articulatory ease (Mielke et al., 2016). It has been shown in other varieties of English that the different tongue shapes associated with English/r/do not differ with respect to the first three formants. We will assess whether the same can be said forAnglo-English. On a methodological level, there is currently no one technique that researchers use to classify tongue shapes for/r/with UTI data, descriptions of which vary in detail. We aim to ensure our classification technique may be replicated by other researchers working with similar data. It is thus hypothesised that in Anglo-English:

Hypothesis 1 /r/is produced with higher rates ofretroflexionthan inAmerican English.

Hypothesis 2 /r/tongue shapes are affected by coarticulation with the following vowel.

Hypothesis 3 Different tongue shapes for/r/result in similar formant values – at least up to F3.

After establishing how/r/is articulated inAnglo-Englishwith respect to its lingual com-ponent, we will turn our attention to the lips. As Chapter 2indicated, it is clear that our understanding of the contribution of the lips to English/r/acoustics is incomplete. While it is generally agreed that F3 is the main acoustic correlate for/r/, which is associated with front cavity resonances, we do not know to what extent the lips may influence/r/acoustics. As Espy-Wilson et al. (2000)’s multi-tube models indicate, the addition of a separatelip protrusion channel would extend the front cavity and lower F3. However, do speakers actually put this articulatory strategy into practice? To test to what extentlip protrusioncontributes to/r/, we will present data from both non-hyper- andhyperarticulatedspeech. If the final goal of speech movements is the correct perception of speech by the listener, the goal ofhyperarticulation must be to enhance the discriminability of phonetic categories (as expressed by H&H Theory,

4.1. Introduction 101

Lindblom, 1990). If the acoustic goal of English/r/ is indeed a low F3,hyperarticulated/r/

should reach even lower F3 values than those observed in non-hyperarticulated speech. If lip protrusioncontributes to the lowering of F3, and therefore to the discernibility of/r/, we expect to find morelip protrusioninhyperarticulatedspeech than in non-hyperarticulated speech. We therefore postulate that:

Hypothesis 4 Lip protrusioncontributes to the lowering of the third formant of/r/.

The lips may also contribute to maintaining a stable acoustic output across different lingual articulations of /r/. As we pointed out in Chapter 2 (Section 2.8, p. 64), atrading relation between the tongue and lips may be a possibility. As the size of thesublingual spacevaries across tongue shapes,/r/productions with little to nosublingual spacemay compensate by employing other articulatory manoeuvres which result in an increase in the size of the front cavity. Front cavity lengthening may be accomplished through a more posterior placement of the tongue, an extension of thesublingual space, or increased lip protrusion. Given the fact that labiodental articulations are rapidly gaining currency in England, we predict that Anglo-English/r/has a labial component that may be related to the size of the sublingual space: articulations with littlesublingual space, i.e., tip downbunchedones, may compensate with increasedlip protrusion. Finally, if thetrading relationbetween thesublingual spaceand lip protrusionexists, we may observe a larger degree oflip protrusioninbunched/r/than in retroflex. Inhyperarticulatedspeech,retroflexersmay attain lower F3 values by increasing the size of the sublingual space(i.e., with more retroflexion), a strategy which would not necessarily be available tobunchers. We therefore predict thathyperarticulated bunched/r/

will be accompanied by morelip protrusionthanhyperarticulated retroflexvariants. If these arguments are valid, the following hypothesis can be derived:

Hypothesis 5 Atrading relationexists between the size of thesublingual spaceand the degree oflip protrusion, which manifests itself through a negative correlation between the two.

4.1.2 Hyperarticulation

In order to assess the contribution of the lips, articulatory and acoustic data from both non-hyperarticulated andhyperarticulatedproductions of/r/will be presented. Speech communica-tion is often characterised as a constant trade-off between ease of produccommunica-tion and the successful transfer of information. For example, as described inChapter 1(Section 1.5.2, page 27), Lind-blom’s ‘Hyper’- and ‘Hypo’-articulation Theory (H&H Theory) states that speakers adapt their production according to the demands of the listener and the situation, which may account for the variable nature of the phonetics of speech (Lindblom, 1990). Thus, ease of articulation in the speaker is in direct opposition to the requirement for sufficient perceptual contrast for the listener (Bradlow, 2002). In fact, it has been shown that phonetic cues are often highly reduced in casual speech and may actually result in the loss of contrastive sound categories (Ernestus & Warner, 2011). Reduction may be related to the predictability of an utterance.

Aylett and Turk (2004) found that phrase-medial syllables with high language redundancy (i.e., highly predictable from lexical, syntactic, semantic, and pragmatic factors) are shorter in duration than less predictable elements. They argued that the need for efficient information transfer while effectively expending articulatory effort leads to an ‘inverse relationship between language redundancy and duration’ (p. 31). This ‘inverse relationship’ improves communication robustness by spreading information more evenly across the speech signal, yielding a ‘smoother signal redundancy profile’ (p. 31). If the communicative situation places extra demands on the listener, we can expect the speaker to spontaneously adjust their articulatory patterns in order to produce speech that is ‘clearer’ (Bradlow, 2002). Types of speech that are produced with the goal of improving intelligibility are commonly referred to asclear speechorhyper-speech(Cooke, King, Garnier, & Aubanel, 2014). Speakers may adjust speech to accommodate to environmental demands when audibility is affected or perceived to be affected by the speaker. For example, speech is often modified in noisy conditions, known asLombard Speech(Lombard, 1911) (e.g., Castellanos et al., 1996; Garnier, Heinrich, & Dubois, 2010; Junqua, 1993; Van Summers et al., 1988), or when addressed to a distant person (e.g., Cheyne, Kalgaonkar, Clements, & Zurek, 2009; Pelegrín-García, Smits, Brunskog, & Jeong, 2011). Speech modifications may also be

4.1. Introduction 103

motivated by demands made by the target audience when they are perceived by the speaker to have intrinsically reduced comprehension, regardless of context (Cooke et al., 2014). Such instances include, but are not limited to, infant directed speech (e.g., Burnham, Kitamura, &

Vollmer-Conna, 2002; P. K. Kuhl et al., 1997; Lindblom, Brownlee, Davis, & Moon, 1992; Stern, Spieker, Barnett, & MacKain, 1983), hearing-impaired directed speech (e.g., Bradlow, 2002;

Howell & Bonnett, 1997; Picheny, Durlach, & Braida, 1985), speech addressed to non-native listeners (e.g., Scarborough et al., 2007; C. L. Smith, 2007; Uther, Knoll, & Burnham, 2007), ma-chine directed speech (e.g., Burnham, Joeffry, & Rice, 2010a, 2010b; Oviatt, Levow, MacEachern,

& Kuhn, 1996), and speech used when correcting (e.g., Beckford Wassink, Wright, & Franklin, 2007; Burnham et al., 2010a, 2010b; Schertz, 2013; Stent, Huffman, & Brennan, 2008).

Speech changes induced by environmental factors are primarily characterised by modi-fications to prosodic cues including increases in intensity, fundamental frequency and word duration (e.g., Castellanos et al., 1996; Garnier, Bailly, Dohen, Welby, & Loevenbruck, 2006; Van Summers et al., 1988). Some languages have even developed a whistled form of language in response to the necessity to communicate across very large physical distances (Meyer, 2005).

In contrast, as Cooke et al. (2014) noted, listener-based speech modifications typically result in changes which may be considered as communicative strategies that help the listener to retrieve and decode phonetic cues. One such technique is exaggerated articulation, orhyperarticulation.

On a segmental level, speakers have been shown to enhance phonetic contrasts between vowels and between consonants. Enhancement strategies may include increases to the vowel space, exaggerated jaw and lip movement, and changes to length contrasts in vowels and voicing contrasts in consonants (a review of known speech modifications is presented in Cooke et al., 2014).

Speech has been found to behyperarticulatedin computer- compared with human-directed speech (Burnham et al., 2010a), particularly in speech following recognition errors (Maniwa, Jongman, & Wade, 2009; Oviatt et al., 1996; Schertz, 2013). If only one segment is incorrectly identified, or is likely to be misunderstood, speakers may limit and target their adaptations to that particular segment in subsequent productions (Schertz, 2013), i.e., targeted

hyperar-ticulation.1 A number of studies have elicited targetedhyperarticulation by employing an experimental paradigm in which participants interact with a simulated automatic speech recog-niser and receive text feedback about what the programme ‘recognised’. Stent et al. (2008) found thatAmerican Englishspeakers make repairs after recognition errors and that hyperar-ticulationincreases after evidence of misrecognition and then gradually decays in the absence of further misrecognitions: speakers’ pre-error speaking style usually returns 4–7 utterances after evidence of misrecognition. The authors found repairs to typically include the use of canonical forms rather than reduced or assimilated ones, e.g., the flapping of/t/was modified to[t]. In Schertz (2013), participants interacted with a simulated automatic speech recognition system and had to repeat words which were incorrectly identified. Target words included voiced and voiceless plosive onsets (e.g.,pit, bit). More extreme voice onset time (VOT) values were elicited by an incorrect computer recognition in which the error was a minimal pair in voicing with the target plosive (e.g., subject readsbit, computer responds with ‘pit’). However, when the computer gave an open-ended request for repetition (e.g., subject readsbit, computer responds with ‘What did you say?’),hyperarticulationdid not occur. In Buz, Tanenhaus, and Jaeger (2016), subjects were recorded interacting with a simulated human partner over the web.

Subjects were asked to say one of three words which appeared on a screen and were informed that their partner would select the word they understood from the three options. Target words contained voiceless plosive onsets. The results indicate that speakershyperarticulatethe target word when a voiced competitor is present and that the size of thehyperarticulationeffect was nearly doubled when simulated partners occasionally misunderstood the word.

The results from previous studies suggest that speakers make judgements based on the

‘perceived communicative success’ (Buz et al., 2016) of their utterances and adapt their speech accordingly. The properties of speech that speakers modify in order to improve the intelligibility of their speech do not all occur at the same time and under the same conditions (Stent et al., 2008). As previously discussed, environmentally-driven modifications tend to occur globally in order to improve audibility. In contrast, listener-oriented adaptations tend to occur more locally with the goal of enhancing segmental distinctiveness. As a result,hyperarticulationmay be

1Other labels have also been employed includingcontrastive,focalandlocalisedhyperarticulation.

4.2. Methodology 105

considered to be a gradient process. Possible enhancement strategies may arise from speakers learning from their experience of the most effective techniques to convey their intended message in a given situation. Indeed, some studies have shown that spontaneous speech adaptations improve intelligibility in listeners (e.g., Junqua, 1993; Krause & Braida, 2003), although not all reported enhancement strategies have necessarily proven to be beneficial (see Cooke et al., 2014, for a review of the perceptual effects of speech adaptation).

While previous studies have been interested in how and why speech enhancement modi-fications occur, we intend to elicit adaptive behaviour in order to answer a specific research question relating to the phonetic implementation of a particular segment, English/r/. If the final goal of speech movements is the correct perception of speech by the listener and if the acoustic goal of/r/is a low third formant, articulatory enhancement should result in further F3 lowering. We will assess which articulatory parameters are available to speakers in order to enhance English/r/by eliciting targetedhyperarticulationat a segmental level. Our method-ology will draw on the results from previous studies, which have indicated that the highest rates of targetedhyperarticulationoccur in computer- rather than human-directed speech, in speech repairs directly following recognition errors and in the 4-7 utterances following the initial error.

4.2 Methodology

4.2.1 Procedure

In order to elicit targeted hyperarticulation specifically at a segmental level, we engaged speakers in error resolution with a simulated speech recognition programme. Speakers were deceptively informed that the aim of the experiment was to test a new automatic ‘silent speech’

reader, which used information from speech movements to recognise the words they say without referring to the auditory signal. They were told that the silent speech reader was having difficulties with certain speech sounds, and that the aim of the recordings was to test the programme on these sounds. However, the sounds of interest were never explicitly revealed to

subjects. The experiment was divided into two parts. During the first, speakers were informed that the computer had access to both visual and auditory cues from their speech. During this part, the programme correctly ‘identified’ every word uttered, which provided us with baseline, non-hyperarticulated productions of/r/. During the second part, participants were informed that the audio would be ‘turned off’ and that the programme would only have access to visual speech information from their lingual and labial movements. During this second part, the computer ‘incorrectly’ identified one third of the stimuli. Whenever computer errors occurred, participants were instructed to repeat the word to try to ‘make the computer understand’.

Each ‘incorrectly’ identified word was repeated two more times in a row, the first of which elicited the same ‘incorrect’ response before being ‘correctly’ recognised. Recording sessions lasted no longer than 30 minutes and the stimuli were presented in a randomised order. By telling participants that the programme could not hear them, it was hoped that articulatory adaptations would be made locally at a segmental level, rather than across the entire word, which may have involved prosodic changes. Participants were told to use their normal speaking voice throughout the recording session.

The target word and computer feedback were presented to the participant, who was seated in a sound-attenuated room, on a computer screen. The participant first saw the target stimulus, e.g.,reed, and the experimenter initiated the recording, which produced a beep sound in the sound-attenuated room, signalling to the participant to say the word on the screen. The partic-ipant then saw the message ‘processing...please wait’, which gave time for the experimenter, who was seated in an adjacent control room, to select the appropriate computer response.

There were three possible computer feedback responses:

1. Recognition not possible: ‘Word not recognised, please wait.’

2. Incorrect identification: ‘Did you say weed?’

3. Correct identification: ‘Did you say reed?’

Although the simulated feedback responses had been pre-determined, the first possibility (i.e.,

‘Word not recognised, please wait’) was included in case a subject made a mistake, in which

4.2. Methodology 107

case a target word could be repeated without jeopardising the believability of the simulated programme. We had originally considered using a technique in which the simulated feedback response was automatically presented to the participant as soon as the experimenter had pressed the stop button. However, pilot testing indicated that subjects very quickly realised that the automatic speech reader was simulated if they made a mistake or did not respond in time but the programme was still able to correctly ‘recognise’ the word they had been asked to say.

Pilot testing also indicated that if the computer recognition feedback was simply presented to the participant directly after having produced the word, some participants paid little attention to the computer feedback response, focusing instead on the words they were asked to say. In order to elicit targetedhyperarticulationof/r/, it was vital that participants believed that their production of/r/was the source of computer misrecognitions. As a result, after each recording, the participants were asked to confirm whether the computer had correctly identified the word they had just said, such as in the following: ‘Did you say reed?’. Participants then responded with yes or no, which they were told would trigger the programme to either move on to the next word in the word list or repeat the original target word if automatic recognition was incorrect.

A schema depicting the order of possible computer responses is presented inFigure 4.1.

Figure 4.1:Possible responses from the simulated automatic silent speech reader (presented in grey) after a target word (presented in green), here the target word is reed. If the computer feedback was a misrecognition (here,weed), the target word was repeated two more times in a row, the first of which elicited the same ‘incorrect’ response (here,weed). The second repetition resulted in a correct recognition, after which a new target word was presented (here,room).

In order to ensure the believability of the simulated programme, a simulated programme interface (presented inFigure 4.2) was created and presented to speakers on a separate screen throughout the recordings. Fake on/off buttons were shown next to the words ‘audio’, ‘video’

and ‘ultrasound’. Just before the second ‘silent speech’ part started, the experimenter ‘turned off’ the audio by clicking on the corresponding fake button.

Figure 4.2:Simulated ‘Silent Speech Reader’ interface presented to subjects during the non-hyperarticulation (top) and hyperarticulation (bottom) session.

4.2.2 Stimuli

Stimuli comprised of nine/r/-initial monosyllabic words followed by the vowels fleece, goose, kit, dress, trap, strut2, thought, lot. Fillers were/w/-initial words followed by the same monophthongs. In the non-hyperarticulated session, all target words were ‘correctly’ identified by the simulated programme. To ensure believability, one repetition per item was recorded in the first session. For the secondhyperarticulatedsession,/r/productions in the wordsreed,

2Some speakers, particularly those from the north and the midlands of England, may not present the foot-strut split. As a result, we expect the foot-strut vowel to be variable with linguistic Northerners likely producing the near-close near-back round foot vowel rather than the open-mid back unrounded strut vowel.

4.2. Methodology 109

red, androomwere ‘incorrectly’ identified as ‘w’ and ‘l’ (e.g.,red was identified as ‘wed’ or

‘led’). When an ‘incorrect’ response was given, the original word was repeated two more times.

The same method was used for/w/-initial filler words, where/w/productions were mistaken for ‘r’ or ‘l’. A total of 24 productions of /r/were recorded in the second,hyperarticulated session. Stimuli were presented to subjects in a semi-randomised order. In thehyperarticulated

The same method was used for/w/-initial filler words, where/w/productions were mistaken for ‘r’ or ‘l’. A total of 24 productions of /r/were recorded in the second,hyperarticulated session. Stimuli were presented to subjects in a semi-randomised order. In thehyperarticulated