The visual world eye-tracking paradigm

Dans le document Prediction in Interpreting (Page 61-71)

1 Literature Review

1.10 The visual world eye-tracking paradigm

All of the studies reported in this thesis use eye-tracking in the visual world paradigm in order to measure prediction. Eye-tracking in the visual world paradigm exploits the fact that people’s eye movements are closely linked to their comprehension processes. Cooper

(1974) was the first to show that when people listen to spoken stimuli, their eyes spontaneously fixate objects on the screen that are linked to what they are hearing.

Participants listened to short stories which directly or indirectly referred to one or several of nine images on a monitor. When participants heard the word “dog”, they were more likely to fixate the picture of a dog. In addition, they fixated images that were contextually related to what they heard, for example upon hearing the word “safari”, they were more likely to fixate images of a lion, a zebra and a snake. Crucially, Cooper showed that the timing of these fixations was very closely linked to what participants heard: 55% of all fixations elicited by words began while these words were being spoken. Tanenhaus, Spivey-Knowlton, Eberhard and Sedivy (1995) subsequently monitored comprehension in milliseconds, showing that participants integrated visual and linguistic information incrementally and without appreciable delay. Participants wore a head-mounted tracker and sat in front of different objects and followed instructions that were either initially ambiguous such as, “Put the apple on the towel in the box”, or unambiguous, such as “Put the apple that’s on the towel in the box”. Participants viewed a visual display showing an apple on a towel, a pencil, a towel and a box. In the ambiguous condition, participants looked towards the empty towel upon hearing “Put the apple on the towel” whereas in the unambiguous condition they looked directly from the apple on the towel to the box. In an alternative visual display condition, participants viewed an apple on a towel, an apple which was not on a towel, an empty towel and a box. In this condition, participants did not look at the empty towel distractor in either the ambiguous or the unambiguous condition, as they used the visual information to disambiguate the garden-path sentence more rapidly. This demonstrates that eye movements follow incremental comprehension processes without an appreciable delay, and that listeners use visual context to disambiguate spoken language.

The visual world paradigm has since been widely used to study online

comprehension processes. Just as in these first studies, eye-tracking studies using the visual world paradigm typically involve participants listening to aural stimuli while observing a visual scene (for a review, see Huettig, Rommers, & Meyer, 2011). Visual scenes are used to tap into different aspects of comprehension processes. The time-course of comprehension is studied by considering when listeners look towards objects mentioned in the sentence (as in the early visual-world studies). Participants’ fixations on objects depicting predictable words before onset of the predictable word are used to measure prediction of meaning (Altmann & Kamide, 1999; Dijkgraaf et al., 2017; Ito et al., 2017; Kamide et al., 2003). For instance, Altmann and Kamide (1999) showed a visual scene of a boy sitting on the floor of a room next to four different objects, paired with the sentence “The boy will eat/move the cake”. Participants looked predictively at the cake (an edible object), when they heard the constraining verb “eat”, but not when they heard the non-constraining verb “move”, which could have applied to any of the four objects.

The pictures may also depict words that are not mentioned in the spoken sentence, but which relate to the predictable word in some way. These objects are called competitor objects, as they may compete for attention that would otherwise be accorded to the target object. Such objects are used to investigate which aspects of a word might be activated before, or upon, hearing that word. For instance, studies have included objects that share the target object’s shape (Huettig & McQueen, 2007; Rommers, Meyer, Praamstra, &

Huettig, 2013), word-form (Huettig & McQueen, 2007; Ito et al., 2018) and semantic field (Huettig & Altmann, 2005; Huettig & McQueen, 2007; Mirman & Magnuson, 2009; Toon &

Kukona, 2020). Word-form competitors have been used to investigate activation across

languages in bilinguals in a monolingual setting. Marian and Spivey (2003) had participants sit in front of a display and respond to instructions in Russian, e.g., “Pick up the balloon”.

One of the objects was the target object, e.g., a balloon (sharik in Russian), one was a Russian competitor object, e.g., a hat (shapka in Russian), one was an English competitor object e.g., a shark, and one was an unrelated object. The paradigm makes it possible to test for cross-language activation without introducing linguistic stimuli from participants’ other language (see Mishra & Singh, 2014, for a related experiment using printed words). Why do participants direct their gaze towards certain objects?

When testing for competitor effects, the competitor and the target object may both be present on the same display. This is known as a target-present design. However, some studies suggest when the semantic context is constraining, participants direct their gaze to the semantically appropriate image presented on the screen alone. Dahan and Tanenhaus (2004) had participants listen to a sentence Dutch in which a target noun was preceded by either a constraining or a non-constraining verb, such as “climb” or “has” before the noun goat (bok in Dutch). Participants viewed a screen showing four objects, one of which depicted the target noun (e.g., a goat) and one of which was a phonological competitor (e.g., a bone/bot). At and after word onset, participants fixated both bok and bot in the non-constraining condition, showing that there was activation of word form. However, at and after word onset in the constraining condition, they fixated bok more, and bot less, than in the non-constraining condition. This suggests that in the constraining condition, participants had narrowed their expectations to the target noun only at or before word onset (see also Barr, 2008b; Weber & Crocker, 2012). The authors of these studies (Dahan & Tanenhaus, 2004; Weber & Crocker, 2012) conclude that listeners form expectations based on the

Tanenhaus, 2005, for further discussion). However, Huettig and Altmann (2005; 2007) argue convincingly that the objects present in the visual display are activated when participants view the display, and that the linguistic input subsequently modulates the level of activation of each object, leading the listener to fixate more on the most relevant object.

There is thus some debate about whether a phonological competitor may be considered in the presence of the target object. However, Rommers et al. (2013) and Ito et al. (2018) found evidence of predictive activation of shape and word form respectively using target-absent designs. We thus also adopt a target-absent design in Chapter 3 and Chapter 5 of this thesis in order to maximise the likelihood of recording competitor effects (Huettig

& Altmann, 2005). Types of visual stimuli

Eye-tracking studies may use different kinds of visual stimuli. Some studies use natural scenes, which might be either the real environment itself, or else a depiction of a real-world environment (cf. Henderson & Ferreira, 2004). Such scenes depict semantically coherent environments containing both background elements and objects which can or could be moved (Henderson & Hollingworth, 1999). For instance, Altmann and Kamide (1999) used basic black-and-white depictions of real-world environments, such as a boy sitting on the floor of the living room next to four different objects, paired with the sentence “The boy will eat the cake”.

In other experiments, an array of objects is shown. This array might consist of different physical objects placed on a grid (e.g., Chambers, Tanenhaus, Eberhard, Filip, &

Carlson, 2002). Alternatively, objects might be shown in different parts of a screen, typically one in each of the four quadrants. The objects might be black-and-white line drawings (e.g.,

Huettig & McQueen, 2007), coloured drawings (e.g., Rommers et al., 2013) or photographs (e.g., Borovsky, Elman, & Fernald, 2012; Kukona, Fang, Aicher, Chen, & Magnuson, 2011).

Other visual world studies show printed words in different quadrants of the screen (Huettig

& McQueen, 2007; Ito, 2019). In the studies in this thesis, participants are presented with coloured clip-art pictures shown in the four quadrants of the screen. Saccades and fixations

Typical dependent variables in eye-tracking studies in the visual-world paradigm are either saccades to, or fixations on, the visual stimuli shown on the screen. Saccades consist of rapid eye movements during which the retinal image is unstable (Pollatsek & Rayner, 1992). Most saccades have a duration of under 100ms (Termsarasab, Thammongkolchai, Rucker, & Frucht, 2015). Fixations are short periods, of a fraction of a second, during which the eye fixates a point on the screen (Pollatsek & Rayner, 1992). It is during fixations, rather than saccades, that new visual information is usually encoded (Parker, Slattery, & Kirkby, 2019). In this thesis, we use fixations as our dependent variable in all studies. The speed of language-mediated eye movements

As noted above, eye-movements and comprehension processes are closely time-locked. However, there is a slight delay between the time at which the cognitive system programmes an eye movement, and the time at which this eye movement can be recorded as a saccade or a fixation on the screen. Several estimates suppose a delay of around 200ms between the programming of saccadic eye movements and the point at which fixations begin (Allopenna, Magnuson, & Tanenhaus, 1998; Dahan, Magnuson, Tanenhaus, & Hogan, 2001; Saslow, 1967). Altmann (2011) provides an estimate of approximately 100ms for the launch of a language-mediated saccade. At any rate, it is important to note that eye

movements and comprehension processes are closely, but not exactly, time-locked, and that acording to most estimates, fixations begin around 200ms after they are programmed. The preview period

The time during which participants look at a visual scene before hearing target stimuli is known as the preview period. During the preview period, the visual system can extract information about the visual context, meaning that people can process a scene before encountering sentences about it (Ferreira, Foucart, & Engelhardt, 2013). The preview period can give the listener time to encode the visual information and then generate

expectations about the form and content of the sentence. This is relevant in studies of prediction, because the visual scene can provide clues about the upcoming content of the sentence. The extent to which the visual scene provides a context depends on both the type of visual stimuli and on the preview period. A visual scene of the type shown in Altmann and Kamide (1999) provides more contextual information than four unrelated objects shown in different corners of the screen (Huettig, Rommers, et al., 2011). A longer preview time also gives participants more opportunity to form expectations about linguistic content based on visual information. In a study based on Tanenhaus et al. (1995), Ferreira et al. (2013) found that comprehenders process linguistic information differently depending on preview time.

Participants listened to ambiguous or unambiguous sentences, e.g., “Put the book on the chair in the bucket”, or “Put the book that’s on the chair in the bucket” and viewed visual scenes that either allowed, or did not allow, for the disambiguation of the sentence, for instance, containing a book, a chair, a bucket, and a book on a chair, or else a chair, a bucket, a book on a chair and a final unrelated object. When they gave participants a preview time of 3000ms, Ferreira et al. (2013) replicated the earlier findings by Tanenhaus

linguistic information, to direct their gaze to the correct referent (the book on the chair).

However, with no preview time, fixations on the incorrect goal (the chair, rather than the bucket) were high, and did not differ between the four conditions, suggesting that

participants were not able to use the visual information to disambiguate the linguistic information when there was no preview time.

Huettig and Guerra (2019) also compared predictive eye movements to target objects under different preview conditions. Participants listened to simple Dutch sentences containing gender-marked targets, such as “Look at the displayed bicycle!” and viewed four objects, which included a bicycle and three distractors that did not agree in gender with the target. With a 4000ms preview period, participants looked predictively at the target when they heard the gender-marked determiner, whereas with a 1000ms preview, participants only looked predictively at the target when the aural stimuli were presented at a slower than normal speech rate. This suggests that predictive eye movements in this study depended on a longer preview time.

Huettig and McQueen (2007) considered the effect of preview time on the activation on phonological, shape and semantic competitor objects. Participants heard

non-constraining sentences in Dutch, e.g., “Eventually she looked at the beaker that was in front of her”, and viewed visual objects including the target word (e.g., beaker), a phonological competitor (e.g., beaver), a shape competitor (e.g., bobbin), and a semantic competitor (e.g., fork). When visual display and sentence onset were synchronised, participants looked first towards phonological, and then towards shape and semantic competitors, whereas when the preview time was reduced to 200ms before the target word, there were no fixations on phonological competitors. They suggest that this relates to the time it takes to

retrieve the name of a single picture, estimated at 385ms by Levelt, Schriefers, Vorberg, Meyer, Pechmann and Havinga (1991).

In order to reduce the effects of the visual context on prediction in the studies in this thesis, we showed participants a visual array of four objects (without a visual context) at a preview time of 1000ms. These measures should reduce (although not eliminate) the effect of the visual context on any predictive eye movements. Limitations of the paradigm

As we have discussed, when listeners view the visual scene, they activate the

linguistic representations of the depicted objects, regardless of what they hear, and may use the visual scene to aid their prediction of upcoming words (as the number of potential candidates might constrain predictions). This means that eye movements may be directed by both comprehension processes and lexical access based on processing visual information (Huettig & McQueen, 2007).

On the flip side, the information present in the visual scene may not be activated, or may not be activated as expected. Thus, the absence of a measurable effect does not necessarily rule out the presence of an effect. For example, when using visual information to investigate the activation of phonological aspects of a word (as in Huettig and McQueen, 2007, or Ito et al., 2018), the time taken to retrieve the name of the picture might play a role. For instance, Ito et al. (2018), found that L1, but not L2 speakers predictively activated word-form when they heard highly predictable sentences such as “The tourists expected rain when the sun went behind the cloud but the weather got better later”, and saw

pictures including a word-form competitor (e.g., clown) that appeared 1000ms before onset of the predictable word. A possible explanation for this is that L2 speakers take longer to

retrieve word-form from pictures than L1 speakers (who take an estimated 385ms, see Levelt et al., 1991). Slower lexical retrieval of word-form from the visual array could thus lead to an apparent lack of a word-form effect. In addition, L2 speakers may not always activate the expected phonological form. For instance, Weber and Cutler (2004) found that Dutch listeners, who sometimes confuse the English vowels /æ/ and /ε/, were more likely to look at a pencil, when they heard the word panda, than at a beetle when they heard the word bottle, as they do not tend to confuse the vowels /ɒ/ and /i/. This shows that L2 listeners may not always activate the expected phonological representation of the visual information.

A final consideration is that cognitive resources may be required to drive eye movements. For instance, Huettig, Olivers and Hartsuiker (2011) propose that working memory is required to connect visual and linguistic representations. This is relevant when we consider combining eye-tracking with a simultaneous interpreting task, because such a task may also burden working memory, and potentially reduce eye movements to critical objects. It is also feasible that load manipulations (e.g., Ito et al., 2017) affect predictive eye movements rather than predictive processing itself. Summary of eye-tracking in the visual-world paradigm

The visual-world paradigm has been widely used in recent decades to study a variety of processes linked to comprehension, including prediction (Altmann & Kamide, 1999; Ito et al., 2017; Ito et al., 2018; Kamide et al., 2003), activation of different aspects of words (Huettig & McQueen, 2007; Rommers et al., 2013), and cross-language activation (Chambers

& Cooke, 2009; Marian & Spivey, 2003). It is a non-invasive method of studying, very precisely, the time-course of language comprehension processes, and does not involve any

linguistic interference in the process of comprehension. It is conceivable that considering a visual scene or array leads to differences in comprehension as compared to comprehension in a more ecological setting (although the contextual effect of the visual scene can be reduced by reducing preview time and showing images without context). On the other hand, a visual context is also present in many, if not most, ecological settings. To sum up, the paradigm is a reliable means of investigating the time course of spoken language comprehension.

Dans le document Prediction in Interpreting (Page 61-71)