• Aucun résultat trouvé

N EURAL CORRELATES OF VISUAL RECOGNITION AND LEARNING

1. MENTAL REPRESENTATIONS AND BRAIN CIRCUITS UNDERLYING VISUAL RECOGNITION

1.1. L EARNING TO RECOGNIZE OBJECTS

1.1.2 N EURAL CORRELATES OF VISUAL RECOGNITION AND LEARNING

Humans are often described as visual animals and this becomes obvious when one looks at the functional distribution of cortical areas. Felleman and Van Essen (1991) first studied this distribution in the Macaque cortex and found that about half (52%) of the cortical surface area was predominantly or exclusively visual. More recently, Van Essen and colleagues made interspecies comparisons between the Macaque and Human cortices and found high anatomical and functional overlap in visual areas (Van Essen et al., 2001).

Ungerleider and Mishkin (1982) also studied the Macaque visual system and observed that lesions of inferior temporal cortex caused impairments in visual discrimination tasks, whereas posterior parietal lesions caused impairments in visuospatial tasks. They proposed that the visual areas fell into two cortical visual pathways, each originating in the primary visual cortex as schematized in Figure 4. The dorsal stream projects into the inferior parietal area and plays a crucial role in the appreciation of the spatial relationships among objects and

visual guidance toward them whereas the ventral stream projects into the inferior temporal cortex and is involved form recognition and object representation.

V4 V1 MT V2

IT

V4 V1 MT V2

IT

Figure 4. Lateral view of the left hemisphere of a Rhesus monkey. The grey area defines the cortical visual surface in the occipital, temporal and parietal lobes. Arrows schematize two cortical visual pathways, each beginning in primary visual cortex (area OC–V1), diverging within prestriate cortex (areas OB–V2), and then coursing either ventrally into OA-V4 and the inferior temporal cortex (areas TEO and TE; IT) or dorsally into OA-MT and the inferior parietal cortex (area PG). The ventral stream is crucial for object vision and the dorsal one for spatial vision. Source: Adapted from Mishkin et al., 1983.

In humans, neuropsychological studies revealed a similar functional organization of the visual areas. Patients with brain damage in the dorsal pathway are generally unable to reach objects accurately but can still name and describe them (‘optic ataxia’). On the contrary, patients with a lesion in the ventral pathway (‘visual form agnosia’) are unable to name what they see but can easily grasp or pick up an object (Farah, 1990; Goodale et al., 1994).

A simple way to refer to these two streams is the what and the where pathways. However, more recently, it has been proposed that the ventral stream constructs our conscious perception in interaction with the memory systems and that the dorsal stream transforms, in a more ‘bottom-up’ way, visual information in order to guide our actions (Milner & Goodale, 1995). Numerous dorsal-ventral interactions might also exist (Pisella et al., 2009). Even if the dorsal stream might be involved in visual information processing, this work focuses on the ventral stream and its implication visual representations formation and consolidation.

Functional organization of the ventral stream

Early neurophysiological studies in monkeys found neurons that were preferentially activated in the temporal lobe for complex visual stimuli such as faces, and even neurons that

responded differently to different faces (Desimone et al., 1984; Perrett et al., 1984). More recently, researchers (Tsao et al., 2006) recorded single neurons in the macaque,in a region of the superior temporal sulcus and found 97% were face-responsive cells. In humans, a similar region has been found, the so-called fusiform face area (FFA, Kanwisher et al., 1997), which will be presented in further details in section 1.2.3. Thus it appears that there is some specialization of function of different temporal cortical visual areas (for a review, see Grill-Spector & Malach, 2004). Category-selective regions for faces, places, body-parts and a category-general region for objects have been identified in both monkeys and humans (for human and monkey comparison, see Bell et al., 2009; for selective areas in humans, see Epstein & Kanwisher, 1998; Downing et al., 2001). A category-general region for objects, the lateral occipital cortex (LOC) seems to play an important role in 2-D-shape and interestingly also in 3-D-shape representations (Kourtzi et al., 2003; Kourtzi & Kanwisher, 2001).

However, the functional organization of all these cortical regions remains controversial. Some authors argue for a more distributed organization with intermingled cortical regions selective for particular categories (Haxby et al., 2001).

This work focuses on the representations of objects and faces. Thus, the regions which received more interest were the LOC for object learning and the FFA for faces.

The role of the inferior temporal lobe in object recognition

The inferior temporal (IT) cortex is the final stage of the so-called what pathway and has been hypothesized to play a pivotal role in various aspects of visual object recognition. It has been extensively investigated in order to find how features are extracted and how object are represented (for a review, see Gross, 2008). Below will be described some advances provided by neurophysiological and neuroimaging studies in the understanding of visual object recognition.

In the attempt to understand the role of the inferior temporal cortex in object recognition, two of the first goals were to identify the level of complexity of the features extracted in this region and to know whether objects were coded in local (‘grandmother cell’ – cells that only respond to a specific visual concept, such as your own grandmother - see Barlow, 1972) or distributed representations. The first evidence of sensitivity for complex stimuli came from Gross and colleagues (Gross et al., 1969) who showed that IT cortex neurons of monkeys were more activated by complex stimuli such as hands and faces than simple stimuli. These early single-units studies together with Perrett et al.’s one (Perrett et al., 1984) suggested that the “grandmother cell” might exist. Tanaka and colleagues (for review see Tanaka, 2003) further investigated how object features are represented in IT. They recorded TE cells activity

and identified objects that lead to the maximum of activity. Then they simplified these objects to the minimal image features maintaining the same firing rate as measured for the whole complex object. The features leading to the same activation as the object tended to be moderately complex (Tanaka et al., 1991). Tanaka and colleagues also found that adjacent cells were activated by similar simplified features (‘minicolumns) (Fujita et al., 1992).

Finally, they observed that objects were coded by the combined activation of multiple cells, each of which representing different features of the object image (Tanaka et al., 1991). This result accounted for a distributed rather than local distribution of object representation.

The degree to which visual representations in the inferior temporal cortex are view-independent or view-view-independent, and how these representations allow invariant recognition has been addressed by several single-unit cells recordings studies in monkeys but also by a few neuroimaging studies in humans. Recordings in IT cells provide evidence of representations that tolerate changes in object (or face)’s visual input (e.g., viewing angle, Hasselmo et al., 1989; size, Lueschow et al., 1994). Other neurophysiological studies found that these cells were sensitive to viewpoint or size (e.g., Perrett et al., 1982; Ashbridge et al., 2000) and other studies showed cells with both view-specific and view-invariant responses (Perrett et al., 1991; Ito et al., 1995).

More recently, some neuroimaging studies showed both object-based and view-based activations and argued for the existence of multiple mechanisms. Sawamura et al. (2005) used a fMRI adaptation paradigm in monkeys (IT) and humans (Lateral Occipital Cortex, LOC) and found the greatest decrease in neural responses for repetitions of the same object at the same size, intermediate levels of responses for repetitions of the same object at different sizes, and lowest response levels for repetition of different objects. They could not find complete invariance in any of the shape-selective regions. However, some degree of size invariance was found, with a tendency in the anterior regions to be more invariant than posterior ones and with the left LOC being more size invariant than the right one. This is concordant with other neuroimaging showing more abstract and invariant representation in the left fusiform gyrus (FG) than in the right one (Koutstaal et al., 2001; Vuilleumier, Henson et al., 2002; Garoff et al., 2005). A recent study also showed both view-dependent and view-independent adaptation effects in the FG, with a view-dependence in the right FFA and view-independence in the medial FG, during repeated presentations of faces (Pourtois et al., 2009). View-independence was also found in the right medial FG for other object categories (chairs, houses).

Thus, the functional architecture of the inferior temporal lobe is complex and seems to allow the formation of both view-dependent and view-invariant representations. However, the nature of these representations, how they build and which processes underlie 3-D object recognition in human are not clearly understood yet.

Experience-dependent changes in brain responses underlying visual learning

Logothetis and colleagues investigated in monkeys how experience could shape behavior and neural representations in IT for novel 3-D objects similar to those used in psychophysical studies in humans (Bulthoff & Edelman, 1992). They found that in early learning stages, behavioral performance was dependent on the view. Monkeys were able to recognize the learned views and to generalize, with decreased performance, to rotated views, but not farther than 40° around the learned views (Logothetis et al., 1994). These results were in agreement with psychophysical studies in humans suggesting the existence of view-based representations and interpolation processes between learning views (Bulthoff & Edelman, 1992). Logothetis and Pauls (1995) hypothesized that extensive learning with a big set of views would lead to invariance in recognition. Thus they extensively trained monkeys with 5 objects during 4 to 6 months. Behavioral performance became view-invariant but they still found multiple cells tuned for different views. Only a very few number of cells showed invariance. Again these results supported the view-dependent account of object-recognition. However, Booth and Rolls (1998) showed that extensive training is not necessary and that ‘natural exposure’ was sufficient to form invariant representations. Indeed, they put real objects in the monkeys’ cage for several weeks and observed that after natural inspection of sometimes only a few seconds long, view-specific and view-invariant responses could be recorded in STS. Thus, it is not clear yet whether view-dependent and view-invariant representations might coexist in the long term, or whether view-dependency is only the first step of learning.

In humans, psychophysical studies demonstrated that structural (view-invariant) representations may derive from view-specific representations, provided that sufficient prior knowledge and input information are available (Gschwind et al., 2007; Rentschler et al., 2008). However, this issue has never been tested using neuroimaging techniques.

Overall, these data confirm that the inferior temporal lobe play a crucial role in visual recognition and in 3-D object learning. Both view-dependent and view-invariant representations might coexist. However, it is not clear yet how these representations are built and how they evolve through learning. Thus, further investigation is needed to understand how knowledge acquired through a set of defined views can be generalized to novel views for newly learned objects.