• Aucun résultat trouvé

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY

N/A
N/A
Protected

Academic year: 2022

Partager "MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY"

Copied!
20
0
0

Texte intégral

(1)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY

CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING

and

DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES

A.I. Memo No. 1479 April, 18 1994

C.B.C.L. Paper No. 96

How are three-dimensional objects represented in the brain?

Heinrich H. Bultho, Shimon Y. Edelman & Michael J. Tarr

This publication can be retrieved by anonymous ftp to publications.ai.mit.edu.

The pathname for this publication is: ai-publications/1994/AIM-1479.ps.Z

Abstract

We discuss a variety of psychophysical experiments that explore dierent aspects of the problem of object recognition and representation in human vision. In all experiments, subjects were presented with realisti- cally rendered images of computer-generated three-dimensional objects, with tight control over stimulus shape, surface properties, illumination, and viewpoint, as well as subjects' prior exposure to the stim- ulus objects. Contrary to the predictions of the paradigmatic theory of recognition, which holds that object representations are viewpoint invariant, performance in all experiments was consistently viewpoint dependent, was only partially aided by binocular stereo and other depth information, was specic to view- points that were familiar, and was systematically disrupted by rotation in depth more than by deforming the two-dimensional images of the stimuli. The emerging concept of multiple-views representation sup- ported by these results is consistent with recently advanced computational theories of recognition based on view interpolation. Moreover, in several simulated experiments employing the same stimuli used in experiments with human subjects, models based on multiple-views representations replicated many of the psychophysical results concerning the observed pattern of human performance.

Copyright cMassachusetts Institute of Technology, 1994

This report describes research done at the Center for Biological and Computational Learning and the Articial Intelligence Laboratory of the Massachusetts Institute of Technology. This research is sponsered by grants from the Oce of Naval Research under contracts N00014-92-J-1879 and N00014-93-1-0385. Support for the Center is provided in part by a grant from the National Science Foundation under contract ASC{9217041 (funds provided by this award include funds from DARPA provided under the HPCC program) and by a grant from the National Institutes of Health under contract NIH 2-S07-RR07047. Support for the laboratory's articial intelligence research is provided by ARPA contract N00014-91-J-4038. Heinrich H. Bultho is now at the Max-Planck-Institut fur biologische Kybernetik, D-72076 Tubingen, Germany; Shimon Edelman is at the Dept. of Applied Mathematics and Computer Science, Weizmann Insitute of Science, Rehovot 76100, Israel and Michael J. Tarr at the Department of Psychology, Yale University, New Haven, CT 06520-8205. SE was supported by the Basic Research Foundation, administered by the Israel Academy of Arts and Sciences. MJT was supported by the Air Force Oce of Scientic Research, contract number F49620-91-J-0169, and the Oce of Naval Research, contract number N00014-93-1-0305.

(2)

1 Introduction

How does the human visual system represent three- dimensional objects for recognition? Object recognition is carried out by the human visual system with such ex- pediency that to introspection it normally appears to be immediate and eortless (Fig. 1 { canonical). Com- putationally, recognition of a three-dimensional object seen from an arbitrary viewpoint is complex because its image structure may vary considerably depending on its pose relative to the observer (Fig. 1 {non-canonical).

Because of this variability across viewpoint, simple two- dimensional template matching is unlikely to account for human performance in recognizing three-dimensional ob- jects, since it would require that a discrete template be stored for each of the innite number of view-specic images that may arise for even a single object. Con- sequently, the most prominent computational theories of object recognition (see Ullman, 1989 for a survey) have rejected the notion of view-specic representations.

Other approaches, rooted in pattern recognition theory, have postulated that objects are represented as lists of viewpoint-invariant properties or by points in abstract multidimensional feature spaces (Duda and Hart, 1973).

Another, more commonly held, alternative is charac- terized by the postulate that objects are represented by three-dimensional viewpoint-invariant part-based de- scriptions (Marr and Nishihara, 1978; Biederman, 1987), similarto the solid geometrical models used in computer- aided design.

Surprisingly, theories that rely on viewpoint-invariant three-dimensional object representations fail to account for a number of important characteristics of human per- formance in recognition. In particular, across a wide range of tasks, recognition performance, as measured by response times and error rates, has been found to vary systematically with the viewpoint of the perceiver rela- tive to the target object. Such results provide converg- ing evidence in favor of an alternative theory of recog- nition, which is based on multiple viewpoint-specic, largely two-dimensional representations To support this interpretation of the psychophysical results, we review briey several computational theories of object recog- nition, each of which generates specic behavioral pre- dictions that the experiments were designed to test.

Many of the psychophysical results are accompanied by data from simulated experiments, in which central characteristics of human performance were replicated by computational models based on viewpoint-specic two- dimensional representations. More about these theo- ries and about the implemented computational models of recognition used in our simulations can be found in (Lowe, 1986; Biederman, 1987; Ullman, 1989; Ullman and Basri, 1991; Poggio and Edelman, 1990; Bultho and Edelman, 1992; Edelman and Weinshall, 1991).

2 Computational theories of object recognition

Explicit computational theories of recognition serve as good starting points for inquiry into the nature of object representation, by providing concrete hypotheses that

maybe refuted or rened through appropriately designed experiments. More than any other single issue, the ques- tion of whether object representations are viewpoint in- variant or viewpoint dependent has been identied as the crucial distinction on which theories of recognition stand or fall.

One can use the viewpoint-invariant/viewpoint- dependent distinction to make specic psychophysical predictions as follows. Intuitively, if the representation is viewpoint invariant, and if an object-centered reference frame can be recovered independently of object pose, then neither recognition time nor accuracy should be re- lated to the viewpoint of the observer with respect to the object. In contrast, if the representation is view- point dependent, and as long as the complexity of the normalization procedure scales with the magnitude of the transformation, then both recognition time and ac- curacy should be systematically related to the viewpoint of the observer with respect to the object. Subtler pre- dictions may be derived from a closer examination of specic theories.

2.1 Theories that rely on three-dimensional object representations

Theories of the rst kind we mention attempt to achieve a computer-vision equivalent of complete object con- stancy, the apparent ability of humans to perceive and recognize three-dimensional objects irrespective of fac- tors such as viewpoint (Ellis et al., 1989). Two major approaches to object constancy can be discerned. The rst approach uses fully three-dimensional viewpoint- invariant representations, and requires that a similar three-dimensional representation of the input be re- covered from the image before it is matched to like- representations in visual memory. The second ap- proach uses viewpoint-specic three-dimensional repre- sentations (e.g., selected views that include depth infor- mation), and requires that three-dimensional represen- tations of the input be normalized (by an appropriate spatial transformation) from the viewpoint of the im- age to the viewpoint of a view-specic representation in visual memory.

2.1.1 Viewpoint-invariant three-dimensional representations

The notion that the processing of the visual in- put culminates in a full restoration of its three- dimensional structure which may then be matched to three-dimensional viewpoint-invariant representations in memory was popularized by Marr and Nishihara (1978).

Representation by reconstruction, which became known in computer vision under the name of intrinsic images (Barrow and Tenenbaum, 1978; Tenenbaum et al., 1981), was never implemented, due to persistent diculties in solving the problem of a general reconstruction of the three-dimensional representation from input images. De- spite the failure of this approach in computer vision, in psychology it has become widely accepted as a plausible model of recognition, following the work of Biederman and his associates.

Biederman's theory, known as Recognition By Com-

(3)

NON-CANONICAL CANONICAL

Figure 1: Canonical views: certain views of three-dimensional objects are consistently easier to recognize or process in a variety of visual tasks. Once this object is identied as a tricycle seen from the front, we nd it dicult to believe its recognition was anything less than immediate. Nevertheless, recognition is at times prone to errors, and even familiar objects take longer to recognize if they are seen from unusual (non-canonical) viewpoints (Palmer et al., 1981). Exploring this and other related phenomena can help elucidate the nature of the representation of three-dimensional objects in the human visual system.

ponents (or more recently, Geon Structural Descriptions, or GSD (Hummel and Biederman, 1992)), postulates that the human visual system represents basic-level ob- ject categories by three-dimensional structural relation- ships between a restricted class of volumetric primitives known as \geons" (Biederman, 1987). The crucial prop- erty of the GSD approach is that the part descriptions upon which object representations are built are qualita- tive { the same object representation is derived, regard- less of viewpoint, so long as the same conguration of perceptual features is present in the image. A conse- quence of this is that GSDs actually exhibit only view- restrictedinvariance in that a change in the visibility or occlusion of parts will alter the feature congurations present in the image (Hummel and Biederman, 1992;

Biederman and Gerhardstein, 1993). Therefore, the rep- resentation of a single object will necessarily include sev- eral characteristic (Freeman and Chakravarty, 1980) or qualitative views, each composed of a distinct GSD and each viewpoint-invariant only for a limited range of view- points.

2.1.2 Viewpoint-specic three-dimensional representations in conjunction with normalization

As a representative of this class of theories we con- sider recognition by viewpoint normalization, of which Ullman's method of alignment is an instance (Ullman, 1989). In the alignment approach the two-dimensional input image is compared with the projection of a stored three-dimensional model, much like in template match- ing, but only after the two are brought into register. The transformation necessary to achieve alignment is com-

puted by matching a small number of features in the image with the corresponding features in the complete three-dimensional model. The aligning transformation is computed separately for each of the models stored in visual memory (but only one per object). The outcome of the recognition process is the model whose projec- tion matches the input image most closely after the two are aligned. Related schemes (Lowe, 1986; Thompson and Mundy, 1987) select the most appropriate model in visual memory by using the \viewpoint consistency con- straint" which projects each model to a hypothesized viewpoint and then relates the projected locations of the resultant image features to the input image, thereby de- riving a mapping of the image to the three-dimensional structure of stored object representations.

Ullman (1989) distinguishes between a full alignment scheme that employs complete three-dimensional mod- els and attempts to compensate for all possible three- dimensional transformations that objects may undergo, such as rotation in depth, and a partial alignmentscheme that employs pictorial descriptions that decompose ob- jects into (non-generic) parts and uses multiple views rather than a single viewpoint-invariant description to compensate for some three-dimensional transformations.

Ullman notes (ibid., p.228) that this latter multiple- views approach to alignment involves a representation that is \view-dependent, since a number of dierent models of the same object from dierent viewing po- sitions will be used," but at the same time is \view- insensitive, since the dierences between views are par- tially compensated by the alignment process." As such, this approach is similar to Biederman's (Hummel and

(4)

Biederman, 1992) most recent version of GSD theory in which multiple viewpoint-invariant GSDs are used to represent a single object (although because GSDs are considered to be qualitative descriptions, no alignment process is ever postulated to compensate for dierences in viewpoint). Regardless of these subtle dierences, both versions of alignment theory (hereafter referred to simply as alignment) may include the assumption that normalization procedures do not depend on the mag- nitude of the transformation { consequently, viewpoint- invariant performance in recognition tasks (e.g., response times and error rates) may be considered their central distinguishing feature. Alternatively, the complexity of normalization may scale with the magnitude of transfor- mation, and as such, viewpoint-invariant performance is predicted only for error rates, with viewpoint-dependent patterns predicted for response times.

2.2 Theories that rely on viewpoint-dependent two-dimensional object representations

Theories of the second kind we mention here each at- tempt to achieve object constancy by storing multi- ple two-dimensional viewpoint-specic representations (e.g., image-based views) and including mechanisms for matching input images to stored views or to views de- rived computationally from stored views. While the specic mechanisms postulated for accomplishing this match vary among theories (and have consequences for the subtler predictions of each), they may all be con- sidered as computational variants of the empirically- based multiple-views-plus-transformation (MVPT) the- ory of recognition (Tarr and Pinker, 1989). MVPT pos- tulates that objects are represented as linked collections of viewpoint-specic images (\views"), and that recogni- tion is achieved when the input image activates the view (or set of views) that corresponds to a familiar object transformed to the appropriate pose. There is evidence (Edelman and Weinshall, 1991; Tarr, 1989; Tarr and Pinker, 1989) indicating that this process can result in the same dependence of the response time on the pose of the stimulus object as obtained in the mental rotation experiments (Shepard and Cooper, 1982). We consider MVPT as a psychological model of human performance that predicts recognition behavior under specic condi- tions; the computational models reviewed below provide details on how this performance may be achieved.

2.2.1 Linear combination of views (LC)

Several recently proposed approaches to recognition dispense with the need to represent objects as three- dimensional models. The rst of these, recognition by linear combination of views (Ullman and Basri, 1991), is built on the observation that, under orthographic projec- tion, the two-dimensional coordinates of an object point can be represented as a linear combination of the coor- dinates of the corresponding points in a small number of xed two-dimensional views of the same object. The required number of views depends on the allowed three- dimensional transformations of the objects and on the representation of an individual view. For a polyhedral object that can undergo a general linear transformation,

three views are required if separate linear bases are used to represent thexand they coordinates of a new view.

Two views suce if a mixedx;y basis is used (Ullman and Basri, 1991). A system that relies solely on the linear combination approach (LC) should achieve uni- formly high performance on those views that fall within the space spanned by the stored set of model views, and should perform poorly on views that belong to an or- thogonal space.

2.2.2 View interpolation by basis functions (HyperBF)

Another approach that represents objects by sets of two-dimensional views is view interpolation by regular- ization networks (Poggio and Edelman, 1990; Poggio and Girosi, 1990). In this approach, generalization from stored to novel views is regarded as a problem of mul- tivariate function interpolation in the space of all possi- ble views. The interpolation is performed in two stages.

In the rst stage intermediate responses are formed by a collection of nonlinear receptive elds (these can be, e.g., multidimensional Gaussians). The output of the second stage is a linear combination of the intermediate receptive eld responses.

More explicitly, a Gaussian-shaped basis function is placed at each of the prototypical stored views of the object, so that an appropriately weighted sum of the Gaussians approximates the desired characteristic func- tion for that object over the entire range of possible views (see (Poggio and Edelman, 1990; Edelman and Poggio, 1992) for details). Recognition of the object represented by such a characteristic function amounts to a compari- son between the value of the function computed for the input image and a threshold.

2.2.3 Conjunction of localized features (CLF)

The third scheme we mention is also based on inter- polation among two-dimensional views and, in addition, is particularly suitable for modeling the time course of recognition, including long-term learning eects (Edel- man and Weinshall, 1991; Edelman, 1991b; Tarr, 1989;

Tarr and Pinker, 1989). The scheme is implemented as a two-layer network of thresholded summation units. The input layer of the network is a retinotopic feature map (thus the model's name). The distribution of the con- nections from the rst layer to the second, or represen- tation, layer is such that the activity in the second layer is a blurred version of the input. Unsupervised Hebbian learning augmented by a winner-take-all operation en- sures that each suciently distinct input pattern (such as a particular view of a three-dimensional object) is represented by a dedicated small clique of units in the second layer. Units that stand for individual views are linked together in an experience-driven fashion, again through Hebbian learning, to form a multiple-view rep- resentation of the object. When presented with a novel view, the CLF network can recognize it through a pro- cess that amounts to blurred template matching and is related to nonlinear basis function interpolation.

(5)

3 Recognition behavior as predicted by the dierent theories

3.1 Experimental issues

A wide range of psychophysical experiments have been reported that assess the impact of changes of view- point on the recognition of both familiar and novel stim- uli. The core issue in all such studies is whether re- sponse times and/or error rates are equivalent for all changes in viewpoint or are systematically related to the magnitude of changes in viewpoint. Such behav- ioral patterns can help to decide which representations (viewpoint-invariant or viewpoint-dependent) are used in object recognition. However, one must be cautious in interpreting such patterns { there are instances of both viewpoint-invariant and viewpoint-dependent be- havior that do not necessarily imply correspondingly viewpoint-invariant or viewpoint-dependent representa- tions. In particular, there is an asymmetry in what may be concluded from viewpoint-invariant patterns of re- sponses. For novel objects, because of the limited stimu- lus set sizes employed in many experiments, a viewpoint- invariant pattern may simplyindicate that in the context of the experimentally dened recognition set, subjects were able to recognize objects via localized viewpoint- invariant features within each object (Eley, 1982). In contrast, in the context of all potentially recognizable objects in the world, such features would not be unique and consequently would not support viewpoint-invariant recognition. Thus, one of the many challenges that must be overcome in assessing recognition mechanisms in hu- mans is the development of novel stimuli that do not facilitate the reliance on unique features (to the extent that such features are unlikely to be unique in the real world). A similar problem of interpretation exists for fa- miliar objects: a viewpoint-invariant pattern may arise as a result of multiple familiar stored views (distributed across viewpoint so as to mask most eects of viewpoint).

Thus, another challenge that must be overcome is how to assess the possible existence of multiple-views in cases where objects are very familiar, presumably leading to the instantiation of many views.

Examples of diculties of interpretation may also be found in patterns of performance that are viewpoint- dependent. For instance, initial viewpoint-dependency for novel objects may occur because viewpoint-invariant representations may arise only over experience. Thus, learning processes must be considered in assessing recog- nition. Viewpoint-dependent patterns may arise because of reliance on perceptual information possibly irrelevant to recognition { for example, mirror-image discrimina- tion requires left/right handedness information dened in only in our ego-centric frame of reference, therefore, mental rotation is apparently used to normalize objects to this frame (Shepard and Cooper, 1982). Thus, a - nal challenge is to ensure that extraneous factors, for instance, handedness, do not produce behavioral pat- terns that are not typical of recognition judgments. As discussed in Sections 4.3 and 5, these challenges are addressed in experiments conducted by Bultho and Edelman (Bultho and Edelman, 1992; Edelman and

Bultho, 1992a) and by Tarr (Tarr, 1989; Tarr and Pinker, 1989). Briey, these experiments employed the following manipulations:

Novel stimulus objects that shared similar parts in dierent spatial relationships (typical of subordinate-level recognition discriminations), thereby reducing the possibility of localized unique features mediating recognition (see Fig. 2).

Measures assessing both the initial recognition of novel objects and recognition following extensive familiarization.

Restricted sets of viewpoints during initial train- ing or other controls (see below) to investigate the degree of viewpoint specicity encoded in object representations of familiar objects or novel objects following extensive familiarization.

The introduction of unfamiliar \test" views to as- sess the underlying organization of views instanti- ated during learning.

Recognition tasks that reduced the likelihood of extraneous inuences on recognition performance.

For instance, some studies controlled for handed- ness by using bilaterally symmetrical objects or treating both members of mirror-pairs as equiva- lent.

Additionally, to dierentiate between the more sub- tle predictions of viewpoint-dependent theories of recog- nition, we have investigated the performance in three distinct cases, each corresponding to a dierent kind of test views. In the rst and easiest case, the test views are familiar to the subject (that is, test views re shown during training). In the second case, the test views are unfamiliar, but are related to the training views through a rigid three-dimensional transformation of the target.

In this case the problem can be regarded as generaliza- tion of recognition to novel views. In the third case, which is especially relevant in the recognition of artic- ulated or exible objects, the test views are obtained through a combination of rigid transformation and non- rigid deformation of the target object. To better place the results of such experiments in a theoretical context, we rst review the specic theoretical predictions gener- ated by each theory of recognition.

3.2 Theoretical predictions

The theories discussed in Section 2 make dierent predic- tions about the eect of factors such as viewpoint on the accuracy and latency of recognition under the various conditions outlined above. As mentioned, at the most general level, theories that rely on viewpoint-invariant representations predict no systematic eect of viewpoint on either response times or error rates, both for familiar and for novel test views, provided that the representa- tional primitives (i.e., invariant features or generic parts) can be readily extracted from the input image. In com- parison, theories that rely on viewpoint-dependent rep- resentations naturally predict viewpoint-dependent per- formance. However, the details of such predictions vary according to the specics of the approach postulated by each particular theory.

(6)

Figure 2: The appearance of a three-dimensional object can depend strongly on the viewpoint. The image in the center represents one view of a computer graphics object (wire-, amoeba-, or cube-like). The other im- ages are derived from the same object by75rotation around the vertical or horizontal axis. The dierence be- tween the images illustrates the diculties encountered by any straightforward template matching approach to three-dimensional object recognition. Thin wire-like ob- jects have the nice property that the negligible amount of occlusion provides any recognition system with equal amount of information for any view. A realistic recogni- tion system has to deal with the more dicult situation of self-occlusion as demonstrated with the amoeba-like objects.

3.2.1 Viewpoint-invariant three-dimensional representations

A recognition scheme based on viewpoint-invariant three-dimensional representations may be expected to perform poorly only for those views which by an acci- dent of perspective lack the informationnecessary for the

recovery of the reference frame in which the viewpoint- invariant description is to be formed (Marr and Nishi- hara, 1978; Biederman, 1987). In a standard example of this situation, an elongated object is seen end-on, caus- ing a foreshortening of its major axis, and an increased error rate, due presumably to a failure to achieve a stable description of the object in terms of its parts (Marr and Nishihara, 1978; Biederman, 1987). In all other cases this theory predicts independence of response time on orientation, and a uniformly low error rate across dif- ferent views. Furthermore, the error rate should remain low even for deformed objects, as long as the deforma- tion does not alter the make-up of the object in terms of its parts and their qualitative spatial relations.

Similar predictions are made by the most recent ver- sion of GSD theory (Biederman and Gerhardstein, 1993;

Hummel and Biederman, 1992) to the extent that given GSD is considered to be viewpoint invariant up to changes in the visibility or occlusion of specic geons.

Therefore, as long as the complete set of GSDs is fa- miliar for a given object, recognition behavior will be completely viewpoint invariant. However, under condi- tions where some GSDs are unfamiliar or, more gener- ally, under conditions where the GSD recovered from an image must be matched to a dierent GSD in memory, recognition behavior will degrade qualitatively, that is, without any systematic relationship to the magnitude of changes in viewpoint (Biederman and Gerhardstein, 1993). Thus, GSD theory predicts viewpoint invariance for the recognition of familiar objects and only step-like viewpoint-dependent patterns for the recognition of un- familiar objects undergoing extreme changes in visible part structure.

3.2.2 Viewpoint-dependent three-dimensional representations

Consider next the predictions of those theories that explicitly compensate for viewpoint-related variability of apparent shape of objects, by normalizing or transform- ing the object to a standard viewpoint. As mentioned, if the recognition system represents an object by multi- ple views and uses an incremental transformation pro- cess for viewpoint normalization, response times are ex- pected to vary monotonically with the viewpoint of the test view relative to one of stored views. This pattern of response times will hold for many of the familiar, as well as for novel test views, since the system may store selectively only some of the views it encounters for each object, and may rely on normalization for the recogni- tion of other views, either familiar or novel. In contrast to the expected dependence of response times on view- point, the error rate under the viewpoint normalization approach will be uniformly low for any test view, either familiar or novel, in which the information necessary for pose estimation is not lost (thereby leading to successful recognition). Alternatively, if normalizing or transform- ing the object uses a \one-shot" transformation process for viewpoint normalization, response times will likewise be viewpoint invariant. In either case, the predictions of this theory may be dierentiated from theories that rely on two-dimensional representations and normaliza-

(7)

tion procedures in that the latter predict eects of view- point for both response times and error rates (as dis- cussed in the following sections). By comparison, theo- ries based on three-dimensional representations predict that error rates will not vary with viewpoint (regardless of the pattern of response times).

3.2.3 Linear combination of views

The predictions of the LC scheme vary according to the particular version used. The basic LC scheme pre- dicts uniformly successful generalization to those views that belong to the space spanned by the stored set of model views. It is expected to perform poorly on views that belong to an orthogonal space. In contrast, the mixed-basis LC (MLC) is expected to generalize per- fectly, just as the three-dimensional viewpoint-invariant schemes do. Furthermore, the varieties of the LC scheme should not benet signicantly from the availability of depth cues, because they require that the views be en- coded as lists of coordinates of object features in two- dimensions and cannot accommodate depth information.

Regarding the recognition of deformed objects, the LC method will generalize to any view that belongs to a hyperplane spanned by the training views (Ullman and Basri, 1991). For the LC+ scheme (that is, LC aug- mented by quadratic constraints verifying that the trans- formation in question is rigid), the generalization will be correctly restricted to the space of the rigid transforma- tions of the object, which is a nonlinear subspace of the hyperplane that is the space of all linear transformations of the object.

3.2.4 View interpolation

Finally, consider the predictions of the view interpo- lation theory. First, as with theories that rely on three- dimensional representations, eects of viewpoint on re- sponse times are expected to vary with specic imple- mentation details. In one instance, there will be no sys- tematic increase in response times with changes in view- point if the transformation (in this case, interpolation) mechanism is \one-shot" instead of incremental. In the other instance, response times will increase with increas- ing changes in viewpoint if the interpolation involves an incremental process, for example, a time-consuming spread of activation in a distributed implementation.

We note that while activation-spread models have been proposed as accounts of viewpoint-dependent re- sponse times in object recognition (Edelman and Wein- shall, 1991), they may also oer a plausible mecha- nism for many so-called mental transformation phenom- ena. For instance, it is well documented that at the behavioral level, humans employ a transformation pro- cess commonly referred to as \mental rotation" during some perceptual judgments (Shepard and Cooper, 1982).

The explanation oered by Shepard is that such trans- formations are mental analogs of actual physical trans- formations { a hypothesis which still stimulates a major debate in cognitive science, but does not seem to lead to a plausible neural or computational theory. In its stead, we propose that, to the extent that a given theory of view interpolation relies on an incremental process, it

may provide a plausible account of mental transforma- tion behavioral patterns across many tasks.1

Another prediction of the view interpolation theory is lower error rate for familiar test views than for novel test views, depending on the distance from the novel view to the nearest familiar stored view. Some variation in the error rate among the familiar views is also possible, if the stored prototypical views form a proper subset of the previously seen ones (in which case views that are the closest to the stored ones will be recognized more reliably than views that have been previously seen, but were not included in the representation). For deformed objects, generalization is expected to be as signicant as for novel views produced by rigid transformations.

Furthermore, better generalization should be obtained for test views produced by the same deformation method used in training.

4 Psychophysical background

4.1 Basic vs. subordinate-level recognition

Numerous studies in cognitive science (see Rosch et al., 1976 for a review) reveal that in the hierarchical struc- ture of object categories there exists a level of category organization, referred to as the basic level, which is the most salient according to a variety of psychological cri- teria (such as the ease and preference of access). Taking as an example the hierarchy \quadruped, mammal, cat, Siamese", the basic level is that of \cat". While basic- level categorical structure is unlikely to a product of ei- ther purely denitional or perceptual mechanisms (Arm- strong et al., 1983), there is some evidence that basic- level categories are organized to some extent around per- ceptual properties of objects. For instance, Tversky and Hemenway (1984) have proposed that the presence of common parts in similar congurations is one of the es- sential properties in determining category membership.

However, given this conjecture, it is clear that some ap- parent members of a particular basic-level category are inappropriate. For example, while robins, bluejays, and penguins all share membership in the category \bird,"

only the rst two actually share many common parts.

Both the shape and consequently the parts of penguins are dissimilar to prototypical birds. Likewise, in terms of naming performance, it is clear that the basic level fails to capture some aspects of categorization behavior;

for example, the rst label assigned to an image of a penguin is likely to be \penguin" rather than \bird" {

1Indeed, a view interpolation account of Tarr's data on ob- ject recognition supports this proposal. Tarr (1989; Tarr and Pinker, 1989) compared directly the response time patterns obtained in recognition tasks to those obtained using iden- tical objects in identical viewpoints in perceptual judgments known to elicit the use of mental transformations. The com- parison revealed that recognition and transformation tasks yield highly similar putative rates of \rotation" as well as deviations from monotonicity. While such evidence is neces- sarily only circumstantial, it provides some indications that well-specied computational theories of recognition may also inform us as to the mechanisms used in other aspects of visual cognition.

(8)

a behavior consistent with the dissociation at the per- ceptual level. Consequently, it has been suggested that for purposes of characterizing recognition performance, the basic level should be supplanted by the entry level { the rst categorical label generally assigned to a given object (Jolicoeur et al., 1984). To the extent that theo- ries of recognition attempt to account for classicatory behavior, they do so for entry-level performance (i.e., Biederman, 1987; Hummel and Biederman, 1992).

In contrast to the entry-level, objects whose recog- nition implies ner distinctions than those required for entry-level categorization are said to belong to asubor- dinate level. In terms of perceptual content, the sub- ordinate level may be characterized by objects having similar overall shape as a consequence of sharing similar parts in similar spatial relationships. Typical examples of subordinate-level or within-category discriminations include recognizing individual faces or specic models of cars.

Crucially, the pattern of response times and error rates in recognition experiments appears to be inu- enced to a large extent by the category level at which the distinction between the dierent stimuli is to be made (Edelman, 1992). Specically, if the subjects are required to classify the stimulus (that is, to determine its entry-level category), error rates and response times are often found to be viewpoint invariant (except in in- stances where the three-dimensional structure of the ob- ject is severely distorted, e.g., due to foreshortening; see Biederman 1987). In contrast, if the task is to identify a specic object (that is, to discriminate one individual from other, visually similar objects sharing parts and spatial relations), error rates and response times are nor- mallyviewpoint dependent. While this distinction is cer- tainly true in its extreme form (for instance, objects hav- ing no parts in commonwill almost certainly be members of dierent entry-level categories and, likewise, may be discriminated by viewpoint-invariant unique features) it is less clear that \everyday" entry-level performance is mediated by viewpoint-invariantmechanisms. For exam- ple, as discussed in the following section, naming times (generally at the entry-level) for familiar common ob- jects have been found to be viewpoint-dependent. More importantly, because entry-level categories are only ac- quired over extensive experience with many instances of each class, it is possible that multiple viewpoint- dependent representations are acquired as the category is learned. As discussed in Section 3.1, this leads to an asymmetry in the kind of conclusions that can be drawn from viewpoint-invariant performance: for famil- iar entry-level categories, the reliance on multiple views may mask the operation of any viewpoint-dependent mechanisms. Thus, it is dicult to assess the under- lying structure of object representations through entry- level tasks employing familiar objects as stimuli. To ad- dress this problem, we are currently undertaking sev- eral psychophysical studies in which the acquisition of entry-level categories for novel objects is manipulated in conjunction with viewpoint. To the extent that entry- level categorization is normallyviewpoint-invariant, such performance should be found regardless of which views

have been displayed; alternatively, to the extent that entry-level categorization relies on multiple-views, per- formance should vary systematically in relation to the views that are familiar.

4.2 Canonical views

Most familiar common objects such as houses, animals, or vehicles are recognized faster or more slowly, depend- ing on the viewpoint of the observer (as demonstrated in Figure 1). This phenomenon has been dened orig- inally purely in descriptive and qualitative terms. For instance, Palmer, Rosch and Chase (1981) found that subjects consistently labeled one or two views, desig- nated ascanonical views, of such objects as subjectively

\better" than all other views. Consistent with such rat- ings, a naming task revealed that subjects tended to re- spond fastest when the stimulus was shown in a canon- ical view (as determined independently in the afore- mentioned subjective judgment experiment), with re- sponse times increasing monotonically with changes in viewpoint relative to this view. This demonstration of viewpoint-dependent naming is consistent with the hy- pothesis that multiple-views mediate recognition even at the entry-level; in particular, theories of recognition that rely on viewpoint-specic representations may accom- modate such results quite naturally, while theories that rely on viewpoint-invariant representations will require added complexity solely to account for this behavior. It should be noted however, that at the entry level, canon- ical views are largely a response time phenomenon (the error rate for basic-level naming, as found by Palmer et.

al., was very low, with the errors being slightly more frequent for the worst views than for others). In com- parison, at the subordinate levels canonical views are apparent in the distribution of error rates as well as re- sponse times, where they constitute strong and stable evidence in favor of viewpoint-dependent nature of ob- ject representations (see Section 5.1). Thus, while entry- level and subordinate-level recognition may share some common representational structures, they may dier at some level of processing, for instance, in the threshold for what constitutes a correct match.

4.3 Mental transformation and its disappearance with practice

As discussed in Section 3.1, the body of evidence docu- menting the monotonic dependency of recognition time on viewpoint has been interpreted recently (Tarr, 1989;

Tarr and Pinker, 1989; Tarr and Pinker, 1990) as an indication that objects are represented by a few spe- cic views, and that recognition involves viewpoint nor- malization (via alignment, linear combinations, or Hy- perBF's) to the nearest stored view, by a process simi- lar to mental rotation (Shepard and Cooper, 1982). A number of researchers have shown the dierences in re- sponse time among familiar views to be transient, with much of the variability disappearing with practice (see, e.g., Jolicoeur, 1985; Koriat and Norman, 1985; Tarr, 1989; Tarr and Pinker, 1989). Thus, experience with many viewpoints of an object leads to apparent view- point invariance. However, to reiterate the point made

(9)

in Section 3.1, such performance is not diagnostic in that it may arise as a result of either multiple-views or as a single viewpoint-invariant representation.

To distinguish between these two possibilities, Tarr and Pinker (1989; also Tarr, 1989) investigated the eect of practice on the pattern of responses in the recognition of novel objects, which are particularly suitable for this purpose because they oer the possibility of complete control over the subjects' prior exposure to the stim- uli. Specically, their experiments included three phases:

training, practice, and surprise. Feedback about the cor- rectness of their responses was provided to subjects in all phases. During training, subjects learned to identify three or four novel objects from a single viewpoint. Cru- cially, the stimulus objects shared similar parts in dif- ferent spatial relationships, a perceptual discrimination characteristic of subordinate-level recognition. To assess the initial eects of changes of viewpoint on recognition, during practice, subjects named the objects in a small select set of viewpoints.2 Consistent with the hypoth- esis that objects are recognized by a normalization to viewpoint-specic two-dimensional representations, ini- tial naming times and accuracy were both monotonically related to the change in viewpoint (a nding also con- sistent with the results of Palmer, et. al., 1981, and Jolicoeur, 1985). In particular, the magnitude of this eect was comparable in terms of putative rate of ro- tation (as measured by the slope of the response time function) to the rates found in classic studies of men- tal rotation (Shepard and Cooper, 1982) and to control experiments in which the same novel stimuli were dis- criminated on the basis of left/right handedness in the identical viewpoints. However, as expected, this eect of viewpoint diminished to near equivalent performance at all familiar viewpoints with extensive practice. At this point, the surprise phase was introduced, during which subjects named the same now-familiar objects in new, never-before-seen viewpoints as well as in previ- ously practiced familiar viewpoints (see Fig. 3).

The surprise phase manipulation is diagnostic for dis- tinguishing between viewpoint-invariant and viewpoint- dependent theories in that the former class of theories predict that the mechanisms used to achieve invariance for the familiar viewpoints may be used to recognize stimuli independent of viewpoint in the unfamiliar view- points as well; in contrast, the latter class of theories pre- dict that no such generalization will occur, rather, the viewpoint-dependent mechanisms used to match stimuli to stored familiar views will now necessitate that stim- uli in unfamiliar views be normalized with stored views.

Consistent with this latter prediction, numerous experi- ments have revealed patterns in both response times and error rates that vary monotonicallywith the distance be- tween the unfamiliar viewpoint and the nearestfamiliar view (Fig. 3). Importantly, the magnitude of such ef- fects was comparable to the viewpoint eects found in the initial practice phase of each experiment { indicat-

2To ensure that subjects did not rely on unique features, several \distractor" objects were also included. Rather than naming such objects, subjects simply made a \none-of-the- above" response.

ing that the same viewpoint-dependent mechanism was employed both when the stimuli were relatively novel and when they were highly familiar (the crucial dier- ence being the number of views encoded per object).

Indeed, as before, further experience with a wide range of views (all of the viewpoints in the surprise phase) once again led to a dimunition in the eect of viewpoint on performance for those specic viewpoints, presum- ably because additional views were acquired with experi- ence. Similar ndings have been observed under numer- ous stimulus manipulations that controlled for the possi- bility that eects of viewpoint were the result of superu- ous handedness checks, including experiments employing bilaterally symmetrical objects and cases where mirror- image pairs were treated as equivalent. Overall, these re- sults provide strong evidence that, at least for purposes of subordinate-level recognition, objects are represented as viewpoint-specic multiple-views and recognized via viewpoint-dependent normalization processes.

4.4 Limited generalization

The pattern of error rates in experiments by Rock and his collaborators (Rock and DiVita, 1987) indicates that when the recognition task can only be solved through relatively precise shape matching (such as required for subordinate-level recognition of the bent wire-forms used as stimuli), the error rate reaches chance level already at a misorientation of about 40 relative to a familiar atti- tude (Rock and DiVita, 1987), see also Figure 6. A sim- ilar limitation seems to hold for people's ability to imag- ine the appearance of such wire-forms from unfamiliar viewpoints (Rock, Wheeler and Tudor, 1989). However, such results may present an extreme case in terms of performance. Farah (Farah et al., 1994) observed that when Rock's wire-forms were interpolated with a smooth clay surface (creating \potato-chip" objects), subjects' recognition accuracy increased dramatically for changes in viewpoint equivalent to those tested by Rock. Thus, object shape and structure plays a signicant role in the ability of humans to compensate for variations in viewpoint (for instance, see Koenderink and van Doorn, 1979). One possibility is that as the structure of ob- jects becomes more regular (in terms of properties such as spatial relations and symmetries), the ability to com- pensate eciently for changes in viewpoint is enhanced, in that the resultant image structure is predictable (Vet- ter et al., 1994). One consequence is that error rates may be reduced and performance will be enhanced, although it is possible that mixed strategies or verication proce- dures will yield response times that are still dependent on viewpoint (as seen in the naming of familiar common objects in non-canonical views, Palmer, et. al., 1981).

5 Psychophysics of subordinate-level recognition

Despite the availability of data indicating that multiple- views and normalization mechanisms play some role in subordinate-level recognition (Section 4.3), psychophys- ical research has left many of the questions vital to com- putational understanding of recognition unanswered.

(10)

500 1000 1500 2000 2500

Response Time (ms)

500 1000 1500 2000

Response Time (ms)

500 1000 1500 2000

Response Time (ms)

10° 40°

70° 100

° 130

° 160

° 190

° 220

° 250

° 280

° 310

° 340

° Degrees of Rotation

Figure 3: Mean response times for correctly naming familiar \cube" objects in familiar and unfamiliar viewpoints.

Viewpoints were generated by rotations in depth (around the x or y axis) or in the picture-plane (around the z axis). Filled data points represent familiar viewpoints learned during training and extensive practice; open points represent unfamiliar viewpoints introduced in the \surprise" phase of the experiment. Prior to this phase, extensive practice resulted in the onset of equivalent naming performance at all familiar viewpoints { a pattern consistent both with the acquisition of multiple viewpoint-dependent \views" and with the acquisition of a single viewpoint- invariant description. Performance in the surprise phase distinguishes between these two possibilities: naming times (and error rates) increased systematically with angular distance from the nearest familiar viewpoint, indicating that subjects represented familiar objects as multiple-views and employed a time-consuming normalization process to match unfamiliar viewpoints to familiar views. One of the 7 \cube" objects is shown along with the axis of rotation to the right of each plot (data and stimuli adapted from Tarr, 1989).

(11)

View-sphere visualization ofRT =f(v iew ang l e) Session 1

Session 2

Figure 4: Canonical views and practice: the advantage of some views over others, as manifested in the pattern of response times (RTs) to dierent views of wire-like objects, is reduced with repeated exposure. The spheroid surrounding the target is a three-dimensional stereo-plot of response time vs. aspect (local deviations from a perfect sphere represent deviations of response time from the mean). The three-dimensional plot may be viewed by free- fusing the two images in each row, or by using a stereoscope. Top, Target object and response time distribution for Session 1. Canonical aspects (e.g., the broadside view, corresponding to the visible pole of the spheroid) can be easily visualized using this display method. Bottom, The response time dierence between views are much smaller in Session 2. Note, that not only did the protrusion in the spheroid in Session 1 disappear but also the dip in the polar view is much smaller in Session 2. Adapted from Edelman and Bultho, 1992.

For example, it is still unclear whether the canonical views phenomenon reects basic viewpoint dependence of recognition, or is due to particular patterns of the subjects' exposure to the stimuli.3 More importantly, existing data are insucient for testing the subtler pre- dictions of the many computational theories concerning generalization to novel views and across object deforma- tions. Finally, the role of depth cues in recognition has been largely unexplored. The experiments described in this section were designed to address many such issues, concentrating on subordinate-level identication, which, unlike entry-level classication (Biederman, 1987), has been relatively unexplored.

All the experiments described below employed tasks in which subjects were asked to explicitly recall whether a currently displayed object had been previously

3Recent psychophysical and computational studies indi- cate that viewpoint dependence may be to a large extent an intrinsic characteristic of 3D shapes (Cutzu and Edelman, 1992; Weinshall et al., 1993).

presented.4 Each experiment consisted of two phases:

training and testing. In the training phase subjects were shown a novel object dened as the target, usually as a motion sequence of two-dimensional views that led to an impression of three-dimensional shape through structure from motion. In the testing phase the subjects were pre- sented with single static views of either the target or a distractor (one of a relatively large set of similarobjects).

The subject's task was to press a \yes"-button if the dis- played object was the current target and a \no"-button otherwise, and to do it as quickly and as accurately as

4Such a judgment is commonly referred to as an \explicit"

memory task. While some dissociations in performance have been found between similar explicit tasks and so-called \im- plicit" tasks such as priming or naming (Schacter, 1987), there is little evidence to indicate that this dissociation holds for changes across viewpoint (Cooper and Schacter, 1992).

Moreover, Palmer, et. al.'s, (1981) and Tarr's (1989; Tarr and Pinker, 1989) studies employed implicit tasks, yet still revealed robust eects of viewpoint.

(12)

possible. No feedback was provided as to the correctness of the response.

5.1 Canonical views and their development with practice

To explore the rst issue raised above, that of the deter- minants of canonical views, we tested the recognition of views all of which have been previously seen as a part of the training sequence (for further details see (Edel- man and Bultho, 1992a), Experiment 1). Our stimuli proved to possess canonical views, despite the fact that in training all views appeared with equal frequency. We also found that the response times for the dierent views became more uniform with practice. The development of canonical views with practice is shown in Figure 4 as a three-dimensional stereo-plot of response time vs. ori- entation, in which local deviations from a perfect sphere represent deviations of response time from the mean.

For example, the dierence in response time between a

\good" and a \bad" view in the rst session (the dip at the pole of the sphere and the large protrusion in Fig. 4, top) decreases in the second session (Fig. 4, bot- tom). The pattern of error rates, in comparison, re- mained largely unaected by repeated exposure.

5.2 Role of depth cues

5.2.1 Depth cues and the recognition of familiar views

A second set of experiments explored the role of three dierent cues to depth in the recognition of familiar views (for details, see (Edelman and Bultho, 1992a), Experiment 2). Whereas in the previous experiment test views were two-dimensional and the only depth avail- able cues were shading of the objects and interposition of their parts, we now added texture and binocular stereo to some of the test views, and manipulated the position of the simulated light source to modulate the strength of the shape from shading cue (cf. Bultho and Mallot, 1988).

The stimuli were rendered under eight dierent com- binations of values of three parameters: surface texture (present or absent), simulated light position (at the sim- ulated camera or to the left of it) and binocular dispar- ity (present or absent). Training was done with maxi- mal depth information (oblique light, texture and stereo present). Stimuli were presented using a noninterlaced stereo viewing system (StereoGraphics Corp.). A xed set of views of each object was used both in training and in testing. We found that both binocular disparity and, to a smaller extent, light position aected performance.

The error rate was lower in the stereo compared to

monotrials (11:5% as opposed to 18:0%) and lower un- der oblique lighting than under head-on lighting (13:7%

compared to 15:8%).

5.2.2 Depth cues and the generalization to novel views

A second manipulation probed the inuence of binoc- ular disparity (shown to be the strongest contributor of depth information to recognition) on the generalization of recognition to novel views (for details, see Edelman

Figure 5: Generalization to novel views: An illustration of the inter, extra andorthoconditions. Computa- tional theories of recognition outlined in Section 2 gen- erate dierent predictions as to the relative degree of generalization in each of the three conditions. We have used this to distinguish experimentally between the dif- ferent theories.

and Bultho, 1992, Experiment 4). The subjects were rst trained on a sequence of closely spaced views of the stimuli,then tested repeatedly on a dierent set of views, spaced at 10intervals (0 to 120from a reference view at the center of the training sequence).

The mean error rate in this experiment was 14:0% un- dermonoand 8:1% understereo. In the last session of the experiment, by the time the transient learning eects have disappeared, the error rate undermonoapproached the error rate understereo, except for the range of mis- orientation between 50and 80, wheremonowas much worse thanstereo. Notably, error rate in each of the two conditions in the last session was still signicantly dependent on misorientation.

5.3 Generalization to novel views

A related experiment used an elaborate generalization task to distinguish among three classes of object recog- nition theories mentioned in Section 2: alignment, linear combination of views (LC), and view interpolation by basis functions (HyperBF). Specically, we explored the dependence of generalization on the relative position of training and test views on the viewing sphere (for de- tails, see Bultho and Edelman, 1992). We presented the subjects with the target from two viewpoints on the equator of the viewing sphere, 75 apart. Each of the two

(13)

Human Subjects RBF Model

Figure 6: Generalization to novel views: Top left: Error rate vs. misorientation relative to the reference (\view- 0" in Fig. 5) for the three types of test views { inter, extra and ortho, horizontal training plane. Top right:

performance of the HyperBF model in a simulated replica of this experiment. Bottom left and right: same as above, except vertical training plane. Adapted from Bultho and Edelman, 1992.

training sequences was produced by letting the camera oscillate with an amplitude of15oaround a xed axis (Fig. 5). Target test views were situated either on the equator (on the 75oor on the 360o;75o= 285oportion of the great circle, calledinterandextraconditions), or on the meridian passing through one of the training views (orthocondition; see Fig. 5).

The results of the generalization experiment, along with those of its replica involving the HyperBF model, appear in Figure 6. As expected, the subjects' gener- alization ability was far from perfect. The mean error rates for theinter,extraandorthoview types were 9:4%, 17:8% and 26:9%. Repeated experiments involving the same subjects and stimuli, as well as control experi- ments under a variety of conditions yielded an identical pattern of error rates. The order of the mean error rates was changed, however, when the training views lay in the vertical instead of the horizontal plane. The means for the inter, extra and ortho conditions were in that case 17:9%, 35:1% and 21:7%.

The experimental results t most closely the predic- tions of the HyperBF scheme and contradict theories that involve three-dimensional viewpoint-invariant mod- els or viewpoint alignment models that do not allow for errors in recognition. In particular, the dierences in generalization performance between the horizontal and the vertical arrangements of training views can be ac-

commodated within the HyperBF framework by assign- ing dierent weights to the horizontal and the vertical dimensions (equivalent to using non-radial basis func- tions).

5.4 Generalization across deformations

In the last experiment reported in this section, we com- pared the generalization of recognition to novel views belonging to several dierent categories: those obtained from the original target object by rigid rotation, by three-dimensional ane transformation, and by non- uniformdeformation (Edelman and Bultho, 1990; Sklar et al., 1993; Spectorov, 1993). The views in the rigid ro- tation category were obtained by rotation around the X axis (that is, in the sagittal plane), around the Y axis, and in the image-plane. In the deformation category, the methods were shear, stretch, quadratic stretch, and non- uniform stretch, all in depth. Altogether, views obtained through seven dierent transformation and deformation classes were tested.

From the experimental results it appears that the degree of generalization exhibited by the human vi- sual system is determined more by the amount of (two- dimensional) deformation as measured in the image plane (cf. Cutzu and Edelman, 1992) than by the direc- tion and the distance between the novel and the train- ing views in the abstract space of all views of the tar-

(14)

X rot Y rot Z rot X def

1 1.5 2 2.5 3 3.5 4

0 0.1 0.2 0.3 0.4 0.5

Deformation Level

Miss Rate

Human Subjects

Figure 7: Human performance in the recognition of ro- tated and deformed objects. The subjects had to at- tribute briey displayed static images of isolated objects to one of two classes (17 subjects participated; data are from 24 experimental sessions, which involved 5 dier- ent object pairs; for details, see Spectorov, 1993). The four curves show mean error (miss) rate for view related to the single training view by rotation around the X, Y, and Z axes (the latter is image-plane rotation), and by deformation along the X axis (data from four deforma- tion methods, all of which produced similar results, are collapsed for clarity). Note that both image-plane rota- tion and deformation were easy, and elicited near-oor error rate.

get object. The HyperBF scheme was recently shown to produce a similar pattern of performance (Spectorov, 1993). More generally, such ndings are consistent with the conception of multiple-views object representations as being exemplar-based, and consequently, recognition performance showing sensitivity to variations in two- dimensional image properties such as global shape, color, or illumination (Wurm et al., 1993).

5.5 Interpretation of the experimental data: support for a view interpolation theory of recognition

The experimental ndings reported above are incompati- ble with theories of recognition that postulate viewpoint- invariant representations. Such theories predict no dierences in recognition performance across dierent views of objects, and therefore cannot account either for the canonical views phenomenon or for the limited gen- eralization to novel views, without assuming that, for some reason, certain views are assigned a special status.

Modifying the thesis of viewpoint-invariant representa-

X rot Y rot Z rot X def

1 1.5 2 2.5 3 3.5 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Deformation Level

Threshold

RBF Model

Figure 8: RBF model performance (measured by the classication threshold needed to achieve correct accep- tance of all test views) in the recognition of rotated and deformed objects (for details, see Spectorov, 1993). The four curves are as in Figure 7. The wire-frame stimuli were encoded by vectors of angles formed by the vari- ous segments. Consequently, the image-plane rotation (which leaves these angles invariant) was as easy for the model as for the human subjects, but the deformations elicited somewhat worse performance (the rotations in depth were the most dicult, as they were for the hu- mans). A choice of features other than angles may bring the performance of the model closer to that of humans.

tion to allow privileged views and a built-in limit on gen- eralization greatly weakens it, by breaking the symmetry that holds for truly viewpoint-invariant representations, in which all views, including novel ones, are equivalent.

Part of the ndings on viewpoint-dependent recog- nition, including mental rotation and its disappearance with practice, and the lack of transfer of the practice ef- fects to novel orientations or to novel objects (Tarr, 1989;

Tarr and Pinker, 1989), can be accounted for in terms of viewpoint alignment (Ullman, 1989). According to Ull- man's (1989) alignment explanation, the visual system represents objects by small sets of canonical views and employs a variant of mental rotation to recognize objects at attitudes other than the canonical ones. Furthermore, practice causes more views to be stored, makingresponse times shorter and more uniform. At the same time, the pattern of error rates across views, determined largely by the second stage of the recognition process in which the aligned model is compared to the input, remains stable due to the absence of feedback to the subject.

This explanation, however, is not compatible with the results of the generalization experiments (nor with Tarr's

Références

Documents relatifs

Figure 6: False positive matches: Images 1 and 2 show templates constructed from models Slow and Yield overlying the sign in image Slow2 (correlation values 0.56

The torque F(t) de- pends on the visual input; it was found (see Reichardt and Poggio, 1976) that it can be approximated, under situations of tracking and chasing, as a function

We now provide some simulations conducted on arbi- trary functions of the class of functions with bounded derivative (the class F ): Fig. 16 shows 4 arbitrary se- lected

In the rst stage the basis function parameters are de- termined using the input data alone, which corresponds to a density estimation problem using a mixture model in

The F-B algorithm relies on a factorization of the joint probability function to obtain locally recursive methods One of the key points in this paper is that the

Three criteria were used to measure the performance of the learning algorithm: experts' parameters deviation from the true values, square root of prediction MSE and

Although the applications in this paper were restricted to networks with Gaussian units, the Probability Matching algorithm can be applied to any reinforcement learning task

This paper describes a new method for lens distortion calibration using only point correspondences in multiple views, without the need to know either the 3D location of the