Detecting articulate pattern classes with less predictable boundaries. Hu- man faces and eyes are target classes with highly predictable spatial boundaries

Distribution of Data Points

2. Detecting articulate pattern classes with less predictable boundaries. Hu- man faces and eyes are target classes with highly predictable spatial boundaries

Hu-man bodies, on the other hand, are an example of an articulate pattern class with less predictable spatial boundaries. Our proposed object and pattern detection approach does not handle articulate pattern classes with arbitrarily shaped boundaries well, because one cannot simply \segment" these patterns from their variable background images using one of several xed shape masks. In general, pattern segmentation is still an unsolved computer vision problem, so one may not even have a reliable means of extracting entire articulate patterns from an image for a \template" based detection approach like ours. For some time, vision researchers have considered a hierarchical approach, similar to the classier combination scheme above, for detecting arbitrar-ily shaped pattern classes without explicitly performing segmentation. The approach

represents an arbitrarily shaped target class as simpler components with predictable spatial boundaries. One can then use our proposed object and pattern detection scheme to deal with these simpler components, and later analyze the detection results to identify and locate instances of the complex target.

Clearly, the key issue in this hierarchical pattern detection approach is one of integrating intermediate output results for identifying full target patterns. In general, implementing an output combination stage involves: (1) nding sets of individual sub-patterns that arise from the same target; (2) devising a scheme for representing geometric relationships between the individually detected sub-patterns; and (3) determining the signicance of each sub-pattern and co-occurrences of sub-patterns as cues for identifying and locating the full target. We now look at some useful tools for implementing an output combination stage.

5.2.1 Combining Sub-Pattern Detection Results with Multi-Layer Per-ceptron Nets

Multi-layer perceptron nets are a convenient machine architecture for learning and encoding complex relationships between individual features or sets of features in a classication task.

For a highly structured and relatively inarticulate target class like human faces, there is usually very little variation in the position and orientation of its sub-pattern components.

When detecting such target classes, one can simply take advantage of their highly pre-dictable spatial structure to search for sub-patterns only at specic image locations; i.e., the rst problem of nding and grouping together sub-patterns from the same target is trivial.

To identify instances of the target from its sub-pattern components, one can train an appropriately structured multi-layer perceptron net whose input features are all the output values from the individual sub-pattern detectors. Because we are applying each sub-pattern detector at a xed spatially oset image location, there is no need to recover position and orientation values for each image sub-pattern as input features to the multi-layer perceptron net classier. Hence, the classier training process takes care of the third output combi-nation task above, while the second task is usually irrelevant for highly structured and relatively inarticulate target pattern classes.

5.2.2 Handling Sub-Patterns with Variable Position and Orientation

Unlike human faces, human bodies are an example of an articulate pattern class with less predictable spatial boundaries. One can still represent an articulate target class as simpler components with predictable spatial boundaries that can each be independently detected in an image. However, there can be a signicant amount of variation in the position and orientation of each sub-pattern component. So, to identify and locate these arbitrarily shaped target patterns from their simpler components, one must also deal with the rst and second output combination issues described above; i.e., the problems of nding sub-patterns in an image that belong to the same target, and representing geometric relationships between them as additional cues for identifying the target.

We shall rst address the second and simpler issue by introducing position and orien-tation attributes for each sub-pattern as additional classication features. If one uses a

\template matching" like detection paradigm that tests for the target at candidate image locations and orientations, then one can represent each sub-pattern's position in a transla-tionally invariant fashion as its spatial oset from the hypothesized target center. Similarly, one can also describe the orientation of each sub-pattern in a rotationally invariant fashion:

dene a xed reference direction for each sub-pattern class, and compute for each image sub-pattern its angular displacement with respect to the hypothesized target orientation.

Thus, for an articulate target class, we recover for each of its components, a set of out-put combination features that includes a detection outout-put value, and an additional vector of translationally and rotationally invariant position and orientation attributes. The ad-ditional position and orientation attributes help capture geometric relationships between the individually detected sub-patterns, which the output combination stage relies on as additional cues for identifying the target.

5.2.3 Finding Sets of Sub-Patterns from an Articulate Target

The discussion in Section 5.2.2 assumes a reasonable scheme for nding and grouping to-gether sub-patterns in an image that belong to the same articulate target object. Unfor-tunately, we believe existing techniques for performing such a task in general are still very much ad hoc and unreliable at best. We conclude this thesis by referring to two current areas of research that may lead to feasible and robust search paradigms for image

sub-patterns. Our discussion here will, however, only be speculative and brief, since current research trends in both these areas still appear open and highly exploratory in nature.

Grouping

Grouping [57] [49] is a process that identies sets of features in a cluttered image likely to have arisen from a single object. It serves mainly as a pre-processing stage that speeds up object recognition by reducing the number of dierent image feature combinations the recognition stage has to consider while testing for the target. Lowe [57] rst demonstrated the idea on an early computational recognition system that makes explicit use of group-ing. Jacobs [49] later extended the idea to a geometric model-based object recognition domain. Typical grouping schemes operate on simple low-level image features that their object recognition systems use, such as intensity edges, junctions and corners. Often, these schemes rely heavily on prior assumptions about feature congurations that make them likely target candidates. Such assumptions may be based on domain specic knowledge, like common features that tend to co-exist in the target, or \general" observations like certain image feature congurations being more \salient" in real scenes.

In the hierarchical pattern detection scheme we are considering here, one can treat the sub-pattern components of an articulate target class as highly specic and sophisticated model features, similar in spirit to the simpler features used by current object recognition and grouping systems. We have argued earlier that one can reliably detect these sophisti-cated features using our proposed object and pattern detection approach from Chapter 2.

We believe one can also develop similar grouping based techniques to identify salient sets of these sophisticated image features that are likely parts from the same target.

Dans le document MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I.T.R. No. January, (Page 168-171)