Computing Image Matches - SCENE MATCHING IN CONSUMER IMAGES

VISUAL FEATURE LOCALIZATION FOR DETECTING UNIQUE OBJECTS

3.2 SCENE MATCHING IN CONSUMER IMAGES

3.2.2 Computing Image Matches

This subsection describes our approach in more detail. In particular, we examine the steps in Figure 3.1 and algorithmic components used to determine a match score between two images. Our experimental test bench places the two images side by side with the target image on the left and a reference image on the right. Scene matching begins by extracting scale - invariant features from each image.

Figure 3.1. Matching reference and target image by adding postprocessing steps to SIFT keypoint matches to eliminate most false positives.

3.2.2.1 Extracting and Matching Features We generate SIFT keypoints using Lowe’ s (2004) SIFT algorithm. Each feature has a corresponding descriptor, which considers location and localized image gradients at some scale. Multiple orientations may be extracted from the gradient directions with one orientation assigned to each feature. Hence, there may be multiple features described for the same location and indeed for the same location and scale. A variation of Lowe ’ s “ Fast Nearest Neigh-bor” search is used to ﬁ nd corresponding pairs of matched SIFT keypoints between the target and reference images. For each keypoint in the target image, the algorithm ﬁ nds a reference keypoint that minimizes the Euclidean distance between the cor-responding target and reference descriptors. If the minimized reference keypoint has no neighbors (subject to an inlier threshold), then the search returns a match to the given target keypoint.

For spatial clustering of SIFT keypoints, only the subpixel location information of participating matched pairs is used. However, the set of matched keypoints must be ﬁ ltered for subpixel location redundancy before proceeding to the spatial cluster-ing step. For different target locations that map to the same reference location, the matched keypoint pair with the lowest Euclidean distance is retained. The remaining redundant match pairs are eliminated. Likewise, for redundantly matched target locations (differing only in scale and/or orientation), the keypoint pairs with the lowest Euclidean distance are retained while the others are eliminated. The remain-ing keypoint pairs represent either probable matchremain-ing object features between the two images or false positives.

Figure 3.2 shows a typical example of a pair of consumer images taken at differ-ent locations, showing SIFT matches without using our clustering step or applying our constraints. Note the dense match that was obtained due to the cluttered nature of these types of scenes. Using our approach described below, no matches are found between these two scenes (the correct conclusion).

3.2.2.2 Clustering Matched Features Scene matching continues by forming spatial clusters within the set of matched keypoint pairs. Keypoints within the target image are clustered ﬁ rst. Variance - based partitions are formed using 2D subpixel location data while the Euclidean metric is employed for measuring the distance

Figure 3.2. Typical image pair (from different locations) showing SIFT matches without using our approach. No matches are found after using our approach.

50 VISUAL FEATURE LOCALIZATION FOR DETECTING UNIQUE OBJECTS IN IMAGES

between keypoints. The clustering algorithm deﬁ nes a criterion function, which attempts to ﬁ nd the partition for each keypoint that minimizes the mean distance of member keypoints from the partition ’ s center. If the number of scene objects is known, then the k - means algorithm (Duda et al. 2001 ) may be leveraged to form spatial clusters. Typically, though, the number of feature - rich scene objects within the image is unknown.

Instead of using k - means to form spatial clusters of keypoints within the target image, a different iterative optimization algorithm is employed. Like k - means, ISODATA ( “ iterative self - organizing data ” ) (Shapiro and Stockman 2001 ) is a

“ hard clustering ” algorithm that matches each keypoint to a speciﬁ c partition.

ISODATA attempts to ﬁ nd the partition for each keypoint, which minimizes the mean distance of member keypoints from the partition ’ s center. Once ISODATA partitions all keypoints, it examines the variances of the resulting k clusters and makes one of three decisions: discard, split, or merge. Upon convergence (or termi-nation), ISODATA yields a set of partitions that may differ in number from its initial value for k . When examining the resulting clusters, ISODATA will discard any cluster that fails a minimum membership threshold. ISODATA declassiﬁ es data members from unviable clusters (if any) and reduces the value of k to k – 1. This process continues until a minimum value ofk results in all valid clusters. Thereafter, ISODATA examines each of the cluster variances, as well as the distance between neighboring clusters. ISODATA will split a large cluster into two separate groups if the collections of keypoints ’ attributes are too dissimilar.

Currently, our approach uses only a keypoint ’ s 2D subpixel location attribute so ISODATA splits a large cluster if the partition ’ s dissimilarity metric (membership distance from the partition center) variation exceeds some threshold. If ISODATA indeed splits a cluster, it reclassiﬁ es the keypoint members (of the original large cluster) into one of two new partitions and increases the value of k to k + 1. If ISODATA does not split any large clusters, it goes on to examine the distance between neighboring clusters. ISODATA will merge two proximate clusters into a single partition if the two groups of keypoints share similar features. So in our case, ISODATA merges two proximate clusters if their partition centers are separated by less than some distance threshold. If ISODATA indeed merges two clusters, it reclassiﬁ es the keypoint members (of those two clusters) into a single partition and reduces the value ofk to k – 1. ISODATA seeks a stable value fork . If after determining a minimum value fork (by discarding invalid clusters), the value ofk changes because of a split or merge decision, ISODATA repartitions the key-points. The algorithm starts again by recalculating the criterion function for each keypoint. The algorithm begins with a new value for k and also with a new guess for the partition centers. If the value of k does not change, then k has stabilized, and no keypoints change membership from one partition to another, so the algorithm terminates.

Keypoints within the reference image are clustered next, using the same ISODATA clustering as the target image. After clustering the target image, though, the number of expected feature - rich objects k is known. However, some of these objects may not be visible in the reference image because of occlusion or change of viewpoint. The scenes may also not be related at all, and thus have a different number of interesting regions. So clustering the matched keypoints in the reference image is done independently of the results of clustering in the target image.

3.2.2.3 Applied Constraints Scene matching continues by creating a pseudo confusion matrix where target clusters form rows and clusters in the reference image form columns. The matrix entries show the number of keypoint matches between the cluster indicated by the row and the one indicated by the column. For clusters in the reference imagei = 1. . . N , and clusters in the target image j = 1. . . M , matrix entry, c_ij : number of point matches between cluster i and cluster j .

Thus, this matrix shows the mapping of keypoints between the clusters of the two images. For each row, membership within the target cluster is correlated to member-ship within each reference cluster. Our matrix construction is motivated by the use of a confusion matrix (Duda et al. 2001 ) in determining the precision of matching.

In a typical scenario, the confusion matrix shows the correlation between the

“ ground truth ” and the algorithm output. The results are highly correlated when the matrix diagonal approaches identity. In our approach, however, clusters within the reference image may not be enumerated in the same order as those within the target. In addition to inconsistent cluster enumeration, the actual number of clusters may differ. That is, when ISODATA is used to build reference partitions, then the number of resulting partitions in the reference image may not agree with the number of partitions in the target. This could occur when some objects are not visible in both images, or when a single cluster in one image is represented by more than one cluster in the other image. Therefore, the cluster map we construct is not a true confusion matrix and the matrix diagonal alone may not be used to judge the quality of cluster matches. However, the pseudo - confusion matrix may be leveraged to form the basis of a scene match score.

The next step in ﬁ ltering the matched keypoints is the determination of the best match for each cluster in the target image from among the clusters in the reference image. A correlation score is determined for each cluster in the target image, which is computed as the proportion of points in this cluster that match points in the cluster’ s strongest match (i.e., largest number of points matched with this cluster).

Clusters that have correlation scores below a threshold (empirically determined to be 0.5) are eliminated in the target image. In addition, our approach applies a minimum membership threshold as a fi lter. That is, scene regions are defi ned using some minimum number of keypoints. Entire rows and columns of the cluster map may be excluded when their respective sums fail the membership threshold. Using the terminology defi ned in the previous paragraph, cluster i in the reference image is eliminated if:

The remaining matrix cells are ﬁ ltered further using two additional criteria. Further ﬁ ltering of spurious keypoint matches is obtained by checking the direction of movement of the matched point from the target to the reference. Recall that our experimental test bench places the target and reference images horizontally side by side. Each matching keypoint forms a Cartesian angle with respect to its host image’ s frame of reference. Using the remaining clusters, an average keypoint angle is determined that represents the general trajectory of regions from one image to

52 VISUAL FEATURE LOCALIZATION FOR DETECTING UNIQUE OBJECTS IN IMAGES

the other. To ensure that the global trajectory of points from the reference to the target is consistent (i.e., all objects in the scene move in the same general direction), keypoints that form a Cartesian angle that is more than a certain standard deviation (empirically determined to be 1.0 σ ) away from the average keypoint angle are eliminated from the pseudo - confusion matrix row or column.

Target and reference partitions are deﬁ ned as having both a spatial center and a spatial size. The size of each cluster is determined by the result of the criterion func-tion applied by the particular iterative optimizafunc-tion strategy (i.e., ISODATA). The current criterion leverages only members ’ subpixel distance from a respective center so the size of each scene region is inversely proportional to the density of features distributed normally about the center. Given in units of Mahalanobis distance, a spatial threshold (empirically set at 2.0 σ ) eliminates keypoints from a cluster that exceeds a maximum distance from the partition center.

Figure 3.3 shows an example of the effect of these postprocessing steps on false keypoint matches. Note that our test bench scene - matching application draws ellipses around target and reference partition centers, where the width and height of each oval represent the sample standard deviations inx and y pixels, respectively.

The denser the partition of feature points, the smaller the ellipse. Our experiments show that the order of applying the constraints during the ﬁ ltering process is not important to the accuracy of the ﬁ nal match.

3.2.2.4 Match Scores If there are keypoints that remain after the fi ltering process, this indicates a match between the reference and target image. The larger the number of keypoints left, the more reliable the match, so the likelihood of a match is proportional to the number of matching keypoints that remain. The scene match score we currently use is simply a count of the remaining features that survive all four threshold tests, that is, score= ∑ ∑i jcij after the fi ltering process has removed clusters and points. However, more sophisticated match scores can be formulated that take into account the confi dence level of each of the remaining matches.

Figure 3.3. Example of fi ltering false keypoint matches by adding constraints. In matched pairs (top) vs. mismatched pair (bottom), the initial set of keypoint matches is shown fi rst, and the fi ltered set of keypoint matches (if any) are shown next.

Dans le document MULTIMEDIA INFORMATION EXTRACTION (Page 65-70)