Visual Place Recognition with Repetitive Structures

(1)

HAL Id: hal-01152483

https://hal.inria.fr/hal-01152483

Submitted on 17 May 2015

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Akihiko Torii, Josef Sivic, Masatoshi Okutomi, Tomas Pajdla

To cite this version:

Akihiko Torii, Josef Sivic, Masatoshi Okutomi, Tomas Pajdla. Visual Place Recognition with Repeti- tive Structures. IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Elec- trical and Electronics Engineers, 2015, pp.1-14. �10.1109/TPAMI.2015.2409868�. �hal-01152483�

(2)

Visual Place Recognition with Repetitive Structures

Akihiko Torii,Member, IEEE,Josef Sivic,Member, IEEE,Masatoshi Okutomi,Member, IEEE and Tomas Pajdla,Member, IEEE,

Abstract

Repeated structures such as building facades, fences or road markings often represent a significant challenge for place recognition. Repeated structures are notoriously hard for establishing correspondences using multi-view geometry. They violate the feature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance. In this work we show that repeated structures are not a nuisance but, when appropriately represented, they form an important distinguishing feature for many places. We describe a representation of repeated structures suitable for scalable retrieval and geometric verification. The retrieval is based on robust detection of repeated image structures and a suitable modification of weights in the bag-of-visual-word model. We also demonstrate that the explicit detection of repeated patterns is beneficial for robust visual word matching for geometric verification. Place recognition results are shown on datasets of street-level imagery from Pittsburgh and San Francisco demonstrating significant gains in recognition performance compared to the standard bag-of-visual-words baseline as well as the more recently proposed burstiness weighting and Fisher vector encoding.

Index Terms

Place Recognition, Bag of Visual Words, Geometric Verification, Image Retrieval.

✦

• A. Torii is with the Department of Mechanical and Control Engineering, Graduate School of Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan.

E-mail: torii@ctrl.titech.ac.jp

• J. Sivic is with the Inria, WILLOW, Departement d’Informatique de l’ ´Ecole Normale Sup´erieure, ENS/INRIA/CNRS UMR 8548, Paris, France.

E-mail: Josef.Sivic@ens.fr

• M. Okutomi is with the Department of Mechanical and Control Engineering, Graduate School of Science and Engineering, Tokyo Institute of Technology, Tokyo, Japan.

E-mail: mxo@ctrl.titech.ac.jp

• T. Pajdla is with the Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague.

E-mail: pajdla@cmp.felk.cvut.cz

(3)

Visual Place Recognition with Repetitive Structures

1 INTRODUCTION

G^IVEN a query image of a particular street or a building, we seek to find one or more images in the geotagged database depicting the same place. The ability to visually recognize a place depicted in an image has a range of potential applications including automatic registration of images taken by a mobile phone for augmented reality applications [1] and accurate visual localization for robotics [2]. Scalable place recognition methods [2], [3], [4], [5], [6] often build on the efficient bag-of-visual-words representation developed for object and image retrieval [7], [8], [9], [10], [11], [12]. In an of- fline pre-processing stage, local invariant descriptors are extracted from each image in the database and quantized into a pre-computed vocabulary of visual words. Each image is represented by a sparse (weighted) frequency vector of visual words, which can be stored in an efficient inverted file indexing structure. At query time, after the visual words are extracted from the query image, the retrieval proceeds in two steps. First a short-list of top ranked candidate images is obtained from the database using the bag-of-visual-words representation. Then, in the second verification stage, candidates are re-ranked based on the spatial layout of visual words.

A number of extensions of this basic architecture have been proposed. Examples include: (i) learning better visual vocabularies [13], [14]; (ii) developing quantization methods less prone to quantization errors [15], [16], [17]; (iii) combining returns from multiple query images depicting the same scene [7], [18]; (iv) exploiting the 3D or graph structure of the database [19], [20], [21], [22], [23], [24]; or (v) indexing on spatial relations between visual words [25], [26], [27].

In this work we develop a scalable representation for large-scale matching of repeated structures. While repeated structures often occur in man-made environments – examples include building facades, fences, or road markings – they are usually treated as nuisance and down-weighted at the indexing stage [4], [8], [28], [29]. In contrast, we develop a simple but efficient representation of repeated structures and demonstrate its benefits for place recognition in urban environments. In detail, we first robustly detect repeated structures in images by finding spatially localized groups of visual words with similar appearance. Next, we modify the weights of the detected repeated visual words in the bag-of-visual- word model, where multiple occurrences of repeated elements in the same image provide a natural soft- assignment of features to visual words. In addition, the contribution of repetitive structures is controlled to pre- vent dominating the matching scores. This is illustrated

Fig. 1. Overview of visual place recognition with repet- itive structures.Left: We detect groups of repeated local features (overlaid in colors). Middle: Repetitive features (shown in red) implicitly provide soft-assignment to multiple visual words (here A and C). Right: Truncating large weights (shown in red) in the bag-of-visual-word vectors prevents repetitions from dominating the matching score.

in Figure 1. Finally, we develop a geometric verification method that takes into account the detected repetitions and suppresses ambiguous tentative correspondences between repeated image patterns.

The paper is organized as follows. After describing related work on finding and matching repeated structures (Section 1.1), we review in detail (Section 2) the common tf-idf visual word weighting scheme and its extensions to soft-assignment [16] and repeated structure suppression [8]. In Section 3, we describe our method for detecting repeated visual words in images. In Section 4, we describe the proposed model for scalable matching of repeated structures, and in Section 5, we detail the proposed geometric matching that takes into account the detected repetitions. Experiments demonstrating the benefits of the developed representations are given in Section 6.

1.1 Related work

Detecting repeated patterns in images is a well-studied problem. Repetitions are often detected based on an assumption of a single pattern repeated on a 2D (de- formed) lattice [30], [31], [32]. Special attention has been paid to detecting planar patterns [33], [34], [35] and in particular building facades [3], [36], [37], for which highly specialized grammar models, learnt from labelled data, were developed [38], [39].

Detecting planar repeated patterns can be useful for single view facade rectification [3] or even single-view 3D reconstruction [40]. However, the local ambiguity of repeated patterns often presents a significant challenge for geometric image matching [34], [41] and image retrieval [8].

(4)

Schindleret al. [34] detect repeated patterns on building facades and then use the rectified repetition elements together with the spatial layout of the repetition grid to estimate the camera pose of a query image, given a database of building facades. Results are reported on a dataset of 5 query images and 9 building facades. In a similar spirit, Doubek et al. [42] detect the repeated patterns in each image and represent the pattern using a single shift-invariant descriptor of the repeated element together with a simple descriptor of the 2D spatial layout. Their matching method is not scalable as they have to exhaustively compare repeated patterns in all images. In scalable image retrieval, Jegouet al[8] observe that repeated structures violate the feature independence assumption in the bag-of-visual-word model and test several schemes for down-weighting the influence of repeated patterns.

This paper is an extended version of [43] with detailed description of the proposed algorithms, new geometric verification method that takes into account the detected repetitive structures (Section 5), and additional experiments.

2 REVIEW OF VISUAL WORD WEIGHTING STRATEGIES

In this section we first review the basic tf-idf weighting scheme proposed in text retrieval [44] and also com- monly used for the bag-of-visual-words retrieval and place recognition [3], [4], [7], [8], [10], [11], [12], [26].

Then, we discuss the soft-assignment weighting [16] to reduce quantization errors and the “burstiness” model proposed by Jegou et al. [8], which explicitly down- weights repeated visual words in an image.

2.1 Term frequency–inverse document frequency weighting

The standard “term frequency–inverse document frequency” (tf–idf) weighting [44], is computed as follows.

Suppose there is a vocabulary of M visual words, then each image is represented by a vector

yd= (y1, ..., yt, ..., yM)^⊤ (1) of weighted visual word frequencies with components

yt=ntd

nd

log N Nt

, (2)

wherentdis the number of occurrences of visual wordt in imaged,ndis the total number of visual words in the image d,Nt is the number of images containing termt, and N is the number of images in the whole database.

The weighting is a product of two terms: the visual word frequency, ntd/nd, and the inverse document (im- age) frequency, log (N/Nt). The word frequency weights words occurring more often in a particular image higher (compared to visual word present/absent), whilst the inverse document frequency down-weights visual words

that appear often in the database, and therefore do not help to discriminate between different images. The generalization of the idf weighting has been recently proposed by Zheng et al[45].

At the retrieval stage, images are ranked by the nor- malized scalar product (cosine of angle)

scored= y_q^⊤y_d kyqk2 kydk2

(3) between the query vectoryq and all image vectorsydin the database, where kyk2=p

y^⊤y is theL2 norm ofy.

The scalar product in Equation (3) can be implemented efficiently using inverted file indexing schemes.

2.2 Soft-assignment weighting

Visual words generated through descriptor clustering often suffer from quantization errors, where local feature descriptors that should be matched but lie close to the Voronoi boundary are incorrectly assigned to different visual words. To overcome this issue, Philbin et al. [16]

soft-assign each descriptor to several (typically 3) closest cluster centers with weights set according to exp−_2σ^d²₂, wheredis the Euclidean distance of the descriptor from the cluster center and σis a parameter of the method.

2.3 Burstiness weighting

Jegouet al. [8] studied the effect of visual “burstiness”, i.e. that a visual-word is much more likely to appear in an image, if it has appeared in the image already. Bursti- ness has been also studied for words in text [46]. Jegouet al. observe by counting visual word occurrences in a large corpus of 1M images that visual words occurring multiple times in an image (e.g. on repeated structures) violate the assumption that visual word occurrences in an image are independent. Further, they observe that the bursted visual words can negatively affect retrieval results. The intuition is that the contribution of visual words with a high number of occurrences towards the scalar product in Equation (3) is too high. In the voting interpretation of the bag-of-visual-words model [26], bursted visual words vote multiple times for the same image. To see this, consider an example where a particular visual word occurs twice in a query and five times in a database image. Ignoring the normalization of the visual word vectors for simplicity, multiplying the number of occurrences as in (3) would result in 10 votes, whereas in practice only up to two matches (correspondences) can exist.

To address this problem Jegou et al. proposed to down-weight the contribution of visual words occurring multiple times in an image, which is referred to as intra-image burstiness. They experimented with different weighting strategies and empirically observed that down-weighting repeated visual words by multiplying the term frequency in Equation (2) by factor ^√_n¹_td, where ntdis the number of occurrences, performed best. Similar

(5)

strategies to discount repeated structures when matching images were also used in [28], [29].

Note that Jegou et al. also considered a more precise description of local invariant regions quantized into visual words using an additional binary signature (Hamming embedding) [26] more precisely localizing the descriptor in the visual word Voronoi cell. The Hamming embedding is complementary to the method developed in this paper.

Contrary to down-weighting repeated structures based on globally counting feature repetitions across the entire image, we (i) explicitly detect localized image areas with repetitive structures, and (ii) use the detected local repetitions to adaptively adjust the visual word weights in the soft-assigned bag-of-visual words model.

The two steps are described next.

3 DETECTION OF REPETITIVE STRUCTURES The goal is to segment local invariant features detected in an image into localized groups of repetitive patterns and a layer of non-repeated features (see Figure 2). Ex- amples include detecting repeated patterns of windows on different building facades, fences, road markings or trees in an image (see Figure 3). We will operate directly on the extracted local features (rather than using specially designed features [36]) as the detected groups will be used to adjust feature weights in the bag-of- visual-words model for efficient indexing. The feature segmentation problem is posed as finding connected components in a graph.

In detail, we build an (undirected) feature graphG= (F,E) with N vertices F = {fi}^N_i=1 consisting of local invariant features at locations xi, scales si and with corresponding SIFT descriptors di. Each SIFT descriptor is further assigned to the K nearest visual words W_i^K = {w^k_i}^K_k=1 in a pre-computed visual vocabulary (see Section 6 for details). Two vertices (features) fi and fj are connected by an edge if they have close-by image positions as well as similar scale and appearance. More formally, a pair of vertices fi and fj is connected by an edge if the following three conditions are satisfied:

1) The spatialL2distancekxi−xjk2between features satisfieskxi−xjk2< γ(si+sj)whereγis a constant (we setγ= 10throughout all our experiments);

2) The ratioσof scales of the two features is in0.5<

σ <2;

3) The features share at least one common visual word in their top K visual word assignments, where K is typically 50. Note that this condition avoids directly thresholding the distance between the SIFT descriptors of the two features, which we found unreliable.

Having built the graph, we group the vertices (image features) into disjoint groups by finding connected components of the graph [47]. These connected components group together features that are spatially close, and are also similar in appearance as well as in scale. In

Fig. 2. Examples of detected repetitive patterns of local invariant features (“repttiles”). The different repetitive patterns detected in each image are shown in different colors. The detection is robust against local deformation of the repeated element and makes only weak assumptions on the spatial structure of the repetition.

Algorithm 1Repttile detection Input C: Visual word centroids.

F:N vertices (features) consisting of xi: location, si: scale,di: descriptor.

Output E: Edges of undirected graphG.

1: Initialize eij∈ E :=∅.

2: for i= 1, . . . , N do

3: For descriptor di find theK nearest visual wordsWi^K ={w^ki}^K_k=1 from vocabularyC.

4: for i= 1, . . . , N do

5: For featurefi retrieve matching featuresfj s.t.

j∈ J:W_i^K∩ W_j^K6=∅.

6: for j∈ J do

7: ifeij =FALSEthen

8: if0.5< si/sj<2 then

9: ifkxi−xjk< γ(si+sj)then

10: eij =eji :=TRUE.% Create edge.

the following, we will call the detected feature groups

“repttiles” for “tiles (regions) of repetitive features”. The repttile detection is summarized in Algorithm 1.

Figures 2 and 3 show a variety of examples of detected patterns of repeated features. Only connected components with more than 20 image features are shown as colored dots. Note that the proposed method makes

(6)

Fig. 3. Examples of detected repetitive patterns of local invariant features (“repttiles”) in images from the INRIA Holidays dataset [8]. The different repetitive patterns detected in each image are shown in different colors. The color indicates the number of features in each group (red indicates large and blue indicates small groups). Note the variety of detected repetitive structures such as different building facades, trees, indoor objects, window tiles or floor patterns.

only weak assumptions on the type and spatial structure of repetitions, not requiring or attempting to detect, for example, feature symmetry or an underlying spatial lattice.

4 REPRESENTING REPETITIVE STRUCTURES FOR SCALABLE RETRIEVAL

In this section we describe our image representation for efficient indexing taking into account the repetitive patterns. The proposed representation is built on two ideas. First, we aim at representing the presence of a repetition, rather than measuring the actual number of matching repeated elements. Second, we note that different occurrences of the same visual element (such as a facade window) are often quantized to different visual words because of the noise in the description and quantization process as well as other non-modeled effects such as complex illumination (shadows) or perspective deformation. We take the advantage of this fact and design a descriptor quantization procedure that adaptively soft- assignslocal features with more repetitions in the image to fewer nearest cluster centers. The intuition is that the multiple examples of a repeated feature provide a natural and accurate soft-assignment to multiple visual words.

Formally, an image is represented by a bag-of-visual- words vector

zd = (z1, ..., zt, ..., zM)^⊤ (4) where the t-th visual word weight

zt=

( rt if 0≤rt< T T if T≤rt

(5) is obtained by thresholding weights rt by a threshold T. Note that the weighting described in Equation (5) is similar to burstiness weighting, which down-weights

repeating visual words. Here, however, we represent highly weighted (repeating) visual words with a constant T as the goal is to represent the occurrence (presence/absence) of the visual word, rather than measuring the actual number of occurrences (matches).

Weightrtof thet-th visual word in imagedis obtained by aggregating weights from adaptively soft-assigned features across the image taking into account the repeated image patterns. In particular, each featurefi∈ Fd

detected in imagedis assigned to the αi nearest (in the feature space) visual words W_i^αⁱ = {w^k_i}^α_k=1ⁱ . Thus,w^k_i, for 1 ≤ k ≤ αi, is the index of the k-th nearest visual word offi. The numberαi, which varies between1and αmax, will be defined below. Weightrt is computed as

rt=

N

X

i=1 αi

X

k=1

1[w^k_i =t] 1

2^k⁻¹ (6)

where the indicator function 1[wi^k = t] is equal to 1 if visual word t is present at the k-th position in W_i^αⁱ. This means that weight rt is obtained as the sum of contributions from all assignments of visual wordtover all features in Fd. The contribution of an individual assignment depends on the orderkof the assignment in W_i^αⁱby the weight1/(2^k⁻¹). The numberαiis computed by the following formula

αi =

&

αmax

log(ⁿ^d_m⁺¹_i ) maxi∈F^dlog(ⁿ^d_m⁺¹_i )

'

(7) whereαmax is the maximum number (upper bound) of assignments (αmax= 3in all our experiments),mi is the number of features in the repttile of which fi belongs, and nd is the total number of features in the image d.

We use ⌈x⌉ =ceiling(x),i.e. ⌈x⌉is the smallest integer greater than or equal tox. This adaptive soft-assignment is summarized in Algorithm 2. Note that image features belonging to relatively larger repttiles are soft-assigned

(7)

Algorithm 2BoVW weighting with adaptive assignment Input

W_i^K: K-nearest visual words for each featurei inF.

Output

rt: Weights of the bag-of-visual-word vector.

1: Initializert:= 0 fort= 1, . . . , M.

2: fori= 1, . . . , N do

3: Compute number of assignmentsαi by Eq. (7).

4: fork= 1, . . . , αi do

5: p:=w^k_i ∈ W_i^K.%Retrieve thek-th visual word.

6: rp :=rp+ 1/2^k⁻¹.

7: Apply thresholding in Eq. 5 tort fort= 1, . . . , M.

to fewer visual words as image repetitions provide a natural soft-assignment of the particular repeating scene element to multiple visual words (see Figures 4 and 5).

This adaptive soft-assignment is more precise and less ambiguous than the standard soft-assignment to multiple nearest visual words [16] as will be demonstrated in Section 6.

5 GEOMETRIC VERIFICATION WITH REPTTILE DETECTION

In this section we describe our geometric verification method that takes advantage of the detected repeated patterns. Similarly to the standard image retrieval, the place recognition accuracy can be significantly improved by re-ranking retrieved images based on geometric con- sistency of local features [11]. The retrieved images in the initial shortlist are typically re-ranked using the number of inliers that are consistent with the 2D affine geometric transformation, homography, or the epipolar constraint.

The geometric verification has two steps: (i) generating tentative matches based on local feature appearance, and (ii) finding subsets of matches consistent with the geometric transformation. The tentative matches are typically found by matching raw (SIFT) feature descriptors [48] or via visual word matching [11], [12]. For large scale matching problems, visual word matching is preferred due to its memory efficiency as we can store only4 bytes for an unsigned-int32 visual word ID instead of128bytes for an unsigned-int8 SIFT descriptor.

However, as reported in [11], [49], visual word matching usually results in some drop in matching accuracy due to quantization effects. In addition, when using visual words it is not straightforward to find the most distinctive matches as can be done with raw descriptors using Lowe’s first to second nearest neighbor ratio test [48]. We develop a method for generating tentative matches for geometric verification that addresses both these issues.

First, to overcome the quantization effects we assign descriptors to multiple visual words (multiple assignment). This is similar to [16] but to minimize false matches we take advantage of the detected repeated patterns and only compute tentative matches from the

Fig. 4. Examples of adaptive soft-assignment with “repttile” detection in images from the Pittsburgh dataset. (Top) The repttiles composed from more than 20 image features are shown in different colors (red indicates large and blue indicates small groups, similarly to Figure 3). (Bottom) The number of visual word assignments of each feature is adaptively defined by the number of features in the repttile as in Equation (7). The color indicates the number of multiple assignments, red= 1, green= 2and blue= 3.

Features belonging to larger repttiles are assigned to fewer visual words (red) but discriminative features (blue and green) are assigned to multiple visual words (up to 3).

Fig. 5. Examples of adaptive soft-assignment with “repttile” detection in images from the San Francisco dataset.

See the caption of Figure 4 for details.

distinct visual words with the limited amount of repetitions in the image. This is implemented by removing features from the largest repttiles (those where αi = 1, see Section 4). The remaining features are assigned to multiple (K) nearest visual words, whereK is typically 50. The multiple assignment is performed only on the query side, and therefore, it is not necessary to store K visual words for all database images as in [16]. Note that no additional computations are required as the K nearest visual words are already computed during the repttile detection (Section 3). The results in Section 6 demonstrate that it is beneficial to remove the heavily repeated features during geometric verification as they are highly ambiguous for the task of establishing feature to feature correspondences. The visual word matching with repttile removal and multiple assignments is summarized in Algorithm 3.

Second, we develop a variant of Lowe’s first to second nearest neighbor ratio test [48] for visual word matching

(8)

Algorithm 3Visual word matching with repttile removal and multiple assignments

Input

W_iq^K: Visual words ofFq in query imageIq. αiq: Numbers of adaptive assignments forFq. W_id¹: Visual words ofFd in imageId.

αid: Numbers of adaptive assignments forFd. Output

mi: Indices of matches to the query features.

1: Initializemi:=∅ fori= 1, . . . , Nq.

2: Removefid∈ Fd ifαid= 1.

%Remove large repttiles inI_d. 3: fori= 1, . . . , Nq do

4: ifαiq>1 then%Use only small repttiles inI_q.

5: k:= 0

6: whilek < K do

7: k:=k+ 1

8: Seek a setJ where{j∈ J |w^k_iq∩w_jd¹ 6=∅}.

%Find visual word match.

9: ifJ ∈ ∅/ then

10: Pick an indexj∈ J randomly.

11: mi:=j

12: break

when the raw (SIFT) descriptors for the images in the database are not available. The main idea of the original ratio test [48] is to compute, for each query image descriptor diq, the ratio

L=kdiq−dN N₁k2

kdiq−dN N₂k2

, (8)

wheredN N₁ is the first nearest neighbor anddN N₂is the second nearest neighbor of the query descriptor diq in the candidate database image. The ratio Llies between 0 and 1. The ratio is close to1 if the query feature d_iq is non-distinctive. While this test works extremely well in practice, it requires storing (or re-computing) local feature descriptors for both the query image and all database images, which, as argued above, is prohibitive for large databases.

To address this issue, we develop an asymmetric version of the ratio test, which is inspired by the asymmetric distance computation in product quantization [15], and which only requires knowing the local feature descriptors for the query image. The local feature descriptors in the database images are represented by their quantized representation, i.e. the centroids of their closest visual words. In detail, we replace each feature descriptor d_id in database image d by its quantized descriptor corresponding to the centroid of its closest visual wordct∈ C.

The approximate ratio is then computed as La=kdiq−cN C₁k2

kdiq−cN C₂k2

, (9)

where diq is the descriptor of the local feature in the query image, cN C₁ is its closest (quantized) descriptor in the database image and c_{N C}₂ is its second closest

(quantized) descriptor in the database image.Laalso lies between0and1and is close to 1 if the descriptor match is ambiguous. The intuition is that N C1 and N C2 are approximations of the first and second nearest neighbors N N1 and N N2 in the original ratio test given by Equa- tion (8). Note that the asymmetric ratio (9) can be computed efficiently as the distances kdiq−cN Ck2 between the query descriptor and multiple (up to 50) closest cluster centers are pre-computed during the assignment of query descriptors to visual words. In addition, N C1

and N C2 can be efficiently found by only accessing the inverted file list.

Results demonstrating benefits of both tentative matching with reptile detection and the asymmetric ratio test are shown in Section 6.

6 EXPERIMENTS

In this section we describe the experimental validation of our approach. First, we describe the experimental set-up (Section 6.1), describe the evaluation metric (Section 6.2) and give the implementation details (Section 6.3). Next, we evaluate place recognition performance of the proposed representation with repttile detection and adaptive soft-assignment and compare it with the state-of-the- art BoVW baseline methods (Section 6.4). Then we evaluate the sensitivity to the different method parameters (Section 6.5) and compare performance with compact image descriptors (Section 6.6). Finally, we experimentally evaluate the proposed geometric verification method (Section 6.7).

6.1 Datasets

We perform experiments on the following two geotagged image databases.

6.1.1 Pittsburgh dataset

The Pittsburgh dataset is formed by 254,064 perspective images generated from 10,586 Google Street View panoramas of the Pittsburgh area downloaded from the the Internet. From each panorama of 6,656×3,328 pixels, we generate 24 perspective images of 640 × 480 pixels (corresponding to 60 degrees of horizontal FoV) with two pitch directions [4^◦,26.5^◦] and 12 yaw [0^◦,30^◦, ...,360^◦]directions. This is a similar setup to [3].

As testing query images, we use24,000perspective images generated from1,000panoramas randomly selected from8,999panoramas of the Google Pittsburgh Research Data Set¹. The datasets are visualized on a map in Figure 6(a). This is a very challenging place recognition set-up as the query images were captured in a different session than the database images and depict the same places from different viewpoints, under very different illumination conditions and, in some cases, in different seasons. Note also the high number of test query images compared to other existing datasets [3], [4].

1. Provided and copyrighted by Google.

(9)

−2000 −1000 0 1000 2000

−2000

−1000 0 1000

2000 Database

Query

0 10 20 30 40 50

30 40 50 60 70 80

N − Number of Top Database Candidates

Recall (Percent)

AA thr−idf tf−idf brst−idf SA tf−idf

0 10 20 30 40 50

30 40 50 60 70 80

Recall (Percent)

AA thr−idf FV 512 FV 4096 VLAD 512 VLAD 4096

(a) (b) (c)

Fig. 6. Evaluation on the Pittsburgh dataset. (a) Locations of query (blue dots) and database (gray dots) images.

(b-c) The fraction of correctly recognized queries (Recall, y-axis) vs. the number of topN retrieved database images (x-axis) for the proposed method (AA thr-idf) compared to several other methods.

−2000 −1000 0 1000 2000

2000 1000 0 1000 2000

−2000 −1000 0 1000 2000

Database Query

0 10 20 30 40 50

30 40 50 60 70 80

Recall (Percent)

AA thr−idf tf−idf brst−idf SA tf−idf Chen NoGPS [3]

0 10 20 30 40 50

30 40 50 60 70 80

Recall (Percent)

AA thr−idf FV 512 FV 4096 VLAD 512 VLAD 4096 Chen NoGPS [3]

(a) (b) (c)

Fig. 7. Evaluation on the San Francisco dataset. (a) Locations of query (blue dots) and database (gray dots) images.

(b-c) The fraction of correctly recognized queries (Recall, y-axis) vs. the number of topN retrieved database images (x-axis) for the proposed method (AA thr-idf) compared to several other methods.

The ground truth is derived from the (known) GPS positions of the query images. We have observed that GPS positions of Street View panoramas are often snapped to the middle of the street. The accuracy of the GPS positions hence seems to be somewhere between 7 and 15 meters.

6.1.2 San Francisco visual place recognition benchmark This dataset consists of the geotagged images formed by1,062,468perspective central images (PCIs),638,090 perspective frontal images (PFIs), and803cell phone images. We use the PCIs as the geotagged image database and the cell phone images as testing query images. The dataset is visualized on a map in Figure 7(a). This dataset involves challenges similar to the Pittsburgh dataset.

Some query images undergo more severe viewpoint changes than those in the Pittsburgh dataset because the cell phone query images are captured on the street-side whereas the PCI database images are captured from the middle of street.

Each image in this dataset has labels of visible build- ings computed using 3D building models [50]. The

ground truth is given by matching the building IDs between query and database images².

6.2 Evaluation metric

Similarly to [3], [4], [51], we measure the place recognition performance by the fraction of correctly recognized queries. Following [3], we denote the fraction of correctly recognized queries as “recall”³. We measure performance by recall on both datasets. However, the definition of a correctly recognized query is different between the two datasets due to different type of available ground truth annotations. For the Pittsburgh dataset, a query is deemed correctly recognized if at least one of the top N retrieved database images is within m meters from the ground truth position of the query.

For the San Francisco dataset, the query is deemed

2. We note that 60 query images do not have the corresponding ground truth building label in the dataset.

3. While this terminology is slightly different from image retrieval, it meets the definition of recall, i.e. it measures the fraction of true positives (correctly recognized queries) over all positives (all query images).

(10)

Query image Top 3 ranked images Query image Top 3 ranked images

(a) (b)

Fig. 8. Examples of place recognition results on the Pittsburgh dataset.Each figure shows the query image (left column) and the three best matching database images (2nd to 4th column) using the baseline burstiness method [8]

(top row) and the proposed repttile detection and adaptive assignment (bottom row). The green borders indicate correctly recognized images. The orange dots show the visual word matches between the query image (1st column) and the best matching database image (2nd column). The visual word matches are also displayed on the 2nd and 3rd best matching images. Please note that for improved clarity, the matching visual words in the query are not shown for the second and third match.

f5a

Query image Top 3 ranked images Query image Top 3 ranked images

(a) (b)

Fig. 9. Examples of place recognition results on the San Francisco dataset.See the caption of Figure 8 for details.

correctly recognized if at least one of the topN retrieved database images has the ground truth building ID match with the query. We evaluate the fraction of correctly recognized queries (recall) for the different lengths N of the candidate shortlist.

6.3 Implementation details

We build a visual vocabulary of 200,000 visual words by approximate k-means clustering [11], [52]. We have not observed any significant improvements by using larger vocabularies. The vocabulary is built from features detected in a subset of 100,000randomly selected database images in each dataset. We use the upright SIFT descriptors [48], [53] for each feature (assuming the upright image gravity vector) followed by the RootSIFT normalization [54],i.e. L1 normalization and square root weighting. We have not used the histogram equalization suggested by [3] as it did not improve results using our visual word setup.

6.4 Comparison with BoVW baselines

We compare results of the proposed adaptive soft- assignment approach (AA thr-idf) with several baselines:

the standard tf-idf weighting (tf-idf) [11], burstiness weights (brst-idf) [8], and the standard soft-assignment

weights [16] (SA tf-idf). Results for the two datasets are summarized below.

• Pittsburgh. Results for different methods for m = 25 meters and varying value of N are shown in Figure 6 (b). Figure 8 shows examples of place recognition results. Notice that the top ranked results by the baseline burstiness method [8] still suffer from repetitive structures that dominate visual word matching (Figure 8 (top row)). How- ever, the proposed repttile detection and adaptive soft-assignment effectively down-weight repetitive structures, while mitigating “quantization effects”

for the distinctive features (Figure 8 (bottom row)).

• San Francisco.Results for the different methods are shown in Figure 7. The results of [3] were obtained directly from the authors but to remove the effect of geometric verification we ignored the threshold on the minimum number of inliers by settingTP CI = 0.

Note also that the GPS position of the query image was not used for any of the compared methods. The pattern of results is similar to the Pittsburgh data with our adaptive soft-assignment method (AA thr- idf) performing best and significantly better than the method of [3] underlying the importance of handling repetitive structures for place recognition in urban environments. Example place recognition

(11)

0 10 20 30 40 50 45

50 55 60 65 70 75 80 85

Recall (Percent)

AA thr−idf T=1 AA thr−idf T=2 AA thr−idf T=3 AA thr−idf T=4 AA thr−idf T=5 AA thr−idf T=10

0 10 20 30 40 50

45 50 55 60 65 70 75 80 85

Recall (Percent)

AA thr−idf (K=50) AA thr−idf (K=20) AA thr−idf (kmax=2) AA thr−idf (kmax=5)

0 10 20 30 40 50

45 50 55 60 65 70 75 80 85

Recall (Percent)

AA thr−idf AA brst−idf thr−idf brst−idf

(a) ThresholdT (b) K andαmax (c) AA andT

Fig. 10. Sensitivity to different parameters on the Pittsburgh dataset.The fraction of correctly recognized queries (Recall, y-axis) vs. the number of topNretrieved database images (x-axis) for different parameter setup.

0 10 20 30 40 50

40 45 50 55 60 65 70 75 80

Recall (Percent)

AA thr−idf T=1 AA thr−idf T=2 AA thr−idf T=3 AA thr−idf T=4 AA thr−idf T=5 AA thr−idf T=10 Chen NoGPS [3]

0 10 20 30 40 50

40 45 50 55 60 65 70 75 80

Recall (Percent)

AA thr−idf (K=50) AA thr−idf (K=20) AA thr−idf (kmax=2) AA thr−idf (kmax=5)

0 10 20 30 40 50

40 45 50 55 60 65 70 75 80

Recall (Percent)

AA thr−idf AA brst−idf thr−idf brst−idf

(a) ThresholdT (b) K andαmax (c) AA andT

Fig. 11. Sensitivity to different parameters on the San Francisco dataset.The fraction of correctly recognized queries (Recall, y-axis) vs. the number of topN retrieved database images (x-axis) for different parameter setup.

results demonstrating benefits of the proposed approach are shown in Figure 9.

6.5 Sensitivity to parameters

Here we list the main parameters of our method and evaluate its sensitivity to their settings.

• The weight threshold T. The weight threshold T in Equation (5) is an important parameter of the method and its setting may depend on the dataset and the size of the visual vocabulary. Since 97%

of visual word weights rt (see Equation (6)) are

≤1 (measured on the Pittsburgh database), setting T = 1 effectively down-weights the bursted visual words. Figures 10(a) and 11(a) show the evaluation of place recognition performance for different values of T. We useT = 1 (unless stated otherwise).

• The number of visual word assignmentsK. Fig- ures 10(b) and 11(b) show the method is fairly insensitive to the choice of the numberKof visual word assignments for repttile detection, where values of 20(AA thr-idf (K = 20)) and the standard 50 (AA thr-idf (K= 50)) result in a similar performance. We use K= 50(unless stated otherwise).

• The maximum repttile soft-assignment αmax. We have also tested different values of the adaptive soft-assignment parameterαmax (Equations (6) and (7)). Figures 10(b) and 11(b) again show the method is fairly insensitive to its choice, where values of 2 (AA thr-idf (amax=2)), 3 (AA thr-idf (K = 50)), and 5 (AA thr-idf (amax=5)) result in a similar performance. We use αmax = 3 following [16] (unless stated otherwise). Note that the base of the exponential in Equation (6) is chosen so that the weights decrease with increasing k and we found 1/2to work well. In general, this value needs to be set experimentally, similarly to the sigma parameter in the standard soft-assignment [16].

• Which stage helps. Here, we evaluate separately the benefits of the two components of the proposed method (weight thresholding and adaptive soft- assignment) compared to the baseline burstiness weights. Figures 10(c) and 11(c) show separately results for (i) thresholding using Equation (5) (thr- idf) and (ii) adaptive soft-assignment using Equa- tions (6) and (7) (AA brst-idf). Combining the two components results in a further improvement (AA