• Aucun résultat trouvé

search as a source coding problem, with application to large scale image retrieval

N/A
N/A
Protected

Academic year: 2022

Partager "search as a source coding problem, with application to large scale image retrieval"

Copied!
24
0
0

Texte intégral

(1)

search as a source coding problem,

with application to large scale image retrieval

Hervé Jégou

GDR ISIS, journée passage à l’échelle November 3th 2010

Joint work with Matthijs Douze, Patrick Pérez, Cordelia Schmid, Harsmirat Singh

(2)

Problem statement

Problem: find the nearest Euclidean neighbor(s) of a query

under the following (contradictory) optimization criteria

►  search quality

►  speed

►  memory usage

Many algorithms only optimize the two first criteria: LSH, FLANN, NV-tree, etc Memory is very important if we want to handle datasets with billion vectors

(3)

Hamming Embedding techniques

Several solutions address this problem by designing/learning an embedding function mapping the Euclidean space into the Hamming space:

The objective is that the neighborhood in the Hamming space reflects the neighborhood in the original Euclidean space:

Related works:

►  Spectral Hashing, Weiss et al.’09

►  Kernelized locality sensitive hashing, Kurtis & Grauman’09

►  Hamming Embedding and Weak geometric consistency for large scale image search, J. et al ‘08

► 

(4)

Drawbacks of current techniques

  The Hamming space is too constrained

►  only D+1 possible distances

►  quite different from Euclidean space in nature:

not clear that the best possible embedding is « good » w.r.t. the code size

  Comparing e(x) and e(yi) introduces two approximations

►  of the database vector yi → can not be avoided

►  of the query vector x → not stored, no reason to approximate it!

000 001

100

110 111

011 101

010

(5)

Preliminary: a source coding system

Simple source-coding system:

►  decorrelation: fixed transformed (DCT, DWT), Karhunen-Loève (PCA)

►  scalar quantization is often used

►  Bits are allocated to quantizers in order to minimize overall distortion To a code e(x) is associated a unique

reconstruction value q(x)

Reconstruction mean square error MSE

scalar quantization

entropy coding decorrelation

rate-distortion optimization

(6)

Mimick source coding for neighbor search: formulation

  Search is seen as a distance approximation problem

  The distance between two vectors can be estimated using

SDC: code-to-code distance ADC: vector-to-code distance

  Error on square distances bounded by approximation error (from Hugyens)

  The key is to be able to compute the SDC and ADC estimators efficiently

(7)

Biased estimator

These estimators are biased:

►  bias removed by quantization error terms

►  but does not improve the NN search quality

(8)

First approach: distance rate-distortion optimization (ICASSP’10)

  Optimization criterion: minimize subject to code length constraint b

  First approach (symmetrical): copy/paste of source coding

►  Karhunen-Loeve transform

►  scalar quantization (Lloyd-Max)

►  the number of quantization intervals allocated to each component aims at minimizing the simplified critertion

  where

►  ej(c1,c2) is the partial reconstructed square distance associated with codes c1 and c2

►  nj is the number of interval allocated to component j

(9)

Example

  Right: 2D anisotropic Gaussian

►  greedy approach selects the dimension that best reduces the distance error per allocated bit

►  Non-integer number of bits per dimension

SDC: The number of possible distance is finite, but much higher than for Hamming Embedding up to 2D instead of D+1

Left: 5 bits → cleary more than 6 distances!

(10)

  Vector split into m subvectors:

  Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

  Example: y = 128-dim vector split in 8 subvectors of dimension 16

Second approach: product quantization [PAMI’11]

8 bits 16 components

64-bit quantization index

y1 y2 y3 y4 y5 y6 y7 y8

q1 q2 q3 q4 q5 q6 q7 q8

q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)

256 centroids

(11)

Combination with an inverted file: avoid exhaustive search

Coarse partition of feature space using k-means

The distance is estimated between the residual vectors for a few vectors only

(12)

Results: GIST descriptors

(13)

Image indexing

  Retrieval of images representing the same object/scene:

►  different viewpoints, backgrounds, cameras, times of day/year

►  transformed images (crop, rotation, overlay, photocopy, etc.)

►  interactive time (result in < 1 s), large scale (billions of images)

►  limited resources (memory, CPU)

Query

Query

(14)

Objective and proposed approach [CVPR’10]

• 

Aim: optimizing the trade-off between

►  search quality

►  search speed

►  memory usage

• 

Approach: joint optimization of three stages

►  local descriptor aggregation

►  dimension reduction

►  indexing algorithm

extract SIFT

aggregate descriptors

dimension reduction

vector encoding

/indexing

D

code

D’

n SIFTs (128 dim)

(15)

Aggregation of local descriptors

  Problem: represent an image by a single fixed-size vector:

set of n local descriptors → 1 vector

  Most popular idea: BoF representation [Sivic & Zisserman 03]

►  sparse vector

►  highly dimensional

→ strong dimensionality reduction introduces loss

  Alternative: Fisher Kernels [Perronnin et al 07]

►  non sparse vector

►  excellent results with a small vector dimensionality

→ our method in the spirit of this representation

(16)

VLAD : Vector of Locally Aggregated Descriptors

  Learning: k-means

►  output: k centroids : c1,…,ci,…ck

  VLAD computation:

► 

► 

► 

⇒ dimension D = k *128

  L2-normalized

  Typical parameter: k=64 (D=8192)

c3

x

v1 v2 v3 v4

v5 c1

c4

c2

c5

assign descriptors

compute x-ci

vi=sum x-ci for cell i

x-c5

②③

(17)

VLADs for corresponding images

SIFT-like representation per centroid (+ components: blue, - components: red)

  good coincidence of energy & orientations

  VLAD is a simplified version of Fisher Kernels!

  comparable performance (depends on datasets) v1 v2 v3 ...

(18)

VLAD performance and dimensionality reduction

  We compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)

  Dimension is reduced to from D to D’ dimensions with PCA

  Observations:

►  VLAD better than BoF for a given descriptor size

→ comparable to Fisher kernels for these operating points

►  Choose a small D if output dimension D’ is small

Aggregator k D D’=D

(no reduction)

D’=128 D’=64

BoF 1,000 1,000 41.4 44.4 43.4

BoF 20,000 20,000 44.6 45.2 44.5

BoF 200,000 200,000 54.9 43.2 41.6

VLAD 16 2,048 49.6 49.5 49.4

VLAD 64 8,192 52.6 51.0 47.7

VLAD 256 32,768 57.5 50.8 47.6

(19)

Optimizing the dimension reduction and quantization together

  We use product quantization indexing scheme

  VLAD vectors suffer two approximations

►  mean square error from PCA projection: ep(D’)

►  mean square error from quantization: eq(D’)

  Given k and bytes/image, choose D’ minimizing their sum Ex, k=16:

D’ ep(D’) eq(D’) ep(D’)+eq(D’)

32 0.0632 0.0164 0.0796

48 0.0508 0.0248 0.0757

64 0.0434 0.0321 0.0755

80 0.0386 0.0458 0.0844

(20)

Searching in 1 million images: how good is the short-list

  Fisher+source coding indexing significantly outperforms BOF with 256 bytes/

image. Fisher a bit better than VLAD in this evaluation.

(21)

Results on standard datasets [new results, improve our CPVR’10]

  Datasets

►  University of Kentucky benchmark score: nb relevant images, max: 4

►  INRIA Holidays dataset score: mAP (%)

Method bytes UKB Holidays

BoF, k=20,000 10K 2.92 44.6

BoF, k=200,000 12K 3.06 54.9

miniBOF 20 2.07 25.5

miniBOF 160 2.72 40.3

FV + ADC 16 3.10 50.6

FV + ADC 320 3.47 63.4

miniBOF: “Packing Bag-of-Features”, ICCV’09

(22)

Large scale experiments (10 million images)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1000 10k 100k 1M 10M

recall@100

Database size: Holidays+images from Flickr BOF D=200k

VLAD k=64 VLAD k=64, D'=96 VLAD k=64, ADC 16 bytes VLAD+Spectral Hashing, 16 bytes

4.768s

ADC: 0.286s IVFADC: 0.014s

Timings

SH ≈ 0.267s

(23)

Conclusion

  The Bag-of-features + inverted file framework is not the ultimate solution

►  reached its limits w.r.t. the number of indexed images

►  concurrent representations do exist: Fisher kernels and variations

  Can scale up to 1 billion images (16 GB RAM, < 1 s search on 8 cores)

  ANN based on source coding: a rich framework

  10 M images indexed on a laptop See the demo now !

(24)

Combination with an inverted file

Références

Documents relatifs

confiée au seul doyen CARBONNIER. Selon l'ensemble de la doctrine, cette unité de pensée et de style a permis la construction d'un corps de droit cohérent. A l'évidence,

They were also used to learn the graph structure of the ap- proximate descriptor assignment introduced in Section 5.2 (for the 200K vocabulary only). Flickr1M are distracting images

Experiments with a large scale collection of image descriptors show that our algorithm sig- nificantly reduces the variance of response times without severely impacting the

To make the model tractable we also introduced a matching graph clustered using connected component clustering, that can be used to quickly organize very large image collections,

It uses a gazetteer which contains representations of places ranging from countries to buildings, and that is used to recognize toponyms, disambiguate them into places, and to

Balancing clusters to reduce response time variability in large scale image search.. Romain Tavenard, Laurent Amsaleg,

4. As a result of 3), search expansions into multiple near clusters are not feasible... As a result of 1), 3) and k -means tendency for imbalanced size distribution of clusters,

Performance mAP on Oxford Oxf and Paris Par with the original annotation, and ROxford and RParis with the newly proposed annotation with three different protocol setups: Easy E,