search as a source coding problem,
with application to large scale image retrieval
Hervé Jégou
GDR ISIS, journée passage à l’échelle November 3th 2010
Joint work with Matthijs Douze, Patrick Pérez, Cordelia Schmid, Harsmirat Singh
Problem statement
Problem: find the nearest Euclidean neighbor(s) of a query
under the following (contradictory) optimization criteria
► search quality
► speed
► memory usage
Many algorithms only optimize the two first criteria: LSH, FLANN, NV-tree, etc Memory is very important if we want to handle datasets with billion vectors
Hamming Embedding techniques
Several solutions address this problem by designing/learning an embedding function mapping the Euclidean space into the Hamming space:
The objective is that the neighborhood in the Hamming space reflects the neighborhood in the original Euclidean space:
Related works:
► Spectral Hashing, Weiss et al.’09
► Kernelized locality sensitive hashing, Kurtis & Grauman’09
► Hamming Embedding and Weak geometric consistency for large scale image search, J. et al ‘08
► …
Drawbacks of current techniques
The Hamming space is too constrained
► only D+1 possible distances
► quite different from Euclidean space in nature:
not clear that the best possible embedding is « good » w.r.t. the code size
Comparing e(x) and e(yi) introduces two approximations
► of the database vector yi → can not be avoided
► of the query vector x → not stored, no reason to approximate it!
000 001
100
110 111
011 101
010
Preliminary: a source coding system
Simple source-coding system:
► decorrelation: fixed transformed (DCT, DWT), Karhunen-Loève (PCA)
► scalar quantization is often used
► Bits are allocated to quantizers in order to minimize overall distortion To a code e(x) is associated a unique
reconstruction value q(x)
Reconstruction mean square error MSE
scalar quantization
entropy coding decorrelation
rate-distortion optimization
Mimick source coding for neighbor search: formulation
Search is seen as a distance approximation problem
The distance between two vectors can be estimated using
SDC: code-to-code distance ADC: vector-to-code distance
Error on square distances bounded by approximation error (from Hugyens)
The key is to be able to compute the SDC and ADC estimators efficiently
Biased estimator
These estimators are biased:
► bias removed by quantization error terms
► but does not improve the NN search quality
First approach: distance rate-distortion optimization (ICASSP’10)
Optimization criterion: minimize subject to code length constraint b
First approach (symmetrical): copy/paste of source coding
► Karhunen-Loeve transform
► scalar quantization (Lloyd-Max)
► the number of quantization intervals allocated to each component aims at minimizing the simplified critertion
where
► ej(c1,c2) is the partial reconstructed square distance associated with codes c1 and c2
► nj is the number of interval allocated to component j
Example
Right: 2D anisotropic Gaussian
► greedy approach selects the dimension that best reduces the distance error per allocated bit
► Non-integer number of bits per dimension
SDC: The number of possible distance is finite, but much higher than for Hamming Embedding up to 2D instead of D+1
Left: 5 bits → cleary more than 6 distances!
Vector split into m subvectors:
Subvectors are quantized separately by quantizers
where each is learned by k-means with a limited number of centroids
Example: y = 128-dim vector split in 8 subvectors of dimension 16
Second approach: product quantization [PAMI’11]
8 bits 16 components
⇒ 64-bit quantization index
y1 y2 y3 y4 y5 y6 y7 y8
q1 q2 q3 q4 q5 q6 q7 q8
q1(y1) q2(y2) q3(y3) q4(y4) q5(y5) q6(y6) q7(y7) q8(y8)
256 centroids
Combination with an inverted file: avoid exhaustive search
Coarse partition of feature space using k-means
The distance is estimated between the residual vectors for a few vectors only
Results: GIST descriptors
Image indexing
Retrieval of images representing the same object/scene:
► different viewpoints, backgrounds, cameras, times of day/year
► transformed images (crop, rotation, overlay, photocopy, etc.)
► interactive time (result in < 1 s), large scale (billions of images)
► limited resources (memory, CPU)
Query
Query
Objective and proposed approach [CVPR’10]
•
Aim: optimizing the trade-off between► search quality
► search speed
► memory usage
•
Approach: joint optimization of three stages► local descriptor aggregation
► dimension reduction
► indexing algorithm
extract SIFT
aggregate descriptors
dimension reduction
vector encoding
/indexing
D
code
D’
n SIFTs (128 dim)
Aggregation of local descriptors
Problem: represent an image by a single fixed-size vector:
set of n local descriptors → 1 vector
Most popular idea: BoF representation [Sivic & Zisserman 03]
► sparse vector
► highly dimensional
→ strong dimensionality reduction introduces loss
Alternative: Fisher Kernels [Perronnin et al 07]
► non sparse vector
► excellent results with a small vector dimensionality
→ our method in the spirit of this representation
VLAD : Vector of Locally Aggregated Descriptors
Learning: k-means
► output: k centroids : c1,…,ci,…ck
VLAD computation:
►
►
►
⇒ dimension D = k *128
L2-normalized
Typical parameter: k=64 (D=8192)
c3
x
v1 v2 v3 v4
v5 c1
c4
c2
c5
① assign descriptors
② compute x-ci
③ vi=sum x-ci for cell i
x-c5
①
②③
VLADs for corresponding images
SIFT-like representation per centroid (+ components: blue, - components: red)
good coincidence of energy & orientations
VLAD is a simplified version of Fisher Kernels!
comparable performance (depends on datasets) v1 v2 v3 ...
VLAD performance and dimensionality reduction
We compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)
Dimension is reduced to from D to D’ dimensions with PCA
Observations:
► VLAD better than BoF for a given descriptor size
→ comparable to Fisher kernels for these operating points
► Choose a small D if output dimension D’ is small
Aggregator k D D’=D
(no reduction)
D’=128 D’=64
BoF 1,000 1,000 41.4 44.4 43.4
BoF 20,000 20,000 44.6 45.2 44.5
BoF 200,000 200,000 54.9 43.2 41.6
VLAD 16 2,048 49.6 49.5 49.4
VLAD 64 8,192 52.6 51.0 47.7
VLAD 256 32,768 57.5 50.8 47.6
Optimizing the dimension reduction and quantization together
We use product quantization indexing scheme
VLAD vectors suffer two approximations
► mean square error from PCA projection: ep(D’)
► mean square error from quantization: eq(D’)
Given k and bytes/image, choose D’ minimizing their sum Ex, k=16:
D’ ep(D’) eq(D’) ep(D’)+eq(D’)
32 0.0632 0.0164 0.0796
48 0.0508 0.0248 0.0757
64 0.0434 0.0321 0.0755
80 0.0386 0.0458 0.0844
Searching in 1 million images: how good is the short-list
Fisher+source coding indexing significantly outperforms BOF with 256 bytes/
image. Fisher a bit better than VLAD in this evaluation.
Results on standard datasets [new results, improve our CPVR’10]
Datasets
► University of Kentucky benchmark score: nb relevant images, max: 4
► INRIA Holidays dataset score: mAP (%)
Method bytes UKB Holidays
BoF, k=20,000 10K 2.92 44.6
BoF, k=200,000 12K 3.06 54.9
miniBOF 20 2.07 25.5
miniBOF 160 2.72 40.3
FV + ADC 16 3.10 50.6
FV + ADC 320 3.47 63.4
miniBOF: “Packing Bag-of-Features”, ICCV’09
Large scale experiments (10 million images)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
1000 10k 100k 1M 10M
recall@100
Database size: Holidays+images from Flickr BOF D=200k
VLAD k=64 VLAD k=64, D'=96 VLAD k=64, ADC 16 bytes VLAD+Spectral Hashing, 16 bytes
4.768s
ADC: 0.286s IVFADC: 0.014s
Timings
SH ≈ 0.267s
Conclusion
The Bag-of-features + inverted file framework is not the ultimate solution
► reached its limits w.r.t. the number of indexed images
► concurrent representations do exist: Fisher kernels and variations
Can scale up to 1 billion images (16 GB RAM, < 1 s search on 8 cores)
ANN based on source coding: a rich framework
10 M images indexed on a laptop See the demo now !