search as a source coding problem, with application to large scale image retrieval

(1)

search as a source coding problem,

with application to large scale image retrieval

Hervé Jégou

GDR ISIS, journée passage à l’échelle November 3^th 2010

Joint work with Matthijs Douze, Patrick Pérez, Cordelia Schmid, Harsmirat Singh

(2)

Problem statement

Problem: find the nearest Euclidean neighbor(s) of a query

under the following (contradictory) optimization criteria

►  search quality

►  speed

►  memory usage

Many algorithms only optimize the two first criteria: LSH, FLANN, NV-tree, etc Memory is very important if we want to handle datasets with billion vectors

(3)

Hamming Embedding techniques

Several solutions address this problem by designing/learning an embedding function mapping the Euclidean space into the Hamming space:

The objective is that the neighborhood in the Hamming space reflects the neighborhood in the original Euclidean space:

Related works:

►  Spectral Hashing, Weiss et al.’09

►  Kernelized locality sensitive hashing, Kurtis & Grauman’09

►  Hamming Embedding and Weak geometric consistency for large scale image search, J. et al ‘08

►  …

(4)

Drawbacks of current techniques

  The Hamming space is too constrained

►  only D+1 possible distances

►  quite different from Euclidean space in nature:

not clear that the best possible embedding is « good » w.r.t. the code size

  Comparing e(x) and e(y_i) introduces two approximations

►  of the database vector y_i → can not be avoided

►  of the query vector x → not stored, no reason to approximate it!

000 001

100

110 111

011 101

010

(5)

Preliminary: a source coding system

Simple source-coding system:

►  decorrelation: fixed transformed (DCT, DWT), Karhunen-Loève (PCA)

►  scalar quantization is often used

►  Bits are allocated to quantizers in order to minimize overall distortion To a code e(x) is associated a unique

reconstruction value q(x)

Reconstruction mean square error MSE

scalar quantization

entropy coding decorrelation

rate-distortion optimization

(6)

Mimick source coding for neighbor search: formulation

  Search is seen as a distance approximation problem

  The distance between two vectors can be estimated using

SDC: code-to-code distance ADC: vector-to-code distance

  Error on square distances bounded by approximation error (from Hugyens)

  The key is to be able to compute the SDC and ADC estimators efficiently

(7)

Biased estimator

These estimators are biased:

►  bias removed by quantization error terms

►  but does not improve the NN search quality

(8)

First approach: distance rate-distortion optimization (ICASSP’10)

  Optimization criterion: minimize subject to code length constraint b

  First approach (symmetrical): copy/paste of source coding

►  Karhunen-Loeve transform

►  scalar quantization (Lloyd-Max)

►  the number of quantization intervals allocated to each component aims at minimizing the simplified critertion

  where

►  e_j(c₁,c₂) is the partial reconstructed square distance associated with codes c₁ and c₂

►  n_j is the number of interval allocated to component j

(9)

Example

  Right: 2D anisotropic Gaussian

►  greedy approach selects the dimension that best reduces the distance error per allocated bit

►  Non-integer number of bits per dimension

SDC: The number of possible distance is finite, but much higher than for Hamming Embedding up to 2^D instead of D+1

Left: 5 bits → cleary more than 6 distances!

(10)

  Vector split into m subvectors:

  Subvectors are quantized separately by quantizers

where each is learned by k-means with a limited number of centroids

  Example: y = 128-dim vector split in 8 subvectors of dimension 16

Second approach: product quantization [PAMI’11]

8 bits 16 components

⇒ 64-bit quantization index

y₁ y₂ y₃ y₄ y₅ y₆ y₇ y₈

q₁ q₂ q₃ q₄ q₅ q₆ q₇ q₈

q₁(y₁) q₂(y₂) q₃(y₃) q₄(y₄) q₅(y₅) q₆(y₆) q₇(y₇) q₈(y₈)

256 centroids

(11)

Combination with an inverted file: avoid exhaustive search

Coarse partition of feature space using k-means

The distance is estimated between the residual vectors for a few vectors only

(12)

Results: GIST descriptors

(13)

Image indexing

  Retrieval of images representing the same object/scene:

►  different viewpoints, backgrounds, cameras, times of day/year

►  transformed images (crop, rotation, overlay, photocopy, etc.)

►  interactive time (result in < 1 s), large scale (billions of images)

►  limited resources (memory, CPU)

Query

(14)

Objective and proposed approach [CVPR’10]

• 

Aim: optimizing the trade-off between

►  search quality

►  search speed

►  memory usage

• 

Approach: joint optimization of three stages

►  local descriptor aggregation

►  dimension reduction

►  indexing algorithm

extract SIFT

aggregate descriptors

dimension reduction

vector encoding

/indexing

D

code

D’

n SIFTs (128 dim)

(15)

Aggregation of local descriptors

  Problem: represent an image by a single fixed-size vector:

set of n local descriptors → 1 vector

  Most popular idea: BoF representation [Sivic & Zisserman 03]

►  sparse vector

►  highly dimensional

→ strong dimensionality reduction introduces loss

  Alternative: Fisher Kernels [Perronnin et al 07]

►  non sparse vector

►  excellent results with a small vector dimensionality

→ our method in the spirit of this representation

(16)

VLAD : Vector of Locally Aggregated Descriptors

  Learning: k-means

►  output: k centroids : c₁,…,c_i,…c_k

  VLAD computation:

► 

⇒ dimension D = k *128

  L2-normalized

  Typical parameter: k=64 (D=8192)

c₃

x

v₁ v₂ v₃ v₄

v₅ c₁

c₄

c₂

c₅

① assign descriptors

② compute x-c_i

③ v_i=sum x-c_i for cell i

x-c₅

①

②③

(17)

VLADs for corresponding images

SIFT-like representation per centroid (+ components: blue, - components: red)

  good coincidence of energy & orientations

  VLAD is a simplified version of Fisher Kernels!

  comparable performance (depends on datasets) v₁ v₂ v₃ ...

(18)

VLAD performance and dimensionality reduction

  We compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)

  Dimension is reduced to from D to D’ dimensions with PCA

  Observations:

►  VLAD better than BoF for a given descriptor size

→ comparable to Fisher kernels for these operating points

►  Choose a small D if output dimension D’ is small

Aggregator k D D’=D

(no reduction)

D’=128 D’=64

BoF 1,000 1,000 41.4 44.4 43.4

BoF 20,000 20,000 44.6 45.2 44.5

BoF 200,000 200,000 54.9 43.2 41.6

VLAD 16 2,048 49.6 49.5 49.4

VLAD 64 8,192 52.6 51.0 47.7

VLAD 256 32,768 57.5 50.8 47.6

(19)

Optimizing the dimension reduction and quantization together

  We use product quantization indexing scheme

  VLAD vectors suffer two approximations

►  mean square error from PCA projection: e_p(D’)

►  mean square error from quantization: e_q(D’)

  Given k and bytes/image, choose D’ minimizing their sum Ex, k=16:

D’ e_p(D’) e_q(D’) e_p(D’)+e_q(D’)

32 0.0632 0.0164 0.0796

48 0.0508 0.0248 0.0757

64 0.0434 0.0321 0.0755

80 0.0386 0.0458 0.0844

(20)

Searching in 1 million images: how good is the short-list

  Fisher+source coding indexing significantly outperforms BOF with 256 bytes/

image. Fisher a bit better than VLAD in this evaluation.

(21)

Results on standard datasets [new results, improve our CPVR’10]

  Datasets

►  University of Kentucky benchmark score: nb relevant images, max: 4

►  INRIA Holidays dataset score: mAP (%)

Method bytes UKB Holidays

BoF, k=20,000 10K 2.92 44.6

BoF, k=200,000 12K 3.06 54.9

miniBOF 20 2.07 25.5

miniBOF 160 2.72 40.3

FV + ADC 16 3.10 50.6

FV + ADC 320 3.47 63.4

miniBOF: “Packing Bag-of-Features”, ICCV’09

(22)

Large scale experiments (10 million images)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1000 10k 100k 1M 10M

recall@100

Database size: Holidays+images from Flickr BOF D=200k

VLAD k=64 VLAD k=64, D'=96 VLAD k=64, ADC 16 bytes VLAD+Spectral Hashing, 16 bytes

4.768s

ADC: 0.286s IVFADC: 0.014s

Timings

SH ≈ 0.267s

(23)

Conclusion

  The Bag-of-features + inverted file framework is not the ultimate solution

►  reached its limits w.r.t. the number of indexed images

►  concurrent representations do exist: Fisher kernels and variations

  Can scale up to 1 billion images (16 GB RAM, < 1 s search on 8 cores)

  ANN based on source coding: a rich framework

  10 M images indexed on a laptop See the demo now !

(24)