Improving image representation using image saliency and information gain

(1)

Pour l'obtention du grade de

DOCTEUR DE L'UNIVERSITÉ DE POITIERS UFR des sciences fondamentales et appliquées

XLIM-SIC

(Diplôme National - Arrêté du 7 août 2006)

École doctorale : Sciences et ingénierie pour l'information, mathématiques - S2IM (Poitiers) Secteur de recherche : Traitement du signal et des images

Présentée par :

Huu Ton Le

Improving image representation using image saliency and

information gain

Directeur(s) de Thèse :

Christine Fernandez-Maloigne, Thierry Urruty Soutenue le 23 novembre 2015 devant le jury Jury :

Président Joemon M. José Professor, School of computing science, University of Glasgow Rapporteur Liming Chen Professeur, Université de Lyon

Rapporteur Benoit Huet Maître de conférences, EURECOM, Sophia-Antipolis Membre Christine Fernandez-Maloigne Professeur, XLIM, Université de Poitiers

Membre Thierry Urruty Maître de conférences, XLIM, Université de Poitiers Membre Muriel Visani Maître de conférences, Université de la Rochelle

Pour citer cette thèse :

Huu Ton Le. Improving image representation using image saliency and information gain [En ligne]. Thèse Traitement du signal et des images. Poitiers : Université de Poitiers, 2015. Disponible sur Internet <http://theses.univ-poitiers.fr>

(2)

pour l’obtention du Grade de

DOCTEUR DE L’UNIVERSITE DE POITIERS

(Faculté des Sciences Fondamentales et Appliquées)

(Diplôme National - Arrêté du 7 août 2006)

École Doctorale: Sciences et Ingénierie pour l’Information,

Mathématiques (S2IM)

Secteur de recherche : Traitement du Signal et des images

Présentée par:

Huu Ton LE

************************

Improving Image Representation Using

Image Saliency and Information Gain

Directrice de thèse: Christine FERNANDEZ-MALOIGNE

Co-Directeur de thèse: Thierry URRUTY

************************ Soutenue le 23 Novembre 2015

devant la Commission d’Examen composée de: ************************

Membres du jury

Pr. Liming CHEN, Université de Lyon, Rapporteur

MCF. Benoit HUET, EURECOM, Rapporteur

Pr. Joemon JOSE, University of Glasgow, Examinateur MCF. Muriel VISANI, Université de La Rochelle, Examinateur Pr. Christine FERNANDEZ-MALOIGNE, Université de Poitiers, Directrice de thèse MCF. Thierry URRUTY, Université de Poitiers, Co-directeur de thèse

(3)

(4)

List of Figures iv List of Tables vi Abbreviations viii 1 Introduction 1 1.1 Context . . . 1 1.2 Scientific contributions . . . 3 1.3 Organisation . . . 4

2 State of the Art 6 2.1 CBIR framework with BoVW implementation . . . 7

2.1.1 Local feature detection. . . 8

2.1.2 Descriptor extraction. . . 13

2.1.3 Image representation . . . 19

2.1.3.1 The idea of BoVW . . . 19

2.1.3.2 Visual Vocabulary Construction . . . 20

2.1.3.3 Image representation . . . 23

2.1.4 Similarity matching . . . 26

2.2 From BoVW to BoVP . . . 29

2.3 Term Weighting Scheme . . . 35

2.4 Visual Attention Model . . . 38

2.4.1 Introduction of Visual Attention Model . . . 38

2.4.2 Visual Attention Model of Itti and Koch. . . 41

2.5 Conclusion . . . 43

3 Using Information Gain for Visual Vocabulary Construction 45 3.1 Iterative Random Visual Word Selection . . . 46

3.1.1 Proposed method. . . 46

3.1.2 Experiments . . . 50

3.1.3 Conclusion on the ItRaSel approach . . . 61

3.2 Information gain evaluation . . . 62

3.2.1 Evaluation methodology . . . 62

3.2.2 Experiments . . . 64

3.2.3 Mixing vocabularies . . . 68 ii

(5)

4 Improving BoVW with Visual Attention Model 76

4.1 Improving local feature extraction using saliency information . . . 77

4.1.1 Evaluating the saliency of key-point detectors . . . 77

4.1.2 Impact of local feature filtering based on visual saliency . . . 80

4.1.3 Conclusion on improving feature detection with saliency . . . 86

4.2 Saliency histogram . . . 88

4.2.2 Experiments and Results . . . 90

4.2.3 Conclusion on saliency histogram . . . 94

4.3 Saliency weighting . . . 95

4.3.2 Experiments and Results . . . 96

4.3.3 Combining saliency weighting and saliency histogram . . . 96

4.3.4 Conclusion on saliency weighting . . . 98

5 Bag of Visual Phrases and Query Expansion 100 5.1 Bag of Visual Phrases . . . 101

5.1.1 BoVP implementation . . . 101

5.1.2 BoVW and BoVP comparision . . . 103

5.1.3 Refine the visual phrase vocabulary with ItRaSel framework. . . . 104

5.1.4 ItRaSel vocabulary on BoVP . . . 106

5.1.5 Saliency weighting for BoVP . . . 107

5.1.6 Conclusion on BoVP model . . . 108

5.2 Query expansion . . . 109

5.2.1 Introduction of propose query method . . . 109

5.2.2 Experiment results . . . 111

5.2.3 Conclusion on query expansion . . . 114

6 Conclusion and perspective 116 6.1 Conclusion . . . 116

6.2 Perspectives . . . 117

(6)

2.1 A typical CBIR framework . . . 7

2.2 Harris detector [1] . . . 10

2.3 Cornerness depends on the scale factor . . . 10

2.4 DoG interest point detector [2] . . . 11

2.5 Searching for extreme points over scale and space [2] . . . 11

2.6 FAST feature detection . . . 13

2.7 128 dimension SIFT descriptor [3] . . . 15

2.8 Images are represented by local features . . . 20

2.9 Vocabulary construction step . . . 20

2.10 Images are represented by histograms of visual words . . . 21

2.11 Vocabulary construction using a clustering method [4] . . . 21

2.12 k-means algorithm . . . 22

2.13 Assign the local features to the closest visual word . . . 25

2.14 Calculate the difference between local features and visual words . . . 25

2.15 VLAD vector presentation . . . 25

2.16 Spatial Pyramid Representation [5] . . . 26

2.17 Illustration of the GVP visual phrase descriptor [6] . . . 32

2.18 Example of creation of 2-grams visual phrases [7] . . . 33

2.19 Architecture of Visual Attention Model of Itti and Koch [8] . . . 42

2.20 An example of saliency model by Itti and Koch. The input image is presented on the left and its saliency map is presented on the right . . . . 43

3.1 The curse of dimensionality [9]. Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that di-mension, pushing them further apart. Additional dimensions spreads the data even further making high dimensional data extremely sparse. . . 47

3.2 Using Information Gain for Vocabulary Construction . . . 48

3.3 Stabilization Process . . . 51

3.4 An example set of 4 images showing the same object in UKB. . . 52

3.5 An example set of images in difference classes of Pascal data-set: the images (a)-(c) belong to class motorbike, the images (d)-(f) belong to class horse and the images (g)-(i) belong to class person . . . 53

3.6 An example set of images in Holidays database: The image in the top is the query image, four corresponding results are shown in the bottom . . . 54

3.7 UKB score with respect to different distance formulations . . . 55

3.8 Iterative selection scores for 2048 random initial words . . . 55

3.9 Iterative selection mean scores for 1024 to 65536 random initial words . . 56

(7)

3.11 Mean scores for 1024 to 4096 random initial words group 1 × 1 to 10 × 10 58 3.12 UKB scores with respect to number of mixed features on UKB data set . 60

3.13 Precision with respect to number of mixed features on Pascal data set . . 60

3.14 Information gain evaluation framework . . . 63

3.15 Score on Holidays with OppSIFT descriptor with respect to number of visual words in vocabulary . . . 65

3.16 Score on UKB using CMI descriptor with respect to number of visual words in vocabulary . . . 65

3.17 Score on Pascal using SIFT descriptor with respect to number of visual words in vocabulary . . . 66

4.1 Image quantised with different saliency thresholds. . . 78

4.2 The used algorithm to perform local feature filtering based on visual saliency. On the last step of the algorithm, the four images correspond to different percentage of the less salient deleted features. . . 81

4.3 Score on UKB using CMI descriptor, with visual vocabulary of 300 elements 81 4.4 Score on UKB using SIFT descriptor, with visual vocabulary of 10000 elements . . . 82

4.5 Score on UKB using CMI descriptor, with visual vocabulary of 300 ele-ments and dense local features . . . 82

4.6 Score on UKB using SIFT descriptor, with visual vocabulary of 10000 elements and dense local features . . . 83

4.7 Score on Holidays using SIFT descriptor, with a visual vocabulary of 10000 elements and dense local features . . . 84

4.8 The results obtained deleting less salient local features.. . . 85

4.9 Replacing the less salient points detected buy Harris-Laplace by the most salient selected with dense quantization. . . 87

4.10 Different level of spatial pyramid representation . . . 89

4.11 Grid-based representation . . . 89

4.12 Retrieval score on UKB database with respect to weighting α . . . 90

4.13 Retrieval score on UKB database with 2 levels of spatial pyramid repre-sentation . . . 92

4.14 Retrieval scores on UKB database with 3 levels of spatial pyramid repre-sentation . . . 92

4.15 Retrieval score on UKB database with grid of 4 sub-regions . . . 93

4.16 Retrieval scores on UKB database with grid of 9 sub-regions. . . 94

(8)

2.1 BoVP model summarization . . . 34

3.1 Complexity analysis for one iterative step . . . 49

3.2 UKB scores . . . 58

3.3 UKB scores: IteRaSel vs other methods . . . 58

3.4 Pascal VOC2012 scores . . . 59

3.5 k-means and Information Gain comparison . . . 67

3.6 Mean differences with respect to the data set . . . 68

3.7 Visual vocabularies obtained by different information gain models. . . 68

3.8 Visual Word Ranking Calculation. . . 70

3.9 Scores using mixing vocabulary methods after the iterative step . . . 71

3.10 Scores using mixing vocabulary methods at the iterative step . . . 73

4.1 Distribution of the salient features for each detector and data set . . . 79

4.2 Computation of the area between the curves of deleting most salient and less salient local features. . . 86

4.3 Result on UKB database with different numbers of bin . . . 91

4.4 Results on UKB database with different descriptors. . . 91

4.5 Results on UKB database using 3 levels spatial pyramid representation of saliency histogram . . . 93

4.6 Results on UKB database using grid based representation of saliency his-togram. . . 95

4.7 Results on UKB database using saliency histogram . . . 95

4.8 Results on UKB database with saliency weighting using BoVW model . . 96

4.9 Results on UKB database with different methods embedding saliency in-formation . . . 97

4.10 Results on UKB database with ItRaSel framework using different methods embedding saliency information . . . 98

5.1 Results on UKB database using Harris-Laplace detector and Dense Sam-pling method . . . 101

5.2 Results on UKB database using different distance formulations . . . 103

5.3 Result on UKB database with BoVW and BoVP model . . . 104

5.4 Results on UKB database with BoVW and BoVP models with ItRaSel framework . . . 106

5.5 Scores on UKB database with different vocabularies . . . 107

5.6 Scores on UKB data set with saliency weighting using BoVP model. . . . 108

5.7 Results on UKB database using query expansion technique with t = 1 using different values of α1. . . 113

(9)

5.9 Results on UKB database with ItRaSel framework and Query Expansion 114 5.10 Results on BoVP with Query Expansion . . . 114

(10)

Acronym What (it) Stands For BBoV Bag of Bags of Visual Words BoVP Bag of Visual Phrases

BoVW Bag of Visual Words

BRIEF Binary Robust Independent Elementary Features BRISK Binary Robust Invariant Scalable Key-point CBIR Content Based Image Retrieval

CM Colour Moment

CMI Colout Moment Invariant

CSIFT Colour SIFT

DoG Difference of Gaussian

EBoF Extended Bag of Features

FAST Features from Accelerated Segment Test FREAK Fast Retina Key-point

GBVS Graph Based Visual Saliency

GMM Gaussian Mixture Model

GVP Geometry-preserving Visual Phrases HSOG Histogram of Second-Order Gradient

IG Information Gain

ItRaSel Iterative Random Selection

mAP Mean Average Precision

MSER Maximally Stable Extremal Region

OppSIFT Opponent SIFT

PASCAL VOC2012 PASCAL Visual Object Classes challenge 2012 PINS Prediction of INterest point Saliency

(11)

SIFT Scale Invariant Feature Transform SURF Speeded Up Robust Features UKB University of Kentucky Benchmark VLAD Vector of Locally Aggregated Descriptors

(12)

Introduction

1.1 Context

The Information retrieval (IR) field has been a very active field since middle of the twen-tieth century at a time where people were relying on other person knowledge whenever they had an information need. The Text Retrieval conferences since 1992 helped a lot increasing the number of research and engineer innovations in this area before it became a necessity with the development of the World Wide Web. Lots of researchers of this field have become references and a great source of inspiration for next generations [10], [11], [12].

As the human knowledge is not bound to textual information, a research field has ap-peared in parallel to the IR one: the Multimedia Information Retrieval field. The expo-nential growth of available multimedia information, coming along with the development of new technologies, leads to the need of powerful tools and systems to store, manage, access, and also share information. Indeed, capturing pictures and videos, sharing them online has become intuitive for any end user using a smart phone with a build-in camera or any camera and personal computer combination. The increasing capacity of storage devices and bandwidth for communication networks has greatly helped the growth of multimedia system needs.

Nowadays, recent studies on social networks and pictures/videos sharing websites show remarkable statistics on multimedia information. We give here some well known ex-amples: there are totally 219 billions pictures on Facebook [13]. This number increases rapidly with 300 million pictures uploaded every day. On Instagram, there are 70 millions photos uploaded daily by more than 300 millions users [14]. For the videos, YouTube has over a billion registered users and 300 hours of video are uploaded every minute [15].

(13)

Some videos are internationally known, for example ”Gangnam Style” with more than 2.4 billion views.

Other studies [13,16] highlight that visual contents get more interaction from end users than only text based content showing the increasing interest of MIR field. One important objective when using such voluminous amount of pictures and videos is to retrieve the multimedia documents based on the query given by the users representing its information need. There are two categories of search methods for an multimedia information, which are:

- Text-based retrieval: query by metadata such as keywords, subject headings, caption tags, annotations or text descriptions of the images.

- Content-based retrieval: query by example with the visual contents of images/videos using any existing descriptor e.g. colour, texture or any other information extracted from the image itself.

The difference between the information extracted automatically from the images by a computer and the interpretation of a user on the same data is known as the semantic gap [17]. For example, a user looking for a ”bat” will probably found first the animal. Was he looking for it or a baseball bat? The query may have to be refined. In image retrieval research, researchers are moving from keyword based to content based and the main problem encountered in the content-based image retrieval research is the semantic gap between the low-level features representing (color, shape, texture,...) and high-level semantics in the images [18,19]. Bridging the semantic gap for image retrieval is a very challenging problem yet to be solved.

Compared to the text-based retrieval, content-based retrieval holds several advantages such as:

- Content-based retrieval does not depend on the accuracy of metadata - It is difficult to use metadata to cover all the contents of the image

- It take lots of effort to manually build up metadata for big image database

For those reasons, Content-Based Image Retrieval (CBIR) has became an interesting and active research topic in the MIR field with an increasing number of application domains: image indexing and retrieval [20,21], face recognition [22,23], event detection [24–27], hand writing scanning [28, 29], object detection and tracking [30, 31], image classification [32], landmark detection [33], bio-medical imaging [34], remote sensing [35]...

(14)

Among these, we focus our research work on trying to improve the image representation for better indexing and retrieval.

1.2 Scientific contributions

In this section, we present our scientific contributions in the MIR field, and more par-ticularly in the indexing and retrieval domain. First, we decide to focus our interest on how to better represent an image. One of the most popular models in CBIR is Bag of Visual Words (BoVW) model [36] which is inspired by Bag of Word model in document retrieval. BoVW model focus on using local features to represent the images. BoVW model first builds a visual vocabulary and use that vocabulary to affect all the local features in the images. Each image is represented by a histogram of visual words and by comparing the histogram of images, we can tell the differences between them. The implementation of BoVW model concerns the following steps: local features detector [2,37–40], descriptor extraction [41–46], visual vocabulary construction [36,47,48], im-age representation [5, 36, 49, 50] and similarity matching [20, 51–53]. Each step has its own state of the art and new solutions are frequently introduced to adapt with the increasing size of images data bases.

Psycho-visual sciences with the study of human visual system is also involved to improve the performance of CBIR system. One example of this approach is visual attention model [8,54–57] which addresses the observed and/or predicted behaviour of human and non-human primal visual attention. The most common application of visual attention model in CBIR is taking advantages of saliency map to decrease the amount of information to be processed [21,58, 59]. Those methods usually take the information given by the visual attention model at an early stage. based on the saliency values, image information will be either discarded or picked as inputs for next stages.

The first contribution of this PhD research work to the CBIR field is to propose a new framework named ItRaSel for visual vocabulary construction. In this framework, local features are selected based on the information gain (IG) value. The IG values are computed by a weighting scheme combined with a visual attention model. Experiments [60] have demonstrated that, for small size vocabularies, this framework outperforms the classical BoVW model not only in effectiveness but also in efficiency. Another contribution of our research is the study of effects of different information gain models on the creation of a visual vocabulary.

(15)

Then, we propose to use visual attention model to improve the performance of the proposed BoVW model. This contribution addresses the importance of saliency key-points in the images by a study on the saliency of local feature detectors. Inspired from the Inspired from the fact that that higher saliency key-points usually return better retrieval score than the lower saliency ones [61], we use saliency as a weighting or an additional histogram for image representation. Our experiments have shown that visual attention helps to improve the BoVW model in multiple ways.

Some researches have extended BoVW into Bag of Visual Phrases (BoVP) model [62– 66] to have a more discriminative representation. Most of BoVP models find the co-occurrences of visual words, link them together into visual phrases and use histogram of visual phrases to represent the images. There are various methodologies to identify visual phrases in the literature, they can be classified into two approaches: either linking a visual word with its nearest neighbors or with other visual words lying within a neighbour region. In general, BoVP outperform BoVW model in effectiveness but due to bigger size of vocabularies, it suffers from an increasing computation time and storage capacity. The last contribution of this thesis to CBIR shows how our framework enhances the BoVP model. We first introduce the methodology to implement BoVP model using dense sampling local feature detector. Then, we extent the two frameworks that were proved to enhance the performance of BoVW into BoVP model: ItRaSel and saliency weighting. Finally, a query expansion technique is employed to increase the retrieval scores on both BoVW and BoVP models.

1.3 Organisation

This manuscript is organised as follow:

First, we present in Chapter 2 different solutions for BoVW based CBIR system in the literature, followed by a brief introduction of BoVP and visual attention model. It is not a comprehensive state of the art, but focuses on the fundamental and inspiration works for our research.

In Chapter 3, we propose a new framework for building the visual vocabulary. The idea of this framework is to select the most informative visual words from a set of random local features to create the vocabulary. We employ an iterative process to step by step reduce the number of visual words from the initial vocabulary until it reaches its expected size. At each step, the information gain values of each visual word are calculated and we use these values as indicators to refine the vocabulary. In this chapter, we also evaluate the effect of different information gain models on the selection of visual words.

(16)

In Chapter 4, we use visual attention to improve the performance of BoVW model. Our work on visual attention started with a study on the saliency of different local feature detectors and how saliency key-points affect the retrieval scores on BoVW model. After that, we embed the visual attention information to the BoVW model to have a better image representation. Two methods of using visual attention are introduced in this chapter: saliency histogram and saliency weighting.

We extend the BoVW to BoVP model in Chapter 5. In BoVP framework, we use dense sampling feature detection method and link visual words that are close to each others into phrases, then use histogram of visual phrases to represent the images. We also apply the two frameworks introduced in Chapter 3 and Chapter 4 on BoVP model. The second part of this chapter concerns query expansion techniques which help to enhance the performance of BoVW and BoVP models.

(17)

State of the Art

In this Chapter, we introduce our literature study on Content Based Image Retrieval (CBIR) which has been the fundamental for our research works presented in the next chapters. Due to the active field of CBIR, the state of the art is too vast to be exposed in the manuscript. Thus, we focus our literature research on the selected topics including the image representation framework detailed in Figure2.1. Given an input query image and an image collection, the goal of the system is to find in the image collection the most similar images to the query one. A typical CBIR system consists of two stages: - The offline stage: its purpose is to index the image data set. To do so, it extracts the features or descriptors from the images and use them to build the signatures of the images, so we can distinguish images by comparing their signatures. There are two types of features which are local features and global features. Global features characterise the image in general, usually related to texture and color information. Local features describe the local behaviour of a point or a region in the image.

- The online stage: once the query image is given, the same process as in the offline stage is applied to the query image in order to find its signature. Then, a similarity matching process is used to compute the similarity between the query image and all other images in the collection. The most similar images will be returned as the results of the retrieval process.

We have found our research interest and inspiration while reading on Bag of Visual Words (BoVW) model, one of the most popular models in CBIR for image representation. In BoVW model, images are represented in form of histograms of local features and by comparing the histograms of images, we can tell the differences between them. In Section 2.1, we present our study on BoVW based CBIR framework which consists of following steps: local feature detection, descriptor extraction, image representation and similarity

(18)

Figure 2.1: _{A typical CBIR framework}

matching. Bag of Visual Phrases (BoVP), an extension version of BoVW model is introduced in Section 2.2. Information weighting scheme and Visual attention model, two models that are used to improve the performance of BoVW model are presented in Section2.3and 2.4.

2.1 CBIR framework with BoVW implementation

We present in this section the implementation of an image retrieval system based on BoVW model. The system consists of 4 main steps: local feature detection, descriptor extraction, image representation and similarity matching.

(19)

2.1.1 Local feature detection

In BoVW model, images are represented by a set of local features and we use those local features to distinguish images. To achieve this goal, first, we need to identify the local features in the images. This task is done by a local feature detector. In BoVW model, the main objective is to find the images that are similar or contain the same object as the query image based on the appearance of the local features in the images. To achieved this goal, similar images should contain similar local features.

The purpose of a local feature detector is to detect the most distinctive image features in the image. Local feature detection is the first step to process an image, in both online and offline stage. According to [67], a good local feature detector should hold the following properties:

- Repeatability: when we apply the feature detector on two images of the same object or scene, taken under different viewing conditions, most the local features detected on one image should be found on the other.

- Distinctiveness/ informativeness: the local features can be distinguished or matched together. Similar objects, or details should return similar features. On the contrary, different objects or scenes should have different local features.

- Locality: The features should hold the local properties as it reduces the probability of occlusion. They should also keep their spatial relations over simple geometric and photometric deformations.

- Quantity: feature detectors should return a reasonable number of features per image, so even small objects are described by a sufficient number of features. The number of features may affect the execution time as well as the effectiveness of the system. Difference applications may require different densities of features, which also reflect the informative content of the images.

- Accuracy: the local information of detected features should be accurate not only in image location, as well as with respect to scale and possibly shape.

- Efficiency: the image collections are getting bigger and bigger, it leads to a burden in computational cost for image retrieval system. Thus, the feature detection should be fast enough to be useful for real time applications.

Many local feature detectors have been introduced in the literature pursuing those char-acteristics. In this section, we introduce some popular local feature detectors which have been widely applied in many applications.

(20)

The Hessian Detector [37] assumes that the interesting key-points in the image should hold strong derivatives in two orthogonal directions. Hessian detector calcultes the second derivatives Ixx, Ixy and Iyy at every image points to create the Hessian matrix

as follow: H(x, σ) = " Ixx(x, σ) Ixy(x, σ) Ixy(x, σ) Iyy(x, σ) # (2.1)

After that, the determinant of the Hessian matrix is used as the indicator to detect the interest points in the images.

det(H) = IxxIyy− Ixy2 (2.2)

A pixel in the image is considered as an interest point if the value of det(H) at this pixel is higher than a predefined threshold and than the det(H) values of all 8 neighbour pixels around it.

Harris detector [38] looks for corner points in the image. Harris detector builds the second-moment matrix C at a pixel x in the image from the first derivatives in a window around it, weighted by a Gaussian function G(x, σ):

C(x) = G(x, σ) × " I_x2(x) IxIy(x) IxIy(x) Iy2(x) # (2.3)

where Ix and Iy are the derivatives of pixel intensity in the x,y direction at point x.

Harris detector uses two eigen values λ1 and λ2 of matrix C to define if a point lies on a

flat region, on an edge or a corner as illustrated in Figure2.2. A point is a corner point if the moment matrix C at this point holds large values of λ1 and λ2.

Scale is an important factor in determining the cornerness of a pixel in the image. As illustrated in Figure 2.3, one corner point in a specific scale may not be a corner with other difference scales. To deal with this problem, many scalable local feature detectors have been introduced such as: Difference of Gaussian (DoG) detector [2], Harris-Laplacian detector [39].

Difference of Gaussiandetects the extreme points in the image base on the Laplacian using multiple scale space pyramids. The scale space pyramids of an image are defined as a function L (x, y, σ) which is the convolution of a variable scale Gaussian, G(x, y, σ), with an input image, I(x, y):

(21)

Figure 2.2: _{Harris detector [}₁_]

Figure 2.3: _{Cornerness depends on the scale factor}

where σ denotes the scale indicator, I(x, y) denotes the input image, ∗ denotes the convolution operation and G(x, y, σ) is the Gaussian function at scale σ:

G(x, y, σ) = 1 2πσ2e−(x

2_+y2_)/2σ2

(2.5) We can see that the Gaussian pyramid is the collection of images built by repeatedly smoothing and sub-sampling the input image. The difference of Gaussian are then computed by taking the differences between adjacent levels in the Gaussian pyramid as illustrated in Figure 2.4:

DoG(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) = L(x, y, kσ) − L(x, y, σ) (2.6) where k is the constant multiplicative factor difference in scale between two adjacent blurred images.

(22)

Figure 2.4: _{DoG interest point detector [}₂_]

Figure 2.5: _{Searching for extreme points over scale and space [}₂_]

Once this DoG is found, the detector searches for the local extreme points in the image over all scales. Each point is compared with its 8 pixels in the neighborhood of the current scale and with 9 other pixels in the next scale as well as another 9 pixels in the previous scale as illustrated in Figure 2.5. If the DoG value of a pixel is locally extremum in scale and space, then this is a potential key-point to be found.

Harris Laplace detector combines the Harris corner detector as introduced in the previous section with the auto scale selection method [68] to achieve the scale invariant characteristic. First, the Gaussian scale space representation is created by convolving

(23)

Gaussian kernel of various sizes with the original image. Then, the Harris function is used at each scale to detect the corner points before an iterative algorithm is employed to automatically selects the characteristic scale of each corner point.

Harris affine region detector: inspired by the work of Tony Lindeberg et al. [69] about affine shape adaptation, authors of [70] further improve the Harris Laplace de-tector to make it affine invariant. The dede-tector starts with the circular region around the corner points returned from the Harris Laplace detector, the radius of each region is defined by the characteristic scale obtained by the method in [71]. An iterative process is then used to estimate the affine region. At each step, we build up the region’s second moment matrix and calculate the eigenvalues of this matrix which yield an elliptical shape, corresponding to a local affine deformation. We select the proper integration scale, differentiation scale and spatially localise interest points and use those parame-ters to update the affine region. This procedure is repeated until the eigenvalues of the second moment matrix are equal.

FAST detector: the feature detectors introduced previously have their own advan-tages and disadvanadvan-tages, one of the common disadvanadvan-tages among them is they are too computationally intensive to be used in real-time applications where users need to have a response as soon as possible. In [40,72], Edward Rosten et al. introduce a new feature detector named FAST (Features from Accelerated Segment Test) which is much more ef-ficient. The implementation of FAST is only based on intensity comparison and because of this reason, it extremely reduces the execution time of the feature detector. Given a pixel p in the image, FAST detector performs intensity tests between the pixel p and other 16 pixels in a circle around p as illustrated in Figure2.6. A pixel is considered as an interest point if the intensities of at least 12 contiguous pixels around it are all above or all below the intensity of p by a threshold t. Because of the contiguity requirement, the speed of the test can be further increased by pre-examining the pixels 1, 5, 9 and 13 because a pixel only passes the test if three of those four pixels are all above or below the intensity of p with respect to the threshold. The number of features detected by FAST is easily managed by adjusting the threshold t. FAST is much faster than other existing feature detectors but still holds high levels of repeatability under large aspect changes and for different kinds of features. However, FAST also suffers from several disadvantages such as a lack of ability to deal with high levels of noise, the threshold t needs to be carefully chosen to have a good quality feature detection, and it may respond to 1 pixel wide lines at certain angles, when the quantisation of the circle misses the line.

Dense Sampling: Dense Sampling is not a feature detector but we present it here because it is also one method to select the feature points in the images [73–75]. The

(24)

Figure 2.6: _{FAST feature detection}

idea of dense sampling is simple: instead of extracting the local features at the corner or interest points, dense sampling method extracts the local feature every nth pixels in the image. Parameter n is the dense spacing and this can be use to adjust the number of features extracted by dense sampling method. Every points selected by this method carry the same scale and orientation. Although the retrieval precision with dense sampling is higher than using a key-point detector [74], the computational cost is considered as a drawback of this approach since the number of local features employed in each image is greatly increased.

We have presented in this section several popular methods in the literature for detecting the local features in the image. Each method has its own advantages and disadvantages. Thus, selecting of ideal local feature detector depends mainly on the objectives of the application and the used data set. Five popular detectors are selected for our exper-iments on the next chapters for their popularity: Harris, Haris-Laplace, DoG, dense sampling and FAST detector. Local feature detection is the first step in BoVW model when processing an image. Once we get the positions of the local features, the next step is to encode those local features into vectors which is often referred as descriptor extraction. The implementation of descriptor extraction will be introduced in the next section.

2.1.2 Descriptor extraction

As introduced in the previous section, the first step in BoVW model is feature extrac-tion which returns a list of key-points or interest points. The next step in the BoVW feature is descriptor extraction. The purpose of descriptor extraction is capturing the most important and distinctive information content enclosed in the regions around the feature locations and encoding this into descriptors which are suitable for discriminative

(25)

matching. In this section, we introduce selected approach from the state of the art of descriptor extraction stage.

SIFT: Scale Invariant Feature Transform (SIFT) has been introduced by David Lowe [41] and has became one of the most popular descriptors. SIFT is said to be invariant to image scaling and rotation, light changes in illumination as well as 3D camera view points. The implementation of SIFT is described as follow:

- Scale-Space extremum detection: this is the first step to extract the SIFT descriptor, the purpose of this step is to look for extreme points in the image at different scales. The potential interest points are identified using difference of Gaussian (DoG) detector as presented in the previous section. All the potential key-points found by DoG detector are refined to discard the one that have low contrast or are poorly localised along an edge.

The Taylor series expansion of scale space is used to get a more accurate location of the each extreme point and if the intensity at this detected key-point is less than a threshold, that key-point will be discarded.

To eliminate the key-points that belong to an edge, Lowe et al. use a 2x2 [41] Hessian matrix to calculate the curvature of each potential key-point and use that indication to decide to keep or discard the key-points.

To achieve the rotation invariance, each key-point will be assigned with an orientation. The magnitude and direction are calculated in the neighbourhood of the key-point, the size of the neighbourhood is determined by the scale at which the key-point has been detected. After that an orientation histogram with 36 bins is created to cover 360 degrees.

Finally, the key-point descriptor is calculated as presented in Figure2.7. The neighbour-hood around the key-point is divided into 16 × 16 sub-regions, then 4 × 4 sub-regions are grouped again to make sub-blocks. A 8 bin orientation histogram is created for each sub-block, which results in a total of 128 dimension SIFT descriptor for each key-point. Many researches in the literature have been introduced to enhance the effectiveness as well as the efficiency of SIFT descriptor.

CSIFT: One of the disadvantages of SIFT is that it works only on the gray image and discards the color information of the image. CSIFT [42] and Opponent-SIFT [75] make use of color information to further improve the performance of SIFT descriptor. CSIFT descriptor is build using the color invariance model which was developed by Geusebroek

(26)

Figure 2.7: _{128 dimension SIFT descriptor [}₃_]

et al. [76]. The photometric reflectance is modeled by the following equation:

E(λ, ~x) = e(λ, ~x)(1 − ρf(~x))2R∞(λ, ~x) + e(λ, ~x)ρf(~x) (2.7)

where λ is the wavelength and ~x is a 2D vector which denotes the image position. e(λ, ~x) denotes the illumination spectrum and ρf(~x) is the Fresnel reflectance at ~x. R∞(λ, ~x)

denotes the material reflectivity and E(λ, ~x) represents the reflected spectrum at the observed position.

CSIFT constructs the first and second derivatives of E(λ, x) and defines the colour image invariant H which is independent from viewpoint, surface orientation, illumination direction, intensity, and Fresnel reflectance coefficient. H is calculated as follow:

H = Eλ Eλλ

(2.8)

Abdel-Hakim et al. [42] calculate (E, Eλ, Eλλ) using a mapping matrix to the RGB

model. CSIFT using the same procedure to create the descriptor as SIFT but apply this procedure to H(x, y) instead of I(x, y).

Opponent SIFT [75] is also a coloured version of SIFT. Images are first represented by 3 channels in an opponent colour space as shown in Equation (2.9)

    O1 O2 O3     =     R−G_√ 2 R+G−2B_√ 6 R+G+B_√ 3     (2.9)

SIFT descriptors are then computed for each channel, which triples the number of di-mensions of the descriptor vector, from 128 (SIFT) to 384.

(27)

The high dimension of SIFT, CSIFT and OpponentSIFT comes along with great amount of computation which slows down the response time of image matching or retrieval. Many others descriptors have been introduced to improve the efficiency of feature ex-traction. One approach is to create lower dimension descriptors such as PCA-SIFT [77], 36 dimensions, SURF (Speeded Up Robust Features) [78] 64 dimensions, Color Mo-ment Invariant [43] 24 dimensions or Color moment [43] with 30 dimensions. Another approach is to use binary coding to present image features such as Binary Robust Inde-pendent Elementary Features (BRIEF) [45], Oriented Fast and Rotated BRIEF (ORB) [46], Fast Retina Key-point (FREAK) [79] and Binary Robust Invariant Scalable Key-points (BRISK) [80]. One advantage of binary descriptors is that the similarity distance between descriptors can be calculated using Hamming distance [81] which is very efficient in comparison with the L2 norm as usually done.

PCA-SIFT uses the Principal Component Analysis algorithm to convert 128 dimension vector of SIFT into a new vector of 36 dimensions. The purpose of this algorithm is to minimise the information lost during the conversion. Although PCA-SIFT is not as distinctive as original SIFT, the computational cost is greatly reduced [77].

SURF: the ideal Gaussian derivatives based key-point detector used in SIFT is replaced by a Hessian-Laplace detector applied on integral images. The Gaussian pyramid struc-ture is no more employed and thus, it speeds up the algorithm. Despite the simple rep-resentation with only 64 dimensions, SURF shows comparative performances to SIFT and PCA-SIFT.

Colour moments are measures that can be used to differentiate images based on their colour features. Once computed, these moments provide a measurement for colour similarity between images. They are based on generalised colour moments [43] and are 30-dimensional. Given a colour image represented by a function I with RGB triplets, for image position (x, y), the generalised colour moments are defined by Equation (2.10).

M_pqabc= Z Z

xpyq[IR(x, y)]a[IG(x, y)]b[IB(x, y)]c dxdy (2.10)

M_pqabc is referred to a generalised colour moments of order p+q and degree a+b+c. Only generalised colour moments up to the first order and the second degree are considered, thus the resulting invariants are functions of the generalised colour moments Mabc

00 , M10abc

and M₀₁abc, with:

(a, b, c) ∈        (1, 0, 0) , (0, 1, 0) , (0, 0, 1) (2, 0, 0) , (0, 2, 0) , (0, 0, 2) (1, 1, 0) , (1, 0, 1) , (0, 1, 1)        .

(28)

Colour Moment Invariantsare computed from the algorithm proposed by Mindru et al. [43]. The authors use generalised colour moments for the construction of combined invariants to the affine transform of coordinates and contrast changes. There are 24 basis invariants involving generalised colour moments in all 3 colour bands.

HSOG [44] stands for histogram of second-order gradient. HSOG is inspired by the suggestion from the studies on human vision that the neural image is a landscape or a surface with geometric properties can be characterised by local curvatures of differential geometry through second order gradient. The implementation of HSOG is described as follows:

- Calculate a set of first order Oriented Gradient Maps (OGMs) for different quantised orientations.

- Compute the histograms of second order gradients for all OGMs.

- Concatenate the histograms found in previous step to create HSOG descriptor. - Finally, perform PCA algorithm to reduce the dimensionality of HSOG to 128. Experiments on different applications: descriptor matching, object categorization and scene classification indicate that the HSOG descriptor owns a good discriminative power to distinguish different visual contents.

BRIEF was introduced in [45], BRIEF is a binary descriptor where each element of the descriptor is either 0 or 1. Given an image patch p of size S ∗ S, the test τ on p is defined as:

τ (p; x, y) := (

1 if p(x) < p(y)

0 otherwise (2.11)

where p(x) is the pixel intensity in a smoothed version of p at x = (u, v)T_{. The number}

of nd location pairs used in the test determines the dimension of BRIEF descriptor. BRIEF descriptor is built as the nd dimension bit strings:

fnd(p) :=

X

1≤i≤nd

2i−1τ (p; xi; yi) (2.12)

Taking advantage from the binary representation, BRIEF descriptor saves lots of storage as well as the computation time.

BRIEF also suffers from some drawbacks as it is not designed for scale and rotation invariant so in general, the performance of BRIEF is not as good as SIFT and SURF. However in some specific cases where scale and orientation are not taken into account, BRIEF returns similar or even better score than SIFT and SURF in a very short time.

(29)

ORB[46] is the combination of BRIEF descriptor with a FAST key-point detector. Ethan Rublee et al. [46] first use FAST point detector to detect the potential key-points in the image and then they use Harris corner measure [38] to order those potential key-points and keep only the key-points in the top. Because FAST detector does not have the multi-scale property, Authors of ORB create a scale pyramid of the image and apply the FAST key-point detector at every level of the pyramid.

To make the descriptor invariant to rotation, authors of [46] assign an orientation to each key-point using the intensity centroids [82] and use that orientation to steer the BRIEF descriptor. In comparison to the popular SIFT descriptor, ORB is more robust to image noise and much faster so it can be applied to real time applications.

BRISK shares the same idea with ORB descriptor. It assigns the direction pattern to BRIEF with a different method. For each key-point, BRISK selects a set A of sampling-points to create the binary descriptor:

A =(pi, pj) ∈ R2× R2|i < N < i, j ∈ N (2.13)

Authors of [80] also define a subset of short-distance pairing S and the long-distance pairing L and use the pair-points in the long-distance set to estimate the overall char-acteristic direction of the key-point:

S = {(pi, pj) ∈ A| kpj− pik < δmax} ⊆ A (2.14)

L = {(pi, pj) ∈ A| kpj− pik > δmin} ⊆ A (2.15)

where δmax and δmin are two chosen thresholds and the overall characteristic direction

of the key-points is calculated as follow: g = 1 L. X (pi,pj)∈L (pj− pi). I(pj, σj) − I(pi, σi) kpj − pik2 (2.16)

where I(pj, σj) and I(pi, σi) are the smoothed intensity values at pi and pjrespectively.

In comparision to SIFT and SURF, BRISK is said to be dramatically faster and returns a comparable matching performance [80].

FREAK was presented by Alahi et al. [79]. Different binary descriptors have different methods for picking the pair-points for comparison. BRIEF and ORB use random pairs with a circular pattern around the key-points. FREAK was inspired by the construction of retina, proposes to use the retinal sampling grid which is circular with the difference of having higher density of points near the centre. FREAK uses the different kernel sizes to smooth each sample point in the image, the kernel sizes are exponentially changed

(30)

to match the retina model. The experiments in [79] show that this framework increases the performance of binary descriptor.

OSRI: has been introduced by Xianwei Xu et.al [83]. Instead of using pixel intensity comparisons to build the descriptors as done in BRISK, ORB and FREAK, OSRI uses the difference tests of regional invariants over pairwise sampling-regions. OSRI also deals with rotation and illumination changes by ordering pixels according to their inten-sities and gradient orientations. The experiments in [83] show that OSRI significantly outperforms two state of the art binary descriptors: FREAK and ORB.

Many descriptor extractions have been introduced in the literature. Each of them has its own advantages and disadvantages. It is difficult to find the perfect descriptor for all applications, many factors may affect the choice selection of the proper descriptor such as: the objective of the system, the characteristic of image database, or real-time response requirement. In our framework, we use five descriptors: SIFT, Colour Moment, Colour Moment Invariant, Opponent SIFT and SURF, mostly for their popularity.

2.1.3 Image representation

We present in this section the method of representing an image using BoVW model. First, the idea of BoVW model is introduced, followed by its implementation which consists of visual vocabulary construction and BoVW representation.

2.1.3.1 The idea of BoVW

BoVW model has been inspired by Bag of Words model in text retrieval. We can find many similarities among them. In BoW model, each document is presented by important and prominent key-words and documents are distinguished from each others by the appearance of the frequency of the key-words in those documents. The same idea is applied in BoVW model where each image is presented by a set of local image features or visual words. BoVW model uses the appearance of those visual words in the images to represent the images.

The idea of BoVW model is divided in the following steps:

First, the local feature detection is used to detect the key-points in the image. The image is then considered as a set of the features. Images with different objects contain different sets of local features, vice versa, images with the same objects or scenes should contain similar ones. Figure 2.8illustrates this process: the images of the car, tree and house are represented by different local features.

(31)

Figure 2.8: _{Images are represented by local features}

Figure 2.9: _{Vocabulary construction step}

The next step of BoVW model is to determine a set of distinctive features called visual vocabulary and uses this vocabulary to represent every features found in the image. Figure 2.9 demonstrates the vocabulary construction step. A clustering algorithm is used to divide the feature space into 4 groups and we use the centers of those groups to build the visual vocabulary. Once the vocabulary is defined, each local feature is assigned to the most similar visual word in the vocabulary and images are presented by the histograms of visual words. Each bin of the histogram is the frequency of the appearances of one word in the image as illustrated Figure 2.10. Images containing similar objects with similar background should have similar histograms since they share similar details or local features and vice versa. By comparing the histograms, we can tell the difference between images. The implementation of BoVW model consists of vocabulary construction and image representation.

2.1.3.2 Visual Vocabulary Construction

The purpose of visual vocabulary construction step is to define a set of distinctive fea-tures and to use this set to represent all the images in the collection. By doing that, all

(32)

Figure 2.10: _{Images are represented by histograms of visual words}

Figure 2.11: _{Vocabulary construction using a clustering method [}₄_]

features in the images are quantised into a common base and they become simpler to compare.

We use a training set of images to build the visual vocabulary. We extract all the features of images in the training set and use a clustering algorithm to divide those features into groups, the centroid of each group will be used as a visual word in the final vocabulary as illustrated in Figure 2.11. This figure presents the feature space where each green dot presents a single local feature, the clustering algorithm groups the feature space into regions which separated by the blue line, the centroid of each region is represented by the red star and we use those centroids to create the visual vocabulary.

There are two main methods that have been used for visual vocabulary construction: k-means [36,84,85] and Gaussian Mixture Model (GMM) [47,48].

(33)

Figure 2.12: _{k-means algorithm}

k-means algorithm: Given a set of local features x1, x2, ..., xn extracted from the

training set, each local features is a d-dimensional vector, the purpose of k-means algo-rithm is to partition those n features into k sets S = S1, S2, ..., Sk so as to minimise the

within-cluster sum of squares. In other words, its objective is to find: argminX

x∈Si

kx − µik2

where µi is the mean of points in Si, µi is also known as the gravity centre of cluster Si.

The implantation of k-means algorithm is illustrated in Figure2.12. First we initialise k clusters, each of them is represented by a centre. An iterative process is then applied to adjust those clusters. At each step we assign all the features x1, x2, ..., xn to k clusters

based on the minimum distance, and then recompute the mean of each cluster based on the feature belongs to it. At this step, a feature may move from a cluster to another one. This process is repeated until no more feature moves.

In [84] Nister et al. used a different version of k-means to build the visual vocabulary named Hierarchical k-means. Hierarchical k-means is a faster version of k-means and it

(34)

enables to create a visual vocabulary with over 1 million visual words using SIFT de-scriptor. The drawback of this method is that it produces deficient clusters compared to normal k-means. Another version of k-means is Approximate k-means method presented by Philbin et al. [85]. This method returns a clustering quality that is very close to normal k-means algorithm in a similar computational time to the hierarchical k-means technique.

Gaussian Mixture Model based method assumes that the local features are made up from the combination of several Gaussian distributions with different means and variances. Unlike the k-means method where each feature is assigned to a single clusters, a GMM method allows a feature to belong to several clusters with a probability function. GMM implements Expectation Maximization technique to determine the cluster. This technique performs in two different steps repeatedly:

- Expectation step: the purpose of expectation step is to calculate the probability that each feature belongs to each cluster. This step is similar to the feature assignment step in k-means, it is based on a distance metric.

- Maximization step: in this step, the means and covariances of each cluster are re-calculated based on the probabilities found in the expectation step. This is somehow equivalent to the re-compute the centroid step of the cluster in k-means algorithm. In this section, different methods for building a visual vocabulary are presented. Both methods are based on a clustering algorithm which are not adapted to the high dimen-sion space [9]. In the next chapter, we propose a new method for visual vocabulary construction based on the information gain.

2.1.3.3 Image representation

The purpose of image representation is to create the image’s signature using the local features contained in the image and the visual vocabulary that have been already built in the previous step. Each local feature in the image is assigned to the most similar visual word in the visual vocabulary then we count how many times each visual word appears in the image to build its representation. Concerning this purpose, plenty of solutions have been proposed. In this section, several methods for representing an image based on the local features are introduced.

Bag of Visual Words Approach

This approach [36, 86] was inspired by the bag of words used in the Information Re-trieval field. Rather than using directly the appearance frequency of visual words in the

(35)

image, this method applies a standard weighting scheme called term frequency- inverse document frequency which is computed as described in Equation (2.17).

Suppose there is a vocabulary of k words, then each document is represented by a k-vector Vd= (t1, . . . , ti, . . . , tk)⊤ of weighted word frequencies with components:

ti= nid nd logN ni (2.17) where nidis the number of occurrences of word i in document d, nd is the total number

of words in the document d, ni is the number of occurrences of term i in the data set

and N is the number of documents in the data set. Fisher Vector Approaches

One of the first research works using Fisher Vector was proposed by Perronnin and Dance [87] on visual vocabularies for image categorisation. They proposed to apply Fisher kernels to visual vocabularies represented by means of a GMM. In comparison to the BoVW representation, fewer visual words are required by this more sophisticated representation.

Vector of Locally Aggregated Descriptors (VLAD) has been introduced by J´egou et al. [49] and can be seen as a simplification of the Fisher kernel. Considering a visual vocabulary C = {c1, . . . , ck} of k visual words generated with k-means algorithm. To

represent an image that contains T local features X = x1, . . . , xT, the idea of the VLAD

descriptor is to accumulate, for each visual word ci, the differences of ci and local feature

xj assigned to it. VLAD method performs the following steps:

- For each xt∈ X, it finds the closest ci∈ C to it such as:

N N (xt) = argmin kxt− cik (2.18)

- Calculate the aggregation vector vi and normalise it to have the VLAD encoding for

this image as:

vi =

X

xt:N N (xt)=ci

(xt− ci)vi:= vi/ kvik2 (2.19)

Figure 2.13, 2.14, 2.15 [88] show an example of a VLAD vector. Figure 2.13 shows the feature vector space where the black dots represent the local features in the image, the red dots c1, c2, ...c5 represent visual words in the vocabulary, the dashed lines that

connect a local feature to a visual word represent local feature assignment to it closest visual word in the vocabulary. Figure 2.14 shows the difference between visual words in the vocabulary with all the features that have been assigned to them. The VLAD vector accumulation is presented in Figure2.15.

(36)

Figure 2.13: _{Assign the local features to the closest visual word}

Figure 2.14: _{Calculate the difference between local features and visual words}

Figure 2.15: _{VLAD vector presentation}

Geometrical representation Approach

All methods presented above only take into consideration the appearances of local fea-tures no matter how they are located in images. To improve the performance of BoVW model, some researchers have proposed to add the spatial information into the image representation. We can named here some very well known approaches such as spatial pyramid representation [5], Bag of Bags of Visual Words (BBoV) [89] and Extended Bag of Features (EBoF) [50].

- Spatial Pyramid Representation [5]: the idea of this method is to divide the image into multiple sub-regions and calculate the signature of each sub-region. The final signature of the image is the concatenation of all signatures of those sub-regions weighted differently according to their level in the spatial pyramid. Figure2.16 demonstrates this method.

(37)

Figure 2.16: _{Spatial Pyramid Representation [}₅_]

The weight associated with level l of the pyramid is given by the following equation: wl =

1

2L−l (2.20)

where L denotes the total level of the pyramid.

- Bag of Bags of Visual Words [89] method: First, images are presented as a connected graph by segmenting the images into partitions using the Normalised cut method. Then the classical BoVW model is applied to each individual sub-graph and each sub-graph has its own histogram of visual words. The signature of the image is the concatenation of histograms of all sub-graphs. By using several resolutions which define the number of sub-graphs per image, authors of [89] introduced the new approach named Irregular Pyramid Matching (IPM) for image representation.

- Extended Bag of Features [50] method: In comparison to classical BoVW, EBOF is said to be more robust to rotation, translation and scale invariance due to the proposed circular-correlation based algorithm. EBOF model divides the image into fan-shaped sub-images and BoVW model is applied to each of them. A 2D Gaussian weighting is then applied to suppress the contribution of visual words which is located far away from the centre. The histograms of all sub-images are combined together to build the image representation.

In this section, several methods of building the image’s signatures have been introduced. Once images are represented by signatures, the matching of images will use only their signatures. Several techniques have been used in the literature and they will be intro-duced in the next section.

2.1.4 Similarity matching

The purpose of similarity matching is to compare the signature of the query image to the signatures of all images in the collection to find out the most similar images to the query one. The signatures of images are usually in form of histogram of visual words. Once we have the signatures of the query image and images in the collection, it is straightforward

(38)

to calculate the histogram distance. If two images hold a small histogram distance, it means they contain similar visual words with similar distribution and they should be similar to each other. The images in the collection that have minimum histogram distances will be returned in a ranked list as the result of the retrieval process.

There are many distance metrics to calculate the histogram distance between two im-ages. The choice of a best histogram distance metric depends on the specific image collection, the length of visual vocabulary,...We would like to introduce some popular histogram distances that are used in the literature such as l1-norm, l2-norm, χ2, Jaccard and Bhattacharyya [90_{],... Given two histograms p, q of n dimensions, p{p}1, ..., pn} and

q{q1, ..., qn}. the distance between p and q can be calculated as:

- L1-norm distance: dl1(p, q) = n X i=1 |pi− qi| (2.21) - L2-norm distance: dl2(p, q) = v u u t n X i=1 (pi− qi)2 (2.22) - Chi-squared distance: dχ2 = 1 2 n X i=1 (pi− qi)2 pi+ qi (2.23) - Bhattacharyya distance: dBattacharyya(p, q) = 1 − n X i=1 √_p iqi (2.24) - Jaccard distance dJaccard= M01+ M10 M01+ M10+ M11 (2.25) where:

M11dentoes the total number of indexes i such that pi 6= 0 and qi6= 0

M01dentoes the total number of indexes i such that pi0 and qi6= 0

M10dentoes the total number of indexes i such that pi 6= 0 and qi0

- In [91], Ofir Pele et al. introduced a new histogram distance family, the Quadratic-Chi (QC) which are the combination of Quadratic-Form distances and cross-bin χ2_-like

normalization. Given two non-negative bounded histograms: P, Q ∈ [0, U]N and A is a non-negative symmetric bounded bin-similarity matrix such that each diagonal element is bigger or equal to every other element in its row. That is, A ∈ [0, U]N × [0, U]N and ∀i, j Aii ≥ Aij. Let 0 ≤ m < 1 be the normalization factor. Experiments in [91]

(39)

demonstrated that the QC member outperforms many state of the art distances while having a short running time. The Quadratic-Chi distance is defined as:

QC_mA(P, Q) = v u u t X ij ( (Pi− Qi) (P c(Pc+ Qc)Aci)m )(P (Pj − Qj) c(Pc+ Qc)Acj)m )Aij (2.26)

- Earth Mover’s Distance [92] first converts the fixed-size histograms into variable-size signatures and then calculates the distance between those signatures. The distance between two signatures is the cost to transform one signature into the other. A signature P {(sp1, w1), ...(spm, wm)} of a histogram p{p1, ..., pn} is a set of m clusters (spj, wj)

(m ≤ n), where spj is the mean and wj is the weight of cluster j. If a cluster j is

mapped to a bin pi of histogram p, then spj is the central value at pi and wj = pi.

Let consider two signatures P {(sp1, w1), ...(spm, wm)} and Q{(sq1, w1), ...(sqn, wn)} and

D = [dij] is the ground distance matrix where dij is the distance between spi and sqj.

We have to find a flow that minimises the cost of transformation T (P, Q):

T (P, Q) = m X i=1 n X j=1 dijfij (2.27)

subject to the following constraints:

fij > 0 1 ≤ i ≤ m, 1 ≤ j ≤ n (2.28) n X j=1 fij ≤ wi 1 ≤ i ≤ m (2.29) m X i=1 fij ≤ wj 1 ≤ i ≤ n (2.30) m X i=1 n X j=1 fij = min( m X i=1 wi n X j=1 wj) (2.31)

Once the flow F is defined, the Earth Mover’s Distance between P and Q is calculated as: dEM D(P, Q) = Pm i=1 Pn j=1dijfij Pm i=1 Pn j=1fij (2.32)

Earth Mover’s Distance usually gives high precision retrieval score, but it suffers from huge computational cost. Thus, we will not use it in our experiments.

To improve the performance of similarity matching, different approaches have been in-troduced in the literature such as re-ranking and query expansion technique.

(40)

Re-ranking technique: the similarity matching returns the result in form of a list of images which are similar to the query one. This list is ranked based on the chosen histogram distance of the images from the query image. The idea of re-ranking technique is to use additional indicator or tests to refine the results obtained by similarity matching step. Some examples of using re-ranking technique to enhance the performance of BoVW model are given here: Geometrical Re-ranking [51] and Spatial Re-ranking [20]. J´egou et al. [51] first use similarity matching to find the short-list of images similar to the query image and then match each descriptor of query images with the 10 closest ones in all images of the short-list images. Then, the affine 2D transformation estimation is used as additional indicator to refined the short-list. The images that pass the geometrical estimation filter are moved to first positions of the list and ranked with a score based on the number of inliers. Spatial Re-ranking method in [20] shares the same idea: to use the spatial constraints to refine the results of similarity matching. They estimate a transformation between the query region and each target image, based on how well its feature locations are predicted by the estimated transformation and then use the discriminability of the spatially verified visual words to re-rank the result list.

Query Expansion technique: The idea of query expansion techniques is to use the highly ranked images in the results as the new query images. By doing this, we can find some new relevant images which were not returned by the normal similarity match. A draw back of this technique is that it may return incorrect results if the expanded query image is not relevant. Many researchers have applied the query expansion technique to their BoVW system [51–53].

2.2 From BoVW to BoVP

One disadvantage that BoVW model suffers from is the ambiguity of visual words. This characteristic can be expressed as two aspects:

- Synonymy: Some visual words are synonymous when they share the same semantic meanings. The consequence of synonymy is over-representation.

- Polysemy: Polysemy means that one single visual word may represent different things, different details in the image. Polysemy leads to under-representation.

One solution to deal with the polysemy of BoVW model is to link visual words together to make them less ambiguous. This idea is also inspired by text retrieval. For example, we have the following words: University, Poitiers, image, processing. Obviously, if we link those words together to have University Poitiers, image processing, we have clearer meanings than each single word alone. The method that links visual words together is

(41)

called Bag of Visual Phrases (BoVP) model. This section introduces some BoVP models in the literature.

Sivic et al. [63] have introduced doublet of visual words which is the combination of two visual words that are located close to each other. For each local feature, the method of [63] finds 5 nearest features in the neighbor region around it to form 5 doublet visual words. The size of vocabulary of doublet visual words vocabulary depends on the size of the initial visual vocabulary.

The method in [62] also proposed to create visual phrase of 2 visual words. Images are represented as histogram of pairs of visual words. Visual pairs are the combination of local features at a key-point and other features within the region that is proportional to the scale where that key-point is detected. The neighbor region is the circular area with a radius equals to the minor axis plus major axis of the elliptical region of the interest point which is detected by Hessian affine detector [93]:

R = l1+ l2 (2.33)

This method also increases the neighbourhood spatial information from local level to-wards global level by expanding the neighbour region for defining the visual pairs by a factor of n:

Rn= n.R (2.34)

Experiments with different values of n demonstrate that different categories of images return different performance with respect to the size of neighbourhood regions.

In [64], L.C.Zitnick et al. create visual phrases of 3 visual words which are called triplet feature. First, local features in the images are detected using DoG feature detection [2], each local feature k is assigned to a pixel location pk, scale σk and rotation θk. The

method in [64] searches for all three features that satisfy space and scale constraints. In scale, the ratio of features scale must be bigger than 0.5 and smaller than 2.0 and the constraint in space is that the distance between each feature in a phrase must be less than 8 × σk.

Different from the three above methods that build the visual phrase vocabularies based on the vocabularies of visual words, the method proposed in [65] creates a vocabulary of visual phrases directly. In this method, each pair of spatially close features is con-catenated together and considered as a data point in the joint feature space. Then the clustering algorithm is applied on the data to create the visual phrase vocabulary which is call local pairwise codebook. By doing so, the size of visual phrase vocabulary can be managed directly by the clustering algorithm.