Information-theoretic analysis of identification systems in large-scale databases

(1)

Thesis

Reference

Information-theoretic analysis of identification systems in large-scale databases

FARHADZADEH, Farzad

Abstract

This thesis is concerned with the theory and applications of an identification problem that arises in various multimedia management and security applications. In many of theese applications, data under analysis might be severely distorted. Consequently, an important issue of content identification systems is their ability to deal with distorted data. To address this issue, we introduce a new identification setup by using a fixed maximum list size decoder.

In order to solve search and memory complexity issues in content identification with large-scale databases, we analyze a simple digital fingerprinting approach based on random projections. To address the search and memory complexity trade-off in identification systems, we introduce a decoding scheme capable of achieving the identification capacity. We introduce a database organization, based on assigning entries of a database to a set of overlapping clusters. We introduce a new framework called active content fingerprinting, which takes the best of content fingerprinting and digital watermarking to overcome some of the fundamental restrictions of these techniques in [...]

FARHADZADEH, Farzad. Information-theoretic analysis of identification systems in large-scale databases. Thèse de doctorat : Univ. Genève, 2014, no. Sc. 4635

URN : urn:nbn:ch:unige-343009

DOI : 10.13097/archive-ouverte/unige:34300

Available at:

http://archive-ouverte.unige.ch/unige:34300

Disclaimer: layout of this document may differ from the published version.

(2)

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES

Département d'Informatique Professeur S. Voloshynovskiy

Information-theoretic analysis of identication systems

in

large-scale databases

THÈSE

présentée à la Faculté des sciences de l'Université de Genève pour obtenir le grade de Docteur ès sciences, mention informatique

par

Farzad FARHADZADEH

de Yazd (Iran)

Thèse no 4635

GENÈVE

Repro-Mail - Université de Genève

2014

(3)

(4)

(5)

(6)

To my dear Neda

(7)

(8)

Acknowledgement

It would not have been possible to write this Thesis without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here.

Above all, I would like to thank my wife Neda for her personal support and great patience at all times. Her tolerance for my occasional vulgar moods is a testament in itself to her unyielding devotion and love.

This Thesis would not have been possible without the help, support and patience of my supervisor, Professor Sviatoslav Voloshynovskiy. His guidance helped me in all the time of research and writing of this Thesis.

My sincere thanks also goes to Professor Frans M.J. Willems, for giving me the opportunity to visit his group and leading my work on diverse exciting projects. He has been helpful in providing advice many times during my career.

I will forever be thankful to my former research advisor, Professor Hamidreza Amindavar.

Special thanks to my jury committee, Professor Mauro Barni, Dr. Teddy Furon, Dr. Ton Kalker, Professor Frans M.J. Willems, and Professor Stéphane Marchand- Maillet for their support, guidance and helpful suggestions. Their guidance has served me well and I owe them my heartfelt appreciation.

I thank my fellow groupmates in the Stochastic Information Processing group, Fokko Beekhof, Maurits Diephuis, Taras Holotyak and Oleksiy Koval, for the stim- ulating discussions, for the sleepless nights working together before deadlines, and for all the enjoyment. Also I thank Professor Thierry Pun and my labmates in the Computer Vision Multimedia Laboratory, Edgar Roman-Rangel, April Morton, Li Weng, Theodoris Kostoulas, Hisham Mohamed, Juan Diego Gomez, and Guillaume Chanel for their constructive comments during the CVML meetings and for the great memories, playing foosball and drinking on Friday evenings. In particular, I am grateful to Ke Sun, who is not only a great mathematician and ping-pong player but also has been kind enough to supply endless L^ATEX support.

April, many thanks for the corrections! Christophe Gusthiot, un grand merci pour les corrections!

(9)

support and for all the fun we have had in the last years.

Last but not the least, I would like to thank my parents, brother and sister who have given me their unequivocal support throughout, as always, for which my mere expression of thanks likewise does not suce.

For any errors or inadequacies that may remain in this work, of course, the responsibility is entirely my own.

(10)

Abstract

This thesis is concerned with the theory and applications of a content identication problem that arises in various multimedia management applications (broadcast monitoring, tracing and tracking, copy detection) and security applications (biometric person identication, anti-counterfeiting, document authentication). Due to an exponential growth of interest in such applications along with signicant progress during recent decades in redistribution channels (radio, TV, Internet), accurate yet computationally ecient content identication tools are very much in demand.

In many of the aforementioned applications, data under analysis might be severely distorted due to the habitual chain of processing, transcoding, communication and storage. Consequently, an important issue of content identication systems is their ability to deal with highly distorted data. To address this issue, we introduce a new identication setup by using a xed maximum list size decoder, based on an order statistics list decoder. We analyze the proposed list decoding versus unique decoding in the identication problem. To achieve this goal, theoretical upper bounds for probabilities of false and incorrect identication are evaluated.

In order to solve search and memory complexity issues in content identication with largescale databases, we analyze a simple digital ngerprinting approach based on random projections. The idea behind exploiting random projections is threefold.

First, randomized mappers are wellknown for their valuable distance preservation properties. Secondly, as we show, using appropriate random projections not only helps us reduce the dimensionality of data but can also eliminate any correlation among data samples asymptotically. Furthermore, contrary to conventional orthogonal mappers such as Karhunen-Loéve transform, random projection keeps data stationary in the random subspace domain.

To address the search and memory complexity tradeo in identication systems, we introduce a generalized decoding scheme capable of achieving the identication capacity. In this scheme, we introduce a special database organization, based on assigning entries of a database to a set of possibly overlapping clusters. The cluster centroids are generated according to statistics of both entries of the database and queries. The proposed scheme not only generalizes several practical searching

(11)

algorithms in identication systems but also makes it possible to approach a new achievable region of searchmemory complexity tradeo.

We introduce a new framework called `active content ngerprinting', which takes the best of the two worlds of content ngerprinting and digital watermarking to overcome some of the fundamental restrictions of these techniques in terms of performance and complexity. We consider several encoding and modulation strategies, examine the performance of the proposed schemes in terms of bit error rate and probability of correct identication, and compare it with those of conventional n- gerprinting and digital watermarking.

(12)

Résumé

Cette thèse porte sur la théorie et les applications d'un problème d'identication de contenu qui se pose dans de nombreuses applications de gestion multimédia (con- trôle de diusion, traçage et suivi, détection de copie) et d'applications de sécurité (identication biometrique de personne, lutte contre la contrefaçon, authentication de documents). En raison d'une croissance exponentielle de l'intérêt pour ces applications, avec les importants progrès réalisés au cours des dernières décennies dans les canaux de redistribution (radio, télévision, Internet), les outils d'identication de contenu précis et ecaces sont actuellement très demandés.

Dans la plupart des applications précitées, les données en cours d'analyse pour- raient être gravement déformées en raison de la chaîne habituelle de traitement, de transcodage, de communication et de stockage. Par conséquent, un grand prob- lème des systèmes d'identication est leur capacité à traiter des données très défor- mées. Pour aborder ce problème, nous introduisons une nouvelle conguration de l'identication à l'aide d'un décodeur en liste de taille maximale xe, sur la base de la statistique d'ordre d'un décodeur en liste. Nous analysons le décodage en liste proposé par rapport à un décodage unique dans le problème d'identication. Pour atteindre cet objectif, les limites supérieures théoriques de probabilité d'identication erronée et mauvaise sont évaluées.

An de résoudre les problèmes de complexité de recherche et de mémoire dans l'identication avec une bases de données de grande taille, nous analysons une ap- proche simple par empreintes digitales numériques basée sur les projections aléa- toires. L'intérêt d'exploiter les projections aléatoires est triple. Premièrement, les projections aléatoires sont bien connues pour leurs propriétés de conservation à distance. Deuxièmement, comme nous le montrons, utiliser des projections aléatoires appropriées nous aide non seulement à réduire la dimensionnalité des données, mais peut aussi éliminer asymptotiquement toute corrélation entre les échantillons de données. En outre, contrairement aux projections orthogonales classiques telles que la transfomation de Karhunen-Loéve, la projection aléatoire maintient les données stationnaires dans le domaine de sous-espace aléatoire.

Pour aborder le compromis entre la complexité de recherche et de mémoire dans

(13)

les systèmes d'identication, nous introduisons un système de décodage général- isé capable d'atteindre la capacité d'identication. Dans ce système, nous introduisons une organisation de base de données particulière, basée sur l'attribution d'entrées d'une base de données à un ensemble de clusters pouvant se chevaucher.

Les centroïdes du cluster sont générées selon les statistiques des entrées de la base de données et des requêtes. Le système proposé généralise non seulement plusieurs algorithmes pratiques de recherche dans les systèmes d'identication, mais permet également d'obtenir une nouvelle région réalisable du compromis entre la complexité de recherche et de mémoire.

Nous introduisons un nouveau framework appelé `empreintes digitales numériques actives', qui prend le meilleur de ces deux mondes, les empreintes digitales numériques et le tatouage numérique, pour surmonter certaines des restrictions fondamentales de ces techniques en termes de performance et de complexité. Nous considérons plusieurs stratégies de codage et de modulation, nous examinons la performance des systèmes proposés en termes de taux d'erreur binaire et de probabilité d'identication correcte, et nous le comparons avec ceux des empreintes digitales classique et du tatouage numérique.

(14)

1 Introduction

Multimedia consumption via the Internet has increased radically over the last few years. The Internet has also emerged as an important medium for distribution of multimedia content such as video and audio products. Video streaming services are available from such providers as Netix, Blockbuster, Hulu, and Amazon. Services such as Google TV and Apple TV are gaining momentum. Fueling this trend is the technological improvement in the bandwidth of network connections, and the growing popularity of User Generated Content (UGC) websites, such as YouTube, whose new oering have changed the expectations of both content providers and consumers with regards to the Internet.

On the other hand, this ease of access has brought signicant challenges to in- tellectual property protection, as the improved technology has made it easier to redistribute copyrighted multimedia content to a large number of users. The popularity of UGC websites has also raised concerns about the posting of copyrighted content by users. The movie industry has recently estimated that piracy and illicit redistribution have caused over $6 billion in lost revenue annually (see e.g. McBride and Fowler 2006). Of further concern is the fact that, most of the Internet multimedia les are unlabeled and provided by dierent users. Therefore, there is a great interest in the development of systems that will allow exible management of these collections such as content-based retrieval, content ltering and automatic tagging (see e.g. Hua and Tian 2009, Kalantidis et al. 2009). Moreover, some multimedia security applications require multimedia copyright protection, content origin identication, content tracking and broadcast monitoring (see e.g. Haitsma et al. 2000, Lu and Hsu 2005).

Similar problems also exist in the physical world when there is need for positive, reliable identication of people or physical objects based on unique features or characteristics. In the identication of human beings such a system will involve biometric data (ngerprint, iris, etc.) which must be handled with special care to satisfy privacy requirements (Ignatenko and Willems 2009, Kalker et al. 2010).

(19)

In the case of physical objects, reliable identication requires specic unclonable characteristics, which can be acquired but cannot be duplicated or reproduced with sucient precision (Tuyls et al. 2007).

Finally, numerous genetics and proteomics applications require either accurate identication of DNA sequences, proteins or peptides or the detection of certain post- translation modications which are considered to be deviations from the baseline templates (see e.g Halperin et al. 2003, Seo and Lee 2004). Currently, any system that oers fast and accurate identication of these sequences also has some side eects that are of growing concern to the search community: large-scale databases, noise modications and distortions, to name a few.

Common to all these problems, despite their dierent domains and origins, is the necessity to nd the best matches to a given query according to a certain dened measure of similarity. The result of the search is given in the form of either the best unique match or a list of matches. Additionally, the list size may vary or be xed to a certain value considered to be feasible for further manual processing. Identication with a unique match or a list of matches can be considered as `nearest neighbor(s)' search problem.

In principle, an identication system can perform an exhaustive search on all database entries to nd the best matches. However, it is not practical in modern applications where the size of databases can be in the billions. Several multi dimensional indexing methods, such as the popular KD-tree (Friedman et al. 1977) or branchandbound techniques, etc. have been proposed to reduce search complexity. Although these access methods generally work well for low-dimensional spaces, their performance degrades as the number of dimensions increases a phenomenon which is known as the curse of dimensionality (see e.g. Weber et al. 1998).

The current stateoftheart techniques overcame this issue by performing approximate matching (see e.g. Datar et al. 2004, Gionis et al. 1999, Muja and Lowe 2009). The key idea shared by these algorithms is to nd the best matches with only high probability being close to1−, whereis a small positive value, instead of the exact match with probability1.

One of the rst techniques of approximate matching in Euclidean space is Eu- clidean Locality Sensitive Hashing (LSH) (Datar et al. 2004, Shakhnarovich et al.

2006b), which has been successfully used for image search based on local descriptors (Ke et al. 2004), 3D object indexing (Matei et al. 2006) and manually preltered proteomics data (Li et al. 2010). However, for real data, LSH can be outperformed by randomized KD-trees or hierarchical k-means tree (Muja and Lowe 2009).

Identication systems in general are facing not only an accuracycomplexity tradeo, but serious practical problems in terms of memory storage as well. Only recently, researchers have tried to design memory-limited identication systems.

This is a key criterion problem involving large scale applications (see e.g. Nister and Stewenius 2006, Silpa-Anan and Hartley 2008), where millions to billions of images have to be indexed.

(20)

1 Introduction One nal problem, though certainly not the least, is the evaluation of performance (identication accuracy) under the above requirements. When the identication systems normally intended for largescale applications, it is not always sucient to validate system performances on small test databases that is often done in most of scientic publications, with rare exceptions (Torralba et al. 2008). Furthermore, it is also infeasible that a small group of researchers can practically test such enor- mous billion-size applications by themselves. Therefore, the development of accurate informationtheoretic models of these systems and corresponding methods for pre- dicting the system's performance is of great practical importance.

Current stateoftheart information-theoretic contributions to the identication problem can be classied in three groups: (a) investigation of theoretical performance limits; (b) investigation of a performance-complexity trade-o; (c) investigation of a performance-memory storage trade-o.

Information-theoretical performance limits of content identication has been investigated by Willems et al. (2003). An identication system usually consists of two main phases: enrollment and identication. In the enrollment phase, feature vectors representing digital contents, humans or physical objects are extracted and stored in a database. In the identication phase, a query, i.e., a noisy and degraded coun- terpart of an enrolled data, is presented for identication which is accomplished by comparing it to feature vectors stored in the database. Willems et al. (2003) investigated the capacity of an identication systemCid, which is dened as the maximum achievable exponential rate of the number of distinguishable objects in a database.

They showed that approximately2^{N R}^id can be distinguished from each other if, and only if, Rid < Cid and N, the dimensionality of the feature space, becomes very large. They presented a single-letter characterization for the identication capacity Cid, which is equal to the mutual information between the outputs of the enrollment and identication channels. Crucial to the obtaining of this result is the fact that a set of enrolled feature vectors can be regarded as a random channel code.

Similarly, the detection-theoretic limits have been studied by Voloshynovskiy et al. (2007) under geometrical desynchronizing distortions. A further extension of this framework was proposed by Varna et al. (2008) for the case of nite-length ngerprinting and null hypothesis. The used decision rule is based on the minimum Hamming distance decoder with a delity constraint under a binary symmetric channel model. However, nding the best match in terms of minimum Hamming distance requires computation of Hamming distances between the query and database entries, which is not feasible in practical schemes.

The second group of identicationtheoretic methods addresses the performance complexity tradeo: a problem that has received little attention in current liter- ature. Only recently, Willems (2009) published the rst paper dedicated to this problem. Willems introduced a twostage identication scheme to speed up the search process by means of data clustering in which the system, upon observing a query, rst detects to which cluster the related item belongs, and then decides about

(21)

Identication setup

Search

complexity Memory

complexity Identication

rate

Figure 1.1: Identication capacity, search and memory complexities tradeo in identication setup.

the item itself.

Finally, the third group of methods addresses the problem of identication rate memory complexity(or storage rate) tradeo. Westover and O'Sullivan (2008) considered this trade-o in the pattern recognition formulation, and Tuncel et al. (2004) analyzed it in the largescale database management setup. They applied quantization during enrollment and considered the fundamental tradeo between compression rate and reconstruction distortion. Later Tuncel (2009) considered the tradeo between enrollment compression rate and identication rate to be an extension of (Willems et al. 2003).

1.1 Scope of the Thesis

To address the above issues, we will consider data stored in a database as digital ngerprints. A digital ngerprint represents a short, robust and distinctive content description.

In summary, an ecient identication should satisfy several important requirements. First, users should be able to identify reliably objects or individuals, i.e., nding the most similar and related objects or individuals in a database with a low probability of error. Secondly, the decoding method should be as fast as possible.

Finally, it should require the least possible amount of memory for both items and indexing structure. These three conditions: reliability, search complexity and memory complexity, require the solving of an informationtheoretic problem that considers the following tradeo: (a) achieving identication capacity, (b) minimizing search complexity and (c) minimizing memory complexity. This triple tradeo (Figure 1.1) is still an open and emerging research problem.

For this reason, one of the objectives of this Thesis is to introduce an information theoretic framework able to properly model, analyze and nally optimally tradeo

(22)

1.1 Scope of the Thesis these requirements.

As mentioned, upon receiving a query, which can be a noisy and degraded enrolled data, identication starts by nding the most similar database entry. However in many identication applications, data can be severely distorted, thus a unique decoder might not be able to reasonably handle the corrupted data, resulting in a high error rate. To tackle this, one can exploit a list decoder. List decoding, which can be considered as a generalization of unique decoding, was rst proposed by Elias (1955) in communication theory. The main feature of this type of decoding is to produce a xed list size of the most likely candidates rather than a single one.

The result of Elias (1955) was generalized by Forney Jr. (1968) to a variable list size. Using a Neyman-Pearson optimality criterion, it was demonstrated that the proposed decoder guarantees maximal Gallager's error exponents. In many identication problems, the nal sink of information is a human being. This restriction makes variable list size decoding undesirable, due to the high variability of the list size: for very noisy environments the list might be exceedingly long.

As mentioned, these types of decoders have been used in a communication setup, where the decoder estimates the sent message from a xed codebook. It is also be- lieved that list decoding might bring additional benets for identication systems that operate in very noisy environments. However, contrary to digital communica- tions, in the identication setup the decoder should determine whether a given query is related to some elements of the database, and if so, which one. Therefore, just using a list decoder is not sucient to ensure that the estimated indices are really related to the query, i.e., without restricting the probability of false acceptance. To generalize the list decoder to the identication setup, we must add an erasure option to the decoding rule, which means that the decision regions are not exhaustive.

In this Thesis, we improve the identication performance by introducing a setup that uses a xed maximum list size decoder based on an Order Statistics List Decoder (OSLD) and analyze its performance versus unique decoding for the identication problem.

Identication systems usually deal with high-dimensional data, which might be highly correlated in spatial and time coordinates like multimedia data. There exist several approaches to resolve such problems like robust hashing and digital ngerprinting. The main idea behind digital ngerprinting approaches is to extract digital ngerprints of a lower dimensionality with a maximum possible entropy, i.e., in the binary case, bits of digital ngerprints should be independently and equally likely0's and1's. However, since real data usually are correlated, one of the principle tasks of a dimensionality reduction transform is to eliminate correlation between data samples. A mapper that possesses such properties is the Karhunen-Loève Transform (KLT) (see e.g. Jain 1989). However, the price that must be paid for this optimality is its data dependence and the necessity of updating the transform matrix for new entries. In order to allay this dependence, several approximations of the KLT were proposed such as the Discrete Cosine Transform (DCT) and Digital Wavelet

(23)

Transform (DWT) (see e.g. Jain 1989). The basis vectors of these transforms are xed and independent of the statistics of their inputs. The basis vectors of DCT and DWT are optimized for locally correlated data. However, the main drawback of such xed basis transforms consists in the public disclosure of the basis vectors, which is rarely acceptable for multimedia security applications (see e.g. Voloshynovskiy et al.

2010).

One solution to overcome this privacy/security shortcoming is a randomized mapper that can be designed based on Random Projections (RP) (see e.g. Fridrich 1999). The RP have been the object of much interest due to their ability for distance preservation (Johnson and Lindenstrauss 1984). Although the decorrelation property of orthogonal transforms is well-known (see e.g. Jain 1989), the RP are based on approximately orthogonal bases. Therefore, the statistics of projected data, i.e., the covariance matrix, are not well justied. On the other hand, prior knowledge of the statistics of extracted digital ngerprints is crucial for evaluation of the performance of Content Based Identication (CBI) systems.

In this Thesis, we evaluate statistics of data, which can be modelled by a Gauss- Markov process or equivalently First order Autoregressive (AR(1)) Gaussian process, projected to random subspace domain. To achieve this goal, we analyze convergence of the covariance matrix of the projected data to an identity matrix. The main idea behind the use of Gauss-Markov is that it is considered as one comprised of simple yet powerful models that accurately represent the local correlations present in images (Jain 1989).

In modern identication applications, the size of a database might be to the order of several billions. For example, the latest estimations show Picasa or Flicker to have about 3 billion images; a similar number of video clips are currently on YouTube and an even larger number of images in the Google Image Search database (see e.g. Hays and Efros 2007). Therefore, the demand is great for the theoretical investigation and development of practical methods of identication systems that can achieve the identication capacity.

In this Thesis, we performed informationtheoretic analysis of search and memory complexity in identication systems, and introduced a generalized search scheme capable of achieving the identication capacity. Considering a twostage search scheme based on vector quantization and clustering, we have derived the achievable region for the number of clusters and the cluster size for such search scheme. More- over, we have derived the size of the clusters that should be decoded at the rst stage of identication.

In conventional content ngerprinting, the ngerprint is extracted directly from the original content and does not require any content modication to preserve the original content quality and integrity. In this sense, it can be considered as a Pas- sive Content Fingerprinting (pCFP). Another approach to content protection and identication is based on Digital Watermarking (DWM). These days, DWM is a well-studied domain (see e.g. Cox et al. 2002, Pérez-González et al. 2003). The

(24)

1.2 Outline of the Thesis Content

identication

Passive Active

decoderList Digital

ngerprint IT

framework aCFP Chapter 2 Chapter 2 Chapter 3 Chapter 4

Figure 1.2: Outline of the Thesis.

essential dierence between pCFP and DWM is that, in ngerprinting, a content owner only assigns some ID number to the content, while in digital watermarking one can mark every individual copy of the content by embedding a unique message or mark. DWM possesses two advantages over the pCFP: (a) each copy of a content can be marked independently and (b) there is no need for complex search procedures due to the usage of structured Error Correction Codes (ECC), as there is with the random ngerprint in pCFP. However, as mentioned, the pCFP does not require embedding of messages into the host data, which degrades content quality. This can be considered as an advantage of pCFP over DWM. Therefore, it is advantageous to investigate new strategies in content identication that would benet from the strengths of DWM and pCFP.

In this Thesis, we introduce a new hybrid technique that combines pCFP and DWM to achieve a better tradeo between performance and complexity. More particularly, we will address the performance of the proposed technique and investigate lowcomplexity identication strategies. We refer to this technique as Active Content Fingerprinting (aCFP). The aCFP essentially obeys the structure of pCFP with the only dierence occurring at the enrollment phase, where both ngerprint and modied content are generated. We will extend the identication of standard pCFP to more elaborated strategies that benet from the statistics of modulated contents.

1.2 Outline of the Thesis

This Thesis consists of the following steps toward the analysis and design of optimal identication systems (see Figure 1.2):

(25)

1.2.1 Fingerprint Statistics and Order Statistics List decoder

Chapter 2 is dedicated to the performance analysis of contentbased identication using binary ngerprints and OSLD. We formulate content-based identication as a multiple hypothesis test and develop analytical models of its performance in terms of probabilities of correct identication/miss and false acceptance for a class of statistical models, which captures the correlation between elements of either the content or its extracted features. Furthermore, in order to determine the block/codeword length impact on the identication's accuracy, we analyze exponents of these probabilities of errors. Finally, we develop a probabilistic model, justifying the accuracy of identication based on list decoding by evaluating the position of the queried entry on the output list. The obtained results make it possible to characterize the performance of traditional unique decoding, based on the maximum likelihood for the situations when the decoder fails to produce the correct index. Finally, we validate our theoretical ndings by applying the proposed ngerprinting and identication procedures on synthetic and real image database.

1.2.2 Identication Rate, Search and Memory Complexity Tradeo

In an informationtheoretic framework, we introduce dierent twostage decoding schemes capable of achieving identication capacity to address search and memory complexities in largescale identication systems in Chapter 3. These twostage decoding procedures are accomplished as follows. For a given query, at the rst stage, a list of cluster indices is estimated. Then, at the second stage, renement checks are performed to all members of the clusters to produce a unique index. This chapter presents the achievable quadruple rate region when, at the rst stage, a list of cluster indices is decoded. Finally, we evaluate the proposed twostage decoding using a real image database and some binary clustering methods.

1.2.3 Active Content Fingerprinting

Content ngerprinting and digital watermarking are techniques that are used for content protection and distribution monitoring and, more recently, for interaction with physical objects. Over the past few years, both techniques have been well studied and their shortcomings understood. In Chapter 4, we introduce a new framework called active content ngerprinting which takes the best of the two worlds of content ngerprinting and digital watermarking, in order to overcome some of the fundamental restrictions of these techniques in terms of performance and complexity. The proposed framework extends the encoding process of conventional content ngerprinting in a way similar to digital watermarking, thus allowing the extraction of ngerprints from the modied cover data. We consider several encoding

(26)

1.3 Main Contributions strategies, examine the performance of the proposed schemes in terms of bit error rate, the probabilities of correct identication and false acceptance and compare it with those of conventional ngerprinting and digital watermarking. Finally, we extend the proposed framework to the multidimensional case based on lattices and demonstrate its performance on both synthetic data and real images.

1.3 Main Contributions

The main contributions in the present work are:

• We improve the identication performance by introducing a new identication setup that uses a xed maximum list size decoder based on an OSLD and analyze its performance versus unique decoding for the identication problem.

• We evaluate statistics of data, which can be modeled by a Gauss-Markov process, projected to a random subspace. We show that using appropriate RP, we can not only reduce the dimensionality of data but also reduce correlation between elements of the data.

• We perform informationtheoretic analysis of search and memory complexity in identication systems, and introduce a generalized search scheme capable of achieving the identication capacity and reducing the searchcomplexity with respect to the exhaustive search.

• We introduce a new hybrid technique that combines pCFP and DWM to achieve a better trade-o between performance and complexity. More particularly, we will address the performance of the proposed technique and investigate low-complexity identication strategies.

(27)

(28)

2 Fingerprint Statistics and Order Statistics List Decoder

In today's world, digital reproduction tools and user generated content (UGC) websites, such as Youtube, which enable massive distribution, sharing and storage of multimedia contents, have undergone an impressive evolution, providing professional solutions to various groups of users. Besides these obvious advantages, these tools oer, at the same time, unprecedented possibilities for counterfeiters to virtually re- produce any physical or digital items, i.e., images, videos, audioles, documents in electronic or printed form, fake biometrics or any luxury goods or art objects. Thus, the issue of integrity in content identication becomes a critical one demanding an urgent solution for various applications.

The Content Based Identication (CBI) problem can be considered as a multiple hypothesis testing problem based on the NeymanPearson criterion (see e.g. Varna and Wu 2011), while the cost for making the wrong decision should be adjusted for each particular application. Since most CBI systems deal with critical and sensitive decisions in security applications, such as biometrics, content identication for copyright protection and illegal copy detection, etc., this cost is relatively high. No less important are the consequences of the wrong identication of physical objects such as over-terminated or fake medications, objects of art or luxury goods. There- fore, under these conditions, the identication problem is dened as the multiple hypothesis test with |M|+ 1alternatives, where |M| is the number of contents to be identied and the additional hypothesis stands for the erasure, if no match can be found. The performance of the CBI system is characterized by the probability of miss, i.e., when the genuine content is wrongly rejected, and the probability of false acceptance, when the faked or content-independent entry is falsely accepted as one of |M| genuine contents. In each considered application, both probabilities should be very small.

On the other hand, the CBI systems are facing numerous additional requirements related to such issues as identication complexity, privacy, security as well as memory storage. The trade-o between these requirements is a quite complex

(29)

problem that still remains unsolved. To address this tradeo digital ngerprints are used (see e.g. Fridrich 1999, Haitsma and Kalker 2002). A digital ngerprint represents a short, robust and distinctive content description. The main idea behind digital ngerprinting consists in the extraction of a lower dimensional content representation that is usually accomplished as follows. First a lower dimensional data representation from a content or its extracted feature is obtained (dimensionality reduction). Secondly, to address complexity, security, privacy and memory storage requirements, the transformed data are converted to a binary format. At the identication phase, either binary (hard decoding) or real valued query (soft decoding) can be used (Voloshynovskiy et al. 2010).

One key factor that restricts the progress in this direction is related to the analysis of the CBI system performance. This in turn requires to introduce tractable analytical models for CBI. Moreover, in many applications, data can be severely distorted and the classical unique decoding might not be capable of reasonably han- dling noisy inputs, thus resulting in a high rate of erroroneus decisions. However, it is known in digital communication that replacement of the unique decoding decision rule by the list decoding with variable (see Forney Jr. 1968) or xed (see Elias 1955) list size might help in such a situation. The reason for this enhancement is due to the fact that content degradation might change the order of the likelihood of the correct content. Since most of the identication techniques using unique decoding are based on the Maximum Likelihood (ML) principle, the change of the order of the correct likelihood will incur an error. However, this change might only cause the ip of the correct likelihood position to the nearest positions in its sorted list.

Consequently, providing the list of most probable likelihoods of candidates might resolve the problem as soon as the correct candidate is on the list. Such a situation is mostly acceptable for the above-mentioned multimedia security, biometrics and physical object security applications, where the nal decision is made by human means. Obviously, the change of decoding rule from `unique' to `list' decoding should be considered along with the relaxation of a constraint on the probability of false acceptance. Nevertheless, the potential help of list decoding in the CBI systems is little investigated and remains largely undiscovered with a few exceptions (see e.g.

Farhadzadeh et al. 2010b,c, Moulin 2010). Therefore, an investigation of the impact of list decoding in the CBI applications is of great theoretical interest and practical importance. In this chapter, we analyze the CBI for still images.

2.1 Stateoftheart

One of the rst attempts to establish the theoretical limits of the CBI systems in biometrics applications was performed by Willems et al. (2003). The authors demonstrated that by using unique decoding under the assumption of an innite length of sequences, one can attain the upper achievable rate given by the mutual

(30)

2.1 Stateoftheart information between outputs of the enrollment and identication channels in the class of Discrete Memoryless Channels (DMCs). This result was derived using the concept of typicality (see e.g. Cover and Thomas 1991). The false acceptance event was not considered by Willems et al. (2003), due to the fact that the probability of two independent sequences being jointly typical is asymptotically vanishing. How- ever, the obtained result can not be directly applied to the correlated contents as that would violate the principle of independence in the concept of typicality. To address this problem, as well as to relax the typicality constraint on the innite length of sequences, Varna and Wu (2011) considered the CBI problem based on the ML criterion with the delity constraint for the images possessing local correlations and nite length ngerprint representations. The preservation of correlation in the binary data representation unavoidably leads to a decrease in entropy of ngerprints and thus to a decrease in identication rate as well as privacy leakage. Moreover, distortions should be also treated with special care due to their dependence upon original data. These factors considerably impact the accuracy of the conveyed analysis that is performed under certain assumptions. Independently, Voloshynovskiy et al. (2010), Willems (2010) considered the CBI for the independent and identically distributed (i.i.d.) binary data with nite length based on a Bounded Distance Decoder (BDD) that can operate in erasure or list decoding modes similarly to For- ney Jr. (1968). However, the main focus of the above mentioned papers was on the analysis of unique decoding under privacy and complexity constraints for nite length sequences. Thus, the impact of real data statistics still remains uncovered.

Therefore, targeting an accurate performance analysis of the CBI systems, we will consider the performance of the CBI based on digital ngerprints taking into account the statistics of real images. Our analysis is accomplished in several stages.

First, in order to guarantee the optimal discriminative power of binary ngerprints, one should maximize the entropy of the ngerprinting output that requires independence between ngerprint bits. Usually, such a property is satised by the proper selection of a linear mapper that is followed by binarization. Selection of such a mapper plays a crucial role at this stage due to the following argument: if the input to binarization procedure is a vector with uncorrelated components, the output is composed of pairwise independent bits (see Papoulis and Pillai 2002). Moreover, if this input has the jointly Gaussian distribution, the elements of the output are mutually independent. The mapper that possesses such properties is the Karhunen- Loève Transform (KLT) (see Jain 1989) that optimally decorrelates its input for a given covariance matrix as well as optimally compacts its energy into a few components, making dimensionality reduction a straightforward process. However, the price that must be paid for this optimality is its data dependence and the necessity of updating the transform matrix for new entries. The latter issue gains importance due to the high computational complexity of this transform that can be evaluated as O(N³), where N is the dimension of its input (see Golub and van Loan 1983).

Additionally, the estimation of covariance matrices for large databases can be pro-

(31)

hibitively expensive. Besides the drawbacks indicated above, the public disclosure of the basis vectors for a given class of data models makes this transform undesirable in the secure identication applications.

In order to ameliorate the issue of complexity, several approximations of the KLT were proposed. These include, for example, the Discrete Cosine Transform (DCT) and Digital Wavelet Transform (DWT) (see Jain 1989), which demonstrate a nearly optimal decorrelation of locally correlated data. The basis vectors of these transforms are xed and independent of the statistics of their inputs. Due to their decorrelation and energy compaction capabilities, as well as the existence of fast implementation algorithms, they are used as a common tool in various signal and image processing applications. However, the main drawback of such xed basis transforms consists in the public disclosure of the basis vectors, which is rarely acceptable for multimedia security applications.

One possible solution to this privacy/security shortcoming is a mapper that can be designed, based on Random Projections (RP) (see Fridrich 1999). The RP have been the object of much interest due to the fact that they are capable of providing an approximate distance preservation, something also recently recognized in the Com- pressed Sensing community for sparse data (see Johnson and Lindenstrauss 1984, Davenport et al. 2010). While the decorrelation property of orthogonal transforms is well-known (see Jain 1989), the RP are based on approximately orthogonal bases.

Therefore, the statistics of the projected data, i.e., the covariance matrix, are not well justied. On the other hand, prior knowledge of the statistics of the extracted digital ngerprints is crucial for the evaluation of the performance of the CBI systems. It is also interesting to explore the possibility of combining the DCT with the RP to benet from both energy compaction and decorrelation, as well as security.

As mentioned above, the other important issue of the CBI systems is their ability to deal with highly distorted data. As a possible solution, one can envision the use list decoding approach introduced by Forney Jr. (1968). However, in many identication applications, the nal sink of information will be a human being. This constraint makes this type of list decoding undesirable, due to the high variability of the list size. Another solution, which is proposed by Farhadzadeh et al. (2010b), is the Constrained List-Based (CLB) decoding approach. The CLB decoding is a combination of Elias (1955) and Forney's list decoding techniques in information transmission and coding applications. In this decoding scheme a limited number of candidates with the largest likelihood functions that can satisfy a specic threshold is selected. The analysis accomplished by Farhadzadeh et al. (2010b) is based on the assumption that the contents are generated independently and identically. Thus, one of the main goals of this chapter consists in the extension of this analysis to a broader class of statistical models with correlation. Moreover, one is often interested in choosing system parameters, i.e., the length of digital ngerprints, the decision threshold and the maximum number of candidates, to ensure that the probabilities of miss and false acceptance are below certain bounds. Hence, in this chapter, besides

(32)

2.1 Stateoftheart computing the exact probabilities of correct identication and false acceptance, we derive bounds on the probabilities of miss, the complement of the probability of correct identication, and false acceptance for the digital ngerprints of a given length. Further yet, to show the impact of the list decoding, we investigate the probability that the correct entry of a database might fall in some position of the list, depending on the level of query degradation.

2.1.1 Contribution to the State-of-the-art

The main contribution of this chapter can be summarized as follows: we analyze an identication setup based on binary i.i.d. ngerprints. In this identication setup, we exploit the CLB decoding in the binarized projected domain for either contents or their extracted features that can be modelled by a correlation-based model like an First order Autoregressive (AR(1)) process, which captures correlation between elements of data (see Jain 1989). Then, we investigate the fundamental performance limits in this setup by analysing probabilities of errors and establishing the error exponent bounds as well as deriving achievable identication rates. Finally, we consider order statistics of the correct entry appearance on the list in order to justify the optimal list size for various operational modes. These results extend and deepen our preliminary ndings by Farhadzadeh et al. (2010b,c) in regard to the analysis of the CBI based on the CLB as well as the previously considered contribution by Willems et al. (2003).

To the best of our knowledge, the only work dealing with list decoding in the content ngerprinting applications is done by Moulin (2010). The closest relevant work addressing the theoretical analysis of correlated contents and binary ngerprints under the unique decoding is done by Varna and Wu (2011). The principal dierences with these papers can be summarised as follows:

• the CLB decoding proposed in this chapter diers from the one analysed by Moulin (2010) in two ways:¹

the type of list decoding: the list decoder proposed by Moulin (2010) produces the variable list size based on thresholding of likelihood functions computed for all items while the list decoder considered in this chapter always outputs the list of candidates that does not exceed the prede- ned list size. The list decoding analysed by Moulin (2010) represents better performance in terms of probability of miss in exchange for the un- bounded list size that is not always desirable in those applications where the nal sink is a human being;

1It should be pointed out that due to the dierent decoding strategies, i.e., constrained list size in the CLB case and variable list size used by Moulin (2010), the performance measure in terms of probability of miss is dierent and that makes a direct comparison unfeasible.

(33)

prior knowledge about channel statistics: the decoder considered by Moulin (2010) is based on some generic distance, which can be matched with the channel statistics, while the CLB considered in this chapter is based on the Hamming distance deduced for the binary ngerprints.

• contrarily to work done by Varna and Wu (2011), we consider a decorrelation approach based on the RP which makes it possible to generate binary ngerprints with asymptotically independent and equal likely distributed bits;

this property could be of advantage for the maximization of the achievable rate of binary ngerprint identication, ecient ngerprint storage, privacy- preserving as well as extension of unique decoding to more general list decoding rules².

The main extension of the results earlier published by Farhadzadeh et al. (2010b,c) consists in:

• in Farhadzadeh et al. (2010b,c) we have assumed that the contents to be identied can be modeled as an i.i.d. Gaussian process. Moreover, the impact of RP which are approximate ortho-projectors was not considered. In this chapter, we extend this assumption from an i.i.d. process to an AR(1) Gaussian process and we analyze the impact of RP on the statistics of the projected data by deriving upper bounds;

• the performance analysis of identication systems proposed by Farhadzadeh et al. (2010b,c) was based only on exact formulae of probabilities of miss and false acceptance, where the distortion channel was assumed to be a Binary Symmetric Channel (BSC). In this chapter, we derived upper bounds on the probabilities of miss and false acceptance for a more general DMC distortion model;

• the numerical evaluations done by Farhadzadeh et al. (2010b,c) were based on synthetic data generated by an i.i.d. Gaussian process, however, in this context we extend our validation to simulations using a real image database, Uncompressed Colour Image Database (UCID) introduced by Schaefer and Stich (2004).

The outline of the rest of this chapter is as follows. In Section 2.2, we introduce denitions exploited through this chapter. Section 2.3 denes the structure of the identication setup. In Section 2.4, we consider the statistics of data used in the identication setup and demonstrate decorrelation and independence preserving properties of RP. Section 2.5 elaborates the fundamental limits of the introduced identication setup. Simulation results are presented in Section 2.6. Finally, the conclusions are presented in Section 2.7.

2In some applications, the extra correlation between ngerprint bits is favoured to strengthen the method with geometrical transforms or to avoid computational complexity of decorrelation in large-scale applications.

(34)

2.2 Denitions and Preliminaries

2.2.1 Order Statistics

LetV1, V2, . . . , Vn be n i.i.d. random variables, each with a Cumulative distribution function (CDF) F(v). The r-th order statistic of these n i.i.d. random variables is denoted by V(r:n), which corresponds to the r-th position of v(1:n) ≤ v(2:n) ≤ . . . ≤ v(r:n) ≤ . . .≤ v(n:n) for a specic outcome. F(r:n)(v), the CDF of V(r:n), is given by (see David and Nagaraja 2003)

F(r:n)(v) = Pr

V(r:n)≤v

= Pr{at leastr of Vi are less than or equal to v}

=

n

X

i=r

n i

Fⁱ(v)[1−F(v)]ⁿ⁻ⁱ, (2.1) since the term in the summand is the binomial probability that exactlyiofV1, . . . , Vn

are less than or equal tov.

2.2.2 Random Projections

In RP, the originalN-dimensional data are projected to an L-dimensional (L≤N) random subspace, by a linear mapper W, of size N by L, drawn from a specied probability distribution. The key idea behind the dimensionality reduction using RP is based on the Johnson and Lindenstrauss (1984)'s lemma: if points in a vector space are projected onto a randomly selected subspace of suitably high dimension, then the distances between the points are approximately preserved. The choice of the random matrixWis very important for satisfying the conditions of this lemma.

The elements Wij of W are often Gaussian distributed, but Achlioptas (2003) has shown that the Gaussian distribution can be replaced by a much simpler Bernoulli distributionPr{Wij =±1/√

N}= 1/2. We also consider the RP based on the above Bernoulli distribution due to the simplicity of statistical analysis of projected data.

2.3 The Identication Setup

The identication setup under analysis shown in Figure 2.1 consists of two main phases: content enrollment and content identication.

In the content enrollment phase, the digital ngerprints are extracted from either contents or their extracted features and stored in a database. The database is a collection of |M|,M={1,2, . . . ,|M|} labelled binary vectors denoted by

¯

x^L(m)∈ {0,1}^L, m∈ {1, . . . ,|M|},

(35)

ψ(·) x^N(|M|)

x^N(2) x^N(1)

...

¯ x^L(|M|)

¯ x^L(2)

¯ x^L(1)

...

. . .

P(y^N|x^N) y^N

ψ(·) y¯^L Decoder

N_l⁰ x^N(m)

x^N

identication enrollment Database

acquisition

channel ngerprint extraction

list of candidates

Figure 2.1: The identication setup for CBI based on binary ngerprints.

where x¯^L(m) =ψ(x^N(m))is a digital ngerprint extracted from either the content or its extracted feature x^N(m) ∈ X^N, which is drawn from a common stationary distribution P(x^N). ψ(·) is a digital ngerprint extraction function that can be key-dependent. Conversion to binary in the ngerprint extraction is applied so as to cope with storage, privacy, security and complexity constraints. However, since the use of the secret key does not impact statistical analysis of the setup due to its symmetrical presence at enrollment and identication phases, we consider only a key-independent digital ngerprint generation.

In the content identication phase, for a given query y^N the digital ngerprint is extracted similar to the enrollment phase, i.e., y¯^L = ψ(y^N). Then, the decoder decides whether the query is relevant to some entries of the database, and if so, decides to which ones. Otherwise, it produces an erasure.

2.3.1 Identication Problem as a Decoding Problem

If the query digital ngerprinty¯^Lis related to some elementx¯^L(m)of the database, this relationship can be modeled as a binary channel with the transition probability P(¯y^L|¯x^L(m)). If the query digital ngerprint y¯^L is unrelated to any entry of the database, we assume that y¯^L is drawn from P(¯y^L) = P

¯

x^L∈{0,1}^LP(¯x^L)P(¯y^L|¯x^L).

Therefore, we can dene the content identication problem as a statistical test with

|M|+ 1 hypotheses

H₀ : Y¯^L∼P(¯y^L)

H_m : Y¯^L∼P(¯y^L|¯x^L(m)), (2.2) where H₀ and H_m correspond to the cases that y¯^L is unrelated to any entry of the database, and y¯^L is related to the m-th entry of the database, respectively.

(36)

2.4 Statistical Analysis of Digital Fingerprint

2.3.2 Constrained List Based Decoder

We dene the CLB decoding as follows:

1. For each entry x¯^L(m),1≤ m ≤ |M|, of the database, the decoder computes log-normalized-likelihoods L_m = ln P(¯y^L|¯x^L(m))/P(¯y^L)

.

2. The computed log-normalized-likelihoods are sorted in ascending order.

3. TheNlindices with the largest log-normalized-likelihood functions are chosen.

Then, their indices are put in the primary listNlone-by-one, i.e., the rst index in N_l corresponds to the largest one and so forth. ParameterNl is referred to as the primary list size.

4. The nal list of candidates is dened as

N_l⁰ ={m∈ N_l :L_m ≥γL}, (2.3) where γ ≥0 controls the number of nal candidates and denes the rejection option.

The performance metrics of the CBI are dened by the probability of correct identication,Pci

Pci = 1−Pm =

|M|

X

m=1

Pr{(m∈ N_l)∩(Lm ≥γL)|Hm}Pr{Hm}, (2.4) wherePm denotes the probability of miss, and the probability of false acceptance

Pf a = Pr{N_l⁰ 6=∅|H₀}. (2.5)

2.4 Statistical Analysis of Digital Fingerprint

The digital ngerprint extraction function ψ(·)works as follows

1. The dimensionality of a content or its extracted feature x^N(m) and a query y^N is reduced from N to L by applying the RP matrix, W. Note that RP are approximately orthoprojectors, i.e., W^†W≈IL, where W∈ {±1/√

N}^N×L whose elements are drawn according to Probability Mass Function (PMF) Pr{Wij = ±1/√

N} = 1/2, 1 ≤ i ≤ L and 1 ≤ j ≤ N. For a given W, the projections x˜^L(m) and y˜^L are obtained by x˜^L(m) = W^†x^N(m) and y˜^L = W^†y^N.

2. L-length binary digital ngerprints, y¯^L and x¯^L(m), are derived by taking the sign of the projected data, i.e., x¯^L(m) = (sign(˜x1(m)), . . . ,sign(˜xN(m))) and

¯

y^L= (sign(˜y1), . . . ,sign(˜yN)).

(37)

2.4.1 The Statistics of Digital Fingerprints Extracted from Correlated Data

In this Section, we investigate the statistics of digital ngerprints obtained by the RP. We assume the inputX^N can be modelled as an AR(1) Gaussian process. The justication of the use of this model is two-fold. First, X^N can be considered as an image that is characterized by local correlations between neighbouring pixels. To capture these correlations, a number of statistical models such as autoregressive and Markov random eld are proposed by Jain (1989). The AR(1) Gaussian process is considered as one comprised of simple yet powerful models that accurately represent the local correlations present in images X^N (see Jain 1989). Secondly, in the case X^N represents some robust features extracted from an original content to cope with potential malicious attacks, such an assumption that might yet be valid. For example, Scale-invariant feature transform (SIFT) by Lowe (1999), SURF by Bay et al. (2006) or Fourier-Mellin by Reddy and Chatterji (1996) transform used for image description includes a certain level of correlation among samples of extracted features that can be modelled as an AR(1) model. Finally, many ngerprinting algorithms operate in decorrelation domains such as DCT or DWT (see Kim 2003), where the residual correlation among components of transformation coecients can be modelled as AR(1) (see Juan and Moulin 2001). Assuming this model, for a given RPW, the covariance matrix in the projected domain is given by

K˜x˜x = E[W^†X^NX^N^†W]

=W^†KxxW, (2.6)

whereKxx is dened by (see Jain 1989)

Kxx =σ_X²







1 ρ . . . ρ^N−1 ρ 1 . . . ρ^N−2

... ... ... ...

ρ^N−1 ρ^N−2 . . . 1







, (2.7)

where σ²_X and 0 ≤ ρ < 1 are variance and the normalized correlation coecient, respectively. We use the following Proposition for statistical modeling of projected data.

Proposition 2.1. Let the elements of the RP matrix, W of size N ×L and 1 <

L ≤ N, be i.i.d. with PMF Pr{Wij = ±1/√

N} = 1/2, and X^N be a real zero- mean random vector modelled as the AR(1) Gaussian process with varianceσ_X² and normalized correlation coecient ρ. Then, we have

Pr

maxi6=j |K˜x˜xij|> βσ_X²

< 1

L, (Odiagonal elements) (2.8a)

Information-theoretic analysis of identification systems in large-scale databases

Thesis

Reference