• Aucun résultat trouvé

Contribution à la vérification multi-modale de l'identité en utilisant la fusion de décisions

N/A
N/A
Protected

Academic year: 2021

Partager "Contribution à la vérification multi-modale de l'identité en utilisant la fusion de décisions"

Copied!
201
0
0

Texte intégral

(1)

HAL Id: tel-00005685

https://pastel.archives-ouvertes.fr/tel-00005685

Submitted on 5 Apr 2004

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

en utilisant la fusion de décisions

Patrick Verlinde

To cite this version:

Patrick Verlinde. Contribution à la vérification multi-modale de l’identité en utilisant la fusion de

décisions. Interface homme-machine [cs.HC]. Télécom ParisTech, 1999. Français. �tel-00005685�

(2)

46,rue Barrault

Paris75634 Cedex 13

France

A CONTRIBUTION TO MULTI-MODAL

IDENTITY VERIFICATION USING DECISION

FUSION

by

Patrick Verlinde

Dissertation submittedto obtainthedegree of

Docteur de l'Ecole Nationale Superieure des

Telecommunications

Specialite: Signal et Images

Compositionofthe thesiscommittee:

Jean-PaulHaton (LORIA)- President

Gerard Chollet(ENST) - Director

MarcAcheroy(RMA) - Reporter

IsabelleBloch (ENST)- Reporter

PaulDelogne (UCL)- Examiner

JosefKittler(UOS) -Examiner

(3)
(4)

my twin daughters

(5)
(6)

In the rst place I wish to thank my thesis director dr. Gerard Chollet

from CNRS/URA 820 (FR), forhis drivingforce,for his critical and very

usefuladvises, forhavinginvolvedmyresearchinevery suitableprojecthe

could nd, andforthe huge amountsof informationheprovidedmewith.

Ialsowouldliketo thankprof. dr. ir. Jean-PaulHatonfromLORIA(FR)

forhis usefuladvises he gave me and forhavinghonored me byaccepting

to be thepresident ofmythesiscommittee.

Special thanks go to prof. dr. ir. Marc Acheroy, head of the electrical

engineeringdepartment of theRMA and director ofthe Signaland Image

Centre (SIC) forbelieving inme, forhavingsupported and motivated me

all thetime, for hiscontinuous ow of advises, and, lastbutnotleast, for

havingaccepted to be a reporterforthisthesis.

I also want to express my sincere gratitude towards prof. dr. ir. Isabelle

BlochfromENST/TSI(FR)notonlyforhelpingmeinaveryfriendlyway

to e ectively control my \uncertainties", but also for having accepted to

be a reporterformywork.

Iamveryproudtohaveprof. dr. ir. JosefKittlerfromUniversityofSurrey

(UK)inmythesiscommittee, andI wouldliketo thankhimespecially for

hishelping commentswith respectto thestatistical aspects ofthisstudy.

Thank you prof. dr. ir. Paul Delogne from UCL/TELE (BE), for your

many\personalized"commentswhichI'msurehaveimprovedthecontents

as well as the readability of this work, and for having accepted to be a

(7)

tary Academy (RMA), forhaving granted me the time and the means to

nish this thesis, and for having accepted to be a member of my thesis

committee.

Thankstoir. CharlesBeumierfromRMA/SIC(BE)forhishelpinthe eld

of machine visionin general and in the framework of the M2VTS project

in particular, to dr. ir. Stephane Pigeon from RMA/SIC (BE) for his

criticalremarks inthegeneral eld offusionand forhishelpin thedesign

ofthe experimental protocols and forwritingthe software forthe NIST99

evaluationand fortheevaluation of themixtureof experts.

Thanksto dr. ir. Jan



Cernock y fromVUT (CZ)forhis hospitalityduring

the NIST99 evaluations and for his help in generating the data for the

experimentsinvolvingthemixture ofexperts.

Thanks to dr. ir. GilbertMa^treformerly from IDIAP/visiongroup(CH)

for his very positive help in the design of experimental protocols and in

the eldof machinevision, to dr. ir. EddyMayoraz from IDIAP/machine

learninggroup(CH)forhis friendlyguidance andfor hishelp in

formaliz-ing theparadigm of the multi-linearclassi er, to ir. Frederic Gobry from

IDIAP/machine learning group (CH) for his help in writing the code for

themulti-linearclassi er, and to dr. ir. Dominique(Doms) Genoudfrom

IDIAP/speech group(CH)forhis helpinthe eld of speakerveri cation.

Thanks also to ir. Guillaume (Guig) Gravier from ENST/TSI (FR) for

his friendship and for all his help in the elds of speaker veri cation and

informationtechnology.

I also would like to thank Bruno, Chris, Dirk, Florence, Idrissa, Lionel,

Marc, Michel, Monica, Nada, Pascal, Vincianne, Wim, Xavier, Yann,

Youssef, and all my other colleagues from RMA/ELTE (BE) and from

ENST/TSI (FR) for the wonderful working atmosphere they all have

contributedto.

Iam gratefulfor thecontributionsof thefollowing students: Benny Tops,

and DangVan Thuong.

Finally I would like to thank Renate for her love, support, and so much

(8)

1 Introduction 1

1.1 Introduction. . . 1

1.2 Subjectof thethesis . . . 1

1.3 Identitydeterminationconcepts . . . 4

1.4 Structure ofthe thesis . . . 4

1.5 Originalcontributionsof thisthesis . . . 7

I General issues related to automatic biometric multi-modal identity veri cation systems 9 2 Biometric veri cation systems 11 2.1 Introduction. . . 11

2.2 Requirementsforbiometrics . . . 11

2.3 Classi cationof biometrics . . . 12

2.4 Generalstructureof amono-modal biometricsystem . . . . 14

2.5 Theneed formulti-modalbiometricsystems . . . 15

2.6 Characterization ofa veri cationsystem . . . 17

2.7 State oftheart . . . 21

2.7.1 Generaloverview . . . 21

2.7.2 Results obtainedon theM2VTS database . . . 24

2.8 Comments . . . 24

3 Experimental setup 27 3.1 Introduction. . . 27

3.2 TheM2VTS audio-visual persondatabase . . . 27

3.3 Experimental protocol . . . 28

3.3.1 Generalissues. . . 28

(9)

3.4.2 Performances . . . 36

3.4.3 Statistical analysisof thedi erent experts . . . 36

3.5 Comments . . . 49

4 Data fusion concepts 51 4.1 Introduction. . . 51

4.2 Taxonomyofdata fusionlevels . . . 51

4.3 Decisionfusionarchitectures. . . 53

4.4 Parallel decisionfusionas aparticular classi cationproblem 54 4.5 Comments . . . 55

II Combiningthe di erent experts inautomatic biomet-ric multi-modal identity veri cation systems 57 5 Introduction to part two 59 5.1 Goal . . . 59

5.2 Parametric ornon-parametric methods? . . . 59

5.3 Comments . . . 61

6 Parametric methods 62 6.1 Introduction. . . 62

6.2 A simpleclassi er: themulti-linearclassi er . . . 63

6.2.1 Decision fusionasa particular classi cationproblem 63 6.2.2 Principle . . . 64

6.2.3 Training . . . 65

6.2.4 Testing . . . 67

6.2.5 Results . . . 67

6.2.6 Partial conclusionsand future work . . . 70

6.3 A statisticalframework fordecisionfusion . . . 71

6.3.1 Bayesian decisiontheory . . . 71

6.3.2 Neyman-Pearson theory . . . 74

6.3.3 Application of Bayesian decision theory to decision fusion . . . 76

6.3.4 The naive Bayesclassi er . . . 77

6.3.5 Applicationsof thenaive Bayes classi er. . . 78

6.3.6 The issueof thea prioriprobabilities. . . 86

(10)

6.4.2 Results . . . 93 6.4.3 Mixture of Experts . . . 94 6.5 Comments . . . 94 7 Non-parametric methods 97 7.1 Introduction. . . 97 7.2 Voting techniques. . . 97 7.3 A classicalk-NN classi er . . . 99

7.4 A k-NNclassi er usingdistanceweighting . . . 100

7.5 A k-NNclassi er usingvector quantization . . . 101

7.6 A decisiontreebased classi er . . . 103

7.7 Comments . . . 106

8 Comparing the di erent methods 107 8.1 Introduction. . . 107

8.2 Parametric versus non-parametricmethods . . . 107

8.3 Experimental comparisonof classi ers . . . 108

8.3.1 Testresults . . . 108 8.3.2 Validationresults . . . 110 8.3.3 Statistical signi cance . . . 111 8.4 Visualinterpretations . . . 113 8.5 Comments . . . 113 9 Multi-level strategy 115 9.1 Introduction. . . 115

9.2 A multi-leveldecisionfusionstrategy . . . 115

9.3 Mono-modalmono-expert fusion . . . 116

9.3.1 Introduction . . . 116

9.3.2 Results . . . 117

9.4 Mono-modalmulti-expert fusion . . . 118

9.4.1 Introduction . . . 118

9.4.2 Methods . . . 119

9.4.3 Results . . . 120

9.4.4 Combiningthe outputsofsegmentalvocal experts . 120 9.4.5 Combiningthe outputsofglobal vocal experts . . . 123

9.5 Multi-modal multi-expert fusion . . . 126

(11)

10.2 Future work . . . 133

Bibliographie 135

A A monotone multi-linear classi er 149

B The iterative goal function 163

C The global goal function 167

D Proof of equivalence 169

E Expression of the conditional probabilities 173

F Visual interpretations 177

(12)

ATM Automatic Teller Machine

AVBPA Audio-and Video-based BiometricPerson Authentication

BDT BinaryDecision Tree

DET Detection ErrorTradeo

EER EqualErrorRate (FAR=FRR)

FA False Acceptance

FAR False Acceptance Rate

FE FrontalExpert

FR False Rejection

FRR False Rejection Rate

GMM Gaussian MixtureModel

HMM HiddenMarkov Model

k-NN k-Nearest Neighbor

LC LinearClassi er

LDA LinearDiscriminantAnalysis

LR naive bayesclassi er usingaLogistic Regressionmodel

M2VTS MultiModalVeri cationforTeleservices and Securityapplications

MAJ Majorityvoting

MAP MaximumA posteriori Probability

MCP MaximumConditionalProbability

ML MaximumLikelihood

MLP Multi-Layer Perceptron

NBG Naive Bayes classi er usingGaussian distributions

NIST National InstituteforStandardsand Technology (USA)

NN Nearest Neighbor

NSA National SecurityAgency(USA)

PE Pro leExpert

PIN Personal Identi cationNumber

PLC Piece-wise LinearClassi er

QC Quadratic Classi er

ROC ReceiverOperatingCharacteristic

TD Temporal Decomposition

TER TotalErrorRate

VE VocalExpert

(13)
(14)

Introduction

1.1 Introduction

The rst chapter starts by introducingthe subject of thethesis. Toavoid

confusion,thisintroductionisfollowedbyanexplanationofthedi erences

and/orsimilaritiesbetweenterms thatare oftenencountered inthe

litera-ture related to the eld of automatic identity \determination", which are

authentication, recognition, identi cation, and veri cation. These

de ni-tionsare followed bya presentation of thestructureof the thesis and this

chapter isendedbyclearlystating theoriginalcontributionsofthisthesis.

1.2 Subject of the thesis

Thisthesis deals withthe automaticveri cation of the identity of a

coop-erativepersonundertest,bycombiningtheresultsofanalysesofhisorher

face, pro leand voice. This speci c application whichis used throughout

thiswork,hasbeende nedintheframeworkof theM2VTS(Multi-Modal

Veri cationfor Tele-services and Securityapplications) project of the

Eu-ropean Union ACTS program [1 ]. The exact de nitionof veri cation and

thedi erences withother, oftenencountered terms,such as identi cation,

authenticationor recognition,willbe explainedhereafter. Thekey idea in

thisthesis is to analyze thepossibilitiesof usingdata fusion techniques to

combinetheresultsobtainedbydi erentbiometric(face,pro leandvoice)

expertsthateachhaveanalyzedtheidentityclaimofthepersonundertest.

Inthis work we are explicitlyavoidingissuessuch as ethics,responsibility

orprivacy. The interestedreadercan ndan introductiontothese delicate

(15)

The automaticveri cation ofa person ismore and more becoming an

im-portant tool in several applications such ascontrolled access to restricted

(physicalandvirtual)environments. Justthinkaboutsecuretele-shopping,

accessingthesaferoomofyourbank,tele-banking,accessingtheservicesof

interactive dialogue systems [175], orwithdrawingmoney from automatic

tellermachines(ATM).

Anumberofdi erentreadilyavailabletechniques,suchaspasswords,

mag-netic stripe cards and Personal Identi cation Numbers (PIN) are already

widelyused inthiscontext, buttheonlything they reallyverify is,inthe

bestcase,acombinationofacertainpossession(forinstancethepossession

of the correct magnetic stripe card) and of a certain knowledge, through

thecorrect restitution of acharacter and/or digit combination. As is well

known,these intrinsicallysimple(access)controlmechanismscan very

eas-ilyleadtoabuses, inducedforinstancebythelossortheftofthemagnetic

stripe card and the corresponding PIN. Thereforea new kindof methods

isemerging, basedon socalledbiometric characteristicsormeasures, such

asvoice,face(includingpro le),eye(iris-pattern,retina-scan), ngerprint,

palm-print,hand-shape orsomeother (preferably)uniqueand measurable

physiological or behavioral characteristic information of the person to be

veri ed.

Inthiswork,abiometricmeasurewillalsobecalledamodality. Thismeans

thatan identityveri cationsystem which uses several biometricmeasures

or modalities (for instance a visual and a vocal biometric modality) is a

multi-modal identityveri cation system.

Anothertermwhichwillbeusedveryofteninthisworkisanexpert. Inthis

thesis,an expert iseach algorithm ormethod usingcharacteristic features

comingfrom aparticular modalityto verifythe identityofa personunder

test. In thissense, one single biometric measure or modality can lead to

theuseofmore thanone expert(thevisualmodalitycan forinstancelead

to theuseof two experts: apro le and afrontalfaceexpert). This means

thata mono-modal identityveri cation system can stillbe a multi-expert

system.

Biometric measures in general, and non-invasive/user-friendly (vocal,

vi-sual) biometric measures in particular, are very attractive because they

have the huge advantage that one can not lose or forget them, and they

arereallypersonal(one cannotpass them to someone else),since theyare

based on a physical appearance measure. We can start using these

(16)

plicationsuse a classical technique (password, or magnetic stripecard) to

claima certain identitywhichisthen veri edusingone ormore biometric

measures.

Ifone uses onlya single (user-friendly) biometricmeasure, theresults

ob-tainedmaybe foundto be notgood enough. This is dueto thefact that

these user-friendlybiometricmeasures tend to vary with time forone and

thesamepersonandtomakeitevenworse,theimportanceofthisvariation

is itself very variable from one person to another. This especially is true

for the vocal (speech) modality, which shows an important intra-speaker

variability. One possible solution to try to cope with the problem of this

intra-person variabilityisto usemorethan one biometricmeasure. In this

newmulti-modalcontext, itisthusbecomingimportanttobeableto

com-bine (or fuse) the outcomes of di erent modalities or experts. There is

currentlyasigni cantinternationalinterestinthistopic. Theorganization

of already two international conferences on the speci c subject of

Audio-and Video-based Biometric Person Authentication (AVBPA) is probably

thebestproof ofthis[16 , 38 ].

Combining theoutcomes of di erent experts can be doneby using

classi-caldatafusiontechniques[2 , 46 ,70 ,71 ,101,170 ,172 ,181],butthemajor

drawbackofthebulkofallthesemethodsistheirratherhighdegreeof

com-plexity,which isexpressed- amongstelse- bythe factthatthese methods

tendto incorporate a lot of parameters that have to be estimated. If this

estimation is not done using enough training data (i.e. if the estimation

isnot doneproperly), thisplacesa seriousconstraint on the abilityofthe

system to correctly generalize [9, 121]. But actually a major diÆculty of

thisparticular estimation problemis the scarcity of multi-modal training

data. Indeed, to keep the automatic veri cation system user-friendly,the

enrollment of a (new) client should not take too much time, and as a

di-rect consequencefrom this,the amount of client trainingdatatends to be

limited. To try to deal with this lack of training data, one possibility is

to develop simple classi ers (i.e. forinstance classi ersthat use onlyfew

parameters),so thattheirparameters can be estimatedusingonlylimited

amountsoftrainingdata. Thepricetobepaidwhenusingsimplemethods

(17)

1.3 Identity determination concepts

Automaticsystemsforrecognizingapersonorforauthenticatinghisidentity

(whichisequivalent),allhaveadatabaseofN so-calledauthorizedpersons

orclients. Authentication orrecognition isthegeneral term, which covers

on one handidenti cation and on the other handveri cation. These two

processesarequitedi erent asthefollowingmore detaileddescriptionwill

show.

Identi cationinthestrictsenseofthewordsupposesaclosedworldcontext.

This means that we are sure that the person under test is a client. The

onlythingweneedto ndoutiswhichclientofthedatabaseofauthorized

persons matches \the best" the person under test. There is no criterion

(such asathresholdforinstance)to de nehowgoodthe match hasto be,

tobeacceptable. Identi cationisthusa 1-out-of-N matching process,and

itisclear thattheperformances decreasewithN.

Veri cationinthestrictsenseofthewordoperatesinanopenworldcontext.

Thismeansthatwearenolongersurethatthepersonundertestisaclient.

Inthiscase, thepersonundertestclaimsacertainidentity,whichofcourse

hastobetheidentityofanauthorizedperson. Ifthepersonundertestisno

member of the database of authorizedpersons, he is a so-called impostor.

Veri cation is thus a 1-out-of-1 matching process, where it is important

that the mismatch between the reference model from the database and

themeasuredcharacteristicsoftheperson underteststaysbelowacertain

threshold. The veri cationperformancesare independentof N.

Sometimespeople do refer to identi cation in the largesense of the word

asthe(sequential)processofidenti cationfollowedbyaveri cationofthe

identi ed identity. Sometimes thisdoubleprocess isalso called

identi ca-tion in an openworld context.

Inthisthesis we willonlyconsiderveri cation problems. Thismeansthat

thedecisionproblemwe areconfrontedwithis atypical binaryhypothesis

test. Indeed,thedecisionwehavetotakeiseithertoacceptortorejectthe

identityclaim ofthe person undertest.

1.4 Structure of the thesis

Thisthesishasbeendividedintotwoparts. Inthe rstpart,generalissues

related to automatic multi-modal identity veri cation systems, such as a

(18)

auto-set-up (including the presentation and the analysis of our experts) and a

generaloverviewofdatafusionrelatedconcepts, aretreated. Inthesecond

part, the fusion of the di erent experts in a multi-modal identity

veri -cation system is implemented on the decision level, usingparametric and

non-parametricmethods. Thesedi erentmethodsarethencomparedwith

eachother and a structuredhierarchicalapproach forgraduallyupgrading

theperformancesof automaticbiometricveri cation systemsis presented.

Attheendofthesetwoparts,weareconcludingthisthesisbysummarizing

ourcontributions to the eld and by looking at possibleextensions of the

work done.

To be more speci c,the rst part isorganized asfollows. In chapter 2 we

dealwithbiometricmodalitiesand westartbylistingsome theoreticaland

practicalrequirementsthat biometricsin generalshould conformto. This

isfollowedbyasectionwhichpresentsatentativeclassi cationofthemost

commonlyfoundbiometricsintotwoclasses: theso-calledphysiologicaland

behavioral biometrics. In thefollowingsection the general structureof an

automaticmono-modalbiometricveri cationsystemis presented,whilein

the next section some general arguments for usingmulti-modal biometric

veri cationsystemsaredeveloped. Thefollowingsectionismeantto

intro-duceandde netheclassicalperformancecharacteristicsusedinthe eldof

automaticidentityveri cation, and the nalsection is giving an overview

of the state of the art in multi-modal biometric identity veri cation

sys-tems. Chapter 3 gives detailsaboutthe experimentalset-up. It starts by

presentingtheM2VTSdatabases usedinthiswork. Afterthis,the

experi-mental protocol isdescribed. Finally,thethree di erent biometricexperts

we have been usingthroughout thiswork are brie y introducedand their

individualperformances are highlightedand statisticallyanalyzed.

Chap-ter4introducessomeelementarydatafusionconceptssuchasthedi erent

datafusionlevelsand architectures,and shows howitis possible, by

mak-ing some well-funded choices, to transform a general data fusion problem

into a particularclassi cation problem.

Thesecond partofthisworkdealsmore particularlywiththeparallel

com-bination or fusion of the partial (soft) decisions of the di erent experts.

Chapter 5explains whywe have chosen to experiment with parametric as

well as with non-parametric methods. Chapter 6 deals with parametric

techniques,buttoshowtheusefulnessofthese parametricmethods rstof

all a trivial but original method is presented: the monotone multi-linear

(19)

mation with respect to the probability density functions of the di erent

populations is thrown away. Thereforein a fairly early stage of this work

ithasbeendecidedto stopdevelopingthissimplemethod andto fallback

insteadthe lessoriginal,butmore fundamental statistical decisiontheory,

by using so-called parametric techniques. In thisparametric class,

classi- ersbasedonthegeneralBayesiandecisiontheory(MaximumA-posteriori

Probability and Maximum Likelihood) and on a simpli ed version of it

(the Naive Bayesian classi er, which has beenapplied in the case of

sim-ple Gaussians and in the case of a logistic regression model), have been

studied. Furthermore experiments have also been done usingLinear and

Quadraticclassi ers. Neuralnetworksformaspecialcaseoftheparametric

family,since thenumberof parameters to be estimated can be very large.

Thereforeneuralnetworksaresometimesclassi edassemi-parametric

clas-si ers. Stillwe will present neuralnetworks inthe chapter on parametric

techniques, by means of its most popularrepresentative: theMulti-Layer

perceptron. Chapter 7 deals with non-parametric techniques. This

chap-terstartsbypresentingavery simplefamilyofnon-parametrictechniques.

Thesevotingtechniquesaresometimesreferredtoask-out-of-nvoting

tech-niques,where k relates to thenumberof experts that have to decidethat

thepersonunder test isa client, beforetheglobal voting classi eraccepts

thepersonundertestasaclient. Afterthevotingmethods,anothersimple

but very populartechnique,the k Nearest Neighbor(k-NN) technique, is

presented with a number of variants. These variants include a distance

weighted and a vector quantized version of the classical k-NN rule. This

chapterendsbypresentingthecategoryof(binary)decisiontrees,bymeans

of an implementation of the C4.5 algorithm, which is probably the most

popularmethodinits kind. Chapter 8dealswiththecomparisonbetween

thedi erent parametric and non-parametric methodsthat have been

pre-sented in the second part of the thesis. Chapter 9 presents a multi-level

decisionfusionstrategythatallowsto graduallyimprove theperformances

of an automatic biometricidentity veri cation system, whilelimiting the

initialinvestments.

Chapter10 nallyconcludesthisthesis,formulatessome recommendations

for developing automatic multi-modal biometric identity veri cation

(20)

1.5 Original contributions of this thesis

Theoriginalcontributionsof thisthesisare thefollowingones:

1. theformulation(intheframeworkofamulti-modalbiometricidentity

veri cation system) of the fusion of the partial(soft) decisions of d

experts in parallel as a particular classi cation problem in the

d-dimensionalspace [179 ];

2. thesystematicanddetailedstatisticalanalysisofthedi erentexperts

thathave beenused;

3. thedevelopmentofasimpledecisionfusionmethod,basedona

mono-tone multi-linearclassi er [179 , 180 ];

4. the analysis of the applicability, the characteristics and the

perfor-manceofthelogisticregressionmethodinaBayesianframework[177 ];

5. thedevelopment of a Vector Quantizationversion of theclassical

k-NearestNeighboralgorithm[173 ];

6. thesystematiccomparisonofalargenumberofparametricaswellas

non-parametric techniques to solve the particular classi cation

pro-blem[174];

7. the introduction of either the non-parametric Cochran's Q test for

binary responses, or the non-parametric Page test forordered

alter-natives, to measure the statistical signi cance of the di erences in

performance of several (i.e. more than two) fusion modules at the

same time;

8. theformulationof amulti-levelfusionstrategywhichallowsto

grad-uallyimprove theperformances of anautomatic (biometric)identity

veri cationsystem [176 , 178 ];

9. theformulationofthemixtureofexperts paradigmintheframework

of mono-modal multi-expert data fusion, appliedto a segmental

ap-proach to text-independent speakerveri cation[171];

10. the introduction of the use of multi-modal identity veri cation in

(21)
(22)

General issues related to

automatic biometric

multi-modal identity

(23)
(24)

Biometric veri cation

systems

2.1 Introduction

This chapter starts by de ningtheideal theoretical and practical

require-ments for any biometric. This is followed by a section which presents a

tentative classi cation (according to [120]) of the most commonly found

biometricsinto two classes: theso-calledphysiologicaland behavioral

bio-metrics. In the following section the general structure of an automatic

mono-modal biometric veri cation system is presented, whilein the next

sectionsomegeneralargumentsforusingmulti-modalbiometricveri cation

systemsaredeveloped. The followingsectionpresentsthenthe main

char-acteristicsofidentityveri cationsystems. Inthe nalsection,anoverview

ofthestateoftheartofmulti-modalbiometricpersonveri cationsystems

isgiven.

2.2 Requirements for biometrics

Automaticbiometricsystemshaveto identifyanindividualortoverifyhis

orheridentity

1

usingmeasurementsofthe(living)humanbody. According

to [88 , 89 ], in theory any human characteristic can be used to make an

identity veri cation, as long as it satis es the following desirable (ideal)

requirements:

1

Asalready mentionedin chapter 1, wewill consider inthis workonlyveri cation

(25)

universality thismeansthatevery person shouldhave thecharacteristic;

uniqueness thisindicatesthatnotwopersonsshouldbethesameinterms

ofthecharacteristic;

permanence thismeansthatthecharacteristic doesnotvarywith time;

collectability thisindicatesthatthecharacteristiccanbemeasured

quan-titatively.

Inpractice, thereare some otherimportant requirements:

performance this speci es not onlythe achievable veri cation accuracy,

butalso theresource requirementsto achieve anacceptable

veri ca-tionaccuracy;

robustness this refers to the in uence of the working or environmental

factors (channel, noise, distortions, ::: ) that a ect the veri cation

accuracy;

acceptability this indicates to what extent people are willing to accept

thebiometricveri cationsystem;

circumvention thisrefers to how easy it is to fool the systemby

fraud-ulent techniques (make sure that the individual owns the data, and

thatheisnottransformingit; thiscould alsoincludeaso-called

live-linesstest).

As mentioned before, these requirements should be regarded as ideal. In

otherwords,thebetter a biometricsatis esthese requirements, thebetter

it will perform. In practice however, there is no single biometric which

ful llsallthese idealrequirementsperfectly. Thisobservationis oneofthe

main reasons why combining several biometric modalities in multi-modal

systemsis gaining eld.

2.3 Classi cation of biometrics

Arangeofmono-modalbiometricsystemsisindevelopmentoronthe

mar-ket,becausenoone biometricmeetsalltheneeds.Thetradeo s in

develop-ingthesesystemsinvolve cost,reliability,discomfortinusingadevice,and

the amount of data needed. Fingerprints, for instance, have a long track

(26)

amount of datathat needsto bestored to describea ngerprint(the

tem-plate) tended to be rather large. In contrast, the hardware for capturing

the voice is cheap (relying on low-cost microphones or on an already

ex-istingtelephone), butitvarieswhen emotionsand statesofhealthchange.

According to [120], biometricsencompasses bothphysiological and

behav-ioral characteristics. This is illustrated for a number of frequently used

biometricsinFigure 2.1.

Face

Fingerprint

Hand

Eye

Signature

Voice

Keystroke

Behavioral

Physiological

Automated biometrics

Figure 2.1: Classi cation of a number of biometrics in physiological and

behavioralcharacteristics.

A physiological characteristic is a relatively stable physical feature such

as a ngerprint [89 , 130, 153], hand geometry [190 ], palm-print [188 ],

in-frared facial and hand vein thermograms [141 ], iris pattern [184], retina

pattern[74 ],orfacialfeature[11 ,12 ,34 ,39 ,102 ,116,183 ,189 ]. Indeed,all

these characteristicsare basically unalterablewithout trauma to the

indi-vidual. Abehavioraltraitontheotherhand,hassome physiologicalbasis,

butalso re ects a person's psychological (emotional)condition. The most

commonbehavioraltraitused inautomatedbiometricveri cationsystems

isthe humanvoice[3 , 10 ,20 , 22 , 31 ,35 , 36 , 52 ,60, 62 , 63 ,64 , 65 , 66 ,69,

72 ,73 , 76 , 81 , 80 ,105 , 111 , 112 , 131 , 132 , 133, 134, 151,154 , 160 ]. Other

behavioral traits are gait [126 ], keystroke dynamics [127], and (dynamic)

(27)

systems thatrelyon behavioralcharacteristicsshouldideally update their

enrolledreferencetemplate(s)onaregularbasis. Thiscouldbedoneeither

in an automatic manner, each time a reference is used successfully (i.e.

thesystemdecidesthat anaccess claimisan authenticclientclaim), orin

a supervisedmanner, by re-enrolling each client periodically. The former

method has theadvantage to be user-friendly,but has thedrawback that

one updatesthe client referenceswith atemplate from animpostor inthe

casethatthesystemcommitsaFalse Acceptance. Thelatterapproach has

theadvantageto updatetheclientreferencesalwayswithclienttemplates,

buthasthedrawbackthatitisnotveryuser-friendly,sincetheclientsneed

to doadditional trainingsessions.

The di erences betweenphysiological and behavioral methodsare

impor-tant. On one hand, the degree of intra-person variability is smaller in a

physiological thanin a behavioralcharacteristic. On theother hand,

ma-chinesthatmeasurephysiologicalcharacteristicstendtobelargerandmore

expensive,and mayseemmore threateningorinvasive to users(thisis for

instancethe caseforretinascanners). Becauseof these di erences,no one

biometricwillserveall needs.

2.4 General structure of a mono-modal biometric

system

Automated mono-modal biometric veri cation systems usually work

ac-cordingtothefollowingprinciples. Inatypical functionalsystemasensor,

adaptedtothespeci cbiometric,generatesmeasurementdata. Fromthese

data,features thatmaybe usedforveri cation areextracted, usingimage

and/orsignalprocessingtechniques. Ingeneral, eachbiometrichasitsown

featureset. Patternmatchingtechniquescomparethefeaturescomingfrom

thepersonundertest withthose storedinthedatabaseunder theclaimed

identity, to providelikely matches. Last butnot least, decisiontheory

in-cludingstatistics providesa mechanism foransweringthequestion\Isthe

person undertest whoheorsheclaimstobe?" and forevaluating

biomet-rictechnology[77 , 78, 158]. Automaticmono-modalbiometricveri cation

systemsareusuallybuiltarrangingtwo mainmodulesinseries: (1)a

mod-ulewhichcomparesthemeasuredfeaturesfromthepersonundertestwith

areference client model andgivesa scalarnumber

2

asoutput,followed by

2

(28)

(2)adecisionmodule realizedbyathresholdingoperation. Thisthreshold

can bea functionoftheclaimed identity.

Thearchitectureofanautomaticmono-modalbiometricveri cationsystem

isrepresentedin gure 2.2.

model

selection

matching

feature

extraction

decision

decision

forming

identification key

score

biometric signal

Figure2.2: Typicalmono-modalbiometricveri cationsystemarchitecture.

2.5 The need for multi-modal biometric systems

There can be several reasons why one would prefer multi-modal

biomet-ric veri cation systems over mono-modal ones. Generally, the criterion

to choose between mono- and multi-modalsystems will be system

perfor-mance. The end-user typically desires a guarantee that the classi cation

errors (FAR and FRR) will be limited by maximal values that will

de-pendon the application. Andalthoughthere existmono-modal biometric

veri cation techniques that do o er very small classi cation errors, the

mainproblem with thiscategory of biometrics is that they are either too

expensivetobeusedinageneralpurposecontext(forinstanceidentity

ver-i cation inthecaseof credit cardpaymentsoverthe Internetusinga PC)

or perceived by the user as too invasive. So very often one is confronted

with the obligation of using inexpensive hardware and non-invasive

user-friendlybiometrics. Two ofthe mostpopularbiometricsthat canconform

to these constraintsare faces and voices. However, thedrawback of using

inexpensive hardware (cheap black and white CCD-cameras and low-cost

microphones)toobtaintherawdatameasurementsofthesebiometrics,has

asadirectconsequencethatthemeasurementsgenerallywillbecorrupted

(29)

metricsarethatthevisualmodalityisrathersensitivetolightingconditions

andthatthevocalmodalitytendstovarywithtime(sinceitisabehavioral

biometric). Thismakestheuseofamono-modalbiometricveri cation

sys-tem based solely either on the facial or on the vocal modality a very big

challenge,especiallysinceitisusuallynotpossibleto updatethedatabase

referencesof theauthorizedusersona regular basis.

One possible solution to cope with this problem is to use not one single

mono-modal biometric system, but to use several of them in parallel to

form a so-called multi-modal biometric veri cation system. It can be felt

intuitivelythatsuchastrategycan behelpful,ifoneconsidersc

omplemen-tarybiometrics. Thiscomplementaritycan beachieved withrespecttothe

di erent requirements as they were presented in section 2.2. A possible

exampleof complementary biometrics withrespect to thepermanence

re-quirementwouldbethecombineduseofaphysiological(face: more

invari-ant in time) and a behavioral (voice: less invariant) biometric. The main

and very general idea of using multi-modalbiometric veri cation systems

insteadofmono-modalonesisthustheabilitytousemore(complementary)

informationwith respectto theperson under test intheformer approach,

than in the latter approach. In chapter 9, a more detailed step-by-step

analysisofa multi-levelstrategyto graduallyimprovetheperformancesof

anautomated biometricsystemis presented.

A possibleand straightforward way of buildinga multi-modalveri cation

systemfrom d such mono-modalsystems is to inputthe d scoresprovided

in parallel into a fusion module, which combines the d scores and passes

thefused score on to the decisionforming module. This module then has

to take the decision accept or reject, based on a threshold. Just as inthe

case of the mono-modal system, this threshold can be a function of the

claimedidentity. However, two alternativesremain for thefusionmodule:

a global (i.e. the same for all persons) or a personal (i.e. tailored to the

speci c characteristics of each authorized person) approach. For the sake

of simplicityand because the personalapproach needs more trainingdata

(sinceinthiscase thefusionmoduleneedsto beoptimizedforeachclient),

we haveopted inthiswork fora globalfusionmodule.

Figure 2.3 shows the typical architecture of a general multi-expert

veri -cationsystem,includingthepossibleuseofpersonalizedfusionordecision

(30)

physical

identification

key

appearance

scores

scores

fusion

fused

decision

forming

decision

expert i

expert j

Figure 2.3: Multi-expertarchitecture.

2.6 Characterization of a veri cation system

Inthiswork,we willconsidertheveri cationof theidentityofaperson as

atypical two-class problem: eithertheperson istheone (inthiscasehe is

calleda client),orisnotthe one (inthatcase he iscalled animpostor) he

claimsto be. Thismeansthatwe aregoingtoworkwithabinaryfaccept,

rejectgdecisionscheme.

When dealing with binary hypothesis testing, it is trivial to understand

that the decision module can make two kinds of errors. Applied to this

problemof theveri cationof theidentityof aperson,these two errorsare

called:

 False Rejection (FR): i.e. when an actualclient isrejected as being

animpostor;

 False Acceptance (FA): i.e. when an actualimpostor is acceptedas

being aclient.

Theperformancesofaspeakerveri cationsystemareusuallygiveninterms

of the global error ratescomputed duringtests: the False Rejection Rate

(FRR) and the False Acceptance Rate (FAR) [18 ]. These error rates are

de nedasfollows:

FRR=

numberof FR

(31)

FAR=

numberof FA

numberofimpostor accesses

(2.2)

Aperfectidentityveri cation(FAR=0andFRR=0)isinpractice

unachiev-able. However, as shown by the study of binary hypothesis testing [167 ],

any of the two FAR, FRR can be reducedto an arbitrary small value by

changingthedecisionthreshold,withthedrawbackof increasingtheother

one. Auniquemeasure canbeobtainedbycombiningthesetwoerrorsinto

theTotal ErrorRate(TER) orits complimentary,theTotal SuccessRate

(TSR):

TER=

numberof FA +numberof FR

total numberofaccesses

(2.3)

TSR=1 TER (2.4)

However, care should be taken when using one of these two unique

mea-sures. Indeed, from the de nition just given it follows directly that these

twouniquenumberscould beheavilybiasedbyoneoreithertypeoferrors

(FARorFRR),dependingsolelyonthenumberofaccesses thathavebeen

used in obtainingthese respective errors. Asa matter of fact, due to the

proportional weighting as speci edin the de nition, the TER willalways

be closer to that type of error (FAR or FRR) which has been obtained

usingthelargest numberof accesses.

The overall performanceof an identityveri cation system ishowever

bet-tercharacterizedbyit'sso-calledReceiverOperating Characteristic(ROC),

which represents the FAR as a function of the FRR [167]. The

Detec-tionErrorTradeo (DET)curve isaconvenient non-lineartransformation

of the ROC curve, which has become the standard method for

compar-ingperformancesof speakerveri cation methodsusedintheannualNIST

evaluationcampaigns[142]. InaDET curve,thehorizontalaxisshowsthe

normaldeviateoftheFalseAlarmprobabilityin(%),whichisanon-linear

transformationofthehorizontalFalseAcceptanceaxisoftheclassicalROC

curve. TheverticalaxisoftheDETcurverepresentsnormaldeviateofthe

Missprobability(in %), which is a non-lineartransformation of theFalse

Rejection axis of the classicalROC curve. The use of the normal deviate

scalemovesthecurvesawayfromthelowerleftwhen performanceis high,

making comparisons between di erent systems easier. It can also be

(32)

portionoftheirrange. Furtherdetailsofthisnon-lineartransformationare

presented in [115 ]. Figures 2.4 and 2.5 give respectivelyan example of a

typical ROCand atypical DET curve.

0

5

10

15

20

25

30

35

40

45

50

0

5

10

15

20

25

30

35

40

45

50

false acceptance (%)

false rejection (%)

NIST EPFL MALES

GMM EPFL

EER

Figure 2.4: Typical exampleofa ROCcurve.

Each point on a ROC or a DET characteristic corresponds with a

parti-culardecision threshold. The Equal Error Rate (EER: i.e. when FAR =

FRR),isoften usedasthe onlyperformancemeasure of an identity

veri -cationmethod,althoughthismeasure givesjustone pointoftheROCand

comparingdi erentsystemssolelybasedonthissinglenumbercanbevery

misleading[129].

Highsecurityaccessapplicationsareconcernedaboutbreak-ins andhence

operateatapointontheROCwithsmallFAR.Forensicapplicationsdesire

tocatchacriminalevenattheexpenseofexaminingalargenumberoffalse

accepts and hence operate at small FRR/high FAR. Civilianapplications

attempt to operate at the operating points with both low FRR and low

FAR.Theseconcepts areshown inFigure2.6, whichwas foundin [88 ].

Unfortunately in practice, as will be shown further in the study of the

fusionmodulespresentedinthisthesis,itisnotalwayspossibletoexplicitly

identify a continuous decisionthresholdin a certain fusionmodule, which

meansthatinthatcaseitwillafortiorinotbepossibletovarythedecision

(33)

Figure2.5: Typical exampleof a DETcurve.

False Acceptance Rate

False Rejection Rate

Equal Error Rate

High Security Access Applications

Forensic Applications

Figure2.6: Typical examples ofdi erent operatingpointsfordi erent

(34)

alsotheonlycorrectwayofdeterminingtheperformanceof anoperational

system, sinceinsuch systemsthedecisionthresholdhas been xed.

Allveri cationresultsinthisthesiswillbegivenintermsofFRR,FAR,and

TER.Foreacherrorthe95%levelcon denceintervalwillbegivenbetween

squarebrackets. The concept of con denceintervals refersto the inherent

uncertaintyintestresultsowingtosmallsamplesize. Theseintervalsarea

posterioriestimatesoftheuncertaintyintheresultsonthetestpopulation.

They do not include the uncertainties caused by errors (mislabeled data,

forexample)inthetestprocess. Thecon denceintervals donotrepresent

a prioriestimates ofperformanceindi erentapplicationsorwithdi erent

populations[182 ].

These con dence levels will be calculated assuming that the probability

distributionforthenumberoferrorsisbinomial. Butsincethebinomiallaw

cannotbeeasilyanalyzedinananalyticalway,thecalculationofcon dence

intervalscan notbe donedirectlyin ananalyticalway. Thereforewe have

usedthe Normal law asan approximation of the binomiallaw. This large

sampleapproach isalready statisticallyjusti edstartingfrom 30 samples.

Usingthisapproximation,the95%con denceintervalofan errorE based

onN tests, isde nedbythefollowinglower(givenbytheminussign)and

upper(given bythe plussign)bounds:

E1:96

r

E(1 E)

N :

Moredetailedinformationaboutthecalculationofcon denceintervalscan

be foundin[41 , 44,155 ].

2.7 State of the art

2.7.1 General overview

Some work on multi-modal biometric identityveri cation systems has

al-ready been reported in the literature. Hereafter, an overview is given of

themostimportantcontributions,withabriefdescriptionofthework

per-formed.

1. Asearlyas1993,ChibelushietAl. haveproposedin[40 ] tointegrate

acousticandvisualspeech(motion ofvisiblearticulators) forspeaker

recognition. The combination scheme used is a simple linear one.

(35)

2. In1995, Brunelliand Falavignahave proposed in[33 ] aperson

iden-ti cation system based on acoustic and visual features. The voice

modality is based on a text-independent vector quantization and it

uses two typesof information: staticand dynamic acoustic features.

The face modality implements a template matching technique on

three distinct areas of the face (eyes, nose, and mouth). They use

adatabasecontainingup tothreesessionsof87 persons. Onesession

wasusedfortraining,theothersfortesting,whichdidleadtoa total

numberof 155 tests. Themostperformingfusionmodule isa neural

network. The best results obtained on thisparticular database are:

FAR =0:5% andFRR =1:5%.

3. In1997, Dieckmann et Al. have proposedin [50 ] a decisionlevel

fu-sion scheme, based on a 2-out-of-3 majority voting. This approach

integrates two biometricmodalities (face and voice), which are

ana-lyzedbythree di erentexperts: (static) face, (dynamic) lipmotion,

and (dynamic) voice. The authors have tested their approach on a

speci c database of 15 persons, where the best veri cation results

obtainedwere FAR=0:3% andFRR =0:2%.

4. In1997, Ducet Al. didproposein[55]a simpleaveraging technique

and compared it withthe Bayesian integration scheme presented by

Bigun et Al. in [13 ]. In this multi-modal system the authors use a

frontal face identi cation expert based on Elastic Graph Matching,

anda text-dependentspeechexpertbasedon person-dependent

Hid-denMarkovModels(HMMs) forisolateddigits. Allexperimentsare

performedontheM2VTSdatabase,andthebestresultsareobtained

fortheBayesianfusionmodule: FAR =0:54% and FRR=0:00%.

5. In1997,JourlinetAl. haveproposedin[93 ]anacoustic-labialspeaker

veri cationmethod. Theirapproachusestwoclassi ers. Oneisbased

on a liptracker using visualfeatures, and the other one is based on

atext-dependentperson-dependentHMMmodelingofisolateddigits

usingacoustic features. The fusedscoreis computedastheweighted

sumof thescores generated by thetwo experts. Allexperimentsare

performedontheM2VTSdatabase,andthebestresultsobtainedfor

theweightedfusionmodule are: FAR =0:5% and FRR=2:8%.

6. In 1998, Kittler et Al. have proposed in [98 ] a multi-modal person

(36)

pro le expert is usinga chamfer matching algorithm, and the voice

expertisbasedontheuseoftext-dependentperson-dependentHMM

modelsfor isolateddigits. Allthese experts give theirsoft decisions

(scoresbetweenzero andone)to thefusionmodule. Allexperiments

are performed on the M2VTS database, and the best combination

resultsareobtainedfora simplesum rule: EER =0:7%.

7. In 1998, Hong and Jain have proposed in [82 ] a multi-modal

per-sonalidenti cationsystem which integrates two di erent biometrics

(face and ngerprints) that complement each other. The face

ver-i cation is done using the eigenfaces approach, and the ngerprint

expertisbasedon aso-calledelasticmatching algorithm. Thefusion

algorithm operates at the expert decision level, where it combines

thescoresfrom thedi erentexperts(underthestatistically

indepen-dencehypothesis),bysimplymultiplyingthem. Thefaccept,rejectg

decision is then taken by comparing the fused score to a threshold.

The databases used in this work are the Michigan State University

ngerprint database containing 1500 images from 150 persons, and

a facedatabase coming from theOlivetti Research Lab, the

Univer-sity of Bern, and the MIT Media Lab, which contains 1132 images

from86persons. Theresultsobtainedforthefusionapproachonthis

databaseare: FAR =1:0% and FRR =1:8%.

8. In 1998, Ben-Yacoub did propose in [7 ] a multi-modal data fusion

approach for person authentication, based on Support Vector

Ma-chines (SVM). In his multi-modal system he uses the same experts

and the same database as Duc et Al. in the work presented above.

The best results which he obtained for the SVM fusion module are

FAR =0:07% andFRR =0:00%.

9. In 1999, Pigeon did propose in [135 ] a multi-modal person

authen-tication approach based on simple fusion algorithms. In this

multi-modal system the author uses a face identi cation expert based on

templatematching,apro leidenti cationexpertbasedonachamfer

matching algorithm, and a text-dependent speech expert based on

person-dependent HMM modelsfor isolated digits. All experiments

areperformed onthe M2VTS database, and thebest resultsare

ob-tained for a fusion module based on a linear discriminant function:

(37)

recognitionsystemusingunconstrainedaudioandvideo. Thesystem

doesnotneedfullyfrontal faceimagesorclean speechasinput. The

face expert is based on the eigenfaces approach, and the audio

ex-pert uses a text-independent HMM using Gaussian Mixture Models

(GMMs). The combination of these two experts is performed using

a Bayes net. The system was tested on a speci c database

contain-ing 26 persons and the best results obtained using the best images

and audio clips from an entire sessionare: FAR = 0:00% and FRR

=0:00%.

2.7.2 Results obtained on the M2VTS database

To facilitate the comparison with the work presented in this thesis, we

haveisolatedfromthepreviousstateofthearttheresultswhichhavebeen

obtained on the same M2VTS database asthe one we have been working

on. TheseresultsarepresentedinTable2.1hereafter. Whereavailable,the

con dence interval is indicated between square brackets. Care should be

takenhoweverwhencomparingtheseresults,sincetheexpertsusedarenot

necessarilythesame forallmethods. The lastlineinthisTable represents

thebestresultsobtainedinthisthesis,usinga logistic regressionmodel.

Table2.1: Stateoftheartoftheveri cationresultsobtainedontheM2VTS

database.

Author(s) Experts FRR (%) FAR (%)

DucetAl. frontal,vocal 0.00 0.54

JourlinetAl. lips,vocal 2.80 0.50

Kittler et Al. frontal,pro le, vocal 0.70(EER) 0.70 (EER)

Ben-Yacoub frontal,vocal 0.00 0.07

Pigeon frontal,pro le, vocal 0.78 0.07

Verlinde frontal,pro le, vocal 0.00 0.00

2.8 Comments

(38)

lim-of all applications. We have seen that voice is one of the most popular

biometrics, thanks to its high acceptability and its user-friendliness [88 ].

Since voice is a behavioralbiometric modality and sincein a multi-modal

approachitiswiseto complementabehavioralmodalitywitha

physiolog-icalone,wewantedto add a physiological modalitywhich also washighly

acceptable. These considerations have led to choose the visual modality.

In the framework of the M2VTS application, another important criterion

forchoosing the di erent biometrics was the availabilityof the hardware.

Withrespectto thetele-services,theideawastouseso-calledmulti-media

PC's, which are equipped with low-cost microphones and CCD-camera.

Theseconsiderationsreinforceeachotherandtheyexplainwhyinthe

multi-modalsystempresentedinthiswork,voiceandvisionwereusedasthetwo

(complementary) biometric modalities. Analyzing the state of the art in

automaticbiometricmulti-modalidentityveri cationsystems, ithasbeen

shown that on the M2VTS database, the best method presented in this

(39)
(40)

Experimental setup

3.1 Introduction

Thischapter starts bypresenting theM2VTS databaseused inthiswork.

Afterthis,theexperimentalprotocolusedfortestingtheindividualexperts

and thefusionmodulesis described. Finally,the three di erent biometric

experts (a frontal,a pro le and a vocal one) we have beenusing

through-outthiswork arebrie y introducedand theirindividualperformancesare

highlighted. Thisisfollowedbyathoroughstatisticalanalysisoftheresults

givenbythesethree di erentexpertsforbothclientandimpostoraccesses.

Inthisanalysisitisshownthatthedistributionofthescoresperexpertand

per type of access(the so-calledconditional probabilitydensityfunctions)

donot satisfytheNormality hypothesis. Furthermore itisshownthatthe

chosen experts do have good discriminatory power, and are

complemen-tary. The potentialgain obtained by combining theresults of these three

di erentexperts areshown bymeans ofa simplelinearclassi er.

3.2 The M2VTS audio-visual person database

The M2VTS [1] multi-modal database comprises 37 di erent persons and

provides5 shotsforeach person. Theseshotswere taken at intervals of at

leastoneweek. Duringeachshot,peoplewereasked (1)to countfrom \0"

to \9" in French (which was the native language for most of the people)

and (2) to rotate their head from 0 to -90 degrees, back to 0 and further

to +90 degrees, and nally back again to 0 degrees. The most diÆcult

shot to recognize is the 5

th

shot. Thisshot mainly di ersfrom the others

(41)

presence of ahat/scarf, ::: ), voice variations orshot imperfections(poor

focus, di erent zoom factor, poor voice signal to noise ratio, ::: ). More

detailswithrespectto thisdatabasecan befoundin[136,137 , 135 ].

Takinginto account thespeci cityof ourproblem(i.e. combining outputs

of several experts) we are not going to use this5

th

shot, since we arenot

interested in developing individual powerful experts that work well even

underthese extremeconditions aspresentedbyshotnumber5.

ToshowthequalityofthepicturescontainedinthesmallM2VTSdatabase,

Figures3.1, 3.2, and 3.3showrespectivelythefrontal viewsof some

per-sons,therotation sequence and the5 di erent shotsforone and the same

person [135].

3.3 Experimental protocol

3.3.1 General issues

Inthemostgeneral(butrich) case,three di erentdatasetsareneeded for

training, ne-tuningand testing theindividualexperts. The rst dataset

iscalled thetraining set and is usedbyeach expert to model thedi erent

persons. The second data set is called the development or validation set

andisusedto ne-tunethedi erentexperts,forinstancebycalculatingthe

decisionthresholds. Thethirddatasetiscalledthetestsetanditisusedto

test the performances of theobtained experts. For thefusion module, we

cande neinthemostgeneralcaseexactlythesamedatasetsasinthecase

of the individualexperts. This general concept of the use of the di erent

datasets is illustratedin Figure 3.4. Thisdoes notnecessarily mean that

onealwayswillneedsixcompletelyseparateddatasets,sincethefactthat

the test set for the individual experts is completely dissociated from the

development of the experts, makes it suitable to be reused for the fusion

module. Furthermore, not all types of experts, nor all fusion modules do

includethemodelingofthepersons. Thismeansthatintheparticularcase

ofexperts andfusionmoduleswhichdo notusedatatomodelpersonsand

intheobviouscaseinwhichwedoreusetheexperttestsetasadatasetfor

thefusionmodule,one onlyneeds threedi erentdatasets insteadofsix in

themostgeneralcase. ThisisillustratedinFigure3.5. Intheintermediate

case, where the experts do need separate trainingand development data,

butthefusionmoduledoesnotneedanydevelopmentdata,one needsfour

di erentdata sets,asillustrated inFigure3.6.

(42)

(43)

Figure3.2: M2VTS database: viewstaken from a rotationsequence.

Figure3.3: M2VTS database: frontalviewsofone personcomingfromthe

(44)

Training

Development

Testing

Testing

Development

Training

Fusion module

Expert

Dataset 1

Dataset 4

Dataset 5

Dataset 6

Dataset 3

Dataset 2

Figure 3.4: Themostgeneral casewhere 6di erent datasets areused.

Expert

Testing

Training

Dataset 1

Dataset 2

Dataset 2

Training

Dataset 3

Testing

Fusion module

(45)

Development

Testing

Training

Dataset 1

Expert

Dataset 2

Dataset 3

Dataset 3

Training

Dataset 4

Testing

Fusion module

Figure3.6: Theintermediatecasewherefourdi erentdatasetsareneeded.

 ifthetestdataisthesame asthetrainingdata,performanceswillbe

overestimated. This is true for both the individualexperts and the

fusionmodule. Thisis ofcourseduetothefactthat theexperts and

the fusion module will generate the best results for the same data

theyhave beentrainedon.

 ifthetrainingdatafortheexperts isthesame asforthefusion

mod-ule,thefusionmodulewillbeunderperforming. Thereasonforthisis

thatthefusionmoduledoesn'tgetenoughinformation. Indeed,inthe

extremecaseofexpertsthatperformperfectlyontheirtrainingdata,

the outcome of such an expert would be either 0 or 1, which leaves

thefusion module with the arbitrary choice of setting the threshold

somewhere inbetween.

3.3.2 Experimental protocol

Forourexperiments,wehaveoptedforaverysimpleexperimentalprotocol.

Inthisprotocolwe useonlythe rst foursessionsof theM2VTSdatabase

(46)

experts. This means that each access has been used to model the

respective client,yielding 37di erent clientmodels.

2. Thenthe accesses fromeach personinthe secondenrollment session

have beenusedto generatevalidationdataintwo di erentmanners.

Oncetoderiveone singleclientaccessbymatchingtheshotofa

spe-ci c person with its own reference model, and once to generate 36

impostor access by matching it to the 36 models of the other

per-sons of the database. This simple strategy thus leads to 37 client

and 3637=1.332 impostor accesses, which have been used for

vali-datingtheperformanceof the individualexperts and forcalculating

thresholds.

3. Thethirdenrollmentsessionhasbeenusedtotesttheseexperts,using

thethresholdscalculated onthe validation dataset. Thissame data

sethasalsobeenusedtotrainthefusionmodules,whichagain leads

to 37 clientand 1.332 impostor reference points.

4. Finally,thefourthenrollmentsessionhasbeenusedtotestthefusion

modules,yieldingoncemorethesame numberofclient andimpostor

claims.

The drawback of this simple protocol, is that the impostors are known

at the expert and supervisor training time. In section 8.3.2, validation

results will be presented using a protocol that does not su er from the

same drawback. This validationprotocolis implementedusinga so-called

leave-one-outmethod[49].

3.4 Identity veri cation experts

3.4.1 Short presentation

Alltheexperimentsinthisthesishavebeenperformedusingthreedi erent

identity veri cation experts. Each one of these experts will be described

brie yhereafter.

Pro le image expert

The pro le image veri cation expert is described in detail in [138] and

its description hereafter has been inspiredby the presentation of this

(47)

corresponding to the claimed identity. The candidate image pro le is

ex-tractedfromthepro leimagesbymeansofcolor-basedsegmentation. The

similarityof thetwo pro lesis measuredusingtheChamferdistance

com-putedsequentially[28 ]. TheeÆciencyoftheveri cationprocessisaidedby

pre-computing a distance map for each reference pro le. The map stores

thedistance of each pixel in the pro leimage to the nearest point on the

reference pro le. As the candidate pro le can be subject to translation,

rotation and scaling,theobjective of thematchingstage is to compensate

forsuch geometric transformations. The parameters of the compensating

transformationaredeterminedbyminimizingthechamferdistancebetween

the template and the transformed candidate pro le. The optimization is

carried out using a simplex algorithm which requires only the distance

function evaluation and no derivatives. The convergence of the simplex

algorithmtoalocalminimumispreventedbyacarefulinitializationofthe

transformation parameters. The translationparameters are estimated by

comparing the position of the nose tip in the two matched pro les. The

scale factor is derived from the comparison of the pro le heights and the

rotationisinitiallysetto zero. Once theoptimalsetoftransformation

pa-rameters is determined,the user is accepted orrejected dependingon the

relationship of the minimal chamfer distance to a pre-speci edthreshold.

The system can be trained very easily. It is suÆcient to store one pro le

perclientin thetrainingset.

Frontal image expert

Thefrontalimageveri cation expertisdescribed indetailin[116]andthe

descriptionhereafter was based on the presentation of this expert in [98 ].

This frontal image expert is based on robust correlation of a frontal face

imageofthepersonundertest andthestoredfacetemplatecorresponding

totheclaimedidentity. A searchfortheoptimumcorrelationisperformed

inthespace of all validgeometricand photometric transformations ofthe

inputimagetoobtainthebestpossiblematchwithrespecttothetemplate.

The geometric transformation includes translation, rotation and scaling,

whereasthephotometric transformationcorrects forachangeof themean

levelof illumination. The search technique fortheoptimaltransformation

parameters is based on random exponential distributions. Accordingly,

at each stage the transformation between the test and reference images

is perturbed by a random vector drawn from an exponential distribution

(48)

transformedfaceimageandthetemplate,andthesimilarityoftheintensity

distributionsofthetwoimages. Thedegreeofsimilarityismeasuredwitha

robustkernel. Thisensuresthatgrosserrorsdueto,forinstance,hairstyle

changes do notswamp thecumulative errorbetween thematchedimages.

Inotherwords,thematchingisbenevolent,aimingto ndaslargeareasof

the face as possible, supporting a close agreement between the respective

gray-level histogramsof thetwo images. Thegross errors willbe re ected

inareducedoverlapbetweenthetwoimages,whichistakenintoaccountin

theoverall matchingcriterion. The systemistrainedvery easilybymeans

ofstoring one templateforeach client. Eachreference image issegmented

tocreateafacemaskwhichexcludesthebackgroundandthetorso asthese

arelikelyto changeovertime.

Vocal expert

The vocal identity veri cation expert is presented in detail in [22 ]. This

text-independent speakerveri cation expert is based on a similarity

mea-surebetween speakers,calculated onsecond order statistics[21].

In this algorithm a rst covariance matrix X is generated from a

refe-rencesequence,consistingofM m-dimensionalacousticalvectors,and

pro-nouncedbytheperson who'sidentityisclaimed:

X= 1 M M X i=1 X i X T i ; whereX T i is X i transposed.

AsecondcovariancematrixY isthengeneratedinthesamewayfroma

se-quence,consistingofM m-dimensionalacousticalvectors,andpronounced

bythe person undertest.

Thena similaritymeasure betweenthese twospeakersisperformed,based

onthe sphericitymeasure

AH ( X;Y) :  AH (X;Y)=log A H ; A( 1 ; 2 ;:::; m )= 1 m m X i=1  i =m 1 tr YX 1  ; H( 1 ; 2 ;:::; m )=m m X 1  i ! 1 =m tr XY 1  1 :

(49)

It can be shown that this sphericity measure is always non-negative and

it is equal to zero only in the case that the two covariance matrices X

and Y are the same. The veri cation process consists then of comparing

theobtainedsphericitymeasure witha decisionthreshold,calculatedon a

validation database.

Oneofthegreatadvantagesof thisalgorithmisthatno explicitextraction

ofthemeigenvalues

i

isnecessary,sincethesphericitymeasureonlyneeds

thecalculation ofthetrace tr( ) ofthematrix productYX

1

orXY

1 .

3.4.2 Performances

The performances achieved by the three mono-modal identity veri cation

systemswhichhavebeenusedinthese experimentsaregiven inTable 3.1.

The resultshave been obtainedbyadjusting thethresholdat theEER on

the validation set and applying this thresholdas an a priorithreshold on

thetest set. Observingthe resultsfor thepro le an thefrontal experts it

can be seen that, although the optimization has been done according to

theEERcriterion,theFRRandtheFARareverydi erent. Thisindicates

that for these two experts, the training and validation sets are not very

representative of thetest set.

Table 3.1: Veri cation resultsforindividualexperts.

Expert FRR(%) FAR (%) TER (%)

(37tests) (1.332 tests) (1.369 tests)

Pro le 21.6[11.4,37.2] 8.5[7.1,10.1] 8.9[7.5,10.5]

Frontal 21.6[11.4,37.2] 8.3 [6.9, 9.9] 8.7[7.3,10.3]

Vocal 5.4[ 1.5,17.7] 3.6 [2.7, 4.7] 3.7[2.8, 4.8]

3.4.3 Statistical analysis of the di erent experts

Introduction

Astatistical analysisoftheindividualexperts

1

isimportantto getanidea

onone handoftheirindividualdiscriminatorypower,and oftheir

comple-mentarityonthe otherhand.

1

(50)

Thepowerofan expert to discriminatebetween clientsand impostorswill

increase(forgivenvariances)withthedi erencebetweenthemeanvalueof

thescores obtainedforclient accessesand themean value ofthescores

ob-tainedforimpostoraccesses. Thetypicalstatisticaltesttoseeifthereexist

signi cant di erences between the means (or more generally between the

statisticalmomentof rstorder)ofseveralpopulationsistheso-called

ana-lysisofvariance(ANOVA).Inthegeneralcase,thisanalysisisimplemented

usinganF-test. Inthespeci ccaseoftwopopulations,thisANOVAcould

also be performedusingan independentsamplest-test [123]. Another

im-portantcharacteristicofanexpertisitsvariance(ormoregenerallythe

sta-tistical moment of second order). The equality of variances can be tested

by a Levene test, which is also implemented using an F-test [114 ]. It is

advantageous thatthevarianceofan expert isthesame forclientsand for

impostors, because thisleadsto simplermethods to combine thedi erent

experts (see chapter 6). Obviously we will needto performt- and F-tests

to analyze themeans and thevariances of the di erentexperts. However,

thet-and F-testsgive onlyexact resultsifthepopulationshavea Normal

distribution. So beforewe canuset-orF-tests, we needto verifythe

Nor-malityofthedi erentpopulations. Thusthisisthe rststatisticalanalysis

thatweneedto perform. SincetheANOVA isonlyvalidifthevariancesof

thedi erentpopulationsperexpertareequal, we have to checkthe

equal-ityofvariances before performingtheANOVA.These remarksexplainthe

forcedorder of the rst three analysesthatare presentedbelow.

Wecangetanideaoftheindependenceofthedi erentexperts(andthusof

theamount of extra informationthateach expert bringsin), byanalyzing

theircorrelation. Anda lineardiscriminantanalysisgivesusa rst ideaof

thecombined discriminatorypoweroftheexperts.

Last butnotleast, the analysis ofthe extremevaluesgives usinsight into

thepossibleuse ofpersonalized approaches.

Analysis of Normality

ThepurposeofaNormalityanalysisisto checkwhether theobserveddata

do or do not support the hypothesis H

0

that the underlying probability

densityfunction is Normal. There exist two types of teststo perform this

analysis: objective (numerical) and subjective (graphical) tests. An

im-portant remark related to the veri cationof H

0

is that theassumption of

NormalityismuchmorediÆculttoverifywhenusingsmallsamplesizes. In

Figure

Figure 2.1: Classication of a number of biometrics in physiological and
Figure 2.6: Typical examples of dierent operating points for dierent ap-
Figure 3.3: M2VTS database: frontal views of one person coming from the
Figure 3.4: The most general case where 6 dierent datasets are used.
+7

Références

Documents relatifs

When pattern 2 occurs in a problem ontology model, we can deduce that the resulting goal type is achieve because the effect of the action is to change the current state of the

In violent scenes classi- fication, we found the following features relevant: the short time audio energy, motion component, and shot words rate.. We classified the shots into

Henrickson L., McKelvey B., Foundations of “New” Social Science: Institutional Legitimacy from Philosophy, Complexity Science, Postmodernism, and Agent-based Modeling,

In the field of image-guided liver surgery (IGLS), the ini- tial registration of the intra-operative organ surface with preoperative tomographic image data is performed on

Specifically, the nascent concept of corporate heritage identity refers to a category of organization where particular identity traits of an organization have endured and meaningfully

In this thesis, we used the implicit dynamic frames methodology [7] with fractional permissions [8], which permits veri…cation of concurrent programs. Implicit dynamic frames

This research investigates: the effectiveness of using simple words and gestures to design or navigate in the 3D space, the combination of speech and gesture inputs to perform

Nevertheless, these semantic struc- tures are flexible, and the architecture uses them to correlate items perceived by different modules, providing a fused representation as output.