Bayesian distance metric learning on i-vector for speaker verification

(1)

Bayesian Distance Metric Learning on i-vector for

Speaker Verification

by

Xiao Fang

B. S., Electrical Engineering

University of Science and Technology of China, 2011

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

NCHIVES

R A R 1ES

September 2013

@

Massachusetts Institute of Technology 2013. All rights reserved.

A uthor ...

...

Department of

... E.

Electrical Engineering and Computer Science

August 30, 2013

Certified by . .

James R. Glass

Senior Research Scientist

Thesis Supervisor

Certified by

...

C

Najim Dehak

Research Scientist

Thesis Supervisor

Accepted by.

/

U U

Leslie A. Kolodziejski

Chairman, Department Committee on Graduate Students

(2)

(3)

Bayesian Distance Metric Learning on i-vector for Speaker

Verification

by

Xiao Fang

Submitted to the Department of Electrical Engineering and Computer Science on August 30, 2013, in partial fulfillment of the

requirements for the degree of Master of Science

Abstract

This thesis explores the use of Bayesian distance metric learning (Bayes-dml) for the task of speaker verification using the i-vector feature representation. We propose a framework that explores the distance constraints between i-vector pairs from the same speaker and different speakers. With an approximation of the distance metric as a weighted covariance matrix of the top eigenvectors from the data covariance matrix, variational inference is used to estimate a posterior distribution of the distance metric. Given speaker labels, we select different-speaker data pairs with the highest cosine scores to form a different-speaker constraint set. This set captures the most discriminative between-speaker variability that exists in the training data. This system is evaluated on the female part of the 2008 NIST SRE dataset. Cosine similarity scoring, as the state-of-the-art approach, is compared to Bayes-dml. Experimental results show the comparable performance between Bayes-dml and cosine similarity scoring. Furthermore, Bayes-dml is insensitive to score normalization, as compared to cosine similarity scoring. Without the requirement of the number of labeled examples, Bayes-dml performs better in the context of limited training data.

Thesis Supervisor: James R. Glass Title: Senior Research Scientist Thesis Supervisor: Najim Dehak Title: Research Scientist

(4)

(5)

Acknowledgments

First and foremost, I would like to thank Jim Glass and Najim Dehak for offering me the opportunity to do research in their group. I am grateful to Jim for his consideration and patience all the time, for always guiding me down the right path. I appreciate Najim's passion and brilliance. Najim's broad knowledge and thoughtful insights in this field have inspired me a lot through this thesis work. This thesis would not have been possible with-out them.

The Spoken Language Systems group has provided me a research home in the past two years. Thank you to all the members for your energy, creativity, and friendship. The birthday celebrations, the spectrum reading seminars, and the defense suit-ups would be unforgettable memories in my life.

I would like to thank all my friends, old and new, locally available and geographically

separated, for accompanying and encouraging me, for sharing tear and joy with me. Lastly

I would like to thank my parents and my elder brother for their love, inspiration, and

support all these years.

This thesis was supported by DARPA under the RATS (Robust Automatic Transcrip-tion of Speech) Program.

(6)

(7)

1. Introduction

13 2. Background and Related Work

17

2.1. Speech Parameterization ... ... 18

2.2. The GMM-UBM Approach . . . . 20

2.3. Data Sets and Evaluations . . . . 24

2.4. Chapter Summary . . . . 25

3. Factor Analysis Based Speaker Verification

27

3.1. Joint Factor Analysis . . . . 27

3.2. Total Variability Approach . . . . 29

3.3. Cosine Similarity Scoring . . . . 30

3.4. Probabilistic Linear Discriminant Analysis . . . . 30

3.4.1. Training . . . . 32

3.4.2. Recognition . . . . 34

3.5. Chapter Summary . . . . 34

4. Compensation Techniques

37

4.1. Session Compensation . . . . 37

4.1.1. Linear Discriminant Analysis (LDA) . . . . 37

4.1.2. Within-Class Covariance Normalization (WCCN) . . . . 38

4.1.3. Nuisance Attribute Projection (NAP) . . . . 39

4.2. Score Normalization . . . . 39

(8)

5. Distance Metric Learning

43 5.1. Neighborhood Component Analysis . . . .

43 5.2. Bayesian Distance Metric Learning Framework . . . .

45 5.3. Variational Approximation . . . .

47 5.4. Chapter Summary . . . .

51 6. Experimental Results

53 6.1. Experimental Set-up . . . .

53 6.2. Parameter Selection . . . .

53 6.3. Results Comparison . . . .

54 6.4. Results on Limited Training Data . . . .

58 6.5. Results on Short Duration Data . . . .

59 6.6. Chapter Summary

. . . .

60 7. Conclusion and Future Work

61 7.1. Summary and Contributions . . . .

61

(9)

List of Figures

2.1. Pipeline for M FCC . . . . 20 2.2. maximum a posteriori (MAP) adaptation

[3j [9] . . . .

24

3.1. essentials of Joint Factor Analysis [7] [9] . . . . 28 6.1. minDCF on the female part of the core condition of the NIST 2008 SRE

based on LDA technique for dimensionality reduction with different score norm alization m ethods. . . . . 54

6.2. Comparison of score histograms from Cosine Score (blue: non-target scores,

red: target scores). . . . . 56

6.3. Comparison of score histograms from Bayes-dml (blue: non-target scores,

(10)

(11)

List of Tables

6.1. Comparison of cosine score, Bayes-dml and GPLDA w/o score normalization

on the female part of the core condition of the NIST 2008 SRE. . . . . 55

6.2. Comparison of Cosine Score-combined norm and Bayes-dml with different

preprocessing techniques on the female part of the core condition of the NIST

2008 SR E . . . . 57

6.3. Comparison of results from, other literatures on the female part of the core

condition of the NIST 2008 SRE. . . . . 58

6.4. Comparison of Cosine Score-combined norm and Bayes-dml on the female

part of the core condition of NIST 2008 SRE with limited training data (the

number of training utterances for each speaker is 3). . . . . 59

6.5. Comparison of Cosine Score-combined norm and Bayes-dml on the female

(12)

(13)

Chapter 1 Introduction

Speaker verification is the use of a machine to verify a person's claimed identity from his/her voice. The applications of speaker verification cover almost all the areas where it is necessary to secure actions, transactions, or any type of interactions by identifying the person. Currently, most applications are in the banking and telecommunication areas. Compared to other biometric systems, which are based on different modalities, such as a fingerprint or face image, the voice has some compelling advantages [1]. First, speech is easy to get at low cost. The telephone system provides a ubiquitous approach to obtain and deliver speech signals. For telephone-based applications, there is no need to install special signal transducers or networks at application access points since a cell phone gives one access almost everywhere. For non-telephone applications, sound cards and microphones are also cheap devices that are readily available. Second, speech is a natural signal which is not considered threatening by users. Users won't consider providing a speech sample for authentication as an intrusive step.

In the last decade, research in speaker verification has made great improvements and we have seen successful commercial applications in some products. Depending on whether the spoken phrase is fixed or not, a speaker verification system can be classified as text-dependent or text-intext-dependent [1]. We focus on the text-intext-dependent speaker verification system in this research.

A typical speaker verification system involves two steps: feature extraction from the

speech signal, and statistical modeling of feature parameters. Since it was proposed in the mid 1990s, Gaussian Mixture Models (GMMs) have become the dominant approach for modeling text-independent speaker verification [2]. In the past decade, the GMM-based system with Bayesian adaptation of speaker models from a universal background model

and score normalization has achieved the top performance in the NIST (National Insti-tute of Standards and Technology) speaker recognition evaluations (SRE) [3]. This system is referred to as the Gaussian Mixture Model-Universal Background Model (GMM-UBM)

(14)

CHAPTER 1. INTRODUCTION

speaker verification system.

In the GMM-UBM approach, the speaker's model is derived from the UBM via a

max-imum a posterior (MAP) adaptation. When the speaker training data is limited, some

Gaussian components were prevented from being adapted [10]. In order to address this

problem, the theory of Joint Factor Analysis (JFA) is used for speaker modeling [6].

JFA-based methods model both speaker and channel/session variability in the context of a GMM [4] [5]. A more recent approach represents all the variabilities in a single low-dimensional space named total variability space, with no distinction between speaker and channel subspaces [13]. A speech utterance is represented by a new vector called total fac-tors (also referred to as an i-vector) in this new space. The i-vector contains both speaker-and channel-variability. We can generally treat i-vectors as input to common classifiers such as Support Vector Machines (SVMs), a cosine distance classifier, or probabilistic lin-ear discriminant analysis (PLDA). In [13], the authors show that cosine distance scoring achieves state-of-the-art performance. In the i-vector training and score verification pro-cess, we don't use speaker labels at all, which suggests that algorithms with the full use of speaker labels might get better performance.

Note that the basic speaker verification task is to determine whether the test utter-ance and the target utterutter-ance are from the same speaker. Thus we can view the speaker verification system as a distance metric leaning problem: given speaker labels of training utterances, we aim to find an appropriate distance metric that brings "similar" utterances (belonging to the same speaker) close together while separating "dissimilar" utterances (be-longing to different speakers) [32]. In this thesis, we present a speaker verification system based on the distance metric learning framework. In [33], Yang and Jin present a Bayesian framework for distance metric learning, which has achieved high classification accuracy in image classification. In addition, this approach is insensitive to the number of labeled ex-amples for each class, as compared to most algorithms requiring a large number of labeled examples [33]. This advantage is particularly important for realistic speaker verification systems, as it can be difficult to collect plenty of samples from every speaker in many in-dustrial applications, although possible to collect samples from a large number of different speakers.

(15)

The rest of this thesis is organized as follows: Chapter 2 will give a background review of speech parameterization and Gaussian Mixture Models. Chapter 3 will introduce the the-ory of factor analysis in speaker verification. The compensation techniques to remove the nuisance variabilities among different trials are explained afterwards in Chapter 4. Then, the Bayesian distance metric learning framework is presented in Chapter 5. Chapter 6 will provide the experimental set up for the system, and show some results. Finally, Chapter 7 concludes this thesis and suggests possible directions for future work.

(16)

(17)

Chapter 2 Background and Related Work

The speech signal conveys rich information, such as the words or message being spoken, the language spoken, the topic of the conversation, and the emotion, gender and identity of the speaker. Automatic speaker recognition aims to recognize the identity of the speaker from a person's voice. The general area of speaker recognition involves two fundamental tasks. The

Speaker Identification task is to determine who produces the speech test segment. Usually

it is assumed that the unknown voice must come from a fixed set of known speakers. Thus, the system performs a 1 : N classification, referred to as a closed-set identification. The

Speaker Verification task is to determine whether the claimed identity of the speaker is the

same as the identity of the person who produced the speech segment. In other words, given a segment of speech and a hypothesized speaker

Q,

the task of speaker verification is to determine if this segment was spoken by the speaker

Q.

Since the impostors who falsely claim to be a target speaker are generally not known to the system, this task is referred to as an open-set classification. This thesis studies the problem of Speaker Verification.

In most speaker verification systems, an input speech utterance is compared to an enrolled target speaker model, resulting in a similarity measure computed between them, also called a similarity score. The process of computing a score from a speaker model and a test speech utterance is usually called a trial. The trials may be classified as target and non-target trials depending on whether the training and test speech are respectively generated by the same individual or not. The users attempting to access the system are referred to as target users when their identity is the same as the claimed one, otherwise they are called impostors.

Human speech contains numerous discriminative features that can be used to identify speakers. The objective of automatic speaker verification is to extract, characterize, and recognize the information about speaker identity. The speech signal is first transformed to

a set of feature vectors in a front-end processing step. The aim of this transformation is to obtain a new representation that is more compact, less redundant, and more suitable

(18)

CHAPTER 2. BACKGROUND AND RELATED WORK

for statistical modeling. The output of this stage is typically a sequence of feature vectors

x =

{x1,

x2, ... , XL}, where x, is a feature vector indexed at an index 1 E

{1,

2, ... , L}. An implicit assumption often used is that x contains speech from only one speaker. Thus, this task is better termed single speaker verification. The single speaker verification task can be stated as a basic hypothesis test between two hypotheses:

HO: x is from the hypothesized speaker

Q,

H1: x is not from the hypothesized speaker Q.

The optimum test to decide between these two hypotheses is to apply the likelihood ratio test given by

P(xIHo) >

/

accept Ho P(xIHi) _</

accept H₁

where

/

is the decision threshold. Since the likelihood is usually very small and may exceed the maximum precision, it is often to use log likelihood ratio instead.

P(xHo) >/ accept Ho

P(xIH<) _<

accept H,

The main goal in designing a speaker detection system is to determine techniques to com-pute values for the two likelihoods, P(xHo) and P(xIHI).

This chapter will introduce the commonly used speech parametrization techniques and the statistical modeling to calculate the likelihoods.

U

2.1 Speech Parameterization

Most current speech parameterizations used in speaker verification systems rely on a cep-stral representation of speech [1]. The extraction and selection of the best parametric representation of acoustic signals is an important task in the design of any speaker veri-fication systems [36]. Some of the audio features that have been successfully used in the

field include Mel-frequency cepstral coefficients (MFCC), Linear predictive coding (LPC),

(19)

2.1. SPEECH PARAMETERIZATION

logarithm of the short-term energy spectrum expressed on a mel-frequency scale [36]. The calculation of the MFCC includes the following steps.

A. Mel-frequency warping

The human perception of sound frequency does not follow a linear scale. For each tone with an actual frequency,

f,

measured in Hz, a subjective pitch is measured on a scale called the mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another [41]. The mel-frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz. As a reference point, the pitch of a 1000 Hz tone, 40dB above the perceptual hearing threshold, is defined as 1000 mels. Therefore we can use the following approximate formula to compute the mel frequency for a given frequency

f

in Hz.

f

Mel(f) = 2595 x logio(1 +

)

(2.1)

700

The common approach to approximate the subjective spectrum is to use a filter bank. The speech signal is first sent to a pass filter to compensate the high-frequency part that was suppressed during the sound production and to amplify the the importance of high-frequency formants, and then segmented into frames. Each frame is multiplied with a hamming window in order to keep the continuity of the boundary. We perform a discrete Fourier transform on each frame and transform them to the mel-frequency spectrum via the filter bank. The filter bank has a triangular band pass frequency response, and the center frequency spacing and the bandwidth are determined by a constant mel-frequency interval. The mel scale filter bank used in this thesis is a series of 23 triangular band pass filters that have been designed to approximate the band pass filtering believed to occur in the auditory system. The log energy within each filter is log mel-frequency spectral coefficient, denoted as Sj,

j

= 1, 2, ... , 23. B. Cepstrum

In the final step, we convert the log mel spectrum back to "time". The result is called the Mel Frequency Cepstral Coefficients (MFCC). The cepstral representation of the

(20)

speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel-frequency spectral coefficients

(and their logarithms) are real numbers, we can convert them to time-like domain, called quefrency domain, using the discrete cosine transform (DCT).

237r -i 1

Ci S - COS[

-

)]

(2.2) j=1

The complete process for the calculation of MFCC is shown in Figure 2.1.

ConFiie2.1

P rnefo

Disc W

US.2e

GMUB

ppoc

Mnfude

spfthnan

_& _ __ nvorw OFT

LNg

MWaMel

n

y

'bpsru

l

0Spectrum

I

apn

Figure 2. 1: Pipeline

for

MFCC

02.2

The GMM-UBM Approach

The next step after obtaining the parametrization representation is the selection of the likelihood function P(xlHo) and P(xIHi). For notational purposes, we can let Ho be represented by a probabilistic model AQ that characterizes the hypothesized speaker

Q,

and we can use Aj to represent the probabilistic model of the alternative hypothesis H1.

The classical approach is to model each speaker as a probabilistic source with unknown but fixed probability density function. While the model in AQ is well defined and can usually be estimated via some enrollment speech from the speaker

Q,

the model for Aa is less well defined since it potentially must represent the entire space of possible alternatives to the

(21)

2.2. THE GMM-UBM APPROACH

hypothesized speaker. The approach typically used to tackle the problem of alternative hypothesis modeling is to pool speech from many non-target speakers and train a single model known as the Universal Background Model (UBM). The advantage of this approach is that a single speaker-independent model can be trained once for a particular task and then used for all hypothesized speakers in the task

[9].

The GMM is a generative model used widely in speaker verification to model the feature distribution. A GMM is composed of a finite mixture of multivariate Gaussian components. Given a GMM 0 consisting of C components, the likelihood of observing an F-dimensional feature vector x is defined as

C

P(xO) = 7cNc(xjpc, E,) (2.3)

c=1

where the mixture weights 7r, > 0 are constrained by EC 7c = 1, and Nc(xI PC, E,) is a multivariate Gaussian with F-dimensional mean vector, Mc, and F x F covariance matrix,

Ec.

1 1

NC(xj1pc, Ec) = exp{- (x - P 1 (c - PC (2.4)

(2lr)2FIZcII/2 2

The parameters of the model are denoted as = {01, 62, ... , 1c}, where C = {irc, PC, Zc}. While the general model form supports full covariance matrices, typically only diagonal covariance matrices are used. This is because the density modeling of an M-th order full covariance GMM can generally be equally achieved using a larger-order diagonal covariance GMM and diagonal covariance GMMs are more computationally efficient than full

covari-ance GMMs [1].

For a sequence of feature vectors x =

{xI1,

2 ... , XL}, we assume that each

obser-vation vector is independent of the others. The likelihood of the given utterance x =

{X1, X2, .- -, XL} is the product of the likelihood of each of the L frames. The log likelihood

is computed as

L

logP(xl9)

=

logP(xi9)

(2.5) t=1

The maximum likelihood (ML) parameters of a model 9 can be estimated via the Expectation-Maximization (EM) algorithm. The equations for the ML parameter updates

(22)

can be found in [8]. _{The UBM is trained on a selection of speech that is reflective of} the expected alternative speech to be encountered during recognition. It represents the speaker-independent distribution of features.

For the speaker model, a single GMM can be trained on the speaker's enrollment data, however, the amount of speaker-specific data would be much too limited to give a good representation of the speaker. We may end up modeling the channel characteristics or other aspects of the data instead. In contrast, the larger abundance of speech data used to estimate the UBM might be a better starting point for modeling a specific speaker. Thus we derive the speaker's model via a maximum a posterior (MAP) adaptation from the well-trained parameters in the UBM. This provides a tighter coupling between the speaker's model and the UBM, which not only produces better performance than separate

(decoupled) models, but also allows for a fast-scoring technique.

The MAP adaptation is similar to the EM algorithm and it also allows the fast log-likelihood ratio scoring technique. Given a UBM parameterized by 0

UBM and training feature vectors from a speaker x = {x1, x2,...

,

XL}, we first calculate the probabilistic

alignment between each training frame and the UBM mixture components. For UBM mixture c, we compute

7rcNc,(xjby, E

y

1

(c)

= P(cjXl, OuBM) = ----,C ,Z) - (2.6) Ec=1 TcNc(xjlpc, Ec)

and the relevant Baum-Welch statistics for the weight, mean, and covariance parameters

of the UBM are:

L L

Nc(x) = P(cIX1, OUBM) Zy(C) (2.7)

1=1 1=1

L

F(X cZ(c) Nccx) -Xi (2.8)

(23)

2.2. THE GMM-UBM APPROACH

The UBM sufficient statistics for mixture c are updated from these sufficient statistics of

the training data to generate adapted parameters as below:

frc (ac L + (1 - ac)rc (2.10)

AC= acc(x) + (1 - ac)pc (2.11)

$c = acecF(x) + (1 - ac)(Ec + pcp*) - AcA* (2.12)

3 is a scale factor computed over all adapted mixture weights to ensure that EC frc = 1, and ac are the data-dependent adaptation coefficients controlling the balance between old and new estimates of the GMM parameters. The coefficients are defined as

ac= N,(x) (2.13)

Nc(x) + r where r is a constant relevance factor.

The data-dependent adaptation coefficient allows mixture-dependent adaptation of pa-rameters. For mixture components with a low probabilistic count Nc(x) of the user data, ac -+ 0 will cause the deemphasis of the new parameters and the emphasis of the old parameters. For mixture components with a high probabilistic count Nc(x), ac -+ 1 will

cause the use of the new speaker-dependent parameters. The relevance factor controls how much new data should be observed in a mixture when updating the old parameters with the new parameters. Thus this approach should be robust to limited training data.

The adaptation of the mean and covariance parameters of the observed Gaussians is displayed in Figure 2.2. In practice, only the mean vectors pc, c = 1, ..., C, are adapted,

while updated weights and covariance matrices do not significantly affect system perfor-mance. The selection of the number of Gaussian components depends on the type and the amount of training data, such as telephone data or microphone data, gender-independent or gender-dependent.

(24)

Speaker cepstral featare vectors Speaker cepwtal feaure vec~ns

UBM-GW Speaker UKK

Figure 2.2: maximum a posteriori

(MAP) adaptation [3] [9]

U 2.3 Data Sets and Evaluations

Our experiments are carried out on the NIST 2008 speaker recognition evaluation (SRE) dataset [39]. NIST SRE is an ongoing series of evaluations to focus on the core technology issues in the field of text independent speaker recognition. The systems have to answer the question, "Did speaker X produce the speech recording Y and to what degree?". Each trial requires a decision score to reflect the system's estimate of the probability that the test segment contains speech from the target speaker.

Detection system performance is usually characterized in terms of two error measures, namely miss probability PMiss/Target and false alarm PFalseAlarm/Nontarget. These respectively correspond to the probability of not detecting the target speaker when present, and the probability of falsely detecting the target speaker when not present. Different operating points will generate different PMiss/Target and PFalseAlarm/Nontarget. We care more about the

operating point where the two error rates are equal, and the resulting rate is called equal error rate (EER). Another formal evaluation measure is the detection cost function (DCF), defined as a weighted sum of the miss probability and false alarm:

(25)

2.4. CHAPTER SUMMARY

The parameters are CMis, and CFalseAlarm, the relative cost of detection errors, and PTarget, the a priori probability of the specified target speaker. The primary evaluation will use CMiss - 10, CFalseAlarm = 1, and PTarget = 0-01.

In addition to the single number measures of minDCF and EER, more information can be shown in a graph plotting all the operating points. An individual operating point corresponds to a score threshold for separating actual decisions of true or false. All possible system operating points are generated by sweeping over all possible threshold values. NIST has introduced Decision Error Tradeoff (DET) Curves since the 1996 evaluation [3], where the two error rates are plotted on the x and y axes on a normal deviate scale on the receiver operating characteristic (ROC) curve. The DET Curves have been widely used to represent the detection system performance. The CDet value and EER correspond to a specific operating point on the DET curve.

U

2.4 Chapter Summary

In this chapter, we have described the speech parameterization to transform a speech utterance to a sequence of feature vectors for statistical modeling. Our focus was on the computation of MFCCs, since they are used in subsequent chapters. We have also presented the GMM-UBM approach, the classical statistical modeling approach for speaker recognition. The maximum a posterior approach to obtain the speaker model from the UBM is fully dealt with and the benefit of this adaption is explained. Finally the datasets and evaluation metric for experiments were introduced.

(26)

(27)

Chapter 3 Factor Analysis Based Speaker Verification

The GMM-UBM approach achieved great success, but suffered from data sparsity in MAP adaptation [9]. Since each Gaussian component is updated independently, some compo-nents of the UBM were prevented from being adapted, and thus failed to capture the thorough and complete representation of the speaker's true model in the presence of lim-ited speaker training data [6]. It is necessary to correlate or link together the different Gaussian components of the UBM. The theory of Joint Factor Analysis (JFA) is used to achieve this goal [25].

This chapter will present a thorough description of the idea and mechanism of Joint Fac-tor Analysis. A good overview of JFA for speech processing can also be found in [9]. Two scoring approaches, cosine similarity scoring and probabilistic linear discriminant analysis, are introduced afterwards.

N

3.1 Joint Factor Analysis

In the JFA framework, a speaker model obtained by adapting from a UBM (parameterized with C mixture components in a feature space of dimension F) can also be viewed as a single supervector of dimension C - F along with a diagonal super-covariance matrix of dimension CF x CF

[7]

[9]. The supervector is generated by concatenating the mean vector

of each Gaussian mixture, while the super-covariance matrix is generated by concatenating the diagonal covariance matrix of each mixture along its diagonal.

The idea behind factor analysis is that a measured high-dimensional vector, i.e. speaker supervector, may be believed to lie in a lower-dimensional subspace. Another assumption in JFA is that the speaker- and channel-dependent supervector M for a given utterance can be broken down into the sum of two supervectors

(28)

CHAPTER 3. FACTOR ANALYSIS BASED SPEAKER VERIFICATION

where the supervector s depends on the speaker, and the supervector c depends on the channel. They can be modeled as

s = m +Vy+Dz

_(3.2)

(3.3)

c

= Ux

where m is the speaker- and channel-independent supervector interpreted as the initial

UBM supervector. V and U are low-rank matrices that represent the lower dimensional

subspaces in which the speakers and channels lie, known as the eigenvoices and the eigen-channels, respectively. Lastly, D is a diagonal CF x CF matrix to model the residual variabilities of the speakers not captured by V. The vectors y, z and x are the speaker-and session-dependent factors in their respective subspaces, speaker-and each is assumed to be a random variable with a normal distribution N(O,

I).

The basic idea is displayed in Figure

3.1 and a detailed explanation can be found in

[8].

The fact that the three latent variables y, z, and x are estimated jointly accounts for the terminology Joint Factor Analysis.

U2

U

1

(29)

3.2. TOTAL VARIABILITY APPROACH

The JFA approach represents speaker variabilities, and compensates for channel vari-abilities better than GMM-UBM approach, while it is complex in both theory and imple-mentation. A simplified solution, called total variability, was subsequently developed with superior performance.

U

3.2 Total Variability Approach

Experiments show that the channel factors in Joint Factor Analysis also contain information about speakers [8]. Based on this, an approach was proposed that does not distinguish between speaker variability and channel variability. Given an utterance, the speaker- and channel-dependent GMM supervector M can be represented as

M = m + Tw (3.4)

where m is the speaker- and channel-independent supervector (which can be taken to be the UBM supervector), T is a rectangular matrix of low rank and w is a random vector having standard normal distribution N(O, I). T defines the new total variability space and the remaining variabilities not captured by T are accounted for in a diagonal covariance matrix E. In this model, the high-dimensional supervectors lie around m in a relatively lower-dimensional subspace and w is the speaker- and channel-dependent factor in the to-tal variability space. The mean of the posterior distribution of w corresponds to a toto-tal factor vector, or an i-vector, which can be seen as a low dimensional speaker verification feature. The i-vector is short for Intermediate Vector, for the intermediate representation between an acoustic feature vector and a supervector, or Identity Vector, for its compact representation of a speaker's identity [13].

The parameter training in the total variability approach is also based on the EM algo-rithm [13] [31]. The main difference from learning the eigenvoice V is that each recording of a given speaker's set of utterances is regarded as having been produced by a different speaker in training T, whereas all the utterances of a given speaker are considered to belong to the same person in training V. The speaker characteristics are not learned explicitly in the total variability approach, while the latent variable y represents the speaker variability

(30)

in Joint Factor Analysis. A thorough explanation of the key details for estimating T and extracting w can be found in [9].

From here on we will use the posterior mean of w as a low-dimensional representation of the utterance. The fixed-length i-vectors can be used as input to standard recognition algorithms to produce the desired likelihood score [11] [26]. Two scoring approaches are introduced next: cosine similarity scoring and probabilistic linear discriminant analysis.

* 3.3 Cosine Similarity Scoring

As the only latent variable learned from each utterance, the low-dimensional i-vector is a full and final representation of a speaker's and channel's identity. Thus, total variability can be used as a front end feature extraction method, and there is no need to calculate the log-likelihood ratio scoring function like the GMM-UBM and JFA approaches [26]. Recently, cosine similarity scoring has been applied to compare two i-vectors for making a speaker detection decision [13]. With i-vectors of the target speaker utterance Wtarget and the test speaker utterance Wtest in hand, the verification is carried out using the cosine similarity score as below:

t

score(wrget, West) Wtarget Wtest (3.5)

I IWtarget| I- ||itestii

where

#

is the decision threshold.

Since the i-vector contains both the speaker and session variabilities, we need to do session compensation for cosine similarity scoring, which will be explained in detail in Section 4.1. A more sophisticated approach to directly model session variability within i-vectors was recently introduced by Kenny [17] [20] as Probabilistic Linear Discriminant Analysis (PLDA) [18].

* 3.4 Probabilistic Linear Discriminant Analysis

Probabilistic Linear Discriminant Analysis (PLDA) is similar to the JFA approach, but uses i-vectors rather than GMM supervectors as the basis for factor modeling [18]. Suppose there are I speakers each of J utterances in the training set. The jth i-vector of the ith speaker

(31)

3.4. PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS

is denoted by wij. We model data generation by the process

wij = m + Fsj + Gui,j + 6%,j (3.6)

Each utterance is comprised of two parts: the signal component m + Fsj which only depends on the speaker identity but not on the particular utterance; and the noise compo-nent Gui,j + ci,j which is different for every utterance of the speaker and represents session variability. In Equation 3.6, m is the overall mean of all the training utterances. F is the eigenvoice matrix and G is the eigenchannel matrix. The columns of F and G contain the basis for the between-speaker subspace and within-speaker subspace, respectively. And si and uij represent the position in the corresponding subspace. The remaining variability not captured is explained by the residual noise term ci,j following a Gaussian prior with diagonal covariance E. Usually we define Gaussian priors on the latent variables si and ui,j. Kenny [17] investigated using both Gaussian and heavy-tailed prior distributions for si, ui,j, and ci,, but we only investigate the Gaussian priors. This model can also be described in terms of the following conditional probabilities

P(wiIsi, uisj, 0) = N(m + Fsj + Guij, E) (3.7)

P(si) = N(O, I) (3.8)

P(uj) = N(0,

I)

(3.9)

Using this model involves two steps: the training phase to learn the parameters 0 =

{m,

F, G, E}; and the recognition phase to make inferences whether two utterances come from the same speaker. The latent variable si identifies the speaker. Thus recognition is conducted to evaluate the likelihood that two utterances are generated from the same underlying si.

(32)

M

3.4.1 Training

The parameters 6 = _{{m, F, G, E} are obtained to maximize the likelihood of the training}

dataset. Similar to the problem in the total variability approach, latent variables and parameters are both unknown and need to be estimated. We can also use the EM algorithm to estimate the two sets of parameters.

* E-step: Calculate the full posterior distribution over the latent variables si and u,, given the parameter values. We simultaneously estimate the joint probability distri-bution of all the latent variable si, uj,...2 that pertain to each speaker. First we combine the generative equations for all of the N utterances as follows

Wil Wi2 WiN m m m +±

F

G

0

0 F

G

0

+± si Uil Ui2 UiN +± ci1 6 i2 EiN (3.10)

We rename these composite matrices as

w= m + Ayi

+ ci

_(3.11)

This compound model is rewritten in terms of probabilities

P(wilyi) = N(Ayi, E') _(3.12)

(3.13)

P(yi) = N(0, I)

where E 0 0 0

0

0 (3.14) E /=

(33)

3.4. PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS

Applying Bayes' rule, we obtain the posterior distribution as

(3.15)

The posterior on the left must be Gaussian since both terms on the right are

Gaus-sians. It can be shown that the first two moments of this Gaussian are

E[yi] = (ATE'-

1

A + I)~1ATE'

-

in')

E[yyT] = (A TE'-1A + I)-' + E[yi]E[yi]T

(3.16)

(3.17)

* M-step: Optimize the point estimates of the parameters 0 =

{m,

F, G, E}. We rewrite

Equation 3.6 as

Wij=m+ IF

G

U

Si

+ 6ij

(3.18) =

m

+

B zij

+

E%

The M-step aims to optimize

(3.19)

Q(6t,96-_)

Oi

=

where t is the iteration index. We take the derivatives with respect to B and E and equal them to zero to obtain the update rules

1=

Wij - n)E[zi]) E[zizf)

)(ij

j

= diag

[(wij -

m)(wij

-

m)T

-

BE[zi](wij

-

n)T]

(3.20)

(3.21)

(3.22) P (y I wi, 0) 0c

P

(Wi Iyi, 0)

P(yi)

P(Zilwil, ... ' Wij, Ot-,)Iog

[P(WjjjZij)P(zj)]dzj

B =(: \ij

(34)

The expectation terms E[zi] and E[zi] can be generated from Equation 3.16 and 3.17, and the equivalence between yi and zi. The updated rules of F and G can be retrieved from B according to the equivalence from Equation 3.18.

* 3.4.2 Recognition

Given two i-vectors Wtarget and wtest, the similarity score can be computed as the logarithm of the ratio of the of the two hypothesis: H0, both Wtarget and wtest belong to the same

speaker (same s), and H1, Wtarget and wtest belong to different speakers (different s). This

score can be expressed as

P(warget, Wtest H0)

S (Wtarget, iWtest) = log P(tre ts O

P(wtarget H1) P(Wtest

I

H1)

= log

f

P(Wtarget, WtestIs)P(s)ds (3.23)

f P(wtarget1si)P(si)ds1 f P(WtestIS2)P(S 2)dS2

Each item in the denominator can be rewritten as

J

P(wls)P(s)ds

J

P(wIs, u)P(u)duP(s)ds (3.24)

The numerator can be rewritten as

P(wi, wj Is)P(s)ds = P(wiIs, ui)P(u du

J

P(wiIs, uj P(u) dul P(s)ds

(3.25)

Note that all the conditional probabilities in Equations 3.24 and 3.25 are defined in Equa-tions 3.7, 3.8, and 3.9. Thus the log ratio score in Equation 3.23 can be easily calculated. In fact we decompose the likelihood by writing the joint likelihood of all observed and hidden variables, and then marginalize over the unknown hidden variables.

M

3.5 Chapter Summary

In this chapter, we explained the factor analysis based speaker verification. We first intro-duced Joint Factor Analysis (JFA) that can correlate or link together the different Gaussian components of the UBM. Based on JFA, a simplified solution, called total variability, is

(35)

3.5. CHAPTER SUMMARY

presented to give the i-vector representation. Cosine similarity scoring and probabilistic linear discriminant analysis for scoring i-vectors were also introduced.

(36)

(37)

Chapter 4 Compensation Techniques

There are many variabilities among different trials, e.g., speaker identity, transmission channel, utterance length, speaking style, etc. It has been shown that these variations have a negative impact on the system performance [15]. Thus compensation techniques are needed to cope with speech variability. Successful compensation techniques have been proposed at different levels, e.g., at the feature, model, session, or score level [19] [24].

In this chapter, we will talk about the compensation techniques at the session and score level. Three approaches for session compensation are introduced. Score normalization is explained as the compensation technique at the score level. We present the motivation of score normalization and the formulations of three score normalization methods.

* 4.1 Session Compensation

In the i-vector representation, there is no explicit compensation for inter-session variability. But the low-dimensional representation rewards compensation techniques in the new space, with the benefit of less expensive computation as well.

* 4.1.1 Linear Discriminant Analysis (LDA)

LDA attempts to define new axes that minimize the within-class variance caused by

ses-sion/channel effects, and to maximize the variance between classes. The LDA optimization problem can be defined to find direction q that maximizes the Fisher criteria

J (q) = 'b (4.1)

|q'Swq |

where Sb and Sw are between-class and within-class covariance matrices: R

Sb(7-) ( U7)=

(4.2)

(38)

CHAPTER 4. COMPENSATION TECHNIQUES

R n

SW = , (wr - (W- (4.3)

r=1 i=1

and 77 = (1/n,) E wr is the mean of the i-vectors for each speaker, n is the number

of utterances for each speaker r, U is the speaker population mean vector (the mean of all the available i-vectors for training), R is the number of speakers. The projection matrix A is achieved by maximizing the Fisher criteria. It is composed of the top eigenvectors of the general matrix Sf Sb [13].

The new cosine kernel between two i-vectors w1 and w2 can be rewritten as

( Aw1

)t

(Aw

2

)

k(wi, w2) = (4.4)

({AW1t( AW1)1\( AW2*( AW2)

* 4.1.2 Within-Class Covariance Normalization (WCCN)

WCCN is used as a channel compensation technique to scale a subspace to attenuate dimensions of high within-class variance

[141.

It is a linear feature projection which aims to minimize the risk of misclassification of SVM classifiers [16]. The projection matrix B is obtained such that BBt = W 1, where W is the average of the within-class covariance

matrix of all the impostors

R n

W = w -r i7)(W ) - (4.5)

r=1 ni=1

In Equation 4.5, U7 = (1/nr) Erl w is the mean of the i-vectors for each speaker, nr is the number of utterances for each speaker r, T is the speaker population mean vector, R is the number of speakers. [14] has provided detailed proofs and analysis for deriving the WCCN projection.

The cosine kernel based on the WCCN matrix is given as follows

k(wi, w2) = (Bw (BW2)(4.6)

(39)

4.2. SCORE NORMALIZATION

* 4.1.3 Nuisance Attribute Projection (NAP)

NAP is a technique to modify the kernel distance between two feature vectors via the removal of subspaces that cause undesired kernel variability [16]. The projection matrix is formulated as

P = I - RR' (4.7)

where R is a low-rank rectangular matrix whose columns are the k eigenvectors having the largest eigenvalues of the within-class covariance matrix [13].

The new cosine kernel can be rewritten as

(Pw1)t(Pw2)

k(wi, w2) =P(4.8)

/(PW1)'(PW1) N/PW2)'(PW2)

U

4.2 Score Normalization

Variability compensation at the score level is also referred to as score normalization. These techniques are defined as a transformation to the output scores of a speaker verification system in order to reduce misalignments in the score ranges due to variations in the condi-tions of a trial. Score normalization is introduced to make a speaker-independent decision threshold more robust and effective.

The decision-making process used in speaker verification based on GMM-UBMs com-pares the likelihood ratio obtained from the claimed speaker model and the UBM model with a decision threshold. Due to score variability between verification trials, the choice of decision threshold is an important, and troublesome problem. Score variability mainly consists of two different sources. One is the different quality of speaker modeling caused by variation in enrollment data. Another is the possible mismatches and environment changes

among test utterances.

Researchers use z-norm and t-norm to obtain a calibrated score [21]. We assume the length-normalized target speaker i-vector is w'taet and the length-normalized test i-vector is Wtest.

(40)

CHAPTER 4. COMPENSATION TECHNIQUES

utterances. The mean pznorm and standard deviation O-znorm of these scores are estimated to normalize the target speaker score. Each target speaker has an associated Iliznorm and

Uznorm. The z-normalized score is

s

)

scor-eznorm

(Wte 7

Wtest)

score(wtarget, Wtest) - Pznorm

Uznorm

In cosine similarity scoring, pznorm = _U = Wtarget

C

target

[12].

where w' is the mean of "impostor" i-vectors, C is the impostor's covariance matrix, C =

E[(w - w )(w - w')t]. Thus the z-normalized score can be rewritten as

scoreznorm _{(wtarget, Wtest)}

I t I r t

-Wtarget Wtest - Wtarget * W

Wtarget C Wtarget

I t

-Wtarget

(

_Wtest _W)

Wta rget WargettC _(4.10)

Similarly, t-norm parameters are estimated from scores of each test segment against a set of impostor speaker models. The mean /tnorm and standard deviation O-tnorm of these

scores are used to adjust the target speaker score. Each impostor speaker model has an

associated Ptnorm and OUtnorm. The t-normalized score is

scoretnorm(Wtarget, Wtest) =

score(wtarget, Wtest) -- Ptnorm

Otnorm

In cosine similarity scoring, Ptnorm = Wtest -, and Otnorm =VWtest -C Wtest. Thus the t-normalized score can be rewritten as

(target - T'West

scoretnorm(wtarget, Wiest) =

(4.12)

Wtest C ' Wtest

In [12], Dehak proposed a new cosine similarity scoring. This new scoring is given as below:

score(wtarget, Wtest) =

(wtarget - WT(Wtest - W')

Wtarget

C

-Wtarget Wtest -

C

-

Wtest

(4.9)

(4.11)

(41)

4.3. CHAPTER SUMMARY

It can be treated as the combination of z-norm and t-norm score normalization, since it captures both the variabilities of different speaker models and the mismatches among different test utterances. This normalization is referred to as "combined norm" in the following chapters.

N

4.3 Chapter Summary

In this chapter, we introduced three techniques for session compensation. These inters-ession compensation methods can remove the sinters-ession variabilities between different trials. Two score normalization methods, t-norm and z-norm, are introduced, along with the cor-responding representations in cosine similarity scoring. The new cosine similar scoring proposed by Dehak [12] is also introduced and will be used in the following experiments.

(42)

(43)

Chapter 5 Distance Metric Learning

With i-vectors as low-dimensional representations of speech utterances, a cosine distance classifier measures the distance between the target user utterance and the test utterance. Although Cosine Similarity Scoring has proven to be effective in speaker verification, we would like to explore the hidden structure of the i-vector space. Defining the distance metric between vectors in a feature space is a crucial problem in machine learning [38]. A learned metric can significantly improve the performance in classification, clustering and retrieval tasks [32] [33]. The objective of distance metric learning is to learn a distance metric that preserves the distance relation among the training data from a given collection of pairs of similar/dissimilar points [32]. Since the basic speaker verification task is to determine whether the test utterance and the target utterance are from the same speaker, a good distance metric can differentiate utterances from different speakers well, and thus achieve good performance in speaker verification.

In this chapter, we explore two supervised distance metric learning methods. As a classical distance metric learning algorithm, Neighborhood Component Analysis (NCA) is first introduced. However, the point estimation of the distance metric and the unreliability with limited training examples make NCA not as powerful as expected. Thus the Bayesian framework is presented to estimate a posterior distribution for the distance metric, which

has no requirement on the number of training examples.

N

5.1 Neighborhood Component Analysis

Neighborhood Component Analysis (NCA) [34] learns a distance metric to minimize the av-erage leave-one-out (LOG) K-nearest-neighbor (KNN) classification error under a stochastic selection rule. The k nearest neighbor classifier identifies the labeled data points that are closest to a given test data point, which involves the estimation of a distance metric. Ap-propriately designed distance metrics can significantly benefit KNN classification accuracy compared to the standard Euclidean distance. We briefly review the key idea of NCA

(44)

CHAPTER 5. DISTANCE METRIC LEARNING

below.

Given a labeled data set consisting of i-vectors wI, W₂, ... , wn and corresponding speaker labels Y1, Y2, ... , yn, we want to find a distance metric that maximizes the performance

of nearest neighbor classification. Ideally, we would like to optimize the performance on future test data, but since we do not know the true data distribution, we instead at-tempt to optimize the leave-one-out (LOO) performance on the training data. In what

follows, we restrict ourselves to learning Mahalanobis (quadratic) distance metrics, which

can always be represented by symmetric positive semi-definite matrices. We estimate such metrics through their inverse square roots, by learning a linear transformation of the input space such that KNN performs well in the transformed space. If we denote the transformation by a matrix B, we are effectively learning a metric

Q

= BTB such that

d(wi, wj) = (wi - wj)tQ(wi - wj) = (Bwi - Bwj)t(Bwi - Bwj).

The actual LOO classification error of KNN is a discontinuous function of the transfor-mation B, since an infinitesimal change in B may change the neighbor graph and thus affect LOO classification performance by a large amount. Instead, we adopt a better behaved measure of nearest neighbor performance, by introducing a differentiable cost function based on stochastic (soft) neighbor assignments in the transformed space. In particular, each utterance wi selects another utterance wj as its neighbor with some probability pij, and inherits its speaker label from the utterance it selects. We define pij using a softmax over Euclidean distances in the transformed space:

exp(-

||

Bwi - Bwj

||2)

.N- = -,P-- = 0 (5.1)

Pjj =

Z exp(- || Bwi - Bwk 1|2)' 0

The probability for the utterance wi selecting neighbors from the same speaker is pi =

ZjcC,

pij, where C is the set of utterances from the same speaker with i. The projection matrix B maximizes the expected number of utterances selecting neighbors from the same speaker:

B = argmaxBf(B) =

3

piJ = Pi (5.2)

(45)

5.2. BAYESIAN DISTANCE METRIC LEARNING FRAMEWORK

A conjugate gradient method is used to obtain the optimal B. Differentiating

f

with respect to the projection matrix B generates the gradient as below:

= -2Bj ZPi (WZjW T - Z PixikXiT)

i jCi k

= 2BZ(piZpikwikwk - ZPiJWJWi (5.3)

i k jECi

U

5.2 Bayesian Distance Metric Learning Framework

NCA provides a point estimation of the distance metric and can be unreliable when the number of training examples is small. The work in [33] presents a Bayesian framework to estimate a posterior distribution for the distance metric by applying a prior distribution on the distance metric.

Given the speaker-label of each utterance, we can form two sets of same-speaker and different-speaker constraints S and D. The probability of two utterances wi and wj be-longing to the same speaker or different speakers is defined under a given distance matrix

A:

1

P(yj Iwi, wA, a) = (5.4)

Wi~~ _{+ exp (yij(||I} _-W_{_} ₁₁₂ _a)

where Yz,j

+1

(wi, wj) E S

-1

(wi, wj) E D

The parameter a is the threshold used to differentiate same-speaker utterances and different-speaker utterances. Two utterances are more likely to be identified from the same speaker only when their distance with respect to the distance matrix A is less than a. The complete likelihood function for all the constraints in S and D is

P(S,

DIA,

a) =

11 ₁

-a)

xpw2 X 1 1± 11

I+a)

2 (ili)ES 1

+ exp(J| wi - wj

Ai -) 1

+ exp(-- 11

Wi - Wj |A )

(46)

CHAPTER 5. DISTANCE METRIC LEARNING

We introduce a Wishart prior for the distance metric A and a Gamma prior for the threshold a as

P()

A(v-m-l/

2

₍

P(ZA) = exp tr(W-1A) (5.6)

Zy (W) (2

b - I

P(a) =

Zbexp(-Oa)

(5.7)

Z(b)

where Z,(W) and Z(b) are the normalization factors. Plugging the priors into the likelihood function, we can obtain the posterior distribution as follows

P(A, oIS, D)

A

FP(A)P(a)P(S,

DIA, a)

(5.8)

fA P( A)d A fo P(a)P(S, D|A, a)da

The optimal A and a are obtained to maximize the posterior distribution above. But the integration over the space of positive semi-definitive matrices makes the estimation compu-tationally intractable. Thus an efficient algorithm is necessary to compute P(A, aS, D).

To simplify the computation, the distance metric A is modeled as a parametric form of the top eigenvectors of the observed data points

[33].

Let X = (wi, w2, ... , wn) denote all the available utterances, and v, 1 = 1,. .., K be the top K eigenvectors of XXT. If we

assume A = EK 1 rnV1V , where y; > 0, 1 = 1, 2, ... , K, the likelihood P(yi,j wi, wj) can be rewritten as:

1

P (yij wi, wj, A, a)=

1 exp (yi (E 1= 1 'yiw - c))

= O(-yy y7wi,5) _(5.9) where wj =

[(wi

- wj)tvJ]2 wij = (--1, w,.. ) - = 1Yl (O) ... x-K) ajz) = 1/(1 +

Bayesian distance metric learning on i-vector for speaker verification

Bayesian Distance Metric Learning on i-vector for

Speaker Verification

by

Xiao Fang

B. S., Electrical Engineering

University of Science and Technology of China, 2011

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

NCHIVES

R A R 1ES

September 2013

@

Massachusetts Institute of Technology 2013. All rights reserved.

A uthor ...

...

Department of

... E.

Electrical Engineering and Computer Science

August 30, 2013

Certified by . .

James R. Glass

Senior Research Scientist

Thesis Supervisor

Certified by

...

C

Najim Dehak

Research Scientist

Thesis Supervisor

Accepted by.

/

U U

Leslie A. Kolodziejski

Chairman, Department Committee on Graduate Students

Bayesian Distance Metric Learning on i-vector for Speaker

Verification

by

Xiao Fang

Abstract

Acknowledgments

Contents

1. Introduction

13

2. Background and Related Work

17

3. Factor Analysis Based Speaker Verification

27

4. Compensation Techniques

37

5. Distance Metric Learning

43

5.1. Neighborhood Component Analysis . . . .

43

5.2. Bayesian Distance Metric Learning Framework . . . .

45

5.3. Variational Approximation . . . .

47

5.4. Chapter Summary . . . .

51

6. Experimental Results

53

6.1. Experimental Set-up . . . .

53

6.2. Parameter Selection . . . .

53

6.3. Results Comparison . . . .

54

6.4. Results on Limited Training Data . . . .

58

6.5. Results on Short Duration Data . . . .

59

6.6. Chapter Summary

. . . .

60

7. Conclusion and Future Work

_(3.2)