• Aucun résultat trouvé

EURECOM submission to the Albayzin 2016 speaker diarization evaluation

N/A
N/A
Protected

Academic year: 2022

Partager "EURECOM submission to the Albayzin 2016 speaker diarization evaluation"

Copied!
1
0
0

Texte intégral

(1)

Acknowledgments

This work was partially supported with funding from the French Agence Nationale  de la Recherche Project ODESSA (ANR­15­CE39­0010)

EURECOM submission to the Albayzin  EURECOM submission to the Albayzin 

2016 Speaker Diarization Evaluation 2016 Speaker Diarization Evaluation

Jose Patino

1

, Héctor Delgado

1

, Nicholas Evans

1

, Xavier Anguera

2

1EURECOM, Sophia­Antipolis, France

2ELSA Corp., Portugal

{patino,delgado,evans}@eurecom.fr, [email protected]

Speaker diarization is the task of segmenting an audio document into speaker­homogeneous  segments and clustering those segments according to speaker identities 

Applications:

   Enabling speaker adaptation in ASR systems

Enabling speaker recognition in multi­speaker data

Spoken document indexing and retrieval

Albayzin Evaluation: Segmenting broadcast audio documents according to different speakers and  attributing those segments to the speaker who uttered them, without any prior information about  the speaker identities nor their number

EURECOM submission: 

   Infinite impulse response – constant Q, Mel­frequency cepstral coefficients (ICMC) [1]

   Speaker diarization system based on the binary key modelling [2], an efficient and compact  speech and speaker representation

   External training data is not required: the test data itself is used for training

   System does not perform overlapping speech detection

Introduction

DER may be decomposed as:

Speaker error: time wrongfully assigned to a speaker

Missed speech: time where speech is present but is not labelled  by the system

False alarm: amount of time that has been assigned to speech  which is not present

Evaluation metric

Error = average

i

durmiss

i

 durfa

i

durref

i

SER=

 n/

Diarization error time for each segment n

Diarization Error Rate (DER)

Database: 100 hours split into two parts: 

   87 hours for training from the Catalan  broadcast news database

  4 and 16 hours for development and test  respectively from the Corporación 

Aragonesa de Radio y Televisión (CARTV)

Acoustic classes annotations provided

Speech  

Music 

Background noise

Database

Results

Primary system Contrastive system

Task Time xRT Time xRT

Feature extraction 00:49:11 0.046 00:49:11 0.046

Speaker diarization 00:39:59 0.044 00:32:01 0.035

Overall 01:29:10 0.045 01:21:12 0.0405

All participants results Execution time

Execution time taken by primary and contrastive systems  when processing the test data (around 16 hours of audio). 

Real time factor (xRT) is also provided

Conclusions

T(n): Duration of segment n

Nref(n): Number of speakers that are present in segment n Nsys(n): Number of system speakers present in segment n

Ncorrect(n): Number of reference speakers in segment n which 

are correctly assigned by the diarization system

EURECOM AHOLAB GTM­UVIGO ATVS­UAM 0

5 10 15 20 25 30 35

4.27 4.30

5.24 13.84 13.96 6.80

17.00

20.00

18.16 18.32

25.68

32.62

3.44 5.82

 

Speaker error  time (%)

Missed speech  time (%)

False alarm  time (%)

DER of submitted systems on the development and test data (with 0.5% and 0.05% false alarm rates, respectively) 

All participants primary systems results

(with 0.05% and 0.06% false alarm rates for systems 1 and 2)

 KBM size as a percentage of the initial           number of Gaussian components, which is    related to the length of the speech content

 Exploration of relative KBM size led to          two working points, submitted as primary     and contrastive system

Very low execution times, suitable for  processing big amounts of data and for  online operation with low latencies

Contrastive system provides a faster  alternative at the cost of decreased  performance

The top two systems performed very  similarly with only an absolute 0.16% 

difference

Systems 3 and 4 exhibited higher missed  speech and false alarm error rates, possibly  influencing speaker error rates

System results

Speaker diarization system based on the ICMC features and binary key modelling

   Data from the training set was not employed. System tuned on the development set only

Two different working points were chosen in order to compose the KBM in the primary  and contrastive systems

Best result of experiments on development data is 11.93% DER

Official result on the Albayzin Speaker Diarization Evaluation is 18.16% DER

System was the 1st ranked among the submissions, with a slight difference over the 2nd  system and an approximate absolute 7% over the 3rd ranked

Binary key speaker diarization system

... ...

Gaussian pool

Select 1st Gaussian Initialize

cosine distances

Update cosine distances

Select Gaussian With the biggest

cosine distance

Reached 1 cluster

Acoustic processing Binary processing

Feature

extraction KBM

training

Feature

binarization Cluster

initialization

Agglomerative clustering

Best clustering selection

no

yes

Best speaker clustering

... ...

...

...

1 8 15 26 3 19 3 2 ... 17 . . .

0 0 1 1 0 1 0 0 1

KBM N Gauss x[1] x[2] x[M]

x[n]

Lkld(x[i]|λ1) Lkld(x[i]|λ2)

Lkld(x[i]|λN)

Initial clustering

Initial clusters' training

Segment assignment1

Clusters' training

Closest cluster pair merging1 and new

cluster training

Reach one cluster?

Resegmentation No Yes

Cumulative Vector (CV) Binary Key

(BK)

Repeat until getting N Gaussians

References

19­order ICMC

25ms window

10ms rate

20­channel filterbank

Liftering

0 1 2 3 4 5 6 7 8 9 10 11 12 13 141516 17 18 19 2021 22 23 20

40 60 80 100 120 140 160

Number of clusters

Within-class sum-of-squares

Elbow

Selected number of clusters

Longest distance to the

straght line Straight line linking first and last points

of the curve

Primary Contrastive 0

5 10 15 20

4.27 4.27

13.84 15.94 18.16

20.26

Test  

Speaker     error time  (%)

Missed  speech time  (%)

False alarm  time (%)

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 11

12 13 14 15 16 17 18

KBM Size (%)

DER (%)

Primary system Contrastive system

[1] Delgado, H., Todisco, M., Sahidullah, M., Sarkar, A. K., Evans, N., Kinnunen, T., & 

Tan, Z. H. (2016). Further optimisations of constant Q cepstral processing for 

integrated utterance and text­dependent speaker verification. In IEEE Spoken Language  Technology (SLT) Workshop

[2] Delgado, H., Anguera, X., Fredouille, C., & Serrano, J. (2015). Fast single­and cross­

show speaker diarization using binary key speaker modeling. IEEE/ACM Transactions  on Audio, Speech and Language Processing (TASLP), 23(12), 2286­2297

Primary Contrastive 0

5 10 15 20

2.03 2.03

9.37 9.81

11.93 12.37 Development

DER (%)

Best clustering selection

Within­class sum of squares  (WCSS) using the cosine  distance is computed for all  clustering solutions

 A trade off between the  WCSS and the number of  clusters is reached through  the elbow criterion

 Primary system, with a bigger relative           KBM, outperforms contrastive system on       development and test

 DER on test data is an approx. absolute         8% worse than on the development set for    both primary and contrastive systems

Resegmentation 1Decided through the cosine distance

DER trend for different KBM sizes on the development set

Références

Documents relatifs

Novel contributions include: (i) a new approach to BK extraction based on multi-resolution spectral analysis; (ii) an approach to speaker change detection using BKs; (iii)

However, when training data is plentiful (open-set condition), neural embeddings provide more robust segmentations, giving speaker representations which lead to better

Therefore, ODESSA submissions to the “speaker” part of the multimodal diarization challenge rely on the same systems used for its open-set submissions to the speaker

Exper- imental results obtained using the standard DIHARD database show that the contributions reported in this paper deliver relative improvements of 39% in terms of the

On the development data, the overall speaker diarization error was reduced from 24.8% for the best setting of the baseline system to 13.2% using BIC clustering, and to 7.1% with

In this paper, we propose to perform speaker diariza- tion within the dialogue scenes of TV series by combining the au- dio and video modalities: speaker diarization is first

Both bottom-up and top-down approaches are generally based on Hidden Markov Models (HMMs) where each state is a Gaussian Mixture Model (GMM) and corresponds to a speaker.

Some strategies such as grid algorithm or genetic algorithm (8) have been used. This is beyond the scope of this paper and may be a future extension. 3 INFINITE GAUSSIAN