Incremental MAP - Semi-supervised on-line diarization

5.2 Semi-supervised on-line diarization

5.2.1 Incremental MAP

The initialisation and update of speaker models in the unsupervised on-line diarization system described in Chapter 4 is guaranteed by a sequential MAP adaptation procedure, in which the speaker models are sequentially MAP adapted as soon as a speech segment is available. However, in a semi-supervised scenario the seeding of the speaker models with labelled training data allows the update of the speaker models by means of a more robust incremental MAP adaptation procedure.

For a given speaker, let there be a sequence ofDspeech segments (D=4 in Figure 5.4) where each segmenti is parametrised by a set of acoustic featuresO⁽ⁱ⁾ =o₁, . . . ,o_M_i. As explained in Chapter 4, Section 4.1.2 in the sequential MAP adaptation procedure, the second algorithm illustrated in 5.4, the sufficient statistics N_i+1,F_i+1 and S_i+1 for the speaker segments i+ 1 are calculated against the previous model s⁽ⁱ⁾ and depend non-linearly on N_i,F_i and S_i in terms of Gaussian occupation probabilities.

Accordingly, even given the same observations in the same segments, the speaker models obtained from the conventional, off-line and sequential MAP adaptation procedures are not the same. However, in the newly proposed incremental MAP adaptation approach, the third algorithm illustrated in Figure 5.4 the sufficient statistics are calculated always with respect to the general speech UBM model λU BM and accumulated during time.

Here, the initial speaker models⁽¹⁾ is obtained in the same way as with sequential MAP adaptation. In order to update the speaker model s⁽ⁱ⁾, sufficient statistics for speaker segment i+ 1 are now always calculated with the original λ_{U BM} model and accumulated with sufficient statistics N_i, F_i and S_i:

Ni+1 =Ni + The mean, variance and weights of the updated model s⁽ⁱ⁺¹⁾ are then obtained according to Equation (4.2) in Chapter 4, Section 4.1.1. This procedure is linear and

thus, given the same data, the incremental MAP procedure will produce the same models as the off-line procedure, while still being suited to on-line processing.

5.2.2 System implementation

The proposed semi-supervised on-line diarization system is illustrated in Fig. 5.5. It is based on the baseline top-down or divisive hierarchical clustering approach to off-line diarization reported in Chapter 3, Section 3.4 and the unsupervised on-line diarization approach described in Chapter 4.

The system is characterised by four stages: (i) feature extraction; (ii) off-line speaker models enrolment; (iii) speech activity detection and (iv) on-line classification.

Feature extraction

Each audio stream is first parametrised by a series of acoustic observations o₁, . . . ,o_T. Critically, for any time τ ∈ 1, . . . , T only those observations for t < τ are used for diarization.

Off-line speaker models enrolment

A brief round-table phase in which each speaker introduces himself is used to seed speaker models. The firstT_{SP K} seconds of active speech for each speaker is set aside as seed labelled training data. An inventory ˜∆ of speaker models sj, where j = 1, . . . , M, with M indicating the number of speakers in any particular meeting, is then trained using a certain duration of seed dataT_{SP K} for each speaker. Speaker models are MAP adapted from the UBM using the seed data. For each speaker model s_j, the sufficient statistics N₁^(j),F^(j)₁ and S^(j)₁ obtained during the MAP adaptation are stored in order to be used during the on-line classification phase to update the speaker models. The resulting set of seed speaker models are then used to diarize the remaining speech segments in an unsupervised fashion.

Speech activity detection and on-line classication

Non-speech segments are removed according to the output of a conventional model-based speech activity detector (SAD) derived from the baseline top-down diarization system described in Chapter 3, Section 3.4.

7.857.97.9588.058.1 SPK 1SPK 1SPK 1

SEGMEN TS

(1,2 ,…

seco nd s)

CAL CU LA TE TH E L IKEL IH OO D

124567 update

T SPK SPK 2 SPK N

SPK 2 SPK N

SPK 1 SPK 2 SPK N update

update

SPK 1SPK 2

3 SPK N

NON-SPEECH

SPEAKER MODELS TRAINING (SUPERVISED) T S

ONLINE CLASSIFICATION (UNSUPERVISED) ~ NX T SPK Fig.5.5Anillustrationofthesemi-supervisedon-linespeakerdiarizationsystem.

The remaining speech segments are then divided into smaller sub-segments whose duration is no longer than an a-priori fixed maximum duration T_S. Higher values ofT_S imply a higher latency system. On-line diarization is then applied in sequence to each sub-segment. The optimised speaker sequence ˜S and segmentation ˜G are obtained by assigning in sequence each segment i to one of the M speaker models according to:

sj = arg max

l∈(1,...,M) K

k=1

L(o_k|sl) (5.2)

where o_k is the k-th acoustic feature in the segment i, K represents the number of acoustic features in the i-th segment and where L(o_k|s_l) denotes the log-likelihood of the k-th feature in segment i given the speaker model s_l. The segment is then labelled according to the recognised speaker j as per (5.2). The updated speaker model s_j is obtained by either sequential or incremental MAP adaptation as described above.

5.3 Performance evaluation

In order to evaluate the performance of the proposed on-line semi-supervised diarization system, average global DERs are assessed as a function of different amount of labelled training data TSP K and for different maximum segment duration TS. The evaluation aims to determine what quantity of manually labelled seed data is needed to obtain the performance of the state-of-the-art, entirely off-line, baseline system reported in Chapter 3, Section 3.4, the associated cost in terms of system latency and the benefit of incremental MAP adaptation.

5.3.1 Semi-supervised on-line diarization against off-line

Dans le document On-line speaker diarization for smart objects (Page 95-98)