• Aucun résultat trouvé

Incremental MAP

5.2 Semi-supervised on-line diarization

5.2.1 Incremental MAP

The initialisation and update of speaker models in the unsupervised on-line diarization system described in Chapter 4 is guaranteed by a sequential MAP adaptation procedure, in which the speaker models are sequentially MAP adapted as soon as a speech segment is available. However, in a semi-supervised scenario the seeding of the speaker models with labelled training data allows the update of the speaker models by means of a more robust incremental MAP adaptation procedure.

For a given speaker, let there be a sequence ofDspeech segments (D=4 in Figure 5.4) where each segmenti is parametrised by a set of acoustic featuresO(i) =o1, . . . ,oMi. As explained in Chapter 4, Section 4.1.2 in the sequential MAP adaptation procedure, the second algorithm illustrated in 5.4, the sufficient statistics Ni+1,Fi+1 and Si+1 for the speaker segments i+ 1 are calculated against the previous model s(i) and depend non-linearly on Ni,Fi and Si in terms of Gaussian occupation probabilities.

Accordingly, even given the same observations in the same segments, the speaker models obtained from the conventional, off-line and sequential MAP adaptation procedures are not the same. However, in the newly proposed incremental MAP adaptation approach, the third algorithm illustrated in Figure 5.4 the sufficient statistics are calculated always with respect to the general speech UBM model λU BM and accumulated during time.

Here, the initial speaker models(1) is obtained in the same way as with sequential MAP adaptation. In order to update the speaker model s(i), sufficient statistics for speaker segment i+ 1 are now always calculated with the original λU BM model and accumulated with sufficient statistics Ni, Fi and Si:

Ni+1 =Ni + The mean, variance and weights of the updated model s(i+1) are then obtained according to Equation (4.2) in Chapter 4, Section 4.1.1. This procedure is linear and

thus, given the same data, the incremental MAP procedure will produce the same models as the off-line procedure, while still being suited to on-line processing.

5.2.2 System implementation

The proposed semi-supervised on-line diarization system is illustrated in Fig. 5.5. It is based on the baseline top-down or divisive hierarchical clustering approach to off-line diarization reported in Chapter 3, Section 3.4 and the unsupervised on-line diarization approach described in Chapter 4.

The system is characterised by four stages: (i) feature extraction; (ii) off-line speaker models enrolment; (iii) speech activity detection and (iv) on-line classification.

Feature extraction

Each audio stream is first parametrised by a series of acoustic observations o1, . . . ,oT. Critically, for any time τ ∈ 1, . . . , T only those observations for t < τ are used for diarization.

Off-line speaker models enrolment

A brief round-table phase in which each speaker introduces himself is used to seed speaker models. The firstTSP K seconds of active speech for each speaker is set aside as seed labelled training data. An inventory ˜∆ of speaker models sj, where j = 1, . . . , M, with M indicating the number of speakers in any particular meeting, is then trained using a certain duration of seed dataTSP K for each speaker. Speaker models are MAP adapted from the UBM using the seed data. For each speaker model sj, the sufficient statistics N1(j),F(j)1 and S(j)1 obtained during the MAP adaptation are stored in order to be used during the on-line classification phase to update the speaker models. The resulting set of seed speaker models are then used to diarize the remaining speech segments in an unsupervised fashion.

Speech activity detection and on-line classication

Non-speech segments are removed according to the output of a conventional model-based speech activity detector (SAD) derived from the baseline top-down diarization system described in Chapter 3, Section 3.4.

7.857.97.9588.058.1 SPK 1SPK 1SPK 1

SEGMEN TS

(1,2 ,…

seco nd s)

CAL CU LA TE TH E L IKEL IH OO D

124567 update

T SPK SPK 2 SPK N

SPK 2 SPK N

SPK 2 SPK N

SPK 1 SPK 2 SPK N update

update

SPK 1SPK 2

3 SPK N

NON-SPEECH

SPEAKER MODELS TRAINING (SUPERVISED) T S

ONLINE CLASSIFICATION (UNSUPERVISED) ~ NX T SPK Fig.5.5Anillustrationofthesemi-supervisedon-linespeakerdiarizationsystem.

The remaining speech segments are then divided into smaller sub-segments whose duration is no longer than an a-priori fixed maximum duration TS. Higher values ofTS imply a higher latency system. On-line diarization is then applied in sequence to each sub-segment. The optimised speaker sequence ˜S and segmentation ˜G are obtained by assigning in sequence each segment i to one of the M speaker models according to:

sj = arg max

l∈(1,...,M) K

X

k=1

L(ok|sl) (5.2)

where ok is the k-th acoustic feature in the segment i, K represents the number of acoustic features in the i-th segment and where L(ok|sl) denotes the log-likelihood of the k-th feature in segment i given the speaker model sl. The segment is then labelled according to the recognised speaker j as per (5.2). The updated speaker model sj is obtained by either sequential or incremental MAP adaptation as described above.

5.3 Performance evaluation

In order to evaluate the performance of the proposed on-line semi-supervised diarization system, average global DERs are assessed as a function of different amount of labelled training data TSP K and for different maximum segment duration TS. The evaluation aims to determine what quantity of manually labelled seed data is needed to obtain the performance of the state-of-the-art, entirely off-line, baseline system reported in Chapter 3, Section 3.4, the associated cost in terms of system latency and the benefit of incremental MAP adaptation.

5.3.1 Semi-supervised on-line diarization against off-line