• Aucun résultat trouvé

In the work outlined in this manuscript, the majority of the experiments are performed on meeting domain, i.e. involving the NIST RT meeting corpus. However, in order to assess the robustness of the systems to different data, some additional work involving a corpus of TV-talk shows, known as the Grand ´Echiquier dataset, is also described in Section 3.3.2.

0,00

2003 2004 2005 2006 2007 2008 2009 2010

% of Overlap Speech

Average Time of the Turn in sec.

Year of the evaluation

Average Time per turn Average Time per turn without ovlp % of overlap speech

Figure 3.1: Analysis of the percentage of overlap speech and the average duration of the turns for each of the 5 NIST RT evaluation datasets. Percentages of overlap speech are given over the total speech time

3.3.1 RT Meeting Corpus

For each NIST RT evaluation since 2004 a new database of annotated audio meetings was collected1. A total of five conference meeting evaluation datasets is available.

Figure 3.1 shows the difference between RT evaluation datasets in terms of percentage of overlap speech and turn duration. For RT‘04, RT‘05 and RT‘09 we see a percentage of overlap speech in the order of 15%, while the datasets from 2006 and 2007 involve around 8% of overlap speech. While looking at the average turn duration, which can be defined as the average time during which there is no change in speaker activity (same speaker, same condition: overlap/no overlap), we observe that the last three evaluations: RT‘06,

‘07 and ‘09 have shorter average turn durations, although we do not consider overlap speech. This brings strikingly to the fore the fact that the speech present in the three last evaluations may be considered as more spontaneous and more interactive, leading to smaller turn durations. According to these first observations we therefore expect the

1 The ground-truth keys are released later so that they may be used by the community for their own research and development independently of official NIST evaluations

RT‘06,‘07 and ‘09 datasets to be more challenging.

For the work reported in this thesis, and for consistency with previous work [Fredouille & Evans, 2008; Fredouille et al., 2004], all the experimental systems were optimized on a development dataset of 23 meetings from the NIST RT‘04, ‘05 and

‘06 evaluations. Performance was then assessed on the independent RT‘07 and RT‘09 datasets. Note that there is no overlap between development and evaluation datasets although they may contain shows recorded from the same site and possibly identical speakers.

3.3.2 GE TV-Talk Shows Corpus

Through some other work [Bozonnet et al., 2010] we also conducted speaker diariza-tion assessments on a database of TV talk-shows known as the Grand ´Echiquier’ (GE) database. Since these results allow us to evaluate the robustness of speaker diarization system (i.e. to variations in dominant speaker floor time), it is described here. Baseline results for the GE database are reported in Section 3.5.

This corpus is comprised of over 50 French-language TV talk-show programs from the 1970-80s and was made popular among both national and European multimedia research projects, e.g. the European K-Space network of excellence [K-Space, K-Space]. Each show focuses on a main guest and other supporting guests, who are both interviewed by a host presenter. The interviews are punctuated with film excerpts, live music, audience applause and laughter. Aside from this, silences during speaker turns can be very short or almost negligible; compared to meetings, where speakers often pause to collect their thoughts or to reflect before responding to a question, TV show speech tends to be more fluent and sometimes almost scripted. This is perhaps due to the fact that the main themes and discussions are prepared in advance and known by the speakers.

Table 3.1 highlights more quantitative differences between NIST RT conference meet-ings from the RT‘09 dataset and 7 TV shows from the GE database, which have thus far been annotated manually according to standard NIST RT protocols [NIST, 2009].

Upon comparison of the first 3 lines of Table 3.1 we observe that TV-talk shows are on average much longer than conference meeting (147 minutes vs. 25 minutes) and, with noise (e.g. applause) and music removed, the quantity of speech is twice that for RT data (50 minutes vs. 21 minutes). Note, however, that the average segment duration is slightly smaller for RT‘09 than for GE (2 sec. vs 3 sec.). These preliminary findings

Attribute GE NIST RT‘09

No. of shows 7 7

Avg. Evaluation time 147 min. 25 min.

Total speech 50 min. 21 min.

Avg. No. of segments 1033 882

Avg. segment length 3 sec. 2 sec.

Avg. Overlap 5 min. 3 min.

Avg. % Overlap / Total speech 10 % 14 %

Avg. No. speakers 13 5

most active 1476 sec. 535 sec.

least active 7 sec. 146 sec.

Table 3.1: A comparison of Grand ´Echiquier (GE) and NIST RT‘09 database characteris-tics.

may suggest that TV-shows will present more of a challenge due to the greater levels of intra-speaker variability within a same show.

Moreover, differences in terms of speaker statistics have to be considered as well.

Indeed the average number of speakers, and the average floor time for the most and least active speakers in each show are not comparable for both domains. On average there are 13 speakers per TV show but only 5 speakers per conference meeting. This might be expected given the longer average length of TV shows. Given a larger number of speakers we can expect a smaller average inter-speaker difference than for meetings and hence increased difficulties in speaker diarization.

Furthermore, we see that the spread in floor time is much greater for the GE dataset than it is for the RT‘09 dataset. The average speaking time for the most active speaker is 1476 seconds for the GE dataset (cf. 535 sec. for RT‘09) and corresponds to the host presenter in each case. The average speaking time for the least active speaker is only 7 seconds (cf. 146 sec. for RT‘09) and corresponds to one of the minor supporting guests.

Speakers with such little data are extremely difficult to detect and thus this aspect of the TV show dataset is likely to pose significant difficulties for speaker diarization. Note however that the overall DER is not very sensitive to such speakers insofar as each speaker’s contribution to the diarization performance metric is time weighted. Addi-tionally, the presence of one or two dominant speakers means that lesser active speakers will be comparatively harder to detect, even if they too have a significant floor time.

Finally, the amount of overlapping speech (averages of 5 minutes cf. 3 minutes per

show), or 10% (GE) vs. 14% (RT‘09) while considering the fraction of the total amount of speech, shows that there is proportionally slightly less overlap speech in the GE dataset than there is in the RT‘09 dataset, but compared to other RT datasets, the overlap speech rate can still be considered as quite high.

Even if there is a shade less overlap speech, the nature of TV shows thus presents unique challenges not seen in meeting data, mainly: the presence of music and other background non-speech sounds, a greater spread in speaker floor time, a greater number of speakers and shorter pauses.