Automatic alignment with variants - Language identification test in code-switching speech

4.6 Language identification test in code-switching speech

5.1.1 Automatic alignment with variants

Method Figure5.1gives a schematic overview of forced alignment, which can be consid-ered as a sub-process of an automatic speech recognition system. The top part of the figure illustrates the basic task of forced alignment: given the speech signal and its transcription as input, the forced alignment process locates words and composing phones in the signal, thus providing the time stamps of their hypothesized boundaries. If pronunciation variants are proposed for a given word, the forced alignment process will also carry out the task of choosing the best matching variant. This is illustrated in the bottom part of Figure5.1. In particular, the illustration highlights the variants paradigm as we propose to use it in the following investigations and which we sometimes refer to as parallel variants paradigm.

For a given sound categorya, add as an alternative the (competing) sound categorybin all positions where theacategory appears, and let the system decide for the incomingxsignal which one of the two a (target) orb (competing) categories match best. In the illustration

of Figure 5.1 (bottom), all occurrences of a in the input stream are added b,and the first occurrence of sound categoryain the output stream is replaced by category bto exemplify thatbwas found to be more similar to signalxin this position.

Figure 5.1: Schematic representation of the forced alignment process. Input includes signal, transcription and phonemic representation. Output gives time stamps to words and phones Top: no variants in phonemic representation. Bottom: with local variants, highlighted by red circles. In addition to time stamps, the system chooses the best matching acoustic phone model among proposed variants a or b.

Implementation The automatic forced alignment of the parallel variants across all speech corpora and for each experiment was realized using a set of position-independent mono-phone acoustic models similar to those described in (Gauvain et al., 2002; Lamel et al., 2004;Gelly et al.,2016;Lamel et al.,2009). This setup was preferred to context-dependent acoustic models, as previous studies showed that these large sets of context-dependent mod-els typically used in speech recognition systems, tend to capture very specific co-articulation variation which may reach beyond a simple segment location. For example, in French spon-taneous speech, shortened and devoiced high vowels may be typical in some obstruent con-texts (as for /y/ in Frenchtu sais"you know", typically produced as [tsE] without [y] segment between [t] and [s]). We therefore prefer using context-independent phone models, as they average and represent the spectral characteristics of all occurrences of a given sound, rather than only a subset extracted from a specific left/right context (Adda-Decker and Lamel,

5.1. METHODOLOGY AND EXPERIMENTAL DESIGN

1999; Mareüil and Adda-Decker, 2002). The alignment system locates word and phone boundaries using orthographic transcriptions and the best matching pronunciations chosen among the pronunciation variants that are included in its dictionary. For technical reasons, the segmentation resolution is limited to 10 ms and the minimum duration of a segment is 30 ms. The phone labelling is not really phonetic, but rather phonological or phonemic (corresponding in most cases to standard word pronunciations).

Typical pronunciation variants in French are due to optional liaison consonants and schwa vowels (which may be described as sequential variants), allowing for one more or one less phone symbol in the pronunciation i.e. the word facile (easy) might provide the following choices to the system: [fasil], [fasil@], (and for example, in the case of a vowel /i/ centralization experiment, additional[fas@l]and[fas@l@]variants which allow for parallel [i,@]variants). Other typical variants are due to word-final consonant cluster simplifications as in the wordautre(other) which provides the following choices to the system: [otr],[Otr], [otr@],[Otr@],[ot],[Ot]. This particular example combines both parallel ([o, O]) and sequential (optional[r]and [@]) variants. However, in general, most lexical entries tend to be described by their canonical (full form) pronunciation.

For the vowel variation experiments, the automatic alignment system makes use of stan-dard French acoustic models. As the French inventory contains more vowels than are present in Algerian Arabic, the French acoustic models allow us to quantify what happens in a larger number of smaller vocalic locations than if we made use of the Arabic acoustic models. In particular, the larger set of French vowel acoustic models allow us to quantify whether the Algerian Arabic vowels are realized in a similar way as the corresponding French vowels or whether they tend to be shifted and if so in what direction. It is noteworthy to remind that the use of French acoustic models should not lead to interpretations such as aligned variants correspond to realizations of French phonemes, but rather that the realization of an Arabic vowel is acoustically close to that of a given French vowel.

For the consonant variation experiments, the automatic alignment system is based on Arabic acoustic models, as Arabic has a larger inventory of consonants than French and all studied French consonants have a counterpart in the Arabic model set. As for French, the Arabic acoustic models consist of position-independent monophone acoustic models similar

to those described for French.

Discussion The forced alignment with specific variants method may be considered as globally more objective than human annotation, as exactly the same acoustic models and the same decision measure is applied in all positions over time. Above all, this method is extremely more time-efficient than human labelling. However the variant choices are cat-egorical and limited to the options offered by the variants included in the pronunciation dictionary, which motivates the proposed term of ABX-like categorization for this variant alignment approach. In order to explore gradual variations, additional analyses such as for-mant measurements and perceptual tests using human listeners are necessary. In this work some automatic formant analyses are proposed, however perceptual tests go beyond the scope of this thesis.

Dans le document Linguistic and phonetic investigations of French-Algerian Arabic code-switching : Large corpus studies using automatic speech processing ~ Association Francophone de la Communication Parlée (Page 133-136)