Automatic Code-switching processing - Linguistic and phonetic investigations of French-Algerian

studies, thus, speech automatic processing is a helpful method to perform these variation studies. Among those tools; automatic speech alignment with variants (Adda-Decker,2006;

Lamel et al., 2009). It allows to the word pronunciation to be recognized with different pronunciation variants at the phone level (See Sections5.1.1,3.7).

Sociophonetics studies in CS studied phonetic speech variations. Among the addressed questions, the phonetic co-influence of the languages in bilingual performance resulting from the languages simultaneous activation during the speech. In comparison between early English-Spanish bilinguals and two groups English-Spanish speakers with inverted L1 L2 status, (Bullock, 2009) showed that the phonetic influence of L1 on L2 in CS is observed in the VOT duration of voiceless stops. The study concluded that the influence can be bi-directional in both L1 and L2 and the variation is observed in each L1/L2 group.

Also, a previous work on VOTvariation of stop consonants in English-Spanish CS re-veals that bilinguals produce a lower speech rate around the language switches and that phonological modifications of VOT values (Deuchar et al.,2014). Other similar studies re-ported the reduction of VOT values in Spanish stop consonants in Spanish-English CS pro-ductions (Balukas and Koops,2015;Piccinini, 2016;Botero et al., 2004). Also, Piccinini and Garellek (2014) indicated that English-Spanish bilinguals produce different prosodic contour depending on monolingual and bilingual speech.

Considering that CS is a speech style that would influence the phonetic production in bilingual mode in general and during the language change, variation studies are worth study-ing especially in CS large scale corpora and to investigate on both consonant and vowel variations. It should be noted that CS variations are few studied even in large scale corpora and spontaneous CS speech corpora.

1.5 Automatic Code-switching processing

With the development of speech technologies tools and the extensive research on language identification and Automatic Speech Recognition (ASR), a large set of research has been devoted to the CS speech (Vu et al., 2012;Yılmaz et al.,2016;Solorio and Liu,2008;Lyu

and Lyu,2008). Works propose methods to process the switch at clause and word levels and use different cues like acoustic, prosodic and phonetic features. In this section, we report briefly theASRmethods used to process CS that made our studies possible with a focus on automatic speech alignment.

1.5.1 Automatic speech recognition

Automatic speech recognition is developing methodologies that enable the recognition of spoken language into text with the use of computer technologies (Mariani,1990).

Concerning the ASR system of French language, evaluation campaigns were set up in 1997 and 2000s which contributed to the systems development for this language (Dolmazon et al.,1997;Gravier et al., 2004). The works are based on various speech types: controlled French speech, broadcast speech and spontaneous speech.

The Algerian Arabic system has been developed as an extension ofMSAlanguage sys-tem named ALASR (Arabic Loria Automatic Speech Recognition) (Menacer et al., 2017).

This work concluded that Algerian Arabic (AA) is highly influenced by French language and needs to include French acoustic and language models to MSA. So, a recent interest was focused on the code-switchingAA andFrench Language (FR) ASR(Amazouz et al., 2016,2017).

TheLaboratoire d’informatique pour la mécanique et les sciences de l’ingénieur-CNRS, Orsay, France (LIMSI)ASR system used for this thesis is built on conversational broadcast and telephone speech for both French and Arabic languages (Gauvain et al., 2003, 2002) (See Section3.7).

The major components of ASR system are: acoustic model, the language dictionary and the language model. Acoustic model is used to classify the predicted phonemes of a lan-guage in a given audio input. Acoustic Model uses deep neural networks for frame predic-tions and the statistical model Hidden Markov Model (HMM) (Rabiner,1989) to transform the units into sequential predictions. Pronunciation dictionary gives the phonological pro-nunciation of each word in the language and then it joins the acoustic model to decode the

1.5. AUTOMATIC CODE-SWITCHING PROCESSING pronounced phones.

The language model consists in assigning probabilities to a sequence of words in order to validate the syntax and the semantics of a given word sequence in a language. This process is realized by modeling the probability for a word occurring in a given sentence and its predecessors.

1.5.2 Code-switching forced alignment

The audio speech segmentation into words and phones and its labelling with phone symbols are important steps for phonetic andASRresearch. Obviously, there are a lot of programs that may be helpful during manual segmentation, for example, the PRAAT tool (Boersma et al., 2002; Boersma, 2017), which also enables a vast range of acoustic analyses. How-ever, automatic speech alignment processes large quantities of speech in a very short time when manual processing can be estimated at around 800×real-time (Schiel et al.,2012).

Forced alignment (automatic Forced Alignment (FA)) of speech (also called Forced Viterbi alignment (Forney,1973) or automatic segmentation and labelling of speech (Ljolje and Riley,1991;Brugnara et al.,1993)) is an automatic method to align the text with the au-dio signal and consequently segment the auau-dio signal into words and phones. Forced align-ment (FA) consists in using an ASR system in a restricted mode: the system is given not only the acoustic signal as input for which it should normally determine the best matching word sequence, but it is also given the reference transcription. The system’s only job then consists in locating the word boundaries of these words within the acoustic signal, and more interestingly locating the phone segments within the words in the acoustic signal. Forced alignment thus makes it possible to automatically derive phone transcriptions and segmen-tation from an orthographically transcribed speech signal. To this aim, the ASR system requires a pronunciation dictionary including all the words occurring in the transcriptions together with a set of acoustic phone models. An interesting aspect of forced alignment is the possibility of introducing multiple pronunciation variants into the dictionary and let the system chose the best matching variant given the acoustic input signal. The FA method,

which is illustrated in Figure 1.2, has been validated in several acoustic-phonetic studies dealing with pronunciation variants (e.g. French liaison, schwa, voicing assimilation, word-final devoicing, regional variants).

Figure 1.2: Forced alignment process with variants using ASR system (Adda-Decker and Lamel,2017)

Thus, the data obtained with the FAconstitutes a valuable database for linguistic and phonetic studies (Yuan et al.,2013;Adda-Decker and Snoeren,2011). The methods used in this thesis to realize the CSFAare described in the Section3.7). TheFAof one language is based on its linguistic and phonetic properties: phones(mono-phones, diphones, triphones,) words, syllables and other phonetic classes as the tones. Works on CS automatic segmenta-tion show that theFAof CS data can be realized differently and it depends first to the final goal of the task (phone, syllable, word, sentence, language alignment ...).

Dans le document Linguistic and phonetic investigations of French-Algerian Arabic code-switching : Large corpus studies using automatic speech processing ~ Association Francophone de la Communication Parlée (Page 47-50)