Integrative CALL - Technology in Second Language Acquisition

Technology in Second Language Acquisition

3.4 Integrative CALL

As the world became more global, and foreign languages became increasingly more important in both professional and personal aspects of people’s lives, the need to learn languages grew from year to year and, therewith, so too did the need for more sophis-ticated CALL applications.

Although early CALL applications were fairly successful – as we saw earlier in this chapter – in recent years there has been an increase in demand for more realistic CALL programs, with which language learners can train not only their writing skills, but also their L2 speaking skills. L2 training with real native speakers is often infeasible, since it either involves traveling to a region where the L2 is spoken, or organizing training sessions with natives living closer by. Both solutions are certainly favorable for L2 acquisition, however they are both time and cost intensive.

As we saw in the introduction of this chapter, the two components that characterize the integrative CALL epoch are on the one hand the use of coherent courses, preferably

enhanced by multimedia elements, and on the other hand the integration of such courses into a bigger framework.

Although there were many technological advances to meet the requirements of the new paradigms of coherent multimedia CALL courses, we are mostly interested in the ways in which such courses can meet the needs of L2 speech training. In this particular field, the most significant progress has been made with the introduction of natural language processing technologies and, most importantly, speech recognition technology.

Because of the integration of speech recognition in CALL tools, it is finally possible to offer more realistic conversation scenarios in which the language learners can speak reasonably freely with the computer.

Next to creating coherent language courses, blended learning scenarios are the sec-ond most important component of integrative CALL. With the introduction of the Internet and of widely used learning platforms (such as Moodle), applications can nowadays be deployed on the web and are hence available to the users anytime and anywhere, provided minimal technical requirements are met. This flexible use of CALL makes it possible to easily integrate CALL applications as a supplement to traditional face-to-face teaching, which in turn allows for more varied language teaching that takes advantage of new technologies paired with the pedagogical and subject matter expertise of human teachers.

In this section we will mainly discuss the new technological advances in CALL, whereas the blended learning aspect will be discussed in detail in Chapter 5. Furthermore, in Chapters 4 and 5 we will see how the integrative approach to CALL was implemented in our CALL-SLT tool. By means of the creation of realistic dialogue settings, we pro-vide a platform for students to train their spoken L2 competences in coherent courses that are enhanced with multimedia elements, such as video prompts, and which are available in a blended learning environment.

3.4.1 Natural Language Processing

Natural Language Processing(NLP) has played an important role in the development of CALL in recent years. It first came into the picture between 1985 and 1995, when there were some important but still crude attempts of integrating NLP into CALL. The general goal of NLP for SLA is to build computational models of natural languages in

order to analyze and generate them and, ultimately to provide realistic exercises for L2 learners.

Since the field of NLP is extremely vast and there are uncountable uses of its var-ious components, we will limit our interest to the technology ofspeech recognition, as it was used in our CALL-SLT application (c.f. Chapter 4). Speech recognition allows CALL programs to appear more realistic since virtual communication partners can be developed. Furthermore, speech recognition makes it possible to integrate spoken L2 production into CALL programs, as we will see in a few examples discussed in sub-section 3.4.2, as well as in the following chapters. But first we will more thoroughly investigate the technology of speech recognition in the following subsection.

3.4.1.1 Speech Recognition

Although the first automatic speech recognizers were developed as early as in the 1950s (as discussed, for example, in Juang and Rabiner (2005)), they have only become powerful enough for actual use in recent years.

According to Ehsani and Knodt (1998), understanding human speech requires a large amount of linguistic knowledge, such as a command of the phonological, lexical, semantic, grammatical, and pragmatic conventions of a language. Ehsani and Knodt (1998), p. 47, furthermore state the difference between human and machine speech processing as follows:

Complex cognitive processes account for the human ability to associate acoustic signals with meanings and intentions. For a computer, on the other hand, speech is essentially a series of digital values. However, despite these differences, the core problem of speech recognition is the same for both humans and machines: namely, of finding the best match between a given speech sound and its corresponding word string.

Nowadays, speech recognition is used in many different domains. An example of this broad usage is that of telephone applications, such as those used in call-centers, or car navigation systems. However, the interest that automatic speech recognition (ASR) has for this work is that it allows language learners to effectively train their

oral productive skills, such as speaking fluency and pronunciation, when integrated in CALL applications. Although it is still difficult to develop CALL applications that allow free speech in a wide-range domain, there are efforts that go beyond simple imi-tation exercises.

In CALL, the quality of speech recognition technology is particularly important if a real spoken interaction with the computer needs to be obtained. In order to provide a useful environment for the language learner in which he or she can effectively train his or her L2 skills, it is hence important to realistically assess the respective target users.

This topic will be discussed in detail in Chapter 5. Another component of L2 speech that needs to be considered when developing speech-enhanced CALL is that (beginner) L2 speech typically comes along with a (more or less heavy) accent and that the system designers need to find an adequate level for recognizing potentially erroneous speech.

It is important to find a good balance in recognition that is neither too strict nor too forgiving, which might differ for various exercises, as we will see in some examples below.

Before taking a closer look at the processes involved in speech recognition in Chapter 4, we will now focus on the current trends in voice-interactive CALL. In recent years there have been various trends in CALL exercises involving speech technology, such as pronunciation training and literacy exercises, as well as training of linguistic structures and limited conversation.

Computer Assisted Pronunciation Training (CAPT) usually consists of prompting students to repeat isolated words or entire sentences in the L2 in order to practice correct pronunciation and intonation. In contrast to speech recognizers used in com-municative L2 exercises, such as for example conversational training, speech recognition for pronunciation training should not account for the wide range of accents with which L2 speech is typically associated. On the contrary, its goal is to detect even subtle deviations from native pronunciation patterns and provide corrective feedback on the correct pronunciation of single phonemes and entire words. A commonly used model to provide feedback on pronunciation exercises is to compare the student’s L2 production against a native model, which is typically trained on both native and non-native speech data. When using entire segments or sentences, feedback can additionally be provided on stress and intonation.

L2 literacycan also be trained by using speech recognition in reading aloud exercises.

Such exercises are of special interest for languages using non-phonetic writing, as is the case for many Asian languages, where the association between sounds and their written forms needs to be trained (Ehsani and Knodt (1998)). Although the recognition is less challenging than for free speech, one of the main challenges in using technology for such reading aloud exercises according to Ehsani and Knodt (1998) is handling disfluencies common in L2 speech such as hesitation, mispronunciation, false starts or self-corrections.

To trainlinguistic structuresandconversational skillswith the help of speech recog-nition, one usually distinguishes between closed and open response designs. The main difference of the two designs is that in closed designs the students get a pre-defined choice of possible responses, whereas in open response designs there is no visible re-striction and the students need to generate their answers freely. The challenges in closed response designs are largely similar to the ones outlined for reading aloud exer-cises, since the system again knows what the student is trying to say but it must take into account non-native disfluency. In contrast, accurate speech recognition in open response designs is much more challenging, since a wider range of possible responses needs to be considered during the recognition process.

Although the technological means are at hand today, it is still very laborious and expensive to develop programs that offer real spontaneous dialogue options between the computer and the language learner. This is why most CALL systems of the early 21st century use a closed response design in which possible responses are limited – either by a very tight question design or by different answer proposals, from which the learner can choose (Nerbonne (2002)). An example of a closed response question is “Would you like a drink?” for which the response can be either “yes” or “no”; other closed response questions which have more scope than yes/no responses are questions of the type “What would you like to drink?” with a multiple-choice list of possible answers, such as for example a) I would like a tea, b) I would like a coke, or c) I would like a beer.

3.4.2 Integrative CALL Projects

Thanks to the Web 2.0, there is a multitude of CALL programs available today that integrate NLP technologies and that are available to a large audience. In this section

we will introduce some representative projects that use speech recognition for various goals, as they have been introduced above.

In Chapter 4 we will discuss how we implemented NLP components – most impor-tantly speech recognition – in CALL-SLT. We will then illustrate that our main focus lies on open response conversational training (as discussed below in subsection 3.4.2.3) and the concept of the Translation Game (as introduced in subsection 3.4.2.4).

3.4.2.1 Pronunciation Training

The SPELL project was designed to automatically assess and improve L2 pronuncia-tion by providing immediate corrective feedback. The project’s focus was on French and Italian as the L1 and English as the L2. In order to effectively detect pronun-ciation errors, the project included typical L1 pronunpronun-ciations (next to the correct L2 pronunciation) in the grammar network to map L2 errors to the respective L1 (Hiller et al. (1994)). This method allowed for a reliable error detection, such as the incorrect pronunciation of the English /th/ sound by either substituting it with a /t/ or /s/

sound, as is often done by non-native speakers. This approach proved to be efficient for language pairs that use similar phonemes (e.g. French/Italian - English) and for which typical errors can be predicted. The downside to mapping L1 errors to L2 production is that it limits the group of potential users; furthermore this approach is not feasible if there is no information on typical L1-L2 errors (Ehsani and Knodt (1998)).

A project that tries to tackle this problem is theDutch-CAPTproject. The project first created a list of pronunciation errors that should be addressed by the CAPT tool based on the six criteria that errors are “1) common across speakers of various L1s; 2) perceptually salient; 3) potentially hampering to communication; 4) frequent; 5) per-sistent, and 6) automatically detectable with sufficient reliability.” (Neri et al. (2007), p. 160) A comprehensive analysis led to eleven Dutch phonemes that met the cited criteria and that were respected in the Dutch-CAPT exercises that typically consist of role-plays, closed question answering, or repetition of words and minimal pairs. The program includes immediate corrective feedback displayed in a combination of a smiley

and a short written comment. In case of mispronunciation the erroneous part is high-lighted and the student is prompted to try again.

Another academic project is the Fluency Pronunciation Trainer (Eskenazi and Hansma (1998)), which is a pronunciation trainer using an open response design. The idea here was to let the users participate more actively by eliciting their answers rather than letting them read a pre-defined structure from the screen. A prompt might look like this: “When did you meet her? (Yesterday).” The Fluency Pronunciation Trainer focused on the training and correction of both phonetics as well as prosody, arguing that phones that are pronounced perfectly correctly might still be unintelligible if they lack correct timing and pitch (Eskenazi and Hansma (1998)). Whereas pronunciation errors are measured by comparing the mean scores of students’ versus native speakers’

recognition scores, prosody is measured in various steps. First the relative duration of the phones is measured to detect errors in duration, before moving on to the evaluation of the user’s pitch in a second, more advanced step. Corrective feedback is provided to the user on all elements (e.g. indicating if a phone was too long or too short), and in addition a native speaker’s sample is provided for comparison.

A system developed by Tsubota et al. (2004) for Japanese learners of English pro-vides exercises for role-play conversation and training of specific phones. During the role-play part, students watch a short video clip with a question and then answer fol-lowing a script provided by the system. During the first phase, the system tracks all pronunciation errors made by the user without informing him or her about them in order to keep up the conversation flow. Based on the created pronunciation profile, the student then moves on to the second phase, in which the erroneous phones are trained separately (based on a speech recognition module trained on Japanese non-native speak-ers of English). In this specialized pronunciation training immediate corrective feedback is provided.

A popular commercial CALL program that includes speech recognition for pronun-ciation training is Auralog’s Tell Me More. In contrast to the Fluency Pronunciation Trainer, the Tell Me More software uses a closed response design to train correct pro-nunciation of L2 phones. Before each training session the user can choose between (1)

a “Free-To-Roam Mode” in which the learners can choose the topics themselves, as well as the nature of the exercises they want to complete; (2) a “Guided Mode” which is a programmed mode adapted to the user’s goals and time constraints; or (3) a “Dynamic Mode” in which the learners’ progress is evaluated automatically by the system in order to adapt the exercises to the learners’ needs as they progress.

The exercises are roughly divided into pronunciation, activities and video exercises, as defined below:

1. Pronunciation exercises include dialogues, sentence and word pronunciation, as well as phonetic exercises, where the user can choose between different dialects (e.g. British or American English).

2. Activity exercisesinclude picture-word association, word searches, word associa-tion, find the right word, fill-in-the-blanks, words and topics, words and functions, grammar practice, mystery phrases, crossword puzzles, word order, sentence prac-tice, dictation and written expression exercises.

3. Video exercises include different video scenarios, for which the language learner has to answer questions.

For all exercise types Tell Me More focuses on real-world scenarios in order to motivate the learner to train foreign language skills in useful scenarios.

Thanks to the integrated speech recognition technology, the system continuously corrects the learner while she or he completes a pronunciation exercise. The user also has the possibility of training certain phonemes specifically, as can be seen in Figure 3.1. This option is especially useful if the language learner experiences difficulties with a particular phoneme, such as the “th” in English.

3.4.2.2 Closed Response Conversational Training

A representative project of closed response conversational training is theVoice Inter-active Language Training System (VILTS), which combines listening comprehension, reading and speaking exercises for French as an L2 (Rypa and Price (1999)). The speaking exercises range from simple to more complex levels of difficulty, depending on the student’s L2 level. A session usually starts by prompting a user to read out

Figure 3.1: Tell Me More phoneme pronunciation exercise

loud one of three possible answers to an orally posed question; more difficult exercises consist of reading turns in a dialogue; and the highest level consists of choosing and reading out loud one of various proposed responses in a branched dialogue (Price and Rypa (1998)). System feedback is based on whether the response was understood by the system and whether it is correct.

Another, more recent approach to closed response conversational training is repre-sented by theGOBL (Games Online for Basic Language Learning)project (Strik et al.

(2013)). This project uses a gamified approach to SLA with online mini games that train fluency and accuracy in a task-based approach. Such mini games have been de-veloped for learners of French, Dutch and English. Three different games were created within the project scope, namely a lie-detector game (where the user has to decide whether a pronounced sentence is correct), a finger print collector (where the user col-lects words that can be used to fill in the gap of a provided sentence) and a roof-surfing parrot (where the user guides a parrot by pronouncing the correct sentence that con-tinues the dialogue logically). If played in succession, the first two games are intended to teach vocabulary and grammar that can then be applied during the dialogue game.

The games are time-paced and contain a scoring mechanism that attributes points for correct answers. The games provide both implicit and explicit feedback. Implicit feed-back is given during the game by attributing or deducting scores, and explicit feedfeed-back is provided once the game is finished in the form of an overview of the committed errors.

One of the best-known commercial representatives that include closed response conversational training is Rosetta Stone (RosettaStone (2012)). Next to traditional exercises, such as word-picture-association, Rosetta Stone also includes speech recogni-tion technology to engage users to speak in the L2. Such speech exercises are typically repetitions of a native speaker’s output in the form of either isolated words or en-tire sentences. More advanced exercises prompt the student with unambiguous images that appeared earlier in the lesson and require student to describe what is depicted (e.g. an image showing a man drinking a coffee and the user needs to say, “The man drinks a coffee”) or answer simple questions. In this kind of exercise only words are accepted that have been taught by Rosetta in earlier exercises. Next to those “core lessons,” Rosetta Stone also offers “focused activities” with which specific L2 skills can be trained, such as pronunciation, grammar, listening/reading, vocabulary, speaking or writing. In speech exercises the users have the option of using the speech analysis function that maps the user’s input against a native speaker’s output model. This op-tion is however only available for mimicry exercises in which a native speaker’s sample is reproduced one-to-one by a student. Feedback is given on emphasis and pitch by means of a graph comparing student and native speaker. In addition, Rosetta Stone

Dans le document The Potential of Interactive Speech-Enabled CALL in the Swiss Education System: A Large-Scale Experiment on the Basis of English CALL-SLT (Page 88-108)