CALL-SLT Architecture - CALL-SLT - The Potential of Interactive Speech-Enabled CALL in the Swis

CALL-SLT

4.2 CALL-SLT Architecture

The CALL-SLT tool was developed on the open-source Regulus platform, a platform for building systems based on grammar-based speech understanding (Rayner et al. (2006)), whose purpose is to help to write and compile speech grammars. CALL-SLT makes use of two main technologies: a grammar-based speech recognition module to process users’ spoken input, and an interlingua-based machine translation module to verify the grammatical and semantic correctness of the users’ input. These two modules will be discussed in detail in sections 4.2.1 and 4.2.2.

To begin, we will introduce the main functioning of CALL-SLT as shown in the schematic overview in Figure 4.1.

1. For every lesson, the course developer first assembles a corpus in which all sen-tences are collected that the system should recognize as correct responses. If, for example, we design a course to teach animal names to students learning English as an L2, the corpus should contain sentences such as “This is an elephant,” “It is a cat,” “That is a dog,” etc. The goal of the corpus is to cover all possible answers, in terms of both vocabulary and syntactic structures.

The corpus has two functions in the CALL-SLT game: it first serves as a base for choosing a prompt that will be shown to the student, and later on it is used for choosing the help example that is shown to the student in case he or she needs assistance in responding.

2. Before choosing one of the sentences in the corpus, the system first translates the corpus into an interlingua form – which is an abstract, and L1- and L2-independent format – that will later help verify if the student’s response matches the initial prompt.

3. From this translated corpus, the system then picks one sentence (following the structure indicated by the course designer) and presents it to the student in the form of a prompt. This prompt is transformed into a surface representation of

the interlingua (gloss), in order to ensure that the prompt is in a user readable format. CALL-SLT makes use of an abstract yet still easily readable form that prevents the student from translating the prompt word-to-word into the L2.

4. Once the student is shown a prompt, he or she then produces a spoken sentence in the L2 that corresponds to the semantic value shown in the prompt.

5. As the next step, the system performs speech recognition by transforming the student’s oral input into a chain of characters and translating the result back into the interlingua, in order to generate the response in the same format as the raw initial prompt.

6. Once the response is translated into the interlingua, the system matches it against the underlying interlingua of the initial prompt and decides whether or not to accept the student’s utterance.

7. If the two fragments of interlingua in step 6 correspond, the student is given positive feedback and is lead to the next prompt (back to step 1).

8. If the two fragments of interlingua do not correspond, the student is given negative feedback and is asked to repeat his or her response. Should the student not know how to respond, the help function can be used, where the system picks a sentence from the corpus that corresponds to the interlingua of the initial prompt and presents it to the student in both written and spoken form (as defined by the course designer).

As can be seen from the overview given above, NLP is evidently one of the most im-portant influences on CALL-SLT. We will now more closely examine the two main NLP technologies used in CALL-SLT – grammar-based speech recognition and interlingua-based machine translation – in the two subsequent subsections 4.2.1 and 4.2.2.

4.2.1 Grammar-based Speech Recognizer

As was already mentioned in Chapter 3, section 3.4, speech recognition is an important technology that allows CALL applications to create more realistic learning environ-ments in which students can train their oral productive skills in the L2. As discussed,

Figure 4.1: CALL-SLT scheme

CALL-SLT attempts to offer a reasonable range of freedom to the learner in creating responses to a given system prompt in order to maximize the authenticity of his or her experience.

In the previous chapter we discussed the importance of finding a good balance be-tween recognition that is neither too forgiving nor too strict. Whereas we want to prevent the users from receiving positive feedback on clearly incorrect utterances (e.g.

incorrect grammar, incorrect use of vocabulary), depending on the type of course (e.g.

dialogue games), we might want to accept utterances with small forgivable glitches (such as slightly incorrect pronunciation, wrong intonation, etc.) that do not nega-tively influence the communication flow. For dialogue games, we always need to bear in mind that in real-life conversations, students would also be understood by native speakers if they had an accent and slightly incorrect pronunciation or intonation.

The speech recognition used in CALL-SLT is a grammar-based speech recognizer, which is built on top of the Regulus framework (Rayner et al. (2006)). The speech recognition process itself is based on three technologies that are signal processing, an acoustic modelwith a pronunciation lexicon, and a language model.

Figure 4.2: Speech recognition process (Rayner et al. (2006))

Figure 4.2 shows a schematic model of the various processes taking place in every speech recognition event: the acoustic model and the pronunciation dictionary suggest the various possible sequences of words based on the input speech signal, and these are then filtered by the language model to pick the most probable ones (Bouillon et al. (in press)). In the CALL-SLT application we make use of the Nuance 8.5 Toolkit (Nuance (2015)) for most of the displayed components. Only the language model is rule-based and directly derived from the CALL-SLT lessons. We will now address the components illustrated in Figure 4.2 more thoroughly.

The acoustic model is responsible for defining the correspondence between the phonemes of a given language and their spoken realizations (Bouillon et al. (in press)).

These correspondences depend on the one hand on the context and on the other hand on the speaker who utters the word(s) in question. A classic example that depends on

the speaker is the word tomato, which can have different pronunciations depending on the speaker’s dialect. An American user might pronounce it as [t@'meI,toU], whereas in British English it would be pronounced as [t@'ma:t@U].

To train such an acoustic model, large annotated speech corpora are used together with machine learning methods. One downside to the acoustic model is that it depends largely on its training data. If a speaker has a dialect or an accent that was not included in the training data, the chances are high that he or she will not reliably be recognized by the system. To counter this problem, the Nuance acoustic model used in this project additionally offers a pronunciation lexicon to which unrecognized pronunciation variants could be added manually by the system developers.

For a CALL scenario we need to consider the fact that most acoustic models are trained on L1 speaker data and might hence perform worse for L2 speakers who may not pronounce all sounds of the L2 correctly (an evaluation of this topic is discussed in Rayner et al. (2012)). The most promising way to tackle this problem would be to use L2 speech data to train the acoustic model, but since this is an expensive solution (given that L1 data is more commonly available than annotated L2 speech data), an alternative option is to improve recognition via the language model, as we will see below.

Another important technology involved in the speech recognition process – next to the acoustic model – is the language model, which is responsible for restricting the recognition of all possible words to a subset of probable words. In other words, it

“defines the relative probabilities of different possible sequences of words, whose possible realizations as sequences of phonemes are defined by the pronunciation dictionary.”

(Bouillon et al. (in press)) This restriction makes the recognition process considerably faster since only a small subset of all possibilities needs to be searched upon each recognition event and, at the same time, the recognition result is more accurate if the number of choices is tightly constrained.

The language model specifies which words should be recognized by the system and in which order they are allowed to appear. There are two different approaches that can be applied to distinguish acceptable from unacceptable sequences (e.g. fifty nine vs. *fifteen nine). The first approach is a data-driven or statistical approach, and the second is a linguistic or rule-based approach, as is used in our project. To better under-stand the advantages of using a linguistic instead of a statistical language model for the

CALL-SLT speech recognition module, we will first briefly introduce both approaches along with their advantages and disadvantages for CALL, and we will then discuss the reasoning behind choosing a linguistic approach for the project at hand.

4.2.1.1 Statistical Language Models

Data-driven language models are also often referred to as “corpus-based” or “statis-tical” models, since they are based on the statistical probability that a word will be followed by a certain other word. In order to calculate this probability, n-grams – in most cases either bigrams (2-word-sequences) or trigrams (3-word-sequences) – are calculated on the basis of large corpora. If we take the example cited above, we can determine that the probability of “fifteen” being followed by “nine” is largely inferior to the probability of “fifty” being followed by “nine” (2% vs. 98% respectively according to a simple Google search). Therefore a speech recognizer would recognize “fifty nine”

as being a correct utterance, due to its high statistical probability, and “*fifteen nine”

as being an incorrect utterance.

One of the most important advantages of statistical language models is that they are relatively easy to train from a set of training data (sequences of observations). Once the data is available, the necessary calculations can easily be programmed.

This advantage leads us directly to the first disadvantage, however, which is that a large quantity of data in the form of corpora is needed in order to create a represen-tative probability chart. Another disadvantage of data-driven language models, which is especially delicate for CALL applications, is that grammatically incorrect n-grams might be recognized as being correct (due to their high probability) since there is no underlying grammar controlling for grammatical correctness. Since a CALL applica-tion aims to teach grammatical correctness, this model is clearly not favorable for our domain.

4.2.1.2 Linguistic Language Models

Rule-based language models (also called “grammar-based” or “linguistic” language models) differ from data-driven models as they do not require extensive training data,

but instead use an underlying grammar (usually in a context-free notation) in combi-nation with a coverage corpus to describe all acceptable structures. In this approach, the speech recognition system only attempts to recognize strings that are covered by the underlying grammar and that appear in the corpus (Rayner et al. (2006)).

Regulus deals with grammars in an abstracted way, by compiling unification gram-mars (or feature gramgram-mars) into annotated Context-Free Grammar (CFG) language models that are then expressed in the Nuance Grammar Specification Language (GSL), which is also a kind of CFG (Rayner et al. (2004)). By using the Nuance Toolkit com-piler utility, GSL models can then be converted into runnable speech recognizers. This ultimately means that a unification grammar can be compiled into a speech recog-nizer (for more information on this topic, c.f. Rayner et al. (2004)). The advantage of this approach is that unification grammars are much more compact. Furthermore, the grammars are specialized (as we will see in the following subsection 4.2.2) in order to reduce the size of the entire grammar and to restrict it to the domain-specific content.

This helps to improve the language model as there is less room for ambiguity and, at the same time, the difficulty level of the recognition can be varied (the bigger the specialized sub-content, the more difficult it is for a correct response to be recognized).

To summarize the advantages and disadvantages of the linguistic language model, we can state that the main advantage of rule-based language models is that there are no grammatical errors in the recognition result thanks to the underlying grammar and that no training data is needed, a further benefit is that the rules can be used across all contents and lessons; however, a disadvantage of this approach is that a large number of rules must be written in order to cover all possible structures that are grammatically correct. This disadvantage might also have an impact on the recognition result, since only utterances that are covered by the grammar can potentially be recognized by the system. As a consequence, utterances that use structures which are not covered by the grammar (or even only slightly deviate from the covered structures) will be rejected by the system. This needs to be kept in mind when evaluating the system’s recognition performance.

If we apply the above-mentioned advantages and disadvantages of statistical and linguistic models to the CALL scenario, our reasoning for choosing a linguistic approach

for CALL-SLT is based on the following points:

• Linguistic models can ensure that only grammatically correct utterances are ac-cepted by the system, which is crucial for the SLA/CALL scenario.

• Since linguistic models are clearly narrowed down to the taught content, it is generally easier for answers to be recognized by the system.

• Linguistic models allow the course designers to derive various recognition levels with a relatively small amount of data. By varying the size of the corpus, the recognition difficulty level can be controlled efficiently. For a less challenging level, only the corpus content that belongs to the smallest unit (e.g. current lesson) is considered, whereas more challenging levels can include the entire corpus (containing the content of multiple lessons).

Once a user’s utterance has been recognized by the system as we have discussed in this subsection, the input needs to be analyzed in order to determine whether or not the utterance corresponds to the initial prompt. In the following subsection we will discuss how this analysis works.

4.2.2 Interlingua-based Machine Translation

The second technology involved in the CALL-SLT system that we discuss in more detail is an interlingua-based machine translation module. As we saw earlier in this chapter, the interlingua is mainly responsible for creating a common base on which a student’s utterance can be judged. This machine translation module is based on the Regulus toolkit, which contains a number of general resource grammars (one per language) that are written in a unification-based feature grammar notation (as discussed in subsection 4.2.1.2). These resource grammars are combined with a domain-specific content-word lexicon, whereas function words are taken from existing resource function-word lexica (Rayner et al. (2010b)).

These resources first come into play when translating the development corpus into the interlingua, in order to extract all possible prompts for a given lesson (as seen in step 1 under section 4.2). They are also the basis for the linguistic language model used for speech recognition as discussed in subsection 4.2.1.2. To extract the possible prompts,

the relevant grammar and content-word lexicon are used to parse the domain corpus, generating a number of analysis trees that are then processed by the Explanation-Based Learning (EBL) algorithm. This processing step produces a specialized sub-grammar, specific to the chosen domain, and only takes into account the rules and declarations used for that specific domain. The specialized grammar is then further compiled into a CFG notation until it is finally compiled into a Nuance grammar, using Nuance utilities that allow it to be run on the Nuance platform. The latter is necessary for the speech recognition processing (c.f. 4.2.1).

Figure 4.3: The two levels of abstraction in Regulus (Rayner et al. (2006), p. 15)

The Regulus approach uses two levels of abstraction, which can be seen in Figure 4.3. The first level of abstraction is the specialization of the general Feature Grammar (FG). Since the extensive general FG is too complex to be compiled into a CFG form, an appropriate subset of this grammar is extracted via the EBL algorithm. This method takes into consideration the general FG, the added domain lexicon, a corpus of training examples and a set of declarations. The goal of this grammar specialization is to reduce the size and the ambiguity of the initial FG and to create a manageable domain FG (Rayner et al. (2006)). The second level of abstraction is the compilation of the abstract and complex specialized FG into a CFG grammar. Once the CFG grammar has been transformed into a GSL format, it is then compiled into a Nuance recognition package.

In this thesis we will not go into detail about the underlying processing, but rather look at some concrete examples of the grammar writing process. For more information

on the processing, see Rayner et al. (2006) and Santaholma (2010). In the following subsections we will first see how grammar and lexicon rules are constructed, before discussing interlingua rules.

4.2.2.1 Grammar and Lexicon Rules

As we have already seen, the rule-based grammars are a core element of CALL-SLT as they are responsible for various processes. For the work at hand we used a pre-existing grammar and adapted it to the content covered in the customized CALL-SLT lessons;

these lessons are discussed in Chapter 5, section 5.4. The goal of this subsection is to give a short introduction to how such grammar and lexicon rules are written. More detailed information on grammar development in general can be found in Rayner et al.

(2006), and specific information on how such grammars can be shared across languages for speech recognition purposes can be found in Bouillon et al. (2006).

The grammar rules written in Regulus are based on the Definite Clause Grammar (DCG) formalism and consist of categories (e.g. noun phrase =np, determiner = det or noun =noun) that are followed by a list of feature-value pairs. Feature-value pairs can either take fixed values (e.g.det_type=indef) or variables that are substituted by the same value within one rule (e.g.sem_np_type=SemType).

The example given below describes a noun phrase consisting of a determiner and a noun. The determiner is further defined by the featuredet_typewith the corresponding value indef, indicating that the noun is used with an indefinite article. The noun description also contains a series of feature-value pairs, namelysem(the semantic value), sem_np_type(the semantic type of the noun phrase),takes_det_type(indicating what kind of determiner is used),can_be_pp (indicating if it can be used as a prepositional phrase) and agr (giving information on agreement, here the third person singular).

With the following rule a noun phrase of the type “a lemonade” could be analyzed:

np:[sem=concat(Det, Noun), sem_np_type=SemType can_be_pp=false] -->

det:[sem=Det, det_type=indef],

noun:[sem=Noun, sem_np_type=SemType, takes_det_type=Det, can_be_pp=false, agr=sg/\3]

As we can see from the grammar rule example above, Regulus makes use of semantic representations (sem). With this mechanism higher-level semantics are built from the semantic representations of the various components, which pass their values upwards to the respective mother node (Rayner et al. (2008b)).

Lexicon rules are written in the same way as grammar rules. An example of a lexicon rule for the entry lemonade is displayed below, additionally making use of the macro mechanism that allows for generalizations in both grammar and lexicon rules. The macro in our example defines the rule for all “drink” nouns, such as lemonade, orange

Dans le document The Potential of Interactive Speech-Enabled CALL in the Swiss Education System: A Large-Scale Experiment on the Basis of English CALL-SLT (Page 113-129)