• Aucun résultat trouvé

Evaluation Methodology

6.3 Evaluation Designs

After having discussed our experiment methodology in section 6.2, we will now turn our attention to how the collected data can best be evaluated. In the following subsections

we introduce our two main evaluation designs, namely the control group and single-case design.

6.3.1 Control Group Experiment Design

As we discussed in the introduction to this chapter, the control group experiment design (e.g. Fisher (1929)) is one of the most commonly used quantitative experiment designs.

It dates a long way back and still is a very popular approach today. In recent years, however, critical voices have grown louder in questioning the suitability of control group designs for behavioral and educational studies, where it is argued that the individual should be the focus of interest (e.g. Shapiro (1961)).

In this section we will examine the historical background of the two research designs, as well as at the methodology behind the two approaches. We will then discuss their respective advantages and suitability for our setting and the evaluation of CALL-SLT.

6.3.1.1 Historical Background

Control group experiments date back to Fisher, who introduced the Null Hypothesis Significance Testing(NHST) in behavioral science during the 1920s and 1930s with his experiments to test the usefulness of agricultural innovations (Fisher (1929)).

His goal was to conduct experiments with two randomly selected subgroups of a specific population, where he applied an intervention or treatment to one group and not the other. He then monitored the change in behavior in the two groups over a specified period of time. At the end of the experiment, he compared the mean for a given dependent variable of the control group (no treatment) with the mean of the experimental group (with treatment). With this method Fisher wanted to rule out the possibility that accidental occurrences that were due to uncontrolled circumstances could have a significant effect and would be generalized to an entire population. Fisher (Fisher (1929), p. 189) explained this as follows:

In the investigation of living beings by biological methods, statistical tests of sig-nificance are essential. Their function is to prevent us being deceived by accidental occurrences, due not to causes we wish to study, or are trying to detect, but to a

combination of many other circumstances which we cannot control. An observation is judged significant, if it would rarely have been produced, in the absence of a real cause of the kind we are seeking. It is common practice to judge a result significant, if it is of such a magnitude that it would have been produced by chance not more frequently than once in twenty trials. This is an arbitrary, but convenient, level of significance for the practical investigator, but it does not mean that he allows himself to be deceived once every twenty experiments. The test of significance only tells him what to ignore, namely all experiments in which significant results are not obtained. He should only claim that a phenomenon is experimentally demon-strable when he knows how to design an experiment so that it will rarely fail to give a significant result. Consequently, isolated significant results which he does not know how to reproduce are left in suspense pending further investigation.

As can be seen in Fisher’s quote, he believed that an effect is demonstrated if the difference is significant at p < 0.05. At the same time, he relativized this statement by stating that only significant results that can be reproduced are of real significance when making claims about generalizations to the population.

Lindquist was one of the first researchers to apply a slightly modified version of Fisher’s significance testing to the educational field. In thecluster randomized controlled trialmethod, pre-existing groups of participants are chosen for control and experimental groups (as is the case with school classes, for example), instead of the random grouping of individuals proposed by Fisher. Lindquist devised this adapted method following an observation made in the educational setting, as described in Lindquist (1940), p. v:

[...] we have overlooked the very significant fact that most of our samples, how-ever large in terms of numbers of individual observations, are not simple random samples, but consist of relatively homogeneous and intact subgroups, such as the pupils in a single school or under a single teacher.

We faced the same situation when applying control group designs in our CALL evaluation, since we could not randomly assign students to a control and experimental group for pedagogical reasons, and hence needed to work with pre-defined clusters, which in our case were school classes. Technically speaking those units are also a

random selection of the entire population, since class-attribution does not follow a specific pattern. Given the fact that we used a variety of teachers at different schools for our CALL-SLT evaluation, this factor should not impact the validity of our experiments.

In the following subsection we will investigate the methodology necessary to con-duct valid control group – or cluster – evaluations.

6.3.1.2 Methodology

In practice, any control group experiment design starts by forming two randomly se-lected sample groups out of a bigger population of possible subjects. In the specific case of a cluster randomized controlled trial, the groups are not formed by randomized groupings of individuals, but rather by a randomized choice of pre-defined groups. One of the two groups then acts as a control group (a group without intervention) and the other group acts as the experimental group, to which the researcher applies an intervention. The researcher defines a hypothesis and a null hypothesis against which the hypothesis is tested. A hypothesis is that an independent variable has a notice-able effect on a dependent varinotice-able, whereas the null hypothesis states that there is no difference between groups caused by the dependent variable (Morgan and Morgan (2009)). By comparing the control against the experimental group, the researcher tries to reject the established null hypothesis in favor of the hypothesis. This is done with statistical inference and the comparing of group means.

The comparison of the average response of the subjects of the two groups provides an objective evaluation of the mean effectiveness of the applied intervention. One important condition is therefore that the two groups (control and experimental) are comparable (with exception of the independent variable).

An important issue when applying control group designs is the one of comparability.

Although the control and experimental groups are chosen randomly, there is always a remaining risk that the groups differ in some respect. This is an issue that cannot be controlled for and which must be considered when evaluating results from control group experiments. However, the more parallel subgroups are contained in both the control and experimental group, the smaller the risk.

6.3.1.3 Control Group Design in CALL-SLT

In our work we mainly applied the control group evaluation design to compare different versions of CALL-SLT. More concretely, we compared classes who used the gamified SLT version (with scores and badges) with classes who used a standard CALL-SLT version without gamification elements. Students using the former version acted as our experimental group, whereas students using the latter version constituted our control group. This evaluation served to address our hypothesis that gamification increases students’ motivation to use CALL-SLT, as will be discussed in Chapter 7, section 7.10.

As we discussed in the previous subsection, the issues that we needed to account for were mainly the ones of comparability and representativeness of the entire popula-tion in both control and experimental groups. Unfortunately we could not choose our groups completely randomly, since we were dependent on the collaboration of teachers.

Since experiments such as the one conducted in this thesis cannot be imposed on school classes at the researcher’s will the choice was somewhat limited. When generalizing to the population we need to be careful, since we must assume that most of the teachers who participated in our experiments were generally in favor of the use of CALL in traditional language teaching. However this should not negatively influence our study, since in our cross-version evaluation, all groups (control and experimental) were biased by the same characteristic.

6.3.2 Single-Case Experiment Design

As opposed to control group designs, single-case research (SCR) is the study of the individual, where an “individual” can either be one or many individuals who form a unit (Vannest et al. (2013)). The idea of SRC is to control an individual against him-or herself rather than against a control group. A definition of SCR is given by Hhim-orner et al. (2005), p. 166:

Single-subject research is experimental rather than correlational or descriptive, and its purpose is to document causal, or functional, relationships between independent and dependent variables.

Another characterization of single-case designs is given by Kratochwill et al. (2010), p. 2, in the following points:

• An individual “case” is the unit of intervention and unit of data analysis. A case may be a single participant or a cluster of participants (e.g. a classroom or a community).

• Within the design, the case provides its own control for purposes of compari-son. For example, the case’s series of outcome variables are measured prior to the intervention and compared with measurements taken during (and after) the intervention.

• The outcome variable is measured repeatedly within and across different condi-tions or levels of the independent variable. These different condicondi-tions are referred to as phases (e.g. baseline phase, intervention phase).

Barlow and Herson (1973), p. 319, further summarize a number of advantages of the single-case experimental method in the educational context with respect to the generality of findings and statistical change, as listed below.

Generality of Findings:

• In comparison to between-group designs, the problem of having statistical falsi-fications which are due to a large discrepancy of a small number of individuals in one group, who then considerably influence the group’s mean value, can be avoided.

• Single-case designs can take into account individual participant’s characteristics, which can be the reason for specific improvement or deterioration.

Statistical Change:

• When working with experimental and control group, there must be a statistically significant difference in the two groups, in order for the hypothesis to be proven.

However even small changes (which might not reflect statistical significance in the whole group) might be significant for the development in one individual subject.

In the subsequent sections we will discuss the historic background of this research design and its advantages for our field of study, before examining the methodology of SCR.

6.3.2.1 Historical Background

Single-case experimental designs have their roots in the sciences of psychology and physiology, two research fields in which single subject experiments have been in use since the early 19thcentury. One prominent example of an early single-case experiment was the one led by Broca in 1861, during which he found the speech center of the brain by examining a man who had lost his ability to speak intelligibly (Broca (1865)). With this example, researchers could already prove in early years that important findings with a wide generality could be found in single subjects (Barlow et al. (2009)).

But it was only much later that this experiment design was taken seriously by a wider audience and applied to other areas of research. A pioneer of SCR in the field of education was Ebbinghaus, who introduced a new measurement for learning in the late 19th century, called the nonsense syllable. In his experiment he tested (mostly his own) memory capacity with supposedly meaningless letter combinations (Ebbinghaus (1913)). With his experiments on memory capacity, he was the first researcher to conduct repeated performance measurements over a long period of time on a single subject.

After its initial popularity in the late 19th century, the single-case experimental design experienced a period of abandonment when Fisher introduced his inferential statistics and therewith promoted between-group experiment designs during the 1930s (Hopkins et al. (1998)).

In the 1940s and 1950s the case study method became popular in clinical inves-tigations (e.g. Bolger (1965)), which set the basis for the revival of the single-case experimental design in the 1960s (Barlow and Herson (1973)).

Allport was an important supporter of single-subject experiments, with the aim of attending to the uniqueness of each individual. In 1962 he reflected on the following rhetorical question, which finally led to a more extensive use and acceptance of single-case experimental designs (Allport (1962), p. 407):

Why should we not start with individual behavior as a source of hunches (as we have in the past) and then seek our generalization (also as we have in the past) but finally come back to the individual not for the mechanical application of laws (as we do now) but for a fuller and more accurate assessment then we are now able to give?

The starting point for today’s use of the SCR experiment design was marked by Shapiro (e.g. Shapiro (1961); Shapiro (1966)), who constructed one of the first scientif-ically based methodologies for the use of single-case experimental designs in the 1950s and 1960s. In his experiments he was able to demonstrate that “repeated measures within an individual could be extended to a logical end point and that this end point wasthe outcome of treatment.” (Barlow et al. (2009), p. 27) Shapiro furthermore ap-plied a correlational approach and demonstrated that independent variables could be systematically manipulated within single subjects, thus fulfilling the requirements of a

“true” experimental approach (Underwood (1957)).

6.3.2.2 Methodology

As we have already mentioned, in single-case research the individual participant is the unit of analysis, where performance prior to intervention is compared to performance during and/or after intervention with each individual serving as its own control (Horner et al. (2005)). As is the case for most research designs, two of the most important issues that must be considered when conducting single-case experiments are validity and generality.

The question of generality can be somewhat relativized in our research scenario, since we are not conducting traditional single-case experiments, but rather large-scale experiments in which we evaluate the progress of various units individually, as will be discussed in subsection 6.3.2.3.

As far as internal validity in SCR is concerned, the following nine threats, as stated by Shadish et al. (2002), p. 55, must be accounted for in the experiment design.

1. Ambiguous Temporal Precedence: Lack of clarity about which variable oc-curred first may yield confusion about which variable is the cause and which is the effect.

2. Selection: Systematic differences between/among conditions in participant char-acteristics could cause the observed effect.

3. History: Events occurring concurrently with the intervention could cause the observed effect.

4. Maturation: Naturally occurring changes over time could be confused with an intervention effect.

5. Statistical Regression (Regression towards the Mean): When causes (e.g.

single participants, classrooms, schools) are selected on the basis of their extreme scores, their scores on other measured variables (including re-measured initial variables) typically will be less extreme, a psychometric occurrence that can be confused with an intervention effect.

6. Attrition: Loss of respondents during a single-case time-series intervention study can produce artifactual effects if that loss is systematically related to the experi-mental conditions.

7. Testing: Exposure to a test can affect scores on subsequent exposures to that test, an occurrence that can be confused with an intervention effect.

8. Instrumentation: The conditions or nature of a measure might change over time in a way that could be confused with an intervention.

9. Additive and Interactive Effects of Threats to Internal Validity: The impact of a threat can be added to that of another threat or may be moderated by levels of another threat.

In our experiment design, we hence needed to pay special attention to the points listed above, in order not to violate the validity of our results.

Most single-case experiments use a baseline phase, which is carried out in the be-ginning of the experiment. This phase serves as an initial observation period during

which a stable trend in the subjects’ behavior can be established (Barlow and Herson (1973)). As a general rule, three observations should be carried out during this first experiment phase, in order to obtain a clear result of improvement or deterioration in the end. In our setting, this baseline was established by a written pre-test (giving a baseline for vocabulary, grammar and basic written interactional competences) and the first interactional session with CALL-SLT (acting as a baseline for oral productive competences, such as pronunciation and fluency). This baseline condition acted as the control condition for each individual unit against which we compared the results of the post-test and the interactions recorded in the end of the experiment phase.

6.3.2.3 Single-Case in CALL-SLT

For our CALL-SLT evaluations, we applied the single-case experimental design to com-pare the influence of our independent variable (the use of CALL-SLT) on the dependent variables (various L2 competences, measured with the placement test scores and the logged system interactions). Since it was not feasible to find control groups for peda-gogical reasons (we would risk making students in the control group feel discriminated against) and the SCR design is an often-used and well-proven experiment design for the educational context, we compared our individual units against themselves. In our eval-uation, activity groups (c.f. 6.2.5) were used as our units. Another approach might be to analyze each individual student separately, but with a group size of 207 participants, this approach seems to be too fine-grained for our purpose.

The SCR approach was used for various evaluations, including the comparison of pre- and post-treatment placement test scores to investigate written L2 skills (c.f. Chap-ter 7, section 7.2), the development of the students’ oral performance (c.f. ChapChap-ter 7, section 7.3), and the influence of the groups’ personal backgrounds on their score devel-opment, such as bilingualism, students’ L1, or previous knowledge of CALL and speech recognition technologies (c.f. Chapter 7, section 7.4).