• Aucun résultat trouvé

VALIDITY? RELIABILITY? DIFFERENT TERMINOLOGY ALTOGETHER?

Several years ago I wrote an article entitled “Validity, reliability, and neither”

(Knapp, 1985) in which I discussed some researchers’ identifications of investigations as validity studies or reliability studies but which were actually neither. In what follows I pursue the matter of confusion regarding the terms

“validity” and “reliability” and suggest the possibility of alternative terms for referring to the characteristics of measuring instruments. I am not the first person to recommend this. As long ago as 1936, Goodenough suggested that the term “reliability” be done away with entirely. Concerns about both “reliability”

and “validity” have been expressed by Stallings & Gillmore (1971), Feinstein (1985, 1987), Suen (1988), Brown (1989), and many others.

The problems

The principal problem, as expressed so succintly by Ennis (1999), is that the word “reliability” as used in ordinary parlance is what measurement experts subsume under “validity”. (See also Feldt & Brennan, 1989.) For example, if a custodian falls asleep on the job every night, most laypeople would say that he(she) is unreliable, i.e., a poor custodian; whereas psychometricians would say that he(she) is perfectly reliable, i.e., a consistently poor custodian.

But there’s more. Even within the measurement community there are all kinds of disagreements regarding the meaning of validity. For example, some contend that the consequences of misuses of a measuring instrument should be taken into account when evaluating its validity; others disagree. (Pro: Messick, 1995, and others; Anti: Lees-Haley, 1996, and others.) And there is the associated problem of the awful (in my opinion) terms “internal validity” and “external validity”

that have little or nothing to do with the concept of validity in the measurement sense, since they apply to the characteristics of a study or its design and not to the properties of the instrument(s) used in the study. [“Internal validity” is synonymous with “causality” and “external validity” is synonymous with

“generalizability.” ‘nuff said.]

The situation is even worse with respect to reliability. In addition to matters such as the (un?)reliable custodian, there are the competing definitions of the term

“reliability” within the field of statistics in general (a sample statistic is reliable if it has a tight sampling distribution with respect to its counterpart population

parameter) and within engineering (a piece of equipment is reliable if there is a small probability of its breaking down while in use). Some people have even talked about the reliability of a study. For example, an article I recently came across on the internet claimed that a study of the reliability (in the engineering sense) of various laptop computers was unreliable, and so was its report!

Some changes in, or retentions of, terminology and the reasons for same

There have been many thoughtful and some not so thoughtful recommendations regarding change in terminology. Here are a few of the thoughtful ones:

1. I’ve already mentioned Goodenough (1936). She was bothered by the fact that the test-retest reliability of examinations (same form or parallel forms) administered a day or two apart are almost always lower than the split-halves reliability of those forms when stepped up by the Spearman-Brown formula, despite the fact that both approaches are concerned with estimating the reliability of the instruments. She suggested that the use of the term “reliability” be

relegated to “the limbo of outworn concepts” (p. 107) and that results of

psychometric investigations be expressed in terms of whatever procedures were used in estimating the properties of the instruments in question.

2. Adams (1936). In that same year he tried to sort out the distinctions among the usages of the terms “validity”, “reliability”, and “objectivity” in the

measurement literature of the time. [Objectivity is usually regarded as a special kind of reliability: “inter-rater reliability” if more than one person is making the judgments; “intra-rater reliability” for a single judge.] He found the situation to be chaotic and argued that validity, reliability, and objectivity are qualities of

measuring instruments (which he called “scales”). He suggested that “accuracy”

should be added as a term to refer to the quantitative aspects of test scores.

3. Thorndike (1951), Stanley (1971), Feldt and Brennan (1989), and Haertel (2006). They are the authors of the chapter on reliability in the various editions of the Educational Measurement compendium. Although they all commented upon various terminological problems, they were apparently content to keep the term

“reliability” as is [judging from the retention of the single word “Reliability” in the chapter title in each of the four editions of the book].

4. Cureton (1951), Cronbach (1971), Messick (1989), and Kane (2006). They were the authors of the corresponding chapters on validity in Educational Measurement. They too were concerned about some of the terminological confusion regarding validity [and the chapter titles went from “Validity” to “Test Validation” back to “Validity” and thence to “Validation”, in that chronological order], but the emphasis changed from various types of validity in the first two editions to an amalgam under the heading of Construct Validity in the last two.

5. Ennis (1999). I’ve already referred to his clear perception of the principal problem with the term “reliability”. He suggested the replacement of “reliability”

with “consistency”. He was also concerned about the terms “true score” and

Like the authors of the chapters in the various editions of Educational

Measurement, the authors of the sections on validity express some concerns about confusions in terminology, but they appear to want to stick with “validity”, whereas the authors of the section on reliability prefer to expand the term

“reliability”. [In the previous (1999) version of the Standards the title was

“Reliability and Errors of Measurement”.]

My personal recommendations

1. I prefer “relevance” to “validity”, especially given my opposition to the terms

“internal validity” and “external validity”. I realize that “relevance” is a word that is over-used in the English language, but what could be a better measuring

instrument than one that is completely relevant to the purpose at hand?

Examples: a road test for measuring the ability to drive a car; a stadiometer for measuring height; and a test of arithmetic items all of the form a + b = ___ for measuring the ability to add.

2. I’m mostly with Ennis (1999) regarding changing “reliability” to “consistency”, even though in my unpublished book on the reliability of measuring instruments (Knapp, 2015) I come down in favor of keeping it “reliability”. [Ennis had nothing to say one way or the other about changing “validity” to something else.]

3. I don’t like to lump techniques such as Cronbach’s alpha under either

“reliability” or “consistency”. For those I prefer the term “homogeneity”, as did Kelley (1942); see Traub (1997). I suggest that time must pass (even if just a few minutes—see Horst, 1954) between the measure and the re-measure.

4, I also don’t like to subsume “objectivity” under “reliability” (either inter-rater or intra-rater). Keep it as “objectivity”.

5. Two terms I recommend for Goodenough’s limbo are “accuracy” and

“precision”, at least as far as measurement is concerned. The former term is too ambiguous. [How can you ever determine whether or not something is

accurate?] The latter term should be confined to the number of digits that are defensible to report when making a measurement.

True score and error of measurement

As I indicated above, Ennis (1999) doesn’t like the terms “true score” and “error of measurement”. Both terms are used in the context of reliability. The former refers to (1) the score that would be obtained if there were no unreliability; and (2) the average (arithmetic mean) of all of the possible obtained scores for an individual. The latter is the difference between an obtained score and the

corresponding true score. What bothers Ennis is that the term “true score” would seem to indicate the score that was actually deserved in a perfectly valid test, whereas the term is associated only with reliability.

I don’t mind keeping both “true score” and “error of measurement” under

“consistency”, as long as there is no implication that the measuring instrument is also necessarily “relevant”. The instrument chosen to provide an

operationalization of a particular attribute such as height or the ability to add or to drive a car might be a lousy one (that’s primarily a judgment call), but it always needs to produce a tight distribution of errors of measurement for any given individual.

References

Adams, H.F. (1936). Validity, reliability, and objectivity. Psychological Monographs, 47, 329-350.

American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

(1999). Standards for educational and psychological testing. Washington, DC:

American Educational Research Association.

American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME).

(in press). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Brown, G.W. (1989). Praise for useful words. American Journal of Diseases of Children, 143 , 770.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443-507). Washington, DC: American Council on Education.

Cureton, E. F. (1951). Validity. In E. F. Lindquist (Ed.), Educational

measurement (1st ed., pp. 621-694). Washington, DC: American Council on Education.

Ennis, R.H. (1999). Test reliability: A practical exemplification of ordinary language philosophy. Yearbook of the Philosophy of Education Society.

Feinstein, A.R. (1985). Clinical epidemiology: The architecture of clinical research. Philadelphia: Saunders.

Feinstein, A.R. (1987). Clinimetrics. New Haven, CT: Yale University Press.

Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 65-110). Westport, CT: American Council on Education/Praeger.

Horst, P. (1954). The estimation of immediate retest reliability. Educational and Psychological Measurement, 14, 705-708.

Kane, M. L. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17-64). Westport, CT: American Council on Education/Praeger.

Kelley, T.L. (1942). The reliability coefficient. Psychometrika, 7, 75-83.

Knapp, T.R. (1985). Validity, reliability, and neither. Nursing Research, 34, 189-192.

Knapp, T.R. (2015). The reliability of measuring instruments. Available free of charge at www.tomswebpage.net.

Knapp

Lees-Haley, P.R. (1996). Alice in validityland, or the dangerous consequences of consequential validity. American Psychologist, 51 (9), 981-983.

Paul R. Lees-Haley

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). Washington, DC: American Council on Education.

Messick, S. (1995). Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50 (9), 741-749.

Stallings, W.M., & Gillmore, G.M. (1971). A note on “accuracy” and “precision”.

Journal of Educational Measurement, 8, 127-129. (1)

Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.), Educational

measurement (2nd ed., pp. 356-442). Washington, DC: American Council on Education.

Suen, H.K. (1987). Agreement, reliability, accuracy, and validity: Toward a clarification. Behavioral Assessment, 10, 343-366.

Thorndike, R.L. (1951). Reliability. In E.F. Lindquist (Ed.), Educational measurement (1st ed., pp. 560-620). Washington, DC: American Council on Education.

Traub, R.E. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and Practice, 16 (4), 8-14.

CHAPTER 7: SEVEN: A COMMENTARY REGARDING CRONBACH’S