w w w . e l s e v i e r . e s / r p t o
Journal of Work and Organizational Psychology
Corrections for criterion reliability in validity generalization: The consistency of Hermes, the utility of Midas
JesúsF.Salgadoa,∗, SilviaMoscosoa,NeilAndersonb
aUniversityofSantiagodeCompostela,Spain
bBrunelUniversity,U.K.
a r t i c l e i n f o
Articlehistory:
Received23November2015 Accepted3December2015 Availableonline4February2016
Keywords:
Interrater Reliability
Validitygeneralization Jobperformance Ratings
a b s t r a c t
Thereiscriticismintheliteratureabouttheuseofinterratercoefficientstocorrectforcriterionreliability invaliditygeneralization(VG)studiesanddisputingwhether.52isanaccurateandnon-dubiousestimate ofinterraterreliabilityofoveralljobperformance(OJP)ratings.Wepresentasecond-ordermeta-analysis ofthreeindependentmeta-analyticstudiesoftheinterraterreliabilityofjobperformanceratingsand makeanumberofcommentsandreflectionsonLeBretonetal.’spaper.Theresultsofourmeta-analysis indicatethattheinterraterreliabilityforasinglerateris.52(k=66,N=18,582,SD=.105).Ourmain conclusionsare:(a)thevalueof.52isanaccurateestimateoftheinterraterreliabilityofoveralljob performanceforasinglerater;(b)itisnotreasonabletoconcludethatpastVGstudiesthatused.52asthe criterionreliabilityvaluehavealessthansecurestatisticalfoundation;(c)basedoninterraterreliability, test-retestreliability,andcoefficientalpha,supervisorratingsareausefulandappropriatemeasureofjob performanceandcanbeconfidentlyusedasacriterion;(d)validitycorrectionforcriterionunreliability hasbeenunanimouslyrecommendedby“classical”psychometriciansandI/Opsychologistsastheproper waytoestimatepredictorvalidity,andisstillrecommendedatpresent;(e)thesubstantivecontribution ofVGprocedurestoinformHRMpracticesinorganizationsshouldnotbelostinthesetechnicalpoints ofdebate.
©2015ColegioOficialdePsicólogosdeMadrid.PublishedbyElsevierEspaña,S.L.U.Thisisanopen accessarticleundertheCCBY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Correcciónporlafiabilidaddelcriterioenlageneralizationdelavalidez:
lacohererenciadeHermes,lautilidaddeMidas
Palabrasclave:
Interjueces Fiabilidad
Generalizacióndelavalidez Desempe ˜noeneltrabajo Valoraciones
re s um e n
En laliterature secritica el usode loscoeficientes interjuecespara corregirpor lafiabilidad del criterioenlosestudiosdegeneralizacióndelavalidez(GV)ycuestionansi.52esunestimadorpre- cisoynodudosodelafiabilidadinterjuecesdelasvaloracionesdeldesempe ˜noglobaleneltrabajo.
Eneste articulo,presentamosunmeta-análisis desegundoorden detresestudiosmeta-analíticos independientes sobre la fiabilidad interjueces de las valoraciones deldesempe ˜no en el trabajoy hacemosdiversoscomentarios yreflexionessobre el artículodeLeBretonetal. Losresultadosde nuestro meta-análisisindicanque lafiabilidadinterjueces es.52(k=66,N=18.582,SD=.105) para unúnicosupervisor.Nuestrasprincipalesconclusionesson:(a)elvalorde.52esunestimadorpre- ciso dela fiabilidad interjueces del desempe ˜no global en el trabajopara unúnico valorador, (b) no esrazonableconcluir que los estudios deGV que han usado.52como valor dela fiabilidad delcriterio tengan una fundamentaciónestadísticapocosegura, (c)sobre labase dela fiabilidad interjueces,lafiabilidad test-retestyel coeficientealfa,losjuiciosdelsupervisor sonuna medida
∗ Correspondingauthor.DepartmentofOrganizationalPsychology.FacultyofLaborRelations.UniversityofSantiagodeCompostela.CampusVida.15782Santiago deCompostela,ACoru ˜na,Spain.
E-mailaddress:[email protected](J.F.Salgado).
http://dx.doi.org/10.1016/j.rpto.2015.12.001
1576-5962/©2015Colegio Oficialde PsicólogosdeMadrid. Publishedby ElsevierEspaña,S.L.U. Thisis an openaccessarticle underthe CCBY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
útilyadecuadadeldesempe ˜noeneltrabajoypuedenserusadosconconfianzacomocriterio,(d)la correccióndelavalidezporfaltadefiabilidaddelcriteriohasidounánimementerecomendadaporlos psicómetrasypsicólogosindustriales“clásicos”comoelmétodocorrectodeestimarlavalidezdelpre- dictoryestodavíarecomendadaenlaactualidady(e)lacontribuciónsustantivadelosprocedimientos deGVparaorientarlasprácticasderecursoshumanosenlasorganizacionesnodeberíaperderseenestas cuestionestécnicasdedebate.
©2015ColegioOficialdePsicólogosdeMadrid.PublicadoporElsevierEspaña,S.L.U.Esteesunartículo OpenAccessbajolalicenciaCCBY-NC-ND(http://creativecommons.org/licenses/by-nc-nd/4.0/).
LeBreton,Scherer,andJames(2014)havewrittenachallenging leadarticleinwhichtheymakeaseriesofcriticismsabouttheuse ofinterratercoefficientstocorrectforcriterionreliabilityinvalidity generalization(VG)studiesanddisputingwhether.52isanaccu- rateandnon-dubiousestimateofinterrater reliabilityofoverall jobperformance(OJP)ratings.Asresearcherswhohaveconducted severalmeta-analytical(MA)andVGstudiesinwhichthevalueof theinterraterreliabilitywasestimated,weheremakeanumberof commentsandreflectionsonLeBretonetal.’spaper.Weorganize ourcommentsundersixpoints:(1)whether.52isinfactadubious interraterreliabilityvalueofOJP,(2)theircriticismthatcorrected coefficientswerewronglylabelledasuncorrectedcoefficients,(3) toshowthattherearesomelabellingerrorsinLeBretonetal.,(4) ifitisappropriatetocorrectobservedvalidityforcriterionreliabil- ity,(5)whetherinterraterreliabilityistheappropriatecoefficient tocorrectforcriterionreliabilityinVGstudies,and(6)widerissues overthevalueofVGstudiesforinformingpoliciesandpracticesin organizations.
Incombination,wearguethatthesepointsindicateunequivo- callythatthecaseofLeBretonetal.(2014)islogicallyflawed,and indeedoncloserinspectionhasbeenbuiltuppiecemealonanum- berofoutlierinterpretations,non-sequitersoflogicalprogression, andimpracticalcallsfordatasettreatmentinVGstudies.Following theirrecommendationsrisk“throwingthebabyoutwiththebath- water”andreducingthelikelihoodthatVGstudieswouldcontinue tohaveimportantpositivebenefitsforthepracticeinemployee selectionandotherareasofI/OPsychology.
Is.52aDubiousInterraterReliabilityValue?
LeBretonetal.(2014)doubtwhether.52isalegitimateandaccu- rateestimateoftheinterraterreliability.Toquote,theyarguethat
“thepastVGstudieswhichreliedonthisdubiouscriterionreliabil- ityvaluehavealessthansecurestatisticalfoundation”,andthat they“suspectthatresearcherswouldconcludethat.52isnotacred- ibleestimate”.Theproblemhereisthatthesearesimplyopinions withoutempiricalbasis,orinfactanysupportingrationalebeing proffered.LeBretonetal.donotprovideanyempiricalsupportfor rejecting.52asacrediblevaluebeyondtheirsuspicion.Shouldwe acceptthisopiniontounilaterallyjettisonthiswell-establishedand widelyusedvaluewithoutanysupportingreasoning orempiri- calfoundation?We believeabsolutelynot,especially whenone considerstheevidenceuponwhichuseofthisinterraterreliability valuehasbeenbased.
Viswesvaran, Ones, and Schmidt (1996), for instance,found valuesof .52(k=40, N=14,650) forinterrater reliability,.81 for coefficients ofstability (k=12, N=1,374)and .86 for coefficient alpha(k=89, N=17,899). These coefficients estimatethree dif- ferentsources of measurementerror (Schmidt &Hunter, 1996;
Viswesvaran, Schmidt,&Ones, 2002).Notallresearchers agree thattheinterratercoefficientistheappropriateestimateofreli- ability.Forinstance,MurphyandDeShon(2000)suggestedthatit istheappropriatecoefficient.However,onethingistobelievethat anothercoefficientistheappropriate,asMurphy&DeShonhave suggested,andanotherthingistodisputethat.52isa credible
Table1
Second-orderMeta-analysisoftheInterraterReliabilityofJobPerformanceRatings.
N k ryy SD 99%CI
18,582 66 .52 .1056 .518/.522
Note.N=totalsamplesize;k=numberofindependentcoefficients;ryy=weighted- sampleaverageinterraterreliability;SD=standarddeviationofryy;99%CI=99%
confidenceintervalofinterraterreliability.
and non-dubious estimate of interrater reliability, as LeBreton etal.,2014havesuggested.Theonlywaytosupportthis claim istodemonstratebeyondreasonabledoubtthatViswesvaranetal.
(1996)madeerrorswhentheycalculatedtheirestimatesor,alter- natively,toprovideanotherestimateoftheinterratercorrelation based on an independent database. In her large-sample study (N=9,975)oftheinterraterreliabilityofoverallperformancerat- ings,Rothstein(1990)foundtheaverageinterraterwas.52.The meta-analysisbySalgadoetal.(2003,Table2)providedanother estimateofinterraterreliabilityofoveralljobperformancewith aEuropeansetofinterratercoefficients. Theyfoundexactlythe samevalueof.52 (k=18,N=1,936). Ina thirdand morerecent meta-analysis,SalgadoandTauriz(2014)foundthattheinterrater reliabilityofoverallperformanceratingswas.52(k=8,N=1,996), usinganindependentdataset.Thedifferencebetweentheesti- matesofViswesvaranetal.(1996),Salgado,Anderson,andTauriz (2015),andSalgadoand Taurizwasthatthestandarddeviation was .095, .19, and .05, respectively. That three MAs produced an identicalinterrater reliability estimate usingentirely differ- entsamplesofprimarystudiesismorethanjust coincidental– itsuggeststhatthisestimateisreasonableandaccurate.Inapre- viousmeta-analysis,Salgado andMoscoso(1996)estimatedthe interraterreliabilityforcompositeandsinglesupervisoryratings criteria.Theyfoundmeaninterraterreliabilitiesof.618and.402, respectively (average ryy=.51). Table1 reports theresults of a second-ordermeta-analysisofthefirstthreeindependentstud- ies:SalgadoandMoscoso’s(1996)meta-analysiswasnotincluded becauseitdoesnotincludethesamplesizes.Ascanbeseen,the interraterreliabilityis.52 andthestandard deviationcombined is.105, which isvery closetothefigurefoundby Viswesvaran etal.(1996).In thepresentcase, weusedtheformulagivenby McNemar(1962,p. 24)todeterminethestandarddeviationfor threedistributionscombined.
Murphyand DeShon(2000,p. 896)suggestedthat thecor- relationof.52 canbearesult ofusingcontextsthat encourage disagreementamongratersandthatencouragesubstantialrating inflationand,consequently,rangerestriction.Assumingthanone raterusestheentirescaleandtheotheronlythetophalfofthe scale,MurphyandDeShonestimatedthatthecorrelationamong raterscorrected for rangerestrictionalone willbe.68and cor- rectedforunreliability,usingViswesvaranetal.’s(1996)coefficient alphaestimateof.86,wouldbe.79.Assumingthatonerateruses theentirescaleandanotheronlythetopthirdofthescale,their estimatedvalueswouldbe.91and1,respectively.
AproblematicpointinMurphyandDeShon’s(2000)examples isthatinadditiontoassumingthattheinterratercorrelationisa
validitycoefficient,theyappliedtheThrondike’sformulaforCaseII (Thorndike,1949,p.173)forcorrectingforrangerestriction.How- ever,intheirexamples,theproperformula tocorrect forrange restrictionwouldbetheThorndike’sformula forCaseI,because therestrictionisinthecriterion(seetheformulaintheAppendix).
Applyingthisformula,thecorrelationcorrectedforrangerestric- tionalonewouldbe.82,andcorrectedforunreliabilityinY1and Y2wouldbe.95(usingalpha=.86)inthefirstexampleofMurphy andDeShon.Inthesecondexample,thecorrelationcorrectedfor rangerestrictionwouldbe.96andcorrectedforunreliabilitywould be1.12(usingalpha=.86),whichisanimpossiblevalue.Moreover, itshouldbetakenintoaccountthatiftheratingsarerestrictedin rangeMurphyandDeShonshouldhaveattenuatedthereliabil- ityproportionallyinordertocorrectforunreliability.Thiscanbe doneusingtheformuladevelopedindependentlyeachotherbyOtis (1922)andKelley(1921)andreproducedinmanybooksandarti- cles(seetheOtis-KelleyformulaintheAppendix).Theapplication ofthisformulawouldresultinanalphacoefficientof.68inthefirst caseandof-.26inthesecond.Repeatingthecalculationswiththe attenuatedreliabilityvalueforthefirstexample,thiswouldbe.82 dividedbythesquarerootof.86multipliedby.68equalsto1.08.
Inotherwords,thecorrectionscarriedoutforthetwoexamples givetwoimpossiblevalues,whichcastdoubtbothonMurphyand DeShon’srationaleandtherealismoftheassumedvaluesofrange restriction.
IfoneacceptsMurphyandDeShon’s(2000)rationale,thenitis surelyunrealistictothinkthatifthecontextproducesrangerestric- tiononeraterusestheentire scaleandthesecondrateronlya fractionofthescale.Itwouldbemorerealistictothinkthatthe tworaters wereaffectedbyrange restriction,and consequently bothwoulduseafractionofthescale,forexample,oneusingthe top¾ofthescaleandthesecondthetophalfofthescale.This appearstousamorerealisticcase.However,thiscasewouldneed acorrectionfordoublerangerestriction.Tothisregard,sixyears afterdevelopingtheformulaspopularizedby Thorndike(1949), Pearson(1908)developed theformula forcorrectingfor double range-restriction,whichistheformulatobeappliedinthiscase(see Formula3intheAppendix).Applyingthisformula,thecorrected interratercorrelationwouldbe.88.
However,ifweacceptthattheinterratercorrelationisareli- abilityestimate(asLeBretonetal.,2014,do),and weapplythe Otis-Kelley’sformula,thisgivesvaluesof.79and.95asinterrater reliabilitycoefficientsforthefirstandsecondcasesofMurphyand DeShon(2000),respectively.Inordertoestimatethepredictive validityofatest,andacceptingthatthecriteriondistributionis restrictedinrange,thenthecorrectionshouldbedoneonboththe criterionreliabilityandthepredictorrestrictedvalidity.Forexam- ple,iftheobserved correlationbetweenapredictor(e.g.,GMA) andoveralljobperformanceratingsis.25,andthevalueofrange restrictionisU=1.5,asintheMurphyandDeShon’sfirstexam- ple,thefullcorrectedvaliditywouldbe.77(rounded),usingCase Iformula(becausetherestrictionisinthecriterion).Thisrequires threesteps:(a)tocorrecttheinterraterreliabilityof.52forrange restrictionusingOtis-Kelley’sformula,whichresultsin.79;(b)to correctthevalidityof.25bythesquarerootof.79,whichgives.28;
and(c)tocorrect.28forrangerestrictionusingaUvalueof1.5, whichproducesacorrectedvalidityof.77.Ifweusetheformula fordisattenuationonly,withthecriterionreliabilityvalueof.52, thecorrectedvaliditywouldbe.35(rounded).Inotherwords,the correctionforrangerestrictionoftheinterraterreliabilityimplies thecorrectionforrangerestrictionofthevaliditycoefficientusing CaseIformula,andtheconsequenceisthatalargervalidity(and unrealistic)valueisobtained.Moreover,itwouldstillbelackingin properlycorrectingforrangerestrictioninthepredictor.
With regard to criterion reliability, sixty-five years ago, Thorndike (1949, pp. 106-107) wrote that “it is not of critical
importancethatthereliabilityofacriterionbehighaslongasit isestablishedasdefinitelygreaterthanzero.Evenwhenthereli- abilityofacriterionisquitelow,giventhatitisdefinitelygreater thanzero,itisstillpossibletoobtainfairlysubstantialcorrelations betweenthatcriterionandreliabletestsandtocarryoutusefulsta- tisticalanalysesinconnectionwiththepredictionofthatcriterion.
Givenatestorcompositeoftestswithareliabilityof.90andacri- terionwithreliabilityof.40,itistheoreticallypossibletoobtaina correlationof.60betweenthetwo...Itismoreimportantthatthe reliabilityofacriterionmeasurebeknownthanthatitbehigh.”
AccordingtoThorndike(1949)andmanyclassicalpsychometri- ciansandI/Opsychologists(e.g.,Ghiselli,Campbell,&Zedeck,1981;
Guilford,1954;Guion,1965,1998;Gulliksen,1950;Nunnally,1978, amongothers),whenthecriterionmeasureisunreliable,whatitis ofcriticalimportanceisthatthesamplesizebeincreasedinorder toallowforsamplingfluctuationsandtogetstabilityintherelative sizeofthevaliditycoefficients.
Insummary,whilenoothermoreaccurateestimateoftheinter- raterreliabilityisavailable,researcherscanbeconfidentthat.52 iscurrentlya robust,accurate,andusefulestimateofinterrater reliabilityofoveralljobperformanceforasinglerater.
WereCorrectedCoefficientsLabelledasUncorrected Coefficients?
LeBretonetal.(2014,p.492)writethat“coefficientsthathave beencorrectedshouldbesodenoted(ˆ)ratherthansimplylabelled asobservedcorrelationcoefficients (r)or referredtoas“validi- ties”withoutclearlyarticulatingthevariouscorrectionsmadeto thesecorrelations(cf.Hunter&Hunter,1984,Table10;Schmidt
&Hunter,1998,Table1).Labellingcorrectedcoefficientsasuncor- rectedcoefficientscouldleadsomepsychologists(orHRmanagers) to drawimproper inferences from meta-analyses.” We are not awarethatthisconstitutesanendemicorevenfrequentproblemin VGstudiesandmanypublishedpaperscanbecitedtodemonstrate this.Inaddition,westronglydisagreewiththeusethatLeBreton hasmadeoftheTable1ofSchmidtandHunter(1998)andTable10 ofHunterandHunter(1984).InthefootnoteofTable1,Schmidtand Hunterwrotethefollowing(thesametextisrepeatedinTable2):
Allofthevaliditiesinthistableareforthecriterionofoveralljob performance.Unlessotherwisenoted,allvalidityestimatesarecor- rectedforthedownwardbiasduetomeasurementerrorinthemeasure ofjobperformance(emphasisadded)andrangerestrictiononthe predictorinincumbentsamplesrelativetoapplicantpopulations.
ThecorrelationsbetweenGMAandotherpredictorsarecorrected forrangerestrictionbutnotformeasurementerrorineithermea- sure(thustheyaresmallerthanfullycorrected meanvaluesin theliterature).Thesecorrelationsrepresentobservedscorecorre- lationsbetweenselectionmethodsinapplicantpopulations.
WithregardtoHunterandHunter’s(1984)Table10,onceagain, LeBretonetal.(2014)arenotfair,andinsteadadoptanextreme interpretationandposition.HunterandHunterwrote:
Ifthepredictorsaretobecompared,thecriterionforjobperfor- mancemustbethesameforall.Thisnecessityinexorablyleads tothechoiceofsupervisor ratings(with correctionfor measure- menterror)asthecriterionbecausetheyarepredictionstudiesfor supervisoryratingsforallpredictors(p.89).
Therefore,HunterandHunter(1984)explainedthattheymade correctionsfor criterion unreliability.Consequently,Hunterand HunterandSchmidtandHunter(1998)properlylabelledthecor- rectedcoefficients and theyhavenot leadpsychologists(orHR managers) todrawimproperinferences frommeta-analyses.In other words,if a reader draws improperinferences is because he/shehasnotproperlyreadthefootnoteandtheexplanations.The responsibilityisthatofthereadernotofthewritersanditis,ironi- cally,LeBretonetal.(2014)whomayhavemisguidedreadersinthe