utility Hermes, of Midas the of Corrections reliability for criterion in validity generalization: Theconsistency Journal of Work and Organizational Psychology

(1)

w w w . e l s e v i e r . e s / r p t o

Journal of Work and Organizational Psychology

Corrections for criterion reliability in validity generalization: The consistency of Hermes, the utility of Midas

JesúsF.Salgado^a,∗, SilviaMoscoso^a,NeilAnderson^b

aUniversityofSantiagodeCompostela,Spain

bBrunelUniversity,U.K.

a r t i c l e i n f o

Articlehistory:

Received23November2015 Accepted3December2015 Availableonline4February2016

Keywords:

Interrater Reliability

Validitygeneralization Jobperformance Ratings

a b s t r a c t

Thereiscriticismintheliteratureabouttheuseofinterratercoefficientstocorrectforcriterionreliability invaliditygeneralization(VG)studiesanddisputingwhether.52isanaccurateandnon-dubiousestimate ofinterraterreliabilityofoveralljobperformance(OJP)ratings.Wepresentasecond-ordermeta-analysis ofthreeindependentmeta-analyticstudiesoftheinterraterreliabilityofjobperformanceratingsand makeanumberofcommentsandreflectionsonLeBretonetal.’spaper.Theresultsofourmeta-analysis indicatethattheinterraterreliabilityforasinglerateris.52(k=66,N=18,582,SD=.105).Ourmain conclusionsare:(a)thevalueof.52isanaccurateestimateoftheinterraterreliabilityofoveralljob performanceforasinglerater;(b)itisnotreasonabletoconcludethatpastVGstudiesthatused.52asthe criterionreliabilityvaluehavealessthansecurestatisticalfoundation;(c)basedoninterraterreliability, test-retestreliability,andcoefficientalpha,supervisorratingsareausefulandappropriatemeasureofjob performanceandcanbeconfidentlyusedasacriterion;(d)validitycorrectionforcriterionunreliability hasbeenunanimouslyrecommendedby“classical”psychometriciansandI/Opsychologistsastheproper waytoestimatepredictorvalidity,andisstillrecommendedatpresent;(e)thesubstantivecontribution ofVGprocedurestoinformHRMpracticesinorganizationsshouldnotbelostinthesetechnicalpoints ofdebate.

Correcciónporlaﬁabilidaddelcriterioenlageneralizationdelavalidez:

lacohererenciadeHermes,lautilidaddeMidas

Palabrasclave:

Interjueces Fiabilidad

Generalizacióndelavalidez Desempe ˜noeneltrabajo Valoraciones

re s um e n

En laliterature secritica el usode loscoeficientes interjuecespara corregirpor lafiabilidad del criterioenlosestudiosdegeneralizacióndelavalidez(GV)ycuestionansi.52esunestimadorpre- cisoynodudosodelafiabilidadinterjuecesdelasvaloracionesdeldesempe ñoglobaleneltrabajo.

Eneste articulo,presentamosunmeta-análisis desegundoorden detresestudiosmeta-analíticos independientes sobre la fiabilidad interjueces de las valoraciones deldesempe ño en el trabajoy hacemosdiversoscomentarios yreflexionessobre el artículodeLeBretonetal. Losresultadosde nuestro meta-análisisindicanque lafiabilidadinterjueces es.52(k=66,N=18.582,SD=.105) para unúnicosupervisor.Nuestrasprincipalesconclusionesson:(a)elvalorde.52esunestimadorpre- ciso dela fiabilidad interjueces del desempe ño global en el trabajopara unúnico valorador, (b) no esrazonableconcluir que los estudios deGV que han usado.52como valor dela fiabilidad delcriterio tengan una fundamentaciónestadísticapocosegura, (c)sobre labase dela fiabilidad interjueces,lafiabilidad test-retestyel coeficientealfa,losjuiciosdelsupervisor sonuna medida

∗ Correspondingauthor.DepartmentofOrganizationalPsychology.FacultyofLaborRelations.UniversityofSantiagodeCompostela.CampusVida.15782Santiago deCompostela,ACoru ˜na,Spain.

E-mailaddress:[email protected](J.F.Salgado).

http://dx.doi.org/10.1016/j.rpto.2015.12.001

(2)

útilyadecuadadeldesempe ñoeneltrabajoypuedenserusadosconconfianzacomocriterio,(d)la correccióndelavalidezporfaltadefiabilidaddelcriteriohasidounánimementerecomendadaporlos psicómetrasypsicólogosindustriales“clásicos”comoelmétodocorrectodeestimarlavalidezdelpre- dictoryestodavíarecomendadaenlaactualidady(e)lacontribuciónsustantivadelosprocedimientos deGVparaorientarlasprácticasderecursoshumanosenlasorganizacionesnodeberíaperderseenestas cuestionestécnicasdedebate.

LeBreton,Scherer,andJames(2014)havewrittenachallenging leadarticleinwhichtheymakeaseriesofcriticismsabouttheuse ofinterratercoefficientstocorrectforcriterionreliabilityinvalidity generalization(VG)studiesanddisputingwhether.52isanaccu- rateandnon-dubiousestimateofinterrater reliabilityofoverall jobperformance(OJP)ratings.Asresearcherswhohaveconducted severalmeta-analytical(MA)andVGstudiesinwhichthevalueof theinterraterreliabilitywasestimated,weheremakeanumberof commentsandreflectionsonLeBretonetal.’spaper.Weorganize ourcommentsundersixpoints:(1)whether.52isinfactadubious interraterreliabilityvalueofOJP,(2)theircriticismthatcorrected coefficientswerewronglylabelledasuncorrectedcoefficients,(3) toshowthattherearesomelabellingerrorsinLeBretonetal.,(4) ifitisappropriatetocorrectobservedvalidityforcriterionreliabil- ity,(5)whetherinterraterreliabilityistheappropriatecoefficient tocorrectforcriterionreliabilityinVGstudies,and(6)widerissues overthevalueofVGstudiesforinformingpoliciesandpracticesin organizations.

Incombination,wearguethatthesepointsindicateunequivo- callythatthecaseofLeBretonetal.(2014)islogicallyﬂawed,and indeedoncloserinspectionhasbeenbuiltuppiecemealonanum- berofoutlierinterpretations,non-sequitersoflogicalprogression, andimpracticalcallsfordatasettreatmentinVGstudies.Following theirrecommendationsrisk“throwingthebabyoutwiththebath- water”andreducingthelikelihoodthatVGstudieswouldcontinue tohaveimportantpositivebeneﬁtsforthepracticeinemployee selectionandotherareasofI/OPsychology.

Is.52aDubiousInterraterReliabilityValue?

LeBretonetal.(2014)doubtwhether.52isalegitimateandaccu- rateestimateoftheinterraterreliability.Toquote,theyarguethat

“thepastVGstudieswhichreliedonthisdubiouscriterionreliabil- ityvaluehavealessthansecurestatisticalfoundation”,andthat they“suspectthatresearcherswouldconcludethat.52isnotacred- ibleestimate”.Theproblemhereisthatthesearesimplyopinions withoutempiricalbasis,orinfactanysupportingrationalebeing proffered.LeBretonetal.donotprovideanyempiricalsupportfor rejecting.52asacrediblevaluebeyondtheirsuspicion.Shouldwe acceptthisopiniontounilaterallyjettisonthiswell-establishedand widelyusedvaluewithoutanysupportingreasoning orempiri- calfoundation?We believeabsolutelynot,especially whenone considerstheevidenceuponwhichuseofthisinterraterreliability valuehasbeenbased.

Viswesvaran, Ones, and Schmidt (1996), for instance,found valuesof .52(k=40, N=14,650) forinterrater reliability,.81 for coefficients ofstability (k=12, N=1,374)and .86 for coefficient alpha(k=89, N=17,899). These coefficients estimatethree dif- ferentsources of measurementerror (Schmidt &Hunter, 1996;

Viswesvaran, Schmidt,&Ones, 2002).Notallresearchers agree thattheinterratercoefficientistheappropriateestimateofreli- ability.Forinstance,MurphyandDeShon(2000)suggestedthatit istheappropriatecoefficient.However,onethingistobelievethat anothercoefficientistheappropriate,asMurphy&DeShonhave suggested,andanotherthingistodisputethat.52isa credible

Table1

Second-orderMeta-analysisoftheInterraterReliabilityofJobPerformanceRatings.

N k ryy SD 99%CI

18,582 66 .52 .1056 .518/.522

Note.N=totalsamplesize;k=numberofindependentcoefﬁcients;ryy=weighted- sampleaverageinterraterreliability;SD=standarddeviationofryy;99%CI=99%

conﬁdenceintervalofinterraterreliability.

and non-dubious estimate of interrater reliability, as LeBreton etal.,2014havesuggested.Theonlywaytosupportthis claim istodemonstratebeyondreasonabledoubtthatViswesvaranetal.

(1996)madeerrorswhentheycalculatedtheirestimatesor,alter- natively,toprovideanotherestimateoftheinterratercorrelation based on an independent database. In her large-sample study (N=9,975)oftheinterraterreliabilityofoverallperformancerat- ings,Rothstein(1990)foundtheaverageinterraterwas.52.The meta-analysisbySalgadoetal.(2003,Table2)providedanother estimateofinterraterreliabilityofoveralljobperformancewith aEuropeansetofinterratercoefficients. Theyfoundexactlythe samevalueof.52 (k=18,N=1,936). Ina thirdand morerecent meta-analysis,SalgadoandTauriz(2014)foundthattheinterrater reliabilityofoverallperformanceratingswas.52(k=8,N=1,996), usinganindependentdataset.Thedifferencebetweentheesti- matesofViswesvaranetal.(1996),Salgado,Anderson,andTauriz (2015),andSalgadoand Taurizwasthatthestandarddeviation was .095, .19, and .05, respectively. That three MAs produced an identicalinterrater reliability estimate usingentirely differ- entsamplesofprimarystudiesismorethanjust coincidental– itsuggeststhatthisestimateisreasonableandaccurate.Inapre- viousmeta-analysis,Salgado andMoscoso(1996)estimatedthe interraterreliabilityforcompositeandsinglesupervisoryratings criteria.Theyfoundmeaninterraterreliabilitiesof.618and.402, respectively (average ryy=.51). Table1 reports theresults of a second-ordermeta-analysisofthefirstthreeindependentstud- ies:SalgadoandMoscoso’s(1996)meta-analysiswasnotincluded becauseitdoesnotincludethesamplesizes.Ascanbeseen,the interraterreliabilityis.52 andthestandard deviationcombined is.105, which isvery closetothefigurefoundby Viswesvaran etal.(1996).In thepresentcase, weusedtheformulagivenby McNemar(1962,p. 24)todeterminethestandarddeviationfor threedistributionscombined.

Murphyand DeShon(2000,p. 896)suggestedthat thecor- relationof.52 canbearesult ofusingcontextsthat encourage disagreementamongratersandthatencouragesubstantialrating inﬂationand,consequently,rangerestriction.Assumingthanone raterusestheentirescaleandtheotheronlythetophalfofthe scale,MurphyandDeShonestimatedthatthecorrelationamong raterscorrected for rangerestrictionalone willbe.68and cor- rectedforunreliability,usingViswesvaranetal.’s(1996)coefﬁcient alphaestimateof.86,wouldbe.79.Assumingthatonerateruses theentirescaleandanotheronlythetopthirdofthescale,their estimatedvalueswouldbe.91and1,respectively.

AproblematicpointinMurphyandDeShon’s(2000)examples isthatinadditiontoassumingthattheinterratercorrelationisa

(3)

validitycoefﬁcient,theyappliedtheThrondike’sformulaforCaseII (Thorndike,1949,p.173)forcorrectingforrangerestriction.How- ever,intheirexamples,theproperformula tocorrect forrange restrictionwouldbetheThorndike’sformula forCaseI,because therestrictionisinthecriterion(seetheformulaintheAppendix).

Applyingthisformula,thecorrelationcorrectedforrangerestric- tionalonewouldbe.82,andcorrectedforunreliabilityinY₁and Y2wouldbe.95(usingalpha=.86)inthefirstexampleofMurphy andDeShon.Inthesecondexample,thecorrelationcorrectedfor rangerestrictionwouldbe.96andcorrectedforunreliabilitywould be1.12(usingalpha=.86),whichisanimpossiblevalue.Moreover, itshouldbetakenintoaccountthatiftheratingsarerestrictedin rangeMurphyandDeShonshouldhaveattenuatedthereliabil- ityproportionallyinordertocorrectforunreliability.Thiscanbe doneusingtheformuladevelopedindependentlyeachotherbyOtis (1922)andKelley(1921)andreproducedinmanybooksandarti- cles(seetheOtis-KelleyformulaintheAppendix).Theapplication ofthisformulawouldresultinanalphacoefficientof.68inthefirst caseandof-.26inthesecond.Repeatingthecalculationswiththe attenuatedreliabilityvalueforthefirstexample,thiswouldbe.82 dividedbythesquarerootof.86multipliedby.68equalsto1.08.

Inotherwords,thecorrectionscarriedoutforthetwoexamples givetwoimpossiblevalues,whichcastdoubtbothonMurphyand DeShon’srationaleandtherealismoftheassumedvaluesofrange restriction.

IfoneacceptsMurphyandDeShon’s(2000)rationale,thenitis surelyunrealistictothinkthatifthecontextproducesrangerestric- tiononeraterusestheentire scaleandthesecondrateronlya fractionofthescale.Itwouldbemorerealistictothinkthatthe tworaters wereaffectedbyrange restriction,and consequently bothwoulduseafractionofthescale,forexample,oneusingthe top¾ofthescaleandthesecondthetophalfofthescale.This appearstousamorerealisticcase.However,thiscasewouldneed acorrectionfordoublerangerestriction.Tothisregard,sixyears afterdevelopingtheformulaspopularizedby Thorndike(1949), Pearson(1908)developed theformula forcorrectingfor double range-restriction,whichistheformulatobeappliedinthiscase(see Formula3intheAppendix).Applyingthisformula,thecorrected interratercorrelationwouldbe.88.

However,ifweacceptthattheinterratercorrelationisareli- abilityestimate(asLeBretonetal.,2014,do),and weapplythe Otis-Kelley’sformula,thisgivesvaluesof.79and.95asinterrater reliabilitycoefficientsforthefirstandsecondcasesofMurphyand DeShon(2000),respectively.Inordertoestimatethepredictive validityofatest,andacceptingthatthecriteriondistributionis restrictedinrange,thenthecorrectionshouldbedoneonboththe criterionreliabilityandthepredictorrestrictedvalidity.Forexam- ple,iftheobserved correlationbetweenapredictor(e.g.,GMA) andoveralljobperformanceratingsis.25,andthevalueofrange restrictionisU=1.5,asintheMurphyandDeShon’sfirstexam- ple,thefullcorrectedvaliditywouldbe.77(rounded),usingCase Iformula(becausetherestrictionisinthecriterion).Thisrequires threesteps:(a)tocorrecttheinterraterreliabilityof.52forrange restrictionusingOtis-Kelley’sformula,whichresultsin.79;(b)to correctthevalidityof.25bythesquarerootof.79,whichgives.28;

and(c)tocorrect.28forrangerestrictionusingaUvalueof1.5, whichproducesacorrectedvalidityof.77.Ifweusetheformula fordisattenuationonly,withthecriterionreliabilityvalueof.52, thecorrectedvaliditywouldbe.35(rounded).Inotherwords,the correctionforrangerestrictionoftheinterraterreliabilityimplies thecorrectionforrangerestrictionofthevaliditycoefﬁcientusing CaseIformula,andtheconsequenceisthatalargervalidity(and unrealistic)valueisobtained.Moreover,itwouldstillbelackingin properlycorrectingforrangerestrictioninthepredictor.

With regard to criterion reliability, sixty-ﬁve years ago, Thorndike (1949, pp. 106-107) wrote that “it is not of critical

importancethatthereliabilityofacriterionbehighaslongasit isestablishedasdeﬁnitelygreaterthanzero.Evenwhenthereli- abilityofacriterionisquitelow,giventhatitisdeﬁnitelygreater thanzero,itisstillpossibletoobtainfairlysubstantialcorrelations betweenthatcriterionandreliabletestsandtocarryoutusefulsta- tisticalanalysesinconnectionwiththepredictionofthatcriterion.

Givenatestorcompositeoftestswithareliabilityof.90andacri- terionwithreliabilityof.40,itistheoreticallypossibletoobtaina correlationof.60betweenthetwo...Itismoreimportantthatthe reliabilityofacriterionmeasurebeknownthanthatitbehigh.”

AccordingtoThorndike(1949)andmanyclassicalpsychometri- ciansandI/Opsychologists(e.g.,Ghiselli,Campbell,&Zedeck,1981;

Guilford,1954;Guion,1965,1998;Gulliksen,1950;Nunnally,1978, amongothers),whenthecriterionmeasureisunreliable,whatitis ofcriticalimportanceisthatthesamplesizebeincreasedinorder toallowforsamplingﬂuctuationsandtogetstabilityintherelative sizeofthevaliditycoefﬁcients.

Insummary,whilenoothermoreaccurateestimateoftheinter- raterreliabilityisavailable,researcherscanbeconﬁdentthat.52 iscurrentlya robust,accurate,andusefulestimateofinterrater reliabilityofoveralljobperformanceforasinglerater.

WereCorrectedCoefﬁcientsLabelledasUncorrected Coefﬁcients?

LeBretonetal.(2014,p.492)writethat“coefﬁcientsthathave beencorrectedshouldbesodenoted(ˆ)ratherthansimplylabelled asobservedcorrelationcoefﬁcients (r)or referredtoas“validi- ties”withoutclearlyarticulatingthevariouscorrectionsmadeto thesecorrelations(cf.Hunter&Hunter,1984,Table10;Schmidt

&Hunter,1998,Table1).Labellingcorrectedcoefﬁcientsasuncor- rectedcoefﬁcientscouldleadsomepsychologists(orHRmanagers) to drawimproper inferences from meta-analyses.” We are not awarethatthisconstitutesanendemicorevenfrequentproblemin VGstudiesandmanypublishedpaperscanbecitedtodemonstrate this.Inaddition,westronglydisagreewiththeusethatLeBreton hasmadeoftheTable1ofSchmidtandHunter(1998)andTable10 ofHunterandHunter(1984).InthefootnoteofTable1,Schmidtand Hunterwrotethefollowing(thesametextisrepeatedinTable2):

Allofthevaliditiesinthistableareforthecriterionofoveralljob performance.Unlessotherwisenoted,allvalidityestimatesarecor- rectedforthedownwardbiasduetomeasurementerrorinthemeasure ofjobperformance(emphasisadded)andrangerestrictiononthe predictorinincumbentsamplesrelativetoapplicantpopulations.

ThecorrelationsbetweenGMAandotherpredictorsarecorrected forrangerestrictionbutnotformeasurementerrorineithermea- sure(thustheyaresmallerthanfullycorrected meanvaluesin theliterature).Thesecorrelationsrepresentobservedscorecorre- lationsbetweenselectionmethodsinapplicantpopulations.

WithregardtoHunterandHunter’s(1984)Table10,onceagain, LeBretonetal.(2014)arenotfair,andinsteadadoptanextreme interpretationandposition.HunterandHunterwrote:

Ifthepredictorsaretobecompared,thecriterionforjobperfor- mancemustbethesameforall.Thisnecessityinexorablyleads tothechoiceofsupervisor ratings(with correctionfor measurementerror)asthecriterionbecausetheyarepredictionstudiesfor supervisoryratingsforallpredictors(p.89).

Therefore,HunterandHunter(1984)explainedthattheymade correctionsfor criterion unreliability.Consequently,Hunterand HunterandSchmidtandHunter(1998)properlylabelledthecor- rectedcoefﬁcients and theyhavenot leadpsychologists(orHR managers) todrawimproperinferences frommeta-analyses.In other words,if a reader draws improperinferences is because he/shehasnotproperlyreadthefootnoteandtheexplanations.The responsibilityisthatofthereadernotofthewritersanditis,ironi- cally,LeBretonetal.(2014)whomayhavemisguidedreadersinthe