HAL Id: hal-02267943
https://hal.archives-ouvertes.fr/hal-02267943v2
Preprint submitted on 22 Aug 2019
HAL
is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire
HAL, estdestinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A relation between log-likelihood and cross-validation log-scores
P.G.L. Porta Mana
To cite this version:
P.G.L. Porta Mana. A relation between log-likelihood and cross-validation log-scores. 2019. �hal-
02267943v2�
and cross-validation log-scores
P.G.L.PortaMana
KavliInstitute,Trondheim,Norway <piero.mana ntnu.no>
18August2019;updated19August2019
Itisshownthatthelog-likelihoodofahypothesisormodelgivensomedatais equaltoanaverageofallleave-one-outcross-validationlog-scoresthatcanbe calculatedfromallsubsetsofthedata.Thisrelationcanbegeneralizedtoany k-foldcross-validationlog-scores.
Note:DearReader&Peer,thismanuscriptisbeingpeer-reviewedbyyou.Thankyou.
1 Log-likelihoodsandcross-validationlog-scores
Theprobabilitycalculusunequivocallytellsushowourdegreeofbe- liefinahypothesisHh givendataDandbackgroundinformationor assumptionsI,thatis,P(Hh| DI),isrelatedtoourdegreeofbeliefin observingthosedatawhenweentertainthathypothesisastrue,thatis, P(D|HhI):
P(Hh|DI) P(D|HhI)P(Hh|I)
P(D|I) (1a)
PP(D|HhI)P(Hh|I)
h′P(D|Hh′I)P(Hh′|I). (1b) D,Hh,Idenotepropositions,whichareusuallyaboutnumericquantities.
Iusetheterms‘degreeofbelief’,‘belief’,and‘probability’assynonyms.
By ‘hypothesis’ I mean either a scientific (physical, biological, etc.) hypothesis–astateordevelopmentofthingscapableofexperimental verification,atleastinathoughtexperiment–ormoregenerallysome proposition,oftennotpreciselyspecified,whichleadstoquantitatively specificdistributionsofbeliefsforanycontemplateddataset.Inthelatter caseweoftencallHh a‘(probabilistic)model’ratherthana‘hypothesis’.
Expression(1b)assumesthatwehaveaset{Hh}ofmutuallyexclusive andexhaustivehypothesesunderconsideration,whichisimplicitinour knowledgeI.Infactit’sonlyvalidif
PW
hHh|I
1, P(Hh∧Hh′|I)0 ifh,h′. (2) Onlyrarelydoesthesetofhypotheses{Hh}encompassandreflectthe
1 This
documentisdesignedforscreenreadingandtwo-upprintingonA4orLetterpaper
PortaMana log-likelihoodandcross-validation
extremelycomplexandfuzzyhypotheseslyinginthebacksofourminds.
They’resimplifiedpictures.That’salsowhythey’recalled‘models’.
Expression(1a)isuniversallyvalidinstead,butit’srarelypossible toquantifyitsdenominatorP(D|I)unlesswesimplifyourinferential problembyintroducingapossiblyunrealisticexhaustivesetofhypo- theses,thusfallingbackto(1b).Wecanbypassthisproblemifweare contentwithcomparingourbeliefsaboutanytwohypothesesthrough theirratio,sothatthetermP(D|I)cancelsout.SeeJaynes’s1insightful remarksaboutsuchbinarycomparisons,andalsoGood’s2.
ThetermP(D|HhI)ineq.(1)iscalledthelikelihoodofthehypothesis giventhedata3.Itslogarithmissurprisinglycalledlog-likelihood:
logP(D|HhI), (3)
wherethelogarithmcanbetakeninanarbitrarybasis(Turing,Good4, Jaynes5recommendbase101/10,leadingtoameasurementindecibels;
seethecitedworksforthepracticaladvantagesofsuchchoice).
Theratioofthelikelihoodsoftwohypotheses,calledrelativeBayes factor,or itslogarithm, the relativeweight ofevidence,6 are oftenused to quantify howmuch the datafavourour belief in one versus the otherhypothesis(thatis,assumingatleastmomentarilythattheybe exhaustive).‘Itishistoricallyinterestingthattheexpression“weightof evidence”,initstechnicalsense,anticipatedtheterm“likelihood”by overfortyyears’7.
Recentliterature8seemstoexclusivelydealwithrelativeBayesfactors.I’dliketorecall,lest itfadesfromthememory,thedefinitionofthenon-relativeBayesfactorforahypothesis HhprovidedbydataD:9
P(D|HhI) P(D|¬HhI)≡
O(Hh|DI) O(Hh|I)
P(D|HhI) [1−P(Hh|I)]
Ph′,h
h′ P(D|Hh′I)P(Hh′|I), (4) wheretheoddsOisdefinedasO:P/(1−P).Lookingattheexpressionontheright,which canbederivedfromtheprobabilityrules,it’sclearthattheBayesfactorforahypothesis involvesthelikelihoodsofallotherhypothesesaswellastheirpre-dataprobabilities.This quantityanditslogarithm,the(non-relative)weightofevidence,haveimportantproperties whichrelativeBayesfactorsandrelativeweightsofevidencedon’tenjoy.Forexample,the
1Jaynes2003§§4.3–4.4. 2Good1950§6.3–6.6. 3Good1950§6.1p.62. 4e.g.Good 1985,1950,1969. 5Jaynes2003§4.2. 6Good1950ch.6,1975,1981,1985,andmany otherworksinGood1983;Osteyeeetal.1974§1.4,MacKay1992,Kassetal.1995;seealso Jeffreys1983chsV,VI,A. 7Osteyeeetal.1974§1.4.2p.12. 8forexampleKassetal.1995.
9Good1981§2.
2
expectedweightofevidenceforacorrecthypothesisisalwayspositive,andforawrong hypothesesalwaysnegative10.SeeJaynes11forfurtherdiscussionandanumericexample.
Theliteratureinprobabilityandstatisticshasalsoemployedand debatedotherad-hocmeasurestoquantifyhowthedatarelatetothe hypotheses–oreventoselectonehypothesisforfurtheruse,discarding theothers12.HereIconsideronemeasureinparticular:theleave-one-out cross-validationlog-score12,whichI’lljustcall‘log-score’forbrevity:
1 d
Xd i1
logP(Di|D−iHhI) (5) whereeveryDiisonedatuminthedataD≡Vd
i1Di,andD−idenotes thedatawithdatumDi excluded.Theintuitionbehindthisscorecan becolloquiallyexpressedthus:‘let’sseewhatmybeliefinonedatum wouldbe,onaverage,onceI’veobservedtheotherdata,ifIconsider Hhastrue’.‘Onaverage’meansconsideringsuchbeliefforeverysingle datumin turn,andthentakingthe geometricmean ofthe resulting beliefs.Othervariantsofthisscoreusemoregeneralpartitionsofthe dataintotwodisjointsubsets12.
Ifyoufindthisyoucanclaimapostcardfromme.
Mypurposeistoshowanexactrelationbetweenthelog-likelihood(3) andtheleave-one-outcross-validationlog-score(5).Thisrelationdoesn’t seemtoappearintheliterature,andIfinditveryintriguingbecause itportraysthelog-likelihoodasasortoffull-scaleuseofthelog-score:
itsaysthatthelog-likelihoodisthesumofallaveragedlog-scoresthatcanbe formedfromalldatasubsets.Therelationcanbeextendedtomoregeneral cross-validationlog-scores,anditcanbeofinterestforthedebateabout thesoundnessoflog-scoresindecidingamonghypotheses.
2 Arelationbetweenlog-likelihoodandlog-score
Wecanobviouslywritethelikelihoodasthedthrootofitsdthpower:
P(D|HI)≡
P(D|HI)×···×P(D|HI)
dtimes
1/d
(6)
10Good1950§6.7. 11Jaynes2003§§4.3–4.4. 12Bernardoetal.2000§§3.4,6.1.6givesthe clearestmotivationandexplanation,seealsoStone1977,Geisseretal.1979,Vehtarietal.
2012,2002,Krnjajićetal.2011,2014,Gelmanetal.2014,Gronauetal.2019,Chandramouli etal.2019.
PortaMana log-likelihoodandcross-validation
wherewehavedroppedthesubscripth forsimplicity.Bytherulesof probabilitywehave
P(D|HI)P(Di|D−iHhI)×P(D−i|HhI) (7) nomatterwhichspecifici ∈ {1,...,d}wechoose(temporalordering andsimilarmattersarecompletelyirrelevantintheformulaabove:it’s alogicalrelationbetweenpropositions).Solet’sexpandeachofthed factorsintheidentity(6)usingtheproductrule(7),usingadifferenti foreachofthem.Theresultcanbethusdisplayed:
P(D|HI)≡
P(D1|D−1HI)×P(D−1|HI)× P(D2|D−2HI)×P(D−2|HI)×
... ×
P(Dd|D−d
x
thiscolumnleadstothelog-score
HI)×P(D−d|HI)1/d .
(8)
Upontakingthelogarithmofthisexpression,the d factorsvertically alignedon theleft addup tothe log-score(5),as indicated.But the mathematicalreshaping wejustdidforP(D| HI) –thatis,theroot- productidentity(6)andthe expansion(8)–canbedoneforeachof theremainingfactorsP(D−i|HI)verticallyalignedontherightinthe expressionabove;andsoonrecursively.Hereisanexplicitexamplefor d3:
P(D|HI)≡
(P(D1|D2D3HI)×
P(D2|D3HI)×P(D3|HI)× P(D3|D2HI)×P(D2|HI)1/2× P(D2|D1D3HI)×
P(D1|D3HI)×P(D3|HI)× P(D3|D1HI)×P(D1|HI)1/2
×
P(D3|D1D2HI)×
P(D1|D2HI)×P(D2|HI)× P(D2|D1HI)×P(D1|HI)1/2)1/3
. (9)
Inthisexamplethelogarithmofthethreeverticallyalignedfactorsin theleftcolumnis,asalreadynoted,thelog-score(5).Thelogarithmof
4
thesixverticallyalignedfactorsinthecentralcolumnisanaverageof thelog-scorescalculatedforthethreedistinctsubsetsofpairsofdata {D1D2},{D1D3},{D2D3}.Likewise,thelogarithmofthesixfactors verticallyalignedontherightistheaverageofthelog-scoresforthe threesubsetsofdatasingletons{D1},{D2},{D3}.
Inthegeneralcasewithddatathereare dk
subsetswithkdatapoints.
Wethereforeobtain
logP(D|HI)≡ 1 d
Xd i1
logP(Di|D−iHI)+
1 d
X
i∈{1,...,d}
1 d−1
j,i
X
j∈{1,...,d}
logP(D−i,j|D−i,−jHI)+
d d−2
!−1 Xi<j i,j∈{1,...,d}
1 d−2
kX,i,j
k∈{1,...,d}
logP(D−i,−j,k|D−i,−j,−kHI)+
···+
d 2
!−1 Xi<j i,j∈{1,...,d}
1 2
logP(Di|DjHI)+logP(Dj|DiHI) +
1 d
Xd i1
logP(Di|HI), (10) whichcanbecompactlywritten
logP(D|HI)≡ Xd k1
d k
!−1X
ordered k-tuples
1 k
X
cyclic permutations
logP(Di1|Di2···DikHI). (11)
Thatis,thelog-likelihoodisthesumofallaveragedlog-scoresthatcanbeformed fromall(non-empty)datasubsetswithkelements,theaverageforlog-scores overkdatabeingtakenoverthe dk
subsetshavingthesamecardinality k.
There’s also an equivalent form with a slightly different cross- validatinginterpretation:WetakeeachdatumDjinturnandcalculate ourlog-beliefinitconditionalonallpossiblesubsetsofremainingdata,
PortaMana log-likelihoodandcross-validation
fromtheemptysubsetwithnodata(termk0),totheonlysubsetD−j
withalldataexceptDj(termkd−1).Theselog-beliefsareaveraged overthe d−k1
subsetshavingthesamecardinalityk.Theresultcanbe expressedas
logP(D|HI)≡ 1 d
Xd j1
d−1
X
k0
d−1 k
!−1X
ordered k-tuples, jexcluded
logP(Dj|Di1···DikHI). (12)
3 Briefdiscussion
It’sremarkablethattheindividuallog-scoresinexpressions(11)and (12)abovearecomputationallyexpensive,buttheirsumresultsinthe log-likelihood,whichislessexpensive.
Therelation(11)invitesustoseethelog-likelihoodasarefinement andimprovement of the log-score. Thelog-likelihoodtakes into ac- countnotonlythelog-scoreforthewholedata,butalsothelog-scores forallpossiblesubsetsofdata.Figurativelyspeakingitexaminesthe relationshipbetweendataandhypothesislocally,globally,andonall intermediatescales.Tomethispropertymakesthelog-likelihoodprefer- abletoanysinglelog-score(besidesthefactthatthelog-likelihoodis directlyobtainedfromthe principlesofthe probabilitycalculus),be- causeourinterestisusuallyinhowthehypothesisHrelatestosingle datapointsaswellastoanycollectionofthem.Ihopetodiscussthis point,whichalsoinvolvesthedistinctionbetweensimpleandcomposite hypotheses13,moreindetailelsewhere14.
Byapplyingtheidentity(6)andgeneralizingtheexpansion(7)to differentdivisionsofthedata–leave-two-out,leave-three-out,andso on–weseethattherelation(11)canbegeneralizedtoanyk-foldcross- validationlog-scores.Thusthelog-likelihoodisalsoequivalenttoan averageofallconceivablecross-validationlog-scoresforallsubsets of data,thoughIhaven’tcalculatedtheweightsofsuchaverage.
Thanks
...totheKavliFoundationandtheCentreofExcellenceschemeofthe ResearchCouncilofNorway(YasserRoudigroup)forfinancialsupport.
13Bernardoetal.2000§6.1.4. 14PortaMana2019.
6
ToAkiVehtariforsomereferences.TothestaffoftheNTNUlibraryfor theiralwaysprompt assistance.ToMari,Miri,Emmaforcontinuous encouragementandaffection,andto BusterKeatonandSaitamafor fillinglifewithaweandinspiration.Tothedevelopersandmaintainers ofOpenScienceFramework,LATEX,Emacs,AUCTEX,Python,Inkscape, Sci-Hubformakingafreeandimpartialscientificexchangepossible.
Bibliography
Bernardo,J.-M.,DeGroot,M.H.,Lindley,D.V.,Smith,A.F.M.,eds.(1985): Bayesian Statistics2.(ElsevierandValenciaUniversityPress,AmsterdamandValencia).https ://www.uv.es/~bernardo/valenciam.html.
Bernardo,J.-M.,Smith,A.F.(2000):BayesianTheory,reprint.(Wiley,NewYork).Firstpubl.
1994.
Chandramouli,S.H.,Shiffrin,R.M.,Vehtari,A.,Simpson,D.P.,Yao,Y.,Gelman,A.,Navarro, D.J.,Gronau,Q.F.,etal.(2019):CommentaryonGronauandWagenmakers.Limitationsof
“LimitationsofBayesianleave-one-outcross-validationformodelselection”.Betweenthedevil andthedeepbluesea:tensionsbetweenscientificjudgementandstatisticalmodelselection.
Rejoinder:morelimitationsofBayesianleave-one-outcross-validation.Comput.BrainBehav.
21,12–47.SeeGronau,Wagenmakers(2019).
Geisser,S.,Eddy,W.F.(1979):Apredictiveapproachtomodelselection.J.Am.Stat.Assoc.
74365,153–160.
Gelman,A.,Hwang,J.,Vehtari,A.(2014):Understandingpredictiveinformationcriteriafor Bayesianmodels.Stat.Comput.246,997–1016.
Good,I.J.(1950):ProbabilityandtheWeighingofEvidence.(Griffin,London).
— (1969):AsubjectiveevaluationofBode’slawandan‘objective’testforapproximatenumerical rationality.J.Am.Stat.Assoc.64325,23–49.Partlyrepr.inGood(1983)ch.13.
— (1975): Explicativity,corroboration,andtherelativeoddsofhypotheses. Synthese301–2, 39–73.Partlyrepr.inGood(1983)ch.15.
— (1981):Somelogicandhistoryofhypothesistesting. In:Philosophyineconomics.Ed.by J.C.Pitt(Reidel),149–174.Repr.inGood(1983)ch.14pp.129–148.
— (1983):GoodThinking:TheFoundationsofProbabilityandItsApplications.(Universityof MinnesotaPress,Minneapolis,USA).
— (1985):Weightofevidence:abriefsurvey.In:Bernardo,DeGroot,Lindley,Smith(1985), 249–270.WithdiscussionbyH.Rubin,T.Seidenfeld,andreply.
Gronau,Q.F.,Wagenmakers,E.-J.(2019):LimitationsofBayesianleave-one-outcross-validation formodelselection.Comput.BrainBehav.21,1–11.Seealsocommentsandrejoinder inChandramouli,Shiffrin,Vehtari,Simpson,Yao,Gelman,Navarro,Gronau,etal.
(2019).
Jaynes,E.T.(2003):ProbabilityTheory:TheLogicofScience.(CambridgeUniversityPress, Cambridge).Ed.byG.LarryBretthorst.Firstpubl.1994.https://archive.org/detai ls/XQUHIUXHIQUHIQXUIHX2,http://www-biba.inrialpes.fr/Jaynes/prob.html. Jeffreys,H.(1983):TheoryofProbability,thirded.withcorrections. (OxfordUniversity
Press,London).Firstpubl.1939.
PortaMana log-likelihoodandcross-validation
Kass,R.E.,Raftery,A.E.(1995):Bayesfactors.J.Am.Stat.Assoc.90430,773–795.https://w ww.stat.washington.edu/raftery/Research/PDF/kass1995.pdf;https://www.andr ew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf.
Krnjajić,M.,Draper,D.(2011):Bayesianmodelspecification:someproblemsrelatedtomodel choiceandcalibration.http://hdl.handle.net/10379/3804.
— (2014):Bayesianmodelcomparison:logscoresandDIC.Stat.Probab.Lett.88,9–14.
MacKay,D.J.C.(1992):Bayesianinterpolation.NeuralComp.43,415–447.http://www.inf erence.phy.cam.ac.uk/mackay/PhD.html.
Osteyee,D.B.,Good,I.J.(1974): Information,WeightofEvidence,theSingularitybetween ProbabilityMeasuresandSignalDetection.(Springer,Berlin).
PortaMana,P.G.L.(2019):Probabilisticmodels:modelsofwhat?Inpreparation.
Stone,M.(1977):Anasymptoticequivalenceofchoiceofmodelbycross-validationandAkaike’s criterion.J.Roy.Stat.Soc.B391,44–47.
Vehtari,A.,Lampinen,J.(2002): Bayesianmodelassessmentandcomparisonusingcross- validationpredictivedensities.NeuralComp.1410,2439–2468.
Vehtari,A.,Ojanen,J.(2012):AsurveyofBayesianpredictivemethodsformodelassessment, selectionandcomparison.Statist.Surv.6,142–228.
8