A relation between log-likelihood and cross-validation log-scores

(1)

HAL Id: hal-02267943

https://hal.archives-ouvertes.fr/hal-02267943v2

Preprint submitted on 22 Aug 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A relation between log-likelihood and cross-validation log-scores

P.G.L. Porta Mana

To cite this version:

P.G.L. Porta Mana. A relation between log-likelihood and cross-validation log-scores. 2019. �hal-

02267943v2�

(2)

and cross-validation log-scores

P.G.L.PortaMana

KavliInstitute,Trondheim,Norway <piero.mana ntnu.no>

18August2019;updated19August2019

Itisshownthatthelog-likelihoodofahypothesisormodelgivensomedatais equaltoanaverageofallleave-one-outcross-validationlog-scoresthatcanbe calculatedfromallsubsetsofthedata.Thisrelationcanbegeneralizedtoany k-foldcross-validationlog-scores.

Note:DearReader&Peer,thismanuscriptisbeingpeer-reviewedbyyou.Thankyou.

1 Log-likelihoodsandcross-validationlog-scores

Theprobabilitycalculusunequivocallytellsushowourdegreeofbe- liefinahypothesisHh givendataDandbackgroundinformationor assumptionsI,thatis,P(^Hh| ^D^I),isrelatedtoourdegreeofbeliefin observingthosedatawhenweentertainthathypothesisastrue,thatis, P(^D|^HhI):

P(Hh|DI) P(^D|^HhI)P(^Hh|^I)

P(^D|^I) ^(1a)

_PP(^D|^HhI)P(^Hh|^I)

h^′P(D|Hh^′I)P(Hh^′|I)^. ^(1b) D,Hh,Idenotepropositions,whichareusuallyaboutnumericquantities.

Iusetheterms‘degreeofbelief’,‘belief’,and‘probability’assynonyms.

By ‘hypothesis’ I mean either a scientific (physical, biological, etc.) hypothesis–astateordevelopmentofthingscapableofexperimental verification,atleastinathoughtexperiment–ormoregenerallysome proposition,oftennotpreciselyspecified,whichleadstoquantitatively specificdistributionsofbeliefsforanycontemplateddataset.Inthelatter caseweoftencallHh a‘(probabilistic)model’ratherthana‘hypothesis’.

Expression(1b)assumesthatwehaveaset{^Hh}ofmutuallyexclusive andexhaustivehypothesesunderconsideration,whichisimplicitinour knowledgeI.Infactit’sonlyvalidif

PW

hHh|I

1, P(Hh∧Hh^′|I)0 ifh,h^′. (2) Onlyrarelydoesthesetofhypotheses{Hh}encompassandreflectthe

1 ^T^his

documentisdesignedforscreenreadingandtwo-upprintingonA4orLetterpaper

(3)

PortaMana log-likelihoodandcross-validation

extremelycomplexandfuzzyhypotheseslyinginthebacksofourminds.

They’resimplifiedpictures.That’salsowhythey’recalled‘models’.

Expression(1a)isuniversallyvalidinstead,butit’srarelypossible toquantifyitsdenominatorP(D|I)unlesswesimplifyourinferential problembyintroducingapossiblyunrealisticexhaustivesetofhypo- theses,thusfallingbackto(1b).Wecanbypassthisproblemifweare contentwithcomparingourbeliefsaboutanytwohypothesesthrough theirratio,sothatthetermP(^D|^I)cancelsout.SeeJaynes’s1insightful remarksaboutsuchbinarycomparisons,andalsoGood’s2.

ThetermP(^D|^HhI)ineq.(1)iscalledthelikelihoodofthehypothesis giventhedata3.Itslogarithmissurprisinglycalledlog-likelihood:

logP(^D|^HhI), (3)

wherethelogarithmcanbetakeninanarbitrarybasis(Turing,Good4, Jaynes5recommendbase10¹^/¹⁰,leadingtoameasurementindecibels;

seethecitedworksforthepracticaladvantagesofsuchchoice).

Theratioofthelikelihoodsoftwohypotheses,calledrelativeBayes factor,or itslogarithm, the relativeweight ofevidence,6 are oftenused to quantify howmuch the datafavourour belief in one versus the otherhypothesis(thatis,assumingatleastmomentarilythattheybe exhaustive).‘Itishistoricallyinterestingthattheexpression“weightof evidence”,initstechnicalsense,anticipatedtheterm“likelihood”by overfortyyears’7.

Recentliterature8seemstoexclusivelydealwithrelativeBayesfactors.I’dliketorecall,lest itfadesfromthememory,thedefinitionofthenon-relativeBayesfactorforahypothesis H_hprovidedbydataD:9

P(D|HhI) P(D|^¬HhI)^≡

O(Hh|DI) O(Hh|I)

P(D|HhI) [1−P(Hh|I)]

P_h^′_,_h

h^′ P(D|Hh^′I)P(Hh^′|I)^, ⁽⁴⁾ wheretheoddsOisdefinedasO:P/(1−P).Lookingattheexpressionontheright,which canbederivedfromtheprobabilityrules,it’sclearthattheBayesfactorforahypothesis involvesthelikelihoodsofallotherhypothesesaswellastheirpre-dataprobabilities.This quantityanditslogarithm,the(non-relative)weightofevidence,haveimportantproperties whichrelativeBayesfactorsandrelativeweightsofevidencedon’tenjoy.Forexample,the

1Jaynes2003§§4.3–4.4. ²Good1950§6.3–6.6. ³Good1950§6.1p.62. ⁴e.g.Good 1985,1950,1969. ⁵Jaynes2003§4.2. ⁶Good1950ch.6,1975,1981,1985,andmany otherworksinGood1983;Osteyeeetal.1974§1.4,MacKay1992,Kassetal.1995;seealso Jeffreys1983chsV,VI,A. ⁷Osteyeeetal.1974§1.4.2p.12. ⁸forexampleKassetal.1995.

9Good1981§2.

2

(4)

expectedweightofevidenceforacorrecthypothesisisalwayspositive,andforawrong hypothesesalwaysnegative10.SeeJaynes11forfurtherdiscussionandanumericexample.

Theliteratureinprobabilityandstatisticshasalsoemployedand debatedotherad-hocmeasurestoquantifyhowthedatarelatetothe hypotheses–oreventoselectonehypothesisforfurtheruse,discarding theothers12.HereIconsideronemeasureinparticular:theleave-one-out cross-validationlog-score12,whichI’lljustcall‘log-score’forbrevity:

1 d

Xd i1

logP(Di|D⁻iHhI) (5) whereeveryDiisonedatuminthedataD≡Vd

i1Di,andD⁻idenotes thedatawithdatumDi excluded.Theintuitionbehindthisscorecan becolloquiallyexpressedthus:‘let’sseewhatmybeliefinonedatum wouldbe,onaverage,onceI’veobservedtheotherdata,ifIconsider Hhastrue’.‘Onaverage’meansconsideringsuchbeliefforeverysingle datumin turn,andthentakingthe geometricmean ofthe resulting beliefs.Othervariantsofthisscoreusemoregeneralpartitionsofthe dataintotwodisjointsubsets12.

Ifyoufindthisyoucanclaimapostcardfromme.

Mypurposeistoshowanexactrelationbetweenthelog-likelihood(3) andtheleave-one-outcross-validationlog-score(5).Thisrelationdoesn’t seemtoappearintheliterature,andIfinditveryintriguingbecause itportraysthelog-likelihoodasasortoffull-scaleuseofthelog-score:

itsaysthatthelog-likelihoodisthesumofallaveragedlog-scoresthatcanbe formedfromalldatasubsets.Therelationcanbeextendedtomoregeneral cross-validationlog-scores,anditcanbeofinterestforthedebateabout thesoundnessoflog-scoresindecidingamonghypotheses.

2 Arelationbetweenlog-likelihoodandlog-score

Wecanobviouslywritethelikelihoodasthedthrootofitsdthpower:

P(D|HI)^≡

P(D|HI)^×^·^·^·^×P(D|HI)

dtimes

₁_/_d

(6)

10Good1950§6.7. ¹¹Jaynes2003§§4.3–4.4. ¹²Bernardoetal.2000§§3.4,6.1.6givesthe clearestmotivationandexplanation,seealsoStone1977,Geisseretal.1979,Vehtarietal.

2012,2002,Krnjajićetal.2011,2014,Gelmanetal.2014,Gronauetal.2019,Chandramouli etal.2019.

(5)

wherewehavedroppedthesubscripth forsimplicity.Bytherulesof probabilitywehave

P(^D|^H^I)P(^Di|^D⁻iH_hI)^×P(^D⁻i|^HhI) (7) nomatterwhichspecifici ∈ {1,...,d}wechoose(temporalordering andsimilarmattersarecompletelyirrelevantintheformulaabove:it’s alogicalrelationbetweenpropositions).Solet’sexpandeachofthed factorsintheidentity(6)usingtheproductrule(7),usingadifferenti foreachofthem.Theresultcanbethusdisplayed:

P(D|HI)^≡

P(D₁|D⁻₁HI)^×P(D⁻₁|HI)^× P(^D2|^D⁻2HI)^×P(^D⁻2|^H^I)^×

... ×

P(Dd|D⁻d

x



thiscolumnleadstothelog-score

HI)^×P(D⁻d|HI)₁_/_d .

(8)

Upontakingthelogarithmofthisexpression,the d factorsvertically alignedon theleft addup tothe log-score(5),as indicated.But the mathematicalreshaping wejustdidforP(^D| ^H^I) –thatis,theroot- productidentity(6)andthe expansion(8)–canbedoneforeachof theremainingfactorsP(^D⁻i|^H^I)verticallyalignedontherightinthe expressionabove;andsoonrecursively.Hereisanexplicitexamplefor d3:

P(^D|^H^I)^≡

(P(^D1|^D2D₃HI)^×

P(^D1|^D3HI)^×P(^D3|^H^I)^× P(D₃|D₁HI)^×P(D₁|HI)₁_/₂

×

P(^D3|^D1D₂HI)^×

P(^D1|^D2HI)^×P(^D2|^H^I)^× P(D₂|D₁HI)^×P(D₁|HI)₁/2)₁_/₃

. (9)

Inthisexamplethelogarithmofthethreeverticallyalignedfactorsin theleftcolumnis,asalreadynoted,thelog-score(5).Thelogarithmof

4

(6)

thesixverticallyalignedfactorsinthecentralcolumnisanaverageof thelog-scorescalculatedforthethreedistinctsubsetsofpairsofdata {^D1D₂},{^D1D₃},{^D2D₃}.Likewise,thelogarithmofthesixfactors verticallyalignedontherightistheaverageofthelog-scoresforthe threesubsetsofdatasingletons{^D1},{^D2},{^D3}.

Inthegeneralcasewithddatathereare ^d_k

subsetswithkdatapoints.

Wethereforeobtain

logP(D|HI)^≡ ¹ d

Xd i1

logP(Di|D⁻iHI)+

1 d

X

i∈{1,...,d}

1 d−1

j,i

X

j∈{1,...,d}

logP(D⁻i,j|D⁻i,−jHI)+

d d−2

!⁻1 Xi<j i,j∈{1,...,d}

1 d−2

kX,i,j

k∈{1,...,d}

logP(^D⁻i,−j,k|^D⁻i,−j,−kHI)+

···+

d 2

!⁻1 Xi<j i,j∈{1,...,d}

1 2

logP(^Di|^DjHI)+logP(^Dj|^DiHI) +

1 d

Xd i1

logP(Di|HI), (10) whichcanbecompactlywritten

logP(^D|^H^I)^≡ Xd k1

d k

!⁻1X

ordered k-tuples

1 k

X

cyclic permutations

logP(^Di₁|^Di₂···DikHI). (11)

Thatis,thelog-likelihoodisthesumofallaveragedlog-scoresthatcanbeformed fromall(non-empty)datasubsetswithkelements,theaverageforlog-scores overkdatabeingtakenoverthe ^d_k

subsetshavingthesamecardinality k.

There’s also an equivalent form with a slightly different cross- validatinginterpretation:WetakeeachdatumDjinturnandcalculate ourlog-beliefinitconditionalonallpossiblesubsetsofremainingdata,

(7)

fromtheemptysubsetwithnodata(termk0),totheonlysubsetD−j

withalldataexceptDj(termkd−1).Theselog-beliefsareaveraged overthe ^d⁻_k¹

subsetshavingthesamecardinalityk.Theresultcanbe expressedas

logP(^D|^H^I)^≡ ¹ d

Xd j1

d−1

X

k0

d−1 k

!⁻₁_X

ordered k-tuples, jexcluded

logP(^Dj|^Di1···D_i_kHI). (12)

3 Briefdiscussion

It’sremarkablethattheindividuallog-scoresinexpressions(11)and (12)abovearecomputationallyexpensive,buttheirsumresultsinthe log-likelihood,whichislessexpensive.

Therelation(11)invitesustoseethelog-likelihoodasarefinement andimprovement of the log-score. Thelog-likelihoodtakes into ac- countnotonlythelog-scoreforthewholedata,butalsothelog-scores forallpossiblesubsetsofdata.Figurativelyspeakingitexaminesthe relationshipbetweendataandhypothesislocally,globally,andonall intermediatescales.Tomethispropertymakesthelog-likelihoodprefer- abletoanysinglelog-score(besidesthefactthatthelog-likelihoodis directlyobtainedfromthe principlesofthe probabilitycalculus),be- causeourinterestisusuallyinhowthehypothesisHrelatestosingle datapointsaswellastoanycollectionofthem.Ihopetodiscussthis point,whichalsoinvolvesthedistinctionbetweensimpleandcomposite hypotheses13,moreindetailelsewhere14.

Byapplyingtheidentity(6)andgeneralizingtheexpansion(7)to differentdivisionsofthedata–leave-two-out,leave-three-out,andso on–weseethattherelation(11)canbegeneralizedtoanyk-foldcross- validationlog-scores.Thusthelog-likelihoodisalsoequivalenttoan averageofallconceivablecross-validationlog-scoresforallsubsets of data,thoughIhaven’tcalculatedtheweightsofsuchaverage.

Thanks

...totheKavliFoundationandtheCentreofExcellenceschemeofthe ResearchCouncilofNorway(YasserRoudigroup)forfinancialsupport.

13Bernardoetal.2000§6.1.4. ¹⁴PortaMana2019.

6

(8)

ToAkiVehtariforsomereferences.TothestaffoftheNTNUlibraryfor theiralwaysprompt assistance.ToMari,Miri,Emmaforcontinuous encouragementandaffection,andto BusterKeatonandSaitamafor fillinglifewithaweandinspiration.Tothedevelopersandmaintainers ofOpenScienceFramework,L^ATEX,Emacs,AUCTEX,Python,Inkscape, Sci-Hubformakingafreeandimpartialscientificexchangepossible.

Bibliography

Bernardo,J.-M.,DeGroot,M.H.,Lindley,D.V.,Smith,A.F.M.,eds.(1985): Bayesian Statistics2.(ElsevierandValenciaUniversityPress,AmsterdamandValencia).^https ://www.uv.es/~bernardo/valenciam.html.

Bernardo,J.-M.,Smith,A.F.(2000):BayesianTheory,reprint.(Wiley,NewYork).Firstpubl.

1994.

Chandramouli,S.H.,Shiffrin,R.M.,Vehtari,A.,Simpson,D.P.,Yao,Y.,Gelman,A.,Navarro, D.J.,Gronau,Q.F.,etal.(2019):CommentaryonGronauandWagenmakers.Limitationsof

“LimitationsofBayesianleave-one-outcross-validationformodelselection”.Betweenthedevil andthedeepbluesea:tensionsbetweenscientificjudgementandstatisticalmodelselection.

Rejoinder:morelimitationsofBayesianleave-one-outcross-validation.Comput.BrainBehav.

2¹,12–47.SeeGronau,Wagenmakers(2019).

Geisser,S.,Eddy,W.F.(1979):Apredictiveapproachtomodelselection.J.Am.Stat.Assoc.

74³⁶⁵,153–160.

Gelman,A.,Hwang,J.,Vehtari,A.(2014):Understandingpredictiveinformationcriteriafor Bayesianmodels.Stat.Comput.24⁶,997–1016.

Good,I.J.(1950):ProbabilityandtheWeighingofEvidence.(Griffin,London).

— (1969):AsubjectiveevaluationofBode’slawandan‘objective’testforapproximatenumerical rationality.J.Am.Stat.Assoc.64³²⁵,23–49.Partlyrepr.inGood(1983)ch.13.

— (1975): Explicativity,corroboration,andtherelativeoddsofhypotheses. Synthese30^1–2, 39–73.Partlyrepr.inGood(1983)ch.15.

— (1981):Somelogicandhistoryofhypothesistesting. In:Philosophyineconomics.Ed.by J.C.Pitt(Reidel),149–174.Repr.inGood(1983)ch.14pp.129–148.

— (1983):GoodThinking:TheFoundationsofProbabilityandItsApplications.(Universityof MinnesotaPress,Minneapolis,USA).

— (1985):Weightofevidence:abriefsurvey.In:Bernardo,DeGroot,Lindley,Smith(1985), 249–270.WithdiscussionbyH.Rubin,T.Seidenfeld,andreply.

Gronau,Q.F.,Wagenmakers,E.-J.(2019):LimitationsofBayesianleave-one-outcross-validation formodelselection.Comput.BrainBehav.2¹,1–11.Seealsocommentsandrejoinder inChandramouli,Shiffrin,Vehtari,Simpson,Yao,Gelman,Navarro,Gronau,etal.

(2019).

Jaynes,E.T.(2003):ProbabilityTheory:TheLogicofScience.(CambridgeUniversityPress, Cambridge).Ed.byG.LarryBretthorst.Firstpubl.1994.https://archive.org/detai ls/XQUHIUXHIQUHIQXUIHX2,http://www-biba.inrialpes.fr/Jaynes/prob.html. Jeffreys,H.(1983):TheoryofProbability,thirded.withcorrections. (OxfordUniversity

Press,London).Firstpubl.1939.

(9)

Kass,R.E.,Raftery,A.E.(1995):Bayesfactors.J.Am.Stat.Assoc.90⁴³⁰,773–795.^https://w ww.stat.washington.edu/raftery/Research/PDF/kass1995.pdf;https://www.andr ew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf.

Krnjajić,M.,Draper,D.(2011):Bayesianmodelspecification:someproblemsrelatedtomodel choiceandcalibration.http://hdl.handle.net/10379/3804.

— (2014):Bayesianmodelcomparison:logscoresandDIC.Stat.Probab.Lett.88,9–14.

MacKay,D.J.C.(1992):Bayesianinterpolation.NeuralComp.4³,415–447.http://www.inf erence.phy.cam.ac.uk/mackay/PhD.html.

Osteyee,D.B.,Good,I.J.(1974): Information,WeightofEvidence,theSingularitybetween ProbabilityMeasuresandSignalDetection.(Springer,Berlin).

PortaMana,P.G.L.(2019):Probabilisticmodels:modelsofwhat?Inpreparation.

Stone,M.(1977):Anasymptoticequivalenceofchoiceofmodelbycross-validationandAkaike’s criterion.J.Roy.Stat.Soc.B39¹,44–47.

Vehtari,A.,Lampinen,J.(2002): Bayesianmodelassessmentandcomparisonusingcross- validationpredictivedensities.NeuralComp.14¹⁰,2439–2468.

Vehtari,A.,Ojanen,J.(2012):AsurveyofBayesianpredictivemethodsformodelassessment, selectionandcomparison.Statist.Surv.6,142–228.

8