• Aucun résultat trouvé

A relation between log-likelihood and cross-validation log-scores

N/A
N/A
Protected

Academic year: 2021

Partager "A relation between log-likelihood and cross-validation log-scores"

Copied!
9
0
0

Texte intégral

(1)

HAL Id: hal-02267943

https://hal.archives-ouvertes.fr/hal-02267943v2

Preprint submitted on 22 Aug 2019

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

A relation between log-likelihood and cross-validation log-scores

P.G.L. Porta Mana

To cite this version:

P.G.L. Porta Mana. A relation between log-likelihood and cross-validation log-scores. 2019. �hal-

02267943v2�

(2)

and cross-validation log-scores

P.G.L.PortaMana

KavliInstitute,Trondheim,Norway <piero.mana ntnu.no>

18August2019;updated19August2019

Itisshownthatthelog-likelihoodofahypothesisormodelgivensomedatais equaltoanaverageofallleave-one-outcross-validationlog-scoresthatcanbe calculatedfromallsubsetsofthedata.Thisrelationcanbegeneralizedtoany k-foldcross-validationlog-scores.

Note:DearReader&Peer,thismanuscriptisbeingpeer-reviewedbyyou.Thankyou.

1 Log-likelihoodsandcross-validationlog-scores

Theprobabilitycalculusunequivocallytellsushowourdegreeofbe- liefinahypothesisHh givendataDandbackgroundinformationor assumptionsI,thatis,P(Hh| DI),isrelatedtoourdegreeofbeliefin observingthosedatawhenweentertainthathypothesisastrue,thatis, P(D|HhI):

P(Hh|DI) P(D|HhI)P(Hh|I)

P(D|I) (1a)

PP(D|HhI)P(Hh|I)

hP(D|HhI)P(Hh|I). (1b) D,Hh,Idenotepropositions,whichareusuallyaboutnumericquantities.

Iusetheterms‘degreeofbelief’,‘belief’,and‘probability’assynonyms.

By ‘hypothesis’ I mean either a scientific (physical, biological, etc.) hypothesis–astateordevelopmentofthingscapableofexperimental verification,atleastinathoughtexperiment–ormoregenerallysome proposition,oftennotpreciselyspecified,whichleadstoquantitatively specificdistributionsofbeliefsforanycontemplateddataset.Inthelatter caseweoftencallHh a‘(probabilistic)model’ratherthana‘hypothesis’.

Expression(1b)assumesthatwehaveaset{Hh}ofmutuallyexclusive andexhaustivehypothesesunderconsideration,whichisimplicitinour knowledgeI.Infactit’sonlyvalidif

PW

hHh|I

1, P(Hh∧Hh|I)0 ifh,h. (2) Onlyrarelydoesthesetofhypotheses{Hh}encompassandreflectthe

1 This

documentisdesignedforscreenreadingandtwo-upprintingonA4orLetterpaper

(3)

PortaMana log-likelihoodandcross-validation

extremelycomplexandfuzzyhypotheseslyinginthebacksofourminds.

They’resimplifiedpictures.That’salsowhythey’recalled‘models’.

Expression(1a)isuniversallyvalidinstead,butit’srarelypossible toquantifyitsdenominatorP(D|I)unlesswesimplifyourinferential problembyintroducingapossiblyunrealisticexhaustivesetofhypo- theses,thusfallingbackto(1b).Wecanbypassthisproblemifweare contentwithcomparingourbeliefsaboutanytwohypothesesthrough theirratio,sothatthetermP(D|I)cancelsout.SeeJaynes’s1insightful remarksaboutsuchbinarycomparisons,andalsoGood’s2.

ThetermP(D|HhI)ineq.(1)iscalledthelikelihoodofthehypothesis giventhedata3.Itslogarithmissurprisinglycalledlog-likelihood:

logP(D|HhI), (3)

wherethelogarithmcanbetakeninanarbitrarybasis(Turing,Good4, Jaynes5recommendbase101/10,leadingtoameasurementindecibels;

seethecitedworksforthepracticaladvantagesofsuchchoice).

Theratioofthelikelihoodsoftwohypotheses,calledrelativeBayes factor,or itslogarithm, the relativeweight ofevidence,6 are oftenused to quantify howmuch the datafavourour belief in one versus the otherhypothesis(thatis,assumingatleastmomentarilythattheybe exhaustive).‘Itishistoricallyinterestingthattheexpression“weightof evidence”,initstechnicalsense,anticipatedtheterm“likelihood”by overfortyyears’7.

Recentliterature8seemstoexclusivelydealwithrelativeBayesfactors.I’dliketorecall,lest itfadesfromthememory,thedefinitionofthenon-relativeBayesfactorforahypothesis HhprovidedbydataD:9

P(D|HhI) P(D|¬HhI)

O(Hh|DI) O(Hh|I)

P(D|HhI) [1P(Hh|I)]

Ph,h

h P(D|HhI)P(Hh|I), (4) wheretheoddsOisdefinedasO:P/(1P).Lookingattheexpressionontheright,which canbederivedfromtheprobabilityrules,it’sclearthattheBayesfactorforahypothesis involvesthelikelihoodsofallotherhypothesesaswellastheirpre-dataprobabilities.This quantityanditslogarithm,the(non-relative)weightofevidence,haveimportantproperties whichrelativeBayesfactorsandrelativeweightsofevidencedon’tenjoy.Forexample,the

1Jaynes2003§§4.3–4.4. 2Good1950§6.3–6.6. 3Good1950§6.1p.62. 4e.g.Good 1985,1950,1969. 5Jaynes2003§4.2. 6Good1950ch.6,1975,1981,1985,andmany otherworksinGood1983;Osteyeeetal.1974§1.4,MacKay1992,Kassetal.1995;seealso Jeffreys1983chsV,VI,A. 7Osteyeeetal.1974§1.4.2p.12. 8forexampleKassetal.1995.

9Good1981§2.

2

(4)

expectedweightofevidenceforacorrecthypothesisisalwayspositive,andforawrong hypothesesalwaysnegative10.SeeJaynes11forfurtherdiscussionandanumericexample.

Theliteratureinprobabilityandstatisticshasalsoemployedand debatedotherad-hocmeasurestoquantifyhowthedatarelatetothe hypotheses–oreventoselectonehypothesisforfurtheruse,discarding theothers12.HereIconsideronemeasureinparticular:theleave-one-out cross-validationlog-score12,whichI’lljustcall‘log-score’forbrevity:

1 d

Xd i1

logP(Di|DiHhI) (5) whereeveryDiisonedatuminthedataD≡Vd

i1Di,andDidenotes thedatawithdatumDi excluded.Theintuitionbehindthisscorecan becolloquiallyexpressedthus:‘let’sseewhatmybeliefinonedatum wouldbe,onaverage,onceI’veobservedtheotherdata,ifIconsider Hhastrue’.‘Onaverage’meansconsideringsuchbeliefforeverysingle datumin turn,andthentakingthe geometricmean ofthe resulting beliefs.Othervariantsofthisscoreusemoregeneralpartitionsofthe dataintotwodisjointsubsets12.

Ifyoufindthisyoucanclaimapostcardfromme.

Mypurposeistoshowanexactrelationbetweenthelog-likelihood(3) andtheleave-one-outcross-validationlog-score(5).Thisrelationdoesn’t seemtoappearintheliterature,andIfinditveryintriguingbecause itportraysthelog-likelihoodasasortoffull-scaleuseofthelog-score:

itsaysthatthelog-likelihoodisthesumofallaveragedlog-scoresthatcanbe formedfromalldatasubsets.Therelationcanbeextendedtomoregeneral cross-validationlog-scores,anditcanbeofinterestforthedebateabout thesoundnessoflog-scoresindecidingamonghypotheses.

2 Arelationbetweenlog-likelihoodandlog-score

Wecanobviouslywritethelikelihoodasthedthrootofitsdthpower:

P(D|HI)

P(D|HI)×···×P(D|HI)

dtimes

1/d

(6)

10Good1950§6.7. 11Jaynes2003§§4.3–4.4. 12Bernardoetal.2000§§3.4,6.1.6givesthe clearestmotivationandexplanation,seealsoStone1977,Geisseretal.1979,Vehtarietal.

2012,2002,Krnjajićetal.2011,2014,Gelmanetal.2014,Gronauetal.2019,Chandramouli etal.2019.

(5)

PortaMana log-likelihoodandcross-validation

wherewehavedroppedthesubscripth forsimplicity.Bytherulesof probabilitywehave

P(D|HI)P(Di|DiHhI)×P(Di|HhI) (7) nomatterwhichspecifici ∈ {1,...,d}wechoose(temporalordering andsimilarmattersarecompletelyirrelevantintheformulaabove:it’s alogicalrelationbetweenpropositions).Solet’sexpandeachofthed factorsintheidentity(6)usingtheproductrule(7),usingadifferenti foreachofthem.Theresultcanbethusdisplayed:

P(D|HI)

P(D1|D1HI)×P(D1|HI)× P(D2|D2HI)×P(D2|HI)×

... ×

P(Dd|Dd

x

thiscolumnleadstothelog-score

HI)×P(Dd|HI)1/d .

(8)

Upontakingthelogarithmofthisexpression,the d factorsvertically alignedon theleft addup tothe log-score(5),as indicated.But the mathematicalreshaping wejustdidforP(D| HI) –thatis,theroot- productidentity(6)andthe expansion(8)–canbedoneforeachof theremainingfactorsP(Di|HI)verticallyalignedontherightinthe expressionabove;andsoonrecursively.Hereisanexplicitexamplefor d3:

P(D|HI)

(P(D1|D2D3HI)×

P(D2|D3HI)×P(D3|HI)× P(D3|D2HI)×P(D2|HI)1/2× P(D2|D1D3HI)×

P(D1|D3HI)×P(D3|HI)× P(D3|D1HI)×P(D1|HI)1/2

×

P(D3|D1D2HI)×

P(D1|D2HI)×P(D2|HI)× P(D2|D1HI)×P(D1|HI)1/2)1/3

. (9)

Inthisexamplethelogarithmofthethreeverticallyalignedfactorsin theleftcolumnis,asalreadynoted,thelog-score(5).Thelogarithmof

4

(6)

thesixverticallyalignedfactorsinthecentralcolumnisanaverageof thelog-scorescalculatedforthethreedistinctsubsetsofpairsofdata {D1D2},{D1D3},{D2D3}.Likewise,thelogarithmofthesixfactors verticallyalignedontherightistheaverageofthelog-scoresforthe threesubsetsofdatasingletons{D1},{D2},{D3}.

Inthegeneralcasewithddatathereare dk

subsetswithkdatapoints.

Wethereforeobtain

logP(D|HI) 1 d

Xd i1

logP(Di|DiHI)+

1 d

X

i{1,...,d}

1 d−1

j,i

X

j{1,...,d}

logP(Di,j|Di,jHI)+

d d−2

!1 Xi<j i,j{1,...,d}

1 d−2

kX,i,j

k{1,...,d}

logP(Di,j,k|Di,j,kHI)+

···+

d 2

!1 Xi<j i,j{1,...,d}

1 2

logP(Di|DjHI)+logP(Dj|DiHI) +

1 d

Xd i1

logP(Di|HI), (10) whichcanbecompactlywritten

logP(D|HI) Xd k1

d k

!1X

ordered k-tuples

1 k

X

cyclic permutations

logP(Di1|Di2···DikHI). (11)

Thatis,thelog-likelihoodisthesumofallaveragedlog-scoresthatcanbeformed fromall(non-empty)datasubsetswithkelements,theaverageforlog-scores overkdatabeingtakenoverthe dk

subsetshavingthesamecardinality k.

There’s also an equivalent form with a slightly different cross- validatinginterpretation:WetakeeachdatumDjinturnandcalculate ourlog-beliefinitconditionalonallpossiblesubsetsofremainingdata,

(7)

PortaMana log-likelihoodandcross-validation

fromtheemptysubsetwithnodata(termk0),totheonlysubsetDj

withalldataexceptDj(termkd−1).Theselog-beliefsareaveraged overthe dk1

subsetshavingthesamecardinalityk.Theresultcanbe expressedas

logP(D|HI) 1 d

Xd j1

d1

X

k0

d−1 k

!1X

ordered k-tuples, jexcluded

logP(Dj|Di1···DikHI). (12)

3 Briefdiscussion

It’sremarkablethattheindividuallog-scoresinexpressions(11)and (12)abovearecomputationallyexpensive,buttheirsumresultsinthe log-likelihood,whichislessexpensive.

Therelation(11)invitesustoseethelog-likelihoodasarefinement andimprovement of the log-score. Thelog-likelihoodtakes into ac- countnotonlythelog-scoreforthewholedata,butalsothelog-scores forallpossiblesubsetsofdata.Figurativelyspeakingitexaminesthe relationshipbetweendataandhypothesislocally,globally,andonall intermediatescales.Tomethispropertymakesthelog-likelihoodprefer- abletoanysinglelog-score(besidesthefactthatthelog-likelihoodis directlyobtainedfromthe principlesofthe probabilitycalculus),be- causeourinterestisusuallyinhowthehypothesisHrelatestosingle datapointsaswellastoanycollectionofthem.Ihopetodiscussthis point,whichalsoinvolvesthedistinctionbetweensimpleandcomposite hypotheses13,moreindetailelsewhere14.

Byapplyingtheidentity(6)andgeneralizingtheexpansion(7)to differentdivisionsofthedata–leave-two-out,leave-three-out,andso on–weseethattherelation(11)canbegeneralizedtoanyk-foldcross- validationlog-scores.Thusthelog-likelihoodisalsoequivalenttoan averageofallconceivablecross-validationlog-scoresforallsubsets of data,thoughIhaven’tcalculatedtheweightsofsuchaverage.

Thanks

...totheKavliFoundationandtheCentreofExcellenceschemeofthe ResearchCouncilofNorway(YasserRoudigroup)forfinancialsupport.

13Bernardoetal.2000§6.1.4. 14PortaMana2019.

6

(8)

ToAkiVehtariforsomereferences.TothestaffoftheNTNUlibraryfor theiralwaysprompt assistance.ToMari,Miri,Emmaforcontinuous encouragementandaffection,andto BusterKeatonandSaitamafor fillinglifewithaweandinspiration.Tothedevelopersandmaintainers ofOpenScienceFramework,LATEX,Emacs,AUCTEX,Python,Inkscape, Sci-Hubformakingafreeandimpartialscientificexchangepossible.

Bibliography

Bernardo,J.-M.,DeGroot,M.H.,Lindley,D.V.,Smith,A.F.M.,eds.(1985): Bayesian Statistics2.(ElsevierandValenciaUniversityPress,AmsterdamandValencia).https ://www.uv.es/~bernardo/valenciam.html.

Bernardo,J.-M.,Smith,A.F.(2000):BayesianTheory,reprint.(Wiley,NewYork).Firstpubl.

1994.

Chandramouli,S.H.,Shiffrin,R.M.,Vehtari,A.,Simpson,D.P.,Yao,Y.,Gelman,A.,Navarro, D.J.,Gronau,Q.F.,etal.(2019):CommentaryonGronauandWagenmakers.Limitationsof

“LimitationsofBayesianleave-one-outcross-validationformodelselection”.Betweenthedevil andthedeepbluesea:tensionsbetweenscientificjudgementandstatisticalmodelselection.

Rejoinder:morelimitationsofBayesianleave-one-outcross-validation.Comput.BrainBehav.

21,12–47.SeeGronau,Wagenmakers(2019).

Geisser,S.,Eddy,W.F.(1979):Apredictiveapproachtomodelselection.J.Am.Stat.Assoc.

74365,153–160.

Gelman,A.,Hwang,J.,Vehtari,A.(2014):Understandingpredictiveinformationcriteriafor Bayesianmodels.Stat.Comput.246,997–1016.

Good,I.J.(1950):ProbabilityandtheWeighingofEvidence.(Griffin,London).

(1969):AsubjectiveevaluationofBode’slawandan‘objective’testforapproximatenumerical rationality.J.Am.Stat.Assoc.64325,23–49.Partlyrepr.inGood(1983)ch.13.

(1975): Explicativity,corroboration,andtherelativeoddsofhypotheses. Synthese301–2, 39–73.Partlyrepr.inGood(1983)ch.15.

(1981):Somelogicandhistoryofhypothesistesting. In:Philosophyineconomics.Ed.by J.C.Pitt(Reidel),149–174.Repr.inGood(1983)ch.14pp.129–148.

(1983):GoodThinking:TheFoundationsofProbabilityandItsApplications.(Universityof MinnesotaPress,Minneapolis,USA).

(1985):Weightofevidence:abriefsurvey.In:Bernardo,DeGroot,Lindley,Smith(1985), 249–270.WithdiscussionbyH.Rubin,T.Seidenfeld,andreply.

Gronau,Q.F.,Wagenmakers,E.-J.(2019):LimitationsofBayesianleave-one-outcross-validation formodelselection.Comput.BrainBehav.21,1–11.Seealsocommentsandrejoinder inChandramouli,Shiffrin,Vehtari,Simpson,Yao,Gelman,Navarro,Gronau,etal.

(2019).

Jaynes,E.T.(2003):ProbabilityTheory:TheLogicofScience.(CambridgeUniversityPress, Cambridge).Ed.byG.LarryBretthorst.Firstpubl.1994.https://archive.org/detai ls/XQUHIUXHIQUHIQXUIHX2,http://www-biba.inrialpes.fr/Jaynes/prob.html. Jeffreys,H.(1983):TheoryofProbability,thirded.withcorrections. (OxfordUniversity

Press,London).Firstpubl.1939.

(9)

PortaMana log-likelihoodandcross-validation

Kass,R.E.,Raftery,A.E.(1995):Bayesfactors.J.Am.Stat.Assoc.90430,773–795.https://w ww.stat.washington.edu/raftery/Research/PDF/kass1995.pdf;https://www.andr ew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf.

Krnjajić,M.,Draper,D.(2011):Bayesianmodelspecification:someproblemsrelatedtomodel choiceandcalibration.http://hdl.handle.net/10379/3804.

(2014):Bayesianmodelcomparison:logscoresandDIC.Stat.Probab.Lett.88,9–14.

MacKay,D.J.C.(1992):Bayesianinterpolation.NeuralComp.43,415–447.http://www.inf erence.phy.cam.ac.uk/mackay/PhD.html.

Osteyee,D.B.,Good,I.J.(1974): Information,WeightofEvidence,theSingularitybetween ProbabilityMeasuresandSignalDetection.(Springer,Berlin).

PortaMana,P.G.L.(2019):Probabilisticmodels:modelsofwhat?Inpreparation.

Stone,M.(1977):Anasymptoticequivalenceofchoiceofmodelbycross-validationandAkaike’s criterion.J.Roy.Stat.Soc.B391,44–47.

Vehtari,A.,Lampinen,J.(2002): Bayesianmodelassessmentandcomparisonusingcross- validationpredictivedensities.NeuralComp.1410,2439–2468.

Vehtari,A.,Ojanen,J.(2012):AsurveyofBayesianpredictivemethodsformodelassessment, selectionandcomparison.Statist.Surv.6,142–228.

8

Références

Documents relatifs

2 Venn diagrams of differentially expressed genes (DEGs) identified in pairwise comparisons between heifers (H) and post-partum lactating (L) and non-lactating (NL) cows in the

In the case when σ = enf (τ ), the equivalence classes at type σ will not be larger than their original classes at τ , since the main effect of the reduction to exp-log normal form

Furthermore, the relativiser effect interacted with NP1 animacy: for lequel-RCs, there was a strong NP2 bias regardless of NP1 animacy, but qui-RCs attached more often to animate

In this paper we present how such data can be transformed into relational data, introduce a novel approach for anonymization of IPv4 address in our dataset using several

Hence in the log-concave situation the (1, + ∞ ) Poincar´e inequality implies Cheeger’s in- equality, more precisely implies a control on the Cheeger’s constant (that any

One of the most important open problems in the theory of minimal models for higher-dimensional algebraic varieties is the finite generation of log canonical rings for lc

UNIVERSITY OF NATURAL SCIENCES HO CHI MINH City http://www.hcmuns.edu.vn/?. Introduction to

Based on the ex- periments we have performed, an important factor is that the approximation of the target model by the n-gram used for sampling should be sufficiently good,