• Aucun résultat trouvé

St a tistical learning

N/A
N/A
Protected

Academic year: 2022

Partager "St a tistical learning"

Copied!
3
0
0

Texte intégral

(1)

St a tistical learning

Chapter20,Sections1–3

Chapter20,Sections1–31

Outline

♦Bayesianlearning

♦Maximumaposterioriandmaximumlikelihoodlearning

♦Bayesnetlearning–MLparameterlearningwithcompletedata–linearregression

Chapter20,Sections1–32

F ull Ba y esian learning

ViewlearningasBayesianupdatingofaprobabilitydistributionoverthehypothesisspace

Histhehypothesisvariable,valuesh1,h2,...,priorP(H) jthobservationdjgivestheoutcomeofrandomvariableDjtrainingdatad=d1,...,dN

Giventhedatasofar,eachhypothesishasaposteriorprobability:

P(hi|d)=αP(d|hi)P(hi) whereP(d|hi)iscalledthelikelihood

Predictionsusealikelihood-weightedaverageoverthehypotheses:

P(X|d)=

Σ

iP(X|d,hi)P(hi|d)=

Σ

iP(X|hi)P(hi|d)

Noneedtopickonebest-guesshypothesis!

Chapter20,Sections1–33

Example

Supposetherearefivekindsofbagsofcandies:10%areh1:100%cherrycandies20%areh2:75%cherrycandies+25%limecandies40%areh3:50%cherrycandies+50%limecandies20%areh4:25%cherrycandies+75%limecandies10%areh5:100%limecandies

Thenweobservecandiesdrawnfromsomebag:

Whatkindofbagisit?Whatflavourwillthenextcandybe?

Chapter20,Sections1–34

P osterior probabilit y of h yp otheses

0 0.2 0.4 0.6 0.8 1

0246810

Posterior probability of hypothesis

Number of samples in d P(h1 | d)P(h2 | d)P(h3 | d)P(h4 | d)P(h5 | d)

Chapter20,Sections1–35

Prediction probabilit y

0.4 0.5 0.6 0.7 0.8 0.9 1

0246810

P(next candy is lime | d)

Number of samples in d

Chapter20,Sections1–36

(2)

MAP appro ximation

Summingoverthehypothesisspaceisoftenintractable(e.g.,18,446,744,073,709,551,616Booleanfunctionsof6attributes)

Maximumaposteriori(MAP)learning:choosehMAPmaximizingP(hi|d) I.e.,maximizeP(d|hi)P(hi)orlogP(d|hi)+logP(hi)

Logtermscanbeviewedas(negativeof)bitstoencodedatagivenhypothesis+bitstoencodehypothesisThisisthebasicideaofminimumdescriptionlength(MDL)learning

Fordeterministichypotheses,P(d|hi)is1ifconsistent,0otherwise⇒MAP=simplestconsistenthypothesis(cf.science)

Chapter20,Sections1–37

ML appro ximation

Forlargedatasets,priorbecomesirrelevant

Maximumlikelihood(ML)learning:choosehMLmaximizingP(d|hi)

I.e.,simplygetthebestfittothedata;identicaltoMAPforuniformprior(whichisreasonableifallhypothesesareofthesamecomplexity)

MListhe“standard”(non-Bayesian)statisticallearningmethod

Chapter20,Sections1–38

ML parameter learning in Ba y es nets

Bagfromanewmanufacturer;fractionθofcherrycandies?

Flavor PF=cherry()θAnyθispossible:continuumofhypotheseshθθisaparameterforthissimple(binomial)familyofmodels

SupposeweunwrapNcandies,ccherriesand`=N−climesThesearei.i.d.(independent,identicallydistributed)observations,so

P(d|hθ)= NY

j=1 P(dj|hθ)=θc·(1−θ)`

Maximizethisw.r.t.θ—whichiseasierforthelog-likelihood:

L(d|hθ)=logP(d|hθ)= NX

j=1 logP(dj|hθ)=clogθ+`log(1−θ) dL(d|hθ)

dθ = c

θ − `1−θ =0⇒θ= c

c+` = c

N

Seemssensible,butcausesproblemswith0counts!

Chapter20,Sections1–39

Multiple parameters

Red/greenwrapperdependsprobabilisticallyonflavor:PF=cherry()

Flavor

Wrapper P()W=red | FFcherry

2limeθ 1θ θLikelihoodfor,e.g.,cherrycandyingreenwrapper:

P(F=cherry,W=green|hθ12)=P(F=cherry|hθ12)P(W=green|F=cherry,hθ12)=θ·(1−θ1) Ncandies,rcred-wrappedcherrycandies,etc.:

P(d|hθ12)=θc(1−θ)`·θrc1(1−θ1)gc·θ r`2(1−θ2)g`

L=[clogθ+`log(1−θ)]+[rclogθ1+gclog(1−θ1)]+[r`logθ2+g`log(1−θ2)]

Chapter20,Sections1–310

Multiple parameters con td.

DerivativesofLcontainonlytherelevantparameter:

∂L∂θ = cθ − `1−θ =0⇒θ= cc+`

∂L∂θ1 = rcθ1 − gc1−θ1 =0⇒θ1= rcrc+gc

∂L∂θ2 = r`θ2 − g`1−θ2 =0⇒θ2= r`r`+g`

Withcompletedata,parameterscanbelearnedseparately

Chapter20,Sections1–311

Example: linear Gaussian mo del

00.20.40.60.81x00.20.40.60.81

y 00.5 1 1.52 2.5 33.5 4 P(y |x)

0 0.2 0.4 0.6 0.8 1

00.10.20.30.40.50.60.70.80.91

y

x

MaximizingP(y|x)= 1√2πσ e (y1x+θ2))2

2σ2w.r.t.θ12

=minimizingE= NX

j=1 (yj−(θ1xj2))2

Thatis,minimizingthesumofsquarederrorsgivestheMLsolutionforalinearfitassumingGaussiannoiseoffixedvariance

Chapter20,Sections1–312

(3)

Summary

FullBayesianlearninggivesbestpossiblepredictionsbutisintractable

MAPlearningbalancescomplexitywithaccuracyontrainingdata

Maximumlikelihoodassumesuniformprior,OKforlargedatasets

1.Chooseaparameterizedfamilyofmodelstodescribethedatarequiressubstantialinsightandsometimesnewmodels

2.Writedownthelikelihoodofthedataasafunctionoftheparametersmayrequiresummingoverhiddenvariables,i.e.,inference

3.Writedownthederivativeoftheloglikelihoodw.r.t.eachparameter

4.Findtheparametervaluessuchthatthederivativesarezeromaybehard/impossible;modernoptimizationtechniqueshelp

Chapter20,Sections1–313

Références

Documents relatifs

Better mobilization of urban health workers to serve remote or underserved areas is a strategy to improve access to health to the population in remote and rural areas..

Pulse el enlace Agregar nueva actividad correspondiente al tipo de personal en cuestión, o seleccione el nombre de la Categoría de personal deseado a partir de la lista Categoría

Concours provincial : FRQSC Avant l'entrée dans le programme ou en 1re année Concours fédéral : BESC (CRSH) Avant l'entrée dans le programme ou en 1re année Bourses de la Fondation

Le plus souvent, sont donc observées des lésions sim- plement « compatibles » avec le dia- gnostic de maladie des voies biliaires : inflammation portale péribiliaire,

— For each prime p, we study the eigenvalues of a 3-regular graph on roughly p 2 vertices constructed from the Markoff surface.. We show they asymptotically follow the Kesten–McKay

The link between aperiodic order and cut and project schemes (Definition 12.1) and model sets (Definition 12.2) is not new. In one dimension, the fact that Stur- mian sequences

• We find a Lagrangian torus fibration on a neighbourhood of the one-dimensional stratum of a simple normal crossing divisor (satisfying certain conditions) such that the base of

• Precise estimates on the disorder induced shift of the critical point (with respect to the annealed model) in the relevant disorder case [AZ09, DGLT09], and a proof of the fact