St a tistical learning

(1)

St a tistical learning

Chapter20,Sections1–3

Outline

♦Bayesianlearning

♦Maximumaposterioriandmaximumlikelihoodlearning

♦Bayesnetlearning–MLparameterlearningwithcompletedata–linearregression

F ull Ba y esian learning

ViewlearningasBayesianupdatingofaprobabilitydistributionoverthehypothesisspace

Histhehypothesisvariable,valuesh1,h2,...,priorP(H) jthobservationdjgivestheoutcomeofrandomvariableDjtrainingdatad=d1,...,dN

Giventhedatasofar,eachhypothesishasaposteriorprobability:

P(hi|d)=αP(d|hi)P(hi) whereP(d|hi)iscalledthelikelihood

Predictionsusealikelihood-weightedaverageoverthehypotheses:

P(X|d)=

Σ

iP(X|d,hi)P(hi|d)=

Σ

iP(X|hi)P(hi|d)

Noneedtopickonebest-guesshypothesis!

Example

Supposetherearefivekindsofbagsofcandies:10%areh1:100%cherrycandies20%areh2:75%cherrycandies+25%limecandies40%areh3:50%cherrycandies+50%limecandies20%areh4:25%cherrycandies+75%limecandies10%areh5:100%limecandies

Thenweobservecandiesdrawnfromsomebag:

Whatkindofbagisit?Whatflavourwillthenextcandybe?

P osterior probabilit y of h yp otheses

0 0.2 0.4 0.6 0.8 1

0246810

Posterior probability of hypothesis

Number of samples in d P(h1 | d)P(h2 | d)P(h3 | d)P(h4 | d)P(h5 | d)

Prediction probabilit y

0.4 0.5 0.6 0.7 0.8 0.9 1

0246810

P(next candy is lime | d)

Number of samples in d

(2)

MAP appro ximation

Summingoverthehypothesisspaceisoftenintractable(e.g.,18,446,744,073,709,551,616Booleanfunctionsof6attributes)

Maximumaposteriori(MAP)learning:choosehMAPmaximizingP(hi|d) I.e.,maximizeP(d|hi)P(hi)orlogP(d|hi)+logP(hi)

Logtermscanbeviewedas(negativeof)bitstoencodedatagivenhypothesis+bitstoencodehypothesisThisisthebasicideaofminimumdescriptionlength(MDL)learning

Fordeterministichypotheses,P(d|hi)is1ifconsistent,0otherwise⇒MAP=simplestconsistenthypothesis(cf.science)

ML appro ximation

Forlargedatasets,priorbecomesirrelevant

Maximumlikelihood(ML)learning:choosehMLmaximizingP(d|hi)

I.e.,simplygetthebestfittothedata;identicaltoMAPforuniformprior(whichisreasonableifallhypothesesareofthesamecomplexity)

MListhe“standard”(non-Bayesian)statisticallearningmethod

ML parameter learning in Ba y es nets

Bagfromanewmanufacturer;fractionθofcherrycandies?

Flavor PF=cherry()θAnyθispossible:continuumofhypotheseshθθisaparameterforthissimple(binomial)familyofmodels

SupposeweunwrapNcandies,ccherriesand`=N−climesThesearei.i.d.(independent,identicallydistributed)observations,so

P(d|hθ)= NY

j=1 P(dj|hθ)=θc·(1−θ)`

Maximizethisw.r.t.θ—whichiseasierforthelog-likelihood:

L(d|hθ)=logP(d|hθ)= NX

j=1 logP(dj|hθ)=clogθ+`log(1−θ) dL(d|hθ)

dθ = c

θ − `1−θ =0⇒θ= c

c+` = c

N

Seemssensible,butcausesproblemswith0counts!

Multiple parameters

Red/greenwrapperdependsprobabilisticallyonflavor:PF=cherry()

Flavor

Wrapper P()W=red | FFcherry

2limeθ 1θ θLikelihoodfor,e.g.,cherrycandyingreenwrapper:

P(F=cherry,W=green|hθ,θ1,θ2)=P(F=cherry|hθ,θ1,θ2)P(W=green|F=cherry,hθ,θ1,θ2)=θ·(1−θ1) Ncandies,rcred-wrappedcherrycandies,etc.:

P(d|hθ,θ1,θ2)=θc(1−θ)`·θrc1(1−θ1)gc·θ r`2(1−θ2)g`

L=[clogθ+`log(1−θ)]+[rclogθ1+gclog(1−θ1)]+[r`logθ2+g`log(1−θ2)]

Multiple parameters con td.

DerivativesofLcontainonlytherelevantparameter:

∂L∂θ = cθ − `1−θ =0⇒θ= cc+`

∂L∂θ1 = rcθ1 − gc1−θ1 =0⇒θ1= rcrc+gc

∂L∂θ2 = r`θ2 − g`1−θ2 =0⇒θ2= r`r`+g`

Withcompletedata,parameterscanbelearnedseparately

Example: linear Gaussian mo del

00.20.40.60.81x00.20.40.60.81

y 00.5 1 1.52 2.5 33.5 4 P(y |x)

0 0.2 0.4 0.6 0.8 1

00.10.20.30.40.50.60.70.80.91

y

x

MaximizingP(y|x)= 1√2πσ e −(y−(θ1x+θ2))2

2σ2w.r.t.θ1,θ2

=minimizingE= NX

j=1 (yj−(θ1xj+θ2))2

Thatis,minimizingthesumofsquarederrorsgivestheMLsolutionforalinearfitassumingGaussiannoiseoffixedvariance

(3)

Summary

FullBayesianlearninggivesbestpossiblepredictionsbutisintractable

MAPlearningbalancescomplexitywithaccuracyontrainingdata

Maximumlikelihoodassumesuniformprior,OKforlargedatasets

1.Chooseaparameterizedfamilyofmodelstodescribethedatarequiressubstantialinsightandsometimesnewmodels

2.Writedownthelikelihoodofthedataasafunctionoftheparametersmayrequiresummingoverhiddenvariables,i.e.,inference

3.Writedownthederivativeoftheloglikelihoodw.r.t.eachparameter

4.Findtheparametervaluessuchthatthederivativesarezeromaybehard/impossible;modernoptimizationtechniqueshelp