St a tistical learning
Chapter20,Sections1–3
Chapter20,Sections1–31
Outline
♦Bayesianlearning
♦Maximumaposterioriandmaximumlikelihoodlearning
♦Bayesnetlearning–MLparameterlearningwithcompletedata–linearregression
Chapter20,Sections1–32
F ull Ba y esian learning
ViewlearningasBayesianupdatingofaprobabilitydistributionoverthehypothesisspace
Histhehypothesisvariable,valuesh1,h2,...,priorP(H) jthobservationdjgivestheoutcomeofrandomvariableDjtrainingdatad=d1,...,dN
Giventhedatasofar,eachhypothesishasaposteriorprobability:
P(hi|d)=αP(d|hi)P(hi) whereP(d|hi)iscalledthelikelihood
Predictionsusealikelihood-weightedaverageoverthehypotheses:
P(X|d)=
Σ
iP(X|d,hi)P(hi|d)=Σ
iP(X|hi)P(hi|d)Noneedtopickonebest-guesshypothesis!
Chapter20,Sections1–33
Example
Supposetherearefivekindsofbagsofcandies:10%areh1:100%cherrycandies20%areh2:75%cherrycandies+25%limecandies40%areh3:50%cherrycandies+50%limecandies20%areh4:25%cherrycandies+75%limecandies10%areh5:100%limecandies
Thenweobservecandiesdrawnfromsomebag:
Whatkindofbagisit?Whatflavourwillthenextcandybe?
Chapter20,Sections1–34
P osterior probabilit y of h yp otheses
0 0.2 0.4 0.6 0.8 1
0246810
Posterior probability of hypothesis
Number of samples in d P(h1 | d)P(h2 | d)P(h3 | d)P(h4 | d)P(h5 | d)
Chapter20,Sections1–35
Prediction probabilit y
0.4 0.5 0.6 0.7 0.8 0.9 1
0246810
P(next candy is lime | d)
Number of samples in d
Chapter20,Sections1–36
MAP appro ximation
Summingoverthehypothesisspaceisoftenintractable(e.g.,18,446,744,073,709,551,616Booleanfunctionsof6attributes)
Maximumaposteriori(MAP)learning:choosehMAPmaximizingP(hi|d) I.e.,maximizeP(d|hi)P(hi)orlogP(d|hi)+logP(hi)
Logtermscanbeviewedas(negativeof)bitstoencodedatagivenhypothesis+bitstoencodehypothesisThisisthebasicideaofminimumdescriptionlength(MDL)learning
Fordeterministichypotheses,P(d|hi)is1ifconsistent,0otherwise⇒MAP=simplestconsistenthypothesis(cf.science)
Chapter20,Sections1–37
ML appro ximation
Forlargedatasets,priorbecomesirrelevant
Maximumlikelihood(ML)learning:choosehMLmaximizingP(d|hi)
I.e.,simplygetthebestfittothedata;identicaltoMAPforuniformprior(whichisreasonableifallhypothesesareofthesamecomplexity)
MListhe“standard”(non-Bayesian)statisticallearningmethod
Chapter20,Sections1–38
ML parameter learning in Ba y es nets
Bagfromanewmanufacturer;fractionθofcherrycandies?
Flavor PF=cherry()θAnyθispossible:continuumofhypotheseshθθisaparameterforthissimple(binomial)familyofmodels
SupposeweunwrapNcandies,ccherriesand`=N−climesThesearei.i.d.(independent,identicallydistributed)observations,so
P(d|hθ)= NY
j=1 P(dj|hθ)=θc·(1−θ)`
Maximizethisw.r.t.θ—whichiseasierforthelog-likelihood:
L(d|hθ)=logP(d|hθ)= NX
j=1 logP(dj|hθ)=clogθ+`log(1−θ) dL(d|hθ)
dθ = c
θ − `1−θ =0⇒θ= c
c+` = c
N
Seemssensible,butcausesproblemswith0counts!
Chapter20,Sections1–39
Multiple parameters
Red/greenwrapperdependsprobabilisticallyonflavor:PF=cherry()
Flavor
Wrapper P()W=red | FFcherry
2limeθ 1θ θLikelihoodfor,e.g.,cherrycandyingreenwrapper:
P(F=cherry,W=green|hθ,θ1,θ2)=P(F=cherry|hθ,θ1,θ2)P(W=green|F=cherry,hθ,θ1,θ2)=θ·(1−θ1) Ncandies,rcred-wrappedcherrycandies,etc.:
P(d|hθ,θ1,θ2)=θc(1−θ)`·θrc1(1−θ1)gc·θ r`2(1−θ2)g`
L=[clogθ+`log(1−θ)]+[rclogθ1+gclog(1−θ1)]+[r`logθ2+g`log(1−θ2)]
Chapter20,Sections1–310
Multiple parameters con td.
DerivativesofLcontainonlytherelevantparameter:
∂L∂θ = cθ − `1−θ =0⇒θ= cc+`
∂L∂θ1 = rcθ1 − gc1−θ1 =0⇒θ1= rcrc+gc
∂L∂θ2 = r`θ2 − g`1−θ2 =0⇒θ2= r`r`+g`
Withcompletedata,parameterscanbelearnedseparately
Chapter20,Sections1–311
Example: linear Gaussian mo del
00.20.40.60.81x00.20.40.60.81
y 00.5 1 1.52 2.5 33.5 4 P(y |x)
0 0.2 0.4 0.6 0.8 1
00.10.20.30.40.50.60.70.80.91
y
x
MaximizingP(y|x)= 1√2πσ e −(y−(θ1x+θ2))2
2σ2w.r.t.θ1,θ2
=minimizingE= NX
j=1 (yj−(θ1xj+θ2))2
Thatis,minimizingthesumofsquarederrorsgivestheMLsolutionforalinearfitassumingGaussiannoiseoffixedvariance
Chapter20,Sections1–312
Summary
FullBayesianlearninggivesbestpossiblepredictionsbutisintractable
MAPlearningbalancescomplexitywithaccuracyontrainingdata
Maximumlikelihoodassumesuniformprior,OKforlargedatasets
1.Chooseaparameterizedfamilyofmodelstodescribethedatarequiressubstantialinsightandsometimesnewmodels
2.Writedownthelikelihoodofthedataasafunctionoftheparametersmayrequiresummingoverhiddenvariables,i.e.,inference
3.Writedownthederivativeoftheloglikelihoodw.r.t.eachparameter
4.Findtheparametervaluessuchthatthederivativesarezeromaybehard/impossible;modernoptimizationtechniqueshelp
Chapter20,Sections1–313