Model Assessment and Selection
We assume i.i.d. data
independently (independent samples)
distributed
identically (the same distribution) Assume many iid datasets
1 line= train/test curve for 1 set of samples (data)
data used to fit the model Training erroris
err=N1PN
i=1L(yi,ˆf(xi)), data not used to fit the model Test erroris
err=N1PN
i=1L(yi,ˆf(xi)), Test error is a point estimate of the generalization error.
Dark blue line is an average of test errors, a more robust estimate.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
0 5 10 15 20 25 30 35
0.00.20.40.60.81.01.2
Model Complexity (df)
Prediction Error
High Bias Low Bias
High Variance Low Variance
FIGURE 7.1.Behavior of test sample and training sample error as the model complexity is varied. The light blue curves show the training errorerr, while the light red curves show the conditional test errorErrT for100training sets of size50each, as the model com- plexity is increased. The solid curves show the expected test errorErrand the expected training errorE[err].
Definition (Generalization error) Generalization erroris the expected prediction error over an independent test sample
Err =E[L(Y,ˆf(X))]
where bothX andY are drawn randomly from their joint distribution (population).
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 147 / 1
Loss functions for regression
Definition
Square error lossL(y,yˆ) = (y−ˆy)2 Absolute error lossL(y,yˆ) =|y−yˆ| Huber error loss
Lδ(y,ˆy) = 1
2(y−yˆ)2 for |y−yˆ| ≤δ, δ(|y−ˆy| − 12δ), otherwise.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 10
−3 −2 −1 0 1 2 3
02468
Squared Error Absolute Error Huber
Loss
y−f
FIGURE 10.5.A comparison of three loss functions for regression, plotted as a function of the marginy−f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when|y−f|is large.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 148 / 1
Classifier Assessment and Selection
Qualitative responseG taking one ofK values labeled as 1, . . . ,K. Typically
G(Xˆ ) = arg maxkpˆk(X).
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 10
−2 −1 0 1 2
0.00.51.01.52.02.53.0 Misclassification
Exponential Binomial Deviance Squared Error Support Vector
Loss
y·f
FIGURE 10.4. Loss functions for two-class classi- fication. The response isy =±1; the prediction is f, with class predictionsign(f). The losses are mis- classification:I(sign(f)=y); exponential:exp(−yf);
binomial deviance: log(1 + exp(−2yf)); squared er- ror:(y−f)2; and support vector:(1−yf)+(see Sec- tion 12.3). Each function has been scaled so that it passes through the point(0,1).
Definition (Loss functions for classification) 0-1 loss, misclassification
L(G,G(X)) =ˆ I(G 6= ˆG(X)) log-likelihood, cross-entropy, deviance
L(G,p(X)) =ˆ −2
N
X
k=1
I(G=k) log ˆpk(X) two classes encoded {−1,+1}
exponential
e−yf support vector
max(0,1−yf(x))
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 149 / 1
Data Rich Situation
If we have enough data, we split the dataset Traintrain the model
Validateselect appropriate model parameterα,λ, usually the model complexity, cost penalty
Testestimate the test error on an independent sample.
Recommended ratios:
1
2: 14 : 12 with the validation set
2
3: 13 without the validation need.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 150 / 1
Stratified Selection
Assume small disease prevalence.
We split the data healthy/sick and split each group separately.
Consider a model over different branches of your company:
to estimate a new branch, all data from some branches should be selected as test ones
not a random sample from each branch.
Other methods only since we almost never have enough of the data.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 151 / 1
Error Estimation with a few data
Amount of data needed depends on:
true function complexity the noise ration.
Model selection
absolute value is not necessary, the difference is crucial
any method can be used (AIC, BIC, cross-validation,. . . )
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Subset Size p
Misclassification Error
5 10 15 20
0.00.10.20.30.40.50.6
••
••
••
•
•
•
• •• • • • • • • • •
•
•
• •
••
• •
• • • • • • ••• • • •
FIGURE 7.9.Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3.
Test error estimation
absolute value is necessary direct estimation (cross-validation, one-leave-out) preferred.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Number of Basis Functions Log-likelihood 0.51.01.52.02.5
Log-likelihood Loss
2 4 8163264128
O O
O O OO
O O O
O O O OO O
O
O O
O O OO O O Train Test AIC
Number of Basis Functions Misclassification Error 0.100.150.200.250.300.35
0-1 Loss
2 4 8163264128
O
O O
O O
O O
O O
O O
O O O
O O O
O O
O O O
O O
FIGURE 7.4. AIC used for model selec- tion for the phoneme recognition example of Sec- tion 5.2.3. The logistic regression coefficient function β(f) =PM
m=1hm(f)θmis modeled as an expansion in Mspline basis functions. In the left panel we see the AIC statistic used to estimateErrinusing log-likelihood loss. Included is an estimate ofErrbased on an in- dependent test sample. It does well except for the ex- tremely over-parametrized case (M= 256parameters forN = 1000observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 152 / 1
Bias Variance Decomposition
AssumeY =f(X) +where E() = 0 andVar() =σ2.
we derive the expected prediction error of the regression fit ˆf(X) at an input pointX =x0using squared-error loss:
Err(x0) = E[(Y −ˆf(x0))2|X =x0]
= σ2+ [Eˆf(x0)−f(x0)]2+E[ˆf(x0)−Eˆf(x0)]2
= σ2+Bias2(ˆf(x0)) +Var(ˆf(x0))
= Irreducible Error + Bias2+ Variance.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 5
NMR Signal
0 200 400 600 800 1000
0204060
spline wavelet
Smooth Function (Simulated)
n
0.0 0.2 0.4 0.6 0.8 1.0
-4-2024
spline wavelet true
•
••
•
••
•
••
•••
•
•
•••
••••
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•••••
•
••
•
•
•
•••
••
•
•
••
•
•
•
•
••
•
•
•••
••
•
•
•
•
•
••
•
••
•
••
•
••
•
••
•
•
•
•••••
•••
•
•
•
•
•
•••
•
•••
•
•
•
••
•
FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples. Each panel com- pares the SURE-shrunk wavelet fit to the cross-vali- dated smoothing spline fit.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 153 / 1
Bias Variance Decomposition
K-Nearest neighbour
Err(x0) = E[(Y −ˆf(x0))2|X =x0]
= σ2+ [Eˆf(x0)−f(x0)]2+E[ˆf(x0)−Eˆf(x0)]2
= σ2+ [f(x0)− 1 k
k
X
`=1
f(x(`))]2+σ2 k . Linear fit
Err(x0) = E[(Y −ˆf(x0))2|X =x0]
= σ2+ [Eˆfp(x0)−f(x0)]2+kh(x0)kσ2 h(x0)y = (x0T(XTX)−1XT)y= ˆfp(x0).
henceVar[ˆfp(x0)] =kh(x0)kσ2 and its average is Npσ2, hence 1
N
N
X
i=1
Err(xi) = σ2+ 1 N
N
X
i=1
[f(xi)−Eˆf(xi)]2+ p Nσ2,
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 154 / 1
Penalized model complexity
Consider ridge regression fit ˆfα(x0) with the fit h(x0)y=x0T(XTX+αI)−1XTy we break the bias more finely. Letβ∗ be the fit withα= 0, β∗= arg minβE(f(X)−βTX)2.
Ex0[f(x0)−Eˆfα(x0)]2 = Ex0[f(x0)−β∗Tx0]2+Ex0[β∗Tx0−EβˆαTx0]2
= Ave[Model Bias]2+Ave[Estimation Bias]2.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Realization
Closest fit in population
Estimation Bias
SPACE
Variance Estimation
Closest fit
Truth
Model bias
RESTRICTED Shrunken fit
MODEL SPACE MODEL
FIGURE 7.2.Schematic of the behavior of bias and variance. The model space is the set of all possible predictions from the model, with the “closest fit” la- beled with a black dot. The model bias from the truth is shown, along with the variance, indicated by the large yellow circle centered at the black dot labeled “closest fit in population.” A shrunken or regularized fit is also shown, having additional estimation bias, but smaller prediction error due to its decreased variance.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 155 / 1
Example: Bias-Variance Tradeoff
50 observations, 20 predictors, uniformly distributed in the hypercube [0,1]20. Left Y is 0 ifX1≤12 and 1
ifX1> 12 and we apply k-NN
Right Y is 1 ifP10
j=1Xj >5 and 0 otherwise, and we use best subset linear regression of size p.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
0.00.10.20.30.4
Number of Neighbors k
50 40 30 20 10 0
k−NN − Regression
5 10 15 20
0.00.10.20.30.4
Subset Size p Linear Model − Regression
0.00.10.20.30.4
Number of Neighbors k
50 40 30 20 10 0
k−NN − Classification
5 10 15 20
0.00.10.20.30.4
Subset Size p Linear Model − Classification
FIGURE 7.3. Expected prediction error (orange), squared bias (green) and variance (blue) for a simu- lated example. The top row is regression with squared error loss; the bottom row is classification with0–1loss.
The models arek-nearest neighbors (left) and best sub- set regression of sizep(right). The variance and bias
Fig. 7.3.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 156 / 1
Optimism of the Training Error Rate
training errorerr = N1 PN
i=1L(yi,ˆf(xi))
usually less than the true errorErr =E[L(Y,ˆf(X))].
we keepxi points fixed, we take new sample of yat these points.
in-sample errorErrin =N1PN
i=1EyEYnewL(Yinew,ˆf(xi)).
optimism op≡Errin−Ey(err).
for squared error, 0-1, and other loss functions: op=N2PN
i=1Cov(ˆyi,yi) For a linear fit withd=pinputs of basis functions and additive error model Y =f(X) +it simplifies
N
X
i=1
cov(ˆyi,yi) = dσ2 Errin = Eyerr+ 2d
Nσ2 =Cp=AICgauss..
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 5
NMR Signal
0 200 400 600 800 1000
0204060
spline wavelet
Smooth Function (Simulated)
n
0.0 0.2 0.4 0.6 0.8 1.0
-4-2024
spline wavelet true
•
••
•
••
•
•
•
•••
•
•
•••
••••
•
•
••
•
•
•
••
•
•
•
•
•
•
•
•
•
••
••
••
•
•
•••••
•
••
•
•
•
•••
••
•
•
••
•
•
•
•
••
•
•
•••
•••
•
•
•
•
••
•
••
•
••
•
••
•
••
•
•
•
•••••
•••
•
•
•
•
•
•••
•
•••
•
•
•
••
•
FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples. Each panel com- pares the SURE-shrunk wavelet fit to the cross-vali- dated smoothing spline fit.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 157 / 1
The Optimism (Just for fun)
yi−ˆyi = (yi−f(xi)) + (f(xi)−Eˆf(xi)) + (Eˆf(xi)−yˆi).
We get six terms of the sum of squares:
A1=P
i(yi−f(xi))2 D1= 2P
i(yi−f(xi))(f(xi)−Eˆf(xi)) B=P
i(f(xi)−Eˆf(xi))2 E = 2P
i(f(xi)−Eˆf(xi))(Eˆf(xi)−ˆyi) C=P
i(Eˆf(xi)−ˆyi)2 F1= 2P
i(yi−f(xi))(Eˆf(xi)−ˆyi) For the new sampleY, taking the expectation.
A2=P
i(EY0Yi0−f(xi))2 and similarlyD2,F2.
N(Errin−err) = (A2+B+C+D2+E+F2)−(A1+B+C+D1+E+F1)
= (A2−A1) + (D2−D1) + (F2−F1) E(A1) =E(A2) =Nσ2,
E(D1) = 2P
i(E(yi)−f(xi))(f(xi)−Eˆf(xi)) = 0 sinceE(yi) =f(xi) E(D2) = 0 as well.
F2= 2P
i(EY0
h
(Yi0−f(xi))(Eˆf(xi)−ˆyi)i
= 0 asE(Y0) =f(xi) andYi0and ˆyi are independent.
E(F1) =−2PN
i cov(yi,ˆyi).
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 158 / 1
The Covariance (Just for fun)
cov(ˆy,y) = cov((X(XTX)−1XT)y,y) = (X(XTX)−1XT)cov(y,y) cov(y,y) = σ2.
The valuescov(ˆyi,yi) are the diagonal values of the above matrixcov(ˆy,y).
Thus
N
X
i=1
cov(ˆyi,yi) = trace(X(XTX)−1(XT))σ2
= trace((XTX)−1(XTX))σ2
= trace(Id)σ2=dσ2. Similarly,
N
X
i=1
cov(Sλyi,yi) = trace(Sλ).
https://www.waxworksmath.com/Authors/G_M/Hastie/WriteUp/Weatherwax_Epstein_Hastie_Solution_Manual.pdf
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 159 / 1
AIC Criterion for Gaussian error
N
X
i=1
Cov(ˆyi,yi) = dσ2
AIC(α) = Eyerr(α) + 2d(α) N σ2. The effective number of parameters
for a linear fit^y=Syisd(S) =trace(S), the sum of the diagonal elements.
For 0-1 loss does not hold in general, only as approximation.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Number of Basis Functions Log-likelihood 0.51.01.52.02.5
Log-likelihood Loss
2 4 8 16 32 64 128
O O
O
O O O
O O O
O O
O O O O
O
O O
O
O O O O O Train
Test AIC
Number of Basis Functions Misclassification Error 0.100.150.200.250.300.35
0-1 Loss
2 4 8 16 32 64 128
O
O O
O O
O O
O O
O O
O O
O O
O O
O O
O O O
O O
FIGURE 7.4. AIC used for model selec- tion for the phoneme recognition example of Sec- tion 5.2.3. The logistic regression coefficient function β(f) =PM
m=1hm(f)θmis modeled as an expansion in M spline basis functions. In the left panel we see the AIC statistic used to estimateErrinusing log-likelihood loss. Included is an estimate of Err based on an in- dependent test sample. It does well except for the ex- tremely over-parametrized case (M = 256 parameters for N = 1000 observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 160 / 1
BIC Bayesian Information Criterion
P(Mm|Z) = P(Z|Mm)·P(Mm) P(Z)
∝ P(Z|Mm)·P(Mm)
∝ P(Mm)· Z
P(Z|θm,Mm)P(θm|Mm)dθm
Laplace approximation to the integral gives (and other simplification):
logP(Z|Mm) = logP(Z|θm,Mm)−dm
2 ·logN+O(1)
= loglik−dm
2 ·logN+O(1) BICm=−2loglikm+ (logN)·dm BIC may be used to compare the model posterior probabilities
e12·BICm PM
`=1e12 ·BIC`
.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 161 / 1
Crossvalidation
Crossvalidation
Split the data intoK roughly equal-sized parts. Usually,K = 5 or 10 orK =N
ForK= 10 it is calledtenfoldcrossvalidation.
ForK=N it is calledone-leave-outcrossvalidation.
κ:{1, . . . ,N} → {1, . . . ,K}is the partition function ˆf−k is the fitted function withkth part removed.
Fork = 1, . . . ,K
For thekth part we fit the model to the otherK−1 parts of the data, and calculate the prediction error of the fitted model when predicting thekth part of the data.
Average the error estimates.
CV = 1 N
N
X
i=1
L(yi,ˆf−κ(i)(xi)).
10 times 10 Crossvalidation
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 162 / 1
Parameter Tuning by Crossvalidation
Parameter Tuning by Crossvalidation Given a set of modelsf(x, α) indexed by a tuning parameter α, we define
CV(α) = 1 N
N
X
i=1
L(yi,ˆf−κ(i)(xi, α)).
TheCV(α) provides an estimate of the test error curve and we find the tuning parameter ˆαthat minimizes it.
Our final model isf(x,α).ˆ
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Subset Size p
Misclassification Error
5 10 15 20
0.00.10.20.30.40.50.6
•
••
••
• •
•
•
• •• • • • • • • • •
•
•
• •
•
•
• •
• • • • • • ••
• • • •
FIGURE 7.9.Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3.
Definition (One standard error rule) We choose the most parsimonious model whose error is no more than one standard error above the error of the best model.
Here,p= 9.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 163 / 1
Generalized Crossvalidation
For a linear fitting method we can write ˆ y=Sy.
For many linear fitting methods, 1
N
N
X
i=1
[yi−ˆf−i(xi)]2= 1 N
N
X
i=1
[yi−ˆf−i(xi) 1−Sii
]2, Thegeneralized crossvalidation GCV approximation is
GCV = 1 N
N
X
i=1
"
yi−ˆf(xi) 1−trace(S)/N
#2
.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 164 / 1
Bad and good Crossvalidation
p= 500 dimensions N= 20
random data,y independent ofx.
We search for the best ’decision stump’
(decision rule based on one single attribute).
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
0 100 200 300 400 500
23456789
Predictor
Error on Full Training Set
1 2 3 4 5 6 7 8
01234
Error on 4/5
Error on 1/5
−1 0 1 2
Predictor 436 (blue) Class Label 01
full 4/5
0.00.20.40.60.81.0
CV Errors
FIGURE 7.11.Simulation study to investigate the per- formance of cross validation in a high-dimensional problem where the predictors are independent of the class labels. The top-left panel shows the number of errors made by individual stump classifiers on the full training set (20observations).
The top right panel shows the errors made by individual stumps trained on a random split of the dataset into4/5ths (16observations) and tested on the remaining1/5th (4ob- servations). The best performers are depicted by colored dots in each panel. The bottom left panel shows the effect of
Wrong way
choose 100 good predictors Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in foldk. Use the classifier to predict the class labels for the samples in foldk.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Correlations of Selected Predictors with Outcome
Frequency
−1.0 −0.5 0.0 0.5 1.0
0102030
Wrong way
Correlations of Selected Predictors with Outcome
Frequency
−1.0 −0.5 0.0 0.5 1.0
0102030
Right way
FIGURE 7.10. Cross-validation the wrong and right way: histograms shows the correlation of class labels, in10 randomly chosen samples, with the100predictors chosen using the incorrect (upper red) and correct (lower green) versions of cross-validation.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 165 / 1
One–leave–out odhad chyby
Jeden vzorek schováme, na ostatních naučíme model, zjistíme, zda je schovaný vzorek predikován správně.
Opakujeme pro každý vzorek a průměrujeme.
Výhody:
Máme velkou množinu na učení.
Výpočet je deterministický (a nemá cenu opakovat víckrát).
Nevýhody:
Výpočetně náročné.
Občas může selhat – např. náhodně 50 na 50 generovaný cíl, máme chybu one–leave–out 100%.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Size of Training Set
1-Err
0 50 100 150 200
0.00.20.40.60.8
FIGURE 7.8.Hypothetical learning curve for a clas- sifier on a given task: a plot of1−Errversus the size of the training setN. With a dataset of200observations, 5-fold cross-validation would use training sets of size 160, which would behave much like the full set. How- ever, with a dataset of50observations fivefold cross–
validation would use training sets of size40, and this would result in a considerable overestimate of predic- tion error.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 166 / 1
Bootstrap
Vzorky trénovací množiny vybíráme s opakováním.
MámeNdat, vybereme s opakovánímN vzorků – některé vyberu vícekrát, takže některé nevyberu vůbec. Ty použiji k testování.
Pravděpodobnost nevybrání vzorku je 1−N1N
≈e−1= 0.368.
Odhad chyby je dost pesimistický, protože učíme naN datech, která ale reálně pocházejí jen z 0.632 dat, proto se konečný odhad chyby upravuje chybou na
trénovacích datech:
chyba= 0.632·etestovaci+0.368·etrenovaci data Opět může být svedeno s cesty, např.
stejnými daty jako one–leave–out.
Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7
Bootstrap
Bootstrap replications
samples
sample Training Z= (z1, z2, . . . , zN)
Z∗1 Z∗2 Z∗B S(Z∗1) S(Z∗2) S(Z∗B)
FIGURE 7.12. Schematic of the bootstrap process.
We wish to assess the statistical accuracy of a quan- tityS(Z)computed from our dataset.Btraining sets Z∗b, b= 1, . . . , Beach of sizeNare drawn with re- placement from the original dataset. The quantity of interestS(Z)is computed from each bootstrap training set, and the valuesS(Z∗1), . . . , S(Z∗B)are used to as- sess the statistical accuracy ofS(Z).
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 167 / 1
Porovnání různých typů modelů
Řekněme, že jsme učili rozhodovací stromy a k–NN klasifikátory, provedli 10x10 krosvalidaci a dostali dvě čísla. Můžeme z toho usuzovat, že je jeden typ modelu na daných datech lepší, než ten druhý?
Jsou–li hodnoty blízko, tak ne, rozdíl může být náhodný.
Pro klasifikaci např. Mc Nemars’s test (errerrA−errB−1)2
A−errB withχ2 with 1 degree of freedom
errA denotes samples where A is wrong and B is correct.
Pro regresi
Statistika volí použítt–test, pokud můžeme určovat chybu na stejných rozděleních na trénovací a testovací data, takpárový t–test, který je přesnější.
Rozdíl jednotlivých odhadů má Studentovo rozložení oN−1 stupni volnosti, po normalizaci rozptylem: qdi
σ2 d (N−1)
používáme dvoustranný odhad testu, je–lidi nula, protože předem nevíme, který model je lepší, tj. na kterou stranu z intervalu vypadneme.
Nepárový test má odhad rozptylu:(kσ2x)+σ
2 y
(l).
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 168 / 1
Ohodnocení numerické predikce – regrese
střední kvadratická chyba mean–squared error
P
i(pi−ai)2 n
odmocnina střední kvad. chyby root mean–squared error
q P
i(pi−ai)2 n
střední absolutní chyba mean absolute error
P
i|pi−ai| n
předchozí – dají se použít buď na absolutní rozdíly nebo na relativní chyby, následující kromě korelace – relativně k předpovědi toho nejpravděpodobnějšího
relativní kvadratická chyba relative squared error
P
i(pi−ai)2
P
i(ai−a)2,a= 1nP
iai
odmocnina ... root relative squared error,
r P
i(pi−ai)2
P
i(ai−a)2
relativní absolutní chyba relative absolute error
P
i|pi−ai|
P
i|ai−a|
korelační koeficient correlation coefficient √SPA
SPSA
whereSPA= P
i(pi−p)(ai−a)
n−1 ,SP =
P
i(pi−p)2 n−1 ,SA=
P
i(ai−a)2
n−1 . Korelační koeficient (čím vyšší, tím lepší, narozdíl od předchozích)
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 169 / 1
Ohodnocení úspěšnosti klasifikace
Základní statistiky uváděné při ohodnocování modelů pro klasifikaci (např. ve Weka) vycházejí z tzv. matice záměn (confusion matrix):
správná třída\klasifikace + -
+ TP – true positive FN – false negative - FP – false positive TN true negative Česky se říká správně/falešně positivní/negativní.
Základní míry:
celková správnost accurancy Acc =TP+TNTP+TN+FP+FN
chyba error Err = TP+TN+FP+FNFP+FN
přesnost precision Prec =TP+FPTP
úplnost, sensitivita recall, sensitivity Rec =TP+FNTP specificita specificity Specificity = TN+FPTN
F míra F measure F = 2·Prec·RecPrec+Rec =2·TP+FP2·TP+FN
TP rate TP+FNTP
FP rate (=1–Specificity) FP+TNFP
Různé obory uvádějí různé míry, v medicíně sensitívita a specificita, u vyhledávání dokumentů přesnost, úplnosti a F míra, případně 3–bodový průměr úspěšnosti (recall) při přesnosti 20,50, a 80 procent nebo 11–ti bodový průměr pro 0,10,20,. . . ,90,100 procent
marketing: TP, TP+TN+FP+FNTP+FP ×100%
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 170 / 1
ROC křivka
ROC = receiver operating characteristic = užívané pro charakteristiku detekce přijímaného signálu
osa x = TP (správně positivních) v procentech osa y = FP (falešně positivních) v procentech
tj. každý positivně klasifikovaný příklad posune křivku o 1, buď nahoru (správně positivní) nebo doprava (falešně positivní). V reálu počítáme průměr přes několik testů.
Pokud máme více testů, u kterých se ROC křivky protínají, pak lze testy zkombinovat tak, abychom dosáhli hranice konvexního obalu ROC křivek.
!!Popisky os jsou vyměněné!.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 171 / 1
Precision recall křivka
osa x jako dřív, osa y má precision, tj. TP ku TP+FP v procentech. Nejlepší místo je pak vpravo nahoře.
!!Popisky os jsou vyměněné!.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 172 / 1
Cena za falešně positivní a falešně negativní
Příklad: predikce ovulace u krav – 97% dní neplodných, proto predikuji vždy neplodná, pokud nemám jinou cenu za promeškání plodnosti proti falešnému označení za plodnou.
Jiný příklad: reklamy– nechtějí promeškat zákazníka, zbytečně poslaný leták tak nevadí.
Čtyřpolní tabulka, obecně matice záměn (confusion matrix) pro vícehodnotový cílový atribut.
Učení s různou cenou chyby (Cost sensitive learning)
Datům, na které máme být citlivější, dáme větší váhu (např. je několikrát zopakujeme) a pak učíme normální model bez citlivosti na ceny.
Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 173 / 1