• Aucun résultat trouvé

Model Assessment and Selection We assume i.i.d. data

N/A
N/A
Protected

Academic year: 2022

Partager "Model Assessment and Selection We assume i.i.d. data"

Copied!
27
0
0

Texte intégral

(1)

Model Assessment and Selection

We assume i.i.d. data

independently (independent samples)

distributed

identically (the same distribution) Assume many iid datasets

1 line= train/test curve for 1 set of samples (data)

data used to fit the model Training erroris

err=N1PN

i=1L(yi,ˆf(xi)), data not used to fit the model Test erroris

err=N1PN

i=1L(yi,ˆf(xi)), Test error is a point estimate of the generalization error.

Dark blue line is an average of test errors, a more robust estimate.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

0 5 10 15 20 25 30 35

0.00.20.40.60.81.01.2

Model Complexity (df)

Prediction Error

High Bias Low Bias

High Variance Low Variance

FIGURE 7.1.Behavior of test sample and training sample error as the model complexity is varied. The light blue curves show the training errorerr, while the light red curves show the conditional test errorErrT for100training sets of size50each, as the model com- plexity is increased. The solid curves show the expected test errorErrand the expected training errorE[err].

Definition (Generalization error) Generalization erroris the expected prediction error over an independent test sample

Err =E[L(Y,ˆf(X))]

where bothX andY are drawn randomly from their joint distribution (population).

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 147 / 1

(2)

Loss functions for regression

Definition

Square error lossL(y,yˆ) = (y−ˆy)2 Absolute error lossL(y,yˆ) =|y−yˆ| Huber error loss

Lδ(y,ˆy) = 1

2(y−yˆ)2 for |y−yˆ| ≤δ, δ(|y−ˆy| − 12δ), otherwise.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 10

−3 −2 −1 0 1 2 3

02468

Squared Error Absolute Error Huber

Loss

yf

FIGURE 10.5.A comparison of three loss functions for regression, plotted as a function of the marginyf. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when|yf|is large.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 148 / 1

(3)

Classifier Assessment and Selection

Qualitative responseG taking one ofK values labeled as 1, . . . ,K. Typically

G(Xˆ ) = arg maxkpˆk(X).

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 10

−2 −1 0 1 2

0.00.51.01.52.02.53.0 Misclassification

Exponential Binomial Deviance Squared Error Support Vector

Loss

y·f

FIGURE 10.4. Loss functions for two-class classi- fication. The response isy =±1; the prediction is f, with class predictionsign(f). The losses are mis- classification:I(sign(f)=y); exponential:exp(−yf);

binomial deviance: log(1 + exp(−2yf)); squared er- ror:(yf)2; and support vector:(1yf)+(see Sec- tion 12.3). Each function has been scaled so that it passes through the point(0,1).

Definition (Loss functions for classification) 0-1 loss, misclassification

L(G,G(X)) =ˆ I(G 6= ˆG(X)) log-likelihood, cross-entropy, deviance

L(G,p(X)) =ˆ −2

N

X

k=1

I(G=k) log ˆpk(X) two classes encoded {−1,+1}

exponential

e−yf support vector

max(0,1−yf(x))

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 149 / 1

(4)

Data Rich Situation

If we have enough data, we split the dataset Traintrain the model

Validateselect appropriate model parameterα,λ, usually the model complexity, cost penalty

Testestimate the test error on an independent sample.

Recommended ratios:

1

2: 14 : 12 with the validation set

2

3: 13 without the validation need.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 150 / 1

(5)

Stratified Selection

Assume small disease prevalence.

We split the data healthy/sick and split each group separately.

Consider a model over different branches of your company:

to estimate a new branch, all data from some branches should be selected as test ones

not a random sample from each branch.

Other methods only since we almost never have enough of the data.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 151 / 1

(6)

Error Estimation with a few data

Amount of data needed depends on:

true function complexity the noise ration.

Model selection

absolute value is not necessary, the difference is crucial

any method can be used (AIC, BIC, cross-validation,. . . )

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Subset Size p

Misclassification Error

5 10 15 20

0.00.10.20.30.40.50.6

• •• • • • • • • • •

• •

• •

• • • • • • •• • • •

FIGURE 7.9.Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3.

Test error estimation

absolute value is necessary direct estimation (cross-validation, one-leave-out) preferred.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Number of Basis Functions Log-likelihood 0.51.01.52.02.5

Log-likelihood Loss

2 4 8163264128

O O

O O OO

O O O

O O O OO O

O

O O

O O OO O O Train Test AIC

Number of Basis Functions Misclassification Error 0.100.150.200.250.300.35

0-1 Loss

2 4 8163264128

O

O O

O O

O O

O O

O O

O O O

O O O

O O

O O O

O O

FIGURE 7.4. AIC used for model selec- tion for the phoneme recognition example of Sec- tion 5.2.3. The logistic regression coefficient function β(f) =PM

m=1hm(f)θmis modeled as an expansion in Mspline basis functions. In the left panel we see the AIC statistic used to estimateErrinusing log-likelihood loss. Included is an estimate ofErrbased on an in- dependent test sample. It does well except for the ex- tremely over-parametrized case (M= 256parameters forN = 1000observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 152 / 1

(7)

Bias Variance Decomposition

AssumeY =f(X) +where E() = 0 andVar() =σ2.

we derive the expected prediction error of the regression fit ˆf(X) at an input pointX =x0using squared-error loss:

Err(x0) = E[(Y −ˆf(x0))2|X =x0]

= σ2+ [Eˆf(x0)−f(x0)]2+E[ˆf(x0)−Eˆf(x0)]2

= σ2+Bias2f(x0)) +Varf(x0))

= Irreducible Error + Bias2+ Variance.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 5

NMR Signal

0 200 400 600 800 1000

0204060

spline wavelet

Smooth Function (Simulated)

n

0.0 0.2 0.4 0.6 0.8 1.0

-4-2024

spline wavelet true

••

••

••••

••••

••

••

••

••

••

••

••

••

••

••

••

••

FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples. Each panel com- pares the SURE-shrunk wavelet fit to the cross-vali- dated smoothing spline fit.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 153 / 1

(8)

Bias Variance Decomposition

K-Nearest neighbour

Err(x0) = E[(Y −ˆf(x0))2|X =x0]

= σ2+ [Eˆf(x0)−f(x0)]2+E[ˆf(x0)−Eˆf(x0)]2

= σ2+ [f(x0)− 1 k

k

X

`=1

f(x(`))]2+σ2 k . Linear fit

Err(x0) = E[(Y −ˆf(x0))2|X =x0]

= σ2+ [Eˆfp(x0)−f(x0)]2+kh(x0)kσ2 h(x0)y = (x0T(XTX)−1XT)y= ˆfp(x0).

henceVarfp(x0)] =kh(x0)kσ2 and its average is Npσ2, hence 1

N

N

X

i=1

Err(xi) = σ2+ 1 N

N

X

i=1

[f(xi)−Eˆf(xi)]2+ p 2,

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 154 / 1

(9)

Penalized model complexity

Consider ridge regression fit ˆfα(x0) with the fit h(x0)y=x0T(XTX+αI)−1XTy we break the bias more finely. Letβ be the fit withα= 0, β= arg minβE(f(X)−βTX)2.

Ex0[f(x0)−Eˆfα(x0)]2 = Ex0[f(x0)−βTx0]2+Ex0Tx0−EβˆαTx0]2

= Ave[Model Bias]2+Ave[Estimation Bias]2.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Realization

Closest fit in population

Estimation Bias

SPACE

Variance Estimation

Closest fit

Truth

Model bias

RESTRICTED Shrunken fit

MODEL SPACE MODEL

FIGURE 7.2.Schematic of the behavior of bias and variance. The model space is the set of all possible predictions from the model, with the “closest fit” la- beled with a black dot. The model bias from the truth is shown, along with the variance, indicated by the large yellow circle centered at the black dot labeled “closest fit in population.” A shrunken or regularized fit is also shown, having additional estimation bias, but smaller prediction error due to its decreased variance.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 155 / 1

(10)

Example: Bias-Variance Tradeoff

50 observations, 20 predictors, uniformly distributed in the hypercube [0,1]20. Left Y is 0 ifX112 and 1

ifX1> 12 and we apply k-NN

Right Y is 1 ifP10

j=1Xj >5 and 0 otherwise, and we use best subset linear regression of size p.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

0.00.10.20.30.4

Number of Neighbors k

50 40 30 20 10 0

k−NN − Regression

5 10 15 20

0.00.10.20.30.4

Subset Size p Linear Model − Regression

0.00.10.20.30.4

Number of Neighbors k

50 40 30 20 10 0

k−NN − Classification

5 10 15 20

0.00.10.20.30.4

Subset Size p Linear Model − Classification

FIGURE 7.3. Expected prediction error (orange), squared bias (green) and variance (blue) for a simu- lated example. The top row is regression with squared error loss; the bottom row is classification with0–1loss.

The models arek-nearest neighbors (left) and best sub- set regression of sizep(right). The variance and bias

Fig. 7.3.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 156 / 1

(11)

Optimism of the Training Error Rate

training errorerr = N1 PN

i=1L(yi,ˆf(xi))

usually less than the true errorErr =E[L(Y,ˆf(X))].

we keepxi points fixed, we take new sample of yat these points.

in-sample errorErrin =N1PN

i=1EyEYnewL(Yinew,ˆf(xi)).

optimism opErrin−Ey(err).

for squared error, 0-1, and other loss functions: op=N2PN

i=1Covyi,yi) For a linear fit withd=pinputs of basis functions and additive error model Y =f(X) +it simplifies

N

X

i=1

covyi,yi) = 2 Errin = Eyerr+ 2d

2 =Cp=AICgauss..

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 5

NMR Signal

0 200 400 600 800 1000

0204060

spline wavelet

Smooth Function (Simulated)

n

0.0 0.2 0.4 0.6 0.8 1.0

-4-2024

spline wavelet true

••

••

••••

••••

••

••

••

••

••

••

••

••

••

••

••

••

FIGURE 5.19. Wavelet smoothing compared with smoothing splines on two examples. Each panel com- pares the SURE-shrunk wavelet fit to the cross-vali- dated smoothing spline fit.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 157 / 1

(12)

The Optimism (Just for fun)

yi−ˆyi = (yif(xi)) + (f(xi)−Eˆf(xi)) + (Eˆf(xi)−yˆi).

We get six terms of the sum of squares:

A1=P

i(yif(xi))2 D1= 2P

i(yif(xi))(f(xi)−Eˆf(xi)) B=P

i(f(xi)−Eˆf(xi))2 E = 2P

i(f(xi)−Eˆf(xi))(Eˆf(xi)−ˆyi) C=P

i(Eˆf(xi)−ˆyi)2 F1= 2P

i(yif(xi))(Eˆf(xi)−ˆyi) For the new sampleY, taking the expectation.

A2=P

i(EY0Yi0f(xi))2 and similarlyD2,F2.

N(Errinerr) = (A2+B+C+D2+E+F2)−(A1+B+C+D1+E+F1)

= (A2A1) + (D2D1) + (F2F1) E(A1) =E(A2) =2,

E(D1) = 2P

i(E(yi)−f(xi))(f(xi)−Eˆf(xi)) = 0 sinceE(yi) =f(xi) E(D2) = 0 as well.

F2= 2P

i(EY0

h

(Yi0f(xi))(Eˆf(xi)−ˆyi)i

= 0 asE(Y0) =f(xi) andYi0and ˆyi are independent.

E(F1) =−2PN

i cov(yi,ˆyi).

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 158 / 1

(13)

The Covariance (Just for fun)

covy,y) = cov((X(XTX)−1XT)y,y) = (X(XTX)−1XT)cov(y,y) cov(y,y) = σ2.

The valuescovyi,yi) are the diagonal values of the above matrixcovy,y).

Thus

N

X

i=1

cov(ˆyi,yi) = trace(X(XTX)−1(XT))σ2

= trace((XTX)−1(XTX))σ2

= trace(Id2=2. Similarly,

N

X

i=1

cov(Sλyi,yi) = trace(Sλ).

https://www.waxworksmath.com/Authors/G_M/Hastie/WriteUp/Weatherwax_Epstein_Hastie_Solution_Manual.pdf

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 159 / 1

(14)

AIC Criterion for Gaussian error

N

X

i=1

Cov(ˆyi,yi) = 2

AIC(α) = Eyerr(α) + 2d(α) N σ2. The effective number of parameters

for a linear fit^y=Syisd(S) =trace(S), the sum of the diagonal elements.

For 0-1 loss does not hold in general, only as approximation.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Number of Basis Functions Log-likelihood 0.51.01.52.02.5

Log-likelihood Loss

2 4 8 16 32 64 128

O O

O

O O O

O O O

O O

O O O O

O

O O

O

O O O O O Train

Test AIC

Number of Basis Functions Misclassification Error 0.100.150.200.250.300.35

0-1 Loss

2 4 8 16 32 64 128

O

O O

O O

O O

O O

O O

O O

O O

O O

O O

O O O

O O

FIGURE 7.4. AIC used for model selec- tion for the phoneme recognition example of Sec- tion 5.2.3. The logistic regression coefficient function β(f) =PM

m=1hm(f)θmis modeled as an expansion in M spline basis functions. In the left panel we see the AIC statistic used to estimateErrinusing log-likelihood loss. Included is an estimate of Err based on an in- dependent test sample. It does well except for the ex- tremely over-parametrized case (M = 256 parameters for N = 1000 observations). In the right panel the same is done for 0–1 loss. Although the AIC formula does not strictly apply here, it does a reasonable job in this case.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 160 / 1

(15)

BIC Bayesian Information Criterion

P(Mm|Z) = P(Z|MmP(Mm) P(Z)

P(Z|MmP(Mm)

P(Mm)· Z

P(Z|θm,Mm)P(θm|Mm)dθm

Laplace approximation to the integral gives (and other simplification):

logP(Z|Mm) = logP(Z|θm,Mm)−dm

2 ·logN+O(1)

= loglikdm

2 ·logN+O(1) BICm=−2loglikm+ (logN)·dm BIC may be used to compare the model posterior probabilities

e12·BICm PM

`=1e12 ·BIC`

.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 161 / 1

(16)

Crossvalidation

Crossvalidation

Split the data intoK roughly equal-sized parts. Usually,K = 5 or 10 orK =N

ForK= 10 it is calledtenfoldcrossvalidation.

ForK=N it is calledone-leave-outcrossvalidation.

κ:{1, . . . ,N} → {1, . . . ,K}is the partition function ˆf−k is the fitted function withkth part removed.

Fork = 1, . . . ,K

For thekth part we fit the model to the otherK−1 parts of the data, and calculate the prediction error of the fitted model when predicting thekth part of the data.

Average the error estimates.

CV = 1 N

N

X

i=1

L(yi,ˆf−κ(i)(xi)).

10 times 10 Crossvalidation

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 162 / 1

(17)

Parameter Tuning by Crossvalidation

Parameter Tuning by Crossvalidation Given a set of modelsf(x, α) indexed by a tuning parameter α, we define

CV(α) = 1 N

N

X

i=1

L(yi,ˆf−κ(i)(xi, α)).

TheCV(α) provides an estimate of the test error curve and we find the tuning parameter ˆαthat minimizes it.

Our final model isf(x,α).ˆ

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Subset Size p

Misclassification Error

5 10 15 20

0.00.10.20.30.40.50.6

• •• • • • • • • • •

• •

• •

• • • • • • •

• • • •

FIGURE 7.9.Prediction error (orange) and tenfold cross-validation curve (blue) estimated from a single training set, from the scenario in the bottom right panel of Figure 7.3.

Definition (One standard error rule) We choose the most parsimonious model whose error is no more than one standard error above the error of the best model.

Here,p= 9.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 163 / 1

(18)

Generalized Crossvalidation

For a linear fitting method we can write ˆ y=Sy.

For many linear fitting methods, 1

N

N

X

i=1

[yi−ˆf−i(xi)]2= 1 N

N

X

i=1

[yi−ˆf−i(xi) 1−Sii

]2, Thegeneralized crossvalidation GCV approximation is

GCV = 1 N

N

X

i=1

"

yi−ˆf(xi) 1−trace(S)/N

#2

.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 164 / 1

(19)

Bad and good Crossvalidation

p= 500 dimensions N= 20

random data,y independent ofx.

We search for the best ’decision stump’

(decision rule based on one single attribute).

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

0 100 200 300 400 500

23456789

Predictor

Error on Full Training Set

1 2 3 4 5 6 7 8

01234

Error on 4/5

Error on 1/5

−1 0 1 2

Predictor 436 (blue) Class Label 01

full 4/5

0.00.20.40.60.81.0

CV Errors

FIGURE 7.11.Simulation study to investigate the per- formance of cross validation in a high-dimensional problem where the predictors are independent of the class labels. The top-left panel shows the number of errors made by individual stump classifiers on the full training set (20observations).

The top right panel shows the errors made by individual stumps trained on a random split of the dataset into4/5ths (16observations) and tested on the remaining1/5th (4ob- servations). The best performers are depicted by colored dots in each panel. The bottom left panel shows the effect of

Wrong way

choose 100 good predictors Using just this subset of predictors, build a multivariate classifier, using all of the samples except those in foldk. Use the classifier to predict the class labels for the samples in foldk.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Correlations of Selected Predictors with Outcome

Frequency

−1.0 −0.5 0.0 0.5 1.0

0102030

Wrong way

Correlations of Selected Predictors with Outcome

Frequency

−1.0 −0.5 0.0 0.5 1.0

0102030

Right way

FIGURE 7.10. Cross-validation the wrong and right way: histograms shows the correlation of class labels, in10 randomly chosen samples, with the100predictors chosen using the incorrect (upper red) and correct (lower green) versions of cross-validation.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 165 / 1

(20)

One–leave–out odhad chyby

Jeden vzorek schováme, na ostatních naučíme model, zjistíme, zda je schovaný vzorek predikován správně.

Opakujeme pro každý vzorek a průměrujeme.

Výhody:

Máme velkou množinu na učení.

Výpočet je deterministický (a nemá cenu opakovat víckrát).

Nevýhody:

Výpočetně náročné.

Občas může selhat – např. náhodně 50 na 50 generovaný cíl, máme chybu one–leave–out 100%.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Size of Training Set

1-Err

0 50 100 150 200

0.00.20.40.60.8

FIGURE 7.8.Hypothetical learning curve for a clas- sifier on a given task: a plot of1−Errversus the size of the training setN. With a dataset of200observations, 5-fold cross-validation would use training sets of size 160, which would behave much like the full set. How- ever, with a dataset of50observations fivefold cross–

validation would use training sets of size40, and this would result in a considerable overestimate of predic- tion error.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 166 / 1

(21)

Bootstrap

Vzorky trénovací množiny vybíráme s opakováním.

MámeNdat, vybereme s opakovánímN vzorků – některé vyberu vícekrát, takže některé nevyberu vůbec. Ty použiji k testování.

Pravděpodobnost nevybrání vzorku je 1−N1N

e−1= 0.368.

Odhad chyby je dost pesimistický, protože učíme naN datech, která ale reálně pocházejí jen z 0.632 dat, proto se konečný odhad chyby upravuje chybou na

trénovacích datech:

chyba= 0.632·etestovaci+0.368·etrenovaci data Opět může být svedeno s cesty, např.

stejnými daty jako one–leave–out.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 7

Bootstrap

Bootstrap replications

samples

sample Training Z= (z1, z2, . . . , zN)

Z∗1 Z∗2 Z∗B S(Z∗1) S(Z∗2) S(Z∗B)

FIGURE 7.12. Schematic of the bootstrap process.

We wish to assess the statistical accuracy of a quan- tityS(Z)computed from our dataset.Btraining sets Z∗b, b= 1, . . . , Beach of sizeNare drawn with re- placement from the original dataset. The quantity of interestS(Z)is computed from each bootstrap training set, and the valuesS(Z∗1), . . . , S(Z∗B)are used to as- sess the statistical accuracy ofS(Z).

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 167 / 1

(22)

Porovnání různých typů modelů

Řekněme, že jsme učili rozhodovací stromy a k–NN klasifikátory, provedli 10x10 krosvalidaci a dostali dvě čísla. Můžeme z toho usuzovat, že je jeden typ modelu na daných datech lepší, než ten druhý?

Jsou–li hodnoty blízko, tak ne, rozdíl může být náhodný.

Pro klasifikaci např. Mc Nemars’s test (errerrA−errB−1)2

A−errB withχ2 with 1 degree of freedom

errA denotes samples where A is wrong and B is correct.

Pro regresi

Statistika volí použítt–test, pokud můžeme určovat chybu na stejných rozděleních na trénovací a testovací data, takpárový t–test, který je přesnější.

Rozdíl jednotlivých odhadů má Studentovo rozložení oN−1 stupni volnosti, po normalizaci rozptylem: qdi

σ2 d (N−1)

používáme dvoustranný odhad testu, je–lidi nula, protože předem nevíme, který model je lepší, tj. na kterou stranu z intervalu vypadneme.

Nepárový test má odhad rozptylu:(kσ2x)+σ

2 y

(l).

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 168 / 1

(23)

Ohodnocení numerické predikce – regrese

střední kvadratická chyba mean–squared error

P

i(pi−ai)2 n

odmocnina střední kvad. chyby root mean–squared error

q P

i(pi−ai)2 n

střední absolutní chyba mean absolute error

P

i|pi−ai| n

předchozí – dají se použít buď na absolutní rozdíly nebo na relativní chyby, následující kromě korelace – relativně k předpovědi toho nejpravděpodobnějšího

relativní kvadratická chyba relative squared error

P

i(pi−ai)2

P

i(ai−a)2,a= 1nP

iai

odmocnina ... root relative squared error,

r P

i(pi−ai)2

P

i(ai−a)2

relativní absolutní chyba relative absolute error

P

i|pi−ai|

P

i|ai−a|

korelační koeficient correlation coefficient SPA

SPSA

whereSPA= P

i(pi−p)(ai−a)

n−1 ,SP =

P

i(pi−p)2 n−1 ,SA=

P

i(ai−a)2

n−1 . Korelační koeficient (čím vyšší, tím lepší, narozdíl od předchozích)

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 169 / 1

(24)

Ohodnocení úspěšnosti klasifikace

Základní statistiky uváděné při ohodnocování modelů pro klasifikaci (např. ve Weka) vycházejí z tzv. matice záměn (confusion matrix):

správná třída\klasifikace + -

+ TP – true positive FN – false negative - FP – false positive TN true negative Česky se říká správně/falešně positivní/negativní.

Základní míry:

celková správnost accurancy Acc =TP+TNTP+TN+FP+FN

chyba error Err = TP+TN+FP+FNFP+FN

přesnost precision Prec =TP+FPTP

úplnost, sensitivita recall, sensitivity Rec =TP+FNTP specificita specificity Specificity = TN+FPTN

F míra F measure F = 2·Prec·RecPrec+Rec =2·TP+FP2·TP+FN

TP rate TP+FNTP

FP rate (=1–Specificity) FP+TNFP

Různé obory uvádějí různé míry, v medicíně sensitívita a specificita, u vyhledávání dokumentů přesnost, úplnosti a F míra, případně 3–bodový průměr úspěšnosti (recall) při přesnosti 20,50, a 80 procent nebo 11–ti bodový průměr pro 0,10,20,. . . ,90,100 procent

marketing: TP, TP+TN+FP+FNTP+FP ×100%

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 170 / 1

(25)

ROC křivka

ROC = receiver operating characteristic = užívané pro charakteristiku detekce přijímaného signálu

osa x = TP (správně positivních) v procentech osa y = FP (falešně positivních) v procentech

tj. každý positivně klasifikovaný příklad posune křivku o 1, buď nahoru (správně positivní) nebo doprava (falešně positivní). V reálu počítáme průměr přes několik testů.

Pokud máme více testů, u kterých se ROC křivky protínají, pak lze testy zkombinovat tak, abychom dosáhli hranice konvexního obalu ROC křivek.

!!Popisky os jsou vyměněné!.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 171 / 1

(26)

Precision recall křivka

osa x jako dřív, osa y má precision, tj. TP ku TP+FP v procentech. Nejlepší místo je pak vpravo nahoře.

!!Popisky os jsou vyměněné!.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 172 / 1

(27)

Cena za falešně positivní a falešně negativní

Příklad: predikce ovulace u krav – 97% dní neplodných, proto predikuji vždy neplodná, pokud nemám jinou cenu za promeškání plodnosti proti falešnému označení za plodnou.

Jiný příklad: reklamy– nechtějí promeškat zákazníka, zbytečně poslaný leták tak nevadí.

Čtyřpolní tabulka, obecně matice záměn (confusion matrix) pro vícehodnotový cílový atribut.

Učení s různou cenou chyby (Cost sensitive learning)

Datům, na které máme být citlivější, dáme větší váhu (např. je několikrát zopakujeme) a pak učíme normální model bez citlivosti na ceny.

Machine Learning Model Assessment and Selection 6 1 - 35 April 4, 2021 173 / 1

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

We also considered mice: we applied logistic regression with a classical forward selection method, with the BIC calculated on each imputed data set. However, note that there is

Section 5 is devoted to numerical experiments on both simulated and real data sets to illustrate the behaviour of the proposed estimation algorithms and model selection criteria..

It is also classical to build functions satisfying [BR] with the Fourier spaces in order to prove that the oracle reaches the minimax rate of convergence over some Sobolev balls,

This means that, for the original attributes used to create the features vector characterizing a compressed image, the reduction of the dimension of this vector is not a

Glucose also can flee the plasma compartment through another route at a rate that is proportional to the glucose amount inside the interstitial insulin

In Section 6, we illustrate the performance of ρ-estimators in the regression setting (with either fixed or random design) and provide an example for which their risk remains

1. In this paper, we measure the risk of an estimator via the expectation of some random L 2 -norm based on the X ~ i ’s. Nonparametric regression, Least-squares estimator,