• Aucun résultat trouvé

Kernel Methods

N/A
N/A
Protected

Academic year: 2022

Partager "Kernel Methods"

Copied!
35
0
0

Texte intégral

(1)

Kernel Methods

estimate regression function f (x ) ∈ R

a different but simple model separately at each query point x 0 . The resulting ˆ f (X ) is smooth in R p . Localization is achieved via a weighting funcion er kernel K λ (x 0 , x i )

assigns a weight to x i based on its distance form x 0 .

λ is a parameter that dictates the width of the neighbourhood.

memory based methods little or no training

the model is the entire training data set.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

Nearest-Neighbor Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O

O

x 0

f ˆ (x 0 )

Epanechnikov Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO

O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO

x 0

f ˆ (x 0 )

FIGURE 6.1. In each panel 100 pairs x i , y i are gen- erated at random from the blue curve with Gaussian errors: Y = sin(4X) + ε, X ∼ U [0, 1], ε ∼ N (0, 1/3).

In the left panel the green curve is the result of a 30-nearest-neighbor running-mean smoother. The red point is the fitted constant f(x ˆ 0 ), and the red circles indicate those observations contributing to the fit at x 0 . The solid yellow region indicates the weights assigned to observations. In the right panel, the green curve is the kernel-weighted average, using an Epanechnikov kernel with (half ) window width λ = 0.2.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

Nearest-Neighbor Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO

O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O

O

x 0

f ˆ (x 0 )

Epanechnikov Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO

x 0

f ˆ (x 0 )

FIGURE 6.1. In each panel 100 pairs x i , y i are gen- erated at random from the blue curve with Gaussian errors: Y = sin(4X) + ε, X ∼ U[0, 1], ε ∼ N(0, 1/3).

In the left panel the green curve is the result of a 30-nearest-neighbor running-mean smoother. The red point is the fitted constant f ˆ (x 0 ), and the red circles indicate those observations contributing to the fit at x 0 . The solid yellow region indicates the weights assigned to observations. In the right panel, the green curve is the kernel-weighted average, using an Epanechnikov kernel with (half ) window width λ = 0.2.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 141 - 172 March 23, 2021 77 / 315

(2)

k-NN, Epanechnikov Kernel

k-Nearest Heighbour kernel N k (x ) is the set of k points nearest to x in squared distance all have equal weight

ˆ f (x ) = 1 k P

x i ∈N k (x) y i . ˆ f (x ) is bumpy, discontinuous.

Nadaraya-Watson kernel-weighted average

ˆ f (x ) = P N

i=1 K λ (x 0 , x i )y i

P N

i=1 K λ (x 0 , x i ) with the Epanenchnikov quadratic kernel

K λ (x 0 , x) = D |x−x

0 | λ

with D(t ) =

3

4 (1 − t 2 ) if |t| ≤ 1 0 otherwise.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

Nearest-Neighbor Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O

O

x 0

f ˆ (x 0 )

Epanechnikov Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO

O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO

x 0

f ˆ (x 0 )

FIGURE 6.1. In each panel 100 pairs x i , y i are gen- erated at random from the blue curve with Gaussian errors: Y = sin(4X) + ε, X ∼ U [0, 1], ε ∼ N (0, 1/3).

In the left panel the green curve is the result of a 30-nearest-neighbor running-mean smoother. The red point is the fitted constant f(x ˆ 0 ), and the red circles indicate those observations contributing to the fit at x 0 . The solid yellow region indicates the weights assigned to observations. In the right panel, the green curve is the kernel-weighted average, using an Epanechnikov kernel with (half ) window width λ = 0.2.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

Nearest-Neighbor Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO

O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O

O

x 0

f ˆ (x 0 )

Epanechnikov Kernel

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O OO OO O O

O O

O O O O

O O

O O

O O OOO O O O O O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO OO

O OO O O OO

O

O O

O

O O O

O O

O O

O O

O

O O O

O O O O O O O O

O O O

O O

O O O O

OO O OO

OO O O

O O OO

O O

O O

O O

O O

O O O

O

OO

x 0

f ˆ (x 0 )

FIGURE 6.1. In each panel 100 pairs x i , y i are gen- erated at random from the blue curve with Gaussian errors: Y = sin(4X) + ε, X ∼ U[0, 1], ε ∼ N(0, 1/3).

In the left panel the green curve is the result of a 30-nearest-neighbor running-mean smoother. The red point is the fitted constant f ˆ (x 0 ), and the red circles indicate those observations contributing to the fit at x 0 . The solid yellow region indicates the weights assigned to observations. In the right panel, the green curve is the kernel-weighted average, using an Epanechnikov kernel with (half ) window width λ = 0.2.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 141 - 172 March 23, 2021 78 / 315

(3)

Kernels - variable width, shapes

The width λ may vary h λ (x 0 ) with x 0

mo general formula for he kernel K λ (x 0 , x) = D

|x−x 0 | h λ (x 0 )

fore k-NN, h k (x 0 ) = |x 0x |k | | where x |k| is the k th closest x i to x 0 .

Tri-cube kernel D(t) =

(1 − |t| 3 ) 3 if |t| ≤ 1 0 otherwise.

Gaussian kernel D(t) = 1 λ e kx−x 0 k

2 2λ

Epanenchnikov D(t) =

3

4 (1 − t 2 ) if |t | ≤ 1 0 otherwise.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

-3 -2 -1 0 1 2 3

0.0 0.4 0.8

Epanechnikov Tri-cube Gaussian

K λ ( x 0 ,x )

FIGURE 6.2. A comparison of three popular kernels for local smoothing. Each has been calibrated to inte- grate to 1. The tri-cube kernel is compact and has two continuous derivatives at the boundary of its support, while the Epanechnikov kernel has none. The Gaus- sian kernel is continuously differentiable, but has infi- nite support.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 141 - 172 March 23, 2021 79 / 315

(4)

Local Linear Regression

Locally-weighted averages can be badly biased on the boundaries of the domain or whenever X are not equally spaced.

Fitting straight lines may help (a bit).

Locally weighted regression

min α(x 0 ),β(x 0 ) N

X

i=1

K λ (x 0 , x i )[y i −α(x 0 )−β(x 0 )x i ] 2 The estimate is: ˆ f (x 0 ) = ˆ α(x 0 ) + ˆ β(x 0 )x 0 .

For x T → (1, x ), X is N × (p + 1) matrix, W N × N diagonal matrix K λ (x 0 , x i ).

Then

ˆ f (x 0 ) = x 0 T (X T W (x 0 )X ) −1 X T )W (x 0 )y what is linear function of y.

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 6

N-W Kernel at Boundary

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

O

O O OO O O

OO O

O O O

OO

O

O O O O

O O O O

O O

O OO O

O O O

O O O O

O O OO

O O O O

O O O O

O O O

O O

O O O O

OO O O O O

O O O

O O O O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

x 0

f(x ˆ 0 )

Local Linear Regression at Boundary

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

O

O O OO O O

OO O

O O O

OO

O

O O O O

O O O O

O O

O OO O

O O O

O O O O

O O OO

O O O O

O O O O

O O O

O O

O O O O

OO O O O O

O O O

O O O O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

x 0

f(x ˆ 0 )

FIGURE 6.3. The locally weighted average has bias problems at or near the boundaries of the domain.

The true function is approximately linear here, but most of the observations in the neighborhood have a higher mean than the target point, so despite weight- ing, their mean will be biased upwards. By fitting a lo- cally weighted linear regression (right panel), this bias is removed to first order

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 6

N-W Kernel at Boundary

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

O

O O OO O O

OO O

O O O

OO

O

O O O O

O O O O

O O

O OO O

O O O

O O O O

O OOO

O O O O

O O O O

O O O

O O

O O O O

OO O O O O

O OO

O O O O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

x 0

f(x ˆ 0 )

Local Linear Regression at Boundary

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

O

O O OO O O

OO O

O O O

OO

O

O O O O

O O O O

O O

O OO O

O O O

O O O O

O OOO

O O O O

O O O O

O O O

O O

O O O O

OO O O O O

O OO

O O O O

O O

O O

O O

O O O O

OO O O O O OO

O O O

OO O

O OO O O

x 0

f(x ˆ 0 )

FIGURE 6.3. The locally weighted average has bias problems at or near the boundaries of the domain.

The true function is approximately linear here, but most of the observations in the neighborhood have a higher mean than the target point, so despite weight- ing, their mean will be biased upwards. By fitting a lo- cally weighted linear regression (right panel), this bias is removed to first order

Machine Learning Kernel Methods, Basis Expansion and regularization 4 141 - 172 March 23, 2021 80 / 315

(5)

Local Polynomial Regression

Local linear fits can help bias dramatically at the boundaries.

local quadratic fits tend to be most helpful in reducing bias due to curvature in the interior of the domain.

Recommended to select the degree by the application, not to combine linear boundaries and quadratic interior.

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 6

Variance

0.0 0.2 0.4 0.6 0.8 1.0

0.00.10.20.30.40.5

Constant Linear Quadratic

FIGURE 6.6. The variances functions || l(x) || 2 for local constant, linear and quadratic regression, for a metric bandwidth (λ = 0.2) tri-cube kernel.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

Local Linear in Interior

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O O

O

O O

O OO OO

O

O O O O

O OO

O

O O O O

O O O

O O

O O O O O O O O

O O O O O O

OO

O OO

O O

O O O OO

O

O OO O O

O O O O O O

O O O O OO

O

O OOO

O O OO

O

O O

O O

O O

O

O O

O O

O O

O O

O O O

O OO

O

O O O O

O O O

O O

O O O O O O O O

O O O O O O

OO

O OO

O O

O O O OO

O

O OO O O

O

• f ˆ (x 0 )

Local Quadratic in Interior

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O O

O

O O

O OO OO O

O O O O

O O O

O

O OO O

O O O

O O

O O O O O O O O

O O O O O O

OO

O O O

O O

O O O OO

O

O OO O O

O O O O O O

O O O O OO O

O OOO

OO OO

O

O O

O O

O O

O

O O

O O

O O

O O

O O O

O O O

O

O OO O

O O O

O O

O O O O O O O O

O O O O O O

OO

O O O

O O

O O O OO

O

O OO O O

O

• f ˆ (x 0 )

FIGURE 6.5. Local linear fits exhibit bias in regions of curvature of the true function. Local quadratic fits tend to eliminate this bias.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

Local Linear in Interior

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O O

O

O O

O OO OO

O

O O O O

O OO

O

O OO O

O O O

O O

O O

O O O O O O

O O O O O O

OO

O OO

O O

O O O OO

O

O OOO

O

O O O O O O

O O O O OO O

O OOO

O O OO

O

O O

O O

O O

O

O O

O O

O O

O O

O O O

O OO

O

O OO O

O O O

O O

O O

O O O O O O

O O O O O O

OO

O OO

O O

O O O OO

O

O OOO

O

O

• f(x ˆ 0 )

Local Quadratic in Interior

0.0 0.2 0.4 0.6 0.8 1.0

-1.0-0.50.00.51.01.5

O O O

O

O O

O OO OO

O

O O O O

O OO

O

O OO O

O O O

O O

O O O O O O O O

O O O O O O

OO

O OO

O O

O O O OO

O

O OOO

O

O O O O O O

O O O O OO

O

O OOO

O O OO

O

O O

O O

O O

O

O O

O O

O O

O O

O O O

O OO

O

O OO O

O O O

O O

O O O O O O O O

O O O O O O

OO

O OO

O O

O O O OO

O

O OOO

O

O

• f ˆ (x 0 )

FIGURE 6.5. Local linear fits exhibit bias in regions of curvature of the true function. Local quadratic fits tend to eliminate this bias.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 141 - 172 March 23, 2021 81 / 315

(6)

Selecting the Width of the Kernel

crossvalidation ˆ f = S λ y

df = trace(S λ )

Right: comparison of the tri-cube local linear regression kernels (orange) and smoothing splines (blue) with matching degrees of freedom 5.86.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

• •

•• ••

••

• •

••

•• •••

• •• •••••• •••••••••••••••••••••••••••••• ••••••••••••••• •••••••••••• ••••••••••

• • •

•• ••

•• •

••

••

•••••

• ••

•••••• •••••••••••••••••••••••••••••• ••••••••••••••••••••••••••• •••••••••• •••••••••••••••••••••• ••••••

•• •• ••

•••• •• •••••••••••••••••••

••• •

••••••••••• •••••••••••• ••••••••••

•••••••••••••••••••••• •••• •••

••• •• • •• ••• ••• •••••••••• ••• • ••

•••• •••••

•••••••••••••••••• ••••••••••

FIGURE 6.7. Equivalent kernels for a local linear regression smoother (tri-cube kernel; orange) and a smoothing spline (blue), with matching degrees of free- dom. The vertical spikes indicates the target points.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

• •

•• ••

••

• •

••

•• •••

• •• •••••• •••••••••••••••••••••••••••••• ••••••••••••••• •••••••••••• ••••••••••

• • •

•• •• ••

••

••

•••••

• ••

•••••• •••••••••••••••••••••••••••••• ••••••••••••••••••••••••••• •••••••••• •••••••••••••••••••••• •••••• •• •• ••

•••• •• •••• •••••••••••••••

••• •

••••••••••• •••••••••••• ••••••••••

•••••••••••••••••••••• •••• •••

••• •• • •• ••• ••• •••••••••• ••• • ••

•••• ••••• •••••••••••••••••• ••••••••••

FIGURE 6.7. Equivalent kernels for a local linear regression smoother (tri-cube kernel; orange) and a smoothing spline (blue), with matching degrees of free- dom. The vertical spikes indicates the target points.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 141 - 172 March 23, 2021 82 / 315

(7)

(Structured Local Regression in R p )

K λ (x 0 , x) = D kx−x

0 k h λ (x 0 )

Structured local regression: a positive semidefinite matrix A to weigh the different coordinates:

K λ (x 0 , x) = D (x−x

0 ) T A(x−x 0 ) h λ (x 0 )

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

East-West South-North

Velocity

East-West South-North

Velocity

FIGURE 6.8. The left panel shows three-dimensional data, where the response is the velocity measurements on a galaxy, and the two predictors record positions on the celestial sphere. The unusual “star”-shaped design indicates the way the measurements were made, and results in an extremely irregular boundary. The right panel shows the results of local linear regression smooth- ing in IR 2 , using a nearest-neighbor window with 15%

of the data.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 6

East-West South-North

Velocity

East-West South-North

Velocity

FIGURE 6.8. The left panel shows three-dimensional data, where the response is the velocity measurements on a galaxy, and the two predictors record positions on the celestial sphere. The unusual “star”-shaped design indicates the way the measurements were made, and results in an extremely irregular boundary. The right panel shows the results of local linear regression smooth- ing in IR 2 , using a nearest-neighbor window with 15%

of the data.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 83 / 315

(8)

Computational Consideration

Model is the entire training data set.

The fitting is done at evaluation or prediction.

Single observation x 0 fit is O(N),

expansion in M basis functions O(M) for one evaluation, typically MO(logN).

Basis function method have an initial cost at least O(NM 2 + M 3 ).

Smoothing parameter λ usualy determined off-line by cross-validation, at cost of O(N 2 ).

Popular implementations of local regression loess is S-PLUS compute the fit exactly at M locations O(NM) and interpolate to fit elsewhere (O(M) per evaluation).

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 84 / 315

(9)

Basis expansion and regularization

Linear and logistic regression assume linear function of X . Regression: We estimate f (X) = E (Y |X)

Classification: We estimate log P(Y P(Y =1|X) =0|X) Linear basis expansion in X

we replace the vector of inputs X with additional variables h m , h m (X ) : R p → R , m = 1, . . . , M.

f (x ) =

M

X

m=1

β m h m (X).

’the only change’ is a different matrix of the features X, further fit is the same.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 85 / 315

(10)

Simple derived features

We fit the model:

f (x ) =

M

X

m=1

β m h m (X).

h m (X ) = X m , m = 1, . . . , M recovers the original linear model.

h m (X ) = X j 2 or h m (X ) = X j X k polynomial terms to achieve higher-order Taylor expansions.

! The number of variables grows exponentially in the degreee of the polynomial.

h m (X ) = log(X j ), p

X j , ||X||,. . . , other nonlinear transformations.

h m (X ) = I(L mX k < U m ), an indicator for a region of X k . piecewise constant contribution for X K .

With non-overlapping regions used in regression trees.

h m (X ) = max((X jξ k ) 3 , 0) piecewise-polynomial spline basis wavelet bases.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 86 / 315

(11)

Piecewise Polynomials and Splines

For most of today we assume one-dimensional feature X . A piecewise polynomial function f (X) is obtained by

division the domain of X into continuous intervals by the knots ξ 1 , . . . , ξ M−1

and representing f by a separate polynomial in each interval.

Examples:

Three basis functions:

h 1 (X) = I(X < ξ 1 ), h 2 (X ) = I(X < ξ 2 ), h 3 (X ) = I(ξ 2 ≥ X ).

Additional linear functions:

h m+3 = h m (X) · X, m = 1, . . . , 3.

Additional cubic functions:

h m+6 = h m (X) · X 2 , h m+9 = h m (X ) · X 3 , m = 1, . . . , 3.

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 5

O O

O

O O

O

O O

O

O O O

O O

O O

O O

O

O O O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O

O O

O

O O

Piecewise Constant

O O

O

O O

O O

O

O

O O O

O O

O O

O O

O

O O O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O

O O

O

O O

Piecewise Linear

O O

O

O O

O

O O

O

O O O

O O

O O

O O

O O

O O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O

O O

O

O O

Continuous Piecewise Linear Piecewise-linear Basis Function

• •

• •

•• • •

• •

• •

• •

• •

ξ 1

ξ 1

ξ 1

ξ 1

ξ 2

ξ 2

ξ 2

ξ 2

(X − ξ 1 ) +

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some artificial data. The bro- ken vertical lines indicate the positions of the two knots ξ 1 and ξ 2 . The blue curve represents the true func- tion, from which the data were generated with Gaus- sian noise. The remaining two panels show piecewise linear functions fit to the same data—the top right un- restricted, and the lower left restricted to be continuous at the knots. The lower right panel shows a piecewise–

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

O O

O

O O

O O

O

O

O O O

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O O

O

O O O

Discontinuous

O O

O

O O

O O

O

O

O O

O

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O O

O

O O O

Continuous

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O OO

O O

O

O O O

Continuous First Derivative

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O OO

O O

O

O O O

Continuous Second Derivative Piecewise Cubic Polynomials

ξ 1

ξ 1

ξ 1

ξ 1

ξ 2

ξ 2

ξ 2

ξ 2

FIGURE 5.2. A series of piecewise-cubic polynomi- als, with increasing orders of continuity.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 87 / 315

(12)

Continuous functions

We add the continuity restriction: the value in ξ j is the unique.

Continuous piecewise linear basis:

h 1 (X) = 1, h 2 (X) = X , h 3 (X) = (X − ξ 1 ) + , h 4 (X ) = (X − ξ 2 ) + . We have spared two parameters for two continuity conditions.

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 5

O O

O

O O

O O

O

O

O O O

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O O

O

O

O O

Piecewise Constant

O O

O

O O

O O

O

O

O O O

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O O

O

O

O O

Piecewise Linear

O O

O

O O

O O

O

O

O O O

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O O

O

O O

O

O

O O

Continuous Piecewise Linear Piecewise-linear Basis Function

• •

• •

•• • •

• •

• •

• •

ξ 1

ξ 1

ξ 1

ξ 1

ξ 2

ξ 2

ξ 2

ξ 2

(X − ξ 1 ) +

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some artificial data. The bro- ken vertical lines indicate the positions of the two knots ξ 1 and ξ 2 . The blue curve represents the true func- tion, from which the data were generated with Gaus- sian noise. The remaining two panels show piecewise linear functions fit to the same data—the top right un- restricted, and the lower left restricted to be continuous at the knots. The lower right panel shows a piecewise–

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O OO

O O

O

O O O

Discontinuous

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O OO

O O

O

O O O

Continuous

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O OO

O O

O

O O O

Continuous First Derivative

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O O

O OO

O O

O

O O O

Continuous Second Derivative Piecewise Cubic Polynomials

ξ 1

ξ 1

ξ 1

ξ 1

ξ 2

ξ 2

ξ 2

ξ 2

FIGURE 5.2. A series of piecewise-cubic polynomi- als, with increasing orders of continuity.

For the cubic fit, the figure looks ugly, we need continous first and second derivative.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 88 / 315

(13)

Cubic spline

Cubic spline is a piecewise cubic fit with continuous first and second derivatives at the knots ξ i .

The basis functions with knots ξ 1 , ξ 2 are:

h 1 (X) = 1, h 2 (X) = X , h 3 (X) = X 2 , h 4 (X) = X 3 , h 5 (X) = (X − ξ 1 ) 3 + , h 6 (X) = (X − ξ 2 ) 3 + .

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O

O O OO

O O

O

O O O

Discontinuous

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O

O O OO

O O

O

O O O

Continuous

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O

O O OO

O O

O

O O O

Continuous First Derivative

O O

O

O O

O O

O

O

O OO

O O

O O

O O

O

O O

O

O

O

O O O

O

O O O

O O

O

O

O

O O

O

O

O O OO

O O

O

O O O

Continuous Second Derivative Piecewise Cubic Polynomials

ξ 1

ξ 1

ξ 1

ξ 1

ξ 2

ξ 2

ξ 2

ξ 2

FIGURE 5.2. A series of piecewise-cubic polynomi- als, with increasing orders of continuity.

Parameter count:

(3 regions)x(4 pars per region)-(2 knots)x(3 constraints per knot)=6.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 89 / 315

(14)

Order-M splines

Cubic spline is an order-4 spline.

Generally, order-M spline with knots ξ j , j = 1, . . . , K is a

piecewise-polynomial of order M and has continuous derivatives to order M − 2.

General truncated basis functions are:

h j (X ) = X j−1 , j = 1, . . . , M , h M+` = (X − ξ ` ) M−1 + , ` = 1, . . . , K.

Regression splines splines with fixed knots

usually at percentiles of the data X .

the number of knots is specified by the degree an the degrees of freedom (df − M). h 0 does not count.

B-splines use other basis describing the same linear feature space.

B-splines are more stable numerically, useful for large number of knots K.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

B-splines of Order 1

0.0 0.2 0.4 0.6 0.8 1.0

0.00.40.81.2

B-splines of Order 2

0.0 0.2 0.4 0.6 0.8 1.0

0.00.40.81.2

B-splines of Order 3

0.0 0.2 0.4 0.6 0.8 1.0

0.00.40.81.2

B-splines of Order 4

0.0 0.2 0.4 0.6 0.8 1.0

0.00.40.81.2

FIGURE 5.20. The sequence of B-splines up to order four with ten knots evenly spaced from 0 to 1. The B-splines have local support; they are nonzero on an interval spanned by M + 1 knots.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 90 / 315

(15)

Computational complexity

N observations, K + M variables (basis functions) take O(N(K + M ) 2 + (K + M) 3 ).

sort values of X

B splines have local support, B is lower 4-banded.

Cholesky decomposition M = LL T can be computed easily.

Solution of ˆ f is in O(N) operations.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 91 / 315

(16)

Natural Cubic Spline

Polynomial fit tends to be erratic near the boundaries.

Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 5

X

Pointwise Variances

0.0 0.2 0.4 0.6 0.8 1.0

0.00.10.20.30.40.50.6

••

••

••••• •• •••••• •• •• • •• •••• • •••• • • ••• •• ••

••

••

••• ••

••••••• •••••• •• •• • •• •••• • •• •• •• •••••

•• ••

••

••

•• •••••

•• •••••••• ••• ••

•••••

•• •• • • ••••• ••

••

••

• ••• ••

••••• •• •••••• •• •• • •• •••• • •• •• •• ••••• •••• •• • Global Linear

Global Cubic Polynomial Cubic Spline - 2 knots Natural Cubic Spline - 6 knots

FIGURE 5.3. Pointwise variance curves for four dif- ferent models, with X consisting of 50 points drawn at random from U[0, 1], and an assumed error model with constant variance. The linear and cubic polynomial fits have two and four degrees of freedom, respectively, while the cubic spline and natural cubic spline each have six degrees of freedom. The cubic spline has two knots at 0.33 and 0.66, while the natural spline has boundary knots at 0.1 and 0.9, and four interior knots uniformly spaced between them.

Natural cubic spline is a spline that the function is linear beyond the boundary knots.

Basis functions N i , i = 1, . . . , K :

N 1 (X) = 1, N 2 (X ) = X , N k+2 (X ) = d k (X ) − d K−1 (X ) for d k (X) = (X − ξ k ) 3 + − (X − ξ K ) 3 +

ξ Kξ k .

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 92 / 315

(17)

Smoothing Splines

Maximal number of knots: N, the number of examples.

But, we need a penalty for model complexity.

RSS(f , λ) =

N

X

i=1

(y if (x i )) 2 + λ Z

(f 00 (t )) 2 dt λ is smoothing parameter

λ = 0: can be any function that interpolates the data.

λ = ∞: the simple least squares line fit, no second derivative can be tolerated.

Has a unique finite-dimensional minimizer, a natural cubic spline with knots at the unique values of the x i , i = 1, . . . , N.

The solution is in the form: f (x ) = P N

i=1 N j (x )θ j . The criterion reduces for:

RSS(θ, λ) = (yNθ) T (y − Nθ) + λθ TN θ where {N} ij = N j (x i ) and {Ω} jk = R

N j 00 (t )N k 00 (t)dt.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 93 / 315

(18)

STAT602X Homework 4

2011-03-16

Keys

14. Suppose that with p = 1,

y|x ∼ N

( sin(12(x + .2)) x + .2 , 1

)

and N = 101 training data pairs are available with x i = (i − 1)/100 for i = 1, 2, . . . ,101. Vardeman will send you a text file with a data set like this in it. Use it in the following.

a) Fit all of the following using first 5 and then 9 effective degrees of freedom i) a cubic smoothing spline,

ii) a locally weighted linear regression smoother based on a normal density kernel, and iii) a locally weighted linear regression smoother based on a tricube kernel.

Plot for 5 effective degrees of freedom all of y i and the 3 sets of smoothed values against x i . Connect the consecutive (x i , y ˆ i ) for each fit with line segments so that they plot as ”functions.” Then redo the plotting for 9 effective degrees of freedom.

Neither the R function smooth.spline() nor the package pspline returns the coefficient matrix, so I decide to implement the whole function from scratches based on the Section 4.1 of the lecture outline. (Yes, my code is really messy where using global variable is not a good idea, and I would not recommend this way except for computing efficiency.) Please let me know if you find an easy way to get c i of the part b) from any cubic smoothing spline function or JMP. As HTF5.2, I use a truncated-power basis set easily obtained second derivatives, and let a = x 1 = 0, b = x 101 = 1, and knots ξ l = x l+1 for l = 1, . . . , K and K = 99. Also, the basis functions for a cubic spline M = 4 are

h j (x) = x j−1 j = 1, . . . , M, h M+l (x) = (x − ξ l ) M−1 + l = 1, . . . , K.

Then, H = (h j (x i )) N,M+K where h j (x i ) is for the i-th row and the j-th column. Let Ω = (ω i,j ) M+K,M +K be a symmetric matrix and the upper triangular ω i,j = ∫ b

a h

′′

i (t)h

′′

j (t)dt is

ω i,j = 0 for i < M,

ω M,j = 1 3 b 31 2 b 2 ξ j + 1 6 ξ j 3 for j > M, and

ω i,j = 1 3 (b 3 − ξ 3 0 ) − 1 2 (b 2 − ξ 0 2 )(ξ i−M + ξ j−M ) + (b − ξ 0 )ξ i−M ξ j−M for j ≥ i > M, where ξ 0 = max{ξ i−M , ξ j−M }.

Given a λ,

Y ˆ λ = H(H H + λΩ) −1 H Y

– 1 –

https://vardeman.public.iastate.edu/stat602/602x_hw4_sol.pdf

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 94 / 315

(19)

Smoothing Splines solution

Smoothing spline solution is a generalized ridge regression θ ˆ = (N T N + λΩ N ) −1 N T y The fitted smoothing spline is given by:

ˆ f (x ) =

N

X

j=1

N j (x)ˆ θ j

Bone mineral density (BMD) in adolescents.

Response: the change in BMD over two consecutive visits, typically about one year apart.

coded by gender, females precedes growth spurt about two years.

λ ≈ 0.00022, df λ = 12.

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 5

Age

Relative Change in Spinal BMD

10 15 20 25

-0.050.00.050.100.150.20

• •

• •

• •

• •

• • •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• • • •

• •

• •

• •

• •

••

• • •

• •

• • •

• •

••

• •

• ••

• •

• •

• •

• •

• •

• •

• •

••

• •

• •

• • •

• •

• •

••

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• • •

• •

• •

• •

• •

• •

• •

• •

• •

••

• •

• •

Male Female

FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adoles- cents, as a function of age. A separate smoothing spline was fit to the males and females, with λ ≈ 0.00022.

This choice corresponds to about 12 degrees of freedom.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 95 / 315

(20)

Degrees of Freedom and Smoother Matrices

Smoothing spline is a linear smoother:

ˆ f = N(N T N + λΩ N ) −1 N T y

= S λ y S λ is known as smoother matrix.

df λ = trace(S λ )

the sum of the diagonal elements

λ ≈ 0.00022 derived numerically by solving trace(S λ ) = 12.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 96 / 315

(21)

Pollution data example

128 observations of pressure and ozone.

Two fitted smoothing splines.

third to sixth eigenvectors of the spline smoother matrices u k against x.

eigendecomposition of S : S λ =

N

X

k=1

ρ k (λ)u k u T k

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 5

Daggot Pressure Gradient

Ozone Concentration

-50 0 50 100

0102030

• •

• • •

• ••

• • • •

••

••

••

••

• • •

••

• • •

••

••

••

•• •

• •

••

• ••

• •

Order

Eigenvalues

5 10 15 20 25

-0.20.00.20.40.60.81.01.2

• ••

• • • • • •• • • • • • • • • • • •

• • • • ••

••

• • • •• • • • • • • df=5 df=11

-50 0 50 100 -50 0 50 100

FIGURE 5.7. (Top:) Smoothing spline fit of ozone concentration versus Daggot pressure gradient. The two fits correspond to different values of the smoothing parameter, chosen to achieve five and eleven effective degrees of freedom, defined by df λ = trace(S λ ). (Lower

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 97 / 315

(22)

Smoother Matrix

rows S λ ordered with x right: selected rows λ → 0 means df λN and S λI

λ → ∞ means df λ → 2 and S λH, the hat matrix for linear regression on x.

H = X(X T X) −1 X T since (ˆ y = Hy )

Elements of Statistical Learning (2nd Ed.) c

Hastie, Tibshirani & Friedman 2009 Chap 5

115 100 75 50 25 12

Smoother Matrix

•••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• •

• Row 115

•••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• Row 100

•••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••• •

• Row 75

•••• • ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • Row 50

••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • Row 25

•••• • • •••••• ••••••••

••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• • • Row 12

Equivalent Kernels

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded, indicating an equivalent kernel with local support. The left panel represents the ele- ments of S as an image. The right panel shows the equivalent kernel or weighting function in detail for the indicated rows.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 98 / 315

(23)

Selection degrees of freedom

f (X ) = sin(12(X X +0.2 +0.2)) Y = f (X ) +

XU [0, 1], N(0, 1), N = 100.

df selected by crossvalidation is 9.

Elements of Statistical Learning (2nd Ed.) cHastie, Tibshirani & Friedman 2009 Chap 5

6 8 10 12 14

0.91.01.11.2

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

• •

y

0.0 0.2 0.4 0.6 0.8 1.0

-4-202

O O

O O

OO O

O O O

O O O

O O

O O O O

O O O

O

O O O O O O

O OO O OO

O O

O O O

OO O O O O O

O O O O O O O OO

O O

O O

O

OO

O O

O O O O

O O O

O O

O O

O O O

O O OO O O

O O O O

O OO

O O O O

O O

O O

y

0.0 0.2 0.4 0.6 0.8 1.0

-4-202

O O

O O

OO O

O O O

O O O

O O

O O O O

O O O

O

O O O O O O

O OO O OO

O O

O O O

OO O O O O O

O O O O O O O OO

O O

O O

O

OO

O O

O O O O

O O O

O O

O O

O O O

O O OO O O

O O O

O

O OO

OO O O

O O

O O

y

0.0 0.2 0.4 0.6 0.8 1.0

-4-202

O O

O O

OO O

O O O

O O O

O O

O O O O

O O O

O

O O O O O O

O OO O OO

O O

O O O

OO O O O O O

O O O O O O O OO

O O

O O

O

OO

O O

O O O O

O O O

O O

O O

O O O

O O OO O O

O O O O

O OO

O O O O

O O

O O

EPE CV

X X

X df

λ

= 5

df

λ

= 9 df

λ

= 15

df

λ

Cross-Validation

EPE( λ ) a nd CV ( λ )

FIGURE 5.9. The top left panel shows the EPE(λ) and CV(λ) curves for a realization from a nonlinear ad- ditive error model (5.22). The remaining panels show the data, the true functions (in purple), and the fit- ted curves (in green) with yellow shaded ± 2 × standard error bands, for three different values of df

λ

.

Machine Learning Kernel Methods, Basis Expansion and regularization 4 173 - 211 March 23, 2021 99 / 315

Références

Documents relatifs

• Our seond KPCA based transfer learning approah (KPCA-TL-LT) further aligns the resulting mathed KPCA representations of soure and target by a linear transformation.. The result

Broadly speaking, kernel methods are based on a transformation called feature map that maps available data taking values in a set X — called the input space — into a (typically)

For large sample size, we show that the median heuristic behaves approximately as the median of a distribution that we describe completely in the setting of kernel two-sample test

The contributions of the present work are: a setting to learn multiple kernel classifiers with mixed norm regularization, a data-dependent bound on the generalization ability of

The idea here is to retrieve learning algorithm by using the exponential family model with clas- sical statistical principle such as the maximum penalized likelihood estimator or

We present methods for dealing with missing variables in the context of Gaussian Processes and Support Vector Machines.. This solves an important problem which has largely been

We believe that although the problem of detection has to be solved by the use of deep learning based on convolutional neural networks, in the same way that current breakthroughs

For further information contact: The Congress Secretariat, 6th Asian Pacific Congress of Nephrology, Gardiner-Caldwell Communications Ltd, 2403 Tung Wai Commercial Building,