solution

(1)

SOLUTIONS - KERNEL METHODS

STATISTICAL LEARNING COURSE, ECOLE NORMALE SUP ´ERIEURE

Rapha¨el Berthier [email protected]

1. Examples of positive definite kernels (1) (a) Denotek=k1+k2. Let α1, . . . , αn∈R and x1, . . . , xn∈ X. Then

X

i,j

α_iα_jk(x_i, x_j) =X

i,j

α_iα_jk₁(x_i, x_j)

| {z }

≥0

+X

i,j

α_iα_jk₂(x_i, x_j)

| {z }

≥0

≥0.

(b) Denote k = k1k2. Let x1, . . . , xn ∈ X and K1, K2 the Gram matrices associated to kernels k₁, k₂ at the pointsx₁, . . . , x_n. We show that K = K₁K₂, the Gram matrix associated tok, is positive definite. Here, denotes the Hadamard product (i.e., the pointwise product). As K₁ is a symmetric positive semi-definite matrix, one can diagonalize K1=P

iλiuiu^T_i . Then K =X

i

λiuiu^T_i K2

But for any vector u :

X

ij

α_iα_j(uu^T K₂)_ij =X

ij

α_iα_j(K₂)_iju_iu_j = (αu)^TK₂(αu)≥0

Thus by summing non-negative terms, P

i,jαiαjKij ≥0.

(2) Consider H=L²(R) andφ(x) =1_{t≤x}. (3) Similarly,

1 x+y =

Z 1 0

t^x−¹²t^y−¹²dt=hφ(x), φ(y)i_L₂_([0,1]).

Thus _x+y¹ is a positive definite kernel. Nowxy is the standard scalar product onR, and by productk is a positive definite kernel.

(4) Denotenthe cardinal ofX. ForA⊂X, denote Φ(A) the indicator function ofA. Then

|A∩B|=φ(A)^Tφ(B),

thus|A∩B|is a positive definite kernel. Further, denotingA^cthe complement ofA,

(2)

2 STATISTICAL LEARNING COURSE, ECOLE NORMALE SUP ´ERIEURE

1

|A∪B| = 1 n− |A^c∩B^c|

= 1

n

1−^|A^c^∩B_n ^c^|

= 1

n(1−^φ(A^c⁾^T_n^φ(B^c⁾)

= 1 n

∞

X

i=0

(φ(A^c)^Tφ(B^c) n )ⁱ

which is a positive definite kernel by sum and products of positive definite kernels.

Finally, by a final product, K is a positive definite kernel.

(5)

GCD(n, m) =Y

pi

p^min(ψ_i ⁱ^(m),ψⁱ⁽ⁿ⁾⁾,

where the p_i are the prime numbers and where ψ_i(m) give the power of p_i in the decomposition of m. We see this as a product of kernels : indeed, consider the feature map

2. Distance in the feature space (1) (a)

kφ(x)−φ(y)k²=k(x, x)−2k(x, y) +k(y, y) (b)

kφ(x)−φ(y)k² = (x−y)² x+y . (2) (a)

kφ(x)−µ₊k²=k(x, x) + 1 n²₊

X

i,yi=1

X

j,yj=1

k(x_i, x_j)− 2 n+

X

i,yi=1

k(x, x_i) (b) We outputy= +1 if

1 n²₊

X

i,yi=1

X

j,yj=1

k(xi, xj)− 2 n₊

X

i,yi=1

k(x, xi)≤ 1 n²₋

X

i,yi=−1

X

j,yj=−1

k(xi, xj)− 2 n−

X

i,yi=−1

k(x, xi) and −1 otherwise.

(c) kφ(x)− 1

n₊ X

i,yi=1

φ(x_i)k² ≤ kφ(x)− 1 n−

X

i,yi=−1

φ(x_i)k²⇔ X

i,yi=1

k(x, x_i)≥ X

i,yi=−1

k(x, x_i)

(d) The application is straightforward. However, it sheds light on the importance of the choice of the kernelk. It decides how samples influence the classification rule of other points. The choice of the kernel can thus be seen as a choice of “similarity between points”.