Kernel change-point detection

(1)

1/23

Kernel change-point detection

Sylvain Arlot^1,2 (joint work with Alain Celisse³ & Za¨ıd Harchaoui⁴)

1Cnrs

2Ecole Normale Sup´´ erieure (Paris), DIENS, ´EquipeSierra

3Universit´e Lille 1

4INRIA Grenoble

Workshop Kernel methods for big data, Lille, 1st April 2014.

(2)

2/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

1-D signal (example)

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Kernel change-point detection Sylvain Arlot

(3)

2/23

1-D signal (example): Find abrupt changes in the mean

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Signal Reg. func.

? ?

(4)

3/23

Estimation rather than identification

With a finite sample, it is impossible to recover some change-points in noisy regions.

55 60 65 70 75 80 85 90 95 100

−0.2 0 0.2 0.4 0.6 0.8

1 Signal: Y

Reg. func. s

Purpose:

1 Estimate the regression function.

2 Use the quadratic loss `(u,v) =ku−vk².

Rk: Without too strong noise, recover all change-points.

(5)

4/23

Detect abrupt changes. . .

General purposes:

1 Detect changes in the whole distribution (not only in the mean)

Mean:

homoscedastic: Birg´e & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007)

2 High-dimensional data of different nature:

Vectorial: measures inR^d, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

(6)

4/23

Detect abrupt changes. . .

General purposes:

Mean:

3 Efficient algorithm allowing to deal with large data sets

(7)

4/23

Detect abrupt changes. . .

General purposes:

Mean:

3 Efficient algorithm allowing to deal with large data sets

(8)

5/23

Kernel and Reproducing Kernel Hilbert Space (RKHS)

X: initial input space.

X1, . . . ,Xn: initial observations.

k(·,·) : X × X →R: reproducing kernel (H: RKHS).

φ(·) : X → H s.t. φ(x) =k(x,·): canonical feature map.

Asset:

Enables to work with high-dimensional heterogeneous data.

Rk:

Estimators depend on the Gram matrixK :={k(X_i,X_j)}_1≤i,j≤n.

(9)

6/23

Model

Mapping of the initial data

∀1≤i ≤n, Yi =φ(Xi)∈ H .

−→(t₁,Y₁), . . . ,(t_n,Y_n)∈[0,1]× H: independent .

(10)

6/23

Model

∀1≤i ≤n, Yi =s_i^?+εi ∈ H , where

s_i^?∈ H: mean element ofP_X_i (distribution of X_i) hs_i^?,fi_H=EXi[hφ(X_i),fi_H], ∀f ∈ H.

∀i,ε_i :=Y_i−s_i^? with E[ε_i] = 0 and v_i :=E h

kε_ik²_Hi .

Assumptions

1 max_ikY_ik_H≤M a.s. (Db) .

2 max_iv_i ≤v_max (Vmax) .

3 s^?= (s₁^?, . . . ,s_n^?)∈ Hⁿ: piecewise constant. ks^?−µk² :=Pn

i=1ks_i^?−µ_ik²_H.

Goal: −→ Estimates^? to recover change-points.

(11)

6/23

Model

∀1≤i ≤n, Yi =s_i^?+εi ∈ H , where

s_i^?∈ H: mean element ofP_X_i (distribution of X_i) hs_i^?,fi_H=EXi[hφ(X_i),fi_H], ∀f ∈ H.

∀i,ε_i :=Y_i−s_i^? with E[ε_i] = 0 and v_i :=E h

kε_ik²_Hi . Assumptions

1 max_ikY_ik_H≤M a.s. (Db) .

2 max_iv_i ≤v_max (Vmax) .

3 s^?= (s₁^?, . . . ,s_n^?)∈ Hⁿ: piecewise constant.

ks^?−µk² :=Pn

i=1ks_i^?−µ_ik²_H.

Goal: −→ Estimates^? to recover change-points.

(12)

7/23

Least-squares estimator

Empirical risk minimizer overS_m (= model):

bs_m ∈arg min

u∈Sm

Rb_n(u) where Rb_n(u) = 1

nku−Yk² = 1 n

n

X

i=1

ku_i −Y_ik²_H .

Regressogram:

bs_m =X

λ∈m

βb_λ1λ βb_λ= 1

Card{t_i ∈λ}

X

ti∈λ

Y_i.

(13)

8/23

Model selection

Models:

M_n={m, segmentation of {1, . . . ,n} }, D_m= Card(m).

m ⇔

I1= [0,tm1], I2 = (tm1,tm2], . . . , I_D_m = (tm_Dm−1,1] . S_m ={µ: (t₁, . . . ,t_n)→ H, piecewise const. on all λ∈m}

⇔ subspace ofHⁿ.

Strategy:

(S_m)m∈M_n −→ (bs_m)m∈M_n −→ bs

mb ???

Oracle model: m^? ∈argmin_m∈M_nks^?−bs_mk².

ks^?−bs

mbk²≤C inf

m∈M_n

nks^?−bs_mk²+R(m,n)o

(14)

8/23

Model selection

Models:

M_n={m, segmentation of {1, . . . ,n} }, D_m= Card(m).

m ⇔

I1= [0,tm1], I2 = (tm1,tm2], . . . , I_D_m = (tm_Dm−1,1] . S_m ={µ: (t₁, . . . ,t_n)→ H, piecewise const. on all λ∈m}

⇔ subspace ofHⁿ.

Strategy:

(S_m)m∈M_n −→ (bs_m)m∈M_n −→ bs

mb ???

Oracle model: m^? ∈argmin_m∈M_nks^?−bs_mk².

Goal: Oracle inequality(in expectation, or with large probability):

ks^?−bs

mbk²≤C inf

m∈M_n

nks^?−bs_mk²+R(m,n)o

(15)

9/23

Choose (D − 1) change-points. . .

Assumption: (Harchaoui & Capp´e (2007)) The number (D−1) of change-points is known.

Question:

Find the locations of the (D−1) change-points? (D is given).

Strategy:

The “best” segmentation inD pieces is obtained by applying the ERM algorithm overS

Dm=DSm :

ERM algorithm:

mbERM(D) = argmin

m|Dm=D

Rb_n(bsm). Rk: Based on dynamic programming.

(16)

10/23

Quality of the segmentations

(17)

11/23

Elementary calculations

Ideal criterion: (Πm: orthog. proj. operator ontoSm)

ks^?−bsmk²=ks^?−Πms^?k²+kΠ_mεk² .

Empirical risk:

kY −bs_mk² =ks^?−Π_ms^?k²− kΠ_mεk²+ 2h(I−Π_m)s^?, εi+kεk².

Expectations (vλ= _Card(λ)¹ P

i∈λvi)

E h

ks^?−bsmk²i

=ks^?−Πms^?k²+X

λ∈m

vλ ,

E h

kY −bsmk²i

=ks^?−Πms^?k²−X

λ∈m

v_λ+Cst ,

−→ERM prefers models with large P

λ∈mv_λ (overfitting).

(18)

11/23

Elementary calculations

Ideal criterion: (Πm: orthog. proj. operator ontoSm)

ks^?−bsmk²=ks^?−Πms^?k²+kΠ_mεk² .

Empirical risk:

kY −bs_mk² =ks^?−Π_ms^?k²− kΠ_mεk²+ 2h(I−Π_m)s^?, εi+kεk².

Expectations (vλ= _Card(λ)¹ P

i∈λvi)

E h

ks^?−bsmk²i

=ks^?−Πms^?k²+X

λ∈m

vλ ,

E

hkY −bsmk²i

=ks^?−Πms^?k²−X

λ∈m

v_λ+Cst ,

Conclusion:

−→ERM prefers models with large P

λ∈mv_λ (overfitting).

(19)

12/23

Choose the number of change-points

From bs

mbD D, choose D amounts to choose the “best model”.

Ideal penalty:

m^? ∈argmin

m∈M

ks^?−bsmk²

= argmin

m∈M

n

kY −bsmk²+ pen_id(m) o

,

with pen_id(m) =: 2kΠ_mεk²−2h(I−Πm)s^?, εi.

Strategy

1 Concentration inequalities for linear and quadratic terms.

2 Derive a tight upper bound pen≥pen_id with high probability.

Previous work:

Birg´e & Massart (2001): Gaussian assumption + real valued functions.

−→cannot be extended to Hilbert framework.

(20)

13/23

Concentration of the linear term

Theorem (Linear term)

Assume(Db)–(Vmax)hold true.

Then, for every segmentation m∈ M_n, for every x >0 with probability at least1−2e^−x,

|hΠ_ms^?−s^?, εi| ≤θkΠ_ms^?−s^?k²+ vmax

θ +4M²

3

x , for everyθ >0.

(21)

14/23

Concentration of the quadratic term

Theorem (Quadratic term) Assume(Db)–(Vmax), and

∃κ≥1, 0< M²

κ ≤min

i v_i (Vmin) . Then, for every m∈ M_n, x>0, andθ∈(0,1],

kΠmεk²−E

hkΠmεk²i ≤θE

hkΠms^?−bsmk²i

+θ⁻¹L(κ)vmaxx ,

with probability at least 1−2e^−x, where L(κ) is a constant.

Idea of the proof:

Pinelis-Sakhanenko’s inequality (

P

i∈λεi

_H).

Bernstein’s inequality (upper bounding moments)

(22)

15/23

Oracle inequality

Theorem

Assume(Db)-(Vmin)-(Vmax)and define

mb ∈argmin

m

1

n kY −bs_mk²+ pen(m)

,

wherepen(m) = ^v^max_n^D^m h

C1ln n

Dm

+C2

i

for constants C₁,C₂>0. Then, for every x ≥1, with probability at least 1−2e^−x,

1

nks^?−bs

mbk² ≤∆₁inf

m

1

nks^?−bs_mk²+ pen(m)

+∆₂v_maxx

n ,

where∆₁ ≥1 and∆₂>0are absolute constants.

In Birg´e & Massart (2001), pen(m) = ^σ²_n^D^m h

c1ln n

Dm

+c2

i .

(23)

16/23

Model selection procedure

pen(m) = vmaxDm

n

C1ln n

Dm

+C2

= pen(Dm) .

Algorithm

1 For every 1≤D≤D_max,

mbD ∈ argmin

m,Dm=D

n

kY −bsmk²o ,

2 Define

Db = argmin

D

1 n

Y −bs

mb_D

2+v_maxD n

h

C₁lnn D

+C₂i .

whereC1,C2: computed by simulation experiments.

3 Final estimator:

bs_m_b =:bs_m_b

bD.

(24)

17/23

Changes in the distribution (synthetic data)

Description:

1 n = 1 000,D^∗−1 = 9, N_rep = 100.

2 In each segment, observations generated according to one distribution within a pool of 10 distributions with same mean and variance.

3 Kernel-based approach enables to distinguish them (higher order moments)

4 Gaussian kernel: k_h(x,y) = exph

− kx−yk²/(2h²)i .

(25)

18/23

Changes in the distribution (synthetic data), cont.

Results

0 5 10 15 20 25 30 35 40 0.4

0.45 0.5 0.55

Emp. risk Crit.

True risk

0 200 400 600 800 1000 0

0.005 0.01 0.015 0.02 0.025

Hausdorff distance: 0.053±0.006

(26)

19/23

Changes in the distribution (synthetic data), cont.

Results: estimated number of change-points

0 10 20 30

0 50 100 150 200 250

(27)

20/23

“Le grand ´ echiquier”, 70s-80s French talk show

Audio and video recordings.

Audio: different situations can be distinguished from sound recordings (music, applause, speech,. . . ).

Video: different video scenes can be distinguished by their backgrounds or specific actions of people (clapping hands, discussing,. . . ).

(28)

21/23

Audio signal

Description:

n = 500,D^∗−1 = 4.

At each t_i, one observes a multivariate vector of dimension 12.

Gaussian kernel: kh(x,y) = exph

− kx−yk²/(2h²)i . Results: Hausdorff distance 0.079±0.006

(29)

22/23

Video sequence

Description:

n = 10 000,D^∗−1 = 4.

Each image summarized by a histogram with 1 024 bins.

χ² kernel: kd(x,y) =Pd i=1

(xi−yi)² x_i+y_i · Results: Hausdorff distance 0.093±0.007

(30)

23/23

Conclusion

Take-home message:

Change-point detection algorithm for possibly high-dimensional or complex data

Data-driven choice of the number of change-points Non-asymptotic oracle inequality(guarantee on the risk) Experiments: changes in less usual properties of the distribution, audio or video data

Open questions:

1 Influence of the choice of kernel

2 Data-driven choice of the kernel

3 Relax the assumption on the variance

4 Extend our model selection theorem to other regression settings

(31)

23/23

Conclusion

Take-home message:

Change-point detection algorithm for possibly high-dimensional or complex data

Data-driven choice of the number of change-points Non-asymptotic oracle inequality(guarantee on the risk) Experiments: changes in less usual properties of the distribution, audio or video data

Open questions:

1 Influence of the choice of kernel

2 Data-driven choice of the kernel

3 Relax the assumption on the variance

4 Extend our model selection theorem to other regression settings