• Aucun résultat trouvé

Kernel change-point detection

N/A
N/A
Protected

Academic year: 2022

Partager "Kernel change-point detection"

Copied!
31
0
0

Texte intégral

(1)

1/23

Kernel change-point detection

Sylvain Arlot1,2 (joint work with Alain Celisse3 & Za¨ıd Harchaoui4)

1Cnrs

2Ecole Normale Sup´´ erieure (Paris), DIENS, ´EquipeSierra

3Universit´e Lille 1

4INRIA Grenoble

Workshop Kernel methods for big data, Lille, 1st April 2014.

(2)

2/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

1-D signal (example)

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Kernel change-point detection Sylvain Arlot

(3)

2/23

1-D signal (example): Find abrupt changes in the mean

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Signal Reg. func.

? ?

(4)

3/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Estimation rather than identification

With a finite sample, it is impossible to recover some change-points in noisy regions.

55 60 65 70 75 80 85 90 95 100

−0.2 0 0.2 0.4 0.6 0.8

1 Signal: Y

Reg. func. s

Purpose:

1 Estimate the regression function.

2 Use the quadratic loss `(u,v) =ku−vk2.

Rk: Without too strong noise, recover all change-points.

Kernel change-point detection Sylvain Arlot

(5)

4/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Detect abrupt changes. . .

General purposes:

1 Detect changes in the whole distribution (not only in the mean)

Mean:

homoscedastic: Birg´e & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007)

2 High-dimensional data of different nature:

Vectorial: measures inRd, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

(6)

4/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Detect abrupt changes. . .

General purposes:

1 Detect changes in the whole distribution (not only in the mean)

Mean:

homoscedastic: Birg´e & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007)

2 High-dimensional data of different nature:

Vectorial: measures inRd, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

3 Efficient algorithm allowing to deal with large data sets

Kernel change-point detection Sylvain Arlot

(7)

4/23

Detect abrupt changes. . .

General purposes:

1 Detect changes in the whole distribution (not only in the mean)

Mean:

homoscedastic: Birg´e & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011) Mean and variance: Picard et al. (2007)

2 High-dimensional data of different nature:

Vectorial: measures inRd, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

3 Efficient algorithm allowing to deal with large data sets

(8)

5/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Kernel and Reproducing Kernel Hilbert Space (RKHS)

X: initial input space.

X1, . . . ,Xn: initial observations.

k(·,·) : X × X →R: reproducing kernel (H: RKHS).

φ(·) : X → H s.t. φ(x) =k(x,·): canonical feature map.

Asset:

Enables to work with high-dimensional heterogeneous data.

Rk:

Estimators depend on the Gram matrixK :={k(Xi,Xj)}1≤i,j≤n.

Kernel change-point detection Sylvain Arlot

(9)

6/23

Model

Mapping of the initial data

∀1≤i ≤n, Yi =φ(Xi)∈ H .

−→(t1,Y1), . . . ,(tn,Yn)∈[0,1]× H: independent .

(10)

6/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Model

∀1≤i ≤n, Yi =si?i ∈ H , where

si?∈ H: mean element ofPXi (distribution of Xi) hsi?,fiH=EXi[hφ(Xi),fiH], ∀f ∈ H.

∀i,εi :=Yi−si? with E[εi] = 0 and vi :=E h

ik2Hi .

Assumptions

1 maxikYikH≤M a.s. (Db) .

2 maxivi ≤vmax (Vmax) .

3 s?= (s1?, . . . ,sn?)∈ Hn: piecewise constant. ks?−µk2 :=Pn

i=1ksi?−µik2H.

Goal: −→ Estimates? to recover change-points.

Kernel change-point detection Sylvain Arlot

(11)

6/23

Model

∀1≤i ≤n, Yi =si?i ∈ H , where

si?∈ H: mean element ofPXi (distribution of Xi) hsi?,fiH=EXi[hφ(Xi),fiH], ∀f ∈ H.

∀i,εi :=Yi−si? with E[εi] = 0 and vi :=E h

ik2Hi . Assumptions

1 maxikYikH≤M a.s. (Db) .

2 maxivi ≤vmax (Vmax) .

3 s?= (s1?, . . . ,sn?)∈ Hn: piecewise constant.

ks?−µk2 :=Pn

i=1ksi?−µik2H.

Goal: −→ Estimates? to recover change-points.

(12)

7/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Least-squares estimator

Empirical risk minimizer overSm (= model):

bsm ∈arg min

u∈Sm

Rbn(u) where Rbn(u) = 1

nku−Yk2 = 1 n

n

X

i=1

kui −Yik2H .

Regressogram:

bsm =X

λ∈m

βbλ1λ βbλ= 1

Card{ti ∈λ}

X

ti∈λ

Yi.

Kernel change-point detection Sylvain Arlot

(13)

8/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Model selection

Models:

Mn={m, segmentation of {1, . . . ,n} }, Dm= Card(m).

m ⇔

I1= [0,tm1], I2 = (tm1,tm2], . . . , IDm = (tmDm−1,1] . Sm ={µ: (t1, . . . ,tn)→ H, piecewise const. on all λ∈m}

⇔ subspace ofHn.

Strategy:

(Sm)m∈Mn −→ (bsm)m∈Mn −→ bs

mb ???

Oracle model: m? ∈argminm∈Mnks?−bsmk2.

ks?−bs

mbk2≤C inf

m∈Mn

nks?−bsmk2+R(m,n)o

(14)

8/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Model selection

Models:

Mn={m, segmentation of {1, . . . ,n} }, Dm= Card(m).

m ⇔

I1= [0,tm1], I2 = (tm1,tm2], . . . , IDm = (tmDm−1,1] . Sm ={µ: (t1, . . . ,tn)→ H, piecewise const. on all λ∈m}

⇔ subspace ofHn.

Strategy:

(Sm)m∈Mn −→ (bsm)m∈Mn −→ bs

mb ???

Oracle model: m? ∈argminm∈Mnks?−bsmk2.

Goal: Oracle inequality(in expectation, or with large probability):

ks?−bs

mbk2≤C inf

m∈Mn

nks?−bsmk2+R(m,n)o

Kernel change-point detection Sylvain Arlot

(15)

9/23

Choose (D − 1) change-points. . .

Assumption: (Harchaoui & Capp´e (2007)) The number (D−1) of change-points is known.

Question:

Find the locations of the (D−1) change-points? (D is given).

Strategy:

The “best” segmentation inD pieces is obtained by applying the ERM algorithm overS

Dm=DSm :

ERM algorithm:

mbERM(D) = argmin

m|Dm=D

Rbn(bsm). Rk: Based on dynamic programming.

(16)

10/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Quality of the segmentations

Kernel change-point detection Sylvain Arlot

(17)

11/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Elementary calculations

Ideal criterion: (Πm: orthog. proj. operator ontoSm)

ks?−bsmk2=ks?−Πms?k2+kΠmεk2 .

Empirical risk:

kY −bsmk2 =ks?−Πms?k2− kΠmεk2+ 2h(I−Πm)s?, εi+kεk2.

Expectations (vλ= Card(λ)1 P

i∈λvi)

E h

ks?−bsmk2i

=ks?−Πms?k2+X

λ∈m

vλ ,

E h

kY −bsmk2i

=ks?−Πms?k2−X

λ∈m

vλ+Cst ,

−→ERM prefers models with large P

λ∈mvλ (overfitting).

(18)

11/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Elementary calculations

Ideal criterion: (Πm: orthog. proj. operator ontoSm)

ks?−bsmk2=ks?−Πms?k2+kΠmεk2 .

Empirical risk:

kY −bsmk2 =ks?−Πms?k2− kΠmεk2+ 2h(I−Πm)s?, εi+kεk2.

Expectations (vλ= Card(λ)1 P

i∈λvi)

E h

ks?−bsmk2i

=ks?−Πms?k2+X

λ∈m

vλ ,

E

hkY −bsmk2i

=ks?−Πms?k2−X

λ∈m

vλ+Cst ,

Conclusion:

−→ERM prefers models with large P

λ∈mvλ (overfitting).

Kernel change-point detection Sylvain Arlot

(19)

12/23

Choose the number of change-points

From bs

mbD D, choose D amounts to choose the “best model”.

Ideal penalty:

m? ∈argmin

m∈M

ks?−bsmk2

= argmin

m∈M

n

kY −bsmk2+ penid(m) o

,

with penid(m) =: 2kΠmεk2−2h(I−Πm)s?, εi.

Strategy

1 Concentration inequalities for linear and quadratic terms.

2 Derive a tight upper bound pen≥penid with high probability.

Previous work:

Birg´e & Massart (2001): Gaussian assumption + real valued functions.

−→cannot be extended to Hilbert framework.

(20)

13/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Concentration of the linear term

Theorem (Linear term)

Assume(Db)–(Vmax)hold true.

Then, for every segmentation m∈ Mn, for every x >0 with probability at least1−2e−x,

|hΠms?−s?, εi| ≤θkΠms?−s?k2+ vmax

θ +4M2

3

x , for everyθ >0.

Kernel change-point detection Sylvain Arlot

(21)

14/23

Concentration of the quadratic term

Theorem (Quadratic term) Assume(Db)–(Vmax), and

∃κ≥1, 0< M2

κ ≤min

i vi (Vmin) . Then, for every m∈ Mn, x>0, andθ∈(0,1],

mεk2E

hmεk2i θE

hms?bsmk2i

+θ−1L(κ)vmaxx ,

with probability at least 1−2e−x, where L(κ) is a constant.

Idea of the proof:

Pinelis-Sakhanenko’s inequality (

P

i∈λεi

H).

Bernstein’s inequality (upper bounding moments)

(22)

15/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Oracle inequality

Theorem

Assume(Db)-(Vmin)-(Vmax)and define

mb ∈argmin

m

1

n kY −bsmk2+ pen(m)

,

wherepen(m) = vmaxnDm h

C1ln n

Dm

+C2

i

for constants C1,C2>0. Then, for every x ≥1, with probability at least 1−2e−x,

1

nks?−bs

mbk2 ≤∆1inf

m

1

nks?−bsmk2+ pen(m)

+∆2vmaxx

n ,

where∆1 ≥1 and∆2>0are absolute constants.

In Birg´e & Massart (2001), pen(m) = σ2nDm h

c1ln n

Dm

+c2

i .

Kernel change-point detection Sylvain Arlot

(23)

16/23

Model selection procedure

pen(m) = vmaxDm

n

C1ln n

Dm

+C2

= pen(Dm) .

Algorithm

1 For every 1≤D≤Dmax,

mbD ∈ argmin

m,Dm=D

n

kY −bsmk2o ,

2 Define

Db = argmin

D

1 n

Y −bs

mbD

2+vmaxD n

h

C1lnn D

+C2i .

whereC1,C2: computed by simulation experiments.

3 Final estimator:

bsmb =:bsmb

bD.

(24)

17/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Changes in the distribution (synthetic data)

Description:

1 n = 1 000,D−1 = 9, Nrep = 100.

2 In each segment, observations generated according to one distribution within a pool of 10 distributions with same mean and variance.

3 Kernel-based approach enables to distinguish them (higher order moments)

4 Gaussian kernel: kh(x,y) = exph

− kxyk2/(2h2)i .

Kernel change-point detection Sylvain Arlot

(25)

18/23

Changes in the distribution (synthetic data), cont.

Results

0 5 10 15 20 25 30 35 40 0.4

0.45 0.5 0.55

Emp. risk Crit.

True risk

0 200 400 600 800 1000 0

0.005 0.01 0.015 0.02 0.025

Hausdorff distance: 0.053±0.006

(26)

19/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Changes in the distribution (synthetic data), cont.

Results: estimated number of change-points

0 10 20 30

0 50 100 150 200 250

Kernel change-point detection Sylvain Arlot

(27)

20/23

“Le grand ´ echiquier”, 70s-80s French talk show

Audio and video recordings.

Audio: different situations can be distinguished from sound recordings (music, applause, speech,. . . ).

Video: different video scenes can be distinguished by their backgrounds or specific actions of people (clapping hands, discussing,. . . ).

(28)

21/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Audio signal

Description:

n = 500,D−1 = 4.

At each ti, one observes a multivariate vector of dimension 12.

Gaussian kernel: kh(x,y) = exph

− kxyk2/(2h2)i . Results: Hausdorff distance 0.079±0.006

Kernel change-point detection Sylvain Arlot

(29)

22/23

Video sequence

Description:

n = 10 000,D−1 = 4.

Each image summarized by a histogram with 1 024 bins.

χ2 kernel: kd(x,y) =Pd i=1

(xi−yi)2 xi+yi · Results: Hausdorff distance 0.093±0.007

(30)

23/23

Framework Which change-points? (Dknown) How many change-points? Empirical assessment

Conclusion

Take-home message:

Change-point detection algorithm for possibly high-dimensional or complex data

Data-driven choice of the number of change-points Non-asymptotic oracle inequality(guarantee on the risk) Experiments: changes in less usual properties of the distribution, audio or video data

Open questions:

1 Influence of the choice of kernel

2 Data-driven choice of the kernel

3 Relax the assumption on the variance

4 Extend our model selection theorem to other regression settings

Kernel change-point detection Sylvain Arlot

(31)

23/23

Conclusion

Take-home message:

Change-point detection algorithm for possibly high-dimensional or complex data

Data-driven choice of the number of change-points Non-asymptotic oracle inequality(guarantee on the risk) Experiments: changes in less usual properties of the distribution, audio or video data

Open questions:

1 Influence of the choice of kernel

2 Data-driven choice of the kernel

3 Relax the assumption on the variance

4 Extend our model selection theorem to other regression settings

Références

Documents relatifs

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

The approach is based on an on-line two step procedure in which the estimated switching times are reintroduced in a parameter estimation algorithm... (4) Next, and owing to the

For large sample size, we show that the median heuristic behaves approximately as the median of a distribution that we describe completely in the setting of kernel two-sample test

The results obtained in the conducted experiments with models based on various distribution laws of the requests flow intensity and the criteria used are used as input information

We prove a non-asymptotic oracle inequality for the resulting kernel- based change-point detection method, whatever the unknown number of change points, thanks to a concentration

Consistent kernel change-point detection Sylvain Arlot.. ) Non vectorial: phenotypic data, graphs, DNA sequence,. Both vectorial and non vectorial data... 3 Efficient algorithm

Note that in the Brownian diffusion case the innovations of the Euler scheme are designed in order to “mimic” Brownian increments, hence it is natural to assume that they satisfy

Abstract Elementary techniques from operational calculus, differential al- gebra, and noncommutative algebra lead to a new approach for change-point detection, which is an