Consistent kernel change-point detection

(1)

1/29

Introduction D=D_{τ ?} D=

b^D Experiments Conclusion

Consistent kernel change-point detection

Sylvain Arlot¹ (joint works with Alain Celisse², Damien Garreau³ & Zaïd Harchaoui⁴)

1Université Paris-Sud

2Université Lille 1

3Max Planck Institute for Intelligent Systems, Tübingen

4University of Washington

Séminaire MODAL’X, Nanterre, 9 Novembre 2017

(2)

2/29

Example 1: 1-D signal

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Consistent kernel change-point detection Sylvain Arlot

(3)

2/29

Example 1: 1-D signal: Find abrupt changes in the mean

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Signal Reg. func.

? ?

(4)

3/29

Example 2: shot detection in a movie

0 200 400 600 800 1000 1200 1400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

(5)

3/29

Example 2: shot detection in a movie

0 200 400 600 800 1000 1200 1400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

(6)

4/29

The change-point problem

Observation: X₁, . . . ,X_n∈ X independent random variables (X: arbitrary measurable set).

P_X_i: distribution ofX_i.

⇒ find where are theabrupt changes in the sequence P_X₁, . . . ,P_X_n?

Notation:

τ ∈ T_n^D :=(τ0, . . . , τ_D)∈N^D+1,0 =τ0 < τ1 <· · ·< τ_D =n segmentation (of{1, . . . ,n}) into Dτ =D ∈ {1, . . . ,n}segments.

(7)

5/29

Challenges for (multiple) change-point detection

1 Detect changes in the whole distribution (not only the mean) Mean:

homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011)

Mean and variance: Picard et al. (2007), Fryzlewicz and Subba Rao (2014)

Full distribution: Zou et al. (2014) inR, Matteson & James (2014) inR^d

2 High-dimensional data of different nature:

Vectorial: measures inR^d, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

3 Efficient algorithm allowing to deal with large data sets

(8)

5/29

Challenges for (multiple) change-point detection

(9)

5/29

Challenges for (multiple) change-point detection

(10)

6/29

Kernels: a quick reminder

k :X × X →Rmeasurable is apositive semidefinite kernel if

∀x₁, . . . ,xm∈ X, the matrix (k(xi,xj))16i,j6m ispositive semidefinite.

Examples:

linearkernel: k(x,y) =hx,yi,

polynomialkernel: k(x,y) = (1 +hx,yi)^p,

Gaussiankernel: k(x,y) = exp(− kx−yk²/(2h²)), χ² kernel on ∆^d: k(x,y) = exp

−_h·d¹ Pd i=1

(x_i−yi)² xi+yi

. . .

(11)

7/29

The kernel least-squares criterion

Least-squares criterion (whenX =R): ∀τ ∈ T_n:=^S_D>1T_n^D,

Rb_n(τ) := 1 n

D

X

`=1 τ_`

X

i=τ`−1+1

(X_i−X_τ_`−1_+1,τ_`)².

Kernel least-squares criterion:

Rb_n(τ) := 1 n

n

X

i=1

k(Xi,Xi)

− 1 n

D

X

`=1



 1 τ`−τ`−1

τ`

X

i=τ`−1+1 τ`

X

j=τ`−1+1

k(X_i,X_j)



.

The two definitions coincide whenX =R andk(x,y) =xy.

(12)

8/29

Kernel change-point detection (KCP)

(A., Celisse & Harchaoui, 2012)

bτ ∈argmin

τ∈Tn

(

kernel least-squares

criterion

z }| {

Rb_n(τ) +pen(τ)

| {z }

penalty function

)

where pen is a function increasing withDτ, such as:

pen(τ) = 1

n c1log n−1 Dτ−1

!

+c2Dτ

!

pen(τ) = D_τ n

c₁log

n D_τ

+c₂

pen(τ) = c₁D_τ n .

ForX =R, linear kernel, Birgé & Massart (2001) and Lebarbier (2005) take pen(τ) = ^σ²_n^D^τ ^hc₁log_Dⁿ

τ

+c₂ⁱ.

(13)

9/29

(Abstract) intuition on KCP

KCP ⇔kernelized version of (penalized) least-squares change-point detection

Canonical feature map Φ :x ∈ X 7→k(x,·)∈ Hreproducing kernel Hilbert space (RKHS)

Yi = Φ(Xi)∈ H are independentH-valued r.v.

E[^pk(X_i,X_i)]<∞ ⇒ can define µ^?_i ∈ H the “mean” ofY_i

⇒ KCP detects jumps of the “mean” µ^?_i of Yi

Remark: ifk is characteristic (eg, Gaussian kernel), µ^?_i characterizesP_X_i.

(14)

9/29

(Abstract) intuition on KCP

(15)

9/29

(Abstract) intuition on KCP

(16)

9/29

(Abstract) intuition on KCP

(17)

10/29

KCP for fixed D (Harchaoui & Cappé, 2007)

τb(D)∈argmin

τ∈T_n^D

Rb_n(τ) Dynamic programming algorithm

No computation inH, only needs to compute the k(X_i,X_j) (cost C_k)

Complexity of computing τ_b(D)

16D6Dmax:

time O (C_k+D_max)n² and space O(D_maxn) (Celisse et al., 2016).

(18)

11/29

Main assumptions

H separable

Bounded kernel/data:

∃M <+∞,∀i ∈ {1, . . . ,n}, k(X_i,X_i)6M² a.s. (Db)

⇒ always satisfied for Gaussian and χ² kernel.

(19)

12/29

D = D

τ^?

known: notations

True segmentation τ^?: µ^?₁ =· · ·=µ^?_τ^?

1 6= µ^?_τ^?

1+1=· · ·=µ^?_τ^?

2 6= · · · 6= µ^?_τ^?

Dτ ?−1+1=· · ·=µ^?_n. Smallest jump size: ∆ := min_i_{/ µ}^?

i6=µ^?_i+1

µ^?_i −µ^?_i+1_H (MMD, Gretton et al. 2006).

Smallest segment length: Λ_τ := ¹_nmin_16`6D_τ|τ_`−τ`−1|.

Loss between segmentations τ¹, τ² ∈ T_n: d_∞,n(τ¹, τ²) := 1

n max

16i6D_τ1−1

16j6Dmin_τ2−1

τ_i¹−τ_j²

= 1

n max

16i6D_τ1−1

τ_i¹−τ_i² if D_τ¹ =D_τ² andτ¹, τ² “close”

(20)

13/29

D = D

τ^?

known: estimation of change-points locations

Theorem (A. & Garreau, 2016)

Assume: Hseparable,(Db), y >0 and Λ_τ^? >vn(y) := 148D_τ^?M²

∆² ·y+ logn+ 1

n .

Then, with probability1−e^−y,

∀τ_b(Dτ^?)∈argmin

τ∈T_n^D^{τ ?}

Rb_n(τ) , d_∞,n τ^?,τ_b(Dτ^?)6vn(y).

(21)

14/29

D = D

τ^?

known: estimation of change-points locations (2)

Corollary (A. & Garreau, 2016, simplified result) Assume: Hseparable,(Db) and ∆²

M² & D_τ^? Λ_τ^? ·logn

n . Then, with probability1−n⁻²,

∀_bτ(D_τ^?)∈argmin

τ∈T_n^D^{τ ?}

Rb_n(τ) , d∞,n τ^?,_bτ(D_τ^?). D_τ^?M²

∆² ·logn n .

∆²

M² ≈signal-to-noise ratio.

Matches minimax lower bound log(n)/n (Brunel, 2014).

Remark: no log(n) factor in the standard “asymptotic” setting (Korostelev & Tsybakov, 2012).

(22)

15/29

KCP: data-driven D by model selection

Notation: Y = (Y₁, . . . ,Y_n)∈ Hⁿ,µ^? = (µ^?₁, . . . , µ^?_n)∈ Hⁿ For anyτ ∈ T_n, Πτ :Hⁿ→ Hⁿ orthogonal projection onto F_τ ={(f₁, . . . ,f_n)∈ Hⁿ/f_τ_`−1₊₁ =· · ·=f_τ_`∀`= 1, . . . ,D_τ}

⇒ Least-squares estimator µ_bτ = ΠτY and least-squares criterion:

Rb_n(τ) = ¹_nkY −µ_bτk² = ¹_n^Pⁿ_i=1kY_i−(µ_bτ)ik²_H

Quadratic riskof µ∈ Hⁿ: R(µ) = 1

nkµ−µ^?k² = 1 n

n

X

i=1

kµ_i −µ^?_ik²_H .

Usual approach formodel selection: take a penalty such that

∀τ ∈ T_n, pen(τ)>pen_id(τ) :=R(µ)−Rb_n(τ) + cst.

(23)

15/29

KCP: data-driven D by model selection

Notation: Y = (Y₁, . . . ,Y_n)∈ Hⁿ,µ^? = (µ^?₁, . . . , µ^?_n)∈ Hⁿ For anyτ ∈ T_n, Πτ :Hⁿ→ Hⁿ orthogonal projection onto F_τ ={(f₁, . . . ,f_n)∈ Hⁿ/f_τ_`−1₊₁ =· · ·=f_τ_`∀`= 1, . . . ,D_τ}

⇒ Least-squares estimator µ_bτ = ΠτY and least-squares criterion:

Rb_n(τ) = ¹_nkY −µ_bτk² = ¹_n^Pⁿ_i=1kY_i−(µ_bτ)ik²_H Quadratic riskof µ∈ Hⁿ:

R(µ) = 1

nkµ−µ^?k² = 1 n

n

X

i=1

kµ_i −µ^?_ik²_H .

Usual approach formodel selection: take a penalty such that

∀τ ∈ T_n, pen(τ)>pen_id(τ) :=R(µ)−Rb_n(τ) + cst.

(24)

16/29

Oracle inequality for KCP

Theorem (A., Celisse & Harchaoui, 2012–2016) Assume: Hseparable,(Db), y >0, C >119and

∀τ ∈ T_n, pen(τ)> CM² n

"

log n−1 Dτ−1

! +D_τ

# .

Then, with probability1−e^−y,

∀τ_b∈argmin

τ∈Tn

n

Rb_n(τ) + pen(τ)^o, R(µ_b

bτ)62 inf

τ∈Tn

R(µ_b_τ) + pen(τ) +83yM² n . applies to pen(τ) = CM²D_τ

n if C >465 log(n).

X =R, linear kernel: Birgé & Massart (2001), Lebarbier (2005).

(25)

17/29

Change-point estimation performance of KCP

Theorem (A. & Garreau, 2016)

Assume: Hseparable,(Db), y >0 and Cmin:= 74

3 (Dτ^?+ 1)(y+ logn+ 1) < C < Cmax := ∆² M²

Λ_τ^? 6Dτ^?

n.

Then, with probability1−e^−y:

∀τ_b∈argmin

τ∈T_n

(

Rb_n(τ) +CM²D_τ n

)

, D

bτ =D_τ^? and d_∞,n τ^?,τ_b6v_n(y) := 148D_τ^?M²

∆² ·y+ logn+ 1

n .

Previous works (Lavielle & Moulines, 2000, among many others):

real case (H=R) only (with dependent data).

(26)

18/29

Change-point estimation performance of KCP (2)

Corollary (A. & Garreau, 2016, simplified result) Assume: Hseparable,(Db) and

Dτ^?logn.C . ∆² M²

Λ_τ^? Dτ^?

n.

Then, with probability1−n⁻²:

∀τ_b∈argmin

τ∈Tn

(

Rb_n(τ) +CM²Dτ

n )

, D

bτ =Dτ^?

and d_∞,n τ^?,τ_b. Dτ^?M²

∆² ·logn n .

∆²

M² ≈signal-to-noise ratio.

Lower bound onC: log(n) necessary (Birgé & Massart, 2007).

(27)

19/29

Oracle inequality: proof ideas

Notation: ε=Y −µ^? ∈ Hⁿ Ideal penalty:

pen_id(τ) :=R(µ)−Rb_n(τ) +1 nkεk²

= 2

nhΠ_τµ^?−µ^?, εi

| {z }

=−Lτ(linear term)

+ 2

n kΠ_τεk²

| {z }

=Qτ (quadratic term)

Concentration forL_τ and Q_τ around their expectations

⇒ show that pen(τ)>pen_id(τ) simultaneously for all τ ∈ T_n, with probability >1−e^−y.

Previous work (Birgé & Massart, 2001): Gaussian assumption + real-valued functions ⇒ does not apply to RKHS case.

(28)

20/29

Concentration of the quadratic term

Proposition (A., Celisse & Harchaoui, 2012–2016)

Assume: Hseparable and(Db). Then, for every τ ∈ T_n, x >0:

kΠ_τεk²−E

hkΠ_τεk²i

6 14M²

3 (x+ 2√

2x D_τ) , with probability at least 1−e^−x.

Proof ideas:

Pinelis-Sakhanenko’s inequality (k^P_i∈λε_ik_H).

Bernstein’s inequality (upper bounding moments).

(29)

21/29

Concentration of the linear term

Proposition

Assume: Hseparable and(Db). Then, for every τ ∈ T_n, x >0, with probability at least1−2e^−x:

|hΠ_τµ^?−µ^?, εi|6θkΠ_τµ^?−µ^?k²+ 1

2θ+4 3

M²x ,

for everyθ >0.

Proof: Bernstein’s inequality.

(30)

22/29

Identification of change-points: proof ideas

τb∈argmin

τ∈Tn

Rb_n(τ) + pen(τ) Empirical risk:

Rb_n(τ) = 1

n kµ^?−Πτµ^?k²

| {z }

=Aτ(approximation)

+2

nhµ^?−Πτµ^?, εi

| {z }

=Lτ(linear term)

− 1

n kΠ_τεk²

| {z }

=Qτ (quadratic term)

+ 1 nkεk²

| {z }

(constant)

Previous concentration inequalities forL_τ,Q_τ. Deterministic bounds onA_τ:

Dτ <Dτ^? ⇒ ¹_nAτ > ¹₂Λ_τ^?∆² (for showingD

bτ >Dτ^?)

1

nAτ > 1 2min

Λ_τ^?,d∞,n(τ^?, τ)

∆² (forτ_b(Dτ^?))

(31)

23/29

Constant mean and variance: data

Constant mean and variance: the distribution ofXi is chosen amongB(0.5), N(0.5,0.25) andE(0.5).

0 100 200 300 400 500 600 700 800 900 1000

−1 0 1 2 3 4

(32)

24/29

Constant mean and variance: results (D

τ^?

)

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Linear Hermite Gaussian

KCP withDτ^? known.

(33)

25/29

Constant mean and variance: results ( D)

^c

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Linear Hermite Gaussian

KCP withD^b data-driven.

(34)

26/29

Histogram-valued data

Xi ∈d-dimensional simplex, Dirichlet distribution (p₁^`, . . . ,p_d^`) on the`-th segment, with p_i^` independent∼ U([0,0.2]).

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5

(first three coordinates)

(35)

27/29

Histogram-valued data: results (D

τ^?

)

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

χ² kernel Gaussian kernel

KCP withD_τ^? known.

(36)

28/29

Histogram-valued data: results ( D)

^c

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

χ² kernel Gaussian kernel

KCP withD^b data-driven.

(37)

29/29

Conclusion

Take-home message:

Kernelized version of penalized least-squares change-point detection (eg, Lebarbier, 2005).

Detection ofchanges in the distribution, not only the first moments.

Can deal withstructured data.

Under reasonable assumptions and for a class of penalty functions:

oracle inequality;

identifies the correctnumber of change-points;

estimates at the correct rate thechange-points locations.

Future work:

Unbounded data/kernel.

Dependent data?

Learn how to choose the kernel.