• Aucun résultat trouvé

Consistent kernel change-point detection

N/A
N/A
Protected

Academic year: 2022

Partager "Consistent kernel change-point detection"

Copied!
37
0
0

Texte intégral

(1)

1/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Consistent kernel change-point detection

Sylvain Arlot1 (joint works with Alain Celisse2, Damien Garreau3 & Zaïd Harchaoui4)

1Université Paris-Sud

2Université Lille 1

3Max Planck Institute for Intelligent Systems, Tübingen

4University of Washington

Séminaire MODAL’X, Nanterre, 9 Novembre 2017

(2)

2/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Example 1: 1-D signal

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Consistent kernel change-point detection Sylvain Arlot

(3)

2/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Example 1: 1-D signal: Find abrupt changes in the mean

0 10 20 30 40 50 60 70 80 90 100

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Position t

Signal

Signal Reg. func.

? ?

(4)

3/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Example 2: shot detection in a movie

0 200 400 600 800 1000 1200 1400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Consistent kernel change-point detection Sylvain Arlot

(5)

3/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Example 2: shot detection in a movie

0 200 400 600 800 1000 1200 1400

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

(6)

4/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

The change-point problem

Observation: X1, . . . ,Xn∈ X independent random variables (X: arbitrary measurable set).

PXi: distribution ofXi.

⇒ find where are theabrupt changes in the sequence PX1, . . . ,PXn?

Notation:

τ ∈ TnD :=0, . . . , τD)∈ND+1,0 =τ0 < τ1 <· · ·< τD =n segmentation (of{1, . . . ,n}) into Dτ =D ∈ {1, . . . ,n}segments.

Consistent kernel change-point detection Sylvain Arlot

(7)

5/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Challenges for (multiple) change-point detection

1 Detect changes in the whole distribution (not only the mean) Mean:

homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011)

Mean and variance: Picard et al. (2007), Fryzlewicz and Subba Rao (2014)

Full distribution: Zou et al. (2014) inR, Matteson & James (2014) inRd

2 High-dimensional data of different nature:

Vectorial: measures inRd, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

3 Efficient algorithm allowing to deal with large data sets

(8)

5/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Challenges for (multiple) change-point detection

1 Detect changes in the whole distribution (not only the mean) Mean:

homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011)

Mean and variance: Picard et al. (2007), Fryzlewicz and Subba Rao (2014)

Full distribution: Zou et al. (2014) inR, Matteson & James (2014) inRd

2 High-dimensional data of different nature:

Vectorial: measures inRd, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

3 Efficient algorithm allowing to deal with large data sets

Consistent kernel change-point detection Sylvain Arlot

(9)

5/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Challenges for (multiple) change-point detection

1 Detect changes in the whole distribution (not only the mean) Mean:

homoscedastic: Birgé & Massart (2001), Comte & Rozenholc (2002, 2004), Baraud, Giraud & Huet (2010)...

heteroscedastic: A. & Celisse (2011)

Mean and variance: Picard et al. (2007), Fryzlewicz and Subba Rao (2014)

Full distribution: Zou et al. (2014) inR, Matteson & James (2014) inRd

2 High-dimensional data of different nature:

Vectorial: measures inRd, curves (sound recordings,. . . ) Non vectorial: phenotypic data, graphs, DNA sequence,. . . Both vectorial and non vectorial data.

3 Efficient algorithm allowing to deal with large data sets

(10)

6/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Kernels: a quick reminder

k :X × X →Rmeasurable is apositive semidefinite kernel if

∀x1, . . . ,xm∈ X, the matrix (k(xi,xj))16i,j6m ispositive semidefinite.

Examples:

linearkernel: k(x,y) =hx,yi,

polynomialkernel: k(x,y) = (1 +hx,yi)p,

Gaussiankernel: k(x,y) = exp(− kxyk2/(2h2)), χ2 kernel on ∆d: k(x,y) = exp

h·d1 Pd i=1

(xi−yi)2 xi+yi

. . .

Consistent kernel change-point detection Sylvain Arlot

(11)

7/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

The kernel least-squares criterion

Least-squares criterion (whenX =R): ∀τ ∈ Tn:=SD>1TnD,

Rbn(τ) := 1 n

D

X

`=1 τ`

X

i=τ`−1+1

(XiXτ`−1+1,τ`)2.

Kernel least-squares criterion:

Rbn(τ) := 1 n

n

X

i=1

k(Xi,Xi)

− 1 n

D

X

`=1

1 τ`τ`−1

τ`

X

i=τ`−1+1 τ`

X

j=τ`−1+1

k(Xi,Xj)

.

The two definitions coincide whenX =R andk(x,y) =xy.

(12)

8/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Kernel change-point detection (KCP)

(A., Celisse & Harchaoui, 2012)

bτ ∈argmin

τ∈Tn

(

kernel least-squares

criterion

z }| {

Rbn(τ) +pen(τ)

| {z }

penalty function

)

where pen is a function increasing withDτ, such as:

pen(τ) = 1

n c1log n−1 Dτ−1

!

+c2Dτ

!

pen(τ) = Dτ n

c1log

n Dτ

+c2

pen(τ) = c1Dτ n .

ForX =R, linear kernel, Birgé & Massart (2001) and Lebarbier (2005) take pen(τ) = σ2nDτ hc1logDn

τ

+c2i.

Consistent kernel change-point detection Sylvain Arlot

(13)

9/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

(Abstract) intuition on KCP

KCP ⇔kernelized version of (penalized) least-squares change-point detection

Canonical feature map Φ :x ∈ X 7→k(x,·)∈ Hreproducing kernel Hilbert space (RKHS)

Yi = Φ(Xi)∈ H are independentH-valued r.v.

E[pk(Xi,Xi)]<∞ ⇒ can define µ?i ∈ H the “mean” ofYi

⇒ KCP detects jumps of the “mean” µ?i of Yi

Remark: ifk is characteristic (eg, Gaussian kernel), µ?i characterizesPXi.

(14)

9/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

(Abstract) intuition on KCP

KCP ⇔kernelized version of (penalized) least-squares change-point detection

Canonical feature map Φ :x ∈ X 7→k(x,·)∈ Hreproducing kernel Hilbert space (RKHS)

Yi = Φ(Xi)∈ H are independentH-valued r.v.

E[pk(Xi,Xi)]<∞ ⇒ can define µ?i ∈ H the “mean” ofYi

⇒ KCP detects jumps of the “mean” µ?i of Yi

Remark: ifk is characteristic (eg, Gaussian kernel), µ?i characterizesPXi.

Consistent kernel change-point detection Sylvain Arlot

(15)

9/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

(Abstract) intuition on KCP

KCP ⇔kernelized version of (penalized) least-squares change-point detection

Canonical feature map Φ :x ∈ X 7→k(x,·)∈ Hreproducing kernel Hilbert space (RKHS)

Yi = Φ(Xi)∈ H are independentH-valued r.v.

E[pk(Xi,Xi)]<∞ ⇒ can define µ?i ∈ H the “mean” ofYi

⇒ KCP detects jumps of the “mean” µ?i of Yi

Remark: ifk is characteristic (eg, Gaussian kernel), µ?i characterizesPXi.

(16)

9/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

(Abstract) intuition on KCP

KCP ⇔kernelized version of (penalized) least-squares change-point detection

Canonical feature map Φ :x ∈ X 7→k(x,·)∈ Hreproducing kernel Hilbert space (RKHS)

Yi = Φ(Xi)∈ H are independentH-valued r.v.

E[pk(Xi,Xi)]<∞ ⇒ can define µ?i ∈ H the “mean” ofYi

⇒ KCP detects jumps of the “mean” µ?i of Yi

Remark: ifk is characteristic (eg, Gaussian kernel), µ?i characterizesPXi.

Consistent kernel change-point detection Sylvain Arlot

(17)

10/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

KCP for fixed D (Harchaoui & Cappé, 2007)

τb(D)∈argmin

τ∈TnD

Rbn(τ) Dynamic programming algorithm

No computation inH, only needs to compute the k(Xi,Xj) (cost Ck)

Complexity of computing τb(D)

16D6Dmax:

time O (Ck+Dmax)n2 and space O(Dmaxn) (Celisse et al., 2016).

(18)

11/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Main assumptions

H separable

Bounded kernel/data:

∃M <+∞,∀i ∈ {1, . . . ,n}, k(Xi,Xi)6M2 a.s. (Db)

⇒ always satisfied for Gaussian and χ2 kernel.

Consistent kernel change-point detection Sylvain Arlot

(19)

12/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

D = D

τ?

known: notations

True segmentation τ?: µ?1 =· · ·=µ?τ?

1 6= µ?τ?

1+1=· · ·=µ?τ?

2 6= · · · 6= µ?τ?

Dτ ?−1+1=· · ·=µ?n. Smallest jump size: ∆ := mini/ µ?

i6=µ?i+1

µ?iµ?i+1H (MMD, Gretton et al. 2006).

Smallest segment length: Λτ := 1nmin16`6Dτ`τ`−1|.

Loss between segmentations τ1, τ2 ∈ Tn: d∞,n1, τ2) := 1

n max

16i6Dτ1−1

16j6Dminτ2−1

τi1τj2

= 1

n max

16i6Dτ1−1

τi1τi2 if Dτ1 =Dτ2 andτ1, τ2 “close”

(20)

13/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

D = D

τ?

known: estimation of change-points locations

Theorem (A. & Garreau, 2016)

Assume: Hseparable,(Db), y >0 and Λτ? >vn(y) := 148Dτ?M2

2 ·y+ logn+ 1

n .

Then, with probability1−e−y,

τb(Dτ?)∈argmin

τ∈TnDτ ?

Rbn(τ) , d∞,n τ?b(Dτ?)6vn(y).

Consistent kernel change-point detection Sylvain Arlot

(21)

14/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

D = D

τ?

known: estimation of change-points locations (2)

Corollary (A. & Garreau, 2016, simplified result) Assume: Hseparable,(Db) and2

M2 & Dτ? Λτ? ·logn

n . Then, with probability1−n−2,

bτ(Dτ?)∈argmin

τ∈TnDτ ?

Rbn(τ) , d∞,n τ?,bτ(Dτ?). Dτ?M2

2 ·logn n .

2

M2 ≈signal-to-noise ratio.

Matches minimax lower bound log(n)/n (Brunel, 2014).

Remark: no log(n) factor in the standard “asymptotic” setting (Korostelev & Tsybakov, 2012).

(22)

15/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

KCP: data-driven D by model selection

Notation: Y = (Y1, . . . ,Yn)∈ Hn,µ? = (µ?1, . . . , µ?n)∈ Hn For anyτ ∈ Tn, Πτ :Hn→ Hn orthogonal projection onto Fτ ={(f1, . . . ,fn)∈ Hn/fτ`−1+1 =· · ·=fτ`∀`= 1, . . . ,Dτ}

⇒ Least-squares estimator µbτ = ΠτY and least-squares criterion:

Rbn(τ) = 1nkY −µbτk2 = 1nPni=1kYi−(µbτ)ik2H

Quadratic riskof µ∈ Hn: R(µ) = 1

nkµ−µ?k2 = 1 n

n

X

i=1

iµ?ik2H .

Usual approach formodel selection: take a penalty such that

∀τ ∈ Tn, pen(τ)>penid(τ) :=R(µ)−Rbn(τ) + cst.

Consistent kernel change-point detection Sylvain Arlot

(23)

15/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

KCP: data-driven D by model selection

Notation: Y = (Y1, . . . ,Yn)∈ Hn,µ? = (µ?1, . . . , µ?n)∈ Hn For anyτ ∈ Tn, Πτ :Hn→ Hn orthogonal projection onto Fτ ={(f1, . . . ,fn)∈ Hn/fτ`−1+1 =· · ·=fτ`∀`= 1, . . . ,Dτ}

⇒ Least-squares estimator µbτ = ΠτY and least-squares criterion:

Rbn(τ) = 1nkY −µbτk2 = 1nPni=1kYi−(µbτ)ik2H Quadratic riskof µ∈ Hn:

R(µ) = 1

nkµ−µ?k2 = 1 n

n

X

i=1

iµ?ik2H .

Usual approach formodel selection: take a penalty such that

∀τ ∈ Tn, pen(τ)>penid(τ) :=R(µ)−Rbn(τ) + cst.

(24)

16/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Oracle inequality for KCP

Theorem (A., Celisse & Harchaoui, 2012–2016) Assume: Hseparable,(Db), y >0, C >119and

∀τ ∈ Tn, pen(τ)> CM2 n

"

log n−1 Dτ−1

! +Dτ

# .

Then, with probability1−e−y,

τb∈argmin

τ∈Tn

n

Rbn(τ) + pen(τ)o, R(µb

bτ)62 inf

τ∈Tn

R(µbτ) + pen(τ) +83yM2 n . applies to pen(τ) = CM2Dτ

n if C >465 log(n).

X =R, linear kernel: Birgé & Massart (2001), Lebarbier (2005).

Consistent kernel change-point detection Sylvain Arlot

(25)

17/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Change-point estimation performance of KCP

Theorem (A. & Garreau, 2016)

Assume: Hseparable,(Db), y >0 and Cmin:= 74

3 (Dτ?+ 1)(y+ logn+ 1) < C < Cmax := ∆2 M2

Λτ? 6Dτ?

n.

Then, with probability1−e−y:

τb∈argmin

τ∈Tn

(

Rbn(τ) +CM2Dτ n

)

, D

bτ =Dτ? and d∞,n τ?b6vn(y) := 148Dτ?M2

2 ·y+ logn+ 1

n .

Previous works (Lavielle & Moulines, 2000, among many others):

real case (H=R) only (with dependent data).

(26)

18/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Change-point estimation performance of KCP (2)

Corollary (A. & Garreau, 2016, simplified result) Assume: Hseparable,(Db) and

Dτ?logn.C . ∆2 M2

Λτ? Dτ?

n.

Then, with probability1−n−2:

τb∈argmin

τ∈Tn

(

Rbn(τ) +CM2Dτ

n )

, D

bτ =Dτ?

and d∞,n τ?b. Dτ?M2

2 ·logn n .

2

M2 ≈signal-to-noise ratio.

Lower bound onC: log(n) necessary (Birgé & Massart, 2007).

Consistent kernel change-point detection Sylvain Arlot

(27)

19/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Oracle inequality: proof ideas

Notation: ε=Yµ? ∈ Hn Ideal penalty:

penid(τ) :=R(µ)−Rbn(τ) +1 nkεk2

= 2

nτµ?µ?, εi

| {z }

=−Lτ(linear term)

+ 2

nτεk2

| {z }

=Qτ (quadratic term)

Concentration forLτ and Qτ around their expectations

⇒ show that pen(τ)>penid(τ) simultaneously for all τ ∈ Tn, with probability >1−e−y.

Previous work (Birgé & Massart, 2001): Gaussian assumption + real-valued functions ⇒ does not apply to RKHS case.

(28)

20/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Concentration of the quadratic term

Proposition (A., Celisse & Harchaoui, 2012–2016)

Assume: Hseparable and(Db). Then, for every τ ∈ Tn, x >0:

τεk2E

hτεk2i

6 14M2

3 (x+ 2

2x Dτ) , with probability at least 1−e−x.

Proof ideas:

Pinelis-Sakhanenko’s inequality (kPi∈λεikH).

Bernstein’s inequality (upper bounding moments).

Consistent kernel change-point detection Sylvain Arlot

(29)

21/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Concentration of the linear term

Proposition

Assume: Hseparable and(Db). Then, for every τ ∈ Tn, x >0, with probability at least1−2e−x:

|hΠτµ?µ?, εi|6θτµ?µ?k2+ 1

2θ+4 3

M2x ,

for everyθ >0.

Proof: Bernstein’s inequality.

(30)

22/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Identification of change-points: proof ideas

τb∈argmin

τ∈Tn

Rbn(τ) + pen(τ) Empirical risk:

Rbn(τ) = 1

n?−Πτµ?k2

| {z }

=Aτ(approximation)

+2

n?−Πτµ?, εi

| {z }

=Lτ(linear term)

− 1

nτεk2

| {z }

=Qτ (quadratic term)

+ 1 nkεk2

| {z }

(constant)

Previous concentration inequalities forLτ,Qτ. Deterministic bounds onAτ:

Dτ <Dτ?1nAτ > 12Λτ?2 (for showingD

bτ >Dτ?)

1

nAτ > 1 2min

Λτ?,d∞,n?, τ)

2 (forτb(Dτ?))

Consistent kernel change-point detection Sylvain Arlot

(31)

23/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Constant mean and variance: data

Constant mean and variance: the distribution ofXi is chosen amongB(0.5), N(0.5,0.25) andE(0.5).

0 100 200 300 400 500 600 700 800 900 1000

−1 0 1 2 3 4

(32)

24/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Constant mean and variance: results (D

τ?

)

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

Linear Hermite Gaussian

KCP withDτ? known.

Consistent kernel change-point detection Sylvain Arlot

(33)

25/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Constant mean and variance: results ( D)

c

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

Linear Hermite Gaussian

KCP withDb data-driven.

(34)

26/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Histogram-valued data

Xid-dimensional simplex, Dirichlet distribution (p1`, . . . ,pd`) on the`-th segment, with pi` independent∼ U([0,0.2]).

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5

0 100 200 300 400 500 600 700 800 900 1000

0 0.5 1 1.5

(first three coordinates)

Consistent kernel change-point detection Sylvain Arlot

(35)

27/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Histogram-valued data: results (D

τ?

)

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

χ2 kernel Gaussian kernel

KCP withDτ? known.

(36)

28/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Histogram-valued data: results ( D)

c

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

100 200 300 400 500 600 700 800 900 1000 0

0.1 0.2 0.3 0.4 0.5 0.6

Position

Freq. of selected chgpts

χ2 kernel Gaussian kernel

KCP withDb data-driven.

Consistent kernel change-point detection Sylvain Arlot

(37)

29/29

Introduction D=Dτ ? D=

bD Experiments Conclusion

Conclusion

Take-home message:

Kernelized version of penalized least-squares change-point detection (eg, Lebarbier, 2005).

Detection ofchanges in the distribution, not only the first moments.

Can deal withstructured data.

Under reasonable assumptions and for a class of penalty functions:

oracle inequality;

identifies the correctnumber of change-points;

estimates at the correct rate thechange-points locations.

Future work:

Unbounded data/kernel.

Dependent data?

Learn how to choose the kernel.

Références

Documents relatifs

Under very general conditions for the observations, we show that the popular Shiryaev–Roberts procedure is asymptotically optimal, as the local probability of false alarm goes to

In this paper we propose an algorithm for a particular change- point detection problem where the frequency band of the signal changes at some points in the time axis.. Apart

Abstract Elementary techniques from operational calculus, differential al- gebra, and noncommutative algebra lead to a new approach for change-point detection, which is an

For large sample size, we show that the median heuristic behaves approximately as the median of a distribution that we describe completely in the setting of kernel two-sample test

In this case the resi- dual is the difference between the measured level and the level calculated using the NRC Traffic Noise Prediction model.' This was done by assuming that

Comme le taux de survivants du cancer infantile est en augmentation, les rapports sur les troubles du sommeil et la fatigue.. au cours des années suivant le traitement et durant

Adding or removing data from the structure requires careful reor- ganisation to maintain the integrity of the database: the ordering of the features in accordance with their weight

L’objectif de ce travail est de montrer la place de la spectrophotométrie infrarouge dans la prise en charge de la lithiase urinaire, en particulier dans la