Segmentation in the mean of heteroscedastic data via resampling or cross-validation

40  Download (0)

Full text

(1)

HAL Id: hal-02816806

https://hal.inrae.fr/hal-02816806

Submitted on 6 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Segmentation in the mean of heteroscedastic data via resampling or cross-validation

Alain Celisse, Sylvain Arlot

To cite this version:

Alain Celisse, Sylvain Arlot. Segmentation in the mean of heteroscedastic data via resampling or cross- validation. Workshop Change-Point Detection Methods and Applications, Sep 2008, Paris, France.

�hal-02816806�

(2)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

Segmentation of the mean of heterosedasti data via

ross-validation

SylvainArlotand Alain Celisse

April8, 2009

Abstrat

Thispapertaklestheproblem ofdetetingabrupthangesinthemeanofahet-

erosedasti signal by model seletion, without knowledge on the variations of the

noise. A newfamilyof hange-pointdetetionproedures is proposed,showing that

ross-validationmethodsanbesuessfulintheheterosedastiframework,whereas

mostexisting proeduresarenotrobusttoheterosedastiity. Therobustnesstohet-

erosedastiity of the proposed proedures is supported by an extensive simulation

study, together with reent theoretial results. An appliation to ComparativeGe-

nomiHybridization(CGH)dataisprovided,showingthatrobustnesstoheterosedas-

tiityanindeed berequiredfortheiranalysis.

1 Introdution

Theproblemtakledinthepaperisthedetetionofabrupthangesinthemeanofasignal

withoutassumingitsvarianeisonstant. Modelseletion andross-validation tehniques

are usedfor buildinghange-point detetion proedures that signiantly improve on ex-

isting proedures when the variane of the signal is not onstant. Before detailing the

approahand themainontributions ofthepaper,letusmotivatetheproblemandbriey

reallsome relatedworksinthehange-point detetionliterature.

1.1 Change-point detetion

Thehange-point detetionproblem,alsoalledone-dimensionalsegmentation,dealswith

astohastiproessthedistributionofwhihabruptlyhangesatsomeunknowninstants.

The purpose is to reover theloation of these hanges and their number. This problem

is motivated by a wide range of appliations, suh as voie reognition, nanial time-

series analysis[29℄ andComparative Genomi Hybridization (CGH) dataanalysis[35℄. A

large literature existsabout hange-point detetionin manyframeworks [see 12 ,17 , for a

omplete bibliography℄.

Therstpapersonhange-pointdetetionweredevotedtothesearhfortheloationof

a unique hange-point, alsonamed breakpoint [see 34 ,for instane℄. Looking formultiple

hange-points is a harder task and has been studied later. For instane, Yao [49 ℄ used

the BICriterion fordeteting multiplehange-pointsinaGaussian signal,and Miaoand

Zhao [33℄proposed anapproahrelying on rankstatistis.

(3)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

The setting of the paper is the following. The values

Y

1

, . . . , Y

n

∈ R

of a noisy signal

at points

t

1

, . . . , t

n areobserved, with

Y

i

= s(t

i

) + σ(t

i

i

, E [ǫ

i

] = 0

and

Var(ǫ

i

) = 1 .

(1)

Thefuntion

s

isalledtheregression funtionandisassumedtobepieewise-onstant,or atleastwellapproximatedbypieewiseonstantfuntions,thatis,

s

issmootheverywhere

exept at a few breakpoints. The noise terms

ǫ

1

, . . . , ǫ

n are assumed to be independent and identially distributed. No assumption is made on

σ : [0, 1] 7→ [0, ∞ )

. Note that all

data

(t

i

, Y

i

)

1≤i≤nareobservedbeforedetetingthehange-points, asettingwhihisalled o-line.

As pointedout byLavielle [28 ℄, multiple hange-point detetion proedures generally

takle one amongthefollowing three problems:

1. Deteting hangesinthe mean

s

assuming the standard-deviation

σ

is onstant,

2. Deteting hangesinthestandard-deviation

σ

assuming themean

s

is onstant,

3. Detetinghangesinthewholedistributionof

Y

,withnodistintionbetweenhanges

inthe mean

s

, hanges in the standard-deviation

σ

and hanges inthe distribution of

ǫ

.

In appliations suh as CGH data analysis, hanges in the mean

s

have an important

biologial meaning, sine they orrespond to the limits of amplied or deleted areas of

hromosomes. However in the CGH setting, the standard-deviation

σ

is not always on-

stant, as assumed in problem 1. See Setion 6 for more details on CGH data, for whih

heterosedastiitythatis, variations of

σ

orrespond to experimental artefats or bio- logial nuisane that shouldberemoved.

Therefore,CGHdataanalysisrequiresto solve afourthproblem,whihisthepurpose

of thepresent artile:

4. Deteting hanges in the mean

s

with no onstraint on the standard-deviation

σ : [0, 1] 7→ [0, ∞ )

.

Comparedtoproblem1,the diereneisthepreseneofanadditional nuisane parameter

σ

making problem4 harder. Up to the best of our knowledge, no hange-point detetion proedurehas everbeen proposed forsolving problem 4with no prior information on

σ

.

1.2 Model seletion

Modelseletion isa suessfulapproah for multiplehange-point detetion, asshown by

Lavielle [28℄ and by Lebarbier [30℄ for instane. Indeed, a set of hange-pointsalled a

segmentationisnaturallyassoiatedwiththesetofpieewise-onstantfuntionsthatmay

onlyjumpatthesehange-points. Givenasetoffuntions(alledamodel),estimationan

be performed by minimizing the least-squares riterion (or other riteria, see Setion 3).

Therefore,detetinghangesinthe meanof asignal, thatisthehoie ofa segmentation,

amounts to seletsuh amodel.

Morepreisely,givenaolletionofmodels

{ S

m

}

m∈Mn

andtheassoiatedolletionof

least-squares estimators

{b s

m

}

m∈Mn,the purposeof modelseletion isto provide a model

index

m b

suhthat

s b

mb reahes thebestperformane amongall estimators

{b s

m

}

m∈Mn.

(4)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

Model seletion an target two dierent goals. On the one hand, a model seletion

proedureiseient whenits quadratiriskissmallerthan thesmallestquadrati riskof

the estimators

{b s

m

}

m∈Mn, up to a onstant fator

C

n

≥ 1

. Suh a property is alled an

orale inequality when it holds for every nite sample size. The proedure is said to be

asymptoti eient whenthe previous property holdswith

C

n

→ 1

as

n

tendsto innity.

Asymptoti eieny isthegoal ofAIC[2, 3℄and Mallows'

C

p [32℄,among manyothers.

On the otherhand, assuming that

s

belongs to one of themodels

{ S

m

}

m∈Mn, a pro-

edureismodelonsistentwhenithoosesthesmallestmodelontaining

s

asymptotially withprobability one. Model onsistenyis the goal of BIC [39℄ for instane. See also the

artile byYang [46℄ about the distintion between eieny andmodel onsisteny.

In the present paper as in [30 ℄, the quality of a multiple hange-point detetion pro-

edure isassessed bythequadrati risk; hene, a hange in the mean hidden by the noise

should not be deteted. Thishoie is motivatedbyappliations where the signal-to-noise

ratio maybe small,sothat exatlyreovering everytrue hange-point ishopeless. There-

fore,eientmodelseletion proedureswillbeusedinordertodetet thehange-points.

Withoutprior informationonthe loationsofthehange-points, thenatural olletion

of models for hange-point detetion depends on the sample size

n

. Indeed, there exist

n−1 D−1

dierentpartitionsofthe

n

designpointsinto

D

intervals,eahpartitionorrespond-

ingto a set of

(D − 1)

hange-points. Sine

D

an take anyvaluebetween 1 and

n

,

2

n−1

modelsanbeonsidered. Therefore,modelseletionproeduresusedformultiplehange-

pointdetetionhaveto satisfynon-asymptotiorale inequalities: theolletionofmodels

annotbeassumedto bexedwiththesamplesize

n

tendingto innity. (SeeSetion 2.3

for apreise denitionof the olletion

{ S

m

}

m∈Mn usedfor hange-point detetion.) Most model seletion results onsider polynomial olletions of models

{ S

m

}

m∈Mn,

thatis

Card( M

n

) ≤ Cn

α for some onstants

C, α ≥ 0

. For polynomial olletions,proe-

dureslikeAICorMallows'

C

pareprovedtosatisfyoraleinequalitiesinvariousframeworks [9, 15,10,16 ℄,assuming thatdataarehomosedasti, thatis,

σ(t

i

)

doesnot depend on

t

i.

Howeverasshownin[6℄,Mallows'

C

p issuboptimalwhendataareheterosedasti,that is the variane is non-onstant. Therefore, other proedures must be used. For instane,

resampling penalization is optimal with heterosedasti data [5℄. Another approah has

been explored by Gendre [25 ℄, whih onsists insimultaneously estimating the mean and

thevariane,using apartiular polynomial olletionofmodels.

However in hange-point detetion, the olletion of models is exponential, that is

Card( M

n

)

isoforder

exp(αn)

for some

α > 0

. For suhlargeolletions,espeiallylarger

than polynomial, the above penalization proedures fail. Indeed, Birgé and Massart [16℄

proved that the minimal amount of penalization required for a proedure to satisfy an

orale inequality isofthe form

pen(m) = c

1

σ

2

D

m

n + c

2

σ

2

D

m

n log

n D

m

,

(2)

where

c

1 and

c

2 are positive onstants and

σ

2 is the variane of the noise, assumed to

be onstant. Lebarbier [30℄ proposed

c

1

= 5

and

c

2

= 2

for optimizing the penalty (2)

in the ontext of hange-point detetion. Penalties similar to (2) have been introdued

independentlybyotherauthors[38 ,1,11,45 ℄andareshowntoprovidesatisfatoryresults.

(5)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

Nevertheless,alltheseresultsassumethatdataarehomosedasti. Atually,themodel

seletion problem with heterosedasti data and an exponential olletion of models has

neverbeen onsideredinthe literature, up to the best ofour knowledge.

Furthermore,penaltiesoftheform(2)areverylosetobeproportionalto

D

m,atleast

forsmallvaluesof

D

m. Therefore,the resultsof[6 ℄leadtoonjeturethatthepenalty(2)

is suboptimalfor model seletion overanexponential olletionof models,when dataare

heterosedasti. Thesuggestof thispaperisto use ross-validation methods instead.

1.3 Cross-validation

Cross-validation(CV)methods allowtoestimate(almost)unbiasedly thequadratiriskof

anyestimator, suh as

b s

m (see Setion 3.2 about the heuristis underlying CV).Classial

examples of CV methods are the leave-one-out [Loo, 27, 43 ℄ and

V

-fold ross-validation

[VFCV, 23,24 ℄. Morerefereneson ross-validationan befound in[7 , 19℄for instane.

CV an be used for model seletion, by hoosing the model

S

m for whih the CV

estimate of the risk of

b s

m is minimal. The properties of CV for model seletion with

a polynomial olletion of models and homosedasti data have been widely studied. In

short,CVisknowntoadapttoawiderangeofstatistialsettings,fromdensityestimation

[42 ,20℄toregression[44 ,48 ℄andlassiation[26,47 ℄. Inpartiular,Looisasymptotially

equivalent to AIC or Mallows'

C

p in several frameworks where they are asymptotially optimal,andotherCVmethodshavesimilarperformanes,providedthesizeofthetraining

sample is lose enough to the sample size [see for instane 31, 40, 22℄. In addition, CV

methodsarerobust toheterosedastiityof data[5, 7℄,aswell asseveral otherresampling

methods [6 ℄. Therefore, CV is a natural alternative to penalization proedures assuming

homosedastiity.

Nevertheless,nearlynothingisknownaboutCVformodelseletionwithanexponential

olletionofmodels,suhasinthehange-pointdetetionsetting. Theliteratureonmodel

seletion and CV [14, 40 ,16, 21℄only suggests thatminimizing diretlythe Loo estimate

of the riskover

2

n−1 models wouldleadto overtting.

In thispaper,a remarkmade byBirgéand Massart[16 ℄ about penalization proedure

is used for solving this issue in the ontext of hange-point detetion. Model seletion is

perfomed in two steps: First, hoose a segmentation given the number of hange-points;

seond,hoosethenumberofhange-points. CVmethodsanbeusedateahstep,leading

to Proedure 6 (Setion 5). The papershows that suh an approah is indeed suessful

for detetinghanges inthemeanof aheterosedastisignal.

1.4 Contributions of the paper

The main purpose of the present work is to design a CV-based model seletion proe-

dure (Proedure 6) that an be used for deteting multiple hanges in the mean of a

heterosedastisignal. Suhaproedureexperimentallyadaptstoheterosedastiitywhen

the olletion of models is exponential, whih hasneverbeen obtained before. In parti-

ular, Proedure 6 is a reliable alternative to Birgé and Massart's penalization proedure

[15 ℄ whendataan beheterosedasti.

Another major diultytakledinthis paperisthe omputational ostof resampling

methods when seleting among

2

n models. Even when the number

(D − 1)

of hange-

(6)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

points isgiven, exploring the

n−1 D−1

partitions of

[0, 1]

into

D

intervals and performing a

resampling algorithm for eah partition is not feasible when

n

is large and

D > 0

. An

implementation of Proedure 6 witha tratable omputational omplexity is proposedin

thepaper,usinglosed-formformulasforLeave-

p

-out(Lpo)estimatorsoftherisk,dynami

programming, and

V

-foldross-validation.

The paper also points out that least-squares estimators are not reliable for hange-

point detetion when the number of breakpoints is given, although they are widely used

to this purpose in the literature. Indeed, experimental and theoretial results detailed in

Setion3.1showthatleast-squaresestimatorssuerfromloaloverttingwhenthevariane

of the signalis varyingoverthe sequene of observations. Onthe ontrary,minimizers of

the Lpo estimator of the risk do not suer from this drawbak, whih emphasizes the

interestof using ross-validation methods inthe ontext ofhange-point detetion.

The paperis organizedasfollows. ThestatistialframeworkisdesribedinSetion 2.

First,theproblemofseleting thebest segmentation giventhenumberof hange-points

is takled in Setion 3. Theoretial results and an extensive simulation study show that

the usual minimization of the least-squares riterion an be misleading when data are

heterosedasti, whereas ross-validation-based proedures provide satisfatory results in

the same framework.

Then, the problem of hoosing the number of breakpoints from data is addressed in

Setion4 . Assupportedbyanextensivesimulation study,

V

-foldross-validation(VFCV) leads to a omputationally feasible and statistially eient model seletion proedure

whendataareheterosedasti,ontraryto proedures impliitly assuminghomosedasti-

ity.

The resampling methods of Setions 3 and 4 are ombined in Setion 5, leading to a

familyofresampling-basedproeduresfordetetinghangesinthemeanofaheterosedas-

tisignal. Awide simulationstudy showstheyperform wellwithboth homosedasti and

heterosedastidata, signiantly improving theperformaneof proedures whih impli-

itlyassume homosedastiity.

Finally,Setion 6illustratesonarealdatasetthepromisingbehaviouroftheproposed

proedures for analyzing CGH miroarray data, ompared to proedures previously used

inthis setting.

2 Statistial framework

In this setion, the statistial framework of hange-point detetion via model seletion is

introdued, aswell assome notation.

2.1 Regression on a xed design

Let

S

denote the set of measurable funtions

[0, 1] 7→ R

. Let

t

1

< · · · < t

n

∈ [0, 1]

be

some deterministi design points,

s ∈ S

and

σ : [0, 1] 7→ [0, ∞ )

be some funtions and

dene

∀ i ∈ { 1, . . . , n } , Y

i

= s(t

i

) + σ(t

i

i

,

(3)

where

ǫ

1

, . . . , ǫ

nareindependentandidentiallydistributedrandomvariableswith

E [ǫ

i

] = 0

and

E

ǫ

2i

= 1

.

(7)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

As explained inSetion 1.1 , thegoal is to nd from

(t

i

, Y

i

)

1≤i≤n a pieewise-onstant funtion

f ∈ S

lose to

s

intermsof the quadrati loss

k s − f k

2n

:= 1 n

X

n i=1

(f (t

i

) − s(t

i

))

2

.

2.2 Least-squares estimator

A lassial estimator of

s

is the least-squares estimator, dened as follows. For every

f ∈ S

,the least-squares riterion at

f

isdened by

P

n

γ (f ) := 1 n

X

n i=1

(Y

i

− f (t

i

))

2

.

The notation

P

n

γ(f )

means that the funtion

(t, Y ) 7→ γ(f; (t, Y )) := (Y − f (t))

2 is

integrated with respet to the empirial distribution

P

n

:= n

−1

P

n

i=1

δ

(ti,Yi).

P

n

γ(f)

is

also alledthe empirial risk of

f

.

Then, given a set

S ⊂ S

of funtions

[0, 1] 7→ R

(alled a model), the least-squares estimatoron model

S

is

ERM(S; P

n

) := arg min

f∈S

{ P

n

γ(f) } .

The notation

ERM(S; P

n

)

stresses that the least-squares estimator is the output of the empirial risk minimization algorithm over

S

,whih takes a model

S

and a data sample

as inputs. When a olletion of models

{ S

m

}

m∈Mn

is given,

b s

m

(P

n

)

or

b s

m are shortuts

for

ERM(S

m

; P

n

)

.

2.3 Colletion of models

Sine thegoal is to detet jumps of

s

,every modelonsidered in this artile isthe set of

pieewise onstant funtionswith respetto some partition of

[0, 1]

.

Forevery

K ∈ { 1, . . . , n − 1 }

andeverysequeneofintegers

α

0

= 1 < α

1

< α

2

< · · · <

α

K

≤ n

(thebreakpoints),

(I

λ

)

λ∈Λ

1,...αK)

denotes the partition

[t

α0

; t

α1

), . . . , [t

αK−1

; t

αK

), [t

αK

; 1]

of

[0, 1]

into

(K + 1)

intervals. Then, themodel

S

1,...αK) isdenedasthesetofpieewise

onstant funtionsthatan only jumpat

t = t

αj for some

j ∈ { 1, . . . , K }

.

For every

K ∈ { 1, . . . , n − 1 }

, let

M f

n

(K + 1)

denote the set of suh sequenes

1

, . . . α

K

)

of length

K

, so that

{ S

m

}

m∈Mfn(K+1) is the olletion of models of piee-

wise onstant funtions with

K

breakpoints. When

K = 0

,

M f

n

(1) := {∅}

and the

model

S

is the linear spae of onstant funtions on

[0, 1]

. Remark that for every

K

and

m ∈ M f

n

(K + 1)

,

S

m isavetorspaeofdimension

D

m

= K + 1

. Intherestofthepa-

per,therelationship betweenthenumberofbreakpoints

K

and thedimension

D = K + 1

ofthemodel

S

1,...αK)isusedrepeatedly;inpartiular,estimatingofthenumberofbreak- points(Setion4) isequivalent tohoosingthedimension ofa model. Inaddition, sine a

model

S

m isuniquely dened by

m

,theindex

m

isalso alleda model.

Figure

Updating...

References