Segmentation in the mean of heteroscedastic data via resampling or cross-validation

(1)

HAL Id: hal-02816806

https://hal.inrae.fr/hal-02816806

Submitted on 6 Jun 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Segmentation in the mean of heteroscedastic data via resampling or cross-validation

Alain Celisse, Sylvain Arlot

To cite this version:

Alain Celisse, Sylvain Arlot. Segmentation in the mean of heteroscedastic data via resampling or cross- validation. Workshop Change-Point Detection Methods and Applications, Sep 2008, Paris, France.

�hal-02816806�

(2)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

Segmentation of the mean of heterosedasti data via

ross-validation

SylvainArlotand Alain Celisse

April8, 2009

Abstrat

Thispapertaklestheproblem ofdetetingabrupthangesinthemeanofahet-

erosedasti signal by model seletion, without knowledge on the variations of the

noise. A newfamilyof hange-pointdetetionproedures is proposed,showing that

ross-validationmethodsanbesuessfulintheheterosedastiframework,whereas

mostexisting proeduresarenotrobusttoheterosedastiity. Therobustnesstohet-

erosedastiity of the proposed proedures is supported by an extensive simulation

study, together with reent theoretial results. An appliation to ComparativeGe-

nomiHybridization(CGH)dataisprovided,showingthatrobustnesstoheterosedas-

tiityanindeed berequiredfortheiranalysis.

1 Introdution

Theproblemtakledinthepaperisthedetetionofabrupthangesinthemeanofasignal

withoutassumingitsvarianeisonstant. Modelseletion andross-validation tehniques

are usedfor buildinghange-point detetion proedures that signiantly improve on ex-

isting proedures when the variane of the signal is not onstant. Before detailing the

approahand themainontributions ofthepaper,letusmotivatetheproblemandbriey

reallsome relatedworksinthehange-point detetionliterature.

1.1 Change-point detetion

Thehange-point detetionproblem,alsoalledone-dimensionalsegmentation,dealswith

astohastiproessthedistributionofwhihabruptlyhangesatsomeunknowninstants.

The purpose is to reover theloation of these hanges and their number. This problem

is motivated by a wide range of appliations, suh as voie reognition, nanial time-

series analysis[29℄ andComparative Genomi Hybridization (CGH) dataanalysis[35℄. A

large literature existsabout hange-point detetionin manyframeworks [see 12 ,17 , for a

omplete bibliography℄.

Therstpapersonhange-pointdetetionweredevotedtothesearhfortheloationof

a unique hange-point, alsonamed breakpoint [see 34 ,for instane℄. Looking formultiple

hange-points is a harder task and has been studied later. For instane, Yao [49 ℄ used

the BICriterion fordeteting multiplehange-pointsinaGaussian signal,and Miaoand

Zhao [33℄proposed anapproahrelying on rankstatistis.

(3)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

The setting of the paper is the following. The values

Y

₁

, . . . , Y

_n

∈ R

^of ^a ^noisy ^signal

at points

t

₁

, . . . , t

_n ^are^observed, ^with

Y

i

= s(t

i

) + σ(t

i

)ǫ

i

, E [ǫ

i

] = 0

^and

Var(ǫ

i

) = 1 .

⁽¹⁾

Thefuntion

s

îsâlled^the^regression ^funtionândîsâssumed^to^bepieewise-onstant,or atleastwellapproximatedbypieewiseonstantfuntions,thatis,

s

^is^smooth^everywhere

exept at a few breakpoints. The noise terms

ǫ

1

, . . . , ǫ

n ^are ^assumed ^to ^be independent and identially distributed. No assumption is made on

σ : [0, 1] 7→ [0, ∞ )

^. ^Note ^that ^all

data

(t

_i

, Y

_i

)

_1≤i≤n^are^observed^before^deteting^thehange-points, asettingwhihisalled o-line.

As pointedout byLavielle [28 ℄, multiple hange-point detetion proedures generally

takle one amongthefollowing three problems:

1. Deteting hangesinthe mean

s

^assuming ^the standard-deviation

σ

^is ^onstant,

2. Deteting hangesinthestandard-deviation

σ

^assuming ^the^mean

s

^is ^onstant,

3. Detetinghangesinthewholedistributionof

Y

^,^with^no^distintion^between^hanges

inthe mean

s

^, ^hanges ⁱⁿ ^the standard-deviation

σ

^and ^hanges ⁱⁿ^the distribution of

ǫ

^.

In appliations suh as CGH data analysis, hanges in the mean

s

^have ^an ^important

biologial meaning, sine they orrespond to the limits of amplied or deleted areas of

hromosomes. However in the CGH setting, the standard-deviation

σ

îs ^not âlways ôn-

stant, as assumed in problem 1. See Setion 6 for more details on CGH data, for whih

heterosedastiitythatis, variations of

σ

^orrespond ^to experimental artefats or biologial nuisane that shouldberemoved.

Therefore,CGHdataanalysisrequiresto solve afourthproblem,whihisthepurpose

of thepresent artile:

4. Deteting hanges in the mean

s

^with ^no ^onstraint ^on ^the standard-deviation

σ : [0, 1] 7→ [0, ∞ )

^.

Comparedtoproblem1,the diereneisthepreseneofanadditional nuisane parameter

σ

^making ^problem⁴ ^harder. Ûp ^to ^the ^best ôf ôur ^knowledge, ^no hange-point detetion proedurehas everbeen proposed forsolving problem 4with no prior information on

σ

^.

1.2 Model seletion

Modelseletion isa suessfulapproah for multiplehange-point detetion, asshown by

Lavielle [28℄ and by Lebarbier [30℄ for instane. Indeed, a set of hange-pointsalled a

segmentationisnaturallyassoiatedwiththesetofpieewise-onstantfuntionsthatmay

onlyjumpatthesehange-points. Givenasetoffuntions(alledamodel),estimationan

be performed by minimizing the least-squares riterion (or other riteria, see Setion 3).

Therefore,detetinghangesinthe meanof asignal, thatisthehoie ofa segmentation,

amounts to seletsuh amodel.

Morepreisely,givenaolletionofmodels

{ S

_m

}

m∈Mn

andtheassoiatedolletionof

least-squares estimators

{b s

m

}

_m∈M_n^,^the ^purposeôf ^model^seletion îs^to ^provide â ^model

index

m b

^suh^that

s b

_m_b ^reahes ^the^best^performane âmongâll êstimators

{b s

_m

}

m∈Mn^.

(4)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

Model seletion an target two dierent goals. On the one hand, a model seletion

proedureiseient whenits quadratiriskissmallerthan thesmallestquadrati riskof

the estimators

{b s

m

}

m∈Mn^, ûp ^to â ônstant ^fator

C

n

≥ 1

^. ^Suh â ^property îs âlled ân

orale inequality when it holds for every nite sample size. The proedure is said to be

asymptoti eient whenthe previous property holdswith

C

_n

→ 1

^as

n

^tends^to ^innity^.

Asymptoti eieny isthegoal ofAIC[2, 3℄and Mallows'

C

p ^[32℄,^among ^many^others.

On the otherhand, assuming that

s

^belongs ^to ^one ^of ^the^models

{ S

_m

}

m∈Mn^, ^a ^pro-

edureismodelonsistentwhenithoosesthesmallestmodelontaining

s

asymptotially withprobability one. Model onsistenyis the goal of BIC [39℄ for instane. See also the

artile byYang [46℄ about the distintion between eieny andmodel onsisteny.

In the present paper as in [30 ℄, the quality of a multiple hange-point detetion pro-

edure isassessed bythequadrati risk; hene, a hange in the mean hidden by the noise

should not be deteted. Thishoie is motivatedbyappliations where the signal-to-noise

ratio maybe small,sothat exatlyreovering everytrue hange-point ishopeless. There-

fore,eientmodelseletion proedureswillbeusedinordertodetet thehange-points.

Withoutprior informationonthe loationsofthehange-points, thenatural olletion

of models for hange-point detetion depends on the sample size

n

^. ^Indeed, ^there ^exist

n−1 D−1

dierentpartitionsofthe

n

^design^points^into

D

întervals,êah^partitionôrrespond-

ingto a set of

(D − 1)

hange-points. Sine

D

ân ^take âny^value^between ¹ ând

n

^,

2

ⁿ⁻¹

modelsanbeonsidered. Therefore,modelseletionproeduresusedformultiplehange-

pointdetetionhaveto satisfynon-asymptotiorale inequalities: theolletionofmodels

annotbeassumedto bexedwiththesamplesize

n

^tending^to ^innity^. ^(See^Setion ^2.3

for apreise denitionof the olletion

{ S

_m

}

_m∈M_n ^used^for hange-point detetion.) Most model seletion results onsider polynomial olletions of models

{ S

_m

}

m∈Mn^,

thatis

Card( M

n

) ≤ Cn

^α ^for ^some ^onstants

C, α ≥ 0

^. ^F^or ^polynomial ^olletions,^proe-

dureslikeAICorMallows'

C

p^are^proved^to^satisfy^oraleinequalitiesinvariousframeworks [9, 15,10,16 ℄,assuming thatdataarehomosedasti, thatis,

σ(t

_i

)

^does^not ^depend ^on

t

_i^.

Howeverasshownin[6℄,Mallows'

C

_p ^is^suboptimal^when^data^areheterosedasti,that is the variane is non-onstant. Therefore, other proedures must be used. For instane,

resampling penalization is optimal with heterosedasti data [5℄. Another approah has

been explored by Gendre [25 ℄, whih onsists insimultaneously estimating the mean and

thevariane,using apartiular polynomial olletionofmodels.

However in hange-point detetion, the olletion of models is exponential, that is

Card( M

n

)

îsôfôrder

exp(αn)

^for ^some

α > 0

^. ^Fôr ^suh^largeôlletions,êspeially^larger

than polynomial, the above penalization proedures fail. Indeed, Birgé and Massart [16℄

proved that the minimal amount of penalization required for a proedure to satisfy an

orale inequality isofthe form

pen(m) = c

₁

σ

²

D

_m

n + c

₂

σ

²

D

_m

n log

n D

_m

,

⁽²⁾

where

c

₁ ^and

c

₂ âre ^positive ônstants ând

σ

² îs ^the ^variane ôf ^the ^noise, âssumed ^to

be onstant. Lebarbier [30℄ proposed

c

₁

= 5

^and

c

₂

= 2

^for ^optimizing ^the ^penalty ⁽²⁾

in the ontext of hange-point detetion. Penalties similar to (2) have been introdued

independentlybyotherauthors[38 ,1,11,45 ℄andareshowntoprovidesatisfatoryresults.

(5)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

Nevertheless,alltheseresultsassumethatdataarehomosedasti. Atually,themodel

seletion problem with heterosedasti data and an exponential olletion of models has

neverbeen onsideredinthe literature, up to the best ofour knowledge.

Furthermore,penaltiesoftheform(2)areverylosetobeproportionalto

D

_m^,^at^least

forsmallvaluesof

D

_m^. ^Therefore,^the ^results^of^{[6 ℄}^lead^to^onjeture^that^the^penalty⁽²⁾

is suboptimalfor model seletion overanexponential olletionof models,when dataare

heterosedasti. Thesuggestof thispaperisto use ross-validation methods instead.

1.3 Cross-validation

Cross-validation(CV)methods allowtoestimate(almost)unbiasedly thequadratiriskof

anyestimator, suh as

b s

_m ^(see ^Setion ^3.2 ^about ^the ^heuristis ^underlying ^CV).^Classial

examples of CV methods are the leave-one-out [Loo, 27, 43 ℄ and

V

^-fold ^ross-v^alidation

[VFCV, 23,24 ℄. Morerefereneson ross-validationan befound in[7 , 19℄for instane.

CV an be used for model seletion, by hoosing the model

S

_m ^for ^whih ^the ^CV

estimate of the risk of

b s

_m ^is ^minimal. ^The ^properties ^of ^CV ^for ^model ^seletion ^with

a polynomial olletion of models and homosedasti data have been widely studied. In

short,CVisknowntoadapttoawiderangeofstatistialsettings,fromdensityestimation

[42 ,20℄toregression[44 ,48 ℄andlassiation[26,47 ℄. Inpartiular,Looisasymptotially

equivalent to AIC or Mallows'

C

p ⁱⁿ ^several ^frameworks ^where ^they ^are asymptotially optimal,andotherCVmethodshavesimilarperformanes,providedthesizeofthetraining

sample is lose enough to the sample size [see for instane 31, 40, 22℄. In addition, CV

methodsarerobust toheterosedastiityof data[5, 7℄,aswell asseveral otherresampling

methods [6 ℄. Therefore, CV is a natural alternative to penalization proedures assuming

homosedastiity.

Nevertheless,nearlynothingisknownaboutCVformodelseletionwithanexponential

olletionofmodels,suhasinthehange-pointdetetionsetting. Theliteratureonmodel

seletion and CV [14, 40 ,16, 21℄only suggests thatminimizing diretlythe Loo estimate

of the riskover

2

ⁿ⁻¹ ^models ^would^lead^to ^overtting.

In thispaper,a remarkmade byBirgéand Massart[16 ℄ about penalization proedure

is used for solving this issue in the ontext of hange-point detetion. Model seletion is

perfomed in two steps: First, hoose a segmentation given the number of hange-points;

seond,hoosethenumberofhange-points. CVmethodsanbeusedateahstep,leading

to Proedure 6 (Setion 5). The papershows that suh an approah is indeed suessful

for detetinghanges inthemeanof aheterosedastisignal.

1.4 Contributions of the paper

The main purpose of the present work is to design a CV-based model seletion proe-

dure (Proedure 6) that an be used for deteting multiple hanges in the mean of a

heterosedastisignal. Suhaproedureexperimentallyadaptstoheterosedastiitywhen

the olletion of models is exponential, whih hasneverbeen obtained before. In parti-

ular, Proedure 6 is a reliable alternative to Birgé and Massart's penalization proedure

[15 ℄ whendataan beheterosedasti.

Another major diultytakledinthis paperisthe omputational ostof resampling

methods when seleting among

2

ⁿ ^models. ^Even ^when ^the ^number

(D − 1)

^of ^hange-

(6)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

points isgiven, exploring the

n−1 D−1

partitions of

[0, 1]

^into

D

întervals ând ^performing â

resampling algorithm for eah partition is not feasible when

n

^is ^large ^and

D > 0

^. ^An

implementation of Proedure 6 witha tratable omputational omplexity is proposedin

thepaper,usinglosed-formformulasforLeave-

p

^-out^(Lpo)^estimators^of^the^risk,^dynami

programming, and

V

^-foldross-validation.

The paper also points out that least-squares estimators are not reliable for hange-

point detetion when the number of breakpoints is given, although they are widely used

to this purpose in the literature. Indeed, experimental and theoretial results detailed in

Setion3.1showthatleast-squaresestimatorssuerfromloaloverttingwhenthevariane

of the signalis varyingoverthe sequene of observations. Onthe ontrary,minimizers of

the Lpo estimator of the risk do not suer from this drawbak, whih emphasizes the

interestof using ross-validation methods inthe ontext ofhange-point detetion.

The paperis organizedasfollows. ThestatistialframeworkisdesribedinSetion 2.

First,theproblemofseleting thebest segmentation giventhenumberof hange-points

is takled in Setion 3. Theoretial results and an extensive simulation study show that

the usual minimization of the least-squares riterion an be misleading when data are

heterosedasti, whereas ross-validation-based proedures provide satisfatory results in

the same framework.

Then, the problem of hoosing the number of breakpoints from data is addressed in

Setion4 . Assupportedbyanextensivesimulation study,

V

^-foldross-validation(VFCV) leads to a omputationally feasible and statistially eient model seletion proedure

whendataareheterosedasti,ontraryto proedures impliitly assuminghomosedasti-

ity.

The resampling methods of Setions 3 and 4 are ombined in Setion 5, leading to a

familyofresampling-basedproeduresfordetetinghangesinthemeanofaheterosedas-

tisignal. Awide simulationstudy showstheyperform wellwithboth homosedasti and

heterosedastidata, signiantly improving theperformaneof proedures whih impli-

itlyassume homosedastiity.

Finally,Setion 6illustratesonarealdatasetthepromisingbehaviouroftheproposed

proedures for analyzing CGH miroarray data, ompared to proedures previously used

inthis setting.

2 Statistial framework

In this setion, the statistial framework of hange-point detetion via model seletion is

introdued, aswell assome notation.

2.1 Regression on a xed design

Let

S

^∗ ^denote ^the ^set ^of ^measurable ^funtions

[0, 1] 7→ R

^. ^Let

t

₁

< · · · < t

_n

∈ [0, 1]

^be

some deterministi design points,

s ∈ S

^∗ ^and

σ : [0, 1] 7→ [0, ∞ )

^be ^some ^funtions ^and

dene

∀ i ∈ { 1, . . . , n } , Y

i

= s(t

i

) + σ(t

i

)ǫ

i

,

⁽³⁾

where

ǫ

₁

, . . . , ǫ

_n^areindependentandidentiallydistributedrandomvariableswith

E [ǫ

_i

] = 0

^and

E

ǫ

²_i

= 1

^.

(7)

Version définitive du manuscrit publié dans / Final version of the manuscript published in :

nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript

Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x

As explained inSetion 1.1 , thegoal is to nd from

(t

_i

, Y

_i

)

1≤i≤n ^a pieewise-onstant funtion

f ∈ S

^∗ ^lose ^to

s

ⁱⁿ^terms^of ^the ^quadrati ^loss

k s − f k

²_n

:= 1 n

X

n i=1

(f (t

i

) − s(t

i

))

²

.

2.2 Least-squares estimator

A lassial estimator of

s

^is ^the least-squares estimator, dened as follows. For every

f ∈ S

^∗^,^the least-squares riterion at

f

^is^dened ^by

P

_n

γ (f ) := 1 n

X

n i=1

(Y

_i

− f (t

_i

))

²

.

The notation

P

_n

γ(f )

^means ^that ^the ^funtion

(t, Y ) 7→ γ(f; (t, Y )) := (Y − f (t))

² ^is

integrated with respet to the empirial distribution

P

_n

:= n

⁻¹

P

n

i=1

δ

_(t_i_,Y_i₎^.

P

_n

γ(f)

^is

also alledthe empirial risk of

f

^.

Then, given a set

S ⊂ S

^∗ ^of ^funtions

[0, 1] 7→ R

^(alled ^a ^model), ^the least-squares estimatoron model

S

^is

ERM(S; P

_n

) := arg min

f∈S

{ P

_n

γ(f) } .

The notation

ERM(S; P

_n

)

^stresses ^that ^the least-squares estimator is the output of the empirial risk minimization algorithm over

S

^,^whih ^takes ^a ^model

S

^and ^a ^data ^sample

as inputs. When a olletion of models

{ S

_m

}

m∈Mn

is given,

b s

_m

(P

_n

)

^or

b s

_m ^are ^shortuts

for

ERM(S

_m

; P

_n

)

^.

2.3 Colletion of models

Sine thegoal is to detet jumps of

s

^,êvery ^modelônsidered ⁱⁿ ^this ârtile îs^the ^set ôf

pieewise onstant funtionswith respetto some partition of

[0, 1]

^.

Forevery

K ∈ { 1, . . . , n − 1 }

ândêvery^sequeneôfîntegers

α

₀

= 1 < α

₁

< α

₂

< · · · <

α

_K

≤ n

^(thebreakpoints),

(I

_λ

)

_λ∈Λ

(α1,...αK)

denotes the partition

[t

_α₀

; t

_α₁

), . . . , [t

_α_K₋₁

; t

_α_K

), [t

_α_K

; 1]

of

[0, 1]

^into

(K + 1)

^intervals. ^Then, ^the^model

S

_(α₁_,...α_K₎ îs^denedâs^the^setôf^pieewise

onstant funtionsthatan only jumpat

t = t

_α_j ^for ^some

j ∈ { 1, . . . , K }

^.

For every

K ∈ { 1, . . . , n − 1 }

^, ^let

M f

n

(K + 1)

^denote ^the ^set ^of ^suh ^sequenes

(α

₁

, . . . α

_K

)

^of ^length

K

^, ^so ^that

{ S

_m

}

_m∈Mfn(K+1) îs ^the ôlletion ôf ^models ôf ^piee-

wise onstant funtions with

K

breakpoints. When

K = 0

^,

M f

n

(1) := {∅}

^and ^the

model

S

_∅ îs ^the ^linear ^spae ôf ônstant ^funtions ôn

[0, 1]

^. ^Remark ^that ^for ^every

K

and

m ∈ M f

n

(K + 1)

^,

S

m îsâ^vetor^spaeôf^dimension

D

m

= K + 1

^. ^In^the^rest^of^the^pa-

per,therelationship betweenthenumberofbreakpoints

K

^and ^the^dimension

D = K + 1

ofthemodel

S

_(α₁_,...α_K₎^is^usedrepeatedly;inpartiular,estimatingofthenumberofbreak- points(Setion4) isequivalent tohoosingthedimension ofa model. Inaddition, sine a

model

S

_m ^is^uniquely ^dened ^by

m

^,^the^index

m

îsâlso âlledâ ^model.