HAL Id: hal-02816806
https://hal.inrae.fr/hal-02816806
Submitted on 6 Jun 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Segmentation in the mean of heteroscedastic data via resampling or cross-validation
Alain Celisse, Sylvain Arlot
To cite this version:
Alain Celisse, Sylvain Arlot. Segmentation in the mean of heteroscedastic data via resampling or cross- validation. Workshop Change-Point Detection Methods and Applications, Sep 2008, Paris, France.
�hal-02816806�
Version définitive du manuscrit publié dans / Final version of the manuscript published in :
nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript
Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x
Segmentation of the mean of heterosedasti data via
ross-validation
SylvainArlotand Alain Celisse
April8, 2009
Abstrat
Thispapertaklestheproblem ofdetetingabrupthangesinthemeanofahet-
erosedasti signal by model seletion, without knowledge on the variations of the
noise. A newfamilyof hange-pointdetetionproedures is proposed,showing that
ross-validationmethodsanbesuessfulintheheterosedastiframework,whereas
mostexisting proeduresarenotrobusttoheterosedastiity. Therobustnesstohet-
erosedastiity of the proposed proedures is supported by an extensive simulation
study, together with reent theoretial results. An appliation to ComparativeGe-
nomiHybridization(CGH)dataisprovided,showingthatrobustnesstoheterosedas-
tiityanindeed berequiredfortheiranalysis.
1 Introdution
Theproblemtakledinthepaperisthedetetionofabrupthangesinthemeanofasignal
withoutassumingitsvarianeisonstant. Modelseletion andross-validation tehniques
are usedfor buildinghange-point detetion proedures that signiantly improve on ex-
isting proedures when the variane of the signal is not onstant. Before detailing the
approahand themainontributions ofthepaper,letusmotivatetheproblemandbriey
reallsome relatedworksinthehange-point detetionliterature.
1.1 Change-point detetion
Thehange-point detetionproblem,alsoalledone-dimensionalsegmentation,dealswith
astohastiproessthedistributionofwhihabruptlyhangesatsomeunknowninstants.
The purpose is to reover theloation of these hanges and their number. This problem
is motivated by a wide range of appliations, suh as voie reognition, nanial time-
series analysis[29℄ andComparative Genomi Hybridization (CGH) dataanalysis[35℄. A
large literature existsabout hange-point detetionin manyframeworks [see 12 ,17 , for a
omplete bibliography℄.
Therstpapersonhange-pointdetetionweredevotedtothesearhfortheloationof
a unique hange-point, alsonamed breakpoint [see 34 ,for instane℄. Looking formultiple
hange-points is a harder task and has been studied later. For instane, Yao [49 ℄ used
the BICriterion fordeteting multiplehange-pointsinaGaussian signal,and Miaoand
Zhao [33℄proposed anapproahrelying on rankstatistis.
Version définitive du manuscrit publié dans / Final version of the manuscript published in :
nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript
Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x
The setting of the paper is the following. The values
Y
1, . . . , Y
n∈ R
of a noisy signalat points
t
1, . . . , t
n areobserved, withY
i= s(t
i) + σ(t
i)ǫ
i, E [ǫ
i] = 0
andVar(ǫ
i) = 1 .
(1)Thefuntion
s
isalledtheregression funtionandisassumedtobepieewise-onstant,or atleastwellapproximatedbypieewiseonstantfuntions,thatis,s
issmootheverywhereexept at a few breakpoints. The noise terms
ǫ
1, . . . , ǫ
n are assumed to be independent and identially distributed. No assumption is made onσ : [0, 1] 7→ [0, ∞ )
. Note that alldata
(t
i, Y
i)
1≤i≤nareobservedbeforedetetingthehange-points, asettingwhihisalled o-line.As pointedout byLavielle [28 ℄, multiple hange-point detetion proedures generally
takle one amongthefollowing three problems:
1. Deteting hangesinthe mean
s
assuming the standard-deviationσ
is onstant,2. Deteting hangesinthestandard-deviation
σ
assuming themeans
is onstant,3. Detetinghangesinthewholedistributionof
Y
,withnodistintionbetweenhangesinthe mean
s
, hanges in the standard-deviationσ
and hanges inthe distribution ofǫ
.In appliations suh as CGH data analysis, hanges in the mean
s
have an importantbiologial meaning, sine they orrespond to the limits of amplied or deleted areas of
hromosomes. However in the CGH setting, the standard-deviation
σ
is not always on-stant, as assumed in problem 1. See Setion 6 for more details on CGH data, for whih
heterosedastiitythatis, variations of
σ
orrespond to experimental artefats or bio- logial nuisane that shouldberemoved.Therefore,CGHdataanalysisrequiresto solve afourthproblem,whihisthepurpose
of thepresent artile:
4. Deteting hanges in the mean
s
with no onstraint on the standard-deviationσ : [0, 1] 7→ [0, ∞ )
.Comparedtoproblem1,the diereneisthepreseneofanadditional nuisane parameter
σ
making problem4 harder. Up to the best of our knowledge, no hange-point detetion proedurehas everbeen proposed forsolving problem 4with no prior information onσ
.1.2 Model seletion
Modelseletion isa suessfulapproah for multiplehange-point detetion, asshown by
Lavielle [28℄ and by Lebarbier [30℄ for instane. Indeed, a set of hange-pointsalled a
segmentationisnaturallyassoiatedwiththesetofpieewise-onstantfuntionsthatmay
onlyjumpatthesehange-points. Givenasetoffuntions(alledamodel),estimationan
be performed by minimizing the least-squares riterion (or other riteria, see Setion 3).
Therefore,detetinghangesinthe meanof asignal, thatisthehoie ofa segmentation,
amounts to seletsuh amodel.
Morepreisely,givenaolletionofmodels
{ S
m}
m∈Mnandtheassoiatedolletionof
least-squares estimators
{b s
m}
m∈Mn,the purposeof modelseletion isto provide a modelindex
m b
suhthats b
mb reahes thebestperformane amongall estimators{b s
m}
m∈Mn.Version définitive du manuscrit publié dans / Final version of the manuscript published in :
nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript
Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x
Model seletion an target two dierent goals. On the one hand, a model seletion
proedureiseient whenits quadratiriskissmallerthan thesmallestquadrati riskof
the estimators
{b s
m}
m∈Mn, up to a onstant fatorC
n≥ 1
. Suh a property is alled anorale inequality when it holds for every nite sample size. The proedure is said to be
asymptoti eient whenthe previous property holdswith
C
n→ 1
asn
tendsto innity.Asymptoti eieny isthegoal ofAIC[2, 3℄and Mallows'
C
p [32℄,among manyothers.On the otherhand, assuming that
s
belongs to one of themodels{ S
m}
m∈Mn, a pro-edureismodelonsistentwhenithoosesthesmallestmodelontaining
s
asymptotially withprobability one. Model onsistenyis the goal of BIC [39℄ for instane. See also theartile byYang [46℄ about the distintion between eieny andmodel onsisteny.
In the present paper as in [30 ℄, the quality of a multiple hange-point detetion pro-
edure isassessed bythequadrati risk; hene, a hange in the mean hidden by the noise
should not be deteted. Thishoie is motivatedbyappliations where the signal-to-noise
ratio maybe small,sothat exatlyreovering everytrue hange-point ishopeless. There-
fore,eientmodelseletion proedureswillbeusedinordertodetet thehange-points.
Withoutprior informationonthe loationsofthehange-points, thenatural olletion
of models for hange-point detetion depends on the sample size
n
. Indeed, there existn−1 D−1
dierentpartitionsofthe
n
designpointsintoD
intervals,eahpartitionorrespond-ingto a set of
(D − 1)
hange-points. SineD
an take anyvaluebetween 1 andn
,2
n−1modelsanbeonsidered. Therefore,modelseletionproeduresusedformultiplehange-
pointdetetionhaveto satisfynon-asymptotiorale inequalities: theolletionofmodels
annotbeassumedto bexedwiththesamplesize
n
tendingto innity. (SeeSetion 2.3for apreise denitionof the olletion
{ S
m}
m∈Mn usedfor hange-point detetion.) Most model seletion results onsider polynomial olletions of models{ S
m}
m∈Mn,thatis
Card( M
n) ≤ Cn
α for some onstantsC, α ≥ 0
. For polynomial olletions,proe-dureslikeAICorMallows'
C
pareprovedtosatisfyoraleinequalitiesinvariousframeworks [9, 15,10,16 ℄,assuming thatdataarehomosedasti, thatis,σ(t
i)
doesnot depend ont
i.Howeverasshownin[6℄,Mallows'
C
p issuboptimalwhendataareheterosedasti,that is the variane is non-onstant. Therefore, other proedures must be used. For instane,resampling penalization is optimal with heterosedasti data [5℄. Another approah has
been explored by Gendre [25 ℄, whih onsists insimultaneously estimating the mean and
thevariane,using apartiular polynomial olletionofmodels.
However in hange-point detetion, the olletion of models is exponential, that is
Card( M
n)
isoforderexp(αn)
for someα > 0
. For suhlargeolletions,espeiallylargerthan polynomial, the above penalization proedures fail. Indeed, Birgé and Massart [16℄
proved that the minimal amount of penalization required for a proedure to satisfy an
orale inequality isofthe form
pen(m) = c
1σ
2D
mn + c
2σ
2D
mn log
n D
m,
(2)where
c
1 andc
2 are positive onstants andσ
2 is the variane of the noise, assumed tobe onstant. Lebarbier [30℄ proposed
c
1= 5
andc
2= 2
for optimizing the penalty (2)in the ontext of hange-point detetion. Penalties similar to (2) have been introdued
independentlybyotherauthors[38 ,1,11,45 ℄andareshowntoprovidesatisfatoryresults.
Version définitive du manuscrit publié dans / Final version of the manuscript published in :
nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript
Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x
Nevertheless,alltheseresultsassumethatdataarehomosedasti. Atually,themodel
seletion problem with heterosedasti data and an exponential olletion of models has
neverbeen onsideredinthe literature, up to the best ofour knowledge.
Furthermore,penaltiesoftheform(2)areverylosetobeproportionalto
D
m,atleastforsmallvaluesof
D
m. Therefore,the resultsof[6 ℄leadtoonjeturethatthepenalty(2)is suboptimalfor model seletion overanexponential olletionof models,when dataare
heterosedasti. Thesuggestof thispaperisto use ross-validation methods instead.
1.3 Cross-validation
Cross-validation(CV)methods allowtoestimate(almost)unbiasedly thequadratiriskof
anyestimator, suh as
b s
m (see Setion 3.2 about the heuristis underlying CV).Classialexamples of CV methods are the leave-one-out [Loo, 27, 43 ℄ and
V
-fold ross-validation[VFCV, 23,24 ℄. Morerefereneson ross-validationan befound in[7 , 19℄for instane.
CV an be used for model seletion, by hoosing the model
S
m for whih the CVestimate of the risk of
b s
m is minimal. The properties of CV for model seletion witha polynomial olletion of models and homosedasti data have been widely studied. In
short,CVisknowntoadapttoawiderangeofstatistialsettings,fromdensityestimation
[42 ,20℄toregression[44 ,48 ℄andlassiation[26,47 ℄. Inpartiular,Looisasymptotially
equivalent to AIC or Mallows'
C
p in several frameworks where they are asymptotially optimal,andotherCVmethodshavesimilarperformanes,providedthesizeofthetrainingsample is lose enough to the sample size [see for instane 31, 40, 22℄. In addition, CV
methodsarerobust toheterosedastiityof data[5, 7℄,aswell asseveral otherresampling
methods [6 ℄. Therefore, CV is a natural alternative to penalization proedures assuming
homosedastiity.
Nevertheless,nearlynothingisknownaboutCVformodelseletionwithanexponential
olletionofmodels,suhasinthehange-pointdetetionsetting. Theliteratureonmodel
seletion and CV [14, 40 ,16, 21℄only suggests thatminimizing diretlythe Loo estimate
of the riskover
2
n−1 models wouldleadto overtting.In thispaper,a remarkmade byBirgéand Massart[16 ℄ about penalization proedure
is used for solving this issue in the ontext of hange-point detetion. Model seletion is
perfomed in two steps: First, hoose a segmentation given the number of hange-points;
seond,hoosethenumberofhange-points. CVmethodsanbeusedateahstep,leading
to Proedure 6 (Setion 5). The papershows that suh an approah is indeed suessful
for detetinghanges inthemeanof aheterosedastisignal.
1.4 Contributions of the paper
The main purpose of the present work is to design a CV-based model seletion proe-
dure (Proedure 6) that an be used for deteting multiple hanges in the mean of a
heterosedastisignal. Suhaproedureexperimentallyadaptstoheterosedastiitywhen
the olletion of models is exponential, whih hasneverbeen obtained before. In parti-
ular, Proedure 6 is a reliable alternative to Birgé and Massart's penalization proedure
[15 ℄ whendataan beheterosedasti.
Another major diultytakledinthis paperisthe omputational ostof resampling
methods when seleting among
2
n models. Even when the number(D − 1)
of hange-Version définitive du manuscrit publié dans / Final version of the manuscript published in :
nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript
Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x
points isgiven, exploring the
n−1 D−1
partitions of
[0, 1]
intoD
intervals and performing aresampling algorithm for eah partition is not feasible when
n
is large andD > 0
. Animplementation of Proedure 6 witha tratable omputational omplexity is proposedin
thepaper,usinglosed-formformulasforLeave-
p
-out(Lpo)estimatorsoftherisk,dynamiprogramming, and
V
-foldross-validation.The paper also points out that least-squares estimators are not reliable for hange-
point detetion when the number of breakpoints is given, although they are widely used
to this purpose in the literature. Indeed, experimental and theoretial results detailed in
Setion3.1showthatleast-squaresestimatorssuerfromloaloverttingwhenthevariane
of the signalis varyingoverthe sequene of observations. Onthe ontrary,minimizers of
the Lpo estimator of the risk do not suer from this drawbak, whih emphasizes the
interestof using ross-validation methods inthe ontext ofhange-point detetion.
The paperis organizedasfollows. ThestatistialframeworkisdesribedinSetion 2.
First,theproblemofseleting thebest segmentation giventhenumberof hange-points
is takled in Setion 3. Theoretial results and an extensive simulation study show that
the usual minimization of the least-squares riterion an be misleading when data are
heterosedasti, whereas ross-validation-based proedures provide satisfatory results in
the same framework.
Then, the problem of hoosing the number of breakpoints from data is addressed in
Setion4 . Assupportedbyanextensivesimulation study,
V
-foldross-validation(VFCV) leads to a omputationally feasible and statistially eient model seletion proedurewhendataareheterosedasti,ontraryto proedures impliitly assuminghomosedasti-
ity.
The resampling methods of Setions 3 and 4 are ombined in Setion 5, leading to a
familyofresampling-basedproeduresfordetetinghangesinthemeanofaheterosedas-
tisignal. Awide simulationstudy showstheyperform wellwithboth homosedasti and
heterosedastidata, signiantly improving theperformaneof proedures whih impli-
itlyassume homosedastiity.
Finally,Setion 6illustratesonarealdatasetthepromisingbehaviouroftheproposed
proedures for analyzing CGH miroarray data, ompared to proedures previously used
inthis setting.
2 Statistial framework
In this setion, the statistial framework of hange-point detetion via model seletion is
introdued, aswell assome notation.
2.1 Regression on a xed design
Let
S
∗ denote the set of measurable funtions[0, 1] 7→ R
. Lett
1< · · · < t
n∈ [0, 1]
besome deterministi design points,
s ∈ S
∗ andσ : [0, 1] 7→ [0, ∞ )
be some funtions anddene
∀ i ∈ { 1, . . . , n } , Y
i= s(t
i) + σ(t
i)ǫ
i,
(3)where
ǫ
1, . . . , ǫ
nareindependentandidentiallydistributedrandomvariableswithE [ǫ
i] = 0
andE
ǫ
2i= 1
.Version définitive du manuscrit publié dans / Final version of the manuscript published in :
nuscript Manuscrit d’auteur / Author manuscript Manuscrit d’auteur / Author manuscript
Statistics and Computing, 2010, vol. 21, n° 4, 613-632, 10.1007/s11222-010-9196-x
As explained inSetion 1.1 , thegoal is to nd from
(t
i, Y
i)
1≤i≤n a pieewise-onstant funtionf ∈ S
∗ lose tos
intermsof the quadrati lossk s − f k
2n:= 1 n
X
n i=1(f (t
i) − s(t
i))
2.
2.2 Least-squares estimator
A lassial estimator of
s
is the least-squares estimator, dened as follows. For everyf ∈ S
∗,the least-squares riterion atf
isdened byP
nγ (f ) := 1 n
X
n i=1(Y
i− f (t
i))
2.
The notation
P
nγ(f )
means that the funtion(t, Y ) 7→ γ(f; (t, Y )) := (Y − f (t))
2 isintegrated with respet to the empirial distribution
P
n:= n
−1P
ni=1
δ
(ti,Yi).P
nγ(f)
isalso alledthe empirial risk of
f
.Then, given a set
S ⊂ S
∗ of funtions[0, 1] 7→ R
(alled a model), the least-squares estimatoron modelS
isERM(S; P
n) := arg min
f∈S
{ P
nγ(f) } .
The notation
ERM(S; P
n)
stresses that the least-squares estimator is the output of the empirial risk minimization algorithm overS
,whih takes a modelS
and a data sampleas inputs. When a olletion of models
{ S
m}
m∈Mnis given,
b s
m(P
n)
orb s
m are shortutsfor
ERM(S
m; P
n)
.2.3 Colletion of models
Sine thegoal is to detet jumps of
s
,every modelonsidered in this artile isthe set ofpieewise onstant funtionswith respetto some partition of
[0, 1]
.Forevery
K ∈ { 1, . . . , n − 1 }
andeverysequeneofintegersα
0= 1 < α
1< α
2< · · · <
α
K≤ n
(thebreakpoints),(I
λ)
λ∈Λ(α1,...αK)
denotes the partition
[t
α0; t
α1), . . . , [t
αK−1; t
αK), [t
αK; 1]
of
[0, 1]
into(K + 1)
intervals. Then, themodelS
(α1,...αK) isdenedasthesetofpieewiseonstant funtionsthatan only jumpat
t = t
αj for somej ∈ { 1, . . . , K }
.For every
K ∈ { 1, . . . , n − 1 }
, letM f
n(K + 1)
denote the set of suh sequenes(α
1, . . . α
K)
of lengthK
, so that{ S
m}
m∈Mfn(K+1) is the olletion of models of piee-wise onstant funtions with
K
breakpoints. WhenK = 0
,M f
n(1) := {∅}
and themodel
S
∅ is the linear spae of onstant funtions on[0, 1]
. Remark that for everyK
and
m ∈ M f
n(K + 1)
,S
m isavetorspaeofdimensionD
m= K + 1
. Intherestofthepa-per,therelationship betweenthenumberofbreakpoints
K
and thedimensionD = K + 1
ofthemodel
S
(α1,...αK)isusedrepeatedly;inpartiular,estimatingofthenumberofbreak- points(Setion4) isequivalent tohoosingthedimension ofa model. Inaddition, sine amodel