• Aucun résultat trouvé

SegAnnot: an R package for fast segmentation of annotated piecewise constant signals

N/A
N/A
Protected

Academic year: 2021

Partager "SegAnnot: an R package for fast segmentation of annotated piecewise constant signals"

Copied!
14
0
0

Texte intégral

(1)

HAL Id: hal-00759129

https://hal.inria.fr/hal-00759129

Preprint submitted on 30 Nov 2012

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

SegAnnot: an R package for fast segmentation of

annotated piecewise constant signals

Toby Dylan Hocking, Guillem Rigaill

To cite this version:

Toby Dylan Hocking, Guillem Rigaill. SegAnnot: an R package for fast segmentation of annotated

piecewise constant signals. 2012. �hal-00759129�

(2)

SegAnnot: an R package for fast segmentation of

annotated piecewise constant signals

Toby Dylan Hocking

Guillem Rigaill

November 30, 2012

Abstract

We describe and propose an implementation of a dynamic programming

algo-rithm for the segmentation of annotated piecewise constant signals. The algoalgo-rithm

is exact in the sense that it recovers the best possible segmentation w.r.t. the

quadratic loss that agrees with the annotations.

Contents

1 Introduction: annotations motivate dedicated segmentation algorithms

2

2 Motivating examples

2

3 Segmentation via annotation-aware optimization

5

3.1

Definition of a segmentation . . . .

5

3.2

Definition of the breakpoint annotations . . . .

5

3.3

Definition of a consistent segmentation . . . .

6

3.4

The annotation-aware segmentation problem . . . .

6

4 Only 0 and 1 annotations

7

4.1

Number of changes and number of possible segmentations . . . .

8

4.2

Algorithm . . . .

8

5 Only 0, 1 and b annotations

9

6 Results

10

6.1

Speed of annotation-aware DP . . . .

10

6.2

Fitting 0/1 annotations perfectly . . . .

11

7 0, 0+, 1 and 1+ annotations

12

(3)

1

Introduction: annotations motivate dedicated

seg-mentation algorithms

In bioinformatics, there are many noisy assays that attempt to measure chromosomal

copy number of a sample of cells. For example, one can recover a piecewise constant

signal of copy number from array CGH or SNP microarrays.

Many algorithms have been proposed to analyze these data, and in practice an expert

biologist will examine scatterplots to judge if the model segmentation is a good fit to

the data. Hocking et al. [2012] make this visual criterion for model selection concrete

by defining annotated regions that can quantify the accuracy of a segmentation. That

study showed that the maximum likelihood segmentation given by the pruned dynamic

programming (DP) algorithm of Rigaill [2010] or PELT [Killick et al., 2011] resulted in

the best breakpoint learning (see extended results of Hocking et al. [2012]). However,

there are two drawbacks to that approach:

• (Speed) The pruned DP or PELT algorithms were not the fastest examined in

that comparison.

• (Fitting) We may have more than one breakpoint annotation per chromosome. In

that case, there may be no maximum likelihood segmentation that agrees with all

the annotations.

Furthermore, no optimization algorithms have been specifically designed to exploit

the breakpoint annotations. The goal of this work is to characterize a new

optimiza-tion problem and segmentaoptimiza-tion algorithm, SegAnnot. It exploits the structure of the

annotated region data to increase both Speed and Fitting.

2

Motivating examples

We begin by showing two annotated copy number profiles from low and high density

microarrays of neuroblastoma tumors. In both cases, there is no maximum likelihood

segmentation that agrees with the breakpoint annotations.

(4)

Let us first give a few notations. We analyze a chromosome with D base pairs X =

{1, . . . , D}, so the set of all possible breakpoints is B = {1, . . . , D − 1}. The microarray

gives us a vector of logratio observations y ∈ R

d

at positions p ∈ X

d

, sorted in increasing

order p

1

< · · · < p

d

. We define the estimated signal with k segments as

ˆ

y

k

= arg min

x

∈R

d

||y − x||

2

2

subject to k − 1 =

d−1

X

j=1

1

x

j

6=x

j+1

.

(1)

Note that we can quickly calculate ˆ

y

k

for k ∈ {1, . . . , k

max

} using pruned DP [Rigaill,

2010]. In the figure below, we plot the noisy observations y for one chromosome on a

low-density array as d = 123 black points.

● ●●●●●● ●●●● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●●●● ●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●●● ●● ●● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ● ● ●●●●●● ●●●● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●●● ●● ●● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ● ● ●●●●●● ●●●●● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ●● ●● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ● ● ●●●●●● ●●●●● ●● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●● ●●● ●● ●● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ● ● ●●●●●● ●●●● ● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●●●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●●● ●● ●● ●● ●● ● ● ● ● ●● ●● ●●● ● ●●● ●

0.0

0.3

0.6

0.9

0.0

0.3

0.6

0.9

0.0

0.3

0.6

0.9

0.0

0.3

0.6

0.9

0.0

0.3

0.6

0.9

1 segment

2 segments

3 segments

4 segments

5 segments

0

20

40

60

80

position on chromosome 17 (mega base pairs)

logr

atio

annotation

1breakpoint

In the figure above, the estimated signals ˆ

y

k

are drawn as green lines for k = 1 to

5 segments. The cghseg model tells us the points after which a break occurs, not the

precise bases. So we define the estimated breakpoint locations shown as vertical green

dashed lines using the mean

φ(ˆ

y

k

, p) =

⌊(p

j

+ p

j+1

)/2⌋ for all j ∈ {1, . . . , d − 1} such that ˆ

y

j

k

6= ˆ

y

k

j+1

.

(2)

Note that this is a function φ : R

d

× X

d

→ 2

B

that gives the positions after which

there is a break in the estimated signal ˆ

y

k

. Clearly, none of the models ˆ

y

1

, . . . , ˆ

y

5

agree

with all 3 of the breakpoint annotations.

Références

Documents relatifs

The binomial model is a discrete time financial market model, composed of a non-risky asset B (bond), corresponding to an investment into a savings account in a bank and of a

Key words: Mulholland inequality, Hilbert-type inequality, weight function, equiv- alent form, the best possible constant factor.. The above inequality includes the best

An algorithm for piecewise continuous approximation of structural experimental multidimensional signals with a previously unknown number of intervals for splitting

In summary, the Cramér-Rao lower bound can be used to assess the minimum expected error of a maximum likelihood estimate based on the inverse of the expected Fisher information

Specific actions and interventions, manifested in the form of farming skills workshops, farming practice initiatives, mentoring programs, scholarships schemes, improved

Each unique examination was coded for the frequency of utterance type being spoken (i.e., questions identified as open-ended, probing, closed yes-no, leading, forced choice,

Figure 1: Chest Radiograph with heart outline showing the maximum internal diameter of the thorax (ID) and maximum transverse diameter of the heart that is the

If the given connected loopless bridgeless planar graph corresponds to the black regions of a knot (link) then to construct its companion graph, all one needs to