• Aucun résultat trouvé

Outline of the OriSeq project

N/A
N/A
Protected

Academic year: 2022

Partager "Outline of the OriSeq project"

Copied!
28
0
0

Texte intégral

(1)

Adaptive sparse Poisson functional regression for the analysis of NGS Data.

S. Ivanoff, F. Picard?, V. Rivoirard

CEREMADE, Univ. Paris Dauphine,F-75775 Paris, France

?LBBE, Univ. Lyon 1, F-69622 Villeurbanne, France

April 2016

(2)

Outline

1 Studying replication by high throughput sequencing

2 Poisson functional regression

3 Calibration of the Lasso weights

4 Simulation study

5 Application on OriSeq data

(3)

Next Generation Sequencing Data

Massive parallel sequencing of DNA molecules

Can be used to quantify DNA in a sample

Expression, copy numbers, DNA-prot. interactions

Focus on mapped data

DNA1?X DNA2

?X

Cut Amplify sequence

map

map

DNA15X DNA2 3X

(4)

Outline of the OriSeq project

DNA replication: duplication of 1 molecule into 2 daughter

molecules

The exact duplication of

mammalian genomes is strongly controlled

Spatial control (loci choice)

Temporal control (firing timing)

⇒ What are the (epi)genetic determinants of these controls ?

Origin of replication

Chromatin

(5)

Mapping human replication origins: a technical challenge

“bubbles” are small and instable (last only minutes by cycle)

no clear consensus sequence (like in S. Cerevisiae)

their specification is associated with both DNA sequence and chromatin structure

⇒ Origin-Omics: SNS Sequencing

Extraction & Purification Short Nascent Strand (SNS)

- qPCR analysis (local) - DNA tiling arrays - Sequencing (Ori-Seq)

Selection of 1.5-2kb SS fragments lambda exonuclease Digestion

(6)

Example of OriSeq data

0123

Genomic coordinates

Number of reads in 2.5kb windows

(7)

Outline

1 Studying replication by high throughput sequencing

2 Poisson functional regression

3 Calibration of the Lasso weights

4 Simulation study

5 Application on OriSeq data

(8)

Introduction to Poisson functional Regression

Yt the observed number of reads at positionXt on the genome, with t = 1, . . . ,n.

Model sequencing data by functional Poisson regression Yt|Xt ∼ P(f0(Xt)),

Goal: estimatef0.

f a candidate estimator off0, decomposed on a functional dictionary with p elements {ϕ1, . . . , ϕp}:

logf(x) =

p

X

j=1

βjϕj(x)

(9)

Dictionaries vs. basis approaches

The basis approach is designed to catch specific features of the signal (p=n)

If many features are present simultaneously ?

Consider overcomplete dictionaries (p>n)

Typical dictionaries:

Histograms, Daubechies wavelets, Fourier

How to select the dictionary elements ?

0 200 400 600 800 1000

1020304050

(10)

Dictionaries vs. basis approaches

The basis approach is designed to catch specific features of the signal (p=n)

If many features are present simultaneously ?

Consider overcomplete dictionaries (p>n)

Typical dictionaries:

Histograms, Daubechies wavelets, Fourier

How to select the dictionary elements ?

0 200 400 600 800 1000

1020304050

(11)

Dictionaries vs. basis approaches

The basis approach is designed to catch specific features of the signal (p=n)

If many features are present simultaneously ?

Consider overcomplete dictionaries (p>n)

Typical dictionaries:

Histograms, Daubechies wavelets, Fourier

How to select the dictionary elements ?

0 200 400 600 800 1000

10203040

(12)

A Penalized likelihood framework

We consider a likelihood-based penalized criterion to select β,

We denote by Athen×p-design matrix with Aijj(Xi), Y= (Y1, . . . ,Yn)T

The log-likelihood of the model is:

logL(β) =X

j∈J

βj(ATY)j

n

X

i=1

expX

j∈J

βjAij

n

X

i=1

log(Yi!),

Selection can be performed by the lasso such that:

βb ∈arg min βRp

−logL(β) +

p

X

j=1

λjj|

 .

(13)

Outline

1 Studying replication by high throughput sequencing

2 Poisson functional regression

3 Calibration of the Lasso weights

4 Simulation study

5 Application on OriSeq data

(14)

Weights calibration using concentration inequalities

Gaussian framework with noise variance σ2, weights for the Lasso

∝σ√ logp

λj is used to control the fluctuations of ATj Y around its mean,

key role of Vj, a variance term (the analog ofσ2) defined by

Vj =V(ATj Y) =

n

X

i=1

f0(Xi2j(Xi).

For anyj, we choose a data-driven value for λj as small as possible so that with high probability, for any j ∈ {1, ...p},

ATj (Y−E[Y])| ≤λj.

(15)

Form of the data-driven weights for the Lasso

Letj be fixed andγ >0 be a constant. Define Vbj =Pn

i=1ϕ2j(Xi)Yi

the natural unbiased estimator of Vj and

Vej =Vbj + r

2γlogpVbjmax

i ϕ2j(Xi) + 3γlogpmax

i ϕ2j(Xi).

Let

λj = q

2γlogpVej +γlogp

3 max

ij(Xi)|,

then

Pr

|ATj (Y−E[Y])| ≥λj

≤ 3 pγ.

(16)

Using the group Lasso for Poisson Functional Regression

In some situations, coefficients can be grouped:

{1, . . . ,p}=G1∪. . .∪GK

For wavelets: group by scales for instance (or adjacent positions)

In this case selection can be performed by the group-Lasso such that:

βb ∈arg min βRp

(

−logL(β) +

K

X

k=1

λkGkk2 )

.

The block`1−norm onRp is defined by:

kβk1,2=

K

X

k=1

Gkk2 =

K

X

k=1

sX

j∈Gk

j|2

(17)

Form of the data-driven weights for the group-Lasso

λgk should depend on sharp estimates of the variance parameters (Vj)j∈Gk

Letk ∈ {1, . . . ,K} be fixed andγ >0 be a constant. Assume that there exists M >0 such that for anyx,|f0(x)| ≤M.

Let

ck = sup

x∈Rn

kAGkATG

kxk2 kATG

kxk2 .

For all j ∈Gk, still with Vbj =Pn

i=1ϕ2j(Xi)Yi,define

Vejg = Vbj + r

2(γlogp+ log|Gk|)Vbjmax

i ϕ2j(Xi) + 3(γlogp+ log|Gk|) max

i ϕ2j(Xi).

(18)

Form of the data-driven weights for the group-Lasso

Letγ >0 be fixed. Definebki =q P

j∈Gkϕ2j(Xi) andbk = max

i bik. Finally, we set

λgk =

1 + 1

2√

2γlogp

sX

j∈Gk

Vejg + 2p

γlogp Dk,

Dk = 8Mck2+ 16b2kγlogp.

Pr kATG

k(Y−E[Y])k2 ≥λgk

!

≤ 2 pγ.

(19)

The form of the weights is analog to the weights in the Gaussian setting

We show that the associated Lasso / Group Lasso procedure are theoretically optimal (oracle inequalities).

With a theoretical form for the weights, much computing power is spared !

(20)

Outline

1 Studying replication by high throughput sequencing

2 Poisson functional regression

3 Calibration of the Lasso weights

4 Simulation study

5 Application on OriSeq data

(21)

Simulation settings

We considered the classical Donoho & Johnstone functions (Blocks, bumps, doppler, heavisine)

The intensity functionf0 is set such that (withα∈ {1, . . . ,7}) f0 =αexpg0

Observations are sampled on a fixed regular grid (n = 210) with Yt ∼ P(f0(Xt)).

Use Daubechie, Haar and Fourier as elements of the dictionary

Check the normalized reconstruction error:

MSE = kbf −f0k22 kf0k22

Compete with the Haar-Fisz transform and cross-validation

(22)

Reconstruction errors

0.10.20.30.40.5

2 4 6

alpha

mse

blocks

0.2250.2500.2750.3000.325

2 4 6

alpha

mse

bumps

0.10.20.30.40.5

2 4 6

alpha

mse

doppler

0.10.20.30.4

2 4 6

alpha

mse

lasso.exact lasso.cvj grplasso.2

grplasso.4 grplasso.8 HaarFisz heavi

(23)

Estimated intensity functions (Lasso)

0102030

0.00 0.25 0.50 0.75 1.00

blocks

510152025

0.00 0.25 0.50 0.75 1.00

bumps

0510152025

0.00 0.25 0.50 0.75 1.00

doppler

01020

0.00 0.25 0.50 0.75 1.00

lasso.exact lasso.cvj HaarFisz f0 heavi

(24)

Estimated intensity functions (group-Lasso)

0102030

0.00 0.25 0.50 0.75 1.00

blocks

510152025

0.00 0.25 0.50 0.75 1.00

bumps

0510152025

0.00 0.25 0.50 0.75 1.00

doppler

01020

0.00 0.25 0.50 0.75 1.00

lasso.exact grplasso.2 grplasso.4 grplasso.8 f0 heavi

(25)

Choosing the best dictionary by cross-validation

D

DF

DH DFH F

H FH D H

DF

DH DFH

F

FH H

0.150.160.170.180.19

25 35 45 55 65

df

MSE

aa a

grplasso.2 HaarFisz lasso

blocks

D

DF DFH DH

F

FH H

D

D DFH DF

DH F

H FH

0.260.270.280.290.300.31

5.0 7.5 10.0 12.5

df

MSE

aa a

grplasso.2 HaarFisz lasso

bumps

D

DF DFH

DH F

FH H D

D DF

DFH DH

F

FH H

0.1250.1500.175

30 40 50

df

MSE

aa a

grplasso.2 HaarFisz lasso

doppler

DDF DFH F DH

FH

H

D DFD

DFH F DH

FH

H

0.070.090.110.13

7.5 10.0 12.5 15.0

df

MSE

aa a

grplasso.2 HaarFisz lasso

heavi

(26)

Outline

1 Studying replication by high throughput sequencing

2 Poisson functional regression

3 Calibration of the Lasso weights

4 Simulation study

5 Application on OriSeq data

(27)

Promising results on OriSeq

0.00 0.05 0.10 0.15 0.20 0.25

024681012

Genomic Location (Mb)

Number of reads

(28)

Perspective for the integration of multiple datasets

This model-based framework offers many perspectives for the analysis of peak-like data

Comparison of two conditions (A vs B) and/or normalization logfA(x) = X

j∈J

βjϕj(x) βj = αBjj

Extend to varying coefficients models (βj(x)).

Consider the Negative Binomial distribution

Preprint: http://arxiv.org/abs/1412.6966

Références

Documents relatifs

In this paper, we have proposed Lasso-SIR, an efficient high-dimensional variant of the sliced inverse regression [Li, 1991], to obtain a sparse solution to the estimation of

La répartition selon le sexe a été caractérisée, dans notre série, par une nette Prédominance féminine (95,25%de femmes pour 5% d'hommes), se rapprochant ainsi des données de

The Retrospective Analysis of Antarctic Tracking Data (RAATD) is a Scientific Committee for Antarctic Research project led jointly by the Expert Groups on Birds and Marine Mammals and

Inspired both by the advances about the Lepski methods (Goldenshluger and Lepski, 2011) and model selection, Chagny and Roche (2014) investigate a global bandwidth selection method

To map open chromatin regions based on ATAC-seq or FAIRE-seq experiments, MACS2 is also used, but in the vast majority of cases (when the data is represented by a

In characterizing the local time-scales of variability of the solar resource, the main contributions of the study are the introduction of the amplitude modulation-frequency

In the infinite dimensional framework, the problem of functional estimation for nonparametric regression operator under some strong mixing conditions on the functional data

The method requires to tune four parameters : the num- ber of slices H , the dimension of the EDR space p, the penalization parameter of the ridge regression µ 2 and of the one of