Outline of the OriSeq project

(1)

Adaptive sparse Poisson functional regression for the analysis of NGS Data.

S. Ivanoff^‡, F. Picard^?, V. Rivoirard^‡

‡CEREMADE, Univ. Paris Dauphine,F-75775 Paris, France

?LBBE, Univ. Lyon 1, F-69622 Villeurbanne, France

April 2016

(2)

Outline

1 Studying replication by high throughput sequencing

2 Poisson functional regression

3 Calibration of the Lasso weights

4 Simulation study

5 Application on OriSeq data

(3)

Next Generation Sequencing Data

• Massive parallel sequencing of DNA molecules

• Can be used to quantify DNA in a sample

• Expression, copy numbers, DNA-prot. interactions

• Focus on mapped data

DNA1?X DNA2

?X

Cut Amplify sequence

map

DNA15X DNA2 3X

(4)

Outline of the OriSeq project

• DNA replication: duplication of 1 molecule into 2 daughter

molecules

• The exact duplication of

mammalian genomes is strongly controlled

• Spatial control (loci choice)

• Temporal control (firing timing)

⇒ What are the (epi)genetic determinants of these controls ?

Origin of replication

Chromatin

(5)

Mapping human replication origins: a technical challenge

• “bubbles” are small and instable (last only minutes by cycle)

• no clear consensus sequence (like in S. Cerevisiae)

• their specification is associated with both DNA sequence and chromatin structure

⇒ Origin-Omics: SNS Sequencing

Extraction & Puriﬁcation Short Nascent Strand (SNS)

- qPCR analysis (local) - DNA tiling arrays - Sequencing (Ori-Seq)

Selection of 1.5-2kb SS fragments lambda exonuclease Digestion

(6)

Example of OriSeq data

0123

Genomic coordinates

Number of reads in 2.5kb windows

(7)

Outline

4 Simulation study

(8)

Introduction to Poisson functional Regression

• Y_t the observed number of reads at positionX_t on the genome, with t = 1, . . . ,n.

• Model sequencing data by functional Poisson regression Y_t|X_t ∼ P(f₀(X_t)),

• Goal: estimatef0.

• f a candidate estimator off0, decomposed on a functional dictionary with p elements {ϕ₁, . . . , ϕ_p}:

logf(x) =

p

X

j=1

β_jϕ_j(x)

(9)

Dictionaries vs. basis approaches

• The basis approach is designed to catch specific features of the signal (p=n)

• If many features are present simultaneously ?

• Consider overcomplete dictionaries (p>n)

• Typical dictionaries:

Histograms, Daubechies wavelets, Fourier

How to select the dictionary elements ?

0 200 400 600 800 1000

1020304050

(10)

Dictionaries vs. basis approaches

0 200 400 600 800 1000

1020304050

(11)

Dictionaries vs. basis approaches

0 200 400 600 800 1000

10203040

(12)

A Penalized likelihood framework

• We consider a likelihood-based penalized criterion to select β,

• We denote by Athen×p-design matrix with A_ij =ϕ_j(X_i), Y= (Y₁, . . . ,Y_n)^T

• The log-likelihood of the model is:

logL(β) =X

j∈J

βj(A^TY)j −

n

X

i=1

expX

j∈J

βjAij

−

n

X

i=1

log(Yi!),

• Selection can be performed by the lasso such that:

βb ∈arg min β^∈R^p







−logL(β) +

p

X

j=1

λ_j|β_j|





 .

(13)

Outline

4 Simulation study

(14)

Weights calibration using concentration inequalities

• Gaussian framework with noise variance σ², weights for the Lasso

∝σ√ logp

• λ_j is used to control the fluctuations of A^T_j Y around its mean,

• key role of V_j, a variance term (the analog ofσ²) defined by

V_j =V(A^T_j Y) =

n

X

i=1

f₀(X_i)ϕ²_j(X_i).

• For anyj, we choose a data-driven value for λ_j as small as possible so that with high probability, for any j ∈ {1, ...p},

A^T_j (Y−E[Y])| ≤λ_j.

(15)

Form of the data-driven weights for the Lasso

• Letj be fixed andγ >0 be a constant. Define Vbj =Pn

i=1ϕ²_j(Xi)Yi

the natural unbiased estimator of V_j and

Vej =Vbj + r

2γlogpVbjmax

i ϕ²_j(Xi) + 3γlogpmax

i ϕ²_j(Xi).

• Let

λ_j = q

2γlogpVe_j +γlogp

3 max

i |ϕ_j(X_i)|,

• then

Pr

|A^T_j (Y−E[Y])| ≥λj

≤ 3 p^γ.

(16)

Using the group Lasso for Poisson Functional Regression

• In some situations, coefficients can be grouped:

{1, . . . ,p}=G₁∪. . .∪G_K

• For wavelets: group by scales for instance (or adjacent positions)

• In this case selection can be performed by the group-Lasso such that:

βb ∈arg min β^∈R^p

(

−logL(β) +

K

X

k=1

λkkβ_G_kk₂ )

.

• The block`1−norm onR^p is defined by:

kβk_1,2=

K

X

k=1

kβ_G_kk₂ =

K

X

k=1

sX

j∈G_k

|β_j|²

(17)

Form of the data-driven weights for the group-Lasso

• λ^g_k should depend on sharp estimates of the variance parameters (V_j)j∈G_k

• Letk ∈ {1, . . . ,K} be fixed andγ >0 be a constant. Assume that there exists M >0 such that for anyx,|f₀(x)| ≤M.

• Let

c_k = sup

x∈Rⁿ

kA_G_kA^T_G

kxk₂ kA^T_G

kxk₂ .

• For all j ∈G_k, still with Vb_j =Pn

i=1ϕ²_j(X_i)Y_i,define

Ve_j^g = Vb_j + r

2(γlogp+ log|G_k|)Vb_jmax

i ϕ²_j(X_i) + 3(γlogp+ log|G_k|) max

i ϕ²_j(X_i).

(18)

Form of the data-driven weights for the group-Lasso

• Letγ >0 be fixed. Defineb_kⁱ =q P

j∈G_kϕ²_j(Xi) andbk = max

i bⁱ_k. Finally, we set

λ^g_k =

1 + 1

2√

2γlogp

sX

j∈G_k

Ve_j^g + 2p

γlogp D_k,

• D_k = 8Mc_k²+ 16b²_kγlogp.

Pr kA^T_G

k(Y−E[Y])k₂ ≥λ^g_k

!

≤ 2 p^γ.

(19)

• The form of the weights is analog to the weights in the Gaussian setting

• We show that the associated Lasso / Group Lasso procedure are theoretically optimal (oracle inequalities).

• With a theoretical form for the weights, much computing power is spared !

(20)

Outline

4 Simulation study

(21)

Simulation settings

• We considered the classical Donoho & Johnstone functions (Blocks, bumps, doppler, heavisine)

• The intensity functionf0 is set such that (withα∈ {1, . . . ,7}) f0 =αexpg0

• Observations are sampled on a fixed regular grid (n = 2¹⁰) with Yt ∼ P(f₀(Xt)).

• Use Daubechie, Haar and Fourier as elements of the dictionary

• Check the normalized reconstruction error:

MSE = kbf −f₀k²₂ kf₀k²₂

• Compete with the Haar-Fisz transform and cross-validation

(22)

Reconstruction errors

0.10.20.30.40.5

2 4 6

alpha

mse

blocks

0.2250.2500.2750.3000.325

2 4 6

alpha

mse

bumps

0.10.20.30.40.5

2 4 6

alpha

mse

doppler

0.10.20.30.4

2 4 6

alpha

mse

lasso.exact lasso.cvj grplasso.2

grplasso.4 grplasso.8 HaarFisz heavi

(23)

Estimated intensity functions (Lasso)

0102030

0.00 0.25 0.50 0.75 1.00

blocks

510152025

0.00 0.25 0.50 0.75 1.00

bumps

0510152025

0.00 0.25 0.50 0.75 1.00

doppler

01020

0.00 0.25 0.50 0.75 1.00

lasso.exact lasso.cvj HaarFisz f0 heavi

(24)

Estimated intensity functions (group-Lasso)

0102030

0.00 0.25 0.50 0.75 1.00

blocks

510152025

0.00 0.25 0.50 0.75 1.00

bumps

0510152025

0.00 0.25 0.50 0.75 1.00

doppler

01020

0.00 0.25 0.50 0.75 1.00

lasso.exact grplasso.2 grplasso.4 grplasso.8 f0 heavi

(25)

Choosing the best dictionary by cross-validation

D

DF

DH DFH F

H FH D H

DF

DH DFH

F

FH H

0.150.160.170.180.19

25 35 45 55 65

df

MSE

aa a

grplasso.2 HaarFisz lasso

blocks

D

DF DFH DH

F

FH H

D

D DFH DF

DH F

H FH

0.260.270.280.290.300.31

5.0 7.5 10.0 12.5

df

MSE

aa a

bumps

D

DF DFH

DH F

FH H D

D DF

DFH DH

F

FH H

0.1250.1500.175

30 40 50

df

MSE

aa a

doppler

DDF DFH F DH

FH

H

D DFD

DFH F DH

FH

H

0.070.090.110.13

7.5 10.0 12.5 15.0

df

MSE

aa a

heavi

(26)

Outline

4 Simulation study

(27)

Promising results on OriSeq

0.00 0.05 0.10 0.15 0.20 0.25

024681012

Genomic Location (Mb)

Number of reads

(28)

Perspective for the integration of multiple datasets

• This model-based framework offers many perspectives for the analysis of peak-like data

• Comparison of two conditions (A vs B) and/or normalization logf_A(x) = X

j∈J

β_jϕ_j(x) β_j = α^B_j +γ_j

• Extend to varying coefficients models (β_j(x)).

• Consider the Negative Binomial distribution

• Preprint: http://arxiv.org/abs/1412.6966