Biological motivation

(1)

Biological motivation and our model Our method and general results Application to genomic data

Estimation non-paramétrique dans un modèle d’interactions poissoniennes et

application à des données génomiques

Laure SANSONNET - Université Paris-Sud 11

Mardi 17 avril 2012

Colloque des jeunes probabilistes et statisticiens - CIRM Marseille -

(2)

Biological motivation

Two given events modeled by point processesP1 andP2:

how does P1 influence P2?

Any type of interaction, for example:

in neurosciences, in economics, in genomics, . . .

"DNA case": study of favored or avoided distances between two given motifs along a genome.

motif=sequence of letters in the alphabet {a,c,g,t}.

Genomes are long and motifs of interest are short.

→ we work in a continuous framework.

→ occurrences of a motif=a point process lying in [0;T], where T is the normalized length of the studied genome.

To study the influence ofP2 onP1, we just invert their roles in the model.

(4)

Poisson process on the real line

LetN be a random countable set of points ofR(here).

N_A number of points of N inA, dN =P

X∈Nδ_X. Poisson process

N_A obeys a Poisson law P(ν(A)),

ifA₁, . . . ,A_` are disjoint measurable sets,N_A₁, . . . ,N_A_` are independent random variables.

ν is a measure called "mean measure".

Generally,dν(t) =h(t)dt.

Ifh =constant,N is a homogeneous Poisson process.

(5)

Our model

• • • •

U₁ U₂ U₃ U₄

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

We observe the occurrences of both given motifs:

Parents : U₁, . . . ,U_n i.i.d. uniform random variables on[0;T]. Children : Poisson process N with intensity

n

X

i=1

h(t−U_i).

Aim: Estimate h.

(6)

Our model

• • • •

U₁ U₂ U₃ U₄

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

Parents : U₁, . . . ,U_ni.i.d. uniform random variables on [0;T].

Children : Poisson process N with intensity

n

X

i=1

h(t−U_i). Aim: Estimate h.

(7)

Our model

• • • •

U₁ U₂ U₃ U₄

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

n

X

i=1

h(t−U_i). Aim: Estimate h.

(8)

Our model

• • • •

U₁ U₂ U₃ U₄

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

n

X

i=1

h(t−U_i).

Aim: Estimate h.

(9)

Our model

• • • •

U₁ U₂ U₃ U₄

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

n

X

i=1

h(t−U_i).

Aim: Estimate h.

(10)

In genomics:

The first motif of interest is a rare word and is modeled by a homogeneous Poisson processN⁰ on [0;T].

Conditionally to the event "the number of points falling into [0;T]isn", the points ofN⁰ (i.e. the parents) obey the same law as a n-sample of uniform random variables on[0;T].

With very high probability,n is proportional to T.

→ the asymptotic considered in genomics: "DNA case".

Hawkes process:

Gusto and Schbath (2005), Reynaud-Bouret and Schbath (2010), Carstensen et al.(2010).

Our model:

no phenomenons of spontaneous apparition and self-excitation, but a nonparametric method of estimation, using a wavelet thresholding rule (no sparsity issues) and a double asymptotic.

(11)

In genomics:

Hawkes process:

Our model:

(12)

In genomics:

Hawkes process:

Our model:

(13)

Framework

Assumption: h∈L1(R)∩L∞(R).

Decomposition of h on the Haar basis (obtained by dilatations and translations of φ=1_[0;1] andψ=1_]1

2;1]−1_[0;1 2]):

h=X

λ∈Λ

β_λϕ_λ with β_λ = Z

R

h(x)ϕ_λ(x)dx,

where Λ ={λ= (j,k) :j >−1,k ∈Z} and∀x ∈R,

∀λ= (j,k)∈Λ, ϕ_λ(x) =

φ(x−k) ifj =−1 2^j^/2ψ(2^jx −k) otherwise . For the theoretical results, we have used the decomposition of h on a particular biorthogonal wavelet basis, built by Cohen et al.(1992).

Aim: Estimate theβ_λ’s.

(14)

Framework

Assumption: h∈L1(R)∩L∞(R).

Decomposition of h on the Haar basis (obtained by dilatations and translations of φ=1_[0;1] andψ=1_]1

2;1]−1_[0;1 2]):

h=X

λ∈Λ

β_λϕ_λ with β_λ = Z

R

h(x)ϕ_λ(x)dx,

where Λ ={λ= (j,k) :j >−1,k ∈Z} and∀x ∈R,

∀λ= (j,k)∈Λ, ϕ_λ(x) =

φ(x−k) ifj =−1 2^j^/2ψ(2^jx −k) otherwise . For the theoretical results, we have used the decomposition of h on a particular biorthogonal wavelet basis, built by Cohen et al.(1992).

Aim: Estimate theβ_λ’s.

(15)

Framework

For allλinΛ,βˆλ= G(ϕ_λ) n , with G(ϕ_λ) =

Z

R n

X

i=1

ϕ_λ(t−U_i)−n−1

n Eπ(ϕ_λ(t−U))

dN_t.

Lemma

For allλ∈Λ,E(G(ϕ_λ)) =n Z

R

ϕλ(x)h(x)dx, i.e.βˆλ is an unbiased estimatorforβ_λ.

Furthermore, itsvarianceis upper boundedas follows: Var( ˆβ_λ)6C

1 n + n

T²

.

(16)

Framework

For allλinΛ,βˆλ= G(ϕ_λ) n , with G(ϕ_λ) =

Z

R n

X

i=1

n Eπ(ϕ_λ(t−U))

dN_t.

Lemma

For allλ∈Λ,E(G(ϕ_λ)) =n Z

R

ϕλ(x)h(x)dx, i.e.βˆλ is an unbiased estimatorforβ_λ.

Furthermore, itsvarianceisupper bounded as follows:

Var( ˆβ_λ)6C 1

n + n T²

.

(17)

Description of our method

Assumption: h is compactly supported in[−A;A], with A>0 (A=the maximal memory along DNA sequences).

Γ =

λ= (j,k)∈Λ :−16j 6j₀,k∈ K_j a deterministic subset ofΛ withj₀∈N^∗ → |Γ| '2^j⁰.

Given some parameterγ >0, for anyλ∈Γ, the threshold: ηλ(γ,∆) =

r

2γj₀Veϕλ

n

+γj₀ 3 Bϕλ

n

+ ∆N_R n

∆a positive quantity (of order ^j⁰²²_n^j⁰^/2 +^√^j⁰

T +

√j0n T for theoretical results),

N_R=number of points of the processNlying in R, B ^ϕ_n^λ

= ¹_n Pn

i=1

ϕλ(· −Ui)−ⁿ⁻¹_n Eπ(ϕλ(· −U)) _∞, Ve ^ϕ_n^λ

= _n¹2

Vˆ(ϕλ) + q

2γj0Vˆ(ϕλ)B²(ϕλ) +3γj0B²(ϕλ)

, Vˆ(ϕλ) =R

R

Pn i=1

ϕλ(t−Ui)−ⁿ⁻¹_n Eπ(ϕλ(t−U))2

dNt.

(18)

Description of our method

Γ =

Given some parameterγ >0, for anyλ∈Γ, the threshold: ηλ(γ,∆) =

r

2γj₀Veϕλ

n

+γj₀ 3 Bϕλ

n

+ ∆N_R n

T +

= ¹_n Pn

i=1

= _n¹2

Vˆ(ϕλ) + q

, Vˆ(ϕλ) =R

R

Pn i=1

dNt.

(19)

Description of our method

Γ =

Given some parameterγ >0, for anyλ∈Γ, the threshold:

ηλ(γ,∆) = r

2γj₀Veϕλ

n

+γj₀ 3 Bϕλ

n

+ ∆N_R n

T +

= ¹_n Pn

i=1

= _n¹2

Vˆ(ϕλ) + q

, Vˆ(ϕλ) =R

R

Pn i=1

dNt.

(20)

Description of our method

Γ =

Given some parameterγ >0, for anyλ∈Γ, the threshold:

ηλ(γ,∆) = r

2γj₀Veϕλ

n

+γj₀ 3 Bϕλ

n

+ ∆N_R n

T +

= ¹_n Pn

i=1

ϕλ(· −Ui)−ⁿ⁻¹_n Eπ(ϕλ(· −U)) ∞, Ve ^ϕ_n^λ

= _n¹2

Vˆ(ϕ_λ) + q

2γj0Vˆ(ϕ_λ)B²(ϕ_λ) +3γj0B²(ϕ_λ)

, Vˆ(ϕλ) =R

R

Pn i=1

ϕλ(t−Ui)−ⁿ⁻¹_n Eπ(ϕλ(t−U))² dNt.

(21)

Description of our method

ηλ(γ,∆) = r

2γj₀Veϕλ

n

+ γj₀ 3 Bϕλ

n

+∆N_R n B,Vˆ andVe only depend on the observations and can be exactly computed.

β˜the estimator of β= (βλ)λ∈Λ associated with the previous thresholding rule:

β˜= βˆ_λ1_|_β_ˆ

λ|>ηλ(γ,∆)1_λ∈Γ

λ∈Λ. h˜=X

λ∈Λ

β˜_λϕ_λ an estimator ofh that only depends on the choice of(γ,∆) andj₀ fixed later.

(22)

Main results

An oracle type inequality.

Theorem

We assume that n>2, j₀ ∈N^∗ such that 2^j⁰ 6n<2^j⁰⁺¹, γ >2 log 2and∆is defined in a technical way.

Then the estimator˜h, previously defined, satisfies E

k˜h−hk²₂

6C₁ inf

m⊂Γ

( X

λ6∈m

β_λ²+|m|

"

1 n + n

T²

# (logn)⁴

) +C₂

"

1 n + n

T²

# .

"DNA case" (n proportional to T) E

k˜h−hk²₂

6C₁ inf

m⊂Γ

( X

λ6∈m

β_λ²+(logT)⁴

T |m|

) +C₂

T

(23)

Main results

An oracle type inequality.

Theorem

We assume that n>2, j₀ ∈N^∗ such that 2^j⁰ 6n<2^j⁰⁺¹, γ >2 log 2and∆is defined in a technical way.

Then the estimator˜h, previously defined, satisfies E

k˜h−hk²₂

6C₁ inf

m⊂Γ

( X

λ6∈m

β_λ²+|m|

"

1 n + n

T²

# (logn)⁴

) +C₂

"

1 n + n

T²

# .

"DNA case" (n proportional to T) E

k˜h−hk²₂

6C1 inf

m⊂Γ

( X

λ6∈m

β_λ²+(logT)⁴

T |m|

) +C₂

T

(24)

Main results

A minimax result on Besov balls still withn proportional toT. B_2,∞^s (R) =







f =X

λ∈Λ

βλϕλ,∀j >−1, X

k∈K_j

β_(j²_,k₎6R²2^−2js







Corollary ("DNA case")

Let R>0and s ∈Rsuch that 0<s <r+1. Assume that h∈ B^s_2,∞(R)and n is proportional to T .

Then the estimator˜h satisfies

E

kh˜−hk²₂ 6C

(logT)⁴ T

_2s+1^2s .

(25)

Main results

A minimax result on Besov balls still withn proportional toT. B_2,∞^s (R) =







f =X

λ∈Λ

βλϕλ,∀j >−1, X

k∈K_j

β_(j²_,k₎6R²2^−2js







Corollary ("DNA case")

Let R>0and s ∈R such that 0<s <r+1. Assume that h∈ B^s_2,∞(R)and n is proportional to T .

Then the estimator˜h satisfies

E

k˜h−hk²₂ 6C

(logT)⁴ T

_2s+1^2s .

(26)

Implementation procedure

From now on, we consider"DNA case": n is proportional toT. Computation of the family of random thresholds(η_λ(γ,δ))_λ∈Γ:

η_λ(γ,δ) = r

2γj0Vˆϕ_λ n

+γj₀

3 Bϕ_λ n

+ δ

√T N_R

n , where∆ = ^√^δ

T (becausen is proportional to T).

We set j₀=5in the sequel. Computation of

n

X

i=1

n Eπ(ϕ_λ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆ_λ,Vˆ andB. Choice of the parameters γ andδ?

→ calibration of parameters from a practical point of view.

(27)

Implementation procedure

η_λ(γ,δ) = r

2γj0Vˆϕ_λ n

+γj₀

3 Bϕ_λ n

+ δ

√T N_R

We setj₀=5in the sequel.

Computation of

n

X

i=1

n Eπ(ϕ_λ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆ_λ,Vˆ andB. Choice of the parameters γ andδ?

(28)

Implementation procedure

η_λ(γ,δ) = r

2γj0Vˆϕ_λ n

+γj₀

3 Bϕ_λ n

+ δ

√T N_R

Computation of

n

X

i=1

ϕ_λ(t−U_i)− n−1

n Eπ(ϕ_λ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆ_λ,Vˆ andB.

Choice of the parameters γ andδ?

(29)

Implementation procedure

η_λ(γ,δ) = r

2γj0Vˆϕ_λ n

+γj₀

3 Bϕ_λ n

+ δ

√T N_R

Computation of

n

X

i=1

ϕ_λ(t−U_i)− n−1

n Eπ(ϕ_λ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆ_λ,Vˆ andB.

Choice of the parameters γ andδ?

(30)

Influence promoters/genes in E. coli

Data:

the sequence composed of both strands of E. coligenomeof length 4 639 221 bases (we took 10 000 bases for the maximal memory)

→ a sequence of length 9 288 442 (=2∗4639221+10000), locations of 4 290genes (we took the positions of the first base of coding sequences),

locations of 1 036 occurrences of the major promoter: tataat.

For convenience, we work on a scale of 1:1000 and we set T =9289and soA=10.

(31)

Influence promoters/genes in E. coli

Data:

the sequence composed of both strands of E. coligenomeof length 4 639 221 bases (we took 10 000 bases for the maximal memory)

→ a sequence of length 9 288 442 (=2∗4639221+10000), locations of 4 290genes (we took the positions of the first base of coding sequences),

locations of 1 036 occurrences of the major promoter: tataat.

For convenience, we work on a scale of 1:1000 and we set T =9289and soA=10.

(32)

Influence promoters/genes in E. coli

How does the DNA motif tataat influence genes?

parents =tataat, children= genes.

(33)

Influence promoters/genes in E. coli

How does the DNA motif tataat influence genes?

parents =tataat, children= genes.

(34)

Influence promoters/genes in E. coli

How does genes influence the DNA motif tataat?

parents =genes, children= tataat.

(35)

Influence promoters/genes in E. coli

How does genes influence the DNA motif tataat?

parents =genes, children= tataat.

(36)

Conclusion

Our random thresholding procedure is optimal in the oracle and minimax setting.

Some simulations illustrate the robustness of our procedure.

The application to genomic data validates our procedure with a good detection of favored or avoided distances between occurrences of tataatand genes along theE. coli genome.

Further possible extensions of our model:

a more sophisticated model that takes into account the phenomenons of spontaneous apparition and self-excitation, an extension of our cascade algorithm to general wavelet bases and not only to Haar bases,

a study of similar processes in the spatial framework. Thanks for your attention!

(37)

Conclusion

a study of similar processes in the spatial framework.

Thanks for your attention!

(38)

Conclusion

a study of similar processes in the spatial framework.

Thanks for your attention!

Biological motivation

Estimation non-paramétrique dans un modèle d’interactions poissoniennes et

application à des données génomiques

Contents

Biological motivation

Poisson process on the real line

Our model

Our model

Our model

Our model

Our model

Framework

Framework

Framework

Framework

Description of our method

Description of our method

Description of our method

Description of our method

Description of our method

Main results

Main results

Main results

Main results

Implementation procedure

Implementation procedure

Implementation procedure

Implementation procedure

Influence promoters/genes in E. coli

Influence promoters/genes in E. coli

Influence promoters/genes in E. coli

Influence promoters/genes in E. coli

Influence promoters/genes in E. coli

Influence promoters/genes in E. coli

Conclusion

Conclusion

Conclusion