• Aucun résultat trouvé

Biological motivation

N/A
N/A
Protected

Academic year: 2022

Partager "Biological motivation"

Copied!
38
0
0

Texte intégral

(1)

Biological motivation and our model Our method and general results Application to genomic data

Estimation non-paramétrique dans un modèle d’interactions poissoniennes et

application à des données génomiques

Laure SANSONNET - Université Paris-Sud 11

Mardi 17 avril 2012

Colloque des jeunes probabilistes et statisticiens - CIRM Marseille -

(2)

Biological motivation and our model Our method and general results Application to genomic data

Contents

1 Biological motivation and our model

2 Our method and general results

3 Application to genomic data

(3)

Biological motivation and our model Our method and general results Application to genomic data

Biological motivation

Two given events modeled by point processesP1 andP2:

how does P1 influence P2?

Any type of interaction, for example:

in neurosciences, in economics, in genomics, . . .

"DNA case": study of favored or avoided distances between two given motifs along a genome.

motif=sequence of letters in the alphabet {a,c,g,t}.

Genomes are long and motifs of interest are short.

→ we work in a continuous framework.

→ occurrences of a motif=a point process lying in [0;T], where T is the normalized length of the studied genome.

To study the influence ofP2 onP1, we just invert their roles in the model.

(4)

Biological motivation and our model Our method and general results Application to genomic data

Poisson process on the real line

LetN be a random countable set of points ofR(here).

NA number of points of N inA, dN =P

X∈NδX. Poisson process

NA obeys a Poisson law P(ν(A)),

ifA1, . . . ,A` are disjoint measurable sets,NA1, . . . ,NA` are independent random variables.

ν is a measure called "mean measure".

Generally,dν(t) =h(t)dt.

Ifh =constant,N is a homogeneous Poisson process.

(5)

Biological motivation and our model Our method and general results Application to genomic data

Our model

• • • •

U1 U2 U3 U4

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

We observe the occurrences of both given motifs:

Parents : U1, . . . ,Un i.i.d. uniform random variables on[0;T]. Children : Poisson process N with intensity

n

X

i=1

h(t−Ui).

Aim: Estimate h.

(6)

Biological motivation and our model Our method and general results Application to genomic data

Our model

• • • •

U1 U2 U3 U4

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

We observe the occurrences of both given motifs:

Parents : U1, . . . ,Uni.i.d. uniform random variables on [0;T].

Children : Poisson process N with intensity

n

X

i=1

h(t−Ui). Aim: Estimate h.

(7)

Biological motivation and our model Our method and general results Application to genomic data

Our model

• • • •

U1 U2 U3 U4

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

We observe the occurrences of both given motifs:

Parents : U1, . . . ,Uni.i.d. uniform random variables on [0;T].

Children : Poisson process N with intensity

n

X

i=1

h(t−Ui). Aim: Estimate h.

(8)

Biological motivation and our model Our method and general results Application to genomic data

Our model

• • • •

U1 U2 U3 U4

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

We observe the occurrences of both given motifs:

Parents : U1, . . . ,Uni.i.d. uniform random variables on [0;T].

Children : Poisson process N with intensity

n

X

i=1

h(t−Ui).

Aim: Estimate h.

(9)

Biological motivation and our model Our method and general results Application to genomic data

Our model

• • • •

U1 U2 U3 U4

h(· −U1) h(· −U2) h(· −U3) h(· −U4)

• • •• • •• • • • • • • •

We observe the occurrences of both given motifs:

Parents : U1, . . . ,Uni.i.d. uniform random variables on [0;T].

Children : Poisson process N with intensity

n

X

i=1

h(t−Ui).

Aim: Estimate h.

(10)

Biological motivation and our model Our method and general results Application to genomic data

In genomics:

The first motif of interest is a rare word and is modeled by a homogeneous Poisson processN0 on [0;T].

Conditionally to the event "the number of points falling into [0;T]isn", the points ofN0 (i.e. the parents) obey the same law as a n-sample of uniform random variables on[0;T].

With very high probability,n is proportional to T.

→ the asymptotic considered in genomics: "DNA case".

Hawkes process:

Gusto and Schbath (2005), Reynaud-Bouret and Schbath (2010), Carstensen et al.(2010).

Our model:

no phenomenons of spontaneous apparition and self-excitation, but a nonparametric method of estimation, using a wavelet thresholding rule (no sparsity issues) and a double asymptotic.

(11)

Biological motivation and our model Our method and general results Application to genomic data

In genomics:

The first motif of interest is a rare word and is modeled by a homogeneous Poisson processN0 on [0;T].

Conditionally to the event "the number of points falling into [0;T]isn", the points ofN0 (i.e. the parents) obey the same law as a n-sample of uniform random variables on[0;T].

With very high probability,n is proportional to T.

→ the asymptotic considered in genomics: "DNA case".

Hawkes process:

Gusto and Schbath (2005), Reynaud-Bouret and Schbath (2010), Carstensen et al.(2010).

Our model:

no phenomenons of spontaneous apparition and self-excitation, but a nonparametric method of estimation, using a wavelet thresholding rule (no sparsity issues) and a double asymptotic.

(12)

Biological motivation and our model Our method and general results Application to genomic data

In genomics:

The first motif of interest is a rare word and is modeled by a homogeneous Poisson processN0 on [0;T].

Conditionally to the event "the number of points falling into [0;T]isn", the points ofN0 (i.e. the parents) obey the same law as a n-sample of uniform random variables on[0;T].

With very high probability,n is proportional to T.

→ the asymptotic considered in genomics: "DNA case".

Hawkes process:

Gusto and Schbath (2005), Reynaud-Bouret and Schbath (2010), Carstensen et al.(2010).

Our model:

no phenomenons of spontaneous apparition and self-excitation, but a nonparametric method of estimation, using a wavelet thresholding rule (no sparsity issues) and a double asymptotic.

(13)

Biological motivation and our model Our method and general results Application to genomic data

Framework

Assumption: h∈L1(R)∩L(R).

Decomposition of h on the Haar basis (obtained by dilatations and translations of φ=1[0;1] andψ=1]1

2;1]−1[0;1 2]):

h=X

λ∈Λ

βλϕλ with βλ = Z

R

h(x)ϕλ(x)dx,

where Λ ={λ= (j,k) :j >−1,k ∈Z} and∀x ∈R,

∀λ= (j,k)∈Λ, ϕλ(x) =

φ(x−k) ifj =−1 2j/2ψ(2jx −k) otherwise . For the theoretical results, we have used the decomposition of h on a particular biorthogonal wavelet basis, built by Cohen et al.(1992).

Aim: Estimate theβλ’s.

(14)

Biological motivation and our model Our method and general results Application to genomic data

Framework

Assumption: h∈L1(R)∩L(R).

Decomposition of h on the Haar basis (obtained by dilatations and translations of φ=1[0;1] andψ=1]1

2;1]−1[0;1 2]):

h=X

λ∈Λ

βλϕλ with βλ = Z

R

h(x)ϕλ(x)dx,

where Λ ={λ= (j,k) :j >−1,k ∈Z} and∀x ∈R,

∀λ= (j,k)∈Λ, ϕλ(x) =

φ(x−k) ifj =−1 2j/2ψ(2jx −k) otherwise . For the theoretical results, we have used the decomposition of h on a particular biorthogonal wavelet basis, built by Cohen et al.(1992).

Aim: Estimate theβλ’s.

(15)

Biological motivation and our model Our method and general results Application to genomic data

Framework

For allλinΛ,βˆλ= G(ϕλ) n , with G(ϕλ) =

Z

R n

X

i=1

ϕλ(t−Ui)−n−1

n Eπλ(t−U))

dNt.

Lemma

For allλ∈Λ,E(G(ϕλ)) =n Z

R

ϕλ(x)h(x)dx, i.e.βˆλ is an unbiased estimatorforβλ.

Furthermore, itsvarianceis upper boundedas follows: Var( ˆβλ)6C

1 n + n

T2

.

(16)

Biological motivation and our model Our method and general results Application to genomic data

Framework

For allλinΛ,βˆλ= G(ϕλ) n , with G(ϕλ) =

Z

R n

X

i=1

ϕλ(t−Ui)−n−1

n Eπλ(t−U))

dNt.

Lemma

For allλ∈Λ,E(G(ϕλ)) =n Z

R

ϕλ(x)h(x)dx, i.e.βˆλ is an unbiased estimatorforβλ.

Furthermore, itsvarianceisupper bounded as follows:

Var( ˆβλ)6C 1

n + n T2

.

(17)

Biological motivation and our model Our method and general results Application to genomic data

Description of our method

Assumption: h is compactly supported in[−A;A], with A>0 (A=the maximal memory along DNA sequences).

Γ =

λ= (j,k)∈Λ :−16j 6j0,k∈ Kj a deterministic subset ofΛ withj0∈N → |Γ| '2j0.

Given some parameterγ >0, for anyλ∈Γ, the threshold: ηλ(γ,∆) =

r

2γj0Veϕλ

n

+γj0 3 Bϕλ

n

+ ∆NR n

a positive quantity (of order j022nj0/2 +j0

T +

j0n T for theoretical results),

NR=number of points of the processNlying in R, B ϕnλ

= 1n Pn

i=1

ϕλ(· −Ui)n−1n Eπλ(· −U)) , Ve ϕnλ

= n12

Vˆλ) + q

2γj0Vˆλ)B2λ) +3γj0B2λ)

, Vˆλ) =R

R

Pn i=1

ϕλ(tUi)n−1n Eπλ(tU))2

dNt.

(18)

Biological motivation and our model Our method and general results Application to genomic data

Description of our method

Assumption: h is compactly supported in[−A;A], with A>0 (A=the maximal memory along DNA sequences).

Γ =

λ= (j,k)∈Λ :−16j 6j0,k∈ Kj a deterministic subset ofΛ withj0∈N → |Γ| '2j0.

Given some parameterγ >0, for anyλ∈Γ, the threshold: ηλ(γ,∆) =

r

2γj0Veϕλ

n

+γj0 3 Bϕλ

n

+ ∆NR n

a positive quantity (of order j022nj0/2 +j0

T +

j0n T for theoretical results),

NR=number of points of the processNlying in R, B ϕnλ

= 1n Pn

i=1

ϕλ(· −Ui)n−1n Eπλ(· −U)) , Ve ϕnλ

= n12

Vˆλ) + q

2γj0Vˆλ)B2λ) +3γj0B2λ)

, Vˆλ) =R

R

Pn i=1

ϕλ(tUi)n−1n Eπλ(tU))2

dNt.

(19)

Biological motivation and our model Our method and general results Application to genomic data

Description of our method

Assumption: h is compactly supported in[−A;A], with A>0 (A=the maximal memory along DNA sequences).

Γ =

λ= (j,k)∈Λ :−16j 6j0,k∈ Kj a deterministic subset ofΛ withj0∈N → |Γ| '2j0.

Given some parameterγ >0, for anyλ∈Γ, the threshold:

ηλ(γ,∆) = r

2γj0Veϕλ

n

+γj0 3 Bϕλ

n

+ ∆NR n

a positive quantity (of order j022nj0/2 +j0

T +

j0n T for theoretical results),

NR=number of points of the processNlying in R, B ϕnλ

= 1n Pn

i=1

ϕλ(· −Ui)n−1n Eπλ(· −U)) , Ve ϕnλ

= n12

Vˆλ) + q

2γj0Vˆλ)B2λ) +3γj0B2λ)

, Vˆλ) =R

R

Pn i=1

ϕλ(tUi)n−1n Eπλ(tU))2

dNt.

(20)

Biological motivation and our model Our method and general results Application to genomic data

Description of our method

Assumption: h is compactly supported in[−A;A], with A>0 (A=the maximal memory along DNA sequences).

Γ =

λ= (j,k)∈Λ :−16j 6j0,k∈ Kj a deterministic subset ofΛ withj0∈N → |Γ| '2j0.

Given some parameterγ >0, for anyλ∈Γ, the threshold:

ηλ(γ,∆) = r

2γj0Veϕλ

n

+γj0 3 Bϕλ

n

+ ∆NR n

a positive quantity (of order j022nj0/2 +j0

T +

j0n T for theoretical results),

NR=number of points of the processNlying in R, B ϕnλ

= 1n Pn

i=1

ϕλ(· −Ui)n−1n Eπλ(· −U)) , Ve ϕnλ

= n12

Vˆλ) + q

2γj0Vˆλ)B2λ) +3γj0B2λ)

, Vˆλ) =R

R

Pn i=1

ϕλ(tUi)n−1n Eπλ(tU))2 dNt.

(21)

Biological motivation and our model Our method and general results Application to genomic data

Description of our method

ηλ(γ,∆) = r

2γj0Veϕλ

n

+ γj0 3 Bϕλ

n

+∆NR n B,Vˆ andVe only depend on the observations and can be exactly computed.

β˜the estimator of β= (βλ)λ∈Λ associated with the previous thresholding rule:

β˜= βˆλ1|βˆ

λ|λ(γ,∆)1λ∈Γ

λ∈Λ. h˜=X

λ∈Λ

β˜λϕλ an estimator ofh that only depends on the choice of(γ,∆) andj0 fixed later.

(22)

Biological motivation and our model Our method and general results Application to genomic data

Main results

An oracle type inequality.

Theorem

We assume that n>2, j0 ∈N such that 2j0 6n<2j0+1, γ >2 log 2and∆is defined in a technical way.

Then the estimator˜h, previously defined, satisfies E

k˜h−hk22

6C1 inf

m⊂Γ

( X

λ6∈m

βλ2+|m|

"

1 n + n

T2

# (logn)4

) +C2

"

1 n + n

T2

# .

"DNA case" (n proportional to T) E

k˜h−hk22

6C1 inf

m⊂Γ

( X

λ6∈m

βλ2+(logT)4

T |m|

) +C2

T

(23)

Biological motivation and our model Our method and general results Application to genomic data

Main results

An oracle type inequality.

Theorem

We assume that n>2, j0 ∈N such that 2j0 6n<2j0+1, γ >2 log 2and∆is defined in a technical way.

Then the estimator˜h, previously defined, satisfies E

k˜h−hk22

6C1 inf

m⊂Γ

( X

λ6∈m

βλ2+|m|

"

1 n + n

T2

# (logn)4

) +C2

"

1 n + n

T2

# .

"DNA case" (n proportional to T) E

k˜h−hk22

6C1 inf

m⊂Γ

( X

λ6∈m

βλ2+(logT)4

T |m|

) +C2

T

(24)

Biological motivation and our model Our method and general results Application to genomic data

Main results

A minimax result on Besov balls still withn proportional toT. B2,∞s (R) =

f =X

λ∈Λ

βλϕλ,∀j >−1, X

k∈Kj

β(j2,k)6R22−2js

Corollary ("DNA case")

Let R>0and s ∈Rsuch that 0<s <r+1. Assume that h∈ Bs2,∞(R)and n is proportional to T .

Then the estimator˜h satisfies

E

kh˜−hk22 6C

(logT)4 T

2s+12s .

(25)

Biological motivation and our model Our method and general results Application to genomic data

Main results

A minimax result on Besov balls still withn proportional toT. B2,∞s (R) =

f =X

λ∈Λ

βλϕλ,∀j >−1, X

k∈Kj

β(j2,k)6R22−2js

Corollary ("DNA case")

Let R>0and s ∈R such that 0<s <r+1. Assume that h∈ Bs2,∞(R)and n is proportional to T .

Then the estimator˜h satisfies

E

k˜h−hk22 6C

(logT)4 T

2s+12s .

(26)

Biological motivation and our model Our method and general results Application to genomic data

Implementation procedure

From now on, we consider"DNA case": n is proportional toT. Computation of the family of random thresholds(ηλ(γ,δ))λ∈Γ:

ηλ(γ,δ) = r

2γj0Vˆϕλ n

+γj0

3 Bϕλ n

+ δ

√T NR

n , where∆ = δ

T (becausen is proportional to T).

We set j0=5in the sequel. Computation of

n

X

i=1

ϕλ(t−Ui)−n−1

n Eπλ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆλ,Vˆ andB. Choice of the parameters γ andδ?

→ calibration of parameters from a practical point of view.

(27)

Biological motivation and our model Our method and general results Application to genomic data

Implementation procedure

From now on, we consider"DNA case": n is proportional toT. Computation of the family of random thresholds(ηλ(γ,δ))λ∈Γ:

ηλ(γ,δ) = r

2γj0Vˆϕλ n

+γj0

3 Bϕλ n

+ δ

√T NR

n , where∆ = δ

T (becausen is proportional to T).

We setj0=5in the sequel.

Computation of

n

X

i=1

ϕλ(t−Ui)−n−1

n Eπλ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆλ,Vˆ andB. Choice of the parameters γ andδ?

→ calibration of parameters from a practical point of view.

(28)

Biological motivation and our model Our method and general results Application to genomic data

Implementation procedure

From now on, we consider"DNA case": n is proportional toT. Computation of the family of random thresholds(ηλ(γ,δ))λ∈Γ:

ηλ(γ,δ) = r

2γj0Vˆϕλ n

+γj0

3 Bϕλ n

+ δ

√T NR

n , where∆ = δ

T (becausen is proportional to T).

We setj0=5in the sequel.

Computation of

n

X

i=1

ϕλ(t−Ui)− n−1

n Eπλ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆλ,Vˆ andB.

Choice of the parameters γ andδ?

→ calibration of parameters from a practical point of view.

(29)

Biological motivation and our model Our method and general results Application to genomic data

Implementation procedure

From now on, we consider"DNA case": n is proportional toT. Computation of the family of random thresholds(ηλ(γ,δ))λ∈Γ:

ηλ(γ,δ) = r

2γj0Vˆϕλ n

+γj0

3 Bϕλ n

+ δ

√T NR

n , where∆ = δ

T (becausen is proportional to T).

We setj0=5in the sequel.

Computation of

n

X

i=1

ϕλ(t−Ui)− n−1

n Eπλ(t−U))

, with a cascade algorithm (inspired by Mallat (1989)), in order to compute the coefficients βˆλ,Vˆ andB.

Choice of the parameters γ andδ?

→ calibration of parameters from a practical point of view.

(30)

Biological motivation and our model Our method and general results Application to genomic data

Influence promoters/genes in E. coli

Data:

the sequence composed of both strands of E. coligenomeof length 4 639 221 bases (we took 10 000 bases for the maximal memory)

→ a sequence of length 9 288 442 (=2∗4639221+10000), locations of 4 290genes (we took the positions of the first base of coding sequences),

locations of 1 036 occurrences of the major promoter: tataat.

For convenience, we work on a scale of 1:1000 and we set T =9289and soA=10.

(31)

Biological motivation and our model Our method and general results Application to genomic data

Influence promoters/genes in E. coli

Data:

the sequence composed of both strands of E. coligenomeof length 4 639 221 bases (we took 10 000 bases for the maximal memory)

→ a sequence of length 9 288 442 (=2∗4639221+10000), locations of 4 290genes (we took the positions of the first base of coding sequences),

locations of 1 036 occurrences of the major promoter: tataat.

For convenience, we work on a scale of 1:1000 and we set T =9289and soA=10.

(32)

Biological motivation and our model Our method and general results Application to genomic data

Influence promoters/genes in E. coli

How does the DNA motif tataat influence genes?

parents =tataat, children= genes.

(33)

Biological motivation and our model Our method and general results Application to genomic data

Influence promoters/genes in E. coli

How does the DNA motif tataat influence genes?

parents =tataat, children= genes.

(34)

Biological motivation and our model Our method and general results Application to genomic data

Influence promoters/genes in E. coli

How does genes influence the DNA motif tataat?

parents =genes, children= tataat.

(35)

Biological motivation and our model Our method and general results Application to genomic data

Influence promoters/genes in E. coli

How does genes influence the DNA motif tataat?

parents =genes, children= tataat.

(36)

Biological motivation and our model Our method and general results Application to genomic data

Conclusion

Our random thresholding procedure is optimal in the oracle and minimax setting.

Some simulations illustrate the robustness of our procedure.

The application to genomic data validates our procedure with a good detection of favored or avoided distances between occurrences of tataatand genes along theE. coli genome.

Further possible extensions of our model:

a more sophisticated model that takes into account the phenomenons of spontaneous apparition and self-excitation, an extension of our cascade algorithm to general wavelet bases and not only to Haar bases,

a study of similar processes in the spatial framework. Thanks for your attention!

(37)

Biological motivation and our model Our method and general results Application to genomic data

Conclusion

Our random thresholding procedure is optimal in the oracle and minimax setting.

Some simulations illustrate the robustness of our procedure.

The application to genomic data validates our procedure with a good detection of favored or avoided distances between occurrences of tataatand genes along theE. coli genome.

Further possible extensions of our model:

a more sophisticated model that takes into account the phenomenons of spontaneous apparition and self-excitation, an extension of our cascade algorithm to general wavelet bases and not only to Haar bases,

a study of similar processes in the spatial framework.

Thanks for your attention!

(38)

Biological motivation and our model Our method and general results Application to genomic data

Conclusion

Our random thresholding procedure is optimal in the oracle and minimax setting.

Some simulations illustrate the robustness of our procedure.

The application to genomic data validates our procedure with a good detection of favored or avoided distances between occurrences of tataatand genes along theE. coli genome.

Further possible extensions of our model:

a more sophisticated model that takes into account the phenomenons of spontaneous apparition and self-excitation, an extension of our cascade algorithm to general wavelet bases and not only to Haar bases,

a study of similar processes in the spatial framework.

Thanks for your attention!

Références

Documents relatifs

We describe a fast algorithm for counting points on elliptic curves dened over nite elds of small characteristic, following Satoh.. Our main contribution is an extension

All models incorporate a full suite of year effects and subfield age effects, as well as a term common to both treated and control subfields that switches from 0 to 1 after the

scenario, an increase in the uptake of HOBr does not lead to a signif- icant increase in bromine release from the sea salt aerosol, because it already is mainly partitioned to the

The bursting noise gives rise to higher or slower autocorrelation time depending on the respective value of the mean production rate and the degradation rate.

English: We extend Zeilberger’s fast algorithm for definite hypergeometric summa- tion to non-hypergeometric holonomic sequences. The algorithm generalizes to the differential case

The present model, which is an extension of the one developed by Lennard- Jones and Devonshire, uses a twin lattice, along with the conjugate lattices, and hence allows for two

Abstract. This note is devoted to an analysis of the so-called peeling algorithm in wavelet denoising. Assuming that the wavelet coefficients of the signal can be modeled by

As another important evidence that vesseloids organize into mature blood vessels within 1 day, we observed that vascular endothelial (VE)- cadherin was strongly expressed