• Aucun résultat trouvé

Constant time estimation of ranking statistics by analytic combinatorics

N/A
N/A
Protected

Academic year: 2021

Partager "Constant time estimation of ranking statistics by analytic combinatorics"

Copied!
5
0
0

Texte intégral

(1)

HAL Id: hal-00567091

https://hal.archives-ouvertes.fr/hal-00567091

Submitted on 18 Feb 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Constant time estimation of ranking statistics by analytic combinatorics

Cyril Banderier, Pierre Nicodème

To cite this version:

Cyril Banderier, Pierre Nicodème. Constant time estimation of ranking statistics by analytic combi- natorics. Statistical Methods for Post-Genomic Data, Jan 2011, Paris, France. no page numbering.

�hal-00567091�

(2)

Constant time estimation of ranking statistics by analytic combinatorics

C. Banderier

and P. Nicod` eme

December 14, 2010

Abstract

We consider1 i.i.d. increments (or jumps)Xi that are integers inJ⊆[−c, . . . ,+d] forc, d∈N, the partial sumsSj = P

1≤i≤jXi, and the discrete walks ((j, Sj))1≤j≤n. Late conditionning by a return of the walk to zero at timen provides discrete bridges that we note (Bj)1≤j≤n. We give in this extended abstract the asymptotic law in the central domain of the height (max1≤j≤nBj) of the bridges asn tends to infinity. As expected, this law converges to the Rayleigh law which is the law of the maximum of a standard Brownian bridge. In the case wherec= 1 (only one negative jump), we provide a full expansion of the asymptotic limit which improves upon the rate of convergence O(log(n)/√

n) given by Borisov [4] for lattice jumps; this applies in particular for the case where Xi∈ {−1,+d}, in which case the expansion is expressible as a function ofn,dand of the height of the bridge. Applying this expansion forXi∈ {−1, d/c}gives an excellent approximation of the case Xi∈ {−d,+c}and provides in constant time an indicator used in ranking statistics; this indicator can be used for medical diagnosis and bioinformatics analysis (see Kelleret al.[8] who compute it in timeO(n×min(c, d)) by use of dynamical programming).

1 Generating function of upper bounded bridges

We use in this article generating functions as main tool for the precise analysis of the behaviour of the walks. Asymptotics methods provide then excellent results for bioinformatics applications (see Section 3 and Figure 2).

We consider first the characteristic (Laurent) polynomial of the jumps P(u) =pdud+· · ·+p−c 1

uc,

where thepiare weights. Then we define the generating function of the altitude of the walk at timekas fk(u) = X

−kc≤j≤kd

fk,juj, (1)

wherefk,j is the number of “walks at altitudej at timek” ifpi = 1 for all i, or the probability of this last event ifP(1) = 1.

We consider walks that are forbidden to go upon a barrierh(the levelhis permitted). We can write a recurrence for the Laurent polynomialsfk(u), by removing the cases that make the walks go upon the barrier,

fk+1(u) =fk(u)P(u)−

d

X

i=1

uh+d[uh+d]fk(u)P(u); (2)

LIPN, CNRS-UMR 7030, Universit´e Paris-Nord, 93430 Villetaneuse, France. http://lipn.fr/˜banderier

LIX, CNRS-UMR 7161, Ecole´ polytechnique, 91128 Palaiseau and AMIB Team, INRIA-Saclay, France.

(3)

z z=ρ z=ζ= 1/3

τ

u1 (z) v1 (z)

|v2 (z)|,|v3 (z)|

u 1/ζ

u1 (ζ) τ v1 (ζ)

P(u) =u3 +1 u

P(τ) = 1/ρ

Figure 1: See [1, 2] for the relevant proofs (in particular, we do not comment upon the fact that some walks areperiodic while others are not; see also [3]). If the equation 1−zP(u) = 0 has c small roots ui(z) (that tend to zero at the origin) anddlarge rootsvj(z) (that tend to infinity at the origin), there are two real dominant roots of 1−zP(u) = 0 forz∈]0, ρ], a small root (that we callu1(z)) and a large root (that we call v1(z)), such that max2≤i≤c|ui(z)|< u1(|z|) < v1(|z|) <min2≤j≤d|vj(z)| in the disk

|z|< ρ. (Figure 1 Left): behaviour of the particular characteristic polynomialP(u) =u3+u1. (Figure 1 Right): a visual rendering of the domination property of the roots of 1−zP(u) = 1−z u3+1u

in the real interval ]0, ρ], where the numberτ is the unique positive solution ofP0(z) = 0 andρ= 1/P(τ). For

1

z > 1ρ or z < ρ the two dominant real solutions u1(z) and v1(z) are such that (i) limz→0+u1(z) = 0 (small root) and limz→0+v1(z) = +∞ (dominant large root) and (ii)u1(z)< v1(z). As a consequence of the identity P0(τ) = 0, we have u1(ρ) = v1(ρ). The non-dominant large roots v2(z) and v3(z) are algebraically conjugate and we haveu1(z)< v1(z)<|v2(z)|=|v3(z)|forz∈]0, ρ[.

We can multiply both sides of Equation 2 by zk+1 and sum up from k = 0 to k = ∞; this gives, (assuming that the walk starts at zero at time zero, orf0(u) =u0= 1),

F[≤h](z, u)(1−zP(u)) = 1−zuh+1Fh+1(z)− · · · −zuh+dFh+d(z), (3) whereF[≤h](z, u) =P

k≥0zkfk(u) and the functionsFh+1(z), . . . , Fh+d(z) are unknown functions.

The popularkernelmethod2 uses the fact that the rootsυ(z) of the equation 1−zP(u) = 0 cancels the left member of Equation (3); there aredsuch solutions orlarge rootsvj(z) (1≤j≤d) that tend to infinity asztends to zero. These solutions then provide a linear system ofdequations of the type

vj(z)h+1Fh+1(z) +· · ·+vj(z)h+dFh+d(z) = 1/z.

By solving the system we obtain explicit expressionsFh+j(z) =Dj(z)/V(z) whereV(z) is a Vandermonde determinant upon the rootsvj(z) and theDj(z) are variants of it. We obtain

F[>h](z, u) = 1 1−zP(u)

d

X

j=1

uh+1 vj(z)h+1

Qj(u)

Qj(vj), where Qj(t) = Y

1≤i≤d i6=j

(t−vi(z)). (4)

Considering thesmallrootsuj(z) of 1−zP(u) = 0 that tend to zero at the origin, we have from [2]

(−k <−c) [u−k] 1

1−zP(u) =z

c

X

j=1

u0j(z)

uj(z)−k+1 = [u0] uk 1−zP(u), which gives for the generating function of bridges

[u0]F[>h](z, u) =z

d

X

j=1

1 vj(z)

c

X

i=1

ui(z) vj(z)

h

u0i(z)Qj(ui(z))

Qj(vj(z)). (5)

2For more informations about this method, see [1, 2, 6].

(4)

2 Asymptotics

We consider now that (i) P(1) = 1, (ii) E(Xi) = 0 (or P0(1) = 0); these two conditions imply that τ=ρ= 1 (see the legend of Figure 1 for relevants definitions). We let nownandhtend to infinity with the condition thath=xσ√

n, wherex∈Randσ2=P00(1) is the variance of the jumps.

Using the properties of domination of the roots (see Figure 1), we obtain asymptotically [u0]F[>h](z, u) =z

u1(z) v1(z)

h

×u01(z)Q1(u1(z))

v1(z)Q1(v1(z)) × 1 +O(Ch)

for|z|<1, (6) with

C= max max

j≥2 sup

|z|<ρ−

|v1(z)|

|vj(z)|, max

j≥2 sup

|z|<ρ−

|u1(z)|

|uj(z)|

! .

Considering the expansion of 1/(P(u(z)) =z atz=ρ= 1, we get

z∼1





u1(z) = 1−q

2

σ2(1−z) +O(1−z), v1(z) = 1+

q 2

σ2(1−z) +O(1−z) Q1(u1(z))

Q1(v1(z)) =Q1(1) +O(√ 1−z) Q1(1) +O(√

1−z) = 1 +O(√ 1−z)

Equation (6) is valid in a Delta-domain (see [7]), and we have in such a domain

[u0]F[>xσ

n]

(z, u) = z σ√

2

1−2q

2

σ2(1−z) n

√1−z × 1 +O(√ 1−z)

×(1 +O(Cn)). Using now the Semi-Large powers theorem (see [7] again), we obtain [zn][u0]F[>xσn] =

n σ

2×e−2x2×

1 +O

1 n

. Since [2] obtained previously [zn][u0]F]−∞,+∞[=

n σ

2× 1 +O

1 n

, we get to Pr

0≤i≤nmax Bi> xσ√ n

= Rayleigh(x)×

1 +O 1

√n

where Rayleigh(x) =e−2x2. Lukasiewicz bridges: J ={−1, . . . , c}.

We have in this case

Q1(u1(z)) = 1 pdz

∂u

u(1−zP(u)) u−v1(z)

u=u

1(z)

= 1

pdz2

u1(z)

u01(z)(u1(z)−v1(z)). (7) The value ofQ1(v1(z)) follows by interchanging the rˆoles ofu1andv1. This leads to

Pr max1≤j≤nBj> xσ n

exp(−2x2) = 1 +(−(2/3)xξ/ζ3/26x/

ζ)

n +1

n (−210 9

ξ2 ζ3+2

3 θ ζ2 16

8 3 ξ

ζ2)x4 (8)

+(24 ζ +5

3 ξ2 ζ3 + 3 θ

ζ2 +20 3

ξ ζ2)x25

ζ3 87

6 ξ ζ2 5

24 ξ2 ζ3+1

8 θ ζ2 + 5

24 ξ3 ζ31

8 θ22

ζ2

!

+O 1

n3/2

,

where ζ = σ2 = P00(1), ξ =P000(1) and θ =P0000(1). The algolib Maple package (more precisely, the gdev andequivalent functions developed by Bruno Salvy, see algo.inria.fr/librairies) can naturally push the expansion to higher orders.

The particular case: J ={−1,+d}.

In this case, it is possible to compute the values of the derivatives ofP(u) evaluated at 1 as functions of d. ForP(u) = ud + d 1, we have in particularP00(1) =d, P000(1) =d(d−4), . . .; substituting these

(5)

0 1e–05 2e–05 3e–05 4e–05 5e–05

2.2 2.4 2.6 2.8 0

1e–05 2e–05 3e–05 4e–05 5e–05

2.4 2.6 2.8 3

n= 104

x Pr

max1≤j≤nBj ncdh

ncd =x

d= 104, c= 11, dc 9.4545

n= 1049

x Pr

max1≤j≤nBj ncd h

ncd=x

d= 1049, c= 111, dc 9.4504

Figure 2: Convergence of the height of discrete bridges to the Rayleigh distribution (top continuous curve); the dotted curves correspond to simulations; the lower continuous curve follows from a refinement of Equation (8), computed up toO(1/n2), withP(u) = 1+d/c1 ud/c+1+d/cd/c u1 (heuristics). We emphasize the excellent precision of our asymptotics for smalln.

3 Bioinformatics application

A set of G genes is expressed in a given tissue; this provides a ranking of level of expression of these genes. Considering now the same ranking and a subset of specific interest of g genes, if these g genes have a high level of expression, they will mostly appear at the top of the ranking. The aim is to provide a statistical estimator for exceptional behaviours. Kelleret al.[8] proposed the following approach: while scanning from left to right the ranking of theGgenes, build a random walk (Bi)0≤i≤G starting at zero and such that its jump at timeiis

G−g if the gene at rankibelongs tog,

−g if the gene at rankibelongs toG−g.

By construction, we have B0 = BG = 0 and these walks are therefore bridges. The tail probability (referred to asp-value) that Kelleret al. choose as statistical indicator isp-value = Pr(max1≤i≤G

Bi

>

h), for any chosen h. They provide a dynamic programming algorithm computing this indicator in complexity O(G×g). We compute heuristically the indicator in constant time by setting n =G and P(u) = 1+d/c1 ud/c+1+d/cd/c 1u with d=G−g andc =g, and applying Equation (8); see Figure 2 for an example.

References

[1] Banderier, C. Combinatoire analytique des chemins et des cartes. Ph.D. thesis, Universit´e Paris VI, 2001.

[2] Banderier, C., and Flajolet, P.Basic analytic combinatorics of directed lattice paths.Theoretical Computer Science 281, Issue 1-2(2002), 37–80. (Special volume dedicated to M. Nivat).

[3] Banderier, C., and Nicod`eme, P. Bounded discrete walks. InProc. AofA 2010 (2010), DMTCS, pp. 35–48. Vienna, June 2010.

[4] Borisov, I. S. On the rate of convergence in the ”conditionnal” invariance principle. Theory of Probabibility and its applications 23, 1 (1978), 63–76.

[5] Bousquet-M´elou, M. Discrete Excursions.emin. Lothar. Comb. 57 (2008).

[6] Bousquet-M´elou, M., and Petkovˇsek, M. Linear recurrences with constant coefficients: The multivariate case.

Discrete Math. 225, 1-3 (2000), 51–75.

[7] Flajolet, P., and Sedgewick, R. Analytic combinatorics. Cambridge University Press, 2009. (810 pages). See also http://algo.inria.fr/flajolet/Publications/books.html.

[8] Keller, A., Backes, C., and Lenhof, H. P. Computation of significance scores of unweighted gene set enrichment analyses.BMC Bioinformatics 290, 8 (2007). Also available ashttp://www.biomedcentral.com/1471-2105/8/290.

Références

Documents relatifs

A vant d'aepter la partie, vous essayez de simuler e jeu, pour voir si vous. avez des hanes de

onsider empirial minimizers of U -statistis, more in the spirit of empirial risk minimization popular in statistial learning theory, see, e.g., Vapnik and.. Chervonenkis [40℄,

Although true labels are not provided, it is still pos- sible to build class models based on the two following as- sumptions: (i) the initial text based search provides a rea-

Sparse denotes the ratio of null coefficients in the dense representation for the total degree d. For example “Sparse = 66% “ means that 66% of the coefficients are equal to zero in

See also Vapnik, 1982] (section 8.6) for an example of the dierence in out-of-sample generalization performance between the model obtained when looking for the true generating

Missing data can be supported by skipping time intervals where data are absent, or by using a decoration (dashed lines, color) showing those gaps. For example, countries can split

The machine takes as inputs incoming streams of tweets either results of search queries, or produced by followed twitterers, performs algebraic computa- tion corresponding to

Proof: If we have an open k-interval-model for G based on order σ, we can make a closed k-interval- model of G by simply adding one interval for vertex x, when we are encoding