• Aucun résultat trouvé

STATISTICS OF BOSE SAMPLES FROM DIRICHLET PROPORTIONS

N/A
N/A
Protected

Academic year: 2021

Partager "STATISTICS OF BOSE SAMPLES FROM DIRICHLET PROPORTIONS"

Copied!
28
0
0

Texte intégral

(1)

HAL Id: hal-00136315

https://hal.archives-ouvertes.fr/hal-00136315

Submitted on 13 Mar 2007

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

STATISTICS OF BOSE SAMPLES FROM DIRICHLET PROPORTIONS

Thierry Huillet

To cite this version:

Thierry Huillet. STATISTICS OF BOSE SAMPLES FROM DIRICHLET PROPORTIONS. Sankhya:

The Indian Journal of Statistics, Indian Statistical Institute, 2007, 69 (2), pp.190-220. �hal-00136315�

(2)

STATISTICS OF BOSE SAMPLES FROM DIRICHLET PROPORTIONS

Thierry HUILLET

Laboratoire de Physique Th´ eorique et Mod´ elisation, CNRS-UMR 8089 et Universit´ e de Cergy-Pontoise, 2, Avenue Adolphe Chauvin, 95302, Cergy-Pontoise, France

October 18, 2006

Abstract

To fix the background and notations, we shall first briefly revisit some aspects of the following Ewens-like randomized occupancy problem: as- sume distinguishable particles are to be placed at random into the cells of the unit interval which was previously broken into random pieces ac- cording to the (Poisson-)Dirichlet partitioning model. Particles being dis- tinguishable, the statistical structure of the problem can be understood within the Maxwell-Boltzmann setup.

In this note, we shall address the following sampling problem of a dif- ferent nature: assume now that indistinguishable particles are to be placed at random within the cells with (Poisson-)Dirichlet distributed sizes. Then the statistical formalism to be used is the one of Bose-Einstein. We show that in the grand canonical ensemble, the Bose sampling procedure from (Poisson-)Dirichlet proportions is, to a large extent, amenable to exact analytic calculations. This concerns for example the full Bose occupancy distributions, the distribution of the number of distinct occupied frag- ments, the number of cells with a prescribed amount of particles. Using a grand canonical approach, a phase transition phenomenon is shown to take place provided the disorder parameter of the (Poisson-)Dirichlet par- tition is large enough; we describe this phase transition in some details.

Keywords and Phrases: random discrete distribution, Dirichlet, sampling, Ewens, urns, Maxwell-Boltzmann and Bose-Einstein statistics, disordered systems, phase transition.

1 Introduction

Sampling from random Dirichlet and Poisson-Dirichlet partitions has for long been a subject of recurrent interest (see Tavar´e and Ewens (1997) and

(3)

references therein for historical background and applications to various fields).

In one model, a number k of distinguishable particles (balls) are sequentially and uniformly thrown on the interval which has been previously partitioned at random into n pieces (fragments, bins or cells) according to the Dirichlet law with “disorder” parameter θ >0. Since particles are distinguishable, the statistical structure of the problem can nicely be understood within the Maxwell- Boltzmann setup. Many interesting questions can be (and have been) developed within this randomized occupancy framework, for instance and to cite only a few:

- What is the joint cells occupancy distribution? - What is the state of cell occupancies if sequential sampling process is stopped when some cell has receivedc >1 particles for the first time ? (the randomized Banach match box problem if n = 2). - What is the sample size (particle number) till the first visit to smallest fragment? - What is the sample size till some fragment has been visited twice for the first time? (the birthday problem). - What is the sample size till all fragments have been visited at least once (r times)? (the coupon collector problem). - What is the sample size between consecutive visits to distinct fragments? - What is the number of distinct fragments visited by thek−sample? - The laws of succession and P`olya urn scheme....

These problems were also naturally investigated within the extended frame- work of the Poisson-Dirichlet partitioning model. This model may be viewed by taking an appropriate Kingman weak limit (n↑ ∞,θ↓0 whilenθ=γ >0) of the Dirichlet partitioning model after reordering the random fragment sizes.

For instance, the occupancy distribution in this context yields the celebrated Ewens Sampling Formula.

The purpose of this note is to address the following sampling problem: as- sume now that k indistinguishable particles are to the placed at random into the cells with Dirichlet distributed random sizes. Then the statistical formalism to be used is the one of Bose-Einstein. Since the image of a sequential throw is lost (in a way, particles are now thrown all at once), the questions relative to stopping times become meaningless, to some extent. However the questions rel- ative to the statistics of occupancies pertain. We shall essentially be concerned here by this distributional problem and its specificities. It turns out that the canonical Bose occupancy distributions (particle numberkis fixed) are difficult to handle analytically. However, we will show that in the grand canonical en- semble (where particle number is appropriately randomized), the Bose sampling procedure from Dirichlet and Poisson-Dirichlet proportions is, to a large extent, amenable to exact analytic calculations. We shall also show that when the Dirichlet disorder parameter θis large enough (namely whenθ >1/(n−1)), a phase transition occurs which is reminiscent of a Bose-Einstein like condensa- tion phenomenon in a different random energy levels occupancy context. In the Poisson-Dirichlet situation, a similar phase transition occurs at the condition thatγ >1.No such critical phenomena arise in the classical Maxwell-Boltzmann formulation of the sampling problem.

(4)

2 Preliminaries on Ewens-sampling from Dirich- let populations

To fix the ideas, notations and analogies, we start recalling some ingredi- ents of the classical occupancies statistics when particles are distinguishable (Maxwell-Boltzmann) before turning in the next section to the main purpose of this work: the statistical properties of Bose samples from Dirichlet populations (when particles to be placed are indistinguishable).

2.1 Dirichlet partition of the interval

Consider the following random partition intonfragments (cells or states) of the unit interval. Letθ >0 be some ‘disorder’ parameter and assume that the random fragment sizes Sn := (S1, .., Sn) (with Pn

m=1Sm= 1) are distributed according to the (exchangeable) Dirichlet density function on the simplex, that is to say

fS1,..,Sn(s1, .., sn) =Γ (nθ) Γ (θ)n

n

Y

m=1

sθ−1m ·δ(Pnm=1sm−1). (2.1)

Alternatively, with (θ)q := Γ (θ+q)/Γ (θ) and Γ (.) the Euler-gamma function, using well-known properties of Dirichlet integrals, the law of Sn can also be characterized by its joint moment function

E

n

Y

m=1

Smqm

!

= 1

(nθ)Pn m=1qm

n

Y

m=1

(θ)q

m withqm>−θ.

(2.2)

If this is so, we shall saySnd Dn(θ). There are two ways to generate Dirichlet partitioning:

- (normalizing) Firstly,Sn can be obtained while considering the indepen- dent and identically distributed (iid) random vector Xn :=

X =d X1, .., Xn , satisfyingX∼d gamma(θ) and by lettingSm=Xm/(X1+..+Xn),m= 1, .., n.

- (conditioning) Secondly, withx >0, consider the partitioning of the inter- val [0, x] obtained while conditioning as follows:

Sn(x) := (X1, .., Xn|X1+..+Xn=x)

where Xn is as above. Then Sn := Sn(1) has Dirichlet distribution and the following important scaling property holds: Sn(x)=d xSn(1).

IfSnd Dn(θ),Sm=d Sn,m= 1, .., n, independently ofmand the individual fragments sizes are all identically distributed. Their common density on the interval (0,1) is a beta(θ,(n−1)θ) density, withE(Sn) = 1/n and σ2(Sn) =

(5)

n−1

n2(nθ+1).In particular,φ(q) :=E(Snq) = (θ)q/(nθ)q is the moment function of typical fragment size Sn.Further,

nSnd Γθ,θd gamma (θ, θ) , with densityfΓθ,θ(t) = θθ

Γ (θ)tθ−1e−θt,t >0.

For eachm16=m2∈[n], as conventional wisdom suggests, (Sm1, Sm2) are nega- tively correlated withCov(Sm1, Sm2) =−σ2n−1(Sn) =−n2(nθ+1)1 .Whenθ= 1, the partition model Eqs.(2.1, 2.2) corresponds to the standard uniform random par- titioning model of the interval. When θ↑ ∞,Sn = (1/n, ..,1/n), the determin- istic uniform partition. Consider next the sequenceS(n):= S(m);m= 1, .., n obtained while ranking the spacings vector Sn according to descending sizes, hence withS(1)> .. > S(m)> .. > S(n). TheS(m)s distribution can hardly be derived in closed form. However, one can prove that, as n↑ ∞

n(1+θ)/θS(n)d Wθ andnθ

S(1)− 1 nθlog

nlogθ−1n

d Gθ

whereWθis a Weibull random variable,Gθa Gumbel random variable such that P(Wθ> t) = exp

tsθ

θ

, t >0 and P(Gθ≤t) = exp

s1

θexp (−t)

, t ∈ R, sθ:= Γ(1+θ)θθ >0 a scale parameter. Note thatsθ>1 ifθ∈(0,1),sθ=1= 1 and sθ<1 ifθ >1 andsθθ↓0 1. In the random division of the interval as in Eq.

(2.1), although all fragments are identically distributed with sizes of ordern−1, the smallest fragment size grows like n−(θ+1)/θ while the one of the largest is of order 1 log

nlogθ−1n

. The smaller disorderθ is, the larger (smaller) the largest (smallest) fragment size is: hence, the smaller θis, the more the values of the Sms are, with high probability, disparate. When θ is small, the size of the largest fragment S(1) tends to dominate the other ones. On the contrary, large values of θ correspond to situations in which the range of fragment sizes is lower: the fragment sizes look more homogeneous and, in the limit θ ↑ ∞, distribution Eq. (2.1) concentrates on its centre n1, ..,1n

. For large disorderθ, the diversity of the partition is small.

AlthoughSnhas a degenerate weak limit whenn↑ ∞,θ↓0 whilenθ=γ >

0, this limit is worth being considered (see Kingman (1975), (1978) and (1993)).

Indeed, in this ∗−limit,S(n)S(∞)d P D(γ) which is the Poisson-Dirichlet distribution with parameter γ; see Kingman (1993) and Tavar´e-Ewens (1997).

We shall also callγ the “disorder” parameter since P Dpartitions with largeγ approach the ‘uniform distribution’ on the infinite-dimensional simplex whereas for small values ofγ, the largest fragment is the dominant one.

2.2 Maxwell-Boltzmann approach to sampling problems from Dirichlet partition

Before discussing the specific statistical features of Bose samples drawn from

(6)

Dirichlet populations, we first revisit the Ewens approach to sampling formulae which is akin to a Maxwell-Boltzmann sampling procedure.

•Sampled fragment size when sample size is 1:

Assume first sample size isk= 1 and suppose the sampled tagged fragment is the one hit by a uniform random throw of a particle on the interval. Under our hypothesis, this particle will launch on fragment numbermwith conditional (given Sn) probability Sm. The corresponding fragment size attached to this single particle, saySn,has conditional law given by

PSn(Sn =Sm) =Sm,m= 1, .., n.

(Here and throughout, the subscriptSninPSn (orESn) will denote conditional probability (or expectation) givenSn). LetESn(Snq) be its moment function and put φSn(q) := Pn

m=1Smq. Then ESn(Snq) = φSn(q+ 1)/φSn(1). Averaging overSn,

E(Snq) :=EESn(Snq) =nE Snq+1

=:nφ(q+ 1)

characterizes the distribution of the fragment size of this single particle. In particular,

E(Sn) =nφ(2) = θ+ 1 nθ+ 1 > 1

n.

In this size-biased picking procedureSn→ Sn, states with large size are clearly favored. One therefore expects (and this indeed true) thatSn is stochastically larger than the typical fragment size Sn fromSn.

•Maxwell-Boltzmann-Ewens sampling (sample size is k >1):

The full Maxwell-Boltzmann sampling version of the randomized occupancy problem proceeds as follows: let (U1, .., Uk) bekiid uniform throws on [0,1] par- titioned bySn. Let (Bn,k(1), .., Bn,k(n)) be an integral-valued random vector which counts the number of visits of particles thus thrown to the different frag- ments in ak−sample in the following sense: ifMlis the random state label which thel−th trial hits, thenBn,k(m) :=Pk

l=1I(Ml=m),m= 1, .., n(whereI(A) stands for the set indicator of the eventA).As stated abovePSn(Ml=m) =Sm and statemis chosen proportionally to its sizeSm.With (b1, .., bn)∈Nn0, (where N0:={0,1,2, ..}) satisfyingPn

m=1bm=k,(Bn,k(m) =bm;m= 1, .., n) clearly follows the conditional multinomial distribution with randomized probabilities Sn:

PSn(Bn,k(m) =bm;m= 1, .., n) = k!

Qn m=1bm!

n

Y

m=1

Smbm. (2.3)

Proceeding in this way to fill up sequentially the statesSn, particles are clearly assumed distinguishable. Indeed, in the above expression of the probability, the multinomial factor Qnk!

m=1bm! represents the number of ways to distribute

(7)

k labelled particles into n distinguishable boxes with respective occupancies (b1, .., bn).

We note from Eq. (2.3) that, givenSn,

(Bn,k(m) ;m= 1, .., n)= (ξd 1, .., ξnn =k)

where (ξ1, .., ξn) are mutually independent onNn0 with sumζn:=Pn

1ξm and PSnm=bm) = Smbme−Sm

bm! ,bm∈N0,

which are Poisson distributions with random meansSmd beta(θ,(n−1)θ), for eachm= 1, .., n.

Averaging overSn gives the Maxwell-Boltzmann distribution

P(Bn,k(m) =bm;m= 1, .., n) = EPSn(Bn,k(m) =bm;m= 1, .., n)

= k!

Qn m=1bm!

1 (nθ)k

n

Y

m=1

(θ)b

m,

also known in the statistical context as the Dirichlet multinomial distribution.

Examples: Whenθ↑ ∞, the partition Sn reduces to 1n, ..,n1

which is not random. It follows from Eq. (2.3) that

P(Bn,k(m) =bm;m= 1, .., n) = k!

Qn

m=1bm!n−k.

Unless otherwise specified, we shall assume in the sequel that θ < ∞ which means that sampling really is from Dirichlet probabilities which are indeed ran- dom.

Whenθ= 1, the partition Sn reduces to the random uniform partition. It follows from Eq. (2.3) that

P(Bn,k(m) =bm;m= 1, .., n) = k!

(n)k = 1

n+k−1 k

,

the uniform distribution on the set{bm∈N0,m= 1, .., n:Pn

1bm=k}.♦ We also recall the following almost sure convergence which follows from conditional strong law of large numbers:

Lemma 1 It holds that

(Bn,k(m) ;m= 1, .., n)/k→Sn as k↑ ∞, (2.4)

in distribution and almost surely.

(8)

In a Maxwell-Boltzmann approach to the sampling problem from Dirichlet proportions, the proportions of sampled fragments when sample size is large is balanced (no concentration phenomenon within a specific fragment) and the Dirichlet partition is recovered in the limit.

•Sampling distribution as a random allocation scheme:

Let (ηm)m≥1be an iid sequence of negative-binomial (or P`olya) distributed random variables onN0with mean 1 and distribution

P(η1=b1) =(θ)b

1

b1! xb1xθ,b1= 0,1, ...

(2.5)

withx= (1 +θ)−1, x:= 1−x. The generating function ofη1 is E(uη11) =

1−xu1 x

−θ

, 0≤u1<1/x.

(2.6)

The random variableη1has mean 1 and variance 1 + 1θ

, exceeding 1 for finite θ (it is over-dispersing compared to a mean 1 Poisson distribution). Its distri- bution can be obtained while randomizing the intensity a Poisson distribution by a gamma(θ, θ)−distributed independent random variable (with mean 1 and variance 1/θ); it is a gamma-Poisson mixture. Letµn :=Pn

m=1ηm,n≥1,be the partial sum sequence of (ηm)m≥1withµ0:= 0. Then, one can be check that

P(Bn,k(m) =bm;m= 1, .., n) =P(η1=b1, .., ηn=bnn=k). The unconditional multinomial-Dirichlet distribution is in the class of random allocation schemes as the ones obtained by conditioning a random walk by its terminal value (see Kolchin (1986), Johnson and Kotz (1977) for instance).

•The number of distinct visited fragments:

Let now Pn,k :=Pn

m=1I(Bn,k(m)>0) count the number of distinct frag- ments which have been visited in thek−sampling process. With 1≤m1< .. <

mp ≤na subset ofplabels from{1, .., n}, withbq ∈N:={1,2, ..},q= 1, .., p, we clearly have

PSn(M1, .., Mk ∈ {m1, .., mp};Bn,k(m1) =b1, .., Bn,k(mp) =bp;Pn,k=p)

= k!

Qp q=1bq!

p

Y

q=1

Smbqq. (2.7)

Define nextBn,k(q)>0,q= 1, .., pto be the numbers of type-q fragments where the Pn,k =pfragments observed were labelled in an arbitrary way (in- dependently of the sampling mechanism). Averaging the last formula over Sn, summing over the np

sequences of hit labels and making use of its exchange- ability, we easily obtain (see Huillet (2005), for details)

(9)

Theorem 2 (i)With bq ∈N:Pp

q=1bq =k, we have P(Bn,k(1) =b1, .., Bn,k(p) =bp;Pn,k =p) (2.8)

= n

p k!

Qp q=1bq!

1 (nθ)k

p

Y

q=1

(θ)b

q.

(ii)With(θ):= (θ)1,(θ)2, .. and Bk,p((θ)) := k!

p!

X

bqN: Pp q=1bq=k

p

Y

q=1

(θ)b

q

bq! Bell polynomials in the indeterminates (θ), it holds that,

P(Pn,k=p) = n!

(n−p)!

1

(nθ)kBk,p((θ)) (2.9)

wherep= 1, .., n∧k.

Concerning the distribution ofPn,k, we also have the conditional transition probabilities

P(Pn,k+1=p+ 1|Pn,k =p) =(n−p)θ nθ+k P(Pn,k+1=p|Pn,k=p) =

Pp

r=1(θ+br)

nθ+k = pθ+k nθ+k. Therefore, the following recurrence holds

P(Pn,k+1=p) = (n−p+ 1)θ

nθ+k P(Pn,k =p−1) + pθ+k

nθ+kP(Pn,k=p). As a simple application of the inclusion-exclusion principle, we shall finally recall a straightforward representation of the probabilityP(Pn,k =p) under the form of an alternate sum (see for example Keener et al (1987), pages 1471-1472). This is an explicit expression of this probability in contrast with Eq. (2.9) which, as just shown, is recursive.

Corollary 3 (i) With hθin,k;m := ((n−m)θ)(nθ) k

k

, m = 0, .., n−1, the generating function of Pn,k reads

E uPn,k

=

n−1

X

m=0

n m

un−m(1−u)mhθin,k;m. (2.10)

In particular, the mean and variance are given by

E(Pn,k) =n

1− hθin,k;1

=n

1−((n−1)θ)k (nθ)k

, (2.11)

(10)

σ2(Pn,k) =n

hθin,k;1+ (n−1)hθin,k;2−nhθi2n,k;1 .

(ii)

P(Pn,k=p) =

p

X

q=1

(−1)p−q n

p p

q

hθin,k;n−q. (2.12)

Remark: Whenθ = 1, one can check that E(Pn,k) = (nk)/(n+k), which is half the geometric average ofnandk. ♦

•Kingman limit:

With sk,p := Bk,p((• −1)!) the absolute value of the first kind Stirling numbers, taking the ∗−limitn ↑ ∞, θ ↓ 0,nθ = γ >0, using Bk,p((θ))∼ θpBk,p((• −1)!), we easily get

P(Pn,k=p)→P(Pk =p) = γpsk,p

(γ)k ,p= 1, .., k (2.13)

and

P(Bn,k(1) =b1, .., Bn,k(p) =bp|Pn,k =p)→ P(Bk(1) =b1, .., Bk(p) =bp|Pk =p) = k!

p!

1 sk,pQp

q=1bq

. (2.14)

We note that the law of Pk in this case is in the class of exponential families.

Further, the generating function ofPk takes the simple form E

uPk

=(γu)k (γ)k . (2.15)

In particular, the mean and variance are given by

E(Pk) =

k−1

X

l=0

γ γ+l, σ2(Pk) =

k−1

X

l=0

γl (γ+l)2.

In this context, we recall the important result of Korwar and Hollander (1973) Pk

logk →γ,k↑ ∞, almost surely.

(2.16)

•The second Ewens formula for Dirichlet populations:

(11)

Let nowAn,k(i),i∈ {0, .., k}count the number of fragments in thek−sample withirepresentatives, that is

An,k(i) =#{m∈ {1, .., n}:Bn,k(m) =i}=

n

X

m=1

I(Bn,k(m) =i). (2.17)

Then Pk

i=0An,k(i) = n, Pk

i=1An,k(i) = p is the number of fragments vis- ited by the k−sample and An,k(0) the number of unvisited ones. Note that Pk

i=1iAn,k(i) =k is the sample size.

The vector (An,k(1), .., An,k(k)) is called the fragments vector count or the species vector count in biology, see Ewens (1990). In Sibuya (1993), it is called the size-index vector and in Good (1968), the frequency of frequencies.

In this case (see Huillet (2005) for computational details), with {n}p :=

n(n−1)..(n−p+ 1) the orderpfalling factorial ofn, we have Theorem 4 For any ai≥0,i= 1, .., k satisfying Pk

i=1iai=k andPk i=1ai= p, we have

P(An,k(1) =a1, .., An,k(k) =ak;Pn,k =p) (2.18)

={n}p k!

Qk

i=1i!aiai! 1 (nθ)k

k

Y

i=1

(θ)aii.

Considering the Kingman limit n ↑ ∞, θ ↓ 0 while nθ = γ > 0, using (θ)iθ↓0 θ(i−1)! and {n}pn↑∞ np, we recover the celebrated Ewens Sam- pling Formula (1972):

Corollary 5 In the Kingman limit, the probability displayed in (2.18) converges to

P(Ak(1) =a1, .., Ak(k) =ak;Pk=p) = k!γp (γ)kQk

i=1iaiai!. (2.19)

•Ewens grand-canonical sampling formula:

Although Poissonization of sample size in occupancy problems was addressed in Cesaroli (1983), this point has not been discussed in the specific random context of Dirichlet partitioning, to the best of the author knowledge. As it will prove essential when considering Bose samples, we shall briefly introduce this topic.

The problem here is to randomize sample sizek. Letz >0 be some “activity”

parameter. Let Kn,z be the random sample size and assume it has Poisson distribution with mean κ:=z >0.

Multiplying the probability displayed in Eq. (2.3) by zk!ke−z, with bm∈N0, m= 1, .., n, we get

PSn(Bn,z(m) =bm;m= 1, .., n) =

n

Y

m=1

e−zSm(zSm)bm bm!

(12)

where the random occupancies B are now indexed by z instead ofk. In this formulation, the annoying restriction thatPm

1 bm=k has been lifted, which is the usual trick used in the grand-canonical ensemble of equilibrium statistical mechanics. GivenSn, the grand canonical distribution ofBn,z(m) ;m= 1, .., n turns out to be the product of n independent Poisson random variables with intensities zSm, m = 1, .., n. Averaging the last formula over Sn and making use of its exchangeability, we get

P(Bn,z(m) =bm;m= 1, .., n) = e−z (nθ)b

1+..+bn n

Y

m=1

zbm(θ)b

m

bm! .

The unconditional distribution ofBn,z(m) =bm;m= 1, .., nis still exchange- able but independence is lost.

Multiplying now the probability displayed in Eq. (2.7) by zk!ke−z, with bq ∈ N,q= 1, .., p, we get

PSn(M1, .., Mk∈ {m1, .., mp};Bn,z(m1) =b1, .., Bn,z(mp) =bp;Pn,z=p)

=e−z

p

Y

q=1

zSmqbq

bq! . Summing over bq ∈N

PSn(M1, .., Mk ∈ {m1, .., mp};Pn,z=p)

=e−z

p

Y

q=1

ezSmq −1 .

Averaging the last formula overSnand making use of its exchangeability, we get the unconditional grand-canonical probability for the number Pn,z of distinct visited fragments

P(Pn,z=p) = n

p

e−zE

" p

Y

q=1

ezSq−1

# .

Here p ∈ {0,1, .., n} with the convention that P(Pn,z= 0) = e−z which, as required, is the probability that there is no particle in the system: the event Kn,z= 0. Clearly, for eachp∈ {1, .., n}, one can check that

P(Pn,z=p) =X

k≥p

zke−z

k! P(Pn,k=p)

whereP(Pn,k =p) is the canonical distribution given that the sample size isk, displayed above in Eq. (2.9) or Eq. (2.12).

(13)

3 Bose samples from Dirichlet populations

We now come to the announced Bose-Einstein version of the sampling pro- cess from Dirichlet proportions. In this problem, particles are assumed to be indistinguishable.

3.1 The statistical structure of the Bose model

Let there now be k indistinguishable particles to place at random on the states Sn. Conditionally on Sn (quenched disorder), let Bn,k(m) denote the occupancy of statemwith Bose equilibrium collective law given by

PSn(Bn,k(m) =bm;m= 1, .., n) = 1 Zk,Sn(θ)

n

Y

m=1

Smbm. (3.1)

With zk

f(z) the coefficient of zk in the power-series expansion of f(z) the normalizing partition function term reads

Zk,Sn(θ) = X

b01+..+b0n=k n

Y

m=1

Sb

0

mm= zk

n

Y

m=1

(1−zSm)−1. (3.2)

The distribution thus defined favors configurations with minimal (interaction free) “energy”: Hn,k(Sn) :=−Pn

m=1bmlogSm.

In Eq. (3.1),bm∈N0,m= 1, .., n, with no restriction butb1+..+bn=k.

Imposing the additional condition that bm ∈ {0,1}, m= 1, .., n(the Pauli ex- clusion principle), would lead to a Fermi-Dirac occupancy problem which we shall not further develop specifically.

Example: Whenθ↑ ∞, the limiting partitionSn reduces to 1n, ..,n1 which is not random. It follows from the above equation that

P(Bn,k(m) =bm;m= 1, .., n) = 1

n+k−1 k

,

the uniform distribution on the set {bm∈N0,m= 1, .., n:Pn

1bm=k}. This distribution is known as the Bose-Einstein distribution (see Feller (1971) and Holst (1985)). Curiously, it coincides with the Maxwell-Boltzmann sampling formula from the random uniform partitionSn(the Dirichlet partition obtained whenθ= 1). ♦

Thanks to the representation of Sn in terms of ratios of iid gamma dis- tributed random variablesXn, this is also

PSn(Bn,k(m) =bm;m= 1, .., n) =PXn(Bn,k(m) =bm;m= 1, .., n)

(14)

where

PXn(Bn,k(m) =bm;m= 1, .., n) = 1 Zk,Xn(θ)

n

Y

m=1

Xmbm,

Zk,Xn(θ) = X

b01+..+b0n=k n

Y

m=1

Xb

0

mm.

Given there are k particles, averaging over disorderSn (or Xn), the Bose un- conditional occupancy probability now is

P(Bn,k(m) =bm;m= 1, .., n) = EPSn(Bn,k(m) =bm;m= 1, .., n)

= EPXn(Bn,k(m) =bm;m= 1, .., n). As a symmetric function of the bms, this distribution is exchangeable. In par- ticular, E(Bn,k(m)) = k/n, m = 1, .., n. Even though, by using this ‘ratio trick’, the average to perform can be over the simpler sequence of iid random variables Xn (rather than over Sn on the simplex), these canonical occupancy distributions conditioned on sample size being equal to k remain clearly hard to evaluate in practice.

•One-dimensional distribution:

We here briefly give the occupancy law of any cell. Withb1∈ {0, .., k} and Xn\1:= (X2, .., Xn)

PXn(Bn,k(1) =b1) = X1b1 Zk,Xn(θ)

X

b02+..+b0n=k−b1

n

Y

m=2

Xb

0

mm,

= X1b1Zk−b1,Xn\1(θ) Zk,Xn(θ) and

P(Bn,k(1) =b1) =E

"

X1b1Zk−b1,Xn\1(θ) Zk,Xn(θ)

#

is the one-dimensional marginal of (Bn,k(m) ;m= 1, .., n).

•Random allocation scheme representation of Bose distribution:

We first observe from Eqs. (3.1, 3.2) that, givenSn:

(Bn,k(m) =bm;m= 1, .., n)= (ξd 1, .., ξnn=k) where (ξ1, .., ξn) are mutually independent onNn0 with sumζn:=Pn

1ξm and PSnm=bm) =Smbm(1−Sm) ,bm∈N0,

(15)

geometric distributions with random success probabilitiesSm

d beta(θ,(n−1)θ), for eachm= 1, .., n. Such a representation of the occupancies is called a random allocation scheme property in Kolchin (1986).

•A concentration phenomenon:

Let us now show that, when the number of fragments is fixed, the proportions of particles tend to concentrate on ground state (which is the fragment with largest size) when the number of particles increases. This result should be compared with the one displayed in Lemma 1 when sampling uses a Maxwell- Boltzmann procedure.

Proposition 6 Let Snd Dn(θ) with 0 < θ < ∞. For each m ∈ {1, .., n}, letSn\m:= (S1, .., Sm−1, Sm+1, ..S)n. With nfixed, as the number of particles grows, conditionally given Sn, we have

Bn,k(m)

k ;m= 1, .., n

k↑∞→ Pm,n:=I Sm>Sn\m

;m= 1, .., n (3.3)

in distribution.

Proof:Let us first consider the ordered versionS(n)of the energy sequence Sn, namely: S(n) := S(1), .., S(n)

with S(1) > .. > S(n). Developing the product partition function

n

Y

m=1

(1−zSm)−1=

n

Y

m=1

1−zS(m)−1

appearing in Eq. (3.2) into a sum ofnrational fractions, extracting its coefficient ofzk, we easily get (after obvious identification of the coefficients)

Zk,Sn(θ) =Zk,S(n)(θ) =

n

X

m=1

C(m)Sk(m)where C(m):= Y

l6=m

1− S(l) S(m)

−1

.

Isolate the ground state term and factorizeS(1). Then Zk,S(n)(θ) =S(1)k C(1)+

n

X

m=2

C(m)Se(m)k

!

whereSe(m):=S(m)/S(1),m= 1, .., n.Withb1+..+bn =k,we want to compute the law of the occupanciesB(n),k(m) ofS(m)which is

PS(n) B(n),k(m) =bm;m= 1, .., n

= 1

Zk,S(n)(θ)

n

Y

m=1

Sb(m)m.

(16)

Sinceb1=k−(b2+..+bn), using the expression ofZk,S(n)(θ), the occupancy distribution of all states but ground state reads

PS(n) B(n),k(m) =bm;m= 2, .., n

=

Qn

m=2Se(m)bm C(1)

1 +Pn m=2

C(m) C(1)Sek(m)

= Qn

m=2Se(m)bm

1−Se(m) 1 +Pn

m=2 C(m)

C(1)Se(m)k

, using C(1) = Y

m6=1

1−Se(m)−1

.

Developing the denominator in power series, we finally obtain PS(n) B(n),k(m) =bm;m= 2, .., n

= (1−ε(k))

n

Y

m=2

n Seb(m)m

1−Se(m)o . (3.4)

SinceSe(n)< .. <Se(3)<Se(2)<1,the corrective termε(k) :=Se(2)k C(2)/C(1)<0 is dominant to the second order. It goes to 0 exponentially fast withkbecoming large. Whenkis large, a good approximation of occupancies of all ordered states but ground state therefore is a product of geometrically distributed random variables with normalized success probabilitiesSe(m).

Supposebm=bkxmcfor some fixedxm∈(0,1] ;m= 2, .., n.In this case, the probability displayed in Eq. (3.4) goes to 0 whenkgoes to∞: in other words, the probabilities ofB(n),k(m)/k;m= 2, .., nall concentrate at 0 and therefore all the probability mass goes to ground state (m = 1). This is the content of statement displayed in Eq. (3.3) where by the event Sm > Sn\m it is meant that Sm is larger than all entries constituting the random vector Sn\m. Note that almost surely Pn

m=1Pm,n = 1 and that for eachm, E(Pm,n) = 1/n and σ2(Pm,n) = (1−1/n)/n.

•Gibbs randomization of sample size (variable particle number):

As it can be guessed from above, the canonical conditional distributions given sample size isk are difficult to evaluate in general (except fork= 1). To circumvent this drawback, we shall again assume that the number of particles is variable and so randomize sample size. In this way, we shall obtain a tractable grand-canonical version of Bose sample from Dirichlet proportions.

Letα >0 stand for fugacity and let z=e−α ∈(0, zc:= 1) be the activity parameter. Assume the number of particlesKn,z is now random with law given by the Gibbs model

PSn(Kn,z=k) = zkZk,Sn(θ) Zz,Sn(θ)

(17)

where the grand canonical partition now reads Zz,Sn(θ) =X

k≥0

zkZk,Sn(θ) =

n

Y

m=1

1 1−zSm. Alternatively, withu∈[0,1], we clearly have

ESn uKn,z

= Zuz,Sn(θ) Zz,Sn(θ) =

n

Y

m=1

1−zSm

1−uzSm E uKn,z

= E

n

Y

m=1

1−zSm

1−uzSm

! .

Under this form, this shows that, given Sn, Kn,z is the sum of n independent geometric random variables with respective success probabilities zSm, m = 1, .., n.

To each valuez∈(0,1) there is a unique corresponding value ofκ:=E(Kn,z) through:

ESn(Kn,z) = −∂αlogZe−α,Sn(θ) =

n

X

m=1

zSm

1−zSm

E(Kn,z) = :κ=EESn(Kn,z) =nE

zSn 1−zSn

.

As will be checked below, κ is an increasing function of z ∈ (0,1), possibly diverging whenz↑zc, depending on the range of the variablesθandnparam- eterizing the Dirichlet model (see below where condition (n−1)θ ≤1 versus (n−1)θ >1 appears that separates two phases depending on whetherκ↑ ∞ or not whenz↑1).

Indexing now cell occupancies byzrather thank, withbm∈N0;m= 1, .., n, the joint occupancies probability takes the product form

PSn(Bn,z(m) =bm;m= 1, .., n) =

n

Y

m=1

n

(zSm)bm(1−zSm)o

where each Bn,z(m) now follows a geometric distribution with success proba- bilityzSm.

In other words, givenSn, the grand canonical distribution ofBn,z(m) ;m= 1, .., n now turns out to be the product of n independent geometric random variables with success probabilitieszSm,m= 1, .., n.

•The Bose sample grand-canonical distribution:

(18)

First we note thatPSn(Bn,z(m)≥bm;m= 1, .., n) =Qn

m=1(zSm)bm. Av- eraging overSn,withk:=Pn

1bm,

P(Bn,z(m)≥bm;m= 1, .., n) = 1 (nθ)k

n

Y

m=1

zbm(θ)b

m . In particular,

P(Bn,z(m)≥1;m= 1, .., n) = (θz)n (nθ)n

is the probability that all fragments have been visited at least once in a Bose sample (the coupon collector problem, see Feller (1971)). In other related applications (reminiscent of the Banach match box problem), with c ∈ N, some cell capacity parameter, one can find useful to estimate the probability P(Bn,z(m)≤c;m= 1, .., n) to have less thanc particles in all cells. To com- pute this quantity, we first need to estimateP(Bn,z(m) =bm;m= 1, .., n) and then sum over eachbm∈ {0, .., c}. Averaging overSn,this unconditional occu- pancy probability is

Theorem 7 With k:=Pn m=1bm,

P(Bn,z(m) =bm;m= 1, .., n) :=EPSn(Bn,z(m) =bm;m= 1, .., n)

= zk (nθ)k

n

Y

m=1

(θ)b

m

n

X

q=0

(−z)q (nθ+k)q

n q

q Y

r=1

(θ+br). Proof: Using exchangeability ofSn

P(Bn,z(m) =bm;m= 1, .., n) =E

" n Y

m=1

n

(zSm)bm(1−zSm)o

#

=

n

X

q=0

(−1)q X

1≤m1<..<mq≤n

E

q

Y

r=1

(zSmr)bmr+1 Y

m6={m1,..,mq}

(zSm)bm

=zk

n

X

q=0

(−z)q n

q

E

q

Y

r=1

Srbr+1

n

Y

r=q+1

Srbr

!

=zk

n

X

q=0

(−z)q (nθ)k+q

n q

q Y

r=1

(θ)b

r+1 n

Y

r=q+1

(θ)b

r

= zk (nθ)k

n

Y

m=1

(θ)b

m

n

X

q=0

(−z)q (nθ+k)q

n q

q Y

r=1

(θ+br). We used (θ)k+q = (θ)k(θ+k)q and (θ)0= 1.

(19)

It is exchangeable but not of product form. From this, we would get the law ofKn,zitself

P(Kn,z=k) = X

b1+..+bn=k

P(Bn,z(m) =bm;m= 1, .., n).

This result allows to extract some information of interest on the grand- canonical equilibrium law of individual cell occupancy. Indeed, withb1∈N0

P(Bn,z(1) =b1) =E h

(zSn)b1(1−zSn)i

=zb1

"

(θ)b

1

(nθ)b

1

−z (θ)b

1+1

(nθ)b

1+1

#

is the one-dimensional expression ofBn,z(1) law. Therefore,

Corollary 8 (i)The probability that in any cell there is more thanb1 particles is

P(Bn,z(1)≥b1) =zb1 (θ)b

1

(nθ)b

1

, b1∈N0.

(ii)Whenz < zc := 1, this one-dimensional probability decays exponentially.

(iii) At critical point z = zc, Bn,z(1) has power law tails with exponent (n−1)θ > 0: therefore, E(Bn,z(1)) = ∞ if and only (n−1)θ ∈ (0,1] and σ2(Bn,z(1))<∞if and only(n−1)θ >2.

Proof: (iii) We have (nθ)(θ)b1

b1

= Γ(nθ)Γ(θ) (b(b1)θ

1) and whenb1 is large, using Stir- ling formula, (b1)θ ∼bθ1. This shows that (nθ)(θ)b1

b1

Γ(nθ)Γ(θ)b−(n−1)θ1 andBn,z(1) has power law tails with exponent (n−1)θ >0.

•The number of distinct visited fragments in a Bose sample:

LetPn,z := Pn

m=1I(Bn,z(m)>0) be the number of distinct visited frag- ments in a Bose sample of the grand canonical ensemble. Withp≤n, assume thatPn,z=p. Letm1< .. < mp be a realization of the labelsMq;q= 1, .., pof thesepvisited fragments. Withbq∈N,q= 1, .., p, we have

PSn(Mq =mq;Bn,z(mq) =bq;q= 1, .., p;Pn,z=p)

=

p

Y

q=1

zSmq

bq

n

Y

m=1

(1−zSm). Summing over bq ∈N,q= 1, .., p

PSn(Mq =mq;q= 1, .., p;Pn,z=p) =

p

Y

q=1

zSmq

Y

m∈[n]\{m1,..,mp}

(1−zSm)

(20)

showing that givenSn,if the visited labels sequence is known,Pn,z is the sum of independent Bernoulli distributed random variables with success probability zSmq.Averaging overSn and using exchangeability ofSn, with p∈ {0, .., n}

P(Pn,z=p) = n

p

E

" p Y

q=1

(zSq)

n

Y

q=p+1

(1−zSq)

# .

Differentiating this expression with respect toz, we obtain

z∂zP(Pn,z=p) =pP(Pn,z=p)−(p+ 1)P(Pn,z=p+ 1) so that if Φn,z(u) := E uPn,z

is the generating function ofPn,z, it satisfies z∂zΦn,z(u) =−(1−u)∂uΦn,z(u). In particular,∂uΦn,z(1) =: E(Pn,z) satis- fiesz∂zE(Pn,z) =E(Pn,z) suggestingE(Pn,z)∝z . In fact, as shown below

E(Pn,z) =z∈(0,1),

independently ofn: the Bose grand-canonical expected number of visited frag- ments is at most one.

Developing the second product in the expression ofP(Pn,z=p) and making use of Eq. (2.2), we get the alternate sum representation

P(Pn,z=p) = (zθ)p n

p n−p

X

q=0

n−p q

(−θz)q (nθ)p+q

= n

p n

X

r=p

(−1)r−p n−p

r−p (θz)r

(nθ)r.

Summing over p∈ {0, .., n}and reversing the summation order, this gives E(Pn,z) = n!

n

X

p=1

1 (p−1)!

n

X

r=p

(−1)r−p (r−p)! (n−r)!

(θz)r (nθ)r

= n!

n

X

r=1

1 (n−r)!

(θz)r (nθ)r

1 (r−1)!

r−1

X

q=0

(−1)q r−1

q

= z∈(0,1). More generally, proceeding similarly

E uPn,z

=

n

X

r=0

n r

(−θz(1−u))r (nθ)r .

From this, the variance of Pn,z is σ2(Pn,z) =z−z2(θ+ 1)/(nθ+ 1)>0 and more generally,

E {Pn,z}r

={n}r (θz)r (nθ)r

are the falling factorial moments ofPn,z. To summarize, we obtained

(21)

Theorem 9 The law of Pn,z is characterized by any of the three equivalent properties

(i)Forp∈ {0, .., n}, with z∈(0,1) P(Pn,z=p) =

n p

n X

r=p

(−1)r−p n−p

r−p (θz)r

(nθ)r. (ii)Withu∈[0,1],Pn,z has the generating function

E uPn,z

=

n

X

r=0

n r

(−θz(1−u))r (nθ)r .

(iii)Withr∈ {0, .., n}, the falling factorial moments ofPn,z are E {Pn,z}r

={n}r (θz)r (nθ)r.

In particular: E(Pn,z) =z andσ2(Pn,z) =z−z2(θ+ 1)/(nθ+ 1). In the∗−Kingman limit, we clearly obtain

Corollary 10 We have Pn,z

dPz. Specifically, (i)Withp∈ {0, .., n}

P(Pn,z=p)→P(Pz=p) =

X

r=p

(−1)r−p r!

r p

(γz)r (γ)r wherep∈N0.

(ii)

E uPn,z

E uPz

=

X

r=0

(−γz(1−u))r r!·(γ)r .

(iii) From this, E(Pz) = z and σ2(Pz) = z−z2/(γ+ 1) >0 and more generally, withr∈N0

E({Pz}r) = (γz)r (γ)r are ther−falling factorial moments of Pz.

(iv)When γ↑ ∞

Pz

d Poisson(z).

Proof: Points (i) to (iii) can easily be derived from the latter Theorem.

Point (iv) is obtained by passing to the limitγ↑ ∞onE uPz

and, recalling (γ)r ∼γr, E uPz

γ↑∞ exp{−z(1−u)}, the moment generating function

Références

Documents relatifs

Key words : Harmonic maps, Dirichlet problem, Lorentz manifold, critical point theory.. Dans cet article, on demontre l’existence et la

- Uniqueness and existence results for boundary value problems for the minimal surface equation on exterior domains obtained.. by Langévin-Rosenberg and Krust in

- We prove that there exist at least two distinct solutions to the Dirichlet problem for the equation of prescribed mean curvature LBX == 2 H (X) Xu A Xv, the

The unitary maps ϕ 12 will be very useful in the study of the diagonal action of U(2) on triples of Lagrangian subspaces of C 2 (see section 3).. 2.4

The scope of this paper is to prove the genericity with respect to the domain of some properties of the Laplacian-Dirichlet operator issuing from control theory and optimization

The main result of this paper is the following well-posedness result for the Dirichlet problem..

On the other hand, Schechter [12], using the Schauder estimates and the maximum principle, proves a very interesting result on the solvability of the Dirichlet problem for an

Qo(Q) is the set of all realvalued, once continuously diffe- rentiable functions on Q, with compact support.. A^o(D) is the set of all realvalued Lipschitz functions of order one