• Aucun résultat trouvé

The Cube Method

Dans le document Springer Series in Statistics (Page 166-173)

is (0,0,1,1,0,1,1,0) and has a cost equal to 0.00003, is not in this sampling design.

Remark 13.Example 33 shows that the sampling design which minimizes the average cost does not necessarily allocate a nonnull probability to the most balanced sample.

Table 3.1, page 32 shows the limits of the enumerative methods. Even if we limit ourselves to samples of fixed size, the enumerative methods cannot deal with populations larger than 30 units. In order to select a balanced sample, the cube method provides the necessary shortcut to avoid the enumeration of samples.

8.6 The Cube Method

The cube method is composed of two phases, called the flight phase and the landing phase. In the flight phase, the constraints are always exactly satisfied.

The objective is to randomly round off to 0 or 1 almost all of the inclusion probabilities; that is, to randomly select a vertex ofK=Q∩C. The landing phase consists of managing as well as possible the fact that the balancing equations (8.1) cannot always be exactly satisfied.

8.6.1 The Flight Phase

The aim of the flight phase is to randomly choose a vertex of K={[0,1]N ∩Q},

where Q=π+ KerA,and A= (ˇx1 · · · xˇk · · · xˇN),in such a way that the inclusion probabilitiesπk, k∈U,and the balancing equations (8.1) are exactly satisfied. Note that by Result 34, a vertex ofKhas at mostpnoninteger values.

The landing phase is necessary only if the attained vertex ofKis not a vertex ofC and consists of relaxing the constraints (8.1) as minimally as possible in order to select a sample; that is, a vertex ofC.

The general algorithm for completing the flight phase is to use a balancing martingale.

Definition 58.A discrete time stochastic process π(t) = [πk(t)], t= 0,1, . . . inRN is said to be a balancing martingale for a vector of inclusion probabilities π and the auxiliary variables x1, . . . , xp if

(i)π(0) =π,

(ii) E(t)(t1), . . . ,π(0)}=π(t1), t= 1,2, . . . , (iii)π(t)∈K=

[0,1]N (π+ KerA)

,whereAis thep×N matrix given byA= (x11 · · · xkk · · · xNN).

A balancing martingale therefore satisfies that π(t1) is the mean of the following possible values ofπ(t).

Result 37.If π(t)is a balancing martingale, then we have the following:

(i) E [π(t)] = E [π(t1)] =· · ·= E [π(0)] =π; (ii)

k∈U

xˇkπk(t) =

k∈U

xˇkπk=X, t= 0,1,2, . . .;

(iii)When the balancing martingale reaches a face of C,it does not leave it.

Proof. Part (i) is obvious. Part (ii) results from the fact that π(t) ∈K.To prove (iii), note thatπ(t1) belongs to a face. It is the mean of the possible values ofπ(t) that therefore must also belong to this face. 2 Part (iii) of Result 37 directly implies that (i) ifπk(t) = 0,thenπk(t+h) = 0, h0; (ii) ifπk(t) = 1, thenπk(t+h) = 1, h≥0; and (iii) the vertices of K are absorbing states.

The practical problem is to find a method that rapidly reaches a vertex.

Algorithm 8.3 allows us to attain a vertex ofK in at mostN steps.

Algorithm 8.3General balanced procedure: Flight phase Initialize π(0) =π.

Fort= 0, . . . , T,and until it is no longer possible to carry outStep1,do 1. Generate any vectoru(t) = [uk(t)]= 0,random or not, such thatu(t) is in the

kernel of matrixA,anduk(t) = 0 ifπk(t) is an integer number.

2. Compute λ1(t) and λ2(t), the largest values ofλ1(t) andλ2(t) such that 0 π(t) +λ1(t)u(t)1,and 0π(t)−λ2(t)u(t)1.Note thatλ1(t)>0 and λ2(t)>0.

3. Select

π(t+ 1) =

π(t) +λ1(t)u(t) with probabilityq1(t)

π(t)−λ2(t)u(t) with probabilityq2(t), (8.7) whereq1(t) =λ2(t)/1(t) +λ2(t)}andq2(t) =λ1(t)[λ1(t) +λ2(t)].

EndFor.

Figure 8.4 shows the geometric representation of the first step in a bal-ancing martingale in the case of N = 3. The only constraint is the fixed sample size. Now, Algorithm 8.3 defines a balancing martingale. Clearly, π(0) =π.Also from Expression (8.7), we obtain E [π(t)(t1), . . . ,π(0)}= π(t1), t= 1,2, . . . ,because

E [π(t)(t1),u(t)] =π(t1), t= 1,2, . . . .

8.6 The Cube Method 161

π(0) π(0) +λ1(0)u(0)

π(0)−λ2(0)u(0)

(000) (100)

(101) (001)

(010) (110)

(111) (011)

Fig. 8.4.Example for the first step of a balancing martingale inS2and a population of sizeN = 3

Finally, because u(t) is in the kernel of A, from (8.7) we obtain that π(t) always remains inK=

[0,1]N(π+ KerA) .

At each step, at least one component of the process is rounded to 0 or 1.

Thus,π(1) is on a face of theN-cube; that is, on a cube of dimensionN−1 at most, π(2) is on a cube of dimensionN−2 at most and so on. LetT be the time when the flight phase has stopped. The fact that step 1 is no longer possible shows that the balancing martingale has attained a vertex ofKand thus by Result 34, page 153, that card{0< πk(T)<1} ≤p.

8.6.2 Fast Implementation of the Flight Phase

Chauvet and Till´e (2006, 2005a,b) proposed an implementation of the flight phase that provides a very fast algorithm. In Algorithm 8.3, the search for a vector uin KerAcan be expensive. The basic idea is to use a submatrixB containing only p+ 1 columns ofA. Note that the number of variablespis smaller than the population size N and that rankB ≤p. The dimension of the kernel ofBis thus larger than or equal to 1.

A vectorvof KerBcan then be used to construct a vectoruof KerAby complementingvwith zeros for the columns ofBthat are not inA.With this idea, all the computations can be done only onB.This method is described in Algorithm 8.4.

IfT3 is the last step of the algorithm andπ3 =π(T3),then we have 1. E(π3) =π,

2. Aπ3 =Aπ,

3. Ifq3= card{k|0<3πk<1},thenq3≤p,wherepis the number of auxiliary variables.

Algorithm 8.4Fast algorithm for the flight phase 1. Initialization

a) The units with inclusion probabilities equal to 0 or 1 are removed from the population before applying the algorithm in such a way that all the remaining units are such that 0< πk<1.

b) The inclusion probabilities are loaded into vectorπ. c) The vectorψis made up of the firstp+ 1 elements ofπ. d) A vector of ranks is createdr= (1 2 · · · p p+ 1). e) MatrixBis made up of the firstp+ 1 columns ofA.

f) Initialize k=p+ 2.

2. Basic loop

a) A vectoruis taken in the kernel ofB,

b) Onlyψis modified (and not the vectorπ) according to the basic technique.

Computeλ1andλ2,the largest values ofλ1andλ2such that 0ψ1u 1,and 0ψ−λ2u1.Note thatλ1>0 andλ2 >0.

c) Select

ψ=

ψ+λ1uwith probabilityq ψ−λ2uwith probability 1−q, whereq=λ2/(λ1+λ2).

d) (The units that correspond toψ(i)integer numbers are removed fromBand are replaced by new units. The algorithm stops at the end of the file.) Fori= 1, . . . , p+ 1,do

Ifψ(i) = 0 orψ(i) = 1then

Ifk≤N then

π(r(i)) =ψ(i);

r(i) =k;

ψ(i) =π(k);

Forj= 1, . . . , p, doB(i, j) =A(k, j); EndFor; k=k+ 1;

Else Goto Step3(a);

EndIf; EndIf; EndFor. e) Goto Step2(a).

3. End of the first part of the flight phase

a) Fori= 1, . . . , p+ 1, doπ(r(i)) =ψ(i)EndFor.

In the case where some of the constraints can be satisfied exactly, the flight phase can be continued. Suppose that C is the matrix containing the columns of Athat correspond to noninteger values ofπ3, andφis the vector of noninteger values of π3. If C is not full-rank, one or several steps of the general Algorithm 8.3 can still be applied toCandφ. A return to the general Algorithm 8.3 is thus necessary for the last steps.

The implementation of the fast algorithm is quite simple. MatrixAnever has to be completely loaded in memory and thus remains in a file that can be read sequentially. For this reason, there does not exist any restriction on the population size because the execution time depends linearly on the population

8.6 The Cube Method 163 size. The search for a vector u in the submatrix B limits the choice of the direction u.In most cases, only one direction is possible. In order to increase the randomness of the sampling design, the units can possibly be randomly mixed before applying Algorithm 8.4.

Another option consists of sorting the units by decreasing order of size. In-deed, from experience, with the general Algorithm 8.3, the rounding problem often concerns units with large size; that is, large inclusion probabilities or large values of ˇxk. With the fast Algorithm 8.4, the rounding problem often concerns the units that are at the end of the file. If the units are sorted by de-creasing order of size, the fast algorithm will try to first balance the big units and the rounding problem will instead concern small units. Analogously, if we want to get an exact fixed weight of potatoes, it is more efficient to first put the large potatoes on the balance and to finish with the smallest potatoes.

This popular idea can also be used to balance a sample, even if the problem is more complex because it is multivariate.

The idea of considering only one subset of the units already underlays in the moving stratification procedure(see Till´e, 1996b) that provides a smoothed effect of stratification. Whenp= 1 and the only auxiliary variable isxk =πk, then the problem of balanced sampling amounts to sampling with unequal probabilities and fixed sample size. In this case,A= (1 · · · 1). At each step, matrix B= (1 1) and u= (1 1). Algorithm 8.4 can therefore be simpli-fied dramatically and is identical to the pivotal method (see Algorithm 6.6, page 107).

8.6.3 The Landing Phase

At the end of the flight phase, the balancing martingale has reached a vertex of K, which is not necessarily a vertex ofC. This vertex is denoted byπ= [πk] =π(T).Letqbe the number of noninteger components of this vertex. If q= 0,the algorithm is completed. Ifq >0 some constraints cannot be exactly attained.

Landing phase by an enumerative algorithm

Definition 59.A samplesis said to be compatible with a vectorπifπk=sk for all ksuch thatπkis an integer. LetC(π)denote the set with2q elements of compatible samples with π.

It is clear that we can limit ourselves to finding a design with mean valueπ and whose support is included inC(π).

The landing phase can be completed by an enumerative algorithm on sub-population C(π) as developed in Section 8.5. The following linear program provides a sampling design on C(π).

pmin(.)

s∈C(π)

Cost(s)p(s), (8.8)

subject to

s∈C(π)

p(s) = 1,

s∈C(π)

sp(s) =π,

0≤p(s)1, for alls∈ C(π).

Next, a sample is selected with sampling design p(.). Because q p, this linear program no longer depends on the population size but only on the number of balancing variables. It is thus restricted to 2q possible samples and, with a modern computer, can be applied without difficulty to a balancing problem with a score of auxiliary variables. If the inclusion probabilities are an auxiliary variable and the sum of the inclusion probabilities is integer, then the linear program can be applied only to

Cn(π) =

s∈ C(π)

k∈U

sk=

k∈U

πk=n

, which dramatically limits the number of samples.

Landing phase by suppression of variables

If the number of balancing variables is too large for the linear program to be solved by a simplex algorithm, q > 20, then, at the end of the flight phase, an balancing variable can be straightforwardly suppressed. A constraint is thus relaxed, allowing a return to the flight phase until it is no longer possible to “move” within the constraint subspace. The constraints are thus successively relaxed. For this reason, it is necessary to order the balancing variables according to their importance so that the least important constraints are relaxed first. This naturally depends on the context of the survey.

8.6.4 Quality of Balancing

The rounding problem can arise with any balanced sampling design. For in-stance, in stratification, the rounding problem arises when the sum of the inclusion probabilities within the strata is not an integer number, which is almost always the case in proportional stratification or optimal stratification.

In practice, the stratum sample sizes nhare rounded either deterministically or randomly. Random rounding is used so as to satisfy in expectation the val-ues ofnh. The purpose of random rounding is to respect the initial inclusion probabilities.

The cube method also uses a random rounding. In the particular case of stratification, it provides exactly the well-known method of random rounding of the sample sizes in the strata. With any variant of the landing phase, the deviation between the Horvitz-Thompson estimator and the total can be bounded because the rounding problem only depends onq≤pvalues.

8.6 The Cube Method 165 Result 38.For any application of the cube method

!XjHT−Xj≤p×max

Result 39.If the sum of the inclusion probabilities is integer and if the sam-pling design has a fixed sample size, then, for any application of the cube method,

Proof. With the cube method, we can always satisfy the fixed sample size constraint when the sum of the inclusion probabilities is integer, which can

be written This bound is a conservative bound of the rounding error because we consider the worst case. Moreover, this bound is computed for a total and must be considered relative to the population size. Let αk =πkN/n, k ∈U.

For almost all the usual sampling designs, we can admit that 1/αkis bounded whenn→ ∞.Note that, for a fixed sample size

1 N

k∈U

αk= 1.

The bound for the estimation of the mean can thus be written:

|X!jHT −Xj|

N p

max

k∈U

xkj αk −Xj

=O(p/n),

whereO(1/n) is a quantity that remains bounded when multiplied byn. The bound thus very quickly becomes negligible if the sample size is large with respect to the number of balancing variables.

For comparison note that with a single-stage sampling design such as sim-ple random sampling or Bernoulli sampling, we have generally that

|X!jHT −Xj|

N =Op(1/ n) (see, for example, Ros´en, 1972; Isaki and Fuller, 1982).

Despite the overstatement of the bound, the gain obtained by balanced sampling is very important. The rate of convergence is much faster for bal-anced sampling than for a usual sampling design. In practice, except for the case of very small sample sizes, the rounding problem is thus negligible. Fur-thermore, the rounding problem also becomes problematic in stratification with very small sample sizes. In addition, this bound corresponds to the “worst case”, whereas the landing phase is used to find the best one.

Dans le document Springer Series in Statistics (Page 166-173)