The Cube Method - Springer Series in Statistics

is (0,0,1,1,0,1,1,0) and has a cost equal to 0.00003, is not in this sampling design.

Remark 13.Example 33 shows that the sampling design which minimizes the average cost does not necessarily allocate a nonnull probability to the most balanced sample.

Table 3.1, page 32 shows the limits of the enumerative methods. Even if we limit ourselves to samples of ﬁxed size, the enumerative methods cannot deal with populations larger than 30 units. In order to select a balanced sample, the cube method provides the necessary shortcut to avoid the enumeration of samples.

8.6 The Cube Method

The cube method is composed of two phases, called the flight phase and the landing phase. In the flight phase, the constraints are always exactly satisfied.

The objective is to randomly round oﬀ to 0 or 1 almost all of the inclusion probabilities; that is, to randomly select a vertex ofK=Q∩C. The landing phase consists of managing as well as possible the fact that the balancing equations (8.1) cannot always be exactly satisﬁed.

8.6.1 The Flight Phase

The aim of the ﬂight phase is to randomly choose a vertex of K={[0,1]^N ∩Q},

where Q=π+ KerA,and A= (ˇx₁ · · · xˇ_k · · · xˇ_N),in such a way that the inclusion probabilitiesπ_k, k∈U,and the balancing equations (8.1) are exactly satisﬁed. Note that by Result 34, a vertex ofKhas at mostpnoninteger values.

The landing phase is necessary only if the attained vertex ofKis not a vertex ofC and consists of relaxing the constraints (8.1) as minimally as possible in order to select a sample; that is, a vertex ofC.

The general algorithm for completing the ﬂight phase is to use a balancing martingale.

Deﬁnition 58.A discrete time stochastic process π(t) = [π_k(t)], t= 0,1, . . . inR^N is said to be a balancing martingale for a vector of inclusion probabilities π and the auxiliary variables x₁, . . . , x_p if

(i)π(0) =π,

(ii) E{π(t)|π(t−1), . . . ,π(0)}=π(t−1), t= 1,2, . . . , (iii)π(t)∈K=

[0,1]^N ∩(π+ KerA)

,whereAis thep×N matrix given byA= (x₁/π₁ · · · x_k/π_k · · · x_N/π_N).

A balancing martingale therefore satisﬁes that π(t−1) is the mean of the following possible values ofπ(t).

Result 37.If π(t)is a balancing martingale, then we have the following:

(i) E [π(t)] = E [π(t−1)] =· · ·= E [π(0)] =π; (ii)

k∈U

xˇ_kπ_k(t) =

k∈U

xˇ_kπ_k=X, t= 0,1,2, . . .;

(iii)When the balancing martingale reaches a face of C,it does not leave it.

Proof. Part (i) is obvious. Part (ii) results from the fact that π(t) ∈K.To prove (iii), note thatπ(t−1) belongs to a face. It is the mean of the possible values ofπ(t) that therefore must also belong to this face. 2 Part (iii) of Result 37 directly implies that (i) ifπ_k(t) = 0,thenπ_k(t+h) = 0, h≥0; (ii) ifπ_k(t) = 1, thenπ_k(t+h) = 1, h≥0; and (iii) the vertices of K are absorbing states.

The practical problem is to ﬁnd a method that rapidly reaches a vertex.

Algorithm 8.3 allows us to attain a vertex ofK in at mostN steps.

Algorithm 8.3General balanced procedure: Flight phase Initialize π(0) =π.

Fort= 0, . . . , T,and until it is no longer possible to carry outStep1,do 1. Generate any vectoru(t) = [uk(t)]= 0,random or not, such thatu(t) is in the

kernel of matrixA,anduk(t) = 0 ifπk(t) is an integer number.

2. Compute λ^∗₁(t) and λ^∗₂(t), the largest values ofλ₁(t) andλ₂(t) such that 0≤ π(t) +λ₁(t)u(t)≤1,and 0≤π(t)−λ₂(t)u(t)≤1.Note thatλ₁(t)>0 and λ₂(t)>0.

3. Select

π(t+ 1) =

π(t) +λ^∗₁(t)u(t) with probabilityq₁(t)

π(t)−λ^∗₂(t)u(t) with probabilityq₂(t), (8.7) whereq₁(t) =λ^∗₂(t)/{λ^∗₁(t) +λ^∗₂(t)}andq₂(t) =λ^∗₁(t)[λ^∗₁(t) +λ^∗₂(t)].

EndFor.

Figure 8.4 shows the geometric representation of the first step in a bal-ancing martingale in the case of N = 3. The only constraint is the fixed sample size. Now, Algorithm 8.3 defines a balancing martingale. Clearly, π(0) =π.Also from Expression (8.7), we obtain E [π(t)|π(t−1), . . . ,π(0)}= π(t−1), t= 1,2, . . . ,because

E [π(t)|π(t−1),u(t)] =π(t−1), t= 1,2, . . . .

8.6 The Cube Method 161

π(0) π(0) +λ^∗₁(0)u(0)

π(0)−λ^∗₂(0)u(0)

(000) (100)

(101) (001)

(010) (110)

(111) (011)

Fig. 8.4.Example for the ﬁrst step of a balancing martingale inS2and a population of sizeN = 3

Finally, because u(t) is in the kernel of A, from (8.7) we obtain that π(t) always remains inK=

[0,1]^N∩(π+ KerA) .

At each step, at least one component of the process is rounded to 0 or 1.

Thus,π(1) is on a face of theN-cube; that is, on a cube of dimensionN−1 at most, π(2) is on a cube of dimensionN−2 at most and so on. LetT be the time when the ﬂight phase has stopped. The fact that step 1 is no longer possible shows that the balancing martingale has attained a vertex ofKand thus by Result 34, page 153, that card{0< π_k(T)<1} ≤p.

8.6.2 Fast Implementation of the Flight Phase

Chauvet and Till´e (2006, 2005a,b) proposed an implementation of the ﬂight phase that provides a very fast algorithm. In Algorithm 8.3, the search for a vector uin KerAcan be expensive. The basic idea is to use a submatrixB containing only p+ 1 columns ofA. Note that the number of variablespis smaller than the population size N and that rankB ≤p. The dimension of the kernel ofBis thus larger than or equal to 1.

A vectorvof KerBcan then be used to construct a vectoruof KerAby complementingvwith zeros for the columns ofBthat are not inA.With this idea, all the computations can be done only onB.This method is described in Algorithm 8.4.

IfT3 is the last step of the algorithm andπ3 =π(T3),then we have 1. E(π3) =π,

2. Aπ3 =Aπ,

3. Ifq3= card{k|0<3π_k<1},thenq3≤p,wherepis the number of auxiliary variables.

Algorithm 8.4Fast algorithm for the ﬂight phase 1. Initialization

a) The units with inclusion probabilities equal to 0 or 1 are removed from the population before applying the algorithm in such a way that all the remaining units are such that 0< πk<1.

b) The inclusion probabilities are loaded into vectorπ. c) The vectorψis made up of the ﬁrstp+ 1 elements ofπ. d) A vector of ranks is createdr= (1 2 · · · p p+ 1). e) MatrixBis made up of the ﬁrstp+ 1 columns ofA.

f) Initialize k=p+ 2.

2. Basic loop

a) A vectoruis taken in the kernel ofB,

b) Onlyψis modiﬁed (and not the vectorπ) according to the basic technique.

Computeλ^∗₁andλ^∗₂,the largest values ofλ₁andλ₂such that 0≤ψ+λ₁u≤ 1,and 0≤ψ−λ₂u≤1.Note thatλ^∗₁>0 andλ^∗₂ >0.

c) Select

ψ=

ψ+λ^∗₁uwith probabilityq ψ−λ^∗₂uwith probability 1−q, whereq=λ^∗₂/(λ^∗₁+λ^∗₂).

d) (The units that correspond toψ(i)integer numbers are removed fromBand are replaced by new units. The algorithm stops at the end of the ﬁle.) Fori= 1, . . . , p+ 1,do

Ifψ(i) = 0 orψ(i) = 1then

Ifk≤N then

π(r(i)) =ψ(i);

r(i) =k;

ψ(i) =π(k);

Forj= 1, . . . , p, doB(i, j) =A(k, j); EndFor; k=k+ 1;

Else Goto Step3(a);

EndIf; EndIf; EndFor. e) Goto Step2(a).

3. End of the ﬁrst part of the ﬂight phase

a) Fori= 1, . . . , p+ 1, doπ(r(i)) =ψ(i)EndFor.

In the case where some of the constraints can be satisﬁed exactly, the ﬂight phase can be continued. Suppose that C is the matrix containing the columns of Athat correspond to noninteger values ofπ3, andφis the vector of noninteger values of π3. If C is not full-rank, one or several steps of the general Algorithm 8.3 can still be applied toCandφ. A return to the general Algorithm 8.3 is thus necessary for the last steps.

The implementation of the fast algorithm is quite simple. MatrixAnever has to be completely loaded in memory and thus remains in a ﬁle that can be read sequentially. For this reason, there does not exist any restriction on the population size because the execution time depends linearly on the population

8.6 The Cube Method 163 size. The search for a vector u in the submatrix B limits the choice of the direction u.In most cases, only one direction is possible. In order to increase the randomness of the sampling design, the units can possibly be randomly mixed before applying Algorithm 8.4.

Another option consists of sorting the units by decreasing order of size. In-deed, from experience, with the general Algorithm 8.3, the rounding problem often concerns units with large size; that is, large inclusion probabilities or large values of ˇx_k. With the fast Algorithm 8.4, the rounding problem often concerns the units that are at the end of the file. If the units are sorted by de-creasing order of size, the fast algorithm will try to first balance the big units and the rounding problem will instead concern small units. Analogously, if we want to get an exact fixed weight of potatoes, it is more efficient to first put the large potatoes on the balance and to finish with the smallest potatoes.

This popular idea can also be used to balance a sample, even if the problem is more complex because it is multivariate.

The idea of considering only one subset of the units already underlays in the moving stratification procedure(see Tillé, 1996b) that provides a smoothed effect of stratification. Whenp= 1 and the only auxiliary variable isx_k =π_k, then the problem of balanced sampling amounts to sampling with unequal probabilities and fixed sample size. In this case,A= (1 · · · 1). At each step, matrix B= (1 1) and u= (−1 1). Algorithm 8.4 can therefore be simpli-fied dramatically and is identical to the pivotal method (see Algorithm 6.6, page 107).

8.6.3 The Landing Phase

At the end of the ﬂight phase, the balancing martingale has reached a vertex of K, which is not necessarily a vertex ofC. This vertex is denoted byπ^∗= [π_k^∗] =π(T).Letqbe the number of noninteger components of this vertex. If q= 0,the algorithm is completed. Ifq >0 some constraints cannot be exactly attained.

Landing phase by an enumerative algorithm

Deﬁnition 59.A samplesis said to be compatible with a vectorπ^∗ifπ^∗_k=s_k for all ksuch thatπ_k^∗is an integer. LetC(π^∗)denote the set with2^q elements of compatible samples with π^∗.

It is clear that we can limit ourselves to ﬁnding a design with mean valueπ^∗ and whose support is included inC(π^∗).

The landing phase can be completed by an enumerative algorithm on sub-population C(π^∗) as developed in Section 8.5. The following linear program provides a sampling design on C(π^∗).

pmin^∗(.)

s∈C(π^∗)

Cost(s)p^∗(s), (8.8)

subject to

s∈C(π^∗)

p^∗(s) = 1,

s∈C(π^∗)

sp^∗(s) =π^∗,

0≤p^∗(s)≤1, for alls∈ C(π^∗).

Next, a sample is selected with sampling design p^∗(.). Because q ≤ p, this linear program no longer depends on the population size but only on the number of balancing variables. It is thus restricted to 2^q possible samples and, with a modern computer, can be applied without diﬃculty to a balancing problem with a score of auxiliary variables. If the inclusion probabilities are an auxiliary variable and the sum of the inclusion probabilities is integer, then the linear program can be applied only to

C_n(π^∗) =

s∈ C(π^∗)

k∈U

s_k=

k∈U

π_k=n

, which dramatically limits the number of samples.

Landing phase by suppression of variables

If the number of balancing variables is too large for the linear program to be solved by a simplex algorithm, q > 20, then, at the end of the flight phase, an balancing variable can be straightforwardly suppressed. A constraint is thus relaxed, allowing a return to the flight phase until it is no longer possible to “move” within the constraint subspace. The constraints are thus successively relaxed. For this reason, it is necessary to order the balancing variables according to their importance so that the least important constraints are relaxed first. This naturally depends on the context of the survey.

8.6.4 Quality of Balancing

The rounding problem can arise with any balanced sampling design. For in-stance, in stratification, the rounding problem arises when the sum of the inclusion probabilities within the strata is not an integer number, which is almost always the case in proportional stratification or optimal stratification.

In practice, the stratum sample sizes n_hare rounded either deterministically or randomly. Random rounding is used so as to satisfy in expectation the val-ues ofn_h. The purpose of random rounding is to respect the initial inclusion probabilities.

The cube method also uses a random rounding. In the particular case of stratiﬁcation, it provides exactly the well-known method of random rounding of the sample sizes in the strata. With any variant of the landing phase, the deviation between the Horvitz-Thompson estimator and the total can be bounded because the rounding problem only depends onq≤pvalues.

8.6 The Cube Method 165 Result 38.For any application of the cube method

!X_jHT−X_j≤p×max

Result 39.If the sum of the inclusion probabilities is integer and if the sam-pling design has a ﬁxed sample size, then, for any application of the cube method,

Proof. With the cube method, we can always satisfy the ﬁxed sample size constraint when the sum of the inclusion probabilities is integer, which can

be written This bound is a conservative bound of the rounding error because we consider the worst case. Moreover, this bound is computed for a total and must be considered relative to the population size. Let α_k =π_kN/n, k ∈U.

For almost all the usual sampling designs, we can admit that 1/α_kis bounded whenn→ ∞.Note that, for a ﬁxed sample size

1 N

k∈U

α_k= 1.

The bound for the estimation of the mean can thus be written:

|X!_jHT −X_j|

N ≤ p

n×max

k∈U

x_kj α_k −X_j

=O(p/n),

whereO(1/n) is a quantity that remains bounded when multiplied byn. The bound thus very quickly becomes negligible if the sample size is large with respect to the number of balancing variables.

For comparison note that with a single-stage sampling design such as sim-ple random sampling or Bernoulli sampling, we have generally that

|X!_jHT −X_j|

N =O_p(1/√ n) (see, for example, Ros´en, 1972; Isaki and Fuller, 1982).

Despite the overstatement of the bound, the gain obtained by balanced sampling is very important. The rate of convergence is much faster for bal-anced sampling than for a usual sampling design. In practice, except for the case of very small sample sizes, the rounding problem is thus negligible. Fur-thermore, the rounding problem also becomes problematic in stratiﬁcation with very small sample sizes. In addition, this bound corresponds to the “worst case”, whereas the landing phase is used to ﬁnd the best one.

Dans le document Springer Series in Statistics (Page 166-173)