Fast algorithm for the ﬂight phase - Springer Series in Statistics

a) The units with inclusion probabilities equal to 0 or 1 are removed from the population before applying the algorithm in such a way that all the remaining units are such that 0< πk<1.

b) The inclusion probabilities are loaded into vectorπ. c) The vectorψis made up of the ﬁrstp+ 1 elements ofπ. d) A vector of ranks is createdr= (1 2 · · · p p+ 1). e) MatrixBis made up of the ﬁrstp+ 1 columns ofA.

f) Initialize k=p+ 2.

2. Basic loop

a) A vectoruis taken in the kernel ofB,

b) Onlyψis modiﬁed (and not the vectorπ) according to the basic technique.

Computeλ^∗₁andλ^∗₂,the largest values ofλ₁andλ₂such that 0≤ψ+λ₁u≤ 1,and 0≤ψ−λ₂u≤1.Note thatλ^∗₁>0 andλ^∗₂ >0.

c) Select

ψ=

ψ+λ^∗₁uwith probabilityq ψ−λ^∗₂uwith probability 1−q, whereq=λ^∗₂/(λ^∗₁+λ^∗₂).

d) (The units that correspond toψ(i)integer numbers are removed fromBand are replaced by new units. The algorithm stops at the end of the ﬁle.) Fori= 1, . . . , p+ 1,do

Ifψ(i) = 0 orψ(i) = 1then

Ifk≤N then

π(r(i)) =ψ(i);

r(i) =k;

ψ(i) =π(k);

Forj= 1, . . . , p, doB(i, j) =A(k, j); EndFor; k=k+ 1;

Else Goto Step3(a);

EndIf; EndIf; EndFor. e) Goto Step2(a).

3. End of the ﬁrst part of the ﬂight phase

a) Fori= 1, . . . , p+ 1, doπ(r(i)) =ψ(i)EndFor.

In the case where some of the constraints can be satisﬁed exactly, the ﬂight phase can be continued. Suppose that C is the matrix containing the columns of Athat correspond to noninteger values ofπ3, andφis the vector of noninteger values of π3. If C is not full-rank, one or several steps of the general Algorithm 8.3 can still be applied toCandφ. A return to the general Algorithm 8.3 is thus necessary for the last steps.

The implementation of the fast algorithm is quite simple. MatrixAnever has to be completely loaded in memory and thus remains in a ﬁle that can be read sequentially. For this reason, there does not exist any restriction on the population size because the execution time depends linearly on the population

8.6 The Cube Method 163 size. The search for a vector u in the submatrix B limits the choice of the direction u.In most cases, only one direction is possible. In order to increase the randomness of the sampling design, the units can possibly be randomly mixed before applying Algorithm 8.4.

Another option consists of sorting the units by decreasing order of size. In-deed, from experience, with the general Algorithm 8.3, the rounding problem often concerns units with large size; that is, large inclusion probabilities or large values of ˇx_k. With the fast Algorithm 8.4, the rounding problem often concerns the units that are at the end of the file. If the units are sorted by de-creasing order of size, the fast algorithm will try to first balance the big units and the rounding problem will instead concern small units. Analogously, if we want to get an exact fixed weight of potatoes, it is more efficient to first put the large potatoes on the balance and to finish with the smallest potatoes.

This popular idea can also be used to balance a sample, even if the problem is more complex because it is multivariate.

The idea of considering only one subset of the units already underlays in the moving stratification procedure(see Tillé, 1996b) that provides a smoothed effect of stratification. Whenp= 1 and the only auxiliary variable isx_k =π_k, then the problem of balanced sampling amounts to sampling with unequal probabilities and fixed sample size. In this case,A= (1 · · · 1). At each step, matrix B= (1 1) and u= (−1 1). Algorithm 8.4 can therefore be simpli-fied dramatically and is identical to the pivotal method (see Algorithm 6.6, page 107).

8.6.3 The Landing Phase

At the end of the ﬂight phase, the balancing martingale has reached a vertex of K, which is not necessarily a vertex ofC. This vertex is denoted byπ^∗= [π_k^∗] =π(T).Letqbe the number of noninteger components of this vertex. If q= 0,the algorithm is completed. Ifq >0 some constraints cannot be exactly attained.

Landing phase by an enumerative algorithm

Deﬁnition 59.A samplesis said to be compatible with a vectorπ^∗ifπ^∗_k=s_k for all ksuch thatπ_k^∗is an integer. LetC(π^∗)denote the set with2^q elements of compatible samples with π^∗.

It is clear that we can limit ourselves to ﬁnding a design with mean valueπ^∗ and whose support is included inC(π^∗).

The landing phase can be completed by an enumerative algorithm on sub-population C(π^∗) as developed in Section 8.5. The following linear program provides a sampling design on C(π^∗).

pmin^∗(.)

s∈C(π^∗)

Cost(s)p^∗(s), (8.8)

subject to

s∈C(π^∗)

p^∗(s) = 1,

s∈C(π^∗)

sp^∗(s) =π^∗,

0≤p^∗(s)≤1, for alls∈ C(π^∗).

Next, a sample is selected with sampling design p^∗(.). Because q ≤ p, this linear program no longer depends on the population size but only on the number of balancing variables. It is thus restricted to 2^q possible samples and, with a modern computer, can be applied without diﬃculty to a balancing problem with a score of auxiliary variables. If the inclusion probabilities are an auxiliary variable and the sum of the inclusion probabilities is integer, then the linear program can be applied only to

C_n(π^∗) =

s∈ C(π^∗)

k∈U

s_k=

k∈U

π_k=n

, which dramatically limits the number of samples.

Landing phase by suppression of variables

If the number of balancing variables is too large for the linear program to be solved by a simplex algorithm, q > 20, then, at the end of the flight phase, an balancing variable can be straightforwardly suppressed. A constraint is thus relaxed, allowing a return to the flight phase until it is no longer possible to “move” within the constraint subspace. The constraints are thus successively relaxed. For this reason, it is necessary to order the balancing variables according to their importance so that the least important constraints are relaxed first. This naturally depends on the context of the survey.

8.6.4 Quality of Balancing

The rounding problem can arise with any balanced sampling design. For in-stance, in stratification, the rounding problem arises when the sum of the inclusion probabilities within the strata is not an integer number, which is almost always the case in proportional stratification or optimal stratification.

In practice, the stratum sample sizes n_hare rounded either deterministically or randomly. Random rounding is used so as to satisfy in expectation the val-ues ofn_h. The purpose of random rounding is to respect the initial inclusion probabilities.

The cube method also uses a random rounding. In the particular case of stratiﬁcation, it provides exactly the well-known method of random rounding of the sample sizes in the strata. With any variant of the landing phase, the deviation between the Horvitz-Thompson estimator and the total can be bounded because the rounding problem only depends onq≤pvalues.

8.6 The Cube Method 165 Result 38.For any application of the cube method

!X_jHT−X_j≤p×max

Result 39.If the sum of the inclusion probabilities is integer and if the sam-pling design has a ﬁxed sample size, then, for any application of the cube method,

Proof. With the cube method, we can always satisfy the ﬁxed sample size constraint when the sum of the inclusion probabilities is integer, which can

be written This bound is a conservative bound of the rounding error because we consider the worst case. Moreover, this bound is computed for a total and must be considered relative to the population size. Let α_k =π_kN/n, k ∈U.

For almost all the usual sampling designs, we can admit that 1/α_kis bounded whenn→ ∞.Note that, for a ﬁxed sample size

1 N

k∈U

α_k= 1.

The bound for the estimation of the mean can thus be written:

|X!_jHT −X_j|

N ≤ p

n×max

k∈U

x_kj α_k −X_j

=O(p/n),

whereO(1/n) is a quantity that remains bounded when multiplied byn. The bound thus very quickly becomes negligible if the sample size is large with respect to the number of balancing variables.

For comparison note that with a single-stage sampling design such as sim-ple random sampling or Bernoulli sampling, we have generally that

|X!_jHT −X_j|

N =O_p(1/√ n) (see, for example, Ros´en, 1972; Isaki and Fuller, 1982).

Despite the overstatement of the bound, the gain obtained by balanced sampling is very important. The rate of convergence is much faster for bal-anced sampling than for a usual sampling design. In practice, except for the case of very small sample sizes, the rounding problem is thus negligible. Fur-thermore, the rounding problem also becomes problematic in stratiﬁcation with very small sample sizes. In addition, this bound corresponds to the “worst case”, whereas the landing phase is used to ﬁnd the best one.

8.7 Application of the Cube Method to Particular Cases

8.7.1 Simple Random Sampling

Simple random sampling is a particular case of the cube method. Sup-pose that π = (n/N · · · n/N · · · n/N) and that the balancing vari-able is x_k = n/N, k ∈ U. We thus have A = (1 · · · 1) ∈ R^N and KerA=

v∈R^N|

k∈Uv_k= 0 .

There are at least three ways to select a simple random sampling without replacement.

1. The ﬁrst way consists of beginning the ﬁrst step by using u(1) =

N−1

N − 1

N · · · − 1 N

. Then,λ₁(1) = (N−n)/(N−1), λ₂(1) =n/(N−1) and

π(1) =

⎧⎨

⎩

1 _N−1ⁿ⁻¹ · · · _N−1ⁿ⁻¹ with probabilityq₁(1)

0 _N−1ⁿ · · · _N−1ⁿ with probabilityq₂(1),

8.7 Application of the Cube Method to Particular Cases 167 where q₁(1) = π₁ = n/N and q₂(1) = 1−π₁ = (N −n)/N. This ﬁrst step corresponds exactly to the selection-rejection method for SRSWOR described in Algorithm 4.3, page 48.

2. The second way consists of sorting the data randomly before applying the cube method with any vectorsv(t). Indeed, any choice ofv(t) provides a ﬁxed size sampling with inclusion probabilitiesπ_k =n/N.A random sort applied before any equal probability sampling provides a simple random sampling (see Algorithm 4.5, page 50).

3. The third way consists of using a random vector v= (v_k),where the v_k are N independent identically distributed variables. Next, this vector is projected on KerA,which gives

u_k=v_k− 1 N

k∈U

v_k.

Note that, for such v_k, it is obvious that a preliminary sorting of the data will not change the sampling design, which is thus a simple random sampling design.

An interesting problem occurs when the design has equal inclusion prob-abilities π_k =π, k ∈U, such that N π is not an integer number. If the only constraint implies a fixed sample size; that is, x_k = 1, k ∈U, then the bal-ancing equation can only be approximately satisfied. Nevertheless, the flight phase of the cube method works untilN−p=N−1 elements ofπ^∗=π(N−1) are integer numbers. The landing phase consists of randomly deciding whether the last unit is drawn. The sample size is therefore equal to one of the two integer numbers nearest toN π.

8.7.2 Stratiﬁcation

Stratiﬁcation can be achieved by takingx_kh=δ_khn_h/N_h, h= 1, . . . , H,where N_h is the size of stratumU_h,n_h is the sample stratum size and

δ_kh=

1 ifk∈U_h 0 ifk /∈U_h. In the ﬁrst step, we use

u_k(1) =v_k(1)− 1 N_h

∈Uh

v(1), k∈U_h.

The three strategies described in Section 8.7.1 for simple random sampling allow us to obtain directly a stratiﬁed random sample with simple random sampling within the strata. If the sums of the inclusion probabilities are not integer within the strata, the cube method randomly rounds the sample sizes of the strata so as to ensure that the given inclusion probabilities are exactly satisﬁed.

The interesting aspect of the cube method is that the stratiﬁcation can be generalized to overlapping strata, which can be called “quota random design”

or “cross-stratiﬁcation”. Suppose that two stratiﬁcation variables are avail-able, for example, in a business survey with “activity sector” and “region”.

The strata defined by the first variable are denoted byU_h., h= 1, . . . , H,and the strata defined by the second variable are denoted by U_.i, i = 1, . . . , K.

Next, deﬁne thep=H+K balancing variables, x_kj=π_k×

I[k∈U_j.] j= 1, . . . , H I/

k∈U_.(j−H)0

j=H+ 1, . . . , H+K,

whereI[.] is an indicator variable that takes the value 1 if the condition is true and 0 otherwise. The sample can now be selected directly by means of the cube method. The generalization to a multiple quota random design follows immediately. It can be shown (Deville and Till´e, 2000) that the quota random sampling can be exactly satisﬁed.

8.7.3 Unequal Probability Sampling with Fixed Sample Size The unequal inclusion probability problem can be solved by means of the cube method. Suppose that the objective is to select a sample of ﬁxed sizenwith inclusion probabilities π_k, k ∈ U, such that

k∈Uπ_k =n. In this case, the only balancing variable isx_k=π_k.In order to satisfy this constraint, we must have

u∈KerA=

v∈R^N

k∈U

v_k= 0

and thus

k∈U

u_k(t) = 0. (8.11)

Each choice, random or not, of vectors u(t) that satisfy (8.11) produces an-other unequal probability sampling method. Nearly all existing methods, ex-cept the rejective ones and the variations of systematic sampling, can easily be expressed by means of the cube method. In this case, the cube method is identical to the splitting method based on the choice of a direction described in Section 6.2.3, page 102.

The techniques of unequal probability sampling can always be improved.

Indeed, in all the available unequal probability sampling methods with fixed sample size, the design is only balanced on a single variable. Nevertheless, two balancing variables are always available, namely, x_k1 = π_k, k ∈ U, and x_k2= 1, k ∈U.The first variable implies a fixed sample size and the second one implies that

N!_HT =

k∈U

S_k π_k =N.

In all methods, the sample is balanced onx_k1 but not on x_k2. The balanced cube method allows us to satisfy both constraints approximately.

8.8 Variance Approximations in Balanced Sampling 169

8.8 Variance Approximations in Balanced Sampling

8.8.1 Construction of an Approximation The variance of the Horvitz-Thompson estimator is

var

and ∆ = [∆_k]. Matrix ∆ is called the variance-covariance operator. Thus, the variance of Y!_HT can theoretically be expressed and estimated by using the joint inclusion probabilities. Unfortunately, even in very simple cases like ﬁxed sample sizes, the computation of ∆is practically impossible.

Deville and Tillé (2005) have proposed approximating the variance by supposing that the balanced sampling can be viewed as a conditional Poisson sampling. A similar idea was also developed by Hájek (1981, p. 26, see also Section 7.5.1, page 139) for sampling with unequal probabilities and fixed sample size. In the case of Poisson sampling, which is a sampling design with no balancing variables, the variance of Y!_HT is easy to derive and can be estimated because only first-order inclusion probabilities are needed. If S3 is the random sample selected by a Poisson sampling design andπ3_k, k ∈U, are the first-order inclusion probabilities of the Poisson design, then

var_POISSON that Expression (8.13) containsπ_k, and3π_k because the variance of the usual estimator (function ofπ_k’s) is computed under Poisson sampling (function of 3

π_ks). Theπ_ks are always known, but the3π_ks are not necessarily known.

If we suppose that, through Poisson sampling, the vector (!Y_HT X!_HT) has approximately a multinormal distribution, we obtain

var_POISSON

and

π_kbecause we compute the variance of the usual Horvitz-Thompson estimator (function of π_k) under the Poisson sampling design (function of3π_k).

b_k =3π_k(1−3π_k), Expression (8.14) can also be written

var_APPROX of balanced sampling amounts to sampling with unequal probabilities and ﬁxed sample size. The approximation of variance given in (8.15) is then equal to the approximation given in Expression (7.13), page 138. In this case, ˇy_k^∗is simply the mean of the ˇy_k’s with the weightsb_k.

The weights b_k unfortunately are unknown because they depend on the 3

π_k’s, which are not exactly equal to theπ_k.We thus propose to approximate theb_k. Note that Expression (8.15) can also be written

var_APPROX

Y!_HT = ˇy∆_APPROXˇy,

where∆_APPROX={∆_appk}is the approximated variance-covariance operator and Four variance approximations can be obtained by various definitions of the b_k’s. These four definitions are denotedb_k1, b_k2, b_k3, and b_k4 and permit the definition of four variance approximations denotedV_α, α = 1,2,3,4,and four variance-covariance operators denoted ∆_α, α= 1,2,3,4,by replacing in (8.15) and (8.16),b_k with, respectively,b_k1, b_k2, b_k3,andb_k4.

1. The ﬁrst approximation is obtained by considering that at least for large sample sizes,π_k≈3π_k, k∈U.Thus, we take b_k1=π_k(1−π_k).

8.8 Variance Approximations in Balanced Sampling 171 2. The second approximation is obtained by applying a correction for the

loss of degrees of freedom:

b_k2=π_k(1−π_k) N N−p.

This correction allows obtaining the exact expression for simple random sampling with ﬁxed sample size.

3. The third approximation is derived from the fact that the diagonal el-ements of the variance-covariance operator ∆ of the true variance are always known and are equal toπ_k(1−π_k).Thus, by deﬁning

b_k3=π_k(1−π_k)trace∆ trace ∆₁,

we can deﬁne the approximated variance-covariance operator∆₃that has the same trace as∆.

4. Finally, the fourth approximation is derived from the fact that the diago-nal elements∆_APPROXcan be computed and are given in (8.16). The b_k4 are constructed in such a way that∆_k=∆_appk,or in other words, that

π_k(1−π_k) =b_k−b_kxˇ_k (

k∈U

b_kxˇ_kˇx_k )₋₁

xˇ_kb_k, k∈U. (8.17) The determination of the b_k4’s then requires the resolution of the non-linear equation system. This fourth approximation is the only one that provides the exact variance expression for stratiﬁcation.

In Deville and Till´e (2005), a set of simulations is presented which shows that b_k4is indisputably the most accurate approximation.

8.8.2 Application of the Variance Approximation to Stratiﬁcation Suppose that the sampling design is stratiﬁed; that is, the population can be split into H nonoverlapping strata denotedU_h, h= 1, . . . , H,of sizesN_h, h= 1, . . . , H.The balancing variables are

x_k1=δ_k1, . . . , x_kH =δ_kH, where

δ_kh=

1 ifk∈U_h 0 ifk /∈U_h.

If a simple random sample is selected in each stratum with sizesn₁, . . . , n_H, then the variance can be computed exactly:

var

Y!_HT = H h=1

N_h²N_h−n_h N_h

V_yh² n_h ,

where

It is thus interesting to compute the four approximations given in Section 8.8.1 in this particular case.

1. The ﬁrst approximation gives

b_k1=π_k(1−π_k) = n_h 2. The second approximation gives

b_k2= N 3. The third approximation gives

b_k3=π_k(1−π_k)trace∆

8.9 Variance Estimation 173 4. The fourth approximation gives

b_k4= n_h

Although the diﬀerences between the variance approximations var_APPROX1, var_APPROX2, var_APPROX3, and var_APPROX4 are small relative to the population size, var_APPROX4is the only approximation that gives the exact variance of a stratiﬁed sampling design.

8.9 Variance Estimation

8.9.1 Construction of an Estimator of Variance

Because Expression (8.15) is a function of totals, we can substitute each total by its Horvitz-Thompson estimator (see, for instance, Deville, 1999) in order to obtain an estimator of (8.15). The resulting estimator for (8.15) is:

% is the estimator of the regression predictor of ˇy_k.

Note that (8.18) can also be written allow deﬁning ﬁve variance estimators by replacingc_kin Expression (8.18) by, respectively, c_k1, c_k2, c_k3, c_k4,andc_k5.

1. The ﬁrst estimator is obtained by takingc_k1= (1−π_k).

2. The second estimator is obtained by applying a correction for the loss of degrees of freedom:

c_k2= (1−π_k) n n−p.

This correction for the loss of degrees of freedom gives the unbiased esti-mator in simple random sampling with ﬁxed sample size.

3. The third estimator is derived from the fact that the diagonal elements of the true matrix∆_k/π_kare always known and are equal to 1−π_k.Thus, we can use

c_k3= (1−π_k)

k∈U(1−π_k)S_k

k∈UD_kk1S_k , whereD_kk1is obtained by pluggingc_k1 inD_kk.

4. The fourth estimator can be derived from b_k4 obtained by solving the equation system (8.17).

c_k4=b_k4 π_k

n n−p

N−p

N .

5. Finally, the ﬁfth estimator is derived from the fact that the diagonal ele-mentsD_kk are known. Thec_k5’s are constructed in such a way that

1−π_k=D_kk, k∈U, (8.19)

or in other words that 1−π_k =c_k−c_kxˇ_k

(

i∈U

c_iS_ixˇ_ixˇ_i )₋₁

xˇ_kc_k, k∈U.

A necessary condition of the existence of a solution for equation system (8.19) is that

maxk

1−π_k

∈US_i(1−π) < 1 2.

The choice of the weights c_k is tricky. Although they are very similar, an evaluation by means of a set of simulations should still be run.

8.9.2 Application to Stratification of the Estimators of Variance The case of stratification is interesting because the unbiased estimator of variance in a stratified sampling design (with a simple random sampling in each stratum) is known and is equal to

% var

Y!_HT = H h=1

N_h²N_h−n_h N_h

v_yh² n_h,

8.9 Variance Estimation 175

It is thus interesting to compute the ﬁve estimators in the stratiﬁcation case.

1. The ﬁrst estimator gives

c_k1= (1−π_k) =N_h−n_h 2. The second estimator gives

c_k2= n 3. The third estimator gives

c_k3= N_h−n_h

4. The fourth estimator gives 5. The ﬁfth estimator gives

c_k5= n_h

The ﬁve estimators are very similar, butvar%_APPROX5is the only approximation that gives the exact variance of a stratiﬁed sampling design.

8.10 Recent Developments in Balanced Sampling

Balanced sampling is at present a common procedure of sampling which is used, for instance, for the selection of the master sample at the Institut Na-tional de la Statistique et des ´Etudes ´Economiques (INSEE, France). The implementation of the cube method has posed new, interesting problems. The question of the determination of the optimal inclusion probabilities is

Dans le document Springer Series in Statistics (Page 169-200)