Application of the Cube Method to Particular Cases

k∈U

α_k= 1.

The bound for the estimation of the mean can thus be written:

|X!_jHT −X_j|

N ≤ p

n×max

k∈U

x_kj α_k −X_j

=O(p/n),

whereO(1/n) is a quantity that remains bounded when multiplied byn. The bound thus very quickly becomes negligible if the sample size is large with respect to the number of balancing variables.

For comparison note that with a single-stage sampling design such as sim-ple random sampling or Bernoulli sampling, we have generally that

|X!_jHT −X_j|

N =O_p(1/√ n) (see, for example, Ros´en, 1972; Isaki and Fuller, 1982).

Despite the overstatement of the bound, the gain obtained by balanced sampling is very important. The rate of convergence is much faster for bal-anced sampling than for a usual sampling design. In practice, except for the case of very small sample sizes, the rounding problem is thus negligible. Fur-thermore, the rounding problem also becomes problematic in stratiﬁcation with very small sample sizes. In addition, this bound corresponds to the “worst case”, whereas the landing phase is used to ﬁnd the best one.

8.7 Application of the Cube Method to Particular Cases

8.7.1 Simple Random Sampling

Simple random sampling is a particular case of the cube method. Sup-pose that π = (n/N · · · n/N · · · n/N) and that the balancing vari-able is x_k = n/N, k ∈ U. We thus have A = (1 · · · 1) ∈ R^N and KerA=

v∈R^N|

k∈Uv_k= 0 .

There are at least three ways to select a simple random sampling without replacement.

1. The ﬁrst way consists of beginning the ﬁrst step by using u(1) =

N−1

N − 1

N · · · − 1 N

. Then,λ₁(1) = (N−n)/(N−1), λ₂(1) =n/(N−1) and

π(1) =

⎧⎨

⎩

1 _N−1ⁿ⁻¹ · · · _N−1ⁿ⁻¹ with probabilityq₁(1)

0 _N−1ⁿ · · · _N−1ⁿ with probabilityq₂(1),

8.7 Application of the Cube Method to Particular Cases 167 where q₁(1) = π₁ = n/N and q₂(1) = 1−π₁ = (N −n)/N. This ﬁrst step corresponds exactly to the selection-rejection method for SRSWOR described in Algorithm 4.3, page 48.

2. The second way consists of sorting the data randomly before applying the cube method with any vectorsv(t). Indeed, any choice ofv(t) provides a ﬁxed size sampling with inclusion probabilitiesπ_k =n/N.A random sort applied before any equal probability sampling provides a simple random sampling (see Algorithm 4.5, page 50).

3. The third way consists of using a random vector v= (v_k),where the v_k are N independent identically distributed variables. Next, this vector is projected on KerA,which gives

u_k=v_k− 1 N

k∈U

v_k.

Note that, for such v_k, it is obvious that a preliminary sorting of the data will not change the sampling design, which is thus a simple random sampling design.

An interesting problem occurs when the design has equal inclusion prob-abilities π_k =π, k ∈U, such that N π is not an integer number. If the only constraint implies a fixed sample size; that is, x_k = 1, k ∈U, then the bal-ancing equation can only be approximately satisfied. Nevertheless, the flight phase of the cube method works untilN−p=N−1 elements ofπ^∗=π(N−1) are integer numbers. The landing phase consists of randomly deciding whether the last unit is drawn. The sample size is therefore equal to one of the two integer numbers nearest toN π.

8.7.2 Stratiﬁcation

Stratiﬁcation can be achieved by takingx_kh=δ_khn_h/N_h, h= 1, . . . , H,where N_h is the size of stratumU_h,n_h is the sample stratum size and

δ_kh=

1 ifk∈U_h 0 ifk /∈U_h. In the ﬁrst step, we use

u_k(1) =v_k(1)− 1 N_h

∈Uh

v(1), k∈U_h.

The three strategies described in Section 8.7.1 for simple random sampling allow us to obtain directly a stratiﬁed random sample with simple random sampling within the strata. If the sums of the inclusion probabilities are not integer within the strata, the cube method randomly rounds the sample sizes of the strata so as to ensure that the given inclusion probabilities are exactly satisﬁed.

The interesting aspect of the cube method is that the stratiﬁcation can be generalized to overlapping strata, which can be called “quota random design”

or “cross-stratiﬁcation”. Suppose that two stratiﬁcation variables are avail-able, for example, in a business survey with “activity sector” and “region”.

The strata defined by the first variable are denoted byU_h., h= 1, . . . , H,and the strata defined by the second variable are denoted by U_.i, i = 1, . . . , K.

Next, deﬁne thep=H+K balancing variables, x_kj=π_k×

I[k∈U_j.] j= 1, . . . , H I/

k∈U_.(j−H)0

j=H+ 1, . . . , H+K,

whereI[.] is an indicator variable that takes the value 1 if the condition is true and 0 otherwise. The sample can now be selected directly by means of the cube method. The generalization to a multiple quota random design follows immediately. It can be shown (Deville and Till´e, 2000) that the quota random sampling can be exactly satisﬁed.

8.7.3 Unequal Probability Sampling with Fixed Sample Size The unequal inclusion probability problem can be solved by means of the cube method. Suppose that the objective is to select a sample of ﬁxed sizenwith inclusion probabilities π_k, k ∈ U, such that

k∈Uπ_k =n. In this case, the only balancing variable isx_k=π_k.In order to satisfy this constraint, we must have

u∈KerA=

v∈R^N

k∈U

v_k= 0

and thus

k∈U

u_k(t) = 0. (8.11)

Each choice, random or not, of vectors u(t) that satisfy (8.11) produces an-other unequal probability sampling method. Nearly all existing methods, ex-cept the rejective ones and the variations of systematic sampling, can easily be expressed by means of the cube method. In this case, the cube method is identical to the splitting method based on the choice of a direction described in Section 6.2.3, page 102.

The techniques of unequal probability sampling can always be improved.

Indeed, in all the available unequal probability sampling methods with fixed sample size, the design is only balanced on a single variable. Nevertheless, two balancing variables are always available, namely, x_k1 = π_k, k ∈ U, and x_k2= 1, k ∈U.The first variable implies a fixed sample size and the second one implies that

N!_HT =

k∈U

S_k π_k =N.

In all methods, the sample is balanced onx_k1 but not on x_k2. The balanced cube method allows us to satisfy both constraints approximately.

8.8 Variance Approximations in Balanced Sampling 169

8.8 Variance Approximations in Balanced Sampling

8.8.1 Construction of an Approximation The variance of the Horvitz-Thompson estimator is

var

and ∆ = [∆_k]. Matrix ∆ is called the variance-covariance operator. Thus, the variance of Y!_HT can theoretically be expressed and estimated by using the joint inclusion probabilities. Unfortunately, even in very simple cases like ﬁxed sample sizes, the computation of ∆is practically impossible.

Deville and Tillé (2005) have proposed approximating the variance by supposing that the balanced sampling can be viewed as a conditional Poisson sampling. A similar idea was also developed by Hájek (1981, p. 26, see also Section 7.5.1, page 139) for sampling with unequal probabilities and fixed sample size. In the case of Poisson sampling, which is a sampling design with no balancing variables, the variance of Y!_HT is easy to derive and can be estimated because only first-order inclusion probabilities are needed. If S3 is the random sample selected by a Poisson sampling design andπ3_k, k ∈U, are the first-order inclusion probabilities of the Poisson design, then

var_POISSON that Expression (8.13) containsπ_k, and3π_k because the variance of the usual estimator (function ofπ_k’s) is computed under Poisson sampling (function of 3

π_ks). Theπ_ks are always known, but the3π_ks are not necessarily known.

If we suppose that, through Poisson sampling, the vector (!Y_HT X!_HT) has approximately a multinormal distribution, we obtain

var_POISSON

and

π_kbecause we compute the variance of the usual Horvitz-Thompson estimator (function of π_k) under the Poisson sampling design (function of3π_k).

b_k =3π_k(1−3π_k), Expression (8.14) can also be written

var_APPROX of balanced sampling amounts to sampling with unequal probabilities and ﬁxed sample size. The approximation of variance given in (8.15) is then equal to the approximation given in Expression (7.13), page 138. In this case, ˇy_k^∗is simply the mean of the ˇy_k’s with the weightsb_k.

The weights b_k unfortunately are unknown because they depend on the 3

π_k’s, which are not exactly equal to theπ_k.We thus propose to approximate theb_k. Note that Expression (8.15) can also be written

var_APPROX

Y!_HT = ˇy∆_APPROXˇy,

where∆_APPROX={∆_appk}is the approximated variance-covariance operator and Four variance approximations can be obtained by various definitions of the b_k’s. These four definitions are denotedb_k1, b_k2, b_k3, and b_k4 and permit the definition of four variance approximations denotedV_α, α = 1,2,3,4,and four variance-covariance operators denoted ∆_α, α= 1,2,3,4,by replacing in (8.15) and (8.16),b_k with, respectively,b_k1, b_k2, b_k3,andb_k4.

1. The ﬁrst approximation is obtained by considering that at least for large sample sizes,π_k≈3π_k, k∈U.Thus, we take b_k1=π_k(1−π_k).

8.8 Variance Approximations in Balanced Sampling 171 2. The second approximation is obtained by applying a correction for the

loss of degrees of freedom:

b_k2=π_k(1−π_k) N N−p.

This correction allows obtaining the exact expression for simple random sampling with ﬁxed sample size.

3. The third approximation is derived from the fact that the diagonal el-ements of the variance-covariance operator ∆ of the true variance are always known and are equal toπ_k(1−π_k).Thus, by deﬁning

b_k3=π_k(1−π_k)trace∆ trace ∆₁,

we can deﬁne the approximated variance-covariance operator∆₃that has the same trace as∆.

4. Finally, the fourth approximation is derived from the fact that the diago-nal elements∆_APPROXcan be computed and are given in (8.16). The b_k4 are constructed in such a way that∆_k=∆_appk,or in other words, that

π_k(1−π_k) =b_k−b_kxˇ_k (

k∈U

b_kxˇ_kˇx_k )₋₁

xˇ_kb_k, k∈U. (8.17) The determination of the b_k4’s then requires the resolution of the non-linear equation system. This fourth approximation is the only one that provides the exact variance expression for stratiﬁcation.

In Deville and Till´e (2005), a set of simulations is presented which shows that b_k4is indisputably the most accurate approximation.

8.8.2 Application of the Variance Approximation to Stratiﬁcation Suppose that the sampling design is stratiﬁed; that is, the population can be split into H nonoverlapping strata denotedU_h, h= 1, . . . , H,of sizesN_h, h= 1, . . . , H.The balancing variables are

x_k1=δ_k1, . . . , x_kH =δ_kH, where

δ_kh=

1 ifk∈U_h 0 ifk /∈U_h.

If a simple random sample is selected in each stratum with sizesn₁, . . . , n_H, then the variance can be computed exactly:

var

Y!_HT = H h=1

N_h²N_h−n_h N_h

V_yh² n_h ,

where

It is thus interesting to compute the four approximations given in Section 8.8.1 in this particular case.

1. The ﬁrst approximation gives

b_k1=π_k(1−π_k) = n_h 2. The second approximation gives

b_k2= N 3. The third approximation gives

b_k3=π_k(1−π_k)trace∆

8.9 Variance Estimation 173 4. The fourth approximation gives

b_k4= n_h

Although the diﬀerences between the variance approximations var_APPROX1, var_APPROX2, var_APPROX3, and var_APPROX4 are small relative to the population size, var_APPROX4is the only approximation that gives the exact variance of a stratiﬁed sampling design.

8.9 Variance Estimation

8.9.1 Construction of an Estimator of Variance

Because Expression (8.15) is a function of totals, we can substitute each total by its Horvitz-Thompson estimator (see, for instance, Deville, 1999) in order to obtain an estimator of (8.15). The resulting estimator for (8.15) is:

% is the estimator of the regression predictor of ˇy_k.

Note that (8.18) can also be written allow deﬁning ﬁve variance estimators by replacingc_kin Expression (8.18) by, respectively, c_k1, c_k2, c_k3, c_k4,andc_k5.

1. The ﬁrst estimator is obtained by takingc_k1= (1−π_k).

2. The second estimator is obtained by applying a correction for the loss of degrees of freedom:

c_k2= (1−π_k) n n−p.

This correction for the loss of degrees of freedom gives the unbiased esti-mator in simple random sampling with ﬁxed sample size.

3. The third estimator is derived from the fact that the diagonal elements of the true matrix∆_k/π_kare always known and are equal to 1−π_k.Thus, we can use

c_k3= (1−π_k)

k∈U(1−π_k)S_k

k∈UD_kk1S_k , whereD_kk1is obtained by pluggingc_k1 inD_kk.

4. The fourth estimator can be derived from b_k4 obtained by solving the equation system (8.17).

c_k4=b_k4 π_k

n n−p

N−p

N .

5. Finally, the ﬁfth estimator is derived from the fact that the diagonal ele-mentsD_kk are known. Thec_k5’s are constructed in such a way that

1−π_k=D_kk, k∈U, (8.19)

or in other words that 1−π_k =c_k−c_kxˇ_k

(

i∈U

c_iS_ixˇ_ixˇ_i )₋₁

xˇ_kc_k, k∈U.

A necessary condition of the existence of a solution for equation system (8.19) is that

maxk

1−π_k

∈US_i(1−π) < 1 2.

The choice of the weights c_k is tricky. Although they are very similar, an evaluation by means of a set of simulations should still be run.

8.9.2 Application to Stratification of the Estimators of Variance The case of stratification is interesting because the unbiased estimator of variance in a stratified sampling design (with a simple random sampling in each stratum) is known and is equal to

% var

Y!_HT = H h=1

N_h²N_h−n_h N_h

v_yh² n_h,

8.9 Variance Estimation 175

It is thus interesting to compute the ﬁve estimators in the stratiﬁcation case.

1. The ﬁrst estimator gives

c_k1= (1−π_k) =N_h−n_h 2. The second estimator gives

c_k2= n 3. The third estimator gives

c_k3= N_h−n_h

4. The fourth estimator gives 5. The ﬁfth estimator gives

c_k5= n_h

The ﬁve estimators are very similar, butvar%_APPROX5is the only approximation that gives the exact variance of a stratiﬁed sampling design.

Dans le document Springer Series in Statistics (Page 173-183)