Development in small area estimation and major issues confronted

(1)

International AaoscMonfor

mmwmm - \

fflwnwtionBi VHRMROil matmo

United Nations Economic Commission

forAfrtea

African

Statistical Association

JOINT lAOS/AFSA CONFERENCE

Addis Ababa, 22-24 May 1995

Managing Development in the 1990s and Beyond: New Trends In Statistics

lAOS/AFSA/IS.4/12

INVITED SESSION 4:

DATA FOR DIVERSE AREAS; THE GEOGRAPHIC DIMENSION IN DATA

DEVELOPMENTS IN SMALL AREA ESTIMATION AND MAJOR ISSUES CONFRONTED

By

Alfredo ALIAGA Macro International Inc

Calverton, Maryland

ADDIS ABABA MAY 1995

(2)

Developments in Small Area Estimation and Major Issues Confronted

Alfredo Aliaga

Macro International Inc.

Calverton, Maryland

April 1995

(3)

1. Introduction

Although national-level direct estimates of sociodemographic characteristics can be obtained from all Demographic and Health Surveys (DHS) sample surveys, small area direct estimates are usually not possible. This is because the samples for small geographical areas are usually not large enough to be representative of those areas—i.e., the national sample was not designed to be of sufficient size to permit estimates for small domains. Likewise, the design level of clustering in the national sample was not adequate for small areas. Recently, however, program managers and policymakers in developing countries have expressed interest in the estimation of basic sociodemographic indicators for small areas in order to set targets, allocate resources, and monitor the performance of health and family planning

programs.

Carrying out national surveys that are also representative of lower level geographical areas (or particular subclasses) is possible, but may not be cost effective. In addition, nonsampling errdrs tend to increase with the increase in sample size. An alternative solution is to rely on indirect estimates for small

areas. This paper is a review of the various approaches to small area estimation and the major issues

associated with them.

2. Data Sources

Population census. Population censuses (usually carried out at intervals of 10 years) provide baseline values for selected sociodemographic indicators. In DHS survey countries, census information can be used to determine basic characteristics for small areas; then, various procedures can be applied

to develop statistical estimations for the small areas selected. Census information can also be used to obtain censal weighing factors for characteristics not collected in the census, and it can be used as

background information in modelling procedures.

Administrative records. Administrative records, e.g., vital statistics or health records, are an important source of information on small area statistics. The DHS program has not used them extensively, however, because such records are not often readily available in survey countries. In addition, administrative records are collected for programmatic rather than statistical purposes, so

(4)

individual definitions for sources are not generally compatible and may not be easily used in combination with data having standardized definitions, such as DHS data. Thus, use of administrative records as a source of information for small area statistics may be limited.

Large household surveys. Large household surveys are often carried out by national statistical offices in survey countries. When available, data from these surveys can provide baseline information about small area statistics.

DHS surveys. The Demographic and Health Surveys (DHS) project provides information on selected sociodemographic and health indicators for the total population, for urban and rural populations, and for some major regional populations. Thus, the DHS surveys, either individually or in combination with other databases, provide sociodemographic and health data for small area estimation.

3. Major Issues in Small Area Estimation

Any sample survey is designed to provide large-domain estimation. Although the associated issues for small areas are critical, they became less controversial for major domains. A good discussion of these issues can be found in Small Area Statistics: An International Symposium (Platek et al., 1987).

Definitions and concepts. Sometimes there is neither a common definition nor the same set of definitions and concepts among the data sources. For example, there can be differences in the eligibility of respondents for each data source, differences in the procedures used for data collection, and differences in the way the estimation approach is applied to the database. Such disparity makes it difficult to combine the data sources. These problems are more critical in developing countries.

Changes in government policies. Changes in government policies can influence the production of data from the sources. Such changes will be reflected in the population covered by the source, the depth of coverage on various issues, and the quality of data obtained, as reflected by the time and resources allocated for data collection.

(5)

^

Estimation procedure problems. Estimation procedures are dependent on the status quo in relation to the above issues. The type of estimation will depend on the information conditions present in the country. In many developing countries, it is advisable to rely on simple procedures rather than complex ones. Complexity requires highly trained personnel, greater resource allocation, and intensive

data processing.

Privacy. Given the differences between sources and the need for combining information from

various databases, it is often necessary to examine individual records or batches of records. The issue

of privacy and its implications is a legitimate concern of the public. The researchers should establish clear procedures to follow in order to maintain information privacy.

Size. Any population census is the best source for estimation of any basic statistics regarding domain, regardless of the size of the domain. However, for more complex statistics, which have been collected by limited studies (not censuses), estimates of such statistics can not be provided everywhere because of data size limitations (no data or very small data). Sample surveys can play an important role by complementing such information; however, they cannot provide estimates for every domain size. A survey sample is usually designed to provide estimates for large domains. The definition of small area is related to the sample design objectives and the sample size.

In general, the DHS program provides estimates of basic demographic parameters that pertain to the country as a whole, to urban and rural areas as separated subgroups, and to the major regions of the country. Areas that fall below the major regions are barely represented in the DHS surveys and end up with small or no sample size. In the DHS program, the term small area is restricted to the next lower geographical level (usually a province or district), which is not intended for data publication. Currently, the DHS program has no plans to develop estimates for small areas.

Estimation. Domain estimation is part of all sample surveys. The type of estimation procedure depends on the sample design. Because the estimation of bias is difficult to measure, procedures that provide unbiased estimates are more acceptable than procedures that provide biased estimates. Some authors prefer estimation procedures that are consistent because, when the sample size becomes sufficiently large, the estimate value is a more reliable for the domain (i.e., the bias value tends to be small). With such consistency, some authors try to guarantee desirable properties at least for a large

(6)

sample.

Another general classification of estimation procedures is based on the sample probabilities of the sample design, and/or use of auxiliary data from other sources. A direct estimate can be made by using the survey data, but only the units for the small area domain. Another type of estimate is a modification of the direct estimate, called a modified direct estimate. It uses data from the main survey—from other domains on both the auxiliary and the study variable-but retains design properties such as consistency and lack of bias. Finally, there is growing research on indirect (or modelling) estimates, which use external data in both the auxiliary and the study variables from outside the domain and/or the time period

of interest, with or without considering the design survey properties.

Evaluation. A generally accepted way to evaluate any estimation is to determine its relative error. This value is calculated as the ratio of the standard error of the estimation and the value of the estimation. A major issue for any sample survey is evaluating the estimation of a large number of estimates using design estimators versus a small number of indirect estimates that were made because of

the small sample size.

Several other issues are involved in estimation procedures that provide consistent estimates.

Consistent estimates have desirable properties when calculated with a large sample size, but do not have them when the sample size is small. Again, a major issue is providing reliable measures of quality—sampling errors or mean square errors—when using these estimation procedures in a particular

survey.

A major problem in using some estimation model procedures is model failure for some small areas. A similar situation occurs with direct estimation: despite having a reasonable sample size, there

is the possibility of having an unreliable estimate value due to sampling variation.

Another issue in small area estimation is associated with using information from different time periods. If the researcher is interested in looking at singular-point changes over time, there is some question as to the reliability of the estimation of change. This problem is less evident when point-time

estimation only is required.

(7)

Ideally, the goal is to develop a reasonable estimation procedure that will satisfy every possible domain estimate under some quality procedure. A suggested evaluation procedure would be to produce a validation study to measure the performance of every direct versus indirect value under several environmental conditions. However, a great deal of effort would be required to carry out such a study.

Sample design. The sample design complies with the needs of the major domains. However, one must consider the possibility of having clear approaches for other domains where estimation can be considered as a possibility in a near future. Estimation is well established for planned domains, but becomes more difficult for unplanned domains. Various compromises in the sample design are necessary to resolve the problem. Partial solutions include modifications to the sample size allocation strategy, to the clustering level, and to the sample stratification.

By knowing the small area in advance, the sample allocation can be modified through the sample design (and the area treated as a planned domain) by increasing the sample size. (This assumes the availability of sufficient human and financial resources.) If the study requires providing only national estimates, then a proportional allocation of the sample may be enough. However, if the sample is required to provide provincial or district estimates, then a comparable sample size among provinces could be achieved by assigning an equal sample size to each to obtain equal reliability. Increasing the sample size of the provinces as needed would provide a larger sample size than the one for the country as a whole, with proportional allocation. A similar situation could happen by considering provincial estimation first and then district estimation within provinces. Here the sample design will have to be a compromise between reliability at different domain levels. In general, the compromise will be between different types of allocations: proportional versus equal allocation, optimal versus proportional allocation, optimal versus equal allocation, etc.

The clustering level is another factor that can be used to improve small area estimates. In general, the level of clustering that provides estimates for national and subnational domains is not adequate for providing representation for small-domain areas. Given the amount of resources available for the survey, a compromise solution is to decrease the clustering level at the first stages of the sample design. In the DHS program, the clustering level generally is adequate for national, subnational and regional levels. However, to improve the sample size for small areas, a larger number of smaller, first- stage units must be brought into the sample design.

(8)

Sample stratification can be used to determine more representative small areas by decreasing the strata size. When this is done in conjunction with decreasing the clustering level, the estimates for small unplanned domains will be improved.

Redefining small areas. If neither increasing the allocation level nor decreasing the clustering level helps to improve small domain estimates, a compromise would be to redefine the domain by collapsing the closest small areas and assigning enough sample size at that level. Such a compromise would improve the estimation process, with the hope that the redefined small area would provide information similar to that of the original area. This approach, which is convenient in practice, may be questionable, since the homogeneity condition over the redefined domain would also be assumed for the original small domain.

There could be a redefinition problem if the country is planning to change the boundaries of administrative domains affecting also the small-area's boundaries in the near future. Such a possibility could be considered in the design by redefining stable (no affected by such change) unit-areas units for selection that could later be joined together to provide the redefined small domain.

Although the problems of redefinition discussed here are neither exhaustive nor exclusive, they may interact together and a solution to the problem of redefining small areas may not, in fact, be possible (particularly when the sample size was not originally designed to provide small domain estimates). If the need for small area estimation was not considered during the design process, it may be necessary at a later stage of analysis to develop estimation procedures for small domains.

4. Estimation Procedures

The following is a presentation of some approaches for small domain estimation. It is not the intention of the author to provide an exhaustive review of small area estimation; a good description of such procedures can be found in Small Area Statistics: An International Symposium (Platek et al., 1987).

Direct estimation. As discussed earlier, small area direct estimation is based on survey data only from the small area domain. A generally acceptable procedure is designed-based, particularly if the procedure is unbiased (or approximately unbiased). However, such a design is not satisfactory when the

(9)

sample size is small, meaning the estimates have large sampling errors. It should be strongly emphasized that direct design-estimation should be used only when the sample size is large enough to provide reliable estimation.

A general ratio estimation formula for a direct estimate is given as:

r=

where

w, is the design weight factor for unit i;

yt is the value of v for unit i; and xt is the value of x for unit i.

In the case where the variable X is a counting variable, the ratio estimator reduces to the weighted mean of Kand can be expressed as:

r=

E, w?t

When the population size TV is known, a direct estimation of a total expansion is given as:

Y=N-

In general, if the total value for variable X is known for every strata, we can define the expanded total as:

(10)

where

Xh is the total value known for the auxiliary variable in strata h.

A regression direct estimate can be provided as an approximated design-unbiased estimator, and

it is given by

where

Ynga is the expanded predicted total for small domain a;

fia is the coefficient in the regression for small domain a\ and where j$a is calculated as:

and where the subscript value i corresponds to the i-th unit in the sample domain a;

Xa is the total value X for small domain a;

TB is the expansion value in small domain a for Y; and X*fl is the expansion value in small domain a for X.

A modified approximate direct estimator can be provided by replacing the coefficient value for

a value calculated over the entire sample, let it be called %

(11)

Indirect estimate. This type of estimation is associated with the synthetic estimation procedure, which is based on the assumption that the small area is shnilar in some sense to another area, often a larger area in which it is contained. Although this type of estimate would have smaller variance, it may be biased if the assumption is violated.

where

YL - the estimated total or ratio value in the ith small area;

y.j = the observed total or ratio value in theyth auxiliary category;

Ny - the estimated number of units in they'th auxiliary category, for the ith area, as observed in the external source;

Nj — the estimated number of units in the ith area, as observed in the external source; and

= the adjustment weights.

Among the indirect methods for obtaining small area estimates are those based on a model approach, in particular the regression approach. Generally, the strength of the regression approach is finding some stability in the model by using information near and/or around the small domain.

Aliaga and Muhuri (1994) have developed a modelling procedure (with several desirable characteristics) that estimates the contraceptive prevalence rate as a dependent variable associated with

11 independent variables:

■ Each variable is categorical at the individual level (two or more categories).

■ Each variable was transformed into a binary variable; the dichomitization was based on maximizing the discrimination power of contraceptive use.

■ A continuous variable (proportions or percentages) was constructed at the next higher level (cluster) unit for each variable, i.e., each having the same scale and being continuous.

(12)

■ Around each small area domain, groups were created with neighboring ones in a circular way and with around 30 clusters in each group. Each group varies from two to five small domains, i.e., by going around the small domain in a circular way it provides information about every possible neighboring area with a likelihood of sharing similar behavior.

■ A regression modelling at the cluster level was done for each group having stability or borrowing strength (each group with around 30 clusters).

■ The distribution of the ratio of the sample mean to the estimated sampling error can be approximated by the normal distribution.

■ An estimate of the prevalence rate was calculated for each group around the small domain area using the corresponding predicted values at cluster level.

■ A final estimation for the small domain was provided by averaging the estimates for all

groups.

No matter what model is used, several problems or constraints could be present. For instance, the estimation is based on the model assumptions, which can be subject to failure; also, the estimation should be restricted to some particular types of variables.

Combined estimators. When at least two different types of estimates have been calculated, an alternative procedure is to calculate an estimate that is the linear combination of both. For example, if there is both a direct and an indirect estimate, then a new estimate can be computed as:

rc=ard+(l-a)r.

where

a equals a value between 0 and 1;

rc the combined estimator;

rd the direct estimator; and rr the indirect estimator.

10

(13)

The a value can be considered a weighting factor assigned to the direct estimator. Therefore, the term (1-a) is the weight for the indirect estimation. Among different solutions, a particular one is when a is proportional to the direct estimation's precision (inverse of its sampling error), i.e., (1-a) is proportional to the indirect estimation's precision. The combined estimate will always be relatively closer to the one having more precision, and of a value between the two.

Other procedures. There are a number of other estimation procedures based on statistical principles such as the Bayesian approach, which is similar in form to the combined estimator, with the a being a probability value. If past information is available to estimate such a distribution (a), then a reasonable estimate for the small domain can be computed. Although the principle is quite attractive, a major difficulty is the establishment of such a prior distribution. Using tjbe Bayesian approach via Gibbs sampling (by generating a Gibbs sequence of random variables from conditional distributions, which is an intensively computerized process) can overcome this problem by empirically generating the posterior distribution of the estimate. Important issues in Gibbs sampling surround the implementation and comparison of the various ways to extract information from the Gibbs sequence.

5. Applications

Any estimation procedure based on the sample design is more acceptable than one that doesn't take into account the sample design. Among these, any procedure that is design-based and unbiased (or approximately unbiased) is more desirable than a biased one. For direct estimation, a design-based, unbiased (or approximately unbiased) procedure should be used as much as possible. As the domain level becomes smaller, however, such an approach becomes less reliable. Although constraints exist for small domains, the direct estimation procedure does give a general idea of the estimate for such critical domains (small sample size, i.e., small number of clusters).

In DHS surveys, for any domain having 30 or more clusters, the direct estimate procedure will provide a consistent and reliable estimate with a small (5%) chance that the estimate will be undesirable.

For domains having 10 to 30 clusters, the direct estimation procedure may be useful if combined with other outcomes. For domains having 10 clusters or less, the direct estimate procedure is highly unreliable; such estimates should not be used directly, but can be combined with other estimate types.

11

(14)

In the case of the provincial estimate for the Dominican Republic (see Table 1), and the district estimate for Kenya (see Table 2), it was found that:

1. The estimates for the Distrito Nacional and Santiago provinces in the Dominican Republic and Nairobi district in Kenya using around 30 clusters or more provide quite consistent values for the different approaches. Therefore, the direct estimate for prevalence use should be used without any other

consideration.

2. For each province in the Dominican Republic and each district in Kenya with a sample of 10 to 20 clusters, the regression approach provides results quite consistent with the combined estimates.

Therefore, the recommendation is to use the regression estimate as a reasonable choice when the absolute difference between the regression and the combined estimates is less than 10 percent of the regression.

Perhaps the combined estimate should be used when such a difference is larger than 10 percent.

3. For each province in the Dominican Republic and each district in Kenya with a sample of 10 clusters or less, an average value between the regression and the synthetic estimates would provide a more reasonable estimate.

12

(15)

Table 1. Direcl Contraceptive Republic, 1991

Region and Province

REGION 0 Distrito Nacional REGION!

Peravia San Cristobal Monte Plata REGION II

Santiago Puerto Plata La Vega EspaiUat Monsefior Nouel REGION III

Salcedo Duarte

Maria T. Sanchez Samana

Sanchez Ramirez REGION IV

Barahona Pederaales Bahoruco Independencia REGION V

LaRomana La Altagracia El Seybo

San P. de Macorfs Hato Mayor REGION VI

San Juan Azua Ellas Pifia REGION VII

Valverde

Santiago Rodriguez Dajabdn

Mohte Cristi

t, Indirect, Regression and Combined Estimates of Prevalence for Married Women: The Dominican

Direct estimate1

Value

60.7 60.7 50.8 48.7 55.9 43.1 61.0 62.1 62.8 59.4 56.2 61.7 57.4 53.4 57.2 65.9 44.6 56.7 47.1 51.9 48.3 38.8 44.8 50.6 44.0 57.5 46.5 54.2 58.0 39.7 34.1 52.3 32.1 58.6 53.9 73.6 51.8 62.8

'1991 DHS survey data

Sampling error

2.0

5.8 3.7 3.9

3.5 5.3 4.4 9.1 1.5

5.6 4.5 7.5 10.2 8.2

2.7 9.6 6.5 9.1

7.9 3.8 9.7 3.5 7.0

8.2 4.5 5.9

2.4 7.4 5.6 4.4

Indirect estimate2

60.6

51.5 53.7 51.4

62.1 62.0 61.5 62.0 61.2

57.1 56.1 - 57.8 55.0 58.5

47.9 45.1 47.7 46.2

50.1 51.4 51.0 52.351.9

39.5 39.0 39.9

59.1 62.7 56.5 56.3

Regression estimate

Value

61.7

54.6 52.3 52.3

60.7 58.6 58.0 57.3 56.7

56.1 55.3 57.9 57.9 53.7

50.0 50.0 43.4 49.1

50.1 49.8 49.5 50.8 49.6

41.5 52.7 40.5

58.4 55.1 55.8 59.3

Sampling error

1.8

3.0 2.6 3.5

4.4 2.1 2.0 3.5 2.0

3.4 2.3 1.6 1.6 4.3

3.3 2.4 4.8 2.5

4.0 4.1 3.4 3.3 2.7

6.6 3.2 8.0

2.9 4.9 6.3 2.6

21991 DHS survey data and 1991 Expanded Household Survey data

Combined estimate

61.2

52.6 ' 47.953.8

61.4 59.8 58.4 57.0 59.6

55.1 55.9 59.3 56.1 54.7

51.0 49.7 41.4 48.2

48.0 53.8 48.7 52.5 51.9

38.2 52.5 35.7

55.9 62.5 53.7 60.6

13

(16)

Table 2. Direcit, Indirect, Regression and Combined Estimates of Contraceptive Prevalence for Married Women: Kenya, 1992

Province and District

NAIROBI Nairobi

CENTRAL Kiamb*

Kirinj^ga Murangt Nyandarua.

Nyeri COAST

Kilifi Kwale Mombasa Taita EASTERN

Embu Kitui Machakos Meru NYANZA

Kisii Kisumu Siaya South Nyanza RIFT VALLEY

Baringo Elgeyo Marak Kajiado Kericho Laikipia Nakuru Nandi Narok Trans Nzoia Uasin Gishu West Pokot WESTERN Bungoma Busia Kakamega

Direct estimate1

Value

33.5 39.5 37.3 54.2 32.1 39.2 40.8 18.1 W).8 24 .S 30.7 40.2 47.2 41.3 40.4 36.3 13.8 21.5 17.8 8.5 6.1 29.6 12.5 16.7 52.0 24.1 68.5 47.2 16.7 22.6 28.9 13.4 0.0 13.7 9.4 16.1 14.9

Sampling

error

2.1

3.2 5.4 3.2

1.7 3.6

2.9 3.5

2.6 2.8 2.2 1.4

2.6 5.4

2.5

1.5 2.0

Indirect estimate1

34.9

38.1 38.5 39.2 38.4 39.0

17.2 17.0 17.0 18.1

40.0 39.6 40.0 39.9

13.5 13.3 13.7 13.2

29.4 29.4 28.8 29.1 29.4 29.2 29.3 28.8 29.2 29.2 29.1

13.2 13.2 U

Regression estimate

Value

38.5

37.9 44.4 38.8 40.4 43.3

15.2 17.3 15.0 29.0

42.1 41.6 41.0 43.4

16.9 14.8 10.1 112

27.1 15.0 41.5 19.3 45.0 34.0 23.318.0 15.1 15.7 U

11.8 11.7 14.4

Sampling error

3.6

2.D3.2 2.5 3.2 2.8

2.6 4.3 2.0 1.9

2.4 2.7 1.7 3.8

1.4 1.3 1.5 1.0

3.1 2.8 4.6 1.6 4.5 5.2 2.2 4.5 2.5 2.3U

1.8 2.2 1.7

Combined estimate

35.3

48.2 36.7 42.1

12.5 18.4

40.8 39.7

18.5 15.8 9.5 9.1

21.1 40.5

14.6

10.5 14.6 Note: Sampling errors for direct estimates were computed only for tiiose districts with more than 9 PSUs.

U=Unknown (data not available) 1 1989 DHS survey data

1 1989 DHS survey data, and 1990 Population Census

14

(17)

6. Conclusions

Studies indicate that there is no single approach to providing small domain estimation with satisfactory results for all situations. The-size and homogeneity of the small area domain, the availability of external information, whether the variable of interest is continuous or discrete, the explanatory power of any assumed relationship, and the power level to discriminate event occurrences will affect the properties of any estimate. An approach that is suitable in one situation may not be suitable in another.

Other types of estimates can be considered as possible alternatives (individually or combined).

For those domains having a critical number of clusters, the final estimate should be given only after very careful evaluation of the variable type and the neighboring conditions around the small domain.

Utilization of the combined estimate procedure should follow a careful evaluation and should not be done automatically (i.e., mathematically). Information available in the public sector may have been used for different purposes; therefore, combining different information sources must be done carefully to avoid possible controversy.

15

(18)

Bibliography

Aliaga, A. and T. Le. 1991. Methodology for small area estimation with DHS samples. In Proceedings of the Demographic and Health Surveys World Conference, Washington DC, 1991, 497-512. Vol. 1.

Columbia, Maryland: IRD/Macro International Inc.

Aliaga, A. and P. Muhuri. 1994. Methods of estimating contraceptive prevalence rates for small areas:

Applications in the Dominican Republic and Kenya. Methodological Reports No. 3. Calverton, Maryland: Macro International Inc.

Gonzalez, M.E. 1973. Use and evaluation of synthetic estimates. In Proceedings of the American Statistical Association (Social Statistics Section), 33-36. Washington, D.C.: American Statistical Association.

Platek, R.( J.N.K. Rao, C.E. Sarndal, andM.P. Singh, eds. 1987. Small area statistics: An international symposium. New York, NY: Wiley & Sons.

16