Statistical testing by resampling methods

PLOTTING PRACTICE AND STYLE

5.4 Statistical testing by resampling methods

Resampling methods such as permutation testing or the bootstrap are robust methods for estimating the significance level of a test statistic. A useful practical text on resampling methods and permutation tests is provided by Good (1993). Efron & Tibshirani (1998) and Davidson & Hinkley (1997) describe bootstrapping methods. Resampling methods are very useful for testing hydrological data because they require relatively few assumptions to be made about the data, yet they are also quite powerful tests.

5.4.1 Understanding resampling

The basic idea behind resampling methods is very straightforward. Consider testing a series for trend: a possible test is the regression gradient. If there is no trend in the data (the null hypothesis) then the order of the data values should make little difference. Thus shuffling (permuting) the data series should not change the gradient very much. Under a permutation approach the data are shuffled very many times. After each shuffle (permutation) the test statistic is recalculated. After very many permutations, the original test statistic is compared to the generated test statistic values. If the original test statistic is rather different from most of the generated values then this suggests that the ordering of the data affects the gradient and thus that there was trend. If the original test statistic lies somewhere in the middle of the generated values then it seems reasonable that the null hypothesis was correct (the order of

the values does not matter, so there is no evidence of trend). In other words, if an observer (or in this case, the statistical test) can distinguish between the original data and the resampled (permuted) data, then the observed data are judged not to satisfy the null hypothesis.

5.4.2 Permutation and the bootstrap

The bootstrap and permutation methods are two slightly different approaches to resampling the data. In permutation methods the data are reordered, each of the data points in the original data series appearing once in each resampled (generated) data series. In bootstrap methods, the original data series is sampled with replacement to give a new series with the same number of values as the original data. With this method, the generated series may contain more than one of some values in the original series and none of other values. In both cases, the generated series has the same distribution as the empirical (i.e., observed) distribution of the data.

The bootstrap is generally but not always, less powerful than a permutation test (Good, 1993). However, bootstrap methods are often to be preferred where a test is looking for change in variance. Further, permutation tests cannot be applied with test statistics that do not change when the data are permuted, e.g. tests for which the test statistic is the mean or median. The tests given here can be used with either method. In general, bootstrap methods are more flexible than permutation methods and can be used in a wider range of circumstances.

5.4.3 Determining significance level

To determine the significance level, the data are resampled, by either permutation or bootstrapping, a large number of times, S. For each of these generated series, the test statistic, T, is calculated to give S artificial values of T. These are then ordered as

T1 ≤ T2 ≤ …≤Ts (5.2)

If the original test statistic is T0 and

Tk ≤ T0 ≤Tk+1 (5.3)

then the probability of the test statistic being less than or equal to T0 under the null hypothesis is approximately

p= k (5.4)

p may also be estimated as

p = (k+0.5)/(S+1) (5.5) or even

p = (k+ 1)/(S+2) (5.6) Assuming that large values of T indicate departure from the null hypothesis, the

significance level for this test is then

100 * 2 min (p, 1-p)% (5.7)

5.4.4 Number of resamples

The number of samples that need to be generated depends on the level of significance required and on the degree of change seen in the data. Typically around 100 to 2000 samples might be generated. The larger the number of resamples, the more accurate the estimate of significance. More resamples will be required to accurately determine significance levels of 1% than significance levels of 10%. A simple approach to check whether the sample size is sufficient is to rerun a test a few times and check that the required percentiles of the generated test statistic values are not varying too much.

For permutation testing, all permutations could, theoretically, be evaluated. However, typically there are too many to be evaluated (for a series of length n there are n!

permutations) and a random selection of possible permutations is used instead.

Note that if confidence intervals are required then sample sizes of 199, 999, 1999 etc.

give exact confidence intervals (Faulkner & Jones, 1999 Appendix 2) e.g. for 199 samples, the 95% confidence interval is given by the 5 largest and smallest values; for 1999 samples, the 95% confidence interval is given by the 50 largest and smallest values.

5.4.5 Resampling when data are not independent

The above resamplmg methods are applicable only in the case where it can be assumed that the data are independent. For series with serial dependency, or series with seasonal structure, different techniques should be used.

If data show serial correlation, or additional structure such as seasonality, then it is necessary that the generated series should replicate this structure. A straightforward means of achieving this is to permute, or bootstrap the data in blocks. For example, for a 40 year series of monthly values, it would be sensible to treat the data as consisting of 40 blocks of one year. Each year’s worth of data is left intact and is moved around together as a block — thus maintaining the seasonal and temporal dependencies within each year. The 40 blocks are then reordered many times. The resampled series would then preserve the original seasonality and serial correlation seen in the data. It is important that the size of the blocks should be sensibly selected. If there is seasonality then the block should contain an integral number of seasonal cycles. If there is autocorrelation then the block should be chosen so that data points one block apart are approximately independent. For the block-bootstrap there are also other more sophisticated methods of resampling such as the wild block bootstrap (Shao & Tu, 1995).

Section 5.5 discusses issues related to dependence more generally and points out other alternative approaches.

Note that block-bootstrap and block-permutation methods can be used with rank-based tests, which, although distribution-free, would otherwise depend on assumptions of independence.

Note also that blocking methods can be useful when there is spatial dependency in a set of multi-site data that is to be tested as a group. In this case, the usual choice of blocks would be to group data across all sites that occurred in the same time interval. Experience suggests that allowance for multi-site dependencies can be very important for estimated significance levels (e.g. Robson et al., 1998).

5.4.6 Summary of method for resampling

The basic method for carrying out a permutation or bootstrap test is as follows

• Select one or more suitable test statistics (see Section 5.6)

• Calculate the test statistic for the observed data, Resample the data series many times (e.g. 1000) to generate new data series and recalculate the test statistic for each of these series using blocking methods if appropriate

• Estimate the significance levels using Section 5.4.3.

Dans le document WORLD CLIMATE PROGRAMME DATA and MONITORING (Page 61-64)