SOFTWARE FOR DETECTING CHANGES IN HYDROLOGICAL DATA

SPECIAL TOPICS

SOFTWARE FOR DETECTING CHANGES IN HYDROLOGICAL DATA

Maciej Radziejewski & Zbigniew W. Kundzewicz

Note: The software below has been developed under the WCP-Water framework It is free software containing implementation of a number of parametric and nonparametric tests commonly used in hydrological studies. It does not currently support the resampling methods recommended in this report. It is hoped that the software will be extended at a future date to allow resampling methods to be used.

Hydrospect is a software package for detecting changes in long time series of hydrological data. It makes use of a set of eight different tests for change detection (linear regression, Mann-Kendall distribution-free CUSUM, cumulative deviations, Worsley’s likelihood ratio, Kruskal-Wallis, Spearman’s rank coefficient, normal scores regression) and provides a possibility to create customized derived series, all in a user-friendly Windows environment.

The source code of Hydrospect, Version 1.0, has been written in Visual C The package delivered free to users consists of the executable programme file hydrospect.exe, three sample data files and the User’s Manual (.doc file in MS Word 97).

Hydrospect is a contribution to the Project A (Analyzing Long Time Series of

Hydrological Data and Indices with Respect to Climate Variability and Change) of the World Climate Programme - Water prepared for the World Meteorological Organization. This software was developed by Maciej Radziejewski under supervision of Zbigniew W.

Kundzewicz.

The programme can be run on a personal computer compatible with the IBM PC standard. The required processor is 486 or Pentium and the operating system — MS Windows 95 or 98.

Creating or editing the data set is not supported within Hydrospect. The Hydrospect package follows a simple philosophy of working with the data as they stand, without modifications.

Making changes to the data (for example correcting errors), can be done with another tool, e.g. a spreadsheet.

Hydrospect can read data from text files. The import feature has been designed for flexibility, under consideration of a number of possible formats. Provision to deal with missing values and lines containing comments and other information has been made.

If one wishes to study a time series with another program (e.g. to create a graph), one can save it in a text file. This is of particular use as Hydrospect allows one to perform a number of operations on a time series, i.e. derive new series from existing ones, and those derived series might be of some use elsewhere, too.

Hydrospect returns the values of the test statistic and of the significance level actually achieved. High value of the actual significance level means that the hypothesis of a lack of change is rejected in the light of evidence, i.e. change is detected on a high significance level.

It is essential to emphasize that all the tests included in Hydrospect are based on strong assumptions on the time series: temporal independence for all the tests used and normal distribution for some tests. Validity of these assumptions in particular cases guides our

levels (returned to the user). If assumptions are not fulfilled, the tests can be only interpreted as exploratory data analysis tools, rather than rigorous statistical methods.

Hydrospect does not include tests for checking assumptions. In fuct, the assumption of normality is rarely valid in the realm of heavily skewed hydrological data. Such data can be easily transformed to normality by computing normal scores (see below), so the tests can be applied to the transformed, normally distributed series. The assumption of temporal independence depends on the time step. A time series of annual values may consist of independent elements, whereas the same process represented by monthly or daily data is likely to show an increasing level of temporal dependence.

Standard procedures like ranking observations in a sample or removing annual regime from a series of flows can be seen as producing a new, derived time series, from an existing one. Hydrospect implements a number of such procedures. One can produce a number of derived series from any series. For example one can compute annual means and monthly means for a series of daily data as well as select a subseries consisting only of observations recorded in June. Then one can rank the annual means etc. The relationships between derived time series are presented in the left pane in the form of a tree.

A list of time series operations supported by Hydrospect includes:

- Derived series with ranks assigned to each observation.

- Aggregation resulting in division of the series into subperiods and replacing the values in each subperiod by one value, the mean, the sum, the minimum, the maximum, the median or Tukey’s trimean.

- Creating a subseries of the time series of concern by restricting to a subperiod or to a certain part of the year, e. g. from December and January.

- Some tests for changes in the mean can be applied to detect changes in variance. This involves computing the distance of each value in the series from the overall mean and applying the test to the series of distances.

- Computing normal scores is supported in that the series is transformed in such a way that the marginal distribution becomes normal (with zero mean and unit standard deviation), while the relative ranks of the values are preserved.

- Deseasonalisation: the seasonal means (regime) are subtracted from each value and the remainder is divided by the seasonal standard deviation. The means and deviations are smoothed using harmonic functions.

The Report menu entry allows one you to create a text file containing the results of all the tests performed during a session.

References

Radziejewski, M. & Kundzewicz, Z. W., 1999. Hydrospect, Version 1.0, User’s Manual.

Unpublished manuscript.

The software package Hydrospect and the User’s Manual are distributed free of charge by Dr Arthur J. Askew, Director of Hydrology and Water Resources Department, World Meteorological Organization, Geneva, Switzerland; (e-mail: askew_a@gateway.wmo.ch).

GLOSSARY

Acceptance region – Set of test statistic values for which the null hypothesis is accepted.

Alternative hypothesis – Counterpart to the null hypothesis (Section 5.2).

Autocorrelation – A series shows autocorrelation if there is dependency between one value and the next. This is also called serial correlation, or temporal dependence.

Biased – An estimate is biased if the average value of the estimate differs from the true value of the quantity or parameter that it estimates.

Bootstrapping – A resampling method used for testing data (Section 5.4).

Box plot – A visual data summary which shows: the smallest value, the three quartiles (see:

quartiles) and the highest value (cf. Chapter 4).

Climate change – A long term alteration in the climate (Chapter 1, Appendix).

Climate variability – The inherent variability in the climate (Chapter 1, Appendix).

Confidence interval – Range of values that a parameter is likely to fall within. If the 95%

confidence interval is (A, B) then there is a 95% chance that the true value lies between A and B.

Correlated – Variables are said to be correlated if they are related to each other in some way.

Correlation coefficient – A measure of the association of two variables. The correlation coefficient is the covariance of two random variables (or data sets) divided by the product of their standard deviations.

Covariance – A measure of the association between two variables. it is the expected value of the product of departures from the mean (mixed second central moment of two stochastic processes).

Critical value – Value that separates the rejection and acceptance regions for a test statistic.

Dependent – A variable is dependent if it is correlated with another variable.

Deseasonalize – Remove the seasonal variation in a time series (e.g. annual seasonal pattern).

Distribution-free approach – A method that does not require a specific probability distribution to be assumed (can be used for all distributions).

Entropy (information-theoretic) – Measure of the degree of uncertainty or disorder. (Concept, germane to thermodynamic entropy, introduced in Shannon, C. E. (1948) A mathematical theory of communication, Bell System Tech. J., 27, 379-423, 623-659.

Extremes (hydrological) – Values that are substantially different from the ordinary or usual.

(e.g. floods and droughts are extreme hydrological events).

Fourier transform – A time series technique that is used to break a series up into a sequence of sinusoidal waves.

Heterogeneity – A sample shows heterogeneity if the values come from a number of different distributions (e.g. corresponding to different underlying physical mechanisms).

Heteroscedasticity – Indicates that the variance of the data is not constant.

Homogeneity – Property where by a whole sample comes from the same distribution.

Homoscedasticity – Indicates that the variance of the data is constant.

Independent – Data points are independent if they are not related to each other. If data are correlated, or show serial correlation or spatial correlation they are not independent.

Interpolation – Estimating values of a variable between known values.

Jump: see step change.

Kurtosis – A measure of the peakiness of a distribution. See moments.

Lag – A term used in time series methods. A lag 1 observation is the observation from the previous time step. A lag 2 observation is an observation from 2 time steps before.

Loess/LOWESS /Locally weighted smoothing – A robust smoothing method (see Chapter 4).

Maximum likelihood – Method of parameter estimation in which the likelihood function is ma Mean – The average of a series. See moments.

Median – The middle ranking value of a series. See second quartile.

Moments: Absolute moment – Expected value of a power of a random variable. Central moment - Expected value of a power of a difference of random variable and its mean.

First (absolute) moment is the mean. Second central moment (about the mean) is variance. Next two central moments are called skewness and kurtosis. Mixed moments - see covariance.

Monotonic change – A change that is consistently in one direction (either always upwards or always downwards).

Multivariate analysis – A method for simultaneous analysis of a number of dependent variables (of time and / or space), including links between these variables.

Non-parametric test – A test that does not involve estimation of parameters. Rank-based tests are non-parametric.

Non-stationarily – Indicates that the distribution of a random variable is changing with time e.g. increasing mean).

Normal scores – The expected values of a sample from the normal distribution.

Null hypothesis – Hypothesis to be tested.

One-sided test – Statistical test where the rejection region is located in one tail of the distribution.

Outliers – Values that appear unusual because they are distant from the bulk of data.

Parametric test - A test that involves estimation of one or more parameters (linear regression is a parametric test because it involves estimation of the gradient of change).

Persistence – Property of long memory of the system, whereby high or low values (excursions to high or low states) are clustered over longer time periods. Persistence is also referred to as autocorrelation or serial correlation.

Phase randomisation – Data generation technique that preserves the autocorrelation structure of the data series.

Power – A measure of how effective a test is. A test with high power is good at accepting the alternative hypothesis when the alternative hypothesis is true.

Principal component analysis – A method of analysis in which multivariate data can be simplified. A small number of new variables (principal components) can be used to represent the original dataset.

Probability density function – Function describing the distribution of data - it expresses the relative likelihood that a random variable attains different values.

Quartiles – The quartiles mark the ¼ and ‘/ and ¾ positions in a dataset when the data are placed in size order. The interpretation of quartiles is as follows: lower (first) quartile - a value whose probability of exceedence is 75%; second quartile (median) - a value whose probability of exceedence is 50%; upper (third) quartile - a value whose probability of exceedence is 25%.

Random variable – A numerically valued function defined over a sample space. Can be thought of as something that provides observations.

Rank – The position of a data point when values are ordered by size. An observation has rank r if it is the r^th largest (or smallest) observation (depending on the direction of ordering).

Rejection region – If the value of the test statistics is in the rejection region, the null hypothesis is rejected.

Resampling – A method for testing data in which artificial data series are generated from the original dataset and these series are used to obtain an estimate of significance level.

Sample size – Number of data elements in the sample.

Seasonality – Seasonal (periodic) behaviour of variable.

Segmentation – Method of finding abrupt changes in the time series by fitting a step-wise function (parameters: amplitude and time instant of a change).

Serial correlation: see autocorrelation.

Sign level – Probability of rejecting the null hypothesis when it is true (type I error).

Skewness – A measure of the asymmetry (See moments).

Smoothing – Replacement of the raw series by a more regular function of time that has less variability.

Standardization – Transformation of data by subtracting the mean and dividing by the standard deviation.

Stationarity – Property of a random variable whose statistical moments do not change with time. (Strict stationarity - first and second moments are constant with time. Weak stationarity - first moments are constant with time.)

Statistical test of a hypothesis – Formal procedure to check whether the hypothesis can be rejected in the light of available evidence.

Step change – Abrupt change in a time series.

Temporal dependence: see autocorrelation.

Test: see statistical test of a hypothesis.

Test statistic – Numerical value calculated from information contained in the sample. The test statistic is used to determine whether or not to reject the null hypothesis.

Tie – Situation where at least two elements in the data set have the same value.

Time series – A sequence through time of observed values of a variable. Time series techniques include spectral methods and ARMA (auto-regressive moving average) approaches and related tests.

Trend – Gradual change in a variable. Often assumed to be monotonic.

Two-sided test – Statistical test where the rejection region is located in two tails of the distribution.

Type I error for a statistical test – Rejecting the null hypothesis when it is true.

Type II error for a statistical test – Accepting the null hypothesis when it is false.

Univariate analysis – Analysis in which a single dependent variable is considered (of time and br space).

Unbiased (estimate) – An estimate is unbiased if its average (expected) value is equal to the quantity or parameter it estimates.

Variance – A measure of the spread in a sample/distribution. (See moments).

Variogram (called also semivariogram) – A technique used with spatial data to show how the covariance between points changes as a function of distance.

Dans le document WORLD CLIMATE PROGRAMME DATA and MONITORING (Page 158-166)