• Aucun résultat trouvé

Improving Single Variable and Multi-Variable Techniques for Estimating Missing Hydrological Data

N/A
N/A
Protected

Academic year: 2021

Partager "Improving Single Variable and Multi-Variable Techniques for Estimating Missing Hydrological Data"

Copied!
19
0
0

Texte intégral

(1)

ELSEVIER Journal of Hydrology 191 (1997) 87-105

Journal

&dogy

Improving single-variable and multivariable techniques for estimating missing hydrological data

S. Bennit?*, F. Bermdab, N. Kang’

%cole de Technologie Sup&hue, 4750, Henri-Julien, Montrkal. QuP. H2T 2C8, Canaab bCIniversitt! Aiii Chock Casablanca, Uaroc

‘Hydra-Q&bee, 870, Maisonneuve Est, Mot&al, QUe. H2L 4Y7, Canaab Received 3 April 1995; revised 4 December 1995; accepted 9 March 1996

Abstract

A highly efficient technique is developed to obtain the best least-squares approximation of the missing hydrological data in the single-variable case. and this is presented here. The technique is based on an appropriate weighing of the estimated values generated by two autoregressive processes operating, respectively, in the forward and backward directions of time.

For the multivariable case, the originality of the work presented here consists in the use of the linear regression model with variable coefficients to estimate missing data. As for the single-variable case, two multivariable regression models are calibrated recursively on available data preceding and following the period of missing data.

The use of Kalman filter (KF) has improved the accuracy in the estimation of the first missing data including the peak flow. For subsequent missing data the confidence of the estimates is greater when using a static model identified by the ordinary least squares (OLS) technique. It has been found that there is a critical rank for which there is an inversion of performance between the KF and OLS technique. When the period of missing data is smaller than the critical rank we use only KF technique. When the period of missing data extends past the critical rank, it is recommended that KF be used to estimate the first missing data and then use OLS technique to estimate data coming after the critical rank. 0 1997 Elsevier Science B.V.

1. Introduction

Engineers and specialists of Hydra-Quebec, one of the hydroelectric companies situated in Canacla, have to manage reservoirs of 200000 hm3 capacity that produce 95% of the

* Corresponding author.

0022-1694N7~17.00 0 1997- Elsevier Science B.V. All rights reserved PII SOO22-1694(96)03076-4

(2)

88 S. Bennis et al/Journal of Hydrology 191 (1997) 87-105

energy consumed in Quebec. So as to optimize the hydroelectric production, hydrologists have to know the amount of energy stored as water in the reservoirs accurately. They will be able then to use this data to plan and manage the hydroelectric plants according to four time horizons: very short, short, medium and long-term.

The knowledge of the available hydrometric data is insufficient to decide the optimal management system to adopt effectively. It is indispensable to anticipate how phenomena will evolve in the future. Hydrologists have shown that missing and erroneous data can mislead forecasts and can compromise the optimal management of water resources (Turgeon, 1985). In the context of new power plants, long series of hydrometric data are indispensable to evaluate the potential power of a site and to make an economic and secure design of spillway and control gates. Considering the far distance of these sites, and especially those not yet instrumented the series of available hydrometric data is very limited.

One of the objectives of this paper is to show how it is possible to use a dynamic model to estimate missing data as one usually does in the area of forecasting. We will especially show how to use two dynamic models, one operating forward in time and one backward in time, to estimate missing data for single-variable and multivariable models.

Although the new proposed techniques here present statistical advantages, simple tech- niques commonly used have also been integrated into the software ValiDeb (Bennis et al., 1994). This integration allows, on the one hand, the comparison of results obtained by different techniques, and, on the other hand, an availability of an alternative solution in case of failure of one of the techniques. Models for reconstituting monthly or annual flow series abound in the literature (Dragan et al., 1989), whereas daily flow reconstitution is much less well-documented. The most widely used model is presented in the form of a multiple regression, where the station to be reconstituted is expressed as a function of the reconstituting stations (Haan, 1977). The parameters of the regressive model are. estimated by means of the classical least-squares method (Young, 1974). Generally speaking, the least-squares technique yields good results; however, it has three disadvantages:

1. the technique does not take into account the variation in the regression parameters as a function of time;

2. the dependent variables are often mutually correlated, which violates one of the con- ditions of application of the technique;

3. the measurement errors rarely obey the Gaussian noise hypothesis used in the standard least-squares solution.

The latter point introduces a source of bias in estimating the parameters of the regres- sion model used (O’Connell, 1980).

Other, more sophisticated algorithms, like the instrumental-variable and maximum- likelihood methods, applied to chronological series which are artificially contaminated with random noise of null mean and constant variance, produce better unbiased solutions than those obtained by the standard least-squares method (Young, 1970). Unfortunately, these algorithms do not produce any other significant improvements in the case of an experimental signal contaminated with real noise of unknown statistical characteristics which no longer satisfies the ideal hypotheses imposed (Davies, 1983).

Principle components analysis provides a solution to the problem of the mutual

(3)

S. Bennis et al/Journal of Hydrology 191 (1997) 87-105 89

dependence of independent variables (McCuen and Snyder, 1986). Through axis rotation, new truly independent orthogonal variables are found, called principle components, which are expressed in the form of linear combinations of the initial variables. The weighting coefficients am the elements of eigenvectors of the correlation matrix of the flow.

The ridge regression (Kachroo and Liang, 1992) can also be used to solve the problem of the multicolinearity of the system input data. Unfortunately, this technique requires an arbitrary weight factor which must be determined by trial and error, while our objective is to develop an automated methodology for general application. In addition, the weight factor used in ridge regression introduces a bias in the solution which is no longer optimal in the least-squares sense. This bias, which evolves in the same direction as the weight factor, can be moderated in certain situations (Bruen and Dooge, 1984). but can become quite significant in others (Kachroo et al., 1992).

All the above techniques yield good results; however, they produce residuals with significant autocotrelation functions, at least for the first three lags (Bennis and Bruneau, 1993). This is explained by the strong correlation that exists between daily river flows, and which is not explicitly taken into account in the models used. The explicit consideration of this correlation can be achieved by including an autoregressive component in the flow estimation model. However, the use of previous flows as independent variables causes problems. First of all, because these variables are strongly correlated to the dependent variable (the same variables, but shifted by 1,2,3, . . . . n days), they acquire a very large weight during regression coefficient estimation, to the detriment of other independent variables which are the flows of neighboring stations. In addition, this type of model can be ineffective during a period of flooding where the flow can vary considerably. It could also be ineffective in estimating missing data over a period of several successive days, since with this type of model, the flow estimate for day j depends. on knowledge of theflowsondaysj-l,j-2, . . . . Thus, in order to determine the missing data for several successive days, these figures must be replaced with their estimated values.

The use of the generalized least-squares method can also be an effective way to eliminate the autocorrelation of residuals (Kachroo and Liang, 1992). This method con- sists of introducing a so-called ‘whitening’ filter to eliminate residual autocorrelation and transform the residuals into Gaussian noise with null mean and constant variance.

Beauchamp et al. (1989) have made the comparison of results obtained by a model based on regression and another using time-series techniques to estimate missing data.

They found that both techniques produce reasonably good estimates and forecasts of the flow. The regression model was found to have a significant amount of autocorrelation in the residuals, whereas the transfer-function model was able to eliminate this type of residual. Unfortunately, all the necessary stages for the elaboration of a time-series model lend themselves badly to modelling and must be treated case by case.

Nash and Barsi (1983) have emphasized that the difficulty in modeling hydrological phenomena derives more from their non-stationarity than from their non-linearity. Gn the other hand, it is easy to verify that the coefficient of determination of the regression equation can vary very considerably with the season and the year for a given station. It is clear, therefore, that the spatial correlation between the flows at neighboring stations varies as a function of time, and would be better represented by a non-stationary model.

The application of the Kalman filter (Kalman, 1960), which is used in hydrological

(4)

90 S. Bennis et al./Joumal of Hydrology 191 (1997) 87-105

forecasting, is justified by the fact that it explicitly takes into account the continuous evolution of the parameters as a function of time by means of the equation of the state of the system (Bergman and Delleur, 1985).

All the methods presented above are mediate. or involve special relationships, in that they take into account information collected outside the station which is missing the data.

For the case where this information is not available, we are proposing diit methods, which use only information collected at the station to be reconstituted. This involves single-variable models. Auto-regressive integrated moving average (AFUMA) models (Box and Jenkins, 1976) are appropriate for representing the evolution of flows and for reconstituting missing data. This type of model can yield good results as long as the number of missing values is not unduly high.

The problem of estimating missing data is different in certain respects from the problem of forecasting. The principle difference is that a measured target value must be met at the end of the missing data estimation process. Bennis and Bruneau (1993) have proposed a smoothing technique for hydrographs at the estimate-measurement transition points. How- ever, this heuristic method does not provide the optimal solution.

2. Models used

2.1. Multivariable model

In a very general way, the flow at ‘k’, a station to be reconstituted, 0~ can be expressed as a function of the flows of the ‘n - 1’ neighboring reconstituting stations Qi,, by:

n

Qk,t= i=,~ikj~b~Qt,t-c.-j+l+vk

where ci is the lag (ci positive) or the lead (ci negative), expressed in days, between the peak flow at reconstituting station ‘i’ and the station of interest to be reconstituted ‘k’. bi represents the coefficients of the model. m’ represents the order of the autoregressive model in the ‘i’ station. vk, is the term of uncertain disturbances corrupting the observa- tions. The delays ci can be determined by using the cross-correlation functions between the time series of flow of each station to be reconstituted and the neighboring reconstituting stations. The rank of the first significant value of this function indicates the delay ci. As these delays are more significant in periods of flooding, one can determine them through visual graph inspection mainly by comparing dates of occurrence of the peak flow.

An autoregressive component of the measured flow at the reconstituted station in the model can lead to difficulties. The main problem resides in the use of estimated values in place of measured values when extrapolating the calibrated model to estimate missing data. This is why we have taken i # k.

An autoregressive component of the measured flow at the reconstituting stations in the model may be appropriate in some cases. This is, for example, the case when the reconstituting stations and the reconstituted station are on the same river at a certain distance from each other. In this case, flow hydrograph measured at the downstream station is related to the flow hydrograph measured at the upstream station by physical

(5)

S. Bennis et al./Joumal of Hydrology 191 (1997) 87-105 91

laws. Flood-routing equations can be obtained under certain simplifying hypotheses from the Saint Venant’s equations (Chow, 1988). As the order of the autoregressive model is equal to 2 in these equations, we suggest the systematic use of m’ = 2 in all applications.

When there is no causal relationship between the reconstituted station and the reconstitut- ing stations, in other words, the different stations are not situated on the same river, the use of the autoregressive component is no longer justified. In that case, we take mi = 1. To keep the text generalized we will continue with the m’ order of the model.

A more condensed expression would result from defining the flow vector (or any other hydrometeorological variable) reconstituted by:

e<O=[Ql,,, Q2.0 . ..s Q,,,,l

The vector of coefficients B(t - 1) of dimension [m x n] x II elements is defined by B(t-l)=[O O...o; bil bi2...b:,; b;, bT2...b:,; . . . . . . . . 0 O...OIT

The diagonal matrix of flows M of dimension n x (m x n x n) contains the terms of the form:

d=[Q~.t-c, Ql,,-,,-,...Q,,,-,,-,I+,; . ..Qn.,-, Qn,t-,-I, . ..Qn.t-c,-m.-11 Eq. (1) can then be generalized to all river sections, in the form:

@)=M(t-l)B(t-l)+V(t-1) (2)

where V(t - 1) is a n-vector containing V, ,,_ 1.. . Vn,t_l. Noting the vector Q(t) = Z(t - l), Eq. (2) is written:

Z(t) =M(t)B(t) + V(t) (3)

Eq. (3) can be interpreted as the measurement equation for the general formulation of the Kalman filter (Kalman and Bucy, 1961).

We assume that the parameters vary in a manner that can be described by the following stochastic matrix equation called the random walk model:

B(t + 1) =B(t) + W(r) (4)

Where W(t) is the vector of uncertain disturbances ‘driving’ the system. The standard assumption in the foregoing formulation is that W, and V, are normally distributed white noise sequences with the following covariance matrices:

E[(W,-W)(W,*-W)=] =&S,, (5)

E [( V, - v)( I’,$ - v)T] = R,i3,1, where v=E(V,) and W= E (W,).

E(.) designates the expectation, S, and R, denote the covariance matrices of modelIing noise and measurement noise respectively. The symbol d,,, stands for the Kroenecker and Tdenotes the transpose of a matrix. In addition, the two noises are assumed to be mutually independent, as expressed in the following equation:

E[(V,-V)(W,&)r]=O Vt and t’ (6)

The algorithm of Kahnan filtering (Fitch and McBean, 1991) allows, at each instant of

(6)

92 S. Bennis et aLlJournal of Hydrology 191(1997) 87-105

time, to filter measures and to identify parameters of the model B(f). As for the recursive least-squares method (Bennis and Rassan, 1991), one can choose &O) to be an arbitrary finite value (&O) = 0 seems most useful) and the corresponding estimation error covariance matrix to be a diagonal matrix with large diagonal elements (of the order IO3 to lo*), indicating little confidence in the initial estimate and no knowledge of the cross-variance properties of the estimates. Experience in this manner shows that the algorithm converges very fast.

2.2. Single-variable models

In the case where there are no reference stations on which to base the reconstitution of missing data, single-variable models must be used. The ARMA models (Box and Jenkins, 1976). which are very popular for use in forecasting, are used here to estimate these data.

2.2.1. The AR forecasting model

The systematic use of the AR model has been made because of its simplicity and the possibility of using linear methods for the estimation of its parameters. The optimal order of the AR model may be determined, for each processed series, by using’ the Akaike criterion (Akaike, 1974). As we are concerned by an automated software program to estimate hydrometric data, we have arbitrarily fixed the order of the AR model to 2.

This is justified by the need to use a parsimonious model (Abraham and Ledolter, 1983). In fact, we have preferred to take an order greater than 1 because, in the case of daily hydrometric data, there is always a correlation between the measures taken on day j and on day i - 2. The presentation of the AR(2) model used here has been adapted to the general formulation of the Kalman filter (KF) algorithm (Fitch and McBean, 1991). The AR(2) model is presented under the form:

Q,=hQ,-, +hQt-z+w-, (7)

where Q, represents the stream flow at time t and w,_~ is the system noise. Suppose that we observe Q, by a noisy measurement:

Z, = Q, + v, (8)

with V, the noise measurement.

Let

Xt=[Qt Q,-AT and 4= h bz

[ 1 0 1

W,-I=[wt_I 01 H,=[l 01 then

[:_*I=[“: bgl] [::j+[wi’] (9)

(7)

S. Bennis et al./Joumal of Hydrology 191 (1991) 87-105 93

and

Eq. (9) and Eq. (10) can then be written as:

xt=4t/t-1%I+W-I

(10)

(11)

q=Hxt+vt (12)

Eq. (11) represents the AR model in forward time. The equation of the AR model back- wards in time is obtained by inverting t and t - 1 in the same equation. At each time t, the KF algorithm is used twice in series: the first time to filter measures and the second time to identify the model parameters that we suppose are variables. This work is done for the two models: the AR models in forward and backward time.

2.2.2. Fraser smoothing technique

Before presenting our proposed technique of missing data estimation for the single- variable case, we briefly recall the Fraser smoothing technique (Radix, 1984). To smooth data, the Fraser algorithm makes an optimal estimation of the value on day t from the N+ 1 measures taken at times to. tl, t2, . . . . tN.

The Fraser smoothing technique consists of the following three stages (Radix, 1984):

Stage one: apply the KF algorithm combined with the forward AR model by going from to to t and use the initial conditions PI, and Xt,, where P, is the estimation error covariance matrix of XtO in order to obtain fir and Xf;.

Stage two: apply the KF algorithm combined with the backw*%d AR model by going from tN to t and use the initial conditions P, and XtN to obtain P, and 2:.

Stage three: calculate an optimal combination of the two estimates using the following relationship:

a;‘=(~+~)-’ (13)

and

2, =b,[(~)_‘P~+(~~)_lR~] (14)

The optimal estimate is therefore a linear combination of the forward and backward estimates. As the coefficients of weights are represented by the precision matrices (e)-’ and (pF)-’ the technique always favors the estimate with the weakest variance.

Relationships Eq. (13) and Eq. (14) are similar to these proposed by Winkler and Makridakis (1983) to combine estimates provided by several models.

2.2.3. Technique of interpolation

We propose adopting the same methodology to estimate missing data using two AR models.

The first one functions in the forward direction of time and is expressed by the relationship:

x~=&-*x~l+ WE, . (13

Exponent F designates the forward direction of time.

(8)

94 S. Bennis et aNJournal of Hydrology 191 (1997) 87-105

The second functions in the backward direction of time, and is expressed by the relationship:

(16) Exponent B designates the backward direction of time.

Each missing.datum can be estimated twice: the first time by extrapolating model Eq.

(15) in the forward direction of time and the second time by extrapolating model Eq. (16) in the backward direction of time. The optimal use of these two estimates involves weighing them by means of forecasting errors in both forward and backward directions of time using a relationship similar to Eq. (13) and Eq. (14). There is a difference between smoothing data using the Fraser technique and the technique of estimating missing data proposed here.

For the former, the measure at time t that is available is used in the process of smooth- ing. For the latter, there is no measure during the period of missing data and all values on the right side of Eq. (14) are forecasted.

3. Eatlmntlon of error covariance matrix of successive missing data estimates 3.1. Single-variable model

As in the domain of forecasting, the accuracy die-off in multi-day missing data estimation refers to the decrease in estimate accuracy with increasing missing data period when using the ARMA model. It is to be expected, for example, that an estimate of missing data of the 3rd rank will be less accurate than an estimate of missing data of the 1st rank.

Indeed, when using the AR(2) model, the flow estimate for day j depends on knowledge of the flows on days j - 1 and j - 2. Thus, in order to determine the missing data for several successive days, these figures must be replaced with their estimated values.

When using Eq. (13) and Eq. (14) to estimate successive missing data, we have, on the one hand, to extrapolate Eq. (15) and Eq. (16) as many times as necessary to calculate successive values of 2: and 2:; and on the other hand, to calculate the corresponding precision matrices (&-’ and (p,“)-‘. The KF algorithm provides only the precision matrices of the first extrapolation. The succeeding precision matrices will be estimated as follows:

The error covariance error matrix is defined by:

where

xr-2t= (;,“,_,) = (;_,)

The error variance o*(e,) for an autoregressive model can be calculated for successive missing data estimation. It is easy to show (Box and Jenkins, 1976) that an autoregressive model of finite order may be transformed into a moving average model of infinite order of

(9)

the form:

S. Bennis et aWJouma1 of Hydrology 191 (1997) 87-105 95

Q,=Q+~ko~,+~,la,_I+f2a,-2+... (17)

where Q is the mean flow, a, is a Gaussian noise of null mean and 4 variance and ei are constants related to the AR model coefficients bi.

The error variance of 1 advance extrapolation is obtained by the relation (Box and Jenkins, 1976):

a2(e(r))=~(l+e:+~~...+~:_1) (18)

3.2. Multivariable model

As for the single-variable model, each missing data in the reconstituted station will be estimated twice using the reconstituting stations. The first time when using Eq. (2) it is written in forward time:

ef’=M(t-l)BF(t-l)+VF(t-1)

And the second time, Eq. (2) is written in backward time:

(19)

~=M(t+l)BB(t+l)+VB(t+l) (20)

The two estimates will be weighed by the corresponding covariance matrix errors as in E!q. (13) and Eq. (14). Due to the fact that reconstitution of missing data may be based on neighboring stations we do not use the autoregressive component in the reconstituted station to avoid the accuracy ‘die-off phenomenon already mentioned.

Then, a missing data estimate of any rank uses only measured data in the neighboring stations. When using a stationary model whose coefficients do not vary over time, the error covariance matrix is the same for all successive missing data estimates. In the case where model coefficients vary over the observation time interval, it is evident that the model accuracy will decrease with increasing missing data estimation period. Unlike the AR model, there is no general rule in the estimating of successive extrapolation errors for a multivariable model with variable coefficients following a random walk model. The KF algorithm, gives only the error covariance matrix of the first extrapolation. The successive error covariance matrices will be calculated here by simulation on the time series data used. At each time t, we use the identified model to calculate the extrapolation error of estimating missing data of different ranks to estimate these values statistically at the end of the simulation. The error covariance matrix calculated will be used to weigh the estimates in forward and backward time when using Eq. (13) and Eq. (14).

4. Performance criteria 4. I. Coejicient of determination

The Iirst comparison criterion is the sum of the quadratic deviations between the

(10)

% S. Bennis et al./Joumal of Hydrology 191 (1997) 87-105

(11)

S. Bennis et aMJournal of Hydrology 191 (1997) 87-105 91

estimated, Qcstj and the observed, Q&j daily flows:

2

EQ= fjf Ust,j-Qohi12

Now, in order to be able to compare the methods independently of the value of the flow, the coefficient of determination R2 will be used:

R’+$

Q where ai is defined by

(23)

4.2. Residual autocorrelation function

The second comparison criterion has to do with the residual autocorrelation function.

The methods and solutions proposed are based on the hypothesis that the residuals take random values of null mean and constant variance. Consequently, we must verify that their autocorrelation function approaches zero.

&&nating by pK the real value of the residual autocorrelation function and rk the estimated value, hypothesis Ho is:

pK=O K=l, 2, 3, . . .

Coefficient tk is calculated by the formula:

tK =TK’PK mc) K=l, 2, 3, . . .

where s(rk) is the approximate estimation error of rk.

In practice, if the absolute value of tk is less than 1.25 for delays 1,2 and 3, and less than 1.6 for subsequent delays, then we can conclude that the residuals are independent (Pankratz, 1983).

5. Application

ValiDeb software (Bennis et al., 1994) is designed for general application. It contains no predetermined parameter or hypothesis which would limit its use to a basin, for example, or to a particular region. For this study, the Saint Francis River basin was selected as the reference basis (Fig. 1). The basin extends over an area of 10 230 km2, 85% of which is in Quebec and 15% in the United States.

The three stations to be considered are all situated on the main waterway of the Saint

Fig. 1. Saint Francis River - location of the stations.

(12)

98 S. BennisetalJJoumalofHydrology 191(1997)87-105

Francis River. The station coded as number 030203 by the Ministry of Environment of Quebec is situated farthest downstream of the Hemming Fall plant. It drains an area of 9610 km2. The station 030204 situated at a distance of 62.5 km upstream, drains an area of 8680 km2. The last station 030206 is situated at a distance of 101.5 km upstream from the first station and drains an area of 4120 km*. Station 030203 has been taking measures since 1 January 1925. The data used in the present study end on 1 March 1989. Since station 030203 has the longest period of operation and no missing data has been observed during its operation, its period of operation will be taken as the reference. Station 030204 has been operational between 1 November 1935 and 24 October 1973 and we have observed 9632 missing data compared with the period of reference. Station 030206 has been opera- tional between 1 October 1938 and 1 October 1988 and 5759 missing data have been observed for this station relative to our reference period.

6. Results

6. I. Multivariable model

As the station 030203 contains no missing data, Q(t) in Eq. (2) is a 2 x 1 vector containing the measured flow at stations 030204 and 030206. When the station to be reconstituted is 030204, the reconstituting stations are 030203 and 030206. To estimate missing data in 030206, we use 030203 and 030204 as reconstituting stations. As measures at stations 030204 and 030206 are sometimes simultaneously missing, they can not always be used as reconstituting stations. It is for this reason we use in parallel two models, one containing only data from the reconstituting station 030203, and the other one containing data from two reconstituting stations 030203, and either 030204 or 030206, depending on the station to be reconstituted. The order of the AR component in the multivariable model mi has been taken systematically to be equal to two for the reason evoked in Section 2.1.

As stations are very near to each other and the time travel between them is less than 6 h, all term delays ci are null. This is confirmed by the visual graph examination as well as by calculation of the cross-correlation function between different stations.

It is very important to make the difference between the error of model calibration and the error of missing data estimation or the error of extrapolation of the model. It is for this reason that we have subdivided the series of data into two parts. The first part will be used to calibrate the model according to the two techniques: ordinary least squares (OLS) and Kalman filter (KF). The second part will be used to test the capacity of the calibrated model to estimate missing data. When the reconstituted station is 030206 the total com- mon period that can be used for calibration extends from the 1st of October 1938 to the 25th of October 1973. Only data from the 1st of October 1938 to the 1st of January 1965 has been used for calibration. The remaining period has been used to validate the cali- brated model according to the two techniques OLS and KF. At the stage of model cali- bration the coefficient R* = 0.877 for the OLS technique. The use of KF according to the (MISP) technique gives a coefficient of R2 = 0.91 during the period of calibration. It is worth recalling that when using the OLS technique the coefficients of regression are obtained by inverting a matrix containing simultaneously all the data used for calibration.

(13)

S. Bennis et al./Joumal of Hydrology 191(1997) 87-105 99

Comparison between measured flows 9001

600

%oo k

I”

300 2no 100

Time stcp(days) 1973/05/01

a- +- t- 1973/07/01

Fig. 2. Comparison between measured flows.

While the KF technique, starts with a tentative model which may be obtained using the initial conditions considered in Section 2.1 or, by using one more obvious approach, consisting of first reading a part of the data and identifying a tentative model using the OLS technique (Bennis, 1987). The remaining data is used recursively in the chronolo- gical order obtained to modify, at each time t, the parameters of the model depending on the error of the first extrapolation. Therefore, with the OLS technique there is only one regression model while with the KF technique we have, at each instant t, a new model due to the fact that the parameters may vary over the observation interval. In the period of testing the OLS technique, namely the period between 1 January 1965 and 25 October 1973, the coefficient R* = 0.872 is the same as the one obtained during the period of calibration. For the KF technique, it is different. The coefficient R* obtained in the period of validation is only 0.87 which is much smaller than the value 0.91 obtained during the period of calibration and even smaller than the value 0.872 obtained by the OLS technique.

Considering the fact that the coefficients obtained with KF vary over the observation time interval and that they are recursively modified to minimixe the error of estimation of the first missing data, we have decided to proceed differently with the KF technique.

We use the identified model, at each time step, to calculate not only the error of estimation for the tirst missing data that allows the modification of the coefficients of the model, but also errors of estimation of the 2nd. 3rd, . . . missing data which in turn will allow us to calculate, statistically, the accuracy of the estimation of missing data of different ranks. The use of KFJ technique to estimate 11353 missing data of the 1st rank

(14)

100 S. Ben& et al./Joumal of Hydrology 191 (1997) 87-105

Comparison Of The Kalman Filter Vs OLS

500 InThelMmationOfMissingDatain030#16

01 “““““““““““““““““““““““““‘~~“~““~

Time in Days

1973/05/01_ h.fEAsm _ ou -A-m 1973/07/01

Fig. 3. Comparison of the Kalman filter vs OLS in the estimation of missing data in 030206.

leads to a coefficient R2 = 0.9 1. For the missing data of the 2nd rank the R2 coefficient is 0.90’7 which is slightly smaller. For missing data of successive ranks the coefficient R2 decreases and reaches the value R2 = 0.8729 for the 30th rank. After the 30th rank, the R2 coefficient continues to decrease and becomes smaller than the value obtained by the OLS technique. The critical rank for which there is an inversion of performance between the KF and OLS techniques depends on the case. It can be determined statistically by simulating for each case study. When the period of missing data is smaller than the rank for which there is an inversion of performance between the two techniques, the 30th in our example, we use only the KF technique. When the period of missing data extends past the critical rank, it is recommended that KF be used to estimate the first missing data, the first 30 data in our example, and then use OLS technique to estimate data coming after the critical rank.

Up to now, we have examined only the impact of using a dynamic model in which the parameter vector may vary over the observation interval. We have not yet considered the situation where measures are available before and after the period of missing data. To consider this case, we have simulated a period of missing data and have estimated it using the OLS and KF techniques. Fig. 2 shows the comparison of the series of flow measured at the three stations for this period. We have eliminated voluntarily 44 successive measured flows alternatively in stations 030206 and 030204 and estimated them using the OLS and KF techniques.

Fig. 3 and Fig. 4 show the comparison of results obtained by the OLS and KF techniques to estimate these simulated missing data. It is evident from these figures that KF gives a

(15)

S. Bennis et ai./Joumal of Hydrology 191 (1997) 87-105

Comparison Of The Kalman Filter Vs OCS

InTbeEStimationof-htaino3o2o4

101

01 “““““1”“““““““““““““““““““““““”

TiUlCiIlDays

19?3/05/01 tMEAsuRED +OLs +- 1973/07/01

Fig. 4. Comparison of the Kalman filter vs OLS in the utimation of missing data in 030204.

best fit between estimated and measured values. For Fig. 3 the two techniques overesti- mate the peak flow. However, the K.F succeeds in reproducing it with a 6.7% error compared with a 69% error with the OLS technique. The OLS technique overestimates systematically all values of flow thus overestimating the total volume by 25%. The KF underestimates the total volume by 9% which is acceptable compared with the 25%

overestimation produced by the OLS technique. This difference between the measured and estimated total volume of flow depends on the duration of the period of missing data.

When we simulated the estimation of missing data on the total period of validation that extends from 1 January 1965 to 25 October 1973, the ratio between measured and esti- mated total volume of llow was 0.999 for the KF technique compared with 0.96 for the OLS technique. Unlike the OLS technique, the KF technique produces a continuity in the estimated hydrograph at the transition point between measured and estimated values.

In the case of the estimation of missing data at station 030204 (Fig. 5), the conclusions on the performance of the two techniques are similar. The KF overestimates the peak flow and the total volume by a 7.2% and a 1.5% of error, respectively. The OLS technique underestimates the peak flow and the total volume with 14.7% and 13.5% of error, respectively.

The main reason we used the KP technique was to avoid autocorrelation of the residuals.

Table 1 compares the residual autocorrelation functions obtained by the standard least- squares method and by the mutually interactive state parameter (MISP) estimation tech- nique. After examining this table, we conclude that the objective has been achieved, which

(16)

102 S. Bennis et al.LJoumal of Hydrology 191(1997) 87-105

Estimation Of Missing Data With AR Model

250 station o3o20qonc MiSing value:Tlle Peak Fknv) uwl-

r^ 150 -

;”

:

& 100-

50 -

0 “““~““““““““““““““““““““““““““”

TimeinDays

1973/05/ 01 -Measured . E&imated 1973/07/01

Fig. 5. Estimation of missing data with AR model. Station 030206 (one missing value: the peak flow).

is that the residual autocorrelation coefficients have been reduced very considerably through the use of the KF technique.

6.2. Single-variable model

Traditionally, to estimate missing data over a short period of time in the absence of reference stations, we proceed by linear interpolation. We have therefore decided on this

Table 1

Compathor~ of residual autocorrelation function at station 030204 Delay Least-squares multiple regression

r Student t

Kalman filter

r student t

1 0.5 1 44.22 0.08

2 0.27 19.09 0.07

3 0.20 13.39 0.02

4 0.14 9.54 0

5 0.11 7.19 0

6 0.09 6.16 - 0.04

7 0.07 6.61 - 0.04

8 0.06 6.34 - 0.03

9 0.05 6.21 - 0.01

10 0.02 6.18 0

5.84 5.70 2.11 - 0.59 - 2.19 - 3.14 - 3.40 - 2.90 - 1.48 - 0.90

(17)

S. Bennis et alJJouma1 of Hydrology 191 (1997) 87-10-C 103

Estimation Of Missing Data With AR Model station 03u2cqTwo s- MiSSiUgVsluer]

=Oz

200- 1

b

8 150 -

;”

E G 100- b

50 -

TilllCinDays

1973/05/01 _Mcaslucd ??Estimated 1973/07/01

Fig. 6. Estimation of missing data with AR model. Station 030206 (two successive missing values).

approach as the basic technique for analyzing the performance of the autoregressive model coupled with the Kalman filter. This technique bas been applied in forward and backward directions of time, weighing the two estimates as explained in Section 2.2.

The most critical piece of missing information to estimate is the peak flow. Fig. 5 shows a simulation of the estimation of two peaks in the two hydrographs of flows measured at station 030206. The relative errors on the first and second peaks are 1.08% and 3.6%

respectively, compared with 14.6% and 10.15% using linear interpolation.

Results obtained by the single-variable model are in this case even better than those obtained by the multivariable model using 030204 and 030203 as the reconstituting sta- tions to estimate the two missing data. The relative errors of estimation of the first and second missing data by such a multivariable model are, respectively, 5% and 9%.

Fig. 6 shows the simulation of estimating two successive missing data: the peak flow and the value that precede it for two hydrographs. The relative errors in estimating the first two values are 12% and 10.8% compared with 33% and 26% by using linear interpolation.

The relative error in estimating the second two values are 0.68% and 4% compared with 6.7% and 13.5% using linear interpolation.

7. Conclusion

This work is concerned with improving the classical techniques for estimating missing data. The advantage of the classical methods, such as the standard least-squares method

(18)

104 S. Bennis et alJJouma1 of Hydrology 191 (1997) 87-105

applied to multiple regression, is that they are simple, yet they offer a reasonable solution when reference stations are available. The disadvantage of the standard least-squares method is, however, that it produces autocorrelation residuals. In addition, the estimated values are biased and the total volume is either considerably overestimated or under- estimated. The peak flow is estimated with poor accuracy. The more robust Kalman filter technique offers statistical advantages. The‘residuals are independent and their autocorre- lation function approaches zero. This makes it possible to consider non-stationary models, the parameters of which vary as a function of the flow and therefore improve the accuracy in estimating missing data. The peak flow is well reproduced and the total volume of flow is preserved when considering long periods of missing data.

The use of the dynamic model with time variable parameters has improved the accuracy in the estimation of the first missing data including the peak flow. For subsequent missing data the confidence of the estimates is greater when using a static model identified by the ordinary least squares (OLS) technique. It has been found that there is a critical rank for which there is an inversion of performance between the KF and OLS technique. When the period of missing data is smaller than the critical rank we use only KF technique. When the period of missing data extends past the critical rank, it is recommended that KF be used to estimate the first missing data and then use OLS technique to estimate data coming after the critical rank.

Acknowledgements

We wish to thank the reviewers for several useful comments and suggestions. This project was supported by Hydro-Qu&ec, the Natural Sciences and Engineering Council of Canada and the hole de Technologie Supkieure.

References

Abraham, B. and Ledolter, J., 1983. Statistical Methods for Forecasting. John Wiley & Son, New York.

Akaike. H., 1974. A new look at the statistical model identification. IEEE Trans. Autom. Control, AC-19: 716- 723.

Beauchamp, J.J., Downing, D.J. and Railsback, S.F., 1989. Comparison of regression and time-series methods for synthesizing missing streamflow records. Water Resour. Bull., 25(5): %l-975.

Bennis, S., 1987. Techniques de @vision des trues par l’analyse des series tempotelles. Ph.D. Thesis, Sberbrooke University, Canada.

Bennis, S. and Bruneau, P., 1993. Comparaison de m&odes d’estimation des debits journaliers. Rev. Can. Gtnie Civil, 20(3): 480-489.

Bennis, S. and Rassam, J.C., 1991. Utilisation dun modele ARMAX pour la prevision du debit B Carvillon. Rev.

Can. Genie Civil, 18(S): 864-870.

Bennis, S., CBte, S., Bern&, F., Gagnon, L. and Rang, N., 1994. Logiciel de validation des donntes hydrom eniques “ValiDeb”. Rapport technique final soumis B Hydra-Qu&ec. l?cole de Technologie Sup&ieure, Montreal, Canada.

Bergman, M.J. and Delleur, J.W., 1985. Rahnan filter estimation and prediction of daily stream flows. I. Review, algorithm and simulation experiments. Water Resour. Bull., 21: 815-825.

Box, G.E.P. and Jenkins, GM.. 1976. Time Series Analysis: Forecasting and Control. 2nd edn., Holden Day, San Francisco.

(19)

S. Bennis et alJJoumal of Hydrobgy 191 (1997) 87-105 105 Bruen, M. and Dooge, J.C.I., 1984. An effective and robust method for estimating unit hydrogmph ordinates.

J. Hydrol., 70: l-24.

Chow, V.T., 1988. Open-channel Hydraulics. McGraw-Hill, pp. 586-617.

Davies, P., 1983. A recursive approach to prony parameter estimation. J. Sound Vib.. 89: 571-583.

Dragan, A., Savie, D., Bum, H. and Zrinji, Z., 1989. A comparison of streamflow generation models for reservoir capacity - yield analysis. Water Resour. Bull., 25: 977-983.

Fitch, M. and McBean, E., 1991. Multi-day flow forecasting using the Kahnan filter. Can. J. Civil Eng.. 18: 320- 327.

Haan, C.T.. 1977. Statistical Methods in Hydrology. The Iowa State University Press, Ames, IA, pp. l-20.

Kachroo. R.K. and Liang. L.G.C.. 1992. River flow forecasting. Part. II Algebraic development of linear model- ling technique. J. Hydml., 133: 17-40.

Kachroo. R.K., Sea, C.H., Warsi. M.S., Jemenex, H. and Saxena, R.P., 1992. River flow forecasting. Part 3.

Application of linear techniques in modelling rainfall-runoff transformations. J. Hydrol., 133: 41-97.

Kahnan, R.E.. 1960. A new approach to linear filtering and prediction problems. J. Basic Eng. (Trans. ASME, Serie D), 82: 34-45.

Kahnan, R.E. and Bucy, R.S.. l%l. New results in linear filtering and prediction theory. J. Basic Eng. (Trans.

ASME, Serie D). 83: 95-107.

McCuen, R.H. and Snyder, WM., 1986. Hydrologic Modelling: Statistical Methods and Applications. Prentice- Hall, Englewood Cliffs.

Nash, J.E. and Barsi, B.I., 1983. A hybrid model for flow forecasting on large catchments. J. Hydrol., 65: 125- 137.

O’Connell, P.E. (Editor), 1980. Real-time Hydrological Forecasting and Control - Proceeding 1st International Workshop, Institute of Hydrology, Wallingford.

Pankratx, A., 1983. Forecasting with Univariate Box-Jenkins Models. Concepts and Cases. John Wiley & Sons, New York.

Radix, C., 1984. Filtrage et Lissage Statistiques Optimaux LinGres. Cepadues Editions, Toulouse, France, pp. 46-76, 163-165.

Turgeon, A., 1985. Etude des pertes dam le r&au d’Hydro-Qm?bec. Rapport interne de l’keq. Quebec, Canada.

Winkler, R.L. and Makridakis, S., 1983. The combination of forecasts. J.R. Stat. Sot. A, 46: 150-157.

Young, P.C., 1970. An instrumental variable method for teal-time identification of a noisy process. Automatika, 6: 271-287.

Young, P.C., 1974. Applying parameter estimation to dynamic systems. Control Eng., 10: 119-124.

Références

Documents relatifs

FOREWORD.. Water and food, food and water, this seemingly uneven couple cannot be separated. There are regions in the world with abundant water resources but without food

• SIMAN provides a single interface for the site administra- tor to define users' quota limits, Terminal Security System (TSS) data, and system security data. TSS permits each

The  sessions  on  quantitative  and  qualitative  research  techniques  will  be  taught 

The  program of  the intern  is in  a  first step  to investigate some bandwidth allocation algorithms with cyclic scheduling and to investigate the main performance. issues

High resolution images are available for the media to view and download free of charge from www.sabmiller.com or www.newscast.co.uk Enquiries SABMiller plc Tel: +44 20 7659 0100

His electronic medical record tracks indicators of government-identified pay-for-performance benchmarks, but its usefulness in patient care is hampered by a lack of

We do not want to develop this resource of family physi- cians with special interests by sacrificing traditional broad- scope family physicians who, according to

The two decad es of this reign consisted in a change for th e better in the way the city looked, even if the prin ce spent but the first half of his reign in