Numerical evaluation - Sep, Oct, Nov - Benjamin Dubois pour obtenir le grade de

Sep, Oct, Nov

2.7 Numerical evaluation

The most relevant evaluation of a model depends on the final task it is designed for. Generally speaking, there is never an easy way to order forecasting methods, especially models with non-stationary multivariate outputs. However, we chose to assess the quality of the models with numerical criteria presented in this section as guides in our research, in order to summarize in a unique number the discrepancy between observed target variables and the predictions.

Note that there might be a post-processing of the forecasts by RTE, in particular when rare events occur. Besides, the day-ahead predictions of the models are always corrected during the intraday forecasting process, once more recent observations are available. None of these two post-processings are considered in this manuscript.

2.7.1 Evaluation criteria for a single-task problem

In a single-task setting, the `¹ and`² distances usually provide relevant means to evaluate a forecasting model. Consider a zone Zk that corresponds to a subset of the substations as introduced in Section2.5.2, a time period denoted :

Tb := [[1, n]], (2.10)

indexed by b ∈ N and a batch of observations `^(k) := (`^k_i)i∈Tb ∈ Rⁿ. To assess the quality of a prediction ˆ`^(k) := (ˆ`^k_i)_i∈T_b, we introduce the following performance criteria.

Mean Squared Error The Mean Squared Error (MSE) is a simple quadratic penalization of the residuals. It corresponds to the squared Euclidean distance between the forecasts and the observations. It also has the notorious advantage of being a convex and differentiable loss. It is defined as :

MSEk,b= 1Xⁿ

(`^k_i −`ˆ^k_i)². (2.11)

The coefficient of determination r² The coefficient of determination r² is a relative error that makes the comparison of different tasks more convenient. It is obtained with an affine transformation of the MSE based on the empirical variance of the time series(`^k_i)i=1,...,n. It is without unit and in particular invariant to affine transformations of the target variables. Let

`¯k:= 1 n

Xn i=1

`^k_i

denote an estimate of the average load for the considered area : r²k,b = 1− We always haver²k,b ≤1and a negative score means that the predictions are worse than the constant prediction`¯k.

Normalized Mean Squared Error Proportional to the MSE, the Normalized Mean Squared Error (NMSE) is defined as :

NMSE_k,b =

Related to the NMSE, we also define the Root Normalized Mean Squared Error (RNMSE) :

RNMSE_k,b = 100×p

NMSE_k,b. (2.14)

MAPE Finally, the Mean Absolute Percentage Error (MAPE) is related to the

`¹-distance between the observations and the predictions, thereby it is more robust (less sensitive) to outliers. It is defined for time series without any zero value, which is the case of well collected loads :

MAPE_k,b = 100× 1

Comments In the single-task setting, the MSE, the coefficient of determination and the NMSE are affine transformations of each other. They are commonly used as loss functions, in large part because they areC^∞ functions. However, as quadratic functions, they are sensitive to outliers while the MAPE is a more robust criteria.

In practice, we minimize in this manuscript the NMSE and mainly compare single-task models using the RNMSE. Also, note that the coefficient of determination and the NMSE are highly sensitive to the empirical mean and variance of the time series. In particular, computing these criteria for a prediction of one year is not equivalent to averaging the same criteria computed separately for each of the 12 months : one should always compare performances for test sets that correspond exactly to the same period since the mean and the variance of the electrical load are generally twice larger during the winter than during the summer.

2.7.2 Evaluation criteria for the multi-task setting

We now discuss how to evaluate a forecasting model on several possibly heteroge-neous time series. We considerK ∈N^∗tasks that correspond to different substations or areas of consumption.

We require the numerical criteria to be representative of the cost induced by the errors in the model for the TSO. In particular, the TSO asked that the numerical criteria should reflect the following : an error of x MWh of the prediction for a small area is more damaging to the network management than an error of the same amplitude in a larger area. Therefore, we introduce, for a time periodTb, the Mean NMSE (MNMSE) and the Root MNMSE (RMNMSE) :

The MNMSE is the criteria used for optimization in the multi-task problems cast in this manuscript. The first mean is taken in Equation 2.13 with respect to the observation instants and the second mean in Equation2.16 with respect to the different tasks. Our results presents the RMNMSE along with the average coefficient of determination and the average MAPE :

Mr²_b = 1

We could go even further and require that the criteria also reflects that an error of x % in a large area is more damaging than an error of x % in a smaller area of consumption. For this reason, we also introduce the Weighted RNMSE and the weighted MAPE :

It is not clear which of these criteria is the most relevant, industrially speaking.

Eventually, we mainly focus on the RMNMSE.

To compute a unique quantity from the performances of the model for different tasks, we chose to compute a uniform or weighted mean over the tasks in the above formulas. Alternatively, we could compute the median, it has the notable advan-tage of not being as sensitive to outliers as the mean and in particular, not being as sensitive to the choice of the discarded substations with the correction proce-dure of SectionB.2. Still, we use the average and illustrate the distribution of the

2.7.3 Experimental process

In operational conditions, the forecasts are made day by day and we are allowed to modify the model everyday by incorporating in the training data the most re-cent days that were already forecast. However, updating the model everyday and forecasting the day in the test set one by one is tedious in the experiments and not necessarily relevant : the fitted model does not change drastically if we add in the training data only the last observed day. Instead, we choose to update the model every s observations with s ∈ N^∗, and perform repeated experiments with sliding training and test sets. Empirically, we estimated that updating the models every 4 weeks is reasonable. This is discussed in more details in Section 3.7.4.

Additionally, because of the obsolescence of data, it is not true that the bigger the training set, the better the fitted model. For instance, we estimated empirically that training sets with 3 years of data are reasonable for the national forecasting problem and we also discuss this decision in Section3.7.4. As an additional remark, we also observed that the optimal length of the training sets may vary between years in the database. We denote h the length of the training sets and consider it as a hyperparameter of the models.

In summary, the models are repeatedly learned with sliding training sets of size hand evaluated on smaller test sets of size s. More formally, let[[L^test, R^test]]denote an interval in the database corresponding to the whole period used to test the models. We consider that older observation instants[[L^test−h, L^test−1]]are available for training. For simplicity, we also assume that there exists B ∈ N^∗ such that R^test−L^test+ 1 = B ×s. For b ∈ [[0, B −1]], we train the models with the b-th training set :

Tb^train:= [[L^test_b −h, L^test_b −1]], (2.22) and evaluate them with the b-th test set :

Tb^test:= [[L^test_b , R^test_b ]], (2.23) whereL^test_b :=L^test+b×s and R^test_b :=L^test_b +s−1.

Example 2. For instance, consider that we want to evaluate the performances of a model in 2016, which corresponds to [[L^test, R^test]] and we choose, first to use h = 3 years in the training datasets, secondly to update the model everys = 4 weeks. The first test subset starts at0a.m. on January1^st, ends at11p.m. on January 28, 2016 and is used to evaluate a model trained with data from January 2013 to December 2015. The second model is trained with data from February 2013 to January 28^th, 2016 and is evaluated with the data from 0 p.m. on January 29^th, 2016 to 11 p.m.

on February 25^th, and so on until December 2016.

In the end, we compute for each time period Tb^test with b ∈ [[0, B − 1]] the quantities Mr²_b, MMAPE_b, RMNMSE_b, WRNMSE_b and MMAPE_b defined in Section 2.7.2 for the forecasts (ˆ`^k_i)_i=L^test

b ,...,R^test_b ,k=1,...,K of the time series (`^k_i)_i=L^test

b ,...,R^test_b ,k=1,...,K. We also define the average performances over the batches :

MMr² = 1

Finally, to compare the predictions in different areas in a same aggregation level, we define the Mean RNMSE (MRNMSE) for the forecasts of one time series in-dexed with k :

whereRNMSE_k,b is the RNMSE defined in Equation (2.14) for time series k over the time periodTb^test.

Size of the batches Two comments are in order after the presentation of the performance measures in Section2.7.3. First, having a batch of results for repeated experiments is more convenient to compare results since the mean over the batches for different models might be quite close and having repeated samples lets use a Wilcoxon signed-rank test [Wilcoxon, 1992] to determine whether the differences are significant. This is why we do not compute the prediction errors over the entire test set at once.

Secondly, we had to determine the size of the batches. Although comparing the errors of different models for every single observation instant would generate more samples for the Wilcoxon test, these samples would be highly correlated, because the errors of one model at subsequent instants are generally highly correlated. Therefore, we had to determine a size of batch to aggregate the errors. For convenience, we chose it so that it matches the frequencys of the updates of the models.

Dans le document Benjamin Dubois pour obtenir le grade de (Page 60-64)