Data cleansing procedure - Benjamin Dubois pour obtenir le grade de

B.1 Detection of anomalies

In this section, we highlight the fact that the database contains a significant number of irrelevant values. We also presentad hoc tools to detect these anomalies.

These irrelevant values can be due to errors in measurements, errors in the correction procedure of Section2.1 or modifications of the network configuration.

Errors in measurements There are three types of errors in the database that are particularly easy to detect : theNot a number values, the negative values and the zero or very close to zero values. The Not a number values and the zero val-ues probably correspond to errors in measurements while the negative valval-ues may follow from an overestimation of the local renewable production in the procedure of Section 2.1. The distribution of the number of anomalous values is presented in Figure B.1.

10⁰ 10¹ 10² 10³ 10⁴ Number X of missing observations 0

500 1000 1500

Numberofsubstationswith atleastXmissingobservations

negative values missing values zero values

FIGURE B.1: Number of anomalous values in the database Repartition of the anomalous values among the substations. For instance, 200 substations have at least 300 anomalous values.

Load reports and anomalies There are in the database inconsistencies that are more difficult to detect. Load reports for instance, correspond to the transfer of a fraction of the load of one substation on another substation. This mechanism leads to load curves like in FigureB.2. There are other anomalies in the database that we cannot, for sure, attribute to load reports. However, the tools that we use to detect them are the same.

0 100 200 300

day of the year 40

60 80 100

load(MWh)

FIGURE B.2: Illustration of a load report

Average load per day over one year to illustrate a load report at one of the substation in the database. A fraction of the load of the substation is reported on other substations from August to November, thus leading to the jumps and the decrease of the load during this period.

Detection with trimmed means In order to detect anomalies in the database, we first use trimmed means. Given an observation instant i and a measurement of the loadìat a substation whose mean is denoted`, we extract from the database the¯ loads{ì+24j}j=−14,...,14 at the same hour of the day during the preceding two weeks and the following two weeks. From this set, we remove the maximum and minimum values and compute the meanµiof the remaining samples. Given a thresholdτ = ₁₀^`^¯, the observation instanti is classified as an error if |ì−µi|> τ. The choice of the threshold and the 1 month long window have not been optimized.

Detection with middle-term models Another trick to detect anomalies relies on the observation of the residuals with a load forecasting model that is only based on calendar and weather information and does not include the recent loads. Such models are called middle-term models and are detailed in Section 2.5.1.

Empirically, we observed that large residuals at a given observation instant usu-ally correspond to a jump of the load time series. This procedure is not automatic but still allowed us to manually identify substations with irregularities.

Altogether, about 800 substations present notable anomalies. A large part of them are corrected as explained in Section B.2

B.2 Correction of anomalous values

After detecting the irrelevant values with the procedure described in SectionB.1, we consider two possibilities : either modifying these values to make the load curves more consistent or squarely remove the concerned substation from the database.

To propose a correction, when an irrelevant value is detected, we could resort to the trimmed mean presented in Section B.1. However, the corrupted data are often consecutive and occur on periods of several days or even weeks, which makes the trimmed mean an irrelevant substitute. Instead, we take advantage of the following observations : Given a setK0of substations in the database where no irrelevant value was detected, and a substationκ^∗ with an irrelevant value at the observation instant i∈N, a remarkably accurate way to forecast the load `^κ_i^∗ at the substation κ^∗ and instant i, is to regress it on the loads (`^κ_i)κ=1,...,K,κ6=κ^∗ at the other substations and the same instant i. Of course, this method cannot be applied for load forecasting because it requires the oracles (`^κ_i)κ=1,...,K,κ6=κ^∗ but we can use it to correct the irrelevant values in the database, with a model estimated on a different time period.

In practice, we choose indeed forK⁰the set of substations where no irrelevant value was detected, there are about 1200 such substations and, givenκ^∗ a substation with irrelevant values at observation instants I^∗, we randomly partition the set of sane observation instants instants into 2 subsets I^train and I^test, respectively containing 80 % and 20 % of the sane observations. Then, we train a regression model with the data in I^train to predict the load at κ^∗ with the loads at the substations in K0

and compute the coefficient of determination (presented in Section2.7.1) on the test set I^test. Given a threshold τ = 0.8, we keep the substation κ^∗ in the database if the coefficient of determination onI^test is above τ and modify the irrelevant values with the trained models for observation instants inI^∗. Otherwise, the substation is eliminated from the database.

Obviously, keeping as many substations in the database with consistent values would be ideal but, since our final objective is to study a multi-task forecasting model, the irrelevant values at some substations can represent a significant hindrance for this model. Therefore we allow ourselves, adopting a pragmatical approach, to choose the second option. In practice, 10 to 15 % of the substations are thereby discarded : the resulting database contains1751 substations.

We do not pretend that these detection and correction mechanisms are optimal but consider that they are sufficient to clean the database from significant errors. We used random forests or regression models with a LASSO penalty for the correction but did not work on the best hyperparameters. This requires further work.

Dans le document Benjamin Dubois pour obtenir le grade de (Page 175-178)