Clustering of the substations - Multi-task setting

Multi-task setting

4.4 Clustering of the substations

This section is dedicated to studying the possibility of clustering the substations such that in each group, the coefficient vectors used to forecast the normalized loads are close, where the normalization is given in Equation (3.33). In Section 4.4.1, we impose a strict equality condition : the same model is learned for all the tasks in a same cluster. A possible relaxation of this hard constraint is proposed in Section4.4.2. Finally, experiments are presented in Section 4.4.3.

The models and results of this section are extracted from the internal report written by Duchemin [2018], after our collaboration during his internship in our team, at École Nationale des Ponts et Chaussées. For pragmatic and confidentiality reasons, the database used for the experiments in Section 4.4.3 is different from the database provided by RTE. Instead, the experiments were performed with the GEFCom 2012 database [Hong et al.,2014].

This database was produced for a competition and little information is given about the substations and the weather stations. As a consequence, selecting the relevant weather stations to forecast each load was one of the expected challenges and the models described in this section precisely perform variable selection.

4.4.1 Formulation of the hard clustering problem

Clustering For the hard clustering problem, we consider the empirical risk given in Equation (4.24) and we make the assumption, similarly to [Bakker and Heskes, 2003], that the different tasks can be grouped into R ∈ N^∗ clusters such that all the tasks in one cluster are forecast with an identical coefficient vector. The set of constraints that we consider has consequently the non-separable form :

Θ_HC:= vectorv_k ∈ {0,1}^R with a non-zero coefficient in position r. Thereby, we can write B=U V^T. In other words, an explicit parametrization ofΘHCwithpR+K degrees of freedom is given by :

ΘHC =n

U V^T | U ∈R^p,R,V ∈ {0,1}^K,R s.t. V1_R =1_K o

, (4.30) where1_Rand1_Kare constant vectors only containing ones. We use this parametriza-tion below to minimize the emiprical risk.

Likelihood and Regularization Since the parametrization of the elements of Θ_HC depends on hidden variables, we introduce a Bayesian formulation with latent classes as in [Hofmann and Puzicha, 1999; Kass and Steffey, 1989] and apply an Expectation-Maximization (EM) algorithm to learn the coefficient matrix B. To this end, we formulate below the minimization of the error as a maximum-likelihood problem.

LetX := (X^(k))k=1,...,K ∈R^n,p,K be thelocal design matrices for theK different tasks, withnthe number of observations andpthecommonnumber of features, and letY ∈R^n,K be the target observation matrix. To model the cluster assignment of the substations and write the maximum-likelihood problem, we consider a coefficient matrix U ∈R^p,R and a cluster assignment matrix V ∈ {0,1}^K,R such that V1_R = 1_K like in Equation (4.30). Besides, we introduce an a priori discrete probability distribution of the cluster assignments (πr)r=1,...,R and the variances within each cluster(σ_r²)r=1,...,R ∈R^R+. This is formalized with the following assumptions :

1. The random vectors (vk)k=1,...,K are independent and such that for each k ∈ [[1, K]], the vectorv_k follows a multinomial distribution M(1, π1, . . . , πR). 2. Conditionally on v_k^r = 1, the vector y^(k) follows a normal distribution with

mean X^(k)u^(r) and covariance matrix σ_r²I_n. Formally, we have :

(y^(k)|v^r_k= 1) = X^(k)u^(r)+σrε^(k), (4.31) where ε^(k) ∈Rⁿ follows the normal distribution N(0,I_n)with mean 0and the

3. Conditionally on the cluster assignment matrix encoded in the matrix V, the random variables (y_i^k)i=1,...,n,k=1,...,K are independent.

In order to regularize the coefficient vector of each cluster and perform variable selection, we add an Elastic Net penalization [Zou and Hastie, 2005] with hyper-parameters λ > 0 and α ∈ [0,1]. The regularized maximum-likelihood problem is :

where p(Y,V) denotes the density function of the pair (Y,V), the assignments (v_k^r)r=1,...,R,k=1,...,K are unknown random variables, Y and X are fixed observations while(πr)r=1,...,R, (u^(r))r=1,...,R and (σr)r=1,...,R are unknown parameters.

EM algorithm The E-step of the EM algorithm applied to Problem4.32 consists in computing for all r ∈ [[1, R]] and k ∈ [[1, K]] the a posteriori probability of assignment γ_r^k := E[v_k^r|Y] while the parameters (πr)r=1,...,R, U and (σ²_r)r=1,...,R are kept fixed.

As for the M-step of the EM algorithm, it consists in minimizing with respect to U ∈ R^p,R, (σ_r²)r=1,...,R ∈ R^R+ and (πr)r=1,...,R, all other variables being fixed, the

Given the presence of the non-differentiable Elastic Net regularization, a proximal-gradient algorithm is applied. The details of the computations and of the algorithms can be found in [Duchemin,2018, Section 1.2].

4.4.2 Formulation of the soft clustering problem

Likelihood The exact Equality (4.29) on the coefficient vectors of the substations within a cluster is a relatively strong assumption as it drastically reduces the number of degrees of freedom from pK to pR+K. Besides, we see in Section 4.4.3 that empirically, the results with this strict assumption are not fully satisfying. Therefore, we introduce in this section a relaxed version of this constrained model where the columns of the coefficient matrix lie close to one of a few cluster centers :

B=U V^T +F, (4.34)

whereU V^T ∈Θ_HC and the norm of F is penalized such that B is close to Θ_HC. Put differently, we assume with the same notations as in Section 4.4.1, that the coefficient vectors(b^(k))k=1,...,K are gathered around the centers (u^(r))r=1,...,R of a few clusters such that for each r ∈ [[1, R]], the within-cluster variance, given by PK

Formally, in place of Equation (4.31), we assume in this section the following Bayesian model for each taskk ∈[[1, K]] :

(b^(k)|v_k^r = 1) = u^(r)+τν^(k)

(y^(k)|b^(k)) = X^(k)b^(k)+σkε^(k), (4.35) where τ >0 and (σk)k=1,...,K are unknown parameters and for each k ∈[[1, K]], the vector ν^(k) ∈ R^p follows a normal distribution N(0,I_p) while the vector ε^(k) ∈ Rⁿ follows a normal distributionN(0,I_n).

With this model, the regularized expected likelihood is : E^V^,B[logp(Y,B,V)]−λ as-signments (v_k^r)r=1,...,R,k=1,...,K and the coefficient matrix B are unknown random variables, Y and X are fixed observations and (πr)r=1,...,R, (u^(r))r=1,...,R, τ and (σk)k=1,...,K are unknown parameters. The details of the EM algorithm used to maximize this log-likelihood can be found in [Duchemin,2018, Section 3.2].

4.4.3 Experiments

GEFCom 2012 The database presented by [Hong et al., 2014] and used in this section contains the measurements of the weather at 11 weather stations and the load of 20 substations in British Columbia with an average load of 82 MWh. In terms of aggregation and load levels, this setting lies between the level of districts and the level of substations described in Table2.1.

Numerical performances The results for the hard and soft clustering methods are presented in Table 4.1. They are compared with the independent models of Chapter 3. We expected the hard clustering assumption in Equation (4.31) to be too restrictive and indeed, the obtained results are not the best. Although the Bayesian model introduced in Section 4.4.2 leads to an improvement, it does not perform better than the independent models.

The clustering method presented in this section leads to a small but still tangible

MMr² MMMAPE MRMNMSE

Independent models 0.87 7.11 8.81

Hard clustering in Equation (4.29) 0.852 7.73 9.70 Soft clustering in Equation (4.35) 0.862 7.49 9.34 TABLE 4.1: Results of the clustering methods

Numerical performances on the GEFCom 2012 database of the clustering methods of Section 4.4, compared with the independent models of Chap-ter 3. While there are20load time series in the database, a clustering with 4 groups typically led to the best results with the soft clustering model.

clearly, the clustering assumption that we made is too restrictive and does not lead to a better generalization performance.

Difficulties In terms of optimization, the two main drawbacks of these cluster-ing methods are the slow speed of convergence of the algorithms that we imple-mented and the existence of local minima. For the latter, we explored three pos-sible solutions, namely a non-random initialization of the algorithm with Ward’s method [Ward Jr, 1963], an annealing method [Ueda and Nakano, 1995] to guide the algorithms during the first iterations and resuscitating empty clusters when they occur. In spite of these patches, the algorithms can still converge towards local minima.

The priorities for further investigation seem to be the possibility of clustering only the coefficients corresponding to inputs that are common to all the substations and an improvement of the speed of the algorithms to allow larger experiments.

Dans le document Benjamin Dubois pour obtenir le grade de (Page 141-145)