Low-rank models - Multi-task setting - Benjamin Dubois pour obtenir le grade de

Multi-task setting

4.5 Low-rank models

In this section, we also discuss the possibility of imposing a structural constraint on the coefficient matrix in order to reduce the number of parameters of the models.

However, instead of the clustering constraints presented in Equation (4.31) and Equation (4.35) that appeared as too restrictive, we introduce in this section a low-rank constraint. This is motivated by the similarity of the tasks enhanced in Section2.4.2and Section 4.1, in particular the rapid decrease of the singular values of the coefficient matrices learned with independent models in Figure 4.1.

In order to obtain a low-dimensional representation of the features and improve the generalization performance on different tasks, the low-rank assumption on the coefficient matrix was studied 20 years ago by Intrator and Edelman [1996]. An in-depth theoretical analysis [Ando and Zhang, 2007] later confirmed the interest of this structural assumption and motivated the study of the resulting optimization problems [Bunea et al., 2011, and references therein].

In Section 4.5.1, we introduce a model where the coefficient matrix must be exactly low-rank. In order to select covariates that are relevant to all the tasks, we define in Section 4.5.2 a non-separable regularization that performs joint variable selection. Finally, experiments with the database provided by RTE are discussed in Section4.5.4.

4.5.1 The low-rank constraint

In the low-rank formulation, we constrain the set of coefficient vectors for the different tasks concatenated into the matrixB∈R^p,K, not to be of cardinalR∈N^∗ like in Equation (4.29), but to span a subspace of dimension at most R :

rank(B)≤R, (4.37)

where rank(B) denotes the rank of the matrix B. Of course, this constraint is effective only ifR <min(p, K).

To highlight the implications of this constraint and how it couples the different tasks, considerB ∈R^p,K with rank(B)≤R and a pair (U,V)∈R^p,R×R^K,R such that :

B =U V^T. (4.38)

Consider a task k ∈ [[1, K]] and a vector of covariates x^(k) ∈ R^p. With the linear model defined in Section3.2, we have :

The right member of Equation (4.39) is a linear combination of the new covariates (hx^(k),u⁽¹⁾i, . . . ,hx^(k),u^(R)i). In other words, by enforcing the low-rank constraint of Equation (4.37), we force the different tasks to choose collectively a linear transfor-mation of the covariates x7→(hx,u⁽¹⁾i, . . . ,hx,u^(R)i), characterized by the matrix U ∈ R^p,R, that performs a dimensionality reduction since R <min(p, K) and such that all tasks can be forecast correctly with the coefficients given by the matrix V ∈R^K,R.

Therefore, the regularized empirical risk minimization problem that we consider follows from Problem (3.25) with an additional low-rank constraint :

min

Remark 6. The low-rank constraint is a strong restriction on the model and just like we introduced the soft version of Equation (4.29) in Equation (4.35), we can define a less restrictive formulation of the low-rank constraint, similarly to [Ando and Zhang, 2007]. Instead of imposing the rank of the coefficient matrix, we may require that it is close to a low-rank matrix i.e. it is the sum of a low-rank matrix with a small perturbation :

B=E+F, (4.41)

where E∈R^p,K is such that rank(E)≤R andF ∈R^p,K. To pull the componentF towards zero, we may add to the objective a regularization, the regularized empirical

risk minimization problem being in this case : this leads to interesting empirical results.

4.5.2 Joint variable selection

In addition to the low-rank constraint set in Problem (4.40), we have considered a group-Lasso regularization [Bakin et al., 1999; Obozinski et al., 2010; Yuan and Lin, 2006] like in Equation (2.2). It is defined for any matrix B ∈R^p,K by : where (bj)j∈[[1,p]] are the rows of the matrix B. The group-Lasso regularization is known for encouraging some of the groups to have zero norm. Thereby, it induces a common sparsity structure among the tasks. Effectively, that a rowb_j of the matrix Bwith j ∈[[1, p]]is zero means that none of the tasks uses the associated covariates (x^k_j)k=1,...,K. The minimization problem of the regularized empirical risk is in this case :

where λ > 0 is a regularization hyperparameter. Chapter 5 is dedicated to a de-tailed analysis of this optimization Problem (4.44), with the important additional assumption that the design matrix is the same for all the tasks i.e. X⁽¹⁾ = . . . =

In Section 4.5.1, we have considered a low-rank constraint on the whole matrix B ∈ R^p,K. In effect, it constrains the coefficients associated to all the features.

Yet, it appears also legitimate to constrain only the coefficients corresponding to the inputs that are shared by all the tasks (e.g. the hour of the week and not the past loads). In the experiments of Section 4.5.4, the constraint that we use is even more detailed as we consider disjoint blocks of rowsA₁, . . . ,A_` of the matrixAand

we impose independent rank constraints on the different blocks. This is formulated as the following optimization problem with partial low-rank constraints :

min

4.5.4 Experiments with partially low-rank models

So far, we have not obtained satisfying empirical results with a rank constraint on the whole matrix like in Equation (4.40). Although the difference between the results with and without a low-rank constraint on the entire matrix of coefficients is less significant when working with middle-term models, we have concluded that this constraint is not relevant.

Instead, we focus directly on the partially low-rank models of Equation (4.46).

Again, we could not improve the generalization performance of the local models with low-rank constraints. Still, we have considerably reduced the number of degrees of freedom for a minor degradation of the performance.

Indeed, in Figure 4.5, we compare the performances of the local models without any rank constraints with the partially low-rank problem where the block of coeffi-cients related to the hour of the week is constrained to be of rankrh and the block related to the day of the year is constrained to be of rankrd. The best results with the constraints is obtained for rh = rd = 20. Given that for each of the K = 1751 substations there are ph = 168 coefficients for the hour of the week and pd = 32 for the day of the year, the unconstrained model has about (ph+pd)K = 300 000 degrees of freedom for these inputs and the constrained model has approximatively (ph+K)rh+ (pd+K)rd= 40 000.

The penalizations that we have used to obtain Figure 4.5 are the same as in Chapter 3. Unfortunately, the group-Lasso regularization that we have introduced in Equation (4.44) does not lead to better results. It was effectively introduce to con-sider a potential variable selection procedure, which is not the case here since first, the past loads and the weather stations have already been selected in Section3.6.5 and secondly, the family of features introduced for each input in Section 3.1 is not redundant.

While it is disappointing not to improve the generalization performance, this result still proves that the number of degrees of freedom in the independent models is unnecessary large. Besides, the analysis of the estimated low-rank matrix is a potential way of understanding the underlying structure.

Dans le document Benjamin Dubois pour obtenir le grade de (Page 145-148)