• Aucun résultat trouvé

A comparison of eight metamodeling techniques for the simulation of N2O fluxes and N leaching from corn crops

N/A
N/A
Protected

Academic year: 2021

Partager "A comparison of eight metamodeling techniques for the simulation of N2O fluxes and N leaching from corn crops"

Copied!
54
0
0

Texte intégral

(1)

HAL Id: hal-00654753

https://hal.archives-ouvertes.fr/hal-00654753

Submitted on 22 Dec 2011

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

A comparison of eight metamodeling techniques for the simulation of N2O fluxes and N leaching from corn crops

Nathalie Villa-Vialaneix, Marco Follador, Marco Ratto, Adrian Leip

To cite this version:

Nathalie Villa-Vialaneix, Marco Follador, Marco Ratto, Adrian Leip. A comparison of eight meta-

modeling techniques for the simulation of N2O fluxes and N leaching from corn crops. Environmental

Modelling and Software, Elsevier, 2012, 34, pp.51-66. �10.1016/j.envsoft.2011.05.003�. �hal-00654753�

(2)

A comparison of eight metamodeling techniques for the simulation of N 2 O fluxes and N leaching from corn crops

Nathalie Villa-Vialaneix a,b,∗ , Marco Follador c , Marco Ratto d , Adrian Leip c

a

IUT de Perpignan (Dpt STID, Carcassonne), Univ. Perpignan Via Domitia, France

b

Institut de Math´ematiques de Toulouse, Universit´e de Toulouse, France

c

European Commission, Institute for Environment and Sustainability, CCU, Ispra, Italy

d

European Commission, Econometrics and Applied Statistics Unit, Ispra, Italy

Abstract

The environmental costs of intensive farming activities are often under- estimated or not traded by the market, even though they play an important role in addressing future society’s needs. The estimation of nitrogen (N) dy- namics is thus an important issue which demands detailed simulation based methods and their integrated use to correctly represent complex and non- linear interactions into cropping systems. To calculate the N 2 O flux and N leaching from European arable lands, a modeling framework has been devel- oped by linking the CAPRI agro-economic dataset with the DNDC-EUROPE bio-geo-chemical model. But, despite the great power of modern calculators, their use at continental scale is often too computationally costly. By compar- ing several statistical methods this paper aims to design a metamodel able to approximate the expensive code of the detailed modeling approach, devising the best compromise between estimation performance and simulation speed.

We describe the use of two parametric (linear) models and six nonparametric approaches: two methods based on splines (ACOSSO and SDR), one method based on kriging (DACE), a neural networks method (multilayer perceptron, MLP), SVM and a bagging method (random forest, RF). This analysis shows that, as long as few data are available to train the model, splines approaches lead to best results, while when the size of training dataset increases, SVM and RF provide faster and more accurate solutions.

Corresponding author.

Email address: nathalie.villa@math.univ-toulouse.fr (Nathalie

Villa-Vialaneix)

(3)

Keywords: Metamodeling, splines, SVM, neural network, random forest, N 2 O flux, N leaching, agriculture

1. Introduction

The impact of modern agriculture on the environment is well documented (Power, 2010; Tilman et al., 2002; Scherr and Sthapit, 2009; FAO, 2007, 2005;

Singh, 2000; Matson et al., 1997). Intensive farming has a high consumption of nitrogen, which is often in-efficiently used, particularly in livestock pro- duction systems (Leip et al., 2011b; Webb et al., 2005; Oenema et al., 2007;

Chadwick, 2005). This leads to a large surplus of nitrogen which is lost to the environment. Up to 95% of ammonia emission in Europe have their origin in agricultural activities (Kirchmann et al., 1998; Leip et al., 2011a) contribut- ing to eutrophication, loss of biodiversity and health problems. Beside NH 3 , nitrate leaching below the soil root zone and entering the groundwater poses a particular problem for the quality of drinking water (van Grinsven et al., 2006). Additionally, agricultural sector is the major source of anthropogenic emissions of N 2 O from the soils, mainly as a consequence of the application of mineral fertilizer or manure nitrogen (Del Grosso et al., 2006; Leip et al., 2011c; European Environment Agency, 2010; Leip et al., 2005). N 2 O is a po- tent greenhouse gas (GHG) contributing with each kilogram emitted about 300 times more to global warming than the same mass emitted as CO 2 , on the basis of a 100-years time horizon (Intergovernmental Panel on Climate Change, 2007).

Various European legislations attempt to reduce the environmental im-

pact of the agriculture sector, particularly the Nitrates Directive (Euro-

pean Council, 1991) and the Water Framework Directive (European Council,

2000). Initially, however, compliance to these directives was poor (Oenema

et al., 2009; European Commission, 2002). Therefore, with the last reform of

the Common Agricultural Policy (CAP) in the year 2003 (European Council,

2003), the European Union introduced a compulsory Cross-Compliance (CC)

mechanism to improve compliance with 18 environmental, food safety, ani-

mal welfare, and animal and plant health standards (Statutory Management

Requirements, SMRs) as well as with requirements to maintain farmlands

in good agricultural and environmental condition (Good Agricultural and

Environment Condition requirements, GAECs), as prerequisite for receiv-

ing direct payments (European Union Commission, 2004; European Council,

(4)

2009; European Union Commission, 2009; Dimopoulus et al., 2007; Jongeneel et al., 2007). The SMRs are based on pre-existing EU Directives and Reg- ulations such as Nitrate Directives. The GAECs focus on soil erosion, soil organic matter, soil structure and a minimum level of maintenance; for each of these issues a number of standards are listed (Alliance Environnement, 2007).

It remains nevertheless a challenge to monitor compliance and to assess the impact of the cross-compliance legislations not only on the environment, but also on animal welfare, farmer’s income, production levels etc. In or- der to help with this task, the EU-project Cross-Compliance Assessment Tool (CCAT) developed a simulation platform to provide scientifically sound and regionally differentiated responses to various farming scenarios (Elbersen et al., 2010; Jongeneel et al., 2007).

CCAT integrates complementary models to assess changes in organic car- bon and nitrogen fluxes from soils (De Vries et al., 2008). Carbon and ni- trogen turnover are very complex processes, characterized by a high spatial variability and a strong dependence on environmental factors such as mete- orological conditions and soils (Shaffer and Ma, 2001; Zhang et al., 2002).

Quantification of fluxes, and specifically a meaningful quantification of the response to mitigation measures at the regional level requires the simulation of farm management and the soil/plant/atmosphere continuum at the high- est possible resolution (Anderson et al., 2003; Leip et al., 2011c). For the simulation of N 2 O fluxes and N-leaching, the process-based biogeochemistry model DNDC-EUROPE (Leip et al., 2008; Li et al., 1992; Li, 2000) was used.

As DNDC-EUROPE is a complex model imposing high computational costs, the time needed to obtain simulation results in large scale applications (such as the European scale) can be restrictive. In particular, the direct use of the deterministic model is prohibited to extract efficiently estimations of the evo- lution of N 2 O fluxes and N-leaching under changing conditions. Hence, there is a need for a second level of abstraction, modeling the DNDC-EUROPE model itself, which is called a meta-model (see Section 2 for a more specific definition of the concept of metamodeling). Metamodels are defined from a limited number of deterministic simulations for specific applications and/or scenario and allow to obtain fast estimations.

This issue is a topic of high interest that has previously been tackled in

several papers: among others, (Bouzaher et al., 1993) develop a parametric

model, including spatial dependency, to model water pollution. (Krysanova

and Haberlandt, 2002; Haberlandt et al., 2002) describe a two-steps approach

(5)

to address the issue of N leaching and water pollution: they use a process- based model followed by a location of the results with a fuzzy rule. More recently, (Pineros Garcet et al., 2006) compare RBF neural networks with kriging modeling to build a metamodel for a deterministic N leaching model called WAVE (Vanclooster et al., 1996). The present article compares in detail different modeling tools in order to select the most reliable one to meta-model the DNDC-EUROPE tasks in the CCAT project Follador and Leip (2009). This study differs from the work of Vanclooster et al. (1996) because of the adopted European scale and of the analysis of 8 meta-modeling approaches (also including a kriging and a neural network method). The comparison has been based on the evaluation of meta-model performances, in terms of accuracy and computational costs, with different sizes of the training dataset.

The rest of the paper is organized as follows: Section 2 introduces the general principles and advantages of using a meta-model; Section 3 reviews in details the different types of metamodels compared in this study; Sec- tion 4 explains the Design Of the Experiments (DOE) and show the results of the comparison, highlighting how the availability of the training data can play an important role in the selection of the best type and form of the approximation. The supplementary material of this paper can be found at:

http://afoludata.jrc.ec.europa.eu/index.php/dataset/detail/232.

2. From model to metamodel

A model is a simplified representation (abstraction) of reality developed

for a specific goal; it may be deterministic or probabilistic. An integrated

use of simulation based models is necessary to approximate our perception

of complex and nonlinear interactions existing in human-natural systems by

means of mathematical input-output (I/O) relationships. Despite the con-

tinuous increase of computer performance, the development of large simula-

tion platforms remains often prohibited because of computational needs and

parametrization constraints. More precisely, every model in a simulation

platform such as DNDC-EUROPE, is characterized by several parameters,

whose near-optimum set is defined during the calibration. A constraint ap-

plies restrictions to the kind of data that the model can use or to specific

boundary conditions. The flux of I/O in the simulation platform can thus

be impeded by the type of data/boundaries that constraints allow - or not

allow - for the models at hand.

(6)

The use of this kind of simulation platform is therefore not recommended for all the applications which require many runs, such as sensitivity analysis or what-if studies. To overcome this limit, the process of abstraction can be applied to the model itself, obtaining a model of the model (2nd level of abstraction from reality) called meta-model (Blanning, 1975; Kleijnen, 1975;

Sacks et al., 1989; van Gighc, 1991; Santner et al., 2003). A metamodel is an approximation of detailed model I/O transformations, built through a moderate number of computer experiments.

Replacing a detailed model with a metamodel generally brings some pay- offs (Britz and Leip, 2009; Simpson et al., 2001):

• easier integration into other processes and simulation platforms;

• faster execution and reduced storage needs to estimate one specific output;

• easier applicability across different spatial and/or temporal scales and site-specific calibrations, as long as data corresponding to the new sys- tem parametrization are available.

As a consequence, a higher number of simulation runs become possible: using its interpolatory action makes a thorough sensitivity analysis more convenient and leads to a better understanding of I/O relationships. Also it offers usually a higher flexibility and can quickly be adapted to achieve a wide range of goals (prediction, optimization, exploration, validation). However, despites these advantages, they suffer from a few drawbacks: internal variables or outputs not originally considered can not be inspected and the prediction for input regimes outside the training/test set is impossible. Hence, a good metamodeling methodology should be able to provide fast predictions. But, considering that limitations, it also must have a low computational cost to be able to build a new metamodel from a new data set including new variables and/or a different range for these input variables.

Let (X, y) be the dataset consisting of N row vectors of input/output pairs (x i , y i ), where x i = (x 1 i , . . . , x d i ) T ∈ R d (i = 1, . . . , N ) are the model input and y i ∈ R (i = 1, . . . , N ) are the model responses for N experimental runs of the simulation platform. The mathematical representation of I/O relationships described by the detailed model can be written as

y i = f (x i ) i = 1, . . . , N (1)

(7)

which corresponds to a first abstraction from the real system. From the values of X and y, also called training set, f is approximated by a function f ˆ : R d → R , called metamodel, whose responses can be written as

ˆ

y i = ˆ f(x i ).

and that correspond to a second abstraction from the reality. In this second abstraction, some of the input variables of Eq. (1) might not be useful and one of the issue of metamodeling can be to find the smallest subset of input variables relevant to achieve a good approximation of model (1).

Finally, the differences between the real system and the metamodel re- sponse, will be the sum of two approximations (Simpson et al., 2001): the first one introduced by the detailed model (1st abstraction) and the second one due to metamodeling (2nd abstraction). Of course, the validity and ac- curacy of a metamodel are conditioned by the validity of the original model:

in the following, it is then supposed that the 1 st level of abstraction induces a small error compared to reality. Then, in this paper, we only focus on the second error, |ˆ y i − y i |, to assess the performance of different metamodels vs. the detailed DNDC-EUROPE model in order to select the best statisti- cal approach to approximate the complex bio-geo-chemical model at a lower computational cost. Defining a correct metamodeling strategy is very impor- tant to provide an adequate fitting to the model, as suggested by (Kleijnen and Sargent, 2000; Meckesheimer et al., 2002).

Recent work, such as (Forrester and Keane, 2009; Wang and Shan, 2007), review the most widely used metamodeling methods: splines based methods (e.g., MARS, kriging...) (Wahba, 1990; Friedman, 1991; Cressie, 1990), neu- ral networks (Bishop, 1995), kernel methods (SVM, SVR...) (Vapnik, 1998;

Christmann and Steinwart, 2007), Gaussian Process such as GEM (Kennedy and O’Hagan, 2001), among others. Some of these metamodeling strategies were selected and others added to be compared in this paper. The compar- ison is made on a specific case study related to N leaching and N 2 O fluxes prediction which is described in Section 4. The next section briefly describes each of the metamodels compared in this paper.

3. Review of the selected metamodels

Several methods were developed and compared to assess their perfor-

mance according to increasing dataset sizes. We provide a brief description

(8)

of the approaches studied in this paper: two linear models (Section 3.1) and six nonparametric methods (two based on splines, in Sections 3.2.1 and 3.2.2, one based on a kriging approach, in Section 3.2.3, which is known to be ef- ficient when analyzing computer experiments, a neural network method, in Section 3.2.4, SVM, in Section 3.2.5 and random forest, in Section 3.2.6).

3.1. Linear methods

The easiest way to handle the estimation of the model given in Eq. (1) is to suppose that f has a simple parametric form. For example, the linear model supposes that f (x) = β T x + β 0 where β ∈ R d is a vector and β 0 is a real number, both of them have to be estimated from the observations ((x i , y i )) i . An estimate is given by minimizing the sum of the square errors

N

X

i=1

y i − (β T x i + β 0 ) 2

which leads to ˆ β = X T X −1

X T y and ˆ β 0 = y − β ˆ T X with y = N 1 P N i=1 y i

and X = N 1 P N i=1 x i .

In this paper two linear models were used:

• in the first one, the explanatory variables were the 11 inputs described in Section 4.2. This model is referred as“LM1”;

• the second one has been developed starting from the work of (Britz and Leip, 2009), that includes the 11 inputs of Section 4.2 but also their non linear transformations (square, square root, logarithm) and interaction components. A total of 120 coefficients were involved in this approach which is denoted by “LM2”. Including transformations and combinations of the 11 inputs has been designed in an attempt to better model a possible nonlinear phenomenon of the original model.

In the second case, due to the large number of explanatory variables, the

model can be over-specified, especially if the training set is small. Actually,

if the dimensionality of the matrix of explanatory variables, X, has a large

dimension, X T X can be not invertible or ill-conditioned (leading to numerical

instability). Hence, a stepwise selection based on the AIC criterion (Akaike,

1974) has been used to select an optimal subset of explanatory variables

during the training step in order to obtain an accurate solution having a

small number of parameters. This has been performed by using the stepAIC

function of the R package MASS.

(9)

3.2. Nonparametric methods

In many modeling problems, linear methods are not enough to catch the complexity of the phenomenon which is, per se, nonlinear. In these situations, nonparametric are often more suited to obtain accurate approximations of the phenomenon under study. In this section, six nonparametric approaches are described: they are compared in Section 4 to model N 2 O fluxes and N leaching.

3.2.1. ACOSSO

Among nonparametric estimation approach, the smoothing splines (Wahba, 1990; Gu, 2002) is one of the most famous and widely used. Re- cently, (Storlie et al., 2011) presented the ACOSSO, an adaptive approach based on the COSSO method (Lin and Zhang, 2006) which is in the same line as smoothing splines: it is described as “a new regularization method for simultaneous model fitting and variable selection in nonparametric regression models in the framework of smoothing spline ANOVA”. This method penal- izes the sum of component norms, instead of the squared norm employed in the traditional smoothing spline method. More precisely, in splines meta- modeling, it is useful to consider the ANOVA decomposition of f into terms of increasing dimensionality:

f(x) = f (x 1 , x 2 , . . . , x d ) = f 0 + X

j

f (j) + X

k>j

f (jk) + . . . + f (12...d) (2) where x j is the j -th explanatory variable and where each term is a function only of the factors in its index, i.e. f (j) = f (x j ), f (jk) = f (x j , x k ) and so on. The terms f (j) represent the additive part of the model f, while all higher order terms f (jk) . . . f (12...d) are denoted as “interactions”. The simplest example of smoothing spline ANOVA model is the additive model where only (f (j) ) j=0,...,d are used.

To estimate f , we make the usual assumption that f ∈ H, where H is a RKHS (Reproducing Kernel Hilbert Space) (Berlinet and Thomas-Agnan, 2004). The space H can be written as an orthogonal decomposition H = {1} ⊕ { L q

j=1 H j }, where each H j is itself a RKHS, ⊕ is the direct sum of Hilbert spaces and j = 1, . . . , q spans ANOVA terms of various orders.

Typically q includes the main effects plus relevant interaction terms. f is

then estimated by ˆ f that minimizes a criterion being a trade-off between

accuracy to the data (empirical mean squared error) and a penalty which

(10)

aims at minimizing each ANOVA term:

1 N

N

X

i=1

(y i − f ˆ (x i )) 2 + λ 0

q

X

j=1

1 θ j

kP j fk ˆ 2 H (3)

where P j f ˆ is the orthogonal projection of ˆ f onto H j and the q-dimensional vector θ j of smoothing parameters needs to be tuned somehow, in such a way that each ANOVA component has the most appropriate degree of smooth- ness.

This statistical estimation problem requires the tuning of the d hyper- parameters θ j (λ 0 /θ j are also denoted as smoothing parameters). Various ways of doing that are available in the literature, by applying generalized cross-validation (GCV), generalized maximum likelihood procedures (GML) and so on (Wahba, 1990; Gu, 2002). But, in Eq. (3), q is often large and the tuning of all θ j is a formidable problem, implying that in practice the problem is simplified by setting θ j to 1 for any j and only λ 0 is tuned. This simplification, however, strongly limits the flexibility of the smoothing spline model, possibly leading to poor estimates of the ANOVA components.

Problem (3) also poses the issue of selection of H j terms: this is tackled rather effectively within the COSSO/ACOSSO framework. The COSSO (Lin and Zhang, 2006) penalizes the sum of norms, using a LASSO type penalty (Tibshirani, 1996) for the ANOVA model: LASSO penalties are L 1 penalties that lead to sparse parameters (i.e., parameters whose coordinates are all equal to zero except for a few ones). Hence, using this kind of penalties allows us to automatically select the most informative predictor terms H j with an estimate of ˆ f that minimizes

1 N

N

X

i=1

(y i − f ˆ (x i )) 2 + λ

Q

X

j=1

kP j f ˆ k H (4) using a single smoothing parameter λ, and where Q includes all ANOVA terms to be potentially included in ˆ f, e.g. with a truncation at 2 nd or 3 rd order interactions.

It can be shown that the COSSO estimate is also the minimizer of 1

N

N

X

i=1

(y i − f(x ˆ i )) 2 +

Q

X

j=1

1 θ j

kP j fk ˆ 2 H (5)

(11)

subject to P Q

j=1 1/θ j < M (where there is a 1-1 mapping between M and λ). So we can think of the COSSO penalty as the traditional smoothing spline penalty plus a penalty on the Q smoothing parameters used for each component. This can also be framed into a linear-quadratic problem, i.e. a quadratic objective (5) plus a linear constraint on 1/θ j . The LASSO type penalty has the effect of setting some of the functional components (H j ’s) equal to zero (e.g. some variables x j and some interactions (x j , x k ) are not included in the expression of ˆ f). Thus it “automatically” selects the appro- priate subset q of terms out of the Q “candidates”. The key property of COSSO is that with one single smoothing parameter (λ or M ) it provides estimates of all θ j parameters in one shot: therefore it improves considerably the simplified problem (3) by setting θ j = 1 (still with one single smoothing parameter λ 0 ) and is much more computationally efficient than the full prob- lem (3) with optimized θ j ’s. An additional improvement from the COSSO is that the single smoothing parameter λ can be tuned to minimize the BIC (Bayesian Information Criterion) (Schwarz, 1978), thus allowing to target the most appropriate degree of parsimony of the metamodel. This is done by a simple grid-search algorithm as follows (see (Lin and Zhang, 2006) for details):

1. for each trial λ value, the COSSO estimate provides the corresponding values for θ j and subsequently its BIC;

2. the grid-search algorithm will provide the ˆ λ with the smallest BIC.

The adaptive COSSO (ACOSSO) of (Storlie et al., 2011) is an improve- ment of the COSSO method: in ACOSSO, ˆ f ∈ H minimizes

1 N

N

X

i=1

(y i − f(x ˆ i )) 2 + λ

q

X

j=1

w j kP j fk ˆ H (6)

where 0 < w j ≤ ∞ are weights that depend on an initial estimate, ˆ f (0) , of f , either using (3) with θ j = 1 or the COSSO estimate (4). The adaptive weights are obtained as w j = kP j f ˆ (0) k −γ L

2

, typically with γ = 2 and the L 2 norm kP j f ˆ (0) k L

2

= ( R

(P j f ˆ (0) (x)) 2 dx) 1/2 . The use of adap-

tive weights improves the predictive capability of ANOVA models with re-

spect to the COSSO case: in fact it allows for more flexibility in estimating

important functional components while giving a heavier penalty to unim-

portant functional components. The R scripts for ACOSSO can be found

(12)

at http://www.stat.lanl.gov/staff/CurtStorlie/index.html. In the present paper we used a MATLAB translation of such R script. The algo- rithm for tuning the hyper-parameters is then modified as follows:

1. an initial estimate of the ANOVA model ˆ f (0) is obtained either using (3) with θ j = 1 or the COSSO estimate (4);

2. given this trial ANOVA model ˆ f (0) , the weights are computed as w j = kP j f ˆ (0) k −γ L

2

;

3. given w j and for each trial λ value, the ACOSSO estimate (6) provides the corresponding values for θ j and subsequently its BIC;

4. the grid-search algorithm will provide the ˆ λ with the smallest BIC.

3.2.2. SDR-ACOSSO

In a “parallel” stream of research with respect to COSSO-ACOSSO, us- ing the state-dependent parameter regression (SDR) approach of (Young, 2001), (Ratto et al., 2007) have developed a non-parametric approach, very similar to smoothing splines and kernel regression methods, based on recur- sive filtering and smoothing estimation (the Kalman filter combined with

“fixed interval smoothing”). Such a recursive least-squares implementa- tion has some key characteristics: (a) it is combined with optimal maxi- mum likelihood estimation, thus allowing for an estimation of the smooth- ing hyper-parameters based on the estimation of a quality criterion rather than on cross-validation and (b) it provides greater flexibility in adapt- ing to local discontinuities, heavy non-linearity and heteroscedastic error terms. Recently, (Ratto and Pagano, 2010) proposed a unified approach to smoothing spline ANOVA models that combines the best of SDR and ACOSSO: the use of the recursive algorithms in particular can be very ef- fective in identifying the important functional components and in providing good estimates of the weights w j to be used in (6), adding valuable infor- mation in the ACOSSO framework and allowing in many cases to improving ACOSSO performance. The Matlab script for this method can be found at http://eemc.jrc.ec.europa.eu/Software-SS_ANOVA_R.htm.

We summarize here the key features of Young’s recursive algorithms of

SDR, by considering the case of d = 1 and f(x 1 ) = f (1) (x 1 ) + e, with e ∼

N (0, σ 2 ). To do so, we rewrite the smoothing problem as y i = s 1 i + e i ,

where i = 1, . . . , N and s 1 i is the estimate of f (1) (x 1 i ). To make the recursive

approach meaningful, the MC sample needs to be sorted in ascending order

(13)

with respect to x 1 : i.e. i and i − 1 subscripts are adjacent elements under such ordering, implying x 1 1 < x 1 2 < . . . < x 1 i < . . . < x 1 N .

To recursively estimate the s 1 i in SDR it is necessary to characterize it in some stochastic manner, borrowing from non-stationary time series processes (Young and Ng, 1989; Ng and Young, 1990). In the present context, the inte- grated random walk (IRW) process provides the same smoothing properties of a cubic spline, in the overall State-Space formulation:

Observation Equation: y i = s 1 i + e i

State Equations: s 1 i = s 1 i−1 + d 1 i−1 d 1 i = d 1 i−1 + η i 1

(7) where d 1 i is the “slope” of s 1 i , η 1 i ∼ N (0, σ η 2

1

) and η 1 i (“system disturbance”

in systems terminology) is assumed to be independent of the “observation noise” e i ∼ N (0, σ 2 ).

Given the ascending ordering of the MC sample, s 1 i can be estimated by using the recursive Kalman Filter (KF) and the associated recursive Fixed Interval Smoothing (FIS) algorithm (see e.g. (Kalman, 1960; Young, 1999) for details). First, it is necessary to optimize the hyper-parameter associated with the state space model (7), namely the Noise Variance Ratio (NVR), where NVR 1 = σ 2 η

1

2 . This is accomplished by maximum likelihood opti- mization (ML) using prediction error decomposition (Schweppe, 1965). The NVR plays the inverse role of a smoothing parameter: the smaller the NVR, the smoother the estimate of s 1 i . Given the NVR, the FIS algorithm then yields an estimate ˆ s 1 i|N of s 1 i at each data sample and it can be seen that the ˆ

s 1 i|N from the IRW process is the equivalent of ˆ f (1) (x 1 i ) in the cubic smooth- ing spline model. At the same time, the recursive procedures provide, in a natural way, standard errors of the estimated ˆ s 1 i|N , that allow for the test- ing of their relative significance. Finally, it can be easily verified (Ratto and Pagano, 2010) that by setting λ/θ 1 = 1/(NVR 1 ·N 4 ), and with evenly spaced x 1 i values, the ˆ f (1) (x 1 i ) estimate in the cubic smoothing spline model equals the ˆ s 1 i|N estimate from the IRW process.

The most interesting aspect of the SDR approach is that it is not limited

to the univariate case, but can be effectively extended to the most relevant

multivariate one. In the general additive case, for example, the recursive

procedure needs to be applied, in turn, for each term f (j) (x j i ) = ˆ s j i|N , requiring

a different sorting strategy for each ˆ s j i|N . Hence the “back-fitting” procedure

is applied, as described in (Young, 2000) and (Young, 2001). This procedure

(14)

provides both ML estimates of all NVR j ’s and the smoothed estimates of the additive terms ˆ s j i|N . So, the estimated NVR j ’s can be converted into λ 0 /θ j

values using λ 0 /θ j = 1/(NVR j · N 4 ), allowing us to put the additive model into the standard cubic spline form.

In the SDR context, (Ratto and Pagano, 2010) formalized an interaction function as the product of two states s 1 · s 2 , each of them characterized by an IRW stochastic process. Hence the estimation of a single interaction term f ( x i ) = f (12) (x 1 i , x 2 i ) + e i is expressed as:

Observation Equation: y i = s I 1,i · s I 2,i + e i

State Equations: (j = 1, 2) s I j,i = s I j,i−1 + d I j,i−1 d I j,i = d I j,i−1 + η j,i I

(8) where y is the model output after having taken out the main effects, I = 1, 2 is the multi-index denoting the interaction term under estimation and η j,i I ∼ N (0, σ η 2

I

j

). The two terms s I j,i are estimated iteratively by running the recursive procedure in turn.

The SDR recursive algorithms are usually very efficient in identifying in the most appropriate way each ANOVA component individually, hence (Ratto and Pagano, 2010) proposed to exploit this in the ACOSSO framework as follows.

We define K hji to be the reproducing kernel (r.k.) of an additive term F j of the ANOVA decomposition of the space F . In the cubic spline case, this is constructed as the sum of two terms K hji = K 01hji ⊕ K 1hji where K 01hji is the r.k. of the parametric (linear) part and K 1hji is the r.k. of the purely non-parametric part. The second order interaction terms are constructed as the tensor product of the first order terms, for a total of four elements, i.e.

K hi,ji = (K 01hii ⊕ K 1hii ) ⊗ (K 01hji ⊕ K 1hji ) (9)

= (K 01hii ⊗ K 01hji ) ⊕ (K 01hii ⊗ K 1hji ) ⊕ (K 1hii ⊗ K 01hji ) ⊕ (K 1hii ⊗ K 1hji ) This suggested that a natural use of the SDR identification and estimation in the ACOSSO framework is to apply specific weights to each element of the r.k. K h·,·i in (9). In particular the weights are the L 2 norms of each of the four elements estimated in (8):

ˆ

s I i · s ˆ I j = ˆ s I 01hii s ˆ I 01hji + ˆ s I 01hii s ˆ I 1hji + ˆ s I 1hii ˆ s I 01hji + ˆ s I 1hii s ˆ I 1hji , (10)

As shown in (Ratto and Pagano, 2010), this choice can lead to a significant

improvement in the accuracy of ANOVA models with respect to the original

(15)

ACOSSO approach. Overall, the algorithm for tuning the hyper-parameters in the combined SDR-ACOSSO reads:

1. the recursive SDR algorithm is applied to get an initial estimate of each ANOVA term in turn (back-fitting algorithm);

2. the weights are computed as the L 2 norms of the parametric and non- parametric parts of the cubic splines estimates;

3. given w j and for each trial λ value, the ACOSSO estimate (6) provides the corresponding values for θ j and subsequently its BIC;

4. the grid-search algorithm will provide the ˆ λ with the smallest BIC.

3.2.3. Kriging metamodel: DACE

DACE (Lophaven et al., 2002) is a Matlab toolbox used to construct kriging approximation models on the basis of data coming from computer experiments. Once we have this approximate model, we can use it as a meta- model (emulator, surrogate model). We briefly highlight the main features of DACE. The kriging model can be expressed as a regression

f ˆ (x) = β 1 φ 1 (x) + · · · + β q φ q (x) + ζ(x) (11) where φ j , j = 1, . . . , q are deterministic regression terms (constant, linear, quadratic, etc.), β j are the related regression coefficients and ζ is a zero mean random process whose variance depends on the process variance ω 2 and on the correlation R(v, w) between ζ(v) and ζ(w). In kriging, correlation functions are typically used, defined as:

R(θ, v − w) = Y

j=1:d

R jj , w j − v j ).

In particular, for the generalized exponential correlation function, used in the present paper, one has

R j (θ j , w j − v j ) = exp(−θ j |w j − v j | θ

d+1

)

Then, we can define R as the correlation matrix at the training points

(i.e., the matrix with coordinates r i,j = R(θ, x i , x j )) and the vector r x =

[R(θ, x 1 , x), . . . , R(θ, x N , x)], x being an untried point. Similarly, we define

the vector φ x = [φ 1 (x) . . . φ q (x)] T and the matrix Φ = [φ x

1

· · · φ x

N

] T (i.e.,

(16)

Φ stacks in matrix form all values of φ x at the training points). Then, con- sidering the linear regression problem Φβ ≈ y coming from Eq. (11), with parameter β = [β 1 , . . . , β q ] T ∈ R q , the GLS solution is given by:

β = (Φ T R −1 Φ) −1 Φ T R −1 y which gives the predictor at untried x

f(x) = ˆ φ T x β + r T x γ ,

where γ is the N -dimensional vector computed as γ = R −1 (y − Φβ ).

The proper estimation of the kriging metamodel requires, of course, to optimize the hyper-parameters θ in the correlation function: this is typi- cally performed by maximum likelihood. It is easy to check that the kriging predictor interpolates x j , if the latter is a training point.

It seems useful to underline that one major difference between DACE and ANOVA smoothing is the absence of any “observation error” in (11). This is a natural choice when analyzing computer experiments and it aims to exploit the “zero-uncertainty” feature of this kind of data. This, in principle, makes the estimation of kriging metamodels very efficient, as confirmed by the many successful applications described in literature and justifies the great success of this kind of metamodels among practitioners. It also seems interesting to mention the so-called “nugget” effect, which is also used in the kriging liter- ature (Mont`es, 1994; Kleijnen, 2009). This is nothing other than a “small”

error term in (11) and it often reduces some numerical problems encountered in the estimation of the kriging metamodels to the form of (11). The addi- tion of a nugget term leads to kriging metamodels that smooth, rather than interpolate, making them more similar to other metamodels presented here.

3.2.4. Multilayer perceptron

“Neural network” is a general name for statistical methods dedicated to

data mining. They comprise of a combination of simple computational el-

ements (neurons or nodes) densely interconnected through synapses. The

number and organization of the neurons and synapses define the network

topology. One of the most popular neural network class is the “multilayer

perceptrons” (MLP) commonly used to solve a wide range of classification

and regression problems. In particular, MLP are known to be able to approx-

imate any (smooth enough) complex function (Hornik, 1991). Perceptrons

(17)

were introduced at the end of the 50s by Rosenblatt but they started be- coming very appealing more recently thanks to the soaring computational capacities of computers. The works of (Ripley, 1994) and (Bishop, 1995) provide a general description of these methods and their properties.

For the experiments presented in Section 4.3, one-hidden-layer percep- trons were used. They can be expressed as a function of the form

f w : x ∈ R p → g 1 Q

X

i=1

w (2) i g 2

x T w (1) i + w i (0) + w (2) 0

!

where:

• w := h

(w (0) i ) i , (( w i (1) ) T ) i , w (2) 0 , (w i (2) ) i

i T

are parameters of the model, called weights. They have to be learned in ( R ) Q × ( R p ) Q × R × ( R ) Q during the training;

• Q is a hyper-parameter indicating the number of neurons on the hidden layer;

• g 1 and g 2 are the activation functions of the neural networks. Generally, in regression cases (when the outputs to be predicted are real values rather than classes), g 1 is the identity function (hence the outputs are a linear combination of the neurons on the hidden layer) and g 2 is the logistic activation function z → 1+e e

zz

.

The weights are learned in order to minimize the mean square error on the training set:

ˆ

w := arg min

n

X

i=1

ky i − f w (x i )k 2 . (12) Unfortunately this error is not a quadratic function of w and thus no exact algorithm is available to find the global minimum of this optimization prob- lem (and the existence of such a global minimum is not even guaranteed).

Gradient descent based approximation algorithms are usually computed to find an approximate solution, where the gradient of w → f w (x i ) is calculated by the back-propagation principle (Werbos, 1974).

Moreover, to avoid overfitting, a penalization strategy, called weight de-

cay (Krogh and Hertz, 1992), was introduced. It consists of replacing the

(18)

minimization problem (12) by its penalized version:

ˆ

w := arg min

n

X

i=1

ky i − f w (x i )k 2 + Ckwk 2

where C is the penalization parameter. The solution of this penalized mean square error is designed to be smoother than that given by Eq. (12). The nnet R function, provided in the R package nnet (Venables and Ripley, 2002), was used to train and test the one-hidden-layer MLP. As described in Section 4.3, a single validation approach was used to tune the hyper-parameters Q and C which were selected on a grid search (Q ∈ {10, 15, 20, 25, 30} and C ∈ {0, 0.1, 1, 5, 10}).

3.2.5. SVM (Support Vector Machines)

SVM were introduced by (Boser et al., 1992) originally to address clas- sification problems. Subsequently (Vapnik, 1995) presented an application to regression problems to predict dependent real valued variables from given inputs. In SVM, the estimate ˆ f is chosen among the family of functions

f : x ∈ R d → hw, φ(x)i H + b

where φ is a function from R d into a given Hilbert space (H, h., .i H ), here a RKHS, w ∈ H and b ∈ R are parameters to be learned from the training dataset. Despite several strategies were developed to learn the parameters w and b (Steinwart and Christmann, 2008), we opted for the original approach which consists of using the ǫ-insensitive loss function as a quality criterion for the regression:

L ǫ (X, y, f ˆ ) =

N

X

i=1

max

| f ˆ (x i ) − y i | − ǫ, 0 .

This loss function has the property to avoid considering the error when it is small enough (smaller than ǫ). His main interest, compared to the usual squared error, is its robustness (see (Steinwart and Christman, 2008) for a discussion). The SVM regression is based on the minimization of this loss function on the learning sample while penalizing the complexity of the obtained ˆ f. More precisely, the idea of SVM regression is to find w and b solutions of:

arg min

w ,b L ǫ (X, y, f ˆ ) + 1

C kwk 2 H (13)

(19)

where the term kwk 2 H is the regularization term that controls the complexity of ˆ f and C is the regularization parameter: when C is small, ˆ f is allowed to make bigger errors in favor of a smaller complexity; if the value of C is high, f ˆ makes (almost) no error on the training data but it could have a large complexity and thus not be able to give good estimations for new observa- tions (e.g., those of the test set). A good choice must devise a compromise between the accuracy required by the project and an acceptable metamodel complexity.

(Vapnik, 1995) demonstrates that, using the Lagrangian and Karush- Kuhn-Tucker conditions, w takes the form

w =

N

X

i=1

(α i − α i )φ(x i )

where α i and α i solve the so-called dual optimization problem:

arg max

α

i

i

− 1 2

N

X

i,j=1

(α i − α i )(α i − α i )hφ(x i ), φ(x j )i H (14)

−ǫ

N

X

i=1

(α i + α i ) +

N

X

i=1

y i (α i − α i )

!

subject to:

N

X

i=1

(α i − α i ) = 0 and α i , α i ∈ [0, C ].

This is a classical quadratic optimization problem that can be explicitly solved. (Keerthi et al., 2001) provide a detailed discussion on the way to compute b once w is found; for the sake of clarity, in this paper we skip the full description of this step.

In Eq. (14), φ is only used through the dot products (hφ(x i ), φ(x j )i H ) i,j . Hence, φ is never explicitly given but only accessed through the dot product by defining a kernel, K:

K(x i , x j ) = hφ(x i ), φ(x j )i H . (15)

This is the so-called kernel trick. As long as K : R d × R d → R is symmet-

ric and positive, it is ensured that an underlying Hilbert space H and an

underlying φ : R d → H exist satisfying the relation of Eq. (15). The very

(20)

common Gaussian kernel, K γ (u, v) = e −γku−vk

2

for a γ > 0, was used in the simulations.

Finally, three hyper-parameters have to be tuned to use SVM regression:

• ǫ of the loss function;

• C, the regularization parameter of the SVM;

• γ, the parameter of the Gaussian kernel.

As described in Section 4.3, a single validation approach was used to tune the hyper-parameters C and γ which were selected on a grid search (C ∈ {10, 100, 1000, 2000} and γ ∈ {0.001, 0.01, 0.1}). To reduce the compu- tational costs and also to limit the number of hyperparameters to the same value as in MLP case (and thus to prevent the global method from being too flexible), we avoided tuning ǫ by setting it equal to 1, which corresponds ap- proximately to the second decile of the target variable for each scenario and output. This choice fitted the standard proposed by (Mattera and Haykin, 1998) which suggests having a number of Support Vectors smaller than 50%

of the training set. Simulations were done by using the function svm from the R package e1071 based on the libsvm library (Chang and Lin, 2001).

3.2.6. Random Forest

Random forests (RF) were first introduced by (Breiman, 2001) on the basis of his studies on bagging and of the works of (Amit and Geman, 1997;

Ho, 1998) on features selection. Basically, bagging consists of computing a large number of elementary regression functions and of averaging them.

In random forest, elementary regression functions involved in the bagging procedure are regression trees (Breiman et al., 1984). Building a regression tree aims at finding a series of splits deduced from one of the d variables, x k (for a k ∈ {1, . . . , d}), and a threshold, τ, that divides the training set into two subsamples, called nodes : {i : x k i < τ } and {i : x k i ≥ τ }. The split of a given node, N , is chosen, among all the possible splits, by minimizing the sum of the homogeneity of the two corresponding child nodes, N c 1 and N c 2 , as follows:

X

i∈N

ci

y i − y ¯ N

ci

2

where ¯ y N

ci

= |N 1

i c

|

P

i∈N

ci

y i is the mean value of the output variable for the

observations belonging to N c i (i.e., the intra-node variance).

(21)

The growth of the tree stops when the child nodes are homogeneous enough (for a previously fixed value of homogeneity) or when the number of observations in the child nodes is smaller than a fixed number (generally chosen between 1 and 5). The prediction obtained for new inputs, x, is then simply the mean of the outputs, y i , of the training set that belong to the same terminal node (a leaf). The pros of this method are its easy readability and interpretability; the main drawback is its limited flexibility, especially for regression problems. To overcome this limit, random forests combine a large number (several hundreds or several thousands) of regression trees, T . In the forest, each tree is built sticking to the following algorithm that is made of random perturbations of the original procedure to make the tree under-efficient (i.e., so that none of the tree in the forest is the optimal one for the training dataset):

1. A given number of observations, m, are randomly chosen from the training set: this subset is called in-bag sample whereas the other ob- servations are called out-of-bag and are used to check the error of the tree;

2. For each node of the tree, a given number of variables, q, are randomly selected among all the possible explanatory variables. The best split is then calculated on the basis of these q variables for the m chosen observations.

All trees in the forest are fully learned: the final leafs all have homogeneity equal to 0. Once having defined the T regression trees, T 1 , . . . , T T , the re- gression forest prediction for new input variables, x, is equal to the mean of the individual predictions obtained by each tree of the forest for x.

Several hyper-parameters can be tuned for random forests such as the

number of trees in the final forest or the number of variables randomly se-

lected to build a given split. But, as this method is less sensitive to parameter

tuning than the other ones (i.e., SVM and MLP), we opted for leaving the

default values implemented in the R package randomForest based on useful

heuristics: 500 trees were trained, each defined from a bootstrap sample built

with replacement and having the size of the original dataset. Each node was

defined from three randomly chosen variables and the trees were grown until

the number of observations in each node was smaller than five. Moreover,

the full learning process always led to a stabilized out-of-bag error.

(22)

4. Simulations and results

4.1. Application to the Cross Compliance Assessment Tool

As described above in the Section 1, the impact assessment of Cross Com- pliance (CC) measures on the EU27 farmlands, required the development of a simulation platform called Cross Compliance Assessment Tool (CCAT). The CCAT framework integrates different models, such as Miterra (Velthof et al., 2009), DNDC-EUROPE (Follador et al., 2011), EPIC (van der Velde et al., 2009) and CAPRI (Britz and Witzke, 2008; Britz, 2008), in order to guarantee an exhaustive evaluation of the effects of agro-environmental standards for different input, scenario assumptions, compliance rates and space-time reso- lutions (Elbersen et al., 2010; De Vries et al., 2008). The simulated outputs are indicators for nitrogen (N) and carbon (C) fluxes, biodiversity and land- scape, market response and animal welfare. The selection of the CC scenarios as well as the definition of the environmental indicators to be considered in this project, are described by (Jongeneel et al., 2008). The CCAT tool eval- uates the effect of agricultural measures on N 2 O fluxes and N leaching by means of the meta-model of the mechanistic model DNDC-EUROPE (Fol- lador et al., 2011). N 2 O is an important greenhouse gas (Intergovernmental Panel on Climate Change, 2007). Agriculture and in particular agricultural soils are contributing significantly to anthropogenic N 2 O emissions (Euro- pean Environment Agency, 2010). N 2 O fluxes from soils are characterized by a high spatial variability and the accuracy of estimates can be increased if spatially explicit information is taken into consideration (Leip et al., 2011a).

Similarly, leaching of nitrogen from agricultural soils is an important source of surface and groundwater pollution (European Environment Agency, 1995).

The main limits of using DNDC-EUROPE directly in the CCAT platform are the high computational costs and memory requirements, due to the large size of input datasets and the complexity and high number of equations to solve. To mitigate this problem, making the integration easier, we decided to develop a metamodel of DNDC-EUROPE (Follador and Leip, 2009). The choice of the best meta-modeling approach has been based on the analysis of performance of different algorithms, as described in details in Section 4.4.

The best metamodel is expected to have low computational costs and an

acceptable accuracy for all the dataset sizes.

(23)

4.2. Input and Output data description

The set of training observations (around 19 000 observations) used to de- fine a metamodel ˆ f was created by linking the agro-economic CAPRI dataset with the bio-geochemical DNDC-EUROPE model at Homogeneous Spatial Mapping Unit (HSMU) resolution, as described in (Leip et al., 2008). We opted for corn cultivation as case study, since it covers almost 4.6% of UAA (utilized agricultural area) in EU27, playing an important role in human and animal food supply (European Union Commission, 2010) 1 and representing one of the main cropping system in Europe. To obtain a representative sample of situations for the cultivation of corn in EU27, we selected about 19,000 HSMUs on which at least 10% of the agricultural land was used for corn (Follador et al., 2011).

The input observations used to train the metamodels were drawn from the whole DNDC-EUROPE input database (Leip et al., 2008; Li et al., 1992;

Li, 2000), in order to meet the need of simplifying the I/O flux of information between models in the CCAT platform. This screening was based on a pre- liminary sensitivity analysis of input data through the importance function of the R package randomForest, and subsequently it was refined by expert evaluations (Follador et al., 2011; Follador and Leip, 2009). At last, 11 input variables were used:

• Variable related to N input [kgN ha −1 yr −1 ], such as mineral fertil- izer (N FR) and manure (N MR) amendments, N from biological fixation (Nfix) and N in crop residue (Nres);

• variables related to soil: soil bulk density, BD, [g cm −3 ], topsoil organic carbon, SOC, [mass fraction], clay content, clay, [fraction] and topsoil pH, pH;

• variables related to climate: annual precipitation Rain, [mm yr −1 ], annual temperature Tmean [ ℃ ] and N in rain, Nr, [ppm].

They refer to the main driving forces taking part in the simulation of N 2 O and N leaching with DNDC-EUROPE, such as farming practices, soil attributes and climate information. In this contribution we only show the results for the corn baseline scenario - that is the conventional corn cultivation in EU27,

1

http://epp.eurostat.ec.europa.eu

(24)

as described by (Follador et al., 2011). Note that a single metamodel was developed for each CC scenario and for each simulated output in CCAT, as described in (Follador and Leip, 2009). Figure 1 summarizes the relations between the DNDC-EUROPE model and the metamodel.

As the number of input variables was not large, they were all used in all the metamodeling methods described in Section 3, without additional vari- able selection. The only exception is the second linear model (Section 3.1) which uses a more complete list of input variables obtained by various com- binations of the original 11 variables and thus includes a variable selection process to avoid collinearity issues.

Two output variables were studied: the emissions of N 2 O ([kg N yr −1 ha −1 ] for each HSMU), a GHG whose reduction is a leading matter in cli- mate change mitigation strategies, and the nitrogen leaching ([kg N yr −1 ha −1 ] for each HSMU), which has to be monitored to meet the drinking water quality standards (Askegaard et al., 2005). A metamodel was devel- oped for each single output variable. The flux of information through the DNDC-EUROPE model and its relationship with the metamodel’s one are summarized in Figure 1. The data were extracted using a high performance computer cluster and the extraction process took more that one day for all the 19 000 observations.

4.3. Training, validation and test approach

The training observations were randomly partitioned (without replace- ment) into two groups: 80% of the observations (i.e., N L ≃ 15 000 HSMU) were used for training (i.e., for defining a convenient ˆ f ) and the 20% re- maining observations (i.e., N T ≃ 4 000 HSMU) were used for validating the metamodels (i.e., for calculated an error score). Additionally, in order to understand the impact of the training dataset on the goodness of the esti- mations (ˆ y i ) and to compare the different metamodel performance according to the data availability, we randomly selected from the entire training set a series of subsets, having respectively N L = 8 000, 4 000, 2 000, 1 000, 500, 200 and 100 observations, each consecutive training subset being a subset of the previous one.

The methodology used to assess the behavior of different metamodels under various experimental conditions (size of the dataset and nature of the output) are summarized in Description 1.

More precisely, for some metamodels, Step 2 requires the tuning of some

hyper-parameters (e.g., SVM have three hyper-parameters, see Section 3).

(25)

Farm management: crop and seeding info,

# of crop rotations, tillage info, fertilizer and manure info, irrigation info, potential yield, land uses

Soil: Bulk density, structure, organic carbon, pH

Meteo: Temp and rain

N2O flux

N leaching

Fertilizers

Manure

Soil bulk density

Soil organic carbon

Clay fraction

Rain

Temp

N in rain

Topsoil pH

N fixation

N residue

N2O flux*

N leaching*

Selection of the most important input variables

DNDC Model

MetaModel 2° level of abstraction

Supervised training:

Model vs Metamodel

IM

OMM

OM

IMM

: Model : Database

Figure 1: Flow of data through the DNDC-EUROPE model (M) and relationship with

the metamodel’s one (MM). The input variables of the metamodels were selected from the

original DNDC-EUROPE dataset (screening). The estimated (*) output were compared

with the detailed model’s output during the training and test phases to improve the

metamodel and to evaluate the goodness of the approximation.

(26)

Description 1 Methodology used to compare the metamodels under various experimental conditions

1: for Each metamodel, each output and each size N L do

2: {Train the metamodel with the N L training observations → definition of ˆ f ;

3: Estimate the outputs for the N T ≃ 4 000 inputs of the test set from f ˆ → calculation of ˆ y i ;

4: Calculate the test error by comparing the estimated outputs, ˆ y i , vs. the outputs of the DNDC-EUROPE model for the same test ob- servations, y i .}

5: end for

These hyper-parameters were tuned by:

• for ACOSSO and SDR: a grid-search to minimize BIC plus an algorithm to get the weights w j : in these cases, an efficient formula, that does not require to compute each leave-one-out estimate of f , can be used to compute the BIC; moreover the COSSO penalty provides all θ j given λ and w j in a single shot. In the SDR identification steps, a maximum likelihood strategy is applied to optimize NVR’s;

• for DACE, a maximum likelihood strategy;

• for MLP, SVM and RF, a simple validation strategy preferred to a cross validation strategy to reduce the computational time especially with the largest training datasets): half of the data were used to define several metamodels depending on the values of hyper-parameters on a grid search and the remaining data were used to select the best set of hyper-parameters by minimizing a mean square error criterion.

Hence, depending on which features are the most interesting (easy tuning of the hyperparameters, size of the training dataset, size of the dataset need- ing new prediction...), the use of one method is more or less recommended.

Table 1 summarizes the main characteristics of the training and validation

steps of each method as well as the characteristics to do new predictions. For

instance, linear models are more

(27)

Table 1: Summary of the main features for training, val- idation (hyperparameters tuning) and test steps of each method.

Method Training Validation New predictions characteristics characteristics characteristics LM1 Very fast to train. There is no hyper

parameter to tune. Very fast.

LM2

Fast to train but much slower than LM1 be- cause of the number of parameters to learn.

There is no hyper-

parameter to tune. Very fast.

ACOSSO

Fast to train only if the number of obser- vations is very low:

the dimension of the kernel matrix is N L × N L and it is obtained as the sum of the kernels of each [ N L × N L ] ANOVA term, which can be long to calculate.

One hyper-

parameter ( λ ) is tuned twice by minimizing BIC: the first time to get the weights w j the sec- ond to get the final estimate (given λ and w j the COSSO penalty provides automatically in a single shot all θ j ).

The time needed to obtain new predic- tions can be high depending on the sizes of both the training dataset and the test dataset. It requires to compute a kernel matrix having dimension N L × N T .

Continued on next page

(28)

Table 1 – Continued from previous page

Method Training Validation Test

characteristics characteristics characteristics

SDR

Fast to train only if the number of obser- vations is very low:

the dimension of the kernel matrix is N L × N L and it is obtained as the sum of the kernels of each [ N L × N L ] ANOVA term, which can be long to calculate.

As for ACOSSO, the single hyper- parameter ( λ ) is tuned by minimiz- ing BIC: the SDR identification step to provide w j also optimizes hyper- parameters for each ANOVA component but this can be done efficiently by the SDR recursive al- gorithms (given λ and w j the COSSO penalty provides automatically in a single shot all θ j ).

The time needed to obtain new predic- tions can be high depending on the sizes of both the training dataset and the test dataset. It requires to compute a kernel matrix having dimension N L × N T .

DACE

Fast to train only if the number of obser- vations is very low:

the dimension of the kernel matrix is N L × N L , and the inversion of a matrix N L × N L is re- quired in the GLS pro- cedure.

d + 1 hyper-

parameters are tuned by ML, which becomes intractable already for moderate d : each step of the optimization a matrix N L × N L has to be inverted.

The time needed to obtain new predic- tions can be high depending on the sizes of both the training dataset and the test dataset. It requires to compute a kernel matrix having dimension N L × N T .

Continued on next page

(29)

Table 1 – Continued from previous page

Method Training Validation Test

characteristics characteristics characteristics

MLP

Hard to train: be- cause the error to min- imize is not quadratic, the training step faces local minima problems and has thus to be performed several times with various initializa- tion values. It is also very sensitive to the di- mensionality of the data (that strongly increases the number of weights to train) and, to a lesser extent, to the number of observations.

2 hyperparameters have to be tuned but one is discrete (number of neurons on the hidden layer) which is easier.

Nervelessness, cross validation is not suited: tuning is performed by simple validation and can thus be less accu- rate. It can be time consuming.

The time needed to obtain new predic- tions is very low: it depends on the num- ber of predictions.

SVM

Fast to train if the number of observa- tions is low: SVM are almost insensitive to the dimensionality of the data but the dimen- sion of the kernel matrix is N L × N l and can be long to calculate.

Three hyperparame- ters have to be tuned and in the case where the size of the train- ing dataset is large, cross validation is not suited. Tuning is performed by simple validation and can thus be less accurate.

It is also time con- suming.

The time needed to obtain new predic- tions can be high depending on the sizes of both the training dataset and the test dataset. It requires to compute a kernel matrix having dimension N L × N T .

Continued on next page

(30)

Table 1 – Continued from previous page

Method Training Validation Test

characteristics characteristics characteristics

RF

Fast to train: al- most insensitive to the size or the dimensionality of the training dataset thanks to the random selections of observa- tions and variables.

Most of the time needed to train is due to the number of trees required to stabilize the algorithm, that can sometimes be large.

Almost insensitive to hyperparameters so no extensive tun- ing is required.

The time needed to obtain new predic- tions is low: it de- pends on the number of predictions to do and also on the num- ber of trees in the forest.

In Step 4, the test quality criterion was evaluated by calculating several quantities:

• the Mean Squared Error (MSE):

MSE = 1 N T

N

T

X

i=1

(ˆ y i − y i ) 2

where y i and ˆ y i are, respectively, the model outputs in the test dataset and the corresponding approximated outputs given by the metamodel.

• the R 2 coefficient:

R 2 = 1 − P N

T

i=1 (ˆ y i − y i ) 2 P N

T

i=1 (ˆ y i − y) 2 = 1 − M SE Var(y)

where y and Var(y) are the mean and the variance of all y i in the test

dataset. R 2 is equal to 1 if the predictions are perfect and thus gives

a way to quantify the accuracy of the predictions to the variability of

the variable to predict.

Références

Documents relatifs

2. Duty to harmonize the special or regional conventions with the basic principles of this Convention. It is a convenient flexibility to allow the amendment of the application of

In this paper, our first goal is to show that indeed there is a sharp estimate for λ 1 (M) under Ricci curvature lower bound. To prove this, we will develop a new argument, that

Subject to the conditions of any agreement between the United Nations and the Organization, approved pursuant to Chapter XVI, States which do not become Members in

⃝ It requires many infrastructures to carry electricity from power plants to consumers.. ⃝ It requires much logistics to carry electricity from power plants

To see that a qualitative argument can be used: we know that the completely ordered case minimize the energy and we are interested in the states that have the closer possible

In other words, beyond the role of night-time leisure economy in transforming central areas of our cities and, more recently, the range of impacts derived from the touristification

In this comment on the articles by Frère and Védrine, and Antunez et al., we will go back over the slow process of France’s territorial organisation and the attempts

In the case of avant de and sans, the English translation is often a conjugated verb (Before they left), or a present participle (Before leaving), rather than the English