• Aucun résultat trouvé

Kernel-based ANOVA decomposition and Shapley effects - Application to global sensitivity analysis

N/A
N/A
Protected

Academic year: 2022

Partager "Kernel-based ANOVA decomposition and Shapley effects - Application to global sensitivity analysis"

Copied!
45
0
0

Texte intégral

(1)

HAL Id: hal-03108628

https://hal.archives-ouvertes.fr/hal-03108628

Preprint submitted on 13 Jan 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Kernel-based ANOVA decomposition and Shapley effects - Application to global sensitivity analysis

Sébastien da Veiga

To cite this version:

Sébastien da Veiga. Kernel-based ANOVA decomposition and Shapley effects - Application to global sensitivity analysis. 2021. �hal-03108628�

(2)

Kernel-based ANOVA decomposition and Shapley effects – Application to global sensitivity analysis

S´ebastien Da Veiga

Safran Tech, Modeling & Simulation

Rue des Jeunes Bois, Chˆateaufort, 78114 Magny-Les-Hameaux, France

Abstract

Global sensitivity analysis is the main quantitative technique for identifying the most in- fluential input variables in a numerical simulation model. In particular when the inputs are independent, Sobol’ sensitivity indices attribute a portion of the output of interest variance to each input and all possible interactions in the model, thanks to a functional ANOVA decom- position. On the other hand, moment-independent sensitivity indices focus on the impact of input variables on the whole output distribution instead of the variance only, thus providing complementary insight on the inputs / output relationship. Unfortunately they do not enjoy the nice decomposition property of Sobol’ indices and are consequently harder to analyze. In this paper, we introduce two moment-independent indices based on kernel-embeddings of probability distributions and show that the RKHS framework used for their definition makes it possible to exhibit a kernel-based ANOVA decomposition. This is the first time such a desirable property is proved for sensitivity indices apart from Sobol’ ones. When the inputs are dependent, we also use these new sensitivity indices as building blocks to design kernel-embedding Shapley effects which generalize the traditional variance-based ones used in sensitivity analysis. Several estimation procedures are discussed and illustrated on test cases with various output types such as categorical variables and probability distributions. All these examples show their potential for enhancing traditional sensitivity analysis with a kernel point of view.

1 Introduction

In the computer experiments community, global sensitivity analysis (GSA) has now emerged as a central tool for exploring the inputs/outputs relationship of a numerical simulation model. Starting from the pioneering work of Sobol (Sobol’, 1993) and Saltelli (Saltelli et al., 1999) on the inter- pretation and estimation of Sobol’ indices, the last two decades have been a fertile ground for the development of advanced statistical methodologies and extensions of original Sobol’ indices:

new estimation procedures (Da Veiga et al. (2009), Da Veiga and Gamboa (2013), Sol´ıs (2019), Gamboa et al. (2020)), multivariate outputs with aggregation (Gamboa et al., 2013) and dimen- sionality reduction (Lamboni et al., 2011), goal-oriented sensitivity analysis (Fort et al., 2016) or moment-independent sensitivity measures (Borgonovo (2007), Da Veiga (2015)), among others. At the heart of the popularity of Sobol’ indices is the fundamental functional analysis of variance (ANOVA) decomposition, which opens the path for their interpretation as parts of the output vari- ance and makes it possible to pull apart the input main effects and all their potential interactions, up to their whole influence measured by total Sobol’ indices. This decomposition however has two

(3)

drawbacks. First, it is only valid when the inputs are independent, although some generalizations were investigated (Chastaing et al., 2012). Secondly, it only concerns the original Sobol’ indices, meaning that it is not possible to split the input effects with goal-oriented or moment-independent sensitivity analysis in general.

When the inputs are dependent, total Sobol’ indices can still be used to discriminate them when the objective is to build a surrogate model of the system, and other Sobol’-related indices have also been proposed for interpretability (Mara et al., 2015). But the major breakthrough happened when Shapley effects have been defined for GSA by Owen (Owen, 2014). Indeed due to their mathematical foundations from game theory, Shapley effects do not require the independence assumption to enjoy nice properties: each input is assigned a Shapley effect lying between 0 and 1, while the sum of all effects is equal to 1. For a given input all interactions and correlations with other ones are mixed up, but the interpretation as parts of the output variance is kept and input rankings are still sensible. For these reasons Shapley effects are now commonly thought as central importance measures in GSA for dealing with dependence, and their estimation has been thoroughly investigated recently (Song et al., 2016; Iooss and Prieur, 2019; Broto et al., 2020;

Plischke et al., 2020).

From an interpretability perspective, other importance measures introduced in the context of goal-oriented and moment-independent sensitivity analysis have proven useful to gain additional insights on a given model. For example quantile-oriented (Fort et al., 2016; Maume-Deschamps and Niang, 2018) or reliability-based measures (Ditlevsen and Madsen, 1996) can help understand which inputs lead to the failure of the system, while optimization-related indices enable dimension reduc- tion for optimization problems (Spagnol et al., 2019). On the other hand, moment-independent sensitivity indices, which quantify the input impact on the whole output distribution instead of the variance only, are powerful complementary tools to grasp further types of input influence.

Among them are the f-divergence indices (Da Veiga (2015), Rahman (2016)) with particular cases corresponding to the sensitivity index introduced by Borgonovo (Borgonovo, 2007) and the class of kernel-based sensitivity indices, which rely on the embedding of probability distributions in re- producing kernel Hilbert spaces (RKHS) (Da Veiga, 2015, 2016). Unfortunately an ANOVA-like decomposition is not available yet for any of these indices even in the independent setting: as a con- sequence this limits the interpretation of their formulation for interactions since without ANOVA it is not possible to remove the main effects, and at the same time the natural normalization constant (equivalent to the total output variance for Sobol’ indices) is not known.

In this paper we focus on a general RKHS framework for GSA and prove that an ANOVA decomposition actually exists for two previously introduced kernel-based sensitivity indices in the independent setting. To the best of our knowledge this is the first time such a decomposition is available for other sensitivity indices other than the original Sobol’ ones. Not only this makes it possible to properly define higher-order indices, but this further gives access to their natural normalization constant. We also demonstrate that these measures are generalizations of Sobol’

indices, in the sense that they are recovered with specific kernels. But the RKHS point of view additionally comes with a large body of work on several kernels specifically designed for particular target applications, such as multivariate, functional, categorical or time-series case studies, thus defining a unified framework for many real GSA test cases. When inputs are not independent, we finally introduce a kernel-based version of Shapley effects similar to the ones proposed by Owen.

The paper is organized as follows. Section 2 first briefly introduces the traditional functional

(4)

ANOVA decomposition with Sobol’ indices and moment-independent indices. In Section 3 we then discuss the elementary tools from RKHS theory needed to build kernel-based sensitivity indices which are at the core of this work. We further investigate these indices and prove they also arise from an ANOVA decomposition. In addition we define Shapley effects with kernels and the benefits of the RKHS framework for GSA are studied through several examples. Several estimation procedures are then discussed in Section 4, where we generalize some of the recent estimators for Sobol’ indices. Finally, Section 5 illustrates the potential of these sensitivity indices with various numerical experiments corresponding to typical GSA applications.

2 Global sensitivity analysis

Let η : X1 ×. . .× Xd → Y denote the numerical simulation model, which is a function of d input variables Xl ∈ Xl, l = 1, . . . , d, and Y ∈ Y the model output given by Y =η(X1, . . . , Xd).

In standard GSA the inputs Xl are further assumed to be independent with known probability distributions PXl, meaning that the vector of inputs X = (X1, . . . , Xd) is distributed as PX = PX1⊗. . .⊗PXd. For any subsetA={l1, . . . , l|A|} ∈ Pdof indices taken from{1, . . . , d} we denote XA=

Xl1, . . . , Xl|A|

∈ XA=Xl1×. . .× Xl|A| the vector of inputs with indices inA andX−Athe complementary vector with indices not inA. In this setting, the main objective of global sensitivity analysis is to quantify the impact of any group of input variablesXAon the model outputY. In this section we first recall the functional ANOVA decomposition and the definition of Sobol’

indices, which fall into the category of variance-based indices. Sensitivity indices that account for the whole output distribution, referred to asmoment-independent indices, are then discussed. Note that in the following, we adopt the notation S for a properly normalized sensitivity index, while S will stand for an unnormalized index, where normalization is to be understood as an end result from an ANOVA-like decomposition.

2.1 ANOVA decomposition and variance-based sensitivity indices

Here we first assume that Y ∈ Y ⊂ R is a square integrable scalar output. If the inputs are independent, the function η can then be decomposed according to the ANOVA decomposition:

Theorem 1 (ANOVA decomposition (Hoeffding, 1948; Antoniadis, 1984)). Assume that η: X1× . . .× Xd→ Y is a square integrable function of dindependent random variables X1, . . . , Xd. Then η admits a decomposition

Y =η(X1, . . . , Xd) = X

A⊆Pd

ηA(XA), with ηA depending only on the variables XA and satisfying

(a) η =E(Y),

(b) EXlA(XA)) = 0 ifl∈A, (c) ηA(XA) =P

B⊂A(−1)|A|−|B|E(Y|XB).

Furthermore,(b) implies that all the terms ηA in the decomposition are mutually orthogonal. As a consequence, the output variance can be decomposed as

VarY = X

A⊆Pd

VarηA(XA) = X

A⊆Pd

VA (1)

(5)

where

VA= X

B⊂A

(−1)|A|−|B|VarE(Y|XB). (2)

When this decomposition holds, it is then straightforward to quantify the influence of any subset of inputs XA on the output variance by normalizing each term with VarY.

Definition 1(Sobol’ indices (Sobol’, 1993)). Under the same assumptions of Theorem 1, the Sobol’

sensitivity index associated to a subset A of input variables is defined as SA= VA

VarY, (3)

while the total Sobol’ index associated to A is SAT = X

B⊆Pd, B∩A6=∅

SB. (4)

In particular, the first-order Sobol’ index of an input Xl writes Sl= VarE(Y|Xl)

VarY and its total Sobol’ index is given by

SlT = X

B⊆Pd, l∈B

SB= 1− VarE(Y|X−l) VarY .

Finally, the ANOVA decomposition (1) readily provides an interpretation of Sobol’ indices as a percentage of explained output variance, i.e.

X

A⊆Pd

SA= 1. (5)

With these definitions, the impact of each input variable can be quantitatively assessed: the first-order Sobol’ index measures the main effect of an input, while the total Sobol’ index aggregates all its potential interactions with other inputs. As an illustration, an input variable with low total Sobol’ index is thus unimportant and one can freeze it at a default value. When for a given input both first-order and total Sobol’ indices are close, this means that this input does not have interactions, while a large gap indicates strong interactions in the model. Furthermore, due to the summation property (5), the interpretation of Sobol’ indices as shares of the output variance is an efficient tool for practitioners who aim at understanding precisely the impact and interactions of the inputs of a model on the output. For example the interaction of two inputs Xl and Xl0 writes

Sll0 = VarE(Y|Xl, Xl0)−VarE(Y|Xl)−VarE(Y|Xl0)

VarY = VarE(Y|Xl, Xl0)

VarY −Sl−Sl0. (6) Note that to compute this interaction one subtracts the first-order indices Sl and Sl0 from the sensitivity index of the subset (Xl, Xl0) in order to remove the main effects and highlight the interaction only.

(6)

2.2 Moment-independent sensitivity indices

Despite their appealing properties, Sobol’ indices rank the input variables according to their impact on the output variance only. In a parallel line of work, several authors proposed to investigate instead how inputs influence the whole output distribution, thus introducing a different insight on the inputs/outputs relationship. The starting point (Baucells and Borgonovo (2013), Da Veiga (2015)) is to consider that a given inputXl is important in the model if the probability distribution PY of the output changes when Xl is fixed, i.e. if the conditional probability distribution PY|Xl

is different from PY. More precisely, if d(·,·) denotes a dissimilarity measure between probability distributions, one can define a sensitivity index for variable Xl given by

Sl =EXl d(PY,PY|Xl)

. (7)

Such a general formulation is flexible, in the sense that many choices ford(·,·) are available. As an illustration, it is straightforward to show that the unnormalized first-order Sobol’ index is retrieved with the naive dissimilarity measured(P,Q) = (Eξ∼P(ξ)−Eξ∼Q(ξ))2, which compares probability distributions only through their means. A large class of dissimilarity measures is also given by the so-called f-divergence family: assuming that (Xl, Y) has an absolute continuous distribution with respect to the Lebesgue measure on R2, the f-divergence between PY and PY|Xl is

df(PY,PY|Xl) = Z

f

pY(y) pY|Xl(y)

pY|Xl(y)dy

wheref is a convex function such that f(1) = 0 andpY andpY|Xl are the probability distribution functions ofY and Y|Xl, respectively. The corresponding sensitivity index is then

Slf = Z

f

pY(y)pXl(x) pXl,Y(x, y)

pXl,Y(x, y)dxdy

withpXlandpXl,Y the probability distribution functions ofXland (Xl, Y), respectively. This index has been studied for example in Da Veiga (2015) and Rahman (2016). A notable special case is obtained with the total-variation distance corresponding tof(t) =|t−1|, leading to the sensitivity index proposed by Borgonovo (Borgonovo, 2007):

SlT V = Z

|pY(y)pXl(x)−pXl,Y(x, y)|dxdy.

Obviously, definition (7) can be easily extended to measure the influence of any subset of inputs SA=EXA d(PY, PY|XA)

. But in this case, since there is no ANOVA-like decomposition, there is no longer the guarantee that an interaction index defined following (6):

SllT V0 = Z

|pY(y)pXl(x)pXl0(x0)−pXl,Xl0,Y(x, x0, y)|dxdx0dy− SlT V − SlT V0

really measures the pure interaction between Xl and Xl0. Therefore the interpretation of higher- order moment-independent sensitivity indices is cumbersome. On the other hand, even if nor- malization constants have been proposed through general inequalities on f-divergences (Borgonovo (2007), Rahman (2016)), the lack of an ANOVA decomposition once again impedes the definition of a natural normalization constant equivalent to the output variance for Sobol’ indices.

(7)

Recently, new moment-independent indices built upon the framework of RKHS embedding of probability distributions have also been investigated (Da Veiga, 2015, 2016). Though originally introduced as an alternative to reduce the curse of dimensionality and make the most of the vast kernel literature, we will see in what follows that they actually exhibit an ANOVA-like decompo- sition and can therefore be seen as a general kernelized version of Sobol’ indices.

3 Kernel-based sensitivity analysis

Before introducing the kernel-based sensitivity indices, we first review some elements of the RKHS embedding of probability distributions (Smola et al., 2007), which will serve as a building block for their definition.

3.1 RKHS embedding of distributions

We first introduce a RKHS H of functions X → R with kernel kX and dot product h·,·iH. The kernel mean embedding µP∈ H of a probability distribution P onX is defined as

µP=Eξ∼PkX(ξ,·) = Z

X

kX(ξ,·)dP(ξ)

if Eξ∼PkX(ξ, ξ) < ∞, see Smola et al. (2007). The representation µP is appealing because, if the kernel kX is characteristic, the map P → µP is injective (Sriperumbudur et al., 2009, 2010).

Consequently, the kernel mean embedding can be used in lieu of the probability distribution for several comparisons and manipulations of probability measures but using only inner products or distances in the RKHS. For example, a distance between two probability measures P1 and P2 on X can simply be obtained by computing the distance between their representations inH,i.e.

MMD(P1,P2) =kµP1 −µP2kH,

which is a distance if the kernel kX is characteristic (Sriperumbudur et al., 2009, 2010). This distance is called the maximum mean discrepancy (MMD) and it has been recently used in many applications (Muandet et al., 2012; Szab´o et al., 2016). Indeed, using the reproducing property of a RKHS one may show (Song, 2008) that

MMD2(P1,P2) =Eξ,ξ0kX(ξ, ξ0)−2Eξ,ζkX(ξ, ζ) +Eζ,ζ0kX(ζ, ζ0)

whereξ, ξ0 ∼P1 and ζ, ζ0∼P2 withξ, ξ0, ζ, ζ0 independent, this notation being used throughout the rest of the paper. This means that the MMD can be computed with expectations of kernels only, unlike other distances between probability distributions which will typically require density estimation.

Another significant application of kernel embeddings concerns the problem of measuring the dependence between random variables. Given a pair of random vectors (U,V)∈ X × Y with prob- ability distribution PUV, we define the product RKHSH=F × G with kernelkH((u,v),(u0,v0)) = kX(u,u0)kY(v,v0). A measure of the dependence betweenUandV can then be defined as the dis- tance between the mean embedding of PUV and PU⊗PV, the joint distribution with independent marginals PU and PV:

MMD2(PUV,PU⊗PV) =kµPUV−µPU ⊗µPVkH.

(8)

This measure is the so-called Hilbert-Schmidt independence criterion (HSIC, see Gretton et al.

(2005a,b)) and can be expanded as

HSIC(U,V) = MMD2(PUV,PU⊗PV)

= EU,U0,V,V0kX(U,U0)kY(V,V0) + EU,U0kX(U,U0)EV,V0kY(V,V0)

− 2EU,V

EU0kX(U,U0)EV0kY(V,V0)

(8) where (U0,V0) is an independent copy of (U,V). Once again, the reproducing property implies that HSIC can be expressed as expectations of kernels, which facilitates its estimation when compared to other dependence measures such as the mutual information.

3.2 Kernel-based ANOVA decomposition

The RKHS framework introduced above can readily be used to define kernel-based sensitivity indices. The first approach relies on the MMD, while the second one builds upon HSIC. We discuss them below and show that in particular both of them admit an ANOVA-like decomposition.

3.2.1 MMD-based sensitivity index

The first natural idea is to come back to the general formulation for moment-independent indices (7) and use the MMD as the dissimilarity measure to compare PY and PY|Xl as proposed in Da Veiga (2016):

SlMMD = EXlMMD2(PY,PY|Xl)

= EXlEξ,ξ0∼PYkY(ξ, ξ0)−2EXlEξ∼PY,ζ∼PY|XlkY(ξ, ζ) +EXlEζ,ζ0∼PY|XlkY(ζ, ζ0)

= EXlEζ,ζ0∼PY|XlkY(ζ, ζ0)−Eξ,ξ0∼PYkY(ξ, ξ0)

where we have defined a RKHS G of functions Y → R with kernel kY. More generally, we can also consider the unnormalized MMD-based sensitivity index for a group of variablesXAgiven by EXA MMD2(PY,PY|XA)

= EXAEζ,ζ0∼PY|X

AkY(ζ, ζ0)−Eξ,ξ0∼PYkY(ξ, ξ0), provided the following assumption holds:

Assumption 1. ∀A⊆ Pd and∀xA∈ XA, Eξ∼PY|X

A=xAkY(ξ, ξ)<∞with the convention PY|XA = PY if A=∅.

First note that if we focus on the scalar output caseY ⊂Rwith the linear kernelkY(y, y0) =yy0, we have

SAMMD =EXA

Eξ∼PY(ξ)−Eζ∼PY|X

A(ζ)2

=EXA(EY −E(Y|XA))2

= VarE(Y|XA),

that is, we recover the unnormalized Sobol’ index forXA. SAMMD can thus be seen as a kernelized version of Sobol’ indices since the latter can be retrieved with a specific kernel. However it is obvious that since the linear kernel is not characteristic, the MMD in this case is not a distance,

(9)

which means thatSAMMD is no longer a moment-independent index.

To make another connection with Sobol’ indices, we now recall Mercer’s theorem, a notable repre- sentation theorem for kernels.

Theorem 2 (Mercer, see Aubin (2000)). Suppose kY is a continuous symmetric positive definite kernel on a compact set Y and consider the integral operator TkY : L2(Y)→L2(Y) defined by

TkYf (x) =

Z

Y

kY(y, u)f(u)du.

Then there is an orthonormal basis{er}of L2(Y) consisting of eigenfunctions ofTkY such that the corresponding sequence of eigenvalues {λr} are non-negative. The eigenfunctions corresponding to non-zero eigenvalues are continuous onY andkY has the following representation

kY(y, y0) =

X

r=1

λrer(y)er(y0) where the convergence is absolute and uniform.

Assume now that the output Y ∈ Y with Y a compact set, meaning that Mercer’s theorem holds. Then kY admits a representation

kY(y, y0) =

X

r=1

φr(y)φr(y0) whereφr(y) =√

λrer(y) are orthogonal functions in L2(Y). In this setting we can write SAMMD =EXA MMD2(PY,PY|XA)

= EXAEξ,ξ0∼PY|X

AkY(ξ, ξ0)−Eζ,ζ0∼PkY(ζ, ζ0)

= EXAEξ,ξ0∼PY|X

A

X

r=1

φr(ξ)φr0)

!

−Eζ,ζ0∼P

X

r=1

φr(ζ)φr0)

! .

Now, since the convergence of the series is absolute, we can interchange the expectations and the summations to get

SAMMD =

X

r=1

EXAEξ,ξ0∼PY|X

A φr(ξ)φr0)

−Eζ,ζ0∼P φr(ζ)φr0)

=

X

r=1

EXAE(φr(Y)|XA)2−E(φr(Y))2

=

X

r=1

VarE(φr(Y)|XA). (9)

In other words, the MMD-based sensitivity indexSAMMD generalizes the Sobol’ one in the sense that it measures the impact of the inputs not only on the conditional expectation of the output,

(10)

but on a possibly infinite number of transformationsφr of the output, given by the eigenfunctions of the kernel.

We can now state the main theorem of this section on the ANOVA-like decomposition for SAMMD. Recall that the variance decomposition (1) states that the variance of the output can be decomposed as VarY =P

A⊆PdVA where each term is given by VA= X

B⊂A

(−1)|A|−|B|VarE(Y|XB).

The MMD-based equivalent is obtained with the following theorem.

Theorem 3 (ANOVA decomposition for MMD). Under the same assumptions of Theorem 1 (in particular, the random vector X has independent components) and with Assumption 1, denote MMD2tot = EkY(Y, Y)−EkY(Y, Y0) where Y0 is an independent copy of Y. Then the total MMD can be decomposed as

MMD2tot= X

A⊆Pd

MMD2A where each term is given by

MMD2A= X

B⊂A

(−1)|A|−|B|EXB MMD2(PY,PY|XB) .

The proof is given in Appendix A.1. Theorem 3 is very similar to the ANOVA one given in (1): one can note that the total variance of the output is replaced by a generalized variance MMD2tot defined by the kernel, and that each subset effect is obtained by removing lesser order ones in the MMD distance of the conditional distributions (instead of the variance of the conditional expectations in the ANOVA). The following corollary states that these two decompositions coincide when the kernel is chosen as the linear one.

Corollary 1. When Y ∈ Y ⊂R and kY(y, y0) =yy0 in Theorem 3, the decomposition is identical to the decomposition (1), which means that

MMD2tot = VarY and ∀B ∈ Pd, EXB MMD2(PY,PY|XB)

= VarE(Y|XB).

It further implies∀A⊆ Pd, MMD2A=VA.

Thanks to Theorem 3 we can now define properly normalized MMD-based indices.

Definition 2 (MMD-based sensitivity indices). In the frame of Theorem 3, let A ⊆ Pd. The normalized MMD-based sensitivity index associated to a subset A of input variables is defined as

SAMMD= MMD2A MMD2tot, while the total MMD-based index associated to A is

SAT,MMD= X

B⊆Pd, B∩A6=∅

SBMMD = 1−EX−A MMD2(PY,PY|X−A)

MMD2tot .

(11)

From Theorem 3, we have the fundamental identity providing the interpretation of MMD-based indices as percentage of the explained generalized variance MMD2tot:

X

A⊆Pd

SAMMD= 1.

Finally, we exhibit a generalized law of total variance for MMD2tot which will yield another formulation for the total MMD-based index.

Proposition 1 (Generalized law of total variance). Assuming Assumption 1 holds, we have MMD2tot =EXA

h

Eξ∼PY|X

AkY(ξ, ξ)−Eξ,ξ0∼PY|X

AkY(ξ, ξ0)i

+EXA MMD2(PY,PY|XA) . The proof is to be found in Appendix A.2. This is a generalization in the sense that the total variance is replaced by MMD2tot, the conditional variance byEξ∼PY|XAkY(ξ, ξ)−Eξ,ξ0∼PY|XAkY(ξ, ξ0) and the variance of the conditional expectation by EXA MMD2(PY,PY|XA)

. In particular, all these terms reduce to the ones in the classical law of total variance if one uses the linear kernel kY(y, y0) =yy0 in the scalar case. This gives the following corollary.

Corollary 2 (Other formulation of total MMD-based index). In the frame of Theorem 3, we have for all A⊆ Pd

SAT,MMD= EX−A

h

Eξ∼PY|X

−AkY(ξ, ξ)−Eξ,ξ0∼PY|X

−AkY(ξ, ξ0)i

MMD2tot .

3.2.2 HSIC-based sensitivity indices

Another approach for combining kernel embeddings with sensitivity analysis consists in directly using HSIC as a sensitivity index. For example Da Veiga (2015) considers the unnormalized index

SAHS = HSIC(XA, Y)

relying on a product RKHS HA=FA× G with kernel kHA((x, y),(x0, y0)) =kXA(xA,x0A)kY(y, y0) and provided the following assumption holds:

Assumption 2. ∀A⊆ Pd, Eξ∼PXAkXA(ξ, ξ)<∞ and Eξ∼PYkY(ξ, ξ)<∞.

In Da Veiga (2015) an empirical normalization inspired by the definition of the distance corre- lation criterion (Sz´ekely et al., 2007) was also proposed. But similarly to the MMD decomposition above, it is actually possible to exhibit an ANOVA-like decomposition for HSIC, thus providing a natural normalization constant. The main ingredient is an assumption on the kernelkX associated to the input variables.

Assumption 3. The reproducing kernel kX of F is of the form

kX(x,x0) =

p

Y

l=1

1 +kl(xl, x0l)

(10)

(12)

where for eachl= 1, . . . , d,kl(·,·) is the reproducing kernel of a RKHS Fl of real functions depend- ing only on variable xl and such that1∈ F/ l.

In addition, for all l= 1, . . . , d and ∀xl∈ Xl, we have Z

Xl

kl(xl, x0l)dPXl(x0l) = 0. (11) The first part (10) of Assumption 3 may seem stringent, however it can be easily fulfilled by using univariate Gaussian kernels since they define a RKHS which does not contain constant functions (Steinwart et al., 2006).

On the contrary, the second assumption (11) is more subtle. It requires using kernels defining a so-called RKHS ofzero-mean functions (Wahba et al., 1995). A prominent example of such RKHS is obtained if (a) all input variables are uniformly distributed on [0,1] and (b) the univariate kernels are chosen among the Sobolev kernels with smoothness parameter r≥1:

kl(xl, x0l) = B2r(|xl−x0l|) (−1)r+1(2r)! +

r

X

j=1

Bj(xl)Bj(x0l)

(j!)2 (12)

where Bj is the Bernoulli polynomial of degree j. Even though applying a preliminary transfor- mation on the inputs in order to get uniform variables is conceivable (with e.g. the probability integral transform), a more general and elegant procedure has been proposed by Durrande et al.

(2012). Starting from an arbitrary univariatek(·,·), they build a zero-mean kernelkD0 (·,·) given by k0D(x, x0) =k(x, x0)−

R k(x, t)dP(t)R

k(x0, t)dP(t) RR k(s, t)dP(s)dP(t) where kD0 (·,·) satisfies ∀x, R

k0D(x, t)dP(t) = 0. Interestingly, they also show that the RKHS H0 associated tok0(·,·) is orthogonal to the constant functions, thus satisfying directly the requirements for the product kernel (10).

More recently, several works made use of the Stein operator (Stein et al., 1972) to define the Stein discrepancy in a RKHS (Chwialkowski et al., 2016) which showed great potential for Monte-Carlo integration (Oates et al., 2017) or goodness-of-fit tests (Gorham and Mackey, 2015; Chwialkowski et al., 2016; Jitkrittum et al., 2017) when the target distribution is either impossible to sample or is known up to a normalization constant. More precisely, given a RKHS H with kernel k(·,·) of functions in Rd and a (target) probability distribution with densityp(·), they define a new RKHS H0 with kernel k0S(·,·) which writes

kS0(x,x0) =∇xx0k(x,x0) +∇xp(x)

p(x) ∇x0k(x,x0) +∇x0p(x0)

p(x0) ∇xk(x,x0) +∇xp(x) p(x)

x0p(x0)

p(x0) k(x,x0) and it can be proved that∀x∈Rd, R

k0S(x,x0)p(x0)dx0 = 0. Unlike k0D(·,·), kernelkS0(·,·) can still be defined whenp(·) is known up to a constant: this property may find interesting applications in GSA when the input distributions are obtained via a preliminary Bayesian data calibration, since it would no longer be required to perform a costly sampling step of their posterior distribution and one could easily use the unnormalized posterior distribution instead.

With Assumption 3, we can now state a decomposition for HSIC-based sensitivity indices.

(13)

Theorem 4 (ANOVA decomposition for HSIC). Under the same assumptions of Theorem 1 (in particular, the random vector X has independent components) and with Assumptions 2 and 3, the HSIC dependence measure between X= (X1, . . . , Xd) and Y can be decomposed as

HSIC (X, Y) = X

A⊆Pd

HSICA

where each term is given by

HSICA= X

B⊂A

(−1)|A|−|B|HSIC (XB, Y)

andHSIC (XB, Y) is defined with a product RKHSHB =FB×Gwith kernelkB(xB,x0B)kY(y, y0) = Q

l∈B(1 +kl(xl, x0l))kY(y, y0) as in (10).

The proof, which mainly relies on Mercer’s theorem and on Theorem 4.1 from Kuo et al. (2010), is given in Appendix A.3. Once again, this decomposition resembles the ANOVA decomposition (1), where the conditional variances are replaced with HSIC dependence measures between subsets of inputs and the output.

Properly normalized HSIC-based indices can then be defined:

Definition 3 (HSIC-based sensitivity indices). In the frame of Theorem 4, let A ⊆ Pd. The normalized HSIC-based sensitivity index associated to a subset A of input variables is defined as

SAHSIC = HSICA

HSIC (X, Y), while the total HSIC-based index associated to A is

SAT,HSIC = X

B⊆Pd, B∩A6=∅

SBHSIC = 1−HSIC(X−A, Y) HSIC (X, Y) .

From Theorem 4, we have the fundamental identity providing the interpretation of HSIC-based indices as percentage of the explained HSIC dependence measure between X= (X1, . . . , Xd) andY:

X

A⊆Pd

SAHSIC = 1.

Finally, a noteworthy asymptotic result yields a link between HSIC-based indices and MMD- based ones when the input kernelkX degenerates to a dirac kernel, as elaborated in the following proposition.

Proposition 2. For all subset A ⊆ Pd, let us define a product RKHS HA =FA× G with kernel kA(xA,x0A)kY(y, y0). We further assume that ∀xA∈ XA, pXA(xA)>0 and that

kA(xA,x0A) = 1 ppXA(xA)p

pXA(x0A) Y

l∈A

1 hK

xl−x0l h

(13) where K : R→R is a symmetric kernel function satisfying R

uK(u)du= 1, and h > 0. Then we have ∀A⊆ Pd

h→0limHSIC(XA, Y) =EXA MMD2(PY,PY|XA)

where HSIC(XA, Y) is defined with the product RKHSHA=FA× G and MMD2(PY,PY|XA) with the RKHS G.

(14)

The proof is given in Appendix A.4. As a particular case of Proposition 2, one can for example choose a (normalized) Gaussian kernel for kX with a standard deviation tending to 0, or the sinc kernel associated to the RKHS of band-limited continuous functions with a cutoff frequency tending to infinity. Obviously the result also holds if one uses different kernels K for each inputXl ∈XA in Eq. (13).

Although they may seem trivial, Proposition 2 and Corollary 1 actually justify our claim that both the MMD- and the HSIC-based sensitivity indices are natural generalizations of Sobol’ indices, in the sense that a degenerate HSIC-based index with a dirac kernel for the input variables gives the MMD-based index which, in turn, is equal to the Sobol’ index when using the dot product kernel for the output.

3.3 Kernel-embedding Shapley effects

In this section, we now discuss how the previously indices can still be valuable in the case where the input variables are no longer independent. In this setting, the Shapley effects introduced in the context of GSA by Owen (Owen, 2014) and based on Shapley values (Shapley, 1953) from game theory have appealing properties, since they provide a proper allocation of the output variance to each input variable, without requiring they are independent. We recall their definition below.

Definition 4(Shapley effects (Shapley, 1953)). For anyl= 1. . . , d, the Shapley effect of input Xl is given by

Shl= 1 VarY

1 p

X

A⊆Pd, A63l

p−1

|A|

−1

VarE Y|XA∪{l}

−VarE(Y|XA)

. (14)

This definition corresponds to the Shapley value (Shapley, 1953)

φl = 1 p

X

A⊆Pd, A63l

p−1

|A|

−1

val (A∪ {l})−val (A)

with value function val :Pd→R+ equal to val(A) = VarE(Y|XA)/VarY. Moreover, we have the following decomposition

p

X

l=1

Shl= 1.

The only requirement is that the value function satisfies val : Pd→ R+ such that val(∅) = 0.

Combining this result with the kernel-based sensitivity indices is consequently straightforward, which leads to the definition ofkernel-embedding Shapley effects:

Definition 5 (Kernel-embedding Shapley effects). For any l= 1. . . , d, we define (a) The MMD-Shapley effect

ShMMDl = 1 MMD2tot

1 p

X

A⊆Pd, A63l

p−1

|A|

−1

EXA∪{l}

MMD2(PY,PY|XA∪{l})

−EXA MMD2(PY,PY|XA)

(15) provided Assumption 1 holds.

(15)

(b) The HSIC-Shapley effect

ShHSICl = 1 HSIC (X, Y)

1 p

X

A⊆Pd, A63l

p−1

|A|

−1

HSIC XA∪{l}, Y

−HSIC (XA, Y)

(16)

provided Assumptions 2 and 3 hold.

We further have the decompositions

p

X

l=1

ShMMDl =

p

X

l=1

ShHSICl = 1.

Just like in the independent setting, kernel-embedding Shapley effects (15) and (16) can be seen as general kernelized versions of Shapley effects, since Proposition 2 and Corollary 1 are still valid when the inputs are dependent.

Remark 1. In the machine learning community dedicated to the interpretability of black-box models, an importance measure called Kernel-Shap has been recently introduced (Lundberg and Lee, 2017).

Although its naming resembles ours, they designate clearly separated approaches, since the Kernel- Shap measure is a local Shapley effect, and the ”Kernel” denomination only refers to an estimation procedure without any links to RKHS.

3.4 Enhancing traditional GSA with kernels

Beyond their theoretical interest in themselves, the kernel-ANOVA decompositions and the asso- ciated sensitivity indices also appear powerful from a practical point of view when one carefully examines the potential of using kernels. We give below some insights on how they could enhance traditional GSA studies in several settings.

Categorical model outputs and target sensitivity analysis. In some applications, the model outputY is categorical, meaning thatY ={1, . . . , K}when the output can takeK levels. A simple common instance involves two levels, corresponding to a failure/success situation. Similarly even ifY is not categorical, the objective may be to measure the impact of each input on the fact that the output reaches disjoint regions of interest R1, . . . ,RK ⊂ Y, as for example in the case where one focuses on events {ti+1 > Y > ti} for thresholds ti, i= 1, . . . , K. Such an objective is called target sensitivity analysis (TSA, see Marrel and Chabridon (2020)) and can be reformulated in a categorical framework by the change of variableZ =iifY ∈ Ri.

The case where Y ={0,1}(or equivalentlyZ =1{Y∈R}) is frequent in TSA. A straightforward approach is to use Sobol’ indices with a 0/1 output, yielding a first-order Sobol index equal to

SlTSA= EXl(P(Y = 1|Xl)−P(Y = 1))2

P(Y = 1)(1−P(Y = 1)) (17)

see Li et al. (2012). But to the best of our knowledge, no systematic procedure is available when the number of levels is greater than two. Without resorting yet to our kernel-based indices, there are at least two roads, which actually lead to the same indices:

(16)

(a) Theone-versus-all approach, where we compute several Sobol’ indices by repeatedly consid- eringZ =1{Y=i} for all i= 1. . . , K. We thus have a collection of indices

SlTSA,[i]= EXl(P(Y =i|Xl)−P(Y =i))2 P(Y =i)(1−P(Y =i)) ,

and we can aggregate them by normalizing each of them by its own variance, yielding SlTSA=

PK

i=1P(Y =i)(1−P(Y =i))STSA,[i]l PK

i=1P(Y =i)(1−P(Y =i)) = PK

i=1EXl(P(Y =i|Xl)−P(Y =i))2 PK

i=1P(Y =i)(1−P(Y =i)) . (b) The one-hot encoding approach, which consists in encoding the categorical output into a

multivariate vector of 0/1 variables and use the aggregated Sobol’ indices defined in Gamboa et al. (2013) on these transformed variables. More precisely if Y ={1, . . . , K},Y is encoded as a K-dimensional vector (Z1 =1Y=1, . . . , ZK=1Y=K). The aggregated Sobol’ indices are then

SlTSA= PK

i=1VarE(Zi|Xl) PK

i=1VarZi

= PK

i=1EXl(P(Y =i|Xl)−P(Y =i))2 PK

i=1P(Y =i)(1−P(Y =i)) , which is exactly the index obtained with the one-versus-all approach.

As for the kernel-based indices, the process is less cumbersome, since the only ingredient that requires attention is the choice of a kernelkY(·,·) adapted to categorical outputs, which has already been investigated in the kernel literature (Song et al., 2007, 2012). We focus here on the simpledirac kernel defined askY(y, y0) =δ(y, y0) for categorical valuesy, y0 ∈ {1, . . . , K}, and the corresponding kernel-based indices are then

• The first-order MMD-based index:

SlMMD = EXl

PK i=1

PK

j=1δ(i, j)P(Y =i|Xl)P(Y =j|Xl)−PK i=1

PK

j=1δ(i, j)P(Y =i)P(Y =j) PK

i=1P(Y =i)−PK i=1

PK

j=1δ(i, j)P(Y =i)P(Y =j)

= EXl

PK

i=1P(Y =i|Xl)2−PK

i=1P(Y =i)2 PK

i=1P(Y =i)−PK

i=1P(Y =i)2

= PK

i=1EXl(P(Y =i|Xl)−P(Y =i))2 PK

i=1P(Y =i)(1−P(Y =i)) , where we retrieve again the one-versus-all Sobol’ index.

• The first-order HSIC-based index:

SlHSIC = Z

Xl×Xl K

X

i=1 K

X

j=1

k{l}(x, x0)δ(i, j)

pXl|Y=i(x)−pXl(x) pXl|Y=j(x0)−pXl(x0)

P(Y =i)P(Y =j)dxdx0

=

K

X

i=1

P(Y =i)2 Z

Xl×Xl

k{l}(x, x0)

pXl|Y=i(x)−pXl(x) pXl|Y=i(x0)−pXl(x0) dxdx0

=

K

X

i=1

P(Y =i)2MMD2 PXl|Y=i,PXl

,

(17)

thus extending the result of Spagnol et al. (2019) to any number of levels.

• The MMD- and HSIC- Shapley effects using one of the above indices as building block.

Interestingly, it has been shown that Eq. (17) can also been written, up to a constant, as the Pearson χ2 divergence between PXl|Y=1 and PXl (Perrin and Defaux, 2019; Spagnol, 2020). This means that SlTSA = SlMMD and SlHSIC essentially have the same interpretation as weighted sums of distances between the initial input distributions and the conditional input distributions (when restricted to an output level), with the Pearsonχ2 divergence and the MMD distance, respectively.

But we will see in Section 4 that the estimation of HSIC-based sensitivity indices is much less prone to the curse of dimensionality and does not require density estimation, as opposed to SlTSA (Perrin and Defaux, 2019). Finally, note that another kernel for categorical variables has also been proposed (Song et al., 2007, 2012), but this is actually a normalized dirac kernel which would only modify the weights in the indices above.

Beyond scalar model outputs. In many numerical simulation models, some of the outputs are curves representing the temporal evolution of physical quantities of the system such as temperatures, pressures, etc. One can also encounter industrial applications which involve spatial outputs (Marrel et al., 2008),. In such cases, the two main approaches in GSA are (a) the ubiquitous point of view, where one sensitivity index is computed for each time step or each spatial location (Terraz et al., 2017) and (b) the dimension reduction angle, in which one preliminary projects the output into a low-dimensional vector space and then calculates aggregated sensitivity indices for this new multivariate vector (Lamboni et al., 2011; Gamboa et al., 2013).

However, the kernel perspective for such structured outputs can bring new insights for GSA.

Indeed the kernel literature has already proposed several ways to handle curves or images in re- gression or classification tasks. For instance the PCA-kernel (Ferraty and Vieu, 2006) can be used as an equivalent of (b), such as illustrated in Da Veiga (2015). But more interestingly, kernels dedicated to times series were designed, such as the global alignment kernel (Cuturi, 2011) inspired by the dynamic time-warping kernel (Sakoe and Chiba, 1978). Such kernel could be employed in industrial applications where one is interested by the impact of an input variable on the shape of the output curve. On the other hand, for dealing with spatial outputs similar to images such as in Marrel et al. (2008), one may consider a kernel based on image classification (Harchaoui and Bach, 2007) which would be better suited to analyze the impact of inputs on the change of the shapes appearing inside the image output.

Finally, numerical models involving graphs as inputs or outputs (e.g. electricity networks or molecules) may now be tractable with GSA by employing kernels specifically tailored for graphs (G¨artner et al., 2003; Ramon and G¨artner, 2003).

Stochastic numerical models. On occasions one has to deal withstochastic simulators, where internally the numerical model relies on random draws to compute the output. Typical industrial applications include models dedicated to the optimization of maintenance costs, where random failures are simulated during the system life cycle, or molecular modeling to predict macroscopic properties based on statistical mechanics, where several microstates of the system are generated at random (Moutoussamy et al., 2015). For fixed values of the input variables, the output is therefore a probability distribution, meaning that Y ⊂ M+1 the set of probability measures. In this setting

(18)

GSA aims at measuring how changes in the inputs modify the output probability distribution, which is clearly out of the traditional scope of GSA.

Once again the kernel point of view makes it possible to easily recycle the MMD- and the HSIC- based sensitivity indices in this context since they only require the definition of a kernelkY(·,·) on probability distributions. This can be achieved through one of the two following kernels:

kY(P,Q) =σ2e−λMMD2(P,Q) (18)

introduced in Song (2008) or

kY(P,Q) =σ2e−λW22(P,Q)

discussed in Bachoc et al. (2017) where P,Q∈ M+1,W2 is the Wasserstein distance and σ2, λ >0 are parameters.

4 Estimation

The properly normalized kernel-based sensitivity indices being defined above, we now discuss their estimation. The HSIC-based index is first examined as we only consider already proposed es- timators. On the other hand, the MMD-based index is analyzed more thoroughly since several estimators can be envisioned given its close links with Sobol’ indices. Finally we investigate the estimation of kernel-embedding Shapley effects.

4.1 HSIC-based index estimation

We start by observing that if Assumption 3 holds, for any subsetA⊆ Pdwe haveEXAkA(XA,x0A) = 1 for allx0A∈ XA, which means that HSIC in Eq. (8) simplifies into:

HSIC(XA, Y) =EXA,X0A,Y,Y0kA(XA,X0A)kY(Y, Y0)−EY,Y0kY(Y, Y0).

Given a sample x(i), y(i)

,i= 1, . . . , n and following Song et al. (2007); Gretton et al. (2008) two estimators HSICu(XA, Y) and HSICb(XA, Y) based on U- and V-statistics, respectively, can be introduced:

HSICu(XA, Y) = 1 n(n−1)

n

X

i,j=1, i6=j

kA(x(i)A,x(j)A )−1

kY(y(i), y(j))

HSICb(XA, Y) = 1 n2

n

X

i,j=1

kA(x(i)A,x(j)A )−1

kY(y(i), y(j))

where we assume that PXA is known and is used to compute analytically the zero-mean kernels in Eq. (3.2.2). The study of the version of the above estimators when the sample x(i)

i=1,...,n

also serves to estimate kA is left as future work. Both HSICu(XA, Y) and HSICb(XA, Y) converge in probability to HSIC(XA, Y) with rate 1/√

n, and one can show (Song et al., 2007) that if we assume thatkAand kY are bounded almost everywhere by 1 and are nonnegative, with probability at least 1−δ we have

|HSICu(XA, Y)−HSIC(XA, Y)| ≤8p

log(2/δ)/n.

Références

Documents relatifs

o Ceci passe tout d’abord par la nécessaire mise en œuvre d’une politique de sensibilisation de l’ensemble des membres de l’Inserm (administratifs et chercheurs)

Specifically, we test if study type (laboratory versus correlative field studies versus manipulative field studies), identity of the nutrient added in stream manip- ulative studies

On the contrary, when more than one component is simultaneously “seen” within the window, what the analysis reveals is the result of their superposition: chirps emerge as

Plus préisément, nous appliquons la méthode de Sobol' dans la version que nous avons introduite. au Chapitre 6 de ette thèse qui réduit le oût dans e as préis d'un fateur 5

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

(2) Si en réalisant l ’ ANOVA, l ’ hypothèse H0 est rejetée, il peut être intéressant de déterminer quel sont les groupes ou les niveaux du facteur étudié dont les moyennes

Mean reaction times in milliseconds (with 95% confidence intervals) in experiment 2 (Pomak) in relation to the various experimental manipulations: the location of the

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.