Variance-based global sensitivity analysis and beyond in life cycle assessment: an application to geothermal heating networks

(1)

Article

Reference

Variance-based global sensitivity analysis and beyond in life cycle assessment: an application to geothermal heating networks

JAXA-ROZEN, Marc, PRATIWI, Astu Sam, TRUTNEVYTE, Evelina

JAXA-ROZEN, Marc, PRATIWI, Astu Sam, TRUTNEVYTE, Evelina. Variance-based global sensitivity analysis and beyond in life cycle assessment: an application to geothermal heating networks. International Journal of Life Cycle Assessment, 2021

DOI : 10.1007/s11367-021-01921-1

Available at:

http://archive-ouverte.unige.ch/unige:151614

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

Variance-based global sensitivity analysis and beyond in life cycle assessment: an application to geothermal heating networks

Marc Jaxa-Rozen¹*, Astu Sam Pratiwi¹, Evelina Trutnevyte¹

1 Renewable Energy Systems, Institute for Environmental Sciences (ISE), Section of Earth and Environmental Sciences, University of Geneva, Switzerland

* corresponding author (Uni Carl Vogt, Boulevard Carl Vogt 66, CH-1211 Geneva 4, Switzerland; +41 22 379040; [email protected])

Contents

1. Sobol indices for variance-based global sensitivity analysis 2. PAWN indices for distribution-based global sensitivity analysis 3. Spectral clustering

4. Patient Rule Induction Method for scenario discovery

A companion Python code notebook for these method descriptions is provided separately:

Jaxa-Rozen M, Pratiwi AS, Trutnevyte E (2021) Analysis workflow for sensitivity analysis and scenario discovery.

http://doi.org/10.5281/zenodo.4201064

(3)

1. Sobol indices for variance-based global sensitivity analysis

The Sobol method for global sensitivity analysis (GSA) uses a variance decomposition scheme to identify each uncertain input's contribution to the output variance of a model (Saltelli et al., 2010; Sobol, 1993, 2001). Taking a model with k parameters and the general form 𝑌 = 𝑓(𝑿) = 𝑓(𝑋₍, . . . , 𝑋₊) , the model can be decomposed into a total of 2^k terms:

𝑓(𝑋₍, . . . , 𝑋₊) = 𝑓_-+ / 𝑓₀(𝑋₀)

+

01(

+ / / 𝑓₀₂(𝑋₀, 𝑋₂)+. . . +𝑓_(,...,+(𝑋₍, . . . , 𝑋₊)

+

2103(

+

01(

The total unconditional variance of the model V(Y) can be decomposed into corresponding partial variances for each term, assuming that the parameters are independent (Oakley & O’Hagan, 2004). The implications of this assumption for sensitivity analysis – which may need to represent correlated parameters – are explored in e.g.

Groen & Heijungs (2017).

𝑉(𝑌) = / 𝑉₀

+

01(

+ / / 𝑉₀₂

+

2103(

+. . . +𝑉_(,...,+

+5(

01(

Using these partial variances, Sobol sensitivity indices can then be defined in relation to the total variance of the model. We assume the total variance is the result of propagating uncertainty in the k input parameters, by computing the model output Y over X when the latter is drawn from specified probability distributions for X1, ..., Xk. The first-order index S1i, or main effect, measures the fraction by which the output variance would be reduced on average by fixing parameter Xi within its range. Taking the input matrix X~i that excludes parameter Xi, the first-order indexcan be expressed as follows; the inner expectation represents the mean of Y over all values of X~i, keeping Xi fixed. The variance of this mean is then taken over all values of Xi.

𝑆1₀ = 𝑉₀

𝑉(𝑌)=𝑉₈₉:𝐸_𝑿_~9[𝑌|𝑋₀]@

𝑉(𝑌)

Second-order indices S2ip represent the fraction of output variance contributed by the pairwise interaction between two parameters Xi and Xp. The inner operator takes the mean of Y over all values of X~i,p, keeping Xi

and Xp fixed. The variance of this mean is then taken over all values of Xi and Xp. We subtract the main effects of Xi and Xp to isolate the contribution of the higher-order interaction.

𝑆2_0B= 𝑉_0B

𝑉(𝑌)=𝑉₈₉_,8_CD𝐸_𝑿_~9,C[𝑌|𝑋₀, 𝑋_B]E

𝑉(𝑌) − 𝑆1₀− 𝑆1_B

Finally, the total index STi represents the sum of Xi's main effect, and all its higher-order interactions up to order k. In this case, the inner operator takes the variance over the values of Xi when the input matrix X~i is fixed. This variance is then averaged over all values of X~i. The total index can be interpreted as the average variance that would remain if all parameters except Xi were fixed.

𝑆𝑇₀ = 1 − 𝑉_~0

𝑉(𝑌)=𝐸_𝑿_~9H𝑉₈₉ (𝑌|𝑿_~0)I 𝑉(𝑌)

(4)

These indices offer several useful properties to interpret model output. The sum of first-order indices indicates the overall importance of interactions between parameters; for a perfectly additive model without interactions, in which variance would only be contributed by the main effects of the parameters, S1 and ST would be equal, with ∑⁺₀₁₍𝑆1₀ = 1. Differences between ST and S1 will otherwise highlight parameters for which higher-order interactions are significant, with ∑⁺₀₁₍𝑆1₀ < 1 , and ∑⁺₀₁₍𝑆𝑇₀ > 1 (as interactions will be double-counted across the parameters involved). These indices can be used for factor prioritization, in which the input parameters with the highest first-order index S1 are assessed as the most influential. In this case, these influential factors can for instance be prioritized for further research to narrow their uncertainty range, as fixing them would lead to the greatest expected reduction in output variance. Conversely, input parameters with a small total index ST do not contribute to output variance either through their main effect or through interactions, and they can be considered non-influential (while keeping in mind that this is conditional to the assumed input distributions).

In practice, the computation of model variances typically requires a numerical approximation, rather than an analytical form. We here summarize the sampling design presented by Saltelli et al. (2010), which can be automated using the SALib package (Herman & Usher, 2017). This sampling design estimates S1, ST and S2 indices with a total of N=n(2k+2) samples, with n a baseline sample size, and k the number of model parameters.

If only S1 and ST are required, the total computational cost is reduced to N=n(k+2). The sampling is based on two independent matrices A and B of size n × k, with row indices j = 1, ..., n and column indices i = 1, ..., k.

These matrices contain the values of parameters X1...k for which the model should be computed, with each column of parameter values being sampled from a user-specified distribution:

𝐴 = N

𝑋₍₍ 𝑋_(O ⋯ 𝑋₍₊

𝑋_O( 𝑋_OO ⋯ 𝑋_O+

⋯ ⋯ ⋯ ⋯

𝑋_Q( 𝑋_QO ⋯ 𝑋_Q+

R 𝐵 = N

𝑋₍₍^∗ 𝑋_(O^∗ ⋯ 𝑋₍₊^∗ 𝑋_O(^∗ 𝑋_OO^∗ ⋯ 𝑋_O+^∗

⋯ ⋯ ⋯ ⋯

𝑋_Q(^∗ 𝑋_QO^∗ ⋯ 𝑋_Q+^∗ R

These parameter values should preferably be sampled from a quasi-random low-discrepancy sequence, such as the Sobol sequence implemented in SALib based on Joe & Kuo (2008). This sequence is designed to place samples as uniformly as possible (Figure 1), and is typically more efficient than normal Monte Carlo sampling for the estimation of Sobol indices (e.g. Becker et al., 2018, section 3.3). The rate of convergence of the Sobol estimation is close to 1/n with this quasi-random sequence, compared to 1/√n with Monte Carlo (Saltelli et al., 2008, p. 84).

Nonetheless, the large number of samples required to reliably estimate sensitivity indices remains a key downside of variance-based GSA for complex models, as the total number of model executions N may commonly exceed 1e5 even with quasi-random sequences (e.g. Butler et al., 2014; Nossent et al., 2011).

Figure 1: Left panel: example of a random Monte Carlo sample for 150 points, with x1 and x2 drawn from a uniform distribution over [0,1]. Right panel: quasi-random sampling for 150 points using a Sobol sequence and the same input distributions. Marginal

histograms show the resulting distribution of the samples across x1 and x2.

(5)

Matrices A and B can be obtained by generating a (quasi-)random sequence of size n × 2k, then splitting it into halves. From the resulting matrices of size n × k, two new matrices are introduced for each parameter Xi to compute its sensitivity indices: AB,i is copied from A, but its column i (i.e. containing values for parameter Xi) is replaced by the column i from matrix B. Conversely, a matrix BA,i is copied from B, but its column i is replaced with the corresponding column of matrix A.

𝐴_U,0 = N

𝑋₍₍ 𝑋_(O ⋯ 𝑋₍₀^∗ ⋯ 𝑋₍₊

𝑋_O( 𝑋_OO ⋯ 𝑋_O0^∗ ⋯ 𝑋_O+

⋯ ⋯ ⋯ ⋯ ⋯ ⋯

𝑋_Q( 𝑋_QO ⋯ 𝑋_Q0^∗ ⋯ 𝑋_Q+

R 𝐵_V,0 = N

𝑋₍₍^∗ 𝑋_(O^∗ ⋯ 𝑋₍₀ ⋯ 𝑋₍₊^∗ 𝑋_O(^∗ 𝑋_OO^∗ ⋯ 𝑋_O0 ⋯ 𝑋_O+^∗

⋯ ⋯ ⋯ ⋯ ⋯ ⋯

𝑋_Q(^∗ 𝑋_QO^∗ ⋯ 𝑋_Q0 ⋯ 𝑋_Q+^∗ R

Using these matrices, the sensitivity indices S1, ST and S2 are then computed using the following estimators;

f(B)j is for example the output value of the model that is obtained using the input parameter values of row j from matrix B. The denominator yields the sample variance taken over A and B, following the implementation of the estimators in the SALib package and Theorem 2 of Saltelli (2002).

𝑆1₀ =

(Q∑^Q₂₁₍𝑓(𝐵)₂:𝑓(𝐴_U,0)₂− 𝑓(𝐴)₂@

OQ( ∑^Q₂₁₍:𝑓(𝐴)₂+ 𝑓(𝐵)₂@^O− H_OQ⁽ ∑^Q₂₁₍:𝑓(𝐴)₂+ 𝑓(𝐵)₂@I^O

𝑆2_0B=

Q(∑^Q₂₁₍𝑓(𝐵_V,0)₂𝑓(𝐴_U,B)₂− 𝑓(𝐴)₂𝑓(𝐵)₂

OQ( ∑^Q₂₁₍:𝑓(𝐴)₂+ 𝑓(𝐵)₂@^O− H_OQ⁽ ∑^Q₂₁₍:𝑓(𝐴)₂+ 𝑓(𝐵)₂@I^O− 𝑆1₀− 𝑆1_B

𝑆𝑇₀ =

(Q∑^Q₂₁₍H𝑓(𝐴)₂− 𝑓(𝐴_U,0)₂I^O

OQ( ∑^Q₂₁₍:𝑓(𝐴)₂+ 𝑓(𝐵)₂@^O− H_OQ⁽ ∑^Q₂₁₍:𝑓(𝐴)₂+ 𝑓(𝐵)₂@I^O

We use an idealized bivariate function to illustrate the computation and interpretation of the indices, with the following form and unconditional output distribution (Figure 2). We assume x1 and x2 are uniformly distributed. This test case and its analysis are demonstrated in Python in the Jupyter Notebook provided with this work (Jaxa-Rozen et al., 2021).

𝑌 = 𝑐𝑜𝑠(𝑥1) + [2 ∙ 𝑥2 + (𝑥1 − 0.5)]^O ; 𝑥1~𝑈(0, 1) 𝑥2~𝑈(0, 1)

Figure 2: Left panel: output surface for Y as a function of x1 and x2, approximated using 1000 Monte Carlo samples. x1 and x2 are drawn from uniform distributions over [0,1]. Right panel: Gaussian kernel density estimate for the distribution of Y.

(6)

Approximating the function with n=1000 as a baseline sample size, we use SALib to compute S1 and ST (Figure 3, left panel), with S2x1x2 = 0.407. The estimated indices converge more quickly using a quasi-random Sobol sequence to sample input values instead of Monte Carlo sampling (Figure 3, right panel). As we have k=2 parameters, the total computational cost in this case is N=n(2k+2) = 6000 samples. x2 contributes to over 90%

of the output variance in total, with a strong main effect of approximately half of the variance. However, the difference between S1 and ST (which in this case can be attributed entirely to S2x1x2) highlights the presence of interactions. As can be observed from the model response (Figure 2, left panel), x2 has a much greater effect on output when x1 is fixed to 1, than when x1 is fixed to 0. In this case, this can be explained analytically by the multiplicative term between x1 and x2.

Figure 3: Left panel: First-order (S1) and total (ST) Sobol indices for x1 and x2. Black lines show the 95% confidence interval for each index. Right panel: convergence of the first-order Sobol indices as a function of the base sample size n, using a Sobol sampling

sequence (solid lines) and Monte Carlo sampling (dashed lines).

As an additional example, we illustrate a basic estimation procedure for S1x2 , by sampling 10,000 Monte Carlo samples and artificially constraining the values of x2 to 20 levels within [0,1]; we then plot the output Y against x2 (Figure 4). The conditional output of the function across the independent values of x1 clearly shifts over the different levels of x2. The blue markers show the resulting mean value of the values of Y, at each level of x2.

The main effect S1x2 is then simply the variance of these mean values, divided by the unconditional variance of Y (i.e. over all the values of x1 and x2). The code notebook compares this "manual" approach for estimating S1 and ST with the SALib computation.

Figure 4: Conditional mean values obtained for Y (blue markers) across 20 levels of x2, for 10,000 Monte Carlo samples. The first- order index S1 of x2 is the variance of these conditional mean values, divided by the unconditional variance of Y.

(7)

2. PAWN indices for distribution-based global sensitivity analysis

Distribution-based methods for global sensitivity analysis (GSA) typically define the importance of model parameters using the change in the output distribution obtained by fixing each of these parameters. As with other GSA approaches, this assumes the model output distribution is produced by propagating uncertainty in the k input parameters by computing the model output Y over input X, when the latter is drawn from specified probability distributions for parameters X1, ..., Xk. For a parameter Xi, the sensitivity index Si is then proportional to the difference between the unconditional output distribution obtained by varying all parameters X simultaneously, and the conditional output distribution Y | Xi obtained when Xi is fixed. The PAWN technique (Pianosi & Wagener, 2015, 2018) applies this approach by systematically computing a Kolmogorov-Smirnov (K-S) statistic to measure the difference between conditional and unconditional output distributions, using the empirical cumulative distribution functions (CDFs) of the model output samples.

We illustrate the variant of PAWN described in Pianosi & Wagener (2018), available in Matlab, Python or R in the open-source SAFE Toolbox package (Noacco et al., 2019). Using this variant, sensitivity indices can be estimated from generic datasets sampled with Monte Carlo or a Latin hypercube design, or a specific design like the quasi-random Sobol sequence used in variance-based GSA. We note that Latin hypercube sampling (available in the SALib package; Herman & Usher, 2017) generally provides better performance than Monte Carlo by regularly covering the range of each input. This strategy generates a desired number of n equiprobable intervals (so that, for a uniformly distributed parameter, they are uniformly spaced over the range of the parameter). A sample is then randomly placed within each interval (Figure 5, right panel). For a small n, Monte Carlo sampling may under-explore certain regions of the input space (Figure 5, left panel), and yield unstable results for the estimation of sensitivity indices with distribution-based GSA. The stability of these indices can be assessed during the analysis by using bootstrap resamples of the data, i.e. resampling with replacement, as implemented in the SAFE Toolbox.

Figure 5: Left panel: random Monte Carlo sampling of five points, with x1 and x2 drawn from a uniform distribution over [0,1]. Right panel: Latin hypercube sampling for five points, with the same input distributions. Dashed lines illustrate the division of each input

ranges into equiprobable sampling intervals.

We consider a model f(Xj), with k uncertain input parameters Xj=(Xj1, ..., Xji, Xjk) sampled from arbitrary distributions over multiple samples j = 1, ..., N. We obtain the corresponding model output Y=(Y1 ,..., Yj, YN).

We then define the empirical cumulative distribution 𝐹a_b, and divide the range of each input parameter into n conditioning intervals Ii,t that are equiprobable, i.e. contain an equal number of samples (so that, for a parameter sampled from a uniform distribution U(a, b), the intervals would be equally spaced over [a, b]).

(8)

For parameter Xi, the value of the K-S statistic in each interval Ii,t is the maximum absolute difference between the empirical CDF of the unconditional output Y, and the empirical CDF of the output when only considering the samples for which the value of Xi falls in the interval Ii,t :

𝐾𝑆₀(𝐼_0,e) = max

b i𝐹a_b(𝑌) − 𝐹a_b|8₉(𝑌|𝑋₀∈ 𝐼_0,e)i

This computation is repeated across all intervals Ii,t. The resulting K-S values from each interval are then summarized using a specified statistic (e.g. the mean, median or maximum value of the K-S values across intervals), yielding the PAWN sensitivity index 𝑆k₀ for Xi.

𝑆k₀= stat

e1(,...,Q𝐾𝑆(𝐼_0,e)

We visually demonstrate the PAWN computation with the analytical function presented in section 1, using 10,000 samples from a Latin hypercube design to estimate Y, and dividing the input range of x1 and x2 into 10 conditioning intervals (Figure 6). This test case is demonstrated in Python in the Jupyter Notebook provided with this work (Jaxa-Rozen et al., 2021). We first compute the empirical CDFs of the samples in each interval of x1 and x2, and compare them to the unconditional CDF over all values of x1 and x2 (Figure 7).

𝑌 = 𝑐𝑜𝑠(𝑥1) + [2 ∙ 𝑥2 + (𝑥1 − 0.5)]^O ; 𝑥1~𝑈(0, 1) 𝑥2~𝑈(0, 1)

Figure 6: Scatter plots showing the output Y of the analytical test function, as a function of x1 (left panel) and x2 (right panel), for 10,000 Latin hypercube samples. Shaded intervals show the 10 intervals used to compute the conditional cumulative distribution

function of the output across values of x1 and x2.

Figure 7: Conditional cumulative distribution of the output (gray lines), for x1 (left panel) and x2 (right panel). Red lines show the unconditional cumulative distribution. The shade of the gray lines corresponds to the intervals in Figure 6.

(9)

The K-S statistics for these comparisons are then taken for each interval, and aggregated using their median and maximum, yielding the median and maximum PAWN indices for x1 and x2 (Figure 8). In this case, x1 and x2 have roughly equal influence on the output distribution when considering the median K-S value, but x2 has a stronger influence on the output distribution when it is fixed in its lowest or highest interval (reflected in its higher maximum K-S value). In a practical case, this could for instance suggest further research on the reliability of the assumed parameter bounds.

Figure 8: Kolmogorov-Smirnov statistic in each interval used to compute conditional cumulative distributions, for x1 (black line) and x2 (blue line).

The computational cost of estimating the PAWN indices from an existing dataset is negligible, so the robustness of these indices can easily be tested by evaluating a different number of computation intervals on the same dataset (Puy et al., 2020). Bootstrap resamples of the dataset can also be used to find a confidence interval for the indices; this is especially relevant to check whether the dataset is sufficiently large to yield a stable estimation of the indices, or whether more model executions are needed.

(10)

3. Spectral clustering

Clustering methods aim to identify underlying patterns in a dataset, by aggregating data samples into a certain number of subgroups (or clusters) based on their similarity. Measures of similarity can be derived from relative relations between samples (such as their closeness when the data is represented as a graph; Hastie et al., 2009), or from direct attributes of the data samples, such as their quantitative values. For example, the k-means clustering algorithm (Lloyd, 1982) uses the squared Euclidean distance between sample values (measured across an arbitrary number of dimensions) as a measure of their similarity, assigning samples to a prespecified number of k clusters based on the distance between each sample and the nearest of k cluster centroids.

However, k-means clustering assumes that the clusters are spherical and separable, so that this algorithm cannot easily identify irregularly-shaped or non-convex clusters (Hastie et al., 2009). For these cases, methods based on spectral clustering may perform more reliably. These methods represent the data samples in the form of a graph, i.e. as a set of nodes connected by a set of (optionally weighted) edges. This approach is comprehensively detailed in von Luxburg (2007). In this representation, each data sample is a node, and two nodes are connected by an edge if their similarity meets a certain condition. This edge may additionally be weighted using a similarity measure (so that similar nodes are connected by an edge of higher weight). The clustering approach then aims to partition this graph into a specified number of groups of nodes, in a way that identifies groups that contain strongly connected (i.e. similar) nodes, but that are weakly connected to each other (i.e. by a smaller number of minimally weighted edges, corresponding to a low similarity score across groups).

We describe a basic implementation of spectral clustering using a k-nearest neighbors algorithm to convert the data into a graph representation. Starting with a dataset of samples {x1, ..., xn} and the corresponding set of nodes {v1, ..., vn}, an edge of weight wij = 1 is generated between two nodes vi and vj if vi is within the k nearest neighbors of vj, or if vj is within the k nearest neighbors of vi. For continuous variables, the nearest neighbors are commonly found using Euclidean distance. We represent these edges in the adjacency matrix W of size n × n, W=(wij)i,j=1, ..., n

, in which two connected nodes have wij = 1, and unconnected nodes (i.e. that are not within each other's nearest neighbors) have wij = 0. The weights wij could be further modified with a similarity measure, such as a radial-basis function kernel 𝑤02= 𝑒𝑥𝑝(−𝛾r𝑥0− 𝑥2r^O) where γ is a kernel parameter that is typically best adjusted empirically to the properties of the data, e.g. using a grid search (von Luxburg, 2007). This weighted similarity approach can for instance be applied using the spectral clustering module of the scikit-learn Python package (Pedregosa et al., 2011). By default, this approach generates a fully connected graph (i.e. where all nodes are connected by edges) then weights each edge using the radial-basis function kernel. We use this weighted similarity approach for the LCA case studies described in the main text, in which the nodes used for clustering correspond to output samples of interest selected in each case study. Each of these nodes is defined by the vector of eight environmental impacts computed for each sample using uncertain input parameters; the squared Euclidean distance r𝑥₀− 𝑥2r^Obetween nodes xi and xj is thus computed from the vectors of resulting impact values for each node, then used to weight the edge between these nodes.

Here assuming an adjacency matrix W generated by the k-nearest neighbors approach without additional weighting, we compute the degree d of each node (i.e. the number of edges by which it is connected to other nodes) by summing the rows of the adjacency matrix:

𝑑₀= /^Q 𝑤₀₂

21(

We take the degree matrix D as a diagonal matrix with the degrees di, ..., dn on the diagonal, and define the graph Laplacian matrix, L = D - W (several definitions are possible for this matrix, summarized in Section 3 of von Luxburg, 2007). We then compute the eigenvectors and eigenvalues of this matrix L, which have several useful properties for clustering (Proposition 1 of von Luxburg, 2007). As such, the number of eigenvalues that are equal to zero indicates the number of subcomponents that can be found in the graph. This number then suggests a suitable number of consistent clusters that could be found in the original data. In practical cases, eigenvalues nearly equal to zero should also indicate consistent clusters.

(11)

For a desired number of k clusters (which can be prespecified, or based on the analysis of the eigenvalues), we then take the eigenvectors u1, ..., uk that correspond to the k smallest eigenvalues of L. We assemble these eigenvectors in a matrix U of size n × k, with the eigenvectors u1, ..., uk as columns. Finally, we apply a conventional k-means clustering algorithm on U, taking the rows of U as samples to be clustered based on their values across the k eigenvectors. We then label each of the original data samples using this cluster classification obtained by k-means for each sample across the eigenvectors of L.

We illustrate this approach on a stylized two-dimensional dataset generated using the make_circles function of the scikit-learn package, generating 300 samples grouped in six circular clusters (demonstrated in the Jupyter Notebook provided with this work, Jaxa-Rozen et al., 2021). We also use scikit-learn to apply k- means clustering, using the package's default parameters. Directly using k-means on this dataset yields poor results (Figure 9, left panel), as a distance-based approach is unable to recover concentric clusters. However, when instead using k-means across the eigenvectors of the graph Laplacian matrix, the original clusters are correctly identified using a 5-nearest neighbors adjacency matrix (Figure 9, right panel).

Figure 9: Left panel: results of k-means clustering on a two-dimensional test dataset; different marker colors correspond to the six identified clusters. Right panel: results of spectral clustering using a 5-nearest neighbors adjacency matrix on the same dataset.

To explain this result with the underlying eigenvector-based classification, we visualize the eigenvalues of the graph Laplacian, sorted by increasing value (Figure 10, left panel). Six of these values are equal to zero, so that we could expect to find six consistent clusters. We then visualize the matrix U, which contains as columns the six eigenvectors corresponding to these lowest eigenvalues (Figure 10, right panel). The 300 rows of U (with one row for each original data sample) all fall into six clearly distinct patterns across the eigenvectors. Unlike the original representation of the data in its two-dimensional space, the eigenvector patterns corresponding to the circular clusters can effectively be recovered using the Euclidean distance used in k-means clustering.

Figure 10: Left panel: eigenvalues of the graph Laplacian matrix used for spectral clustering, sorted in increasing order. Right panel:

values of the eigenvectors corresponding to the six smallest eigenvalues of the graph Laplacian matrix, plotted across the 300 samples of the test dataset.

(12)

4. Patient Rule Induction Method for scenario discovery

Methods for scenario discovery (Bryant & Lempert, 2010; Groves & Lempert, 2007) aim to identify the combinations and values of uncertain model input parameters associated with a specified region of the model output. These methods have primarily been developed and applied in the context of the literature on decision- making under deep uncertainty (reviewed in Kwakkel & Haasnoot, 2019), and rely on statistical rule-induction algorithms such as the Patient Rule Induction Method (PRIM; Friedman & Fisher, 1999) or Classification and Regression Trees (CART; Breiman et al., 1984). This work uses PRIM, which yields more easily interpretable results for models with a relatively large number of input parameters (Bryant & Lempert, 2010). Implementations of this algorithm are available in Python (EM Workbench package; Kwakkel, 2017) and R (sdtoolkit or OpenMORDM; Bryant, 2014, Hadka et al., 2015). We first describe the PRIM algorithm following Friedman &

Fisher (1999), then illustrate it with a stylized example.

We consider a model f(Xj) with k uncertain input parameters, Xj=(Xj1, ..., Xji, Xjk), sampled from arbitrary distributions over multiple samples j=1,..., N. When using PRIM for scenario discovery, the output of the model Y=(Y1 ,..., Yj, YN) is first pre-processed to specify samples that are of interest for the analysis. Typically, the original model output values will be transformed to binary values that indicate whether a given sample is of interest (1) or not (0). These samples of interest can be labeled using a logical condition, based on any relevant criterion for the analysis: for instance, a performance threshold that should be met, or the cluster labeling produced by a clustering analysis. The PRIM algorithm then searches for a region of the input space that maximizes the average value of the corresponding model output, i.e., that lead to a high density of output samples of interest. This assumes that the identified region of the input space is described by a set of k restrictions on the range of each parameter Xi.. These restrictions are in turn defined by bounds mini and maxi on the value of the parameter:

t 𝑚𝑖𝑛₀≤ 𝑋₀ ≤ 𝑚𝑎𝑥₀

+

01(

If a parameter Xi is highly influential towards the output samples of interest, we could expect the restricted range between mini and maxi to be relatively narrow relative to the original sampling bounds. For a non-influential parameter, mini and maxi may be unrestricted, i.e. matching the original sampling bounds for the parameter. The search for these restricted parameter ranges uses an iterated hill-climbing optimization procedure. Starting from an initial unrestricted box B which contains all model samples {𝑌₂, 𝑿₂}, 𝑗 = 1, . . . , 𝑁, each step of the algorithm

"peels" a slice of data by choosing an optimal sub-box of samples b^* to be removed. Each of these peeling steps is applied along a single dimension i (i.e. one of the model input parameters), creating a new, smaller box Bnew. The sub-boxes available for peeling depend on the data type of the input parameters: for categorical inputs, sub-boxes for each discrete value of the parameter are assessed for peeling. For real-valued inputs, two candidate sub-boxes b^*i- , b^*i+ are assessed for each input Xi:

𝑏₀₅^∗ = {𝑿₂ | 𝑋₀ < 𝑋_0(•)} 𝑏₀₃^∗ = {𝑿₂ | 𝑋₀ > 𝑋_0((5•)}

b^*i- contains the samples that are below a specified quantile α for the values of Xi in the current box, so that removing this sub-box would peel "upwards" from the current box along Xi. Conversely, b^*i+ contains the samples that are above the quantile 1-α for Xi, and removing this sub-box would peel "downwards" along Xi. For each step of the algorithm, a set C is created containing the sub-boxes available for peeling across all inputs.

The sub-box b^* to be removed is then picked from this set using a specified objective function. The original

(13)

objective function proposed by Friedman & Fisher (1999) maximizes the average value of the output samples Yj contained in the box Bnew that remains after removing b^*:

𝑏^∗= arg max

‚^∗∈ƒ

∑_𝑿_„_∈U_…†‡𝑌₂

|𝐵_Qˆ‰|

Alternative objective functions can be used to make the peeling more "lenient" in the case of categorical inputs (Kwakkel & Jaxa-Rozen, 2016). This procedure is repeated until the number of samples contained in Bnew reaches a specified minimum threshold, or fraction of the original box B. The bounds of the box along each dimension i then correspond to the constrained ranges of values mini and maxi for each input parameter, that are associated with samples of interest. A limitation of the PRIM algorithm is that it describes (hyper-) rectangular boxes of the input space, which may not always be an effective approximation of the input-output mapping. To evaluate the quality of the identified boxes, metrics for coverage and density are typically used (Lempert et al., 2008; these metrics are equivalent to recall and precision in the statistical learning literature, respectively). Coverage is the fraction of all samples of interest that is contained in Bnew, and density is the fraction of samples in Bnew that is of interest. There is usually a trade-off between these two metrics, so that a box with high coverage (such as the initial box B that contains all samples) may have relatively low density, and vice versa. The implementation of PRIM used in this work (Kwakkel, 2017) interactively visualizes this trade-off along the "peeling trajectory" followed by the algorithm, so that a suitable box can be picked depending on the purposes of the analysis.

For situations where samples of interest are associated with a highly non-rectangular region of the input, the samples can be pre-processed using principal component analysis to more effectively describe the input-output mapping, using linear combinations of the original input parameters (Dalal et al., 2013). If the samples are associated with multiple disjoint regions of the input, the analysis can simply be repeated after identifying the first PRIM box, to find a different combination of constrained input ranges. The samples contained in the first box can either be removed from the analysis altogether before running a second iteration of the algorithm, or relabeled as samples that are not of interest (Guivarch et al., 2016).

To illustrate the algorithm, we use the bivariate function introduced in section 1 and demonstrated in the Jupyter Notebook provided with this work (Jaxa-Rozen et al., 2021).

𝑌 = 𝑐𝑜𝑠(𝑥1) + [2 ∙ 𝑥2 + (𝑥1 − 0.5)]^O ; 𝑥1~𝑈[0, 1] 𝑥2~𝑈[0, 1]

We assume we are interested in finding the ranges of values of x1 and x2 for which Y ≤ 0.8 (Figure 11, left panel). In a real case, this threshold on Y could for instance be a performance threshold that should be met. We first take a sequence of Monte Carlo input samples Xj = (x1j, x2j) for j = 1, ..., 1000, sampling from uniform distributions over the domain of x1 and x2. As with PAWN indices (section 2), Latin hypercube sampling would offer more regular coverage of the input ranges, but the technique is applicable regardless of the sampling method. We then compute the output vector Y = (Y1, Y2, ..., Y1000) from the sequence of Monte Carlo samples.

Based on the specified threshold, we assign a binary vector Yb = (b1, b2, ..., b1000) to identify samples of interest, where bj = 1 if Yj ≤ 0.8, and bj = 0 otherwise. These Monte Carlo samples reasonably approximate the analytical boundary (Figure 11, right panel), with 126 samples being of interest (Yj ≤ 0.8).

(14)

Figure 11: Left panel: Gaussian kernel density estimate of the output distribution for Y, with a shaded area showing the region of interest on this distribution (Y ≤ 0.8). Right panel: region of interest for Y as a function of x1 and x2, approximated using 1000 Monte Carlo samples and overlaid with the analytical solution for Y = 0.8. Blue markers show samples with Y ≤ 0.8. x1 and x2 are drawn from

uniform distributions over [0,1].

We then apply the PRIM algorithm, passing the sequence of Monte Carlo samples and the corresponding binary output vector Yb that identifies samples of interest. We set a relatively coarse peeling parameter α=30% for clearer visualization, and a minimum mass parameter of 100 samples. After 6 peeling steps, removing the sub- boxes b^*1, ..., b^*6 (Figure 12, left panel), the algorithm terminates with a box containing 117 samples. This box provides 83% coverage (i.e. containing 104 of the 126 samples for which Yi ≤ 0.8), and 89% density (i.e. 104 out of the 117 samples contained in the box are designated as cases of interest, where Yi ≤ 0.8). The bounds of the box (grey lines in Figure 12, right panel) reasonably approximate the analytical solution. Depending on the preferences of the analyst, b^*6 could have been kept to maximize coverage; conversely, peeling steps could be added to improve density by reducing the minimum mass parameter. As the region containing cases of interest is non-rectangular, an improvement in density would in this case come at the expense of coverage.

Figure 12: Left panel: PRIM peeling steps. Shaded boxes show the individual sub-boxes removed in sequence. Right panel: restricted ranges of the parameters x1 and x2 corresponding to the box of samples of interest identified after 6 peeling steps (gray lines),

visualized in relation to the overall sampling bounds of x1 and x2.

(15)

References

Becker, W. E., Tarantola, S., & Deman, G. (2018). Sensitivity analysis approaches to high-dimensional screening problems at low sample size. Journal of Statistical Computation and Simulation, 88(11), 2089–2110.

https://doi.org/10.1080/00949655.2018.1450876

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. CERN Document Server.

http://cds.cern.ch/record/2253780

Bryant, B. (2014). Sdtoolkit: Scenario Discovery Tools to Support Robust Decision Making (v2. 33-1). Retrieved from Cran.r- Project.Org/Web/Packages/Sdtoolkit/Index.Html.

Bryant, B., & Lempert, R. J. (2010). Thinking inside the box: A participatory, computer-assisted approach to scenario discovery.

Technological Forecasting and Social Change, 77(1), 34–49. https://doi.org/10.1016/j.techfore.2009.08.002

Butler, M. P., Reed, P. M., Fisher-Vanden, K., Keller, K., & Wagener, T. (2014). Identifying parametric controls and dependencies in integrated assessment models using global sensitivity analysis. Environmental Modelling & Software, 59, 10–29.

https://doi.org/10.1016/j.envsoft.2014.05.001

Dalal, S., Han, B., Lempert, R., Jaycocks, A., & Hackbarth, A. (2013). Improving scenario discovery using orthogonal rotations.

Environmental Modelling & Software, 48, 49–64. https://doi.org/10.1016/j.envsoft.2013.05.013

Friedman, J. H., & Fisher, N. I. (1999). Bump hunting in high-dimensional data. Statistics and Computing, 9(2), 123–143.

https://doi.org/10.1023/A:1008894516817

Groen, E. A., & Heijungs, R. (2017). Ignoring correlation in uncertainty and sensitivity analysis in life cycle assessment: What is the risk? Environmental Impact Assessment Review, 62, 98–109. https://doi.org/10.1016/j.eiar.2016.10.006

Groves, D. G., & Lempert, R. J. (2007). A new analytic method for finding policy-relevant scenarios. Global Environmental Change, 17(1), 73–85. https://doi.org/10.1016/j.gloenvcha.2006.11.006

Guivarch, C., Rozenberg, J., & Schweizer, V. (2016). The diversity of socio-economic pathways and CO2 emissions scenarios:

Insights from the investigation of a scenarios database. Environmental Modelling & Software, 80, 336–353.

https://doi.org/10.1016/j.envsoft.2016.03.006

Hadka, D., Herman, J., Reed, P., & Keller, K. (2015). An open source framework for many-objective robust decision making.

Environmental Modelling & Software, 74, 114–129. https://doi.org/10.1016/j.envsoft.2015.07.014

Hastie, T., Tibshirani, R., & Friedman, J. (2009). 14—Unsupervised Learning. In The Elements of Statistical Learning (pp. 485–585).

Springer New York. http://link.springer.com/chapter/10.1007/978-0-387-84858-7_14

Herman, J., & Usher, W. (2017). SALib: An open-source Python library for Sensitivity Analysis. The Journal of Open Source Software, 2(9). https://doi.org/10.21105/joss.00097

Jaxa-Rozen, M., Pratiwi, A. S., & Trutnevyte, E. (2021). Analysis workflow for sensitivity analysis and scenario discovery

(http://doi.org/10.5281/zenodo.4201064). http://doi.org/10.5281/zenodo.4201064. http://doi.org/10.5281/zenodo.4201064 Joe, S., & Kuo, F. Y. (2008). Constructing Sobol Sequences with Better Two-Dimensional Projections. SIAM Journal on Scientific

Computing, 30(5), 2635–2654. https://doi.org/10.1137/070709359

Kwakkel, J. H. (2017). The Exploratory Modeling Workbench: An open source toolkit for exploratory modeling, scenario discovery, and (multi-objective) robust decision making. Environmental Modelling & Software, 96, 239–250.

Kwakkel, J. H., & Haasnoot, M. (2019). Supporting DMDU: A Taxonomy of Approaches and Tools. In V. A. W. J. Marchau, W. E.

Walker, P. J. T. M. Bloemen, & S. W. Popper (Eds.), Decision Making under Deep Uncertainty: From Theory to Practice (pp. 355–374). Springer International Publishing. https://doi.org/10.1007/978-3-030-05252-2_15

Kwakkel, J. H., & Jaxa-Rozen, M. (2016). Improving scenario discovery for handling heterogeneous uncertainties and multinomial classified outcomes. Environmental Modelling & Software, 79, 311–321. https://doi.org/10.1016/j.envsoft.2015.11.020 Lempert, R., Bryant, B., & Bankes, S. (2008). Comparing Algorithms for Scenario Discovery (WR-557-NSF). RAND Corporation.

Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.

Noacco, V., Sarrazin, F., Pianosi, F., & Wagener, T. (2019). Matlab/R workflows to assess critical choices in Global Sensitivity Analysis using the SAFE toolbox. MethodsX, 6, 2258–2280.

Nossent, J., Elsen, P., & Bauwens, W. (2011). Sobol sensitivity analysis of a complex environmental model. Environmental Modelling

& Software, 26(12), 1515–1525. https://doi.org/10.1016/j.envsoft.2011.08.010

Oakley, J. E., & O’Hagan, A. (2004). Probabilistic sensitivity analysis of complex models: A Bayesian approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(3), 751–769. https://doi.org/10.1111/j.1467-9868.2004.05304.x Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,

Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

(16)

Pianosi, F., & Wagener, T. (2015). A simple and efficient method for global sensitivity analysis based on cumulative distribution functions. Environmental Modelling & Software, 67, 1–11. https://doi.org/10.1016/j.envsoft.2015.01.004

Pianosi, F., & Wagener, T. (2018). Distribution-based sensitivity analysis from a generic input-output sample. Environmental Modelling & Software, 108, 197–207. https://doi.org/10.1016/j.envsoft.2018.07.019

Puy, A., Lo Piano, S., & Saltelli, A. (2020). A sensitivity analysis of the PAWN sensitivity index. Environmental Modelling &

Software, 127, 104679. https://doi.org/10.1016/j.envsoft.2020.104679

Saltelli, A. (2002). Making best use of model evaluations to compute sensitivity indices. Computer Physics Communications, 145, 280–297.

Saltelli, A., Annoni, P., Azzini, I., Campolongo, F., Ratto, M., & Tarantola, S. (2010). Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Computer Physics Communications, 181(2), 259–270.

https://doi.org/10.1016/j.cpc.2009.09.018

Saltelli, A., Ratto, M., Andres, T., Campolongo, F., Cariboni, J., Gatelli, D., Saisana, M., & Tarantola, S. (2008). Global sensitivity analysis: The primer. John Wiley & Sons.

Sobol, I. M. (1993). Sensitivity estimates for nonlinear mathematical models. Mathematical Modelling and Computational Experiments, 1(4), 407–414.

Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55(1–3), 271–280. https://doi.org/10.1016/S0378-4754(00)00270-6

von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416.