• Aucun résultat trouvé

Data-driven modelling for environmental data science 2

3.2 Model development procedure

3.2.1 Create conceptual framework: model selection

3.2.1.6 Summary of advantages and drawbacks

A summary of the advantages and drawbacks of the selected modelling techniques is provided in Table 3.2. General drawbacks of each approach are mentioned despite the existence of several recently developed techniques that, at least partially, compensate for these weaknesses. However, most compensating techniques have a negative influence on the main advantages, which highlights the need of a well-balanced and carefully considered implementation.

Table 3.2: Summary of model advantages and drawbacks. An overview is provided of the five modelling techniques (decision trees (DT), generalised linear models (GLM), artificial neural networks (ANN), fuzzy logic (FL) and Bayesian belief networks (BBN)) discussed in previous subsections.

Technique Advantages Drawbacks

DT - Transparent modelling technique;

- Able to deal with small data sets;

- Able to identify interactions between explanatory variables;

predicting probability of occurrence with statistical substantiation. between predictors and response variables when knowledge on the system’s functioning is lacking.

- Acts as black box model;

- Lack of guidelines for optimal design;

- Low ecological relevance;

- Limited explanatory power.

FL - Absence of strict boundary values;

- Ability to complement empirical data with expert knowledge;

- Ability to incorporate uncertainty scenarios (e.g. climate change) by

- Construction of knowledge-based rules is time intensive.

BBN - Accounts for uncertainties explicitly;

- Ability to incorporate uncertainty scenarios (e.g. climate change) by probability approach;

- Straightforward propagation of uncertainties related to model inputs;

- Ability to complement empirical data with expert knowledge.

- Inability to implement temporal dynamics;

- Information loss due to data discretisation;

- Construction of knowledge-based rules is time intensive.

DATA-DRIVEN MODELLING

71 3.2.2 Data collection and exploration

Following modelling technique selection based on abovementioned advantages and drawbacks, data is to be collected for model training (Step 2, Table 3.1). Environmental observations collect information on a myriad of variables, often classified as explanatory and response variables. Typically, explanatory variables include all variables to be considered to explain the observed pattern within the response variable of interest and can be biotic and abiotic. However, the majority of HSMs focuses on defining suitable abiotic conditions and thereby restricts the extent of the explanatory variable space.

Variables can be discrete or continuous, with the former representing a limited number of possibilities (e.g. 5 different classes of land use), while the latter is not characterised by fixed thresholds to distinguish classes (e.g. river width expressed in meters). Most HSMs aim to accurately predict habitat suitability of a single species, thereby relying on a presence/absence statement (discrete) or a measure of abundance (continuous) in the response variable. Typically, models developed with a continuous response variable tend to be more sensitive compared to presence-absence models, despite containing potential biases related to seasonality, long term fluctuations and different sampling techniques (Ysebaert et al. 2002).

Still, the majority of HSMs is trained with a discrete response variable, ranging from the basic presence-only (PO) to completely presence-absence (PA). Presence-only data sets describe the locations where a specific species is observed, occasionally making use of records from museums or herbaria (Graham et al. 2004), though without providing any information on unsuitable conditions (Ward et al. 2009). In contrast, presence-absence data include information on species absences, yet these do not necessarily reflect effective absences. More specifically, reported absences combine true and false absences, the latter of which is composed of species being present without being observed (non-detectability) and species being absent due to historical or dispersal limitations (future potential) (Anderson and Raza, 2010). These false absences negatively affect model accuracy by providing ambiguous information (Lobo et al., 2010).

Obtained data sets are rarely perfect and often contain one (or more) variables with missing values, erroneous notations, redundant variables and an unbalanced response variable (Gibert et al., 2018a). Data exploration and pre-processing are therefore crucial tools to characterise and improve the quality of the obtained data and, thereby, increase the reliability of model outcomes (Zhang et al., 2003; Zuur et al., 2010). Data exploration allows to obtain a graphical representation of a variable’s distribution, with boxplots, dotplots and histograms being frequently applied to identify potential deviating instances (Zuur et al., 2010). Transformation or removal of these outliers are common approaches to improve data quality, yet the lack of uniform guidelines cause it to be relatively subjective and open to further study.

CHAPTER 3

72

Similarly, predictor assessment by means of a correlation index or principal component analysis (PCA) helps in identifying explanatory variables that contain similar information. These are helpful in understanding ecological interactions and processes, yet potentially compromise model development due to limited information gain compared to the computational cost. Reduction of data dimensionality by correlated variable removal or the creation of a new set of independent variables based on the PCA axes generally supports the development of simpler and more transparent models (Guo et al., 2015; Wilson et al., 2011). The intensity of these effects depends on data characteristics and varies among modelling techniques.

Aside from abovementioned pre-processing, additional changes are potentially required prior to model training. For instance, ANN requires predictors to be rescaled to a predefined interval, ranging between 0 (or -1) up to 1, in order to make reliable predictions. Without this rescaling, predictors with an extensive range can have a higher influence, which can be artificially altered by changing the unit. Similarly, balancing of the response variable within the training data is highly recommended for DTs in order to avoid model preference towards the class with the highest frequency. To this end, a balanced ratio can be obtained via (i) random subsampling of the class(es) with higher abundance (Araújo and Guisan, 2006), (ii) oversampling of the class(es) with lower abundance or (iii) a combination of both (see Figure 3.6).

Figure 3.6: Balancing of data describing the occurrence of a non-specified organism.

Occurrence was assessed at 170 locations and resulting in 120 absence statements and 50 presence statements. A balanced dataset is created by A: randomly omitting data related to the absence of the organism (subsampling); B: randomly duplicating data related to the presence of the organism (oversampling) or C: applying a combination of subsampling and oversampling.

DATA-DRIVEN MODELLING

73 3.2.3 Model application

A variety of parameters is linked with model development, distinguishing between algorithm parameters (or ‘hyperparameters’) and model parameters (e.g. regression coefficients). Hyperparameters are often subjectively defined prior to model training and remain unaltered regardless of the provided data, while model parameters are an intrinsic element of the final model and highly dependent on the training data.

Subjective selection of hyperparameter settings can affect model performance drastically, hence preliminary optimisation is highly recommended. Ideally, all potential hyperparameter settings are tested to identify the best-performing combination(s), though this number tends to increase exponentially with every additional hyperparameter to be considered. Alternatively, random selection of a subset (e.g. 60 combinations) provides a first overview of potential performance and identifies a starting point for hyperparameter optimisation, while being generally faster than the traditional grid search (Bergstra and Bengio, 2012). This procedure reduces overall calculation time as it does not require for all combinations to be assessed, yet risks that the global optimal hyperparameter combination will not be found.