Data pre-processing techniques - Materials and method

data pre-processing 4

6.2 Materials and method

6.2.3 Data pre-processing techniques

With the settings inferred from the preliminary assessment, CRFs were developed, which involved the testing of the effect of further data pre-processing on model performance. Due to the specific construction of the random forest algorithm, it is expected that the inclusion of both outliers and correlated variables has a limited effect (Breiman, 2001; Fox et al., 2017; Vezza et al., 2015). However, the reduction of incorrect and irrelevant information improves model regularisation by reducing model complexity and therefore relies on the trade-off between data and model complexity.

Consequently, model regularisation encompasses appropriate instance and variable selection (i.e. identifying and eliminating outliers, false absences, correlated and irrelevant variables).

6.2.3.1 Selection of instances Detection of outliers

Practical implementation of outlier identification and removal starts with considering the original 𝑁_{𝑖𝑛𝑠𝑡}× 𝑁_𝑣𝑎𝑟 dataset (D) and creating a new, equally-dimensioned matrix O.

For each variable Xj (j ≤ Nvar) the first and third quartile are defined (Qj,1 and Qj,3, respectively) as well as a user-specified range threshold (τo,j). Subsequently, Equation 6.1 is applied to di,j (∈ D) and an outlier dummy score (1 if considered outlier, 0 if not) is instance with an outlier score in 1 variable becomes less reliable and should, therefore, be removed from the data.

DATA PRE-PROCESSING

135 𝑜_𝑖,𝑗 = {

1, 𝑑_𝑖,𝑗 < 𝑄_𝑗,1− 𝜏_𝑜,𝑗∙ (𝑄_𝑗,3− 𝑄_𝑗,1) 0, 𝑄_𝑗,1− 𝜏_𝑜,𝑗 ∙ (𝑄_𝑗,3− 𝑄_𝑗,1) ≤ 𝑑_𝑖,𝑗 ≤ 𝑄_𝑗,3+ 𝜏_𝑜,𝑗∙ (𝑄_𝑗,3− 𝑄_𝑗,1) 1, 𝑑_𝑖,𝑗 > 𝑄_𝑗,3+ 𝜏_𝑜,𝑗∙ (𝑄_𝑗,3− 𝑄_𝑗,1)

(Equation 6.1)

With di,j the value of the j-th variable of the i-th instance, oi,j the outlier dummy score of di,j, Qj,1 the first quartile of the j-th variable, Qj,3 the third quartile of the j-th variable and 𝜏_𝑜,𝑗 the user-specified threshold for the j-th variable.

Detection of pseudo-absences

Instance selection based on false absence identification started with the separation of presences and absences. Based on the absence data set Dabs (𝑁_𝑎𝑏𝑠× 𝑁_𝑣𝑎𝑟) a new, equally-dimensioned matrix A is created. For each variable Xj (j ≤ Nvar) distribution percentiles (𝑃𝜏𝑎,𝑗

and 𝑃

(¹⁻^{𝜏𝑎,𝑗}₂ )) of the occupied environmental domain (i.e. presence data set, Dpres) are defined, including a user-specified range threshold (𝜏_𝑎,𝑗). Subsequently, Equation 6.2 is applied to di,j (∈ Dabs) and an absence dummy score (1 if considered potential true absence, 0 if not) is assigned to ai,j (∈ A). Finally, absence dummy scores are summed for each instance, causing instances that exceed the pre-specified threshold αa (i.e.

∑^𝑁_𝑗=1^𝑣𝑎𝑟𝑎_𝑖,𝑗≥ 𝛼_𝑎) to be maintained in the absence data set. This approach is visualised in Figure 6.1 for two variables, but can be easily extended to higher dimensions. Lastly, the presence and updated absence data are merged into a single data set for model training.

Figure 6.1: Illustration of the false absence concept. Each absence can be classified as a true absence, a potentially true absence or a false absence when related to occupied environmental niche. A: Situation of observed absences along two environmental gradients (X1 and X2) with respect to the observed environmental domain (light grey) and occupied environmental domain (dark grey). B: Classification of the observed absences from (A) based on being true or false in the individual environmental gradients. The value of 𝜏_𝑎 determines the extent of the occupied range in (A), while the value of αa influences which potential true absences from (B) are ultimately included in the model training data.

CHAPTER 6

136

To assess the effect of range selection, 𝜏_𝑎,𝑗 was set to range from 0 (high degree of removal) up to 0.15 (low degree of removal) with a step size equal to 0.01, without being variable-specific (i.e. 𝜏_𝑎,𝑗 = 𝜏_𝑎). Meanwhile, αa was fixed to 1, reflecting the idea that an instance with an absence score in 1 variable is situated outside the realised environmental niche and should be kept in the data set. Hence, potential true absences are included in the resulting absence data set (see Figure 6.1) and subsequently used as training data.

With di,j the value of the j-th variable of the i-th instance, ai,j the absence dummy score of di,j, 𝑃𝜏𝑎,𝑗

the lower percentile of the j-th variable, 𝑃

(1−𝜏𝑎,𝑗

2 ) the upper percentile of the j-the variable and τa,j the user-specified threshold for the j-th variable.

6.2.3.2 Selection of variables

Identification of correlated variables

Correlation-based dimensionality reduction starts by considering the original 𝑁_{𝑖𝑛𝑠𝑡}× 𝑁_𝑣𝑎𝑟 dataset (D) and the construction of a 𝑁_𝑣𝑎𝑟× 𝑁_𝑣𝑎𝑟 correlation matrix C. For each variable Xj (j ≤ Nvar), the Pearson correlation coefficient with variable Xi (i ≤ Nvar) is stored in ci,j (with special cases cj,j = 1 and ci,j = cj,i). Subsequently, variable pairs with a correlation score exceeding the threshold value (τc) are identified and individually correlated with the response.

Here, the variable with the highest correlation with the response was maintained in the data set. In short, the procedure as shown by Algorithm 6.1 was applied. To assess the effect of correlation threshold selection, τc was set to range from 0.25 (high degree of removal) up to 0.95 (low degree of removal) with a step size equal to 0.05.

Algorithm 6.1: Correlation-based variable removal Calculate correlation matrix C from dataframe D FOR each element c in C

IF element c is greater than or equal to correlation threshold τc

Store unique variable-variable combination in an overall list L END if

END for

Sort list L according to decreasing correlation score FOR each instance in list L

Determine correlation of each variable with response Remove variable with lowest correlation from L and D END for

DATA PRE-PROCESSING

137 Identification of irrelevant variables

The identification of irrelevant variables contrasts the straightforward correlation-based variable selection as it requires the development of a basic model to derive the importance scores of the incorporated variables. More specifically, variable importance was derived by developing CRFs and assessing the decrease in accuracy following permutation of the variable values, with higher scores being assigned to more important variables. As patterns and type of information differed among species, model-specific importance scores are divided by the highest obtained importance score and subsequently checked against a user-specified threshold (τi). All variables with a relative importance score below the threshold are consequently removed from the dataset (see Algorithm 6.2). To assess the effect of threshold selection, τi was set to range from 0 (low degree of removal) up to 0.5 (high degree of removal) with a step size equal to 0.05.

Algorithm 6.2: Importance-based variable removal Develop basic model m

FOR each variable in D

Derive variable importance scores from m Calculate relative variable importance

IF relative importance is lower than threshold τi

Remove variable from D END if

END for

Dans le document Data-driven models and trait-oriented experiments of aquatic macrophytes to support freshwater management (Page 164-167)