Feature Selection Methods - Predicting head and neck cancer in patients using epigenomics data

The aim of feature selection is to identify the most appropriate subset of features. The filtering of unnecessary features by feature selection increases the classification model's accuracy while minimising processing time. The below are some of the benefits of using Feature Selection Methods.

a. It simplifies the model by reducing data, reducing storage, and enhancing visualisation.

b. Decreases preparation time c. Prevents over-fitting d. Increases model accuracy e. Avoids the dimensionality curse

The commonly used methods of feature selection can be divided into three groups, which are as follows.

1. Filter-based Methods 2. Wrapper-based Methods 3. Embedded Methods

4.1.1 Filter-based Methods

In this method, a score for each function is determined using divergence, correlation, or another method, and then a threshold or filter is applied to pick or exclude the features. Filter-based approaches are particularly useful for high-dimensional datasets because they are easier to compute than other methods. The following are some examples:

i. Information gain ii. Chi-square test iii. Fisher score

iv. Correlation Coefficient v. Variance threshold

vi. Principal Component Analysis vii. Gain Ratio

viii. ReliefF

4.1.2 Wrapper-based Methods

Wrapper-based models use the hold-out technique, in which the model is trained on the training set for each subset to be picked, and then the function subset is chosen on the test set based on the severity of the error. These methods' computing costs differ, but they are all time consuming.

The following are some examples:

21 i. Genetic algorithms

ii. Recursive feature exclusion

iii. Sequential feature filtering algorithms (Backward or Forward Elimination)

4.1.3 Embedded Methods

The embedded method trains machine learning models to obtain the weight coefficients of each function, and then selects features from large to small based on the coefficients. The coefficients are obtained by preparation, which is close to the filter technique. These methods use much less computing power than wrapper methods. The following are some examples:

i. L1 (LASSO) regularization

ii. Decision tree based embedded method iii. Random Forest based embedded method

Principal Component Analysis (PCA), Correlation-dependent feature selection (CFS), Mutual Information Based, ReliefF, Chi-Squared, Recursive Feature Elimination (RFE), Select from Model, and Logistic regression and Random Forest based embedded methods were all used in our study.

4.2 Feature Selection Methods

4.2.1 Chi-Squared

Chi-Squared 𝛘 𝟐 is a filter method for evaluating individual characteristics by estimating their chi-squared statistic in comparison to the groups. The 𝛘 𝟐 measure is used in statistics to determine if two incidents are independent.

4.2.2 Mutual information (MI)

Mutual Intelligence (MI), which is a measure of the amount of information that one random variable has on another variable, is another filter-based technique. Mutual Intelligence (MI) for two random variables is a non-negative value that calculates the dependency between the variables. It is zero if and only if two random variables are independent, and a high degree of dependency/association means higher values. Mutual proof has little impact on whether the univariate relationship is linear.

4.2.3 Correlation-based feature selection (CFS)

Correlation-based Feature Selection (CFS) is a conventional filtering feature selection algorithm that uses a search algorithm and a function to determine the value of a subset of features.

CFS uses a heuristic to measure the accuracy of subsets of features that considers the utility of individual features as well as the degree of inter-correlation.

4.2.4 Principal Component Analysis (PCA)

Principal Component Analysis is a dimension reduction technique that can be used to minimize the dimension of a dataset to a small number while still retaining the significant data. It's a statistical method for converting a large number of associated variables into a smaller number of uncorrelated variables.

4.2.5 ReliefF

The ReliefF algorithm is a simple, efficient, and widely used method for estimating weight characteristics. A probabilistic estimation is performed in ReliefF, which states that the weight learned for a function predicts the difference between two conditional probabilities. These two probabilities reflect the value of a function that varies depending on the closest miss and closest hit given. As a result of the closest neighbor classifier's data, ReliefF usually outperforms other filter-based approaches.

4.2.6 Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a wrapper process for selecting features by recursively considering smaller and smaller ranges of features based on an external estimator that assigns weights to features (i.e., the coefficients of linear model). Second, the estimator is focused on the initial collection of characteristics, and the value of each characteristic is calculated using either a coefficient attribute or a feature importance attribute. From the current set of characteristics, the least significant characteristics are pruned. The method is replicated recursively on the pruned bundle before the optimal number of features to select is eventually reached.

4.2.7 Embedded Methods –Variable Importance based methods

To remove redundant features from high-dimensional datasets, algorithms such as Select K best-features based on Logistics Regression or based on Decision Tree or Random Forest trees can be used. The techniques are typically based on the following approaches.

• Permutation-based algorithm: This algorithm permutes the variables and examines the loss of precision during testing. This strategy is part of the wrapper's algorithms. It can thus be applied to any learning algorithm.

• The quality of the node splits is reflected in the revised Random Forest algorithm. It is in fact similar to Gini importance, but the difference is that the depth importance takes into account the position of the node in the trees.

Dans le document Predicting head and neck cancer in patients using epigenomics data and advanced machine learning methods (Page 29-34)