• Aucun résultat trouvé

Predicting head and neck cancer in patients using epigenomics data and advanced machine learning methods

N/A
N/A
Protected

Academic year: 2022

Partager "Predicting head and neck cancer in patients using epigenomics data and advanced machine learning methods"

Copied!
127
0
0

Texte intégral

(1)

i

Predicting Head and Neck Cancer in Patients using Epigenomics Data and Advanced Machine Learning Methods

By Virali Oza

A thesis submitted in partial fulfilment of the requirements for the degree of

Master of Science (MSc) in Computational Sciences

The Faculty of Graduate Studies Laurentian University Sudbury, Ontario, Canada

© Virali Oza, 2021

(2)

II

THESIS DEFENCE COMMITTEE/COMITÉ DE SOUTENANCE DE THÈSE Laurentian Université/Université Laurentienne

Faculty of Graduate Studies/Faculté des études supérieures

Title of Thesis

Titre de la thèse Predicting Head and Neck Cancer in Patients using Epigenomics Data and Advanced Machine Learning Methods

Name of Candidate

Nom du candidat Oza, Virali

Degree

Diplôme Master of Science

Department/Program Date of Defence

Département/Programme Computational Sciences Date de la soutenance July 28, 2021

APPROVED/APPROUVÉ Thesis Examiners/Examinateurs de thèse:

Dr. Kalpdrum Passi

(Supervisor/Directeur(trice) de thèse) Dr. Ratvinder Grewal

(Committee member/Membre du comité) Dr. Mazen Saleh

(Committee member/Membre du comité)

Approved for the Office of Graduate Studies Approuvé pour le Bureau des études supérieures Tammy Eger, PhD

Vice-President Research (Office of Graduate Studies)

Dr. Gulshan Wadhwa Vice-rectrice à la recherche (Bureau des études supérieures)

(External Examiner/Examinateur externe) Laurentian University / Université Laurentienne

ACCESSIBILITY CLAUSE AND PERMISSION TO USE

I, Virali Oza, hereby grant to Laurentian University and/or its agents the non-exclusive license to archive and make accessible my thesis, dissertation, or project report in whole or in part in all forms of media, now or for the duration of my copyright ownership. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also reserve the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that this copy is being made available in this form by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

(3)

iii

Abstract

Epigenomics is the field of biology dealing with modifications of the phenotype that do not cause any alteration in the sequence of cell DNA. Epigenomics is formed up of the words epi and genomics, with epi deriving from the Greek prefix. The epi- in epigenomics refers to features that are "on top of" or "in addition to" the traditional genetic base for inheritance. As a result, it basically adds something on the top of DNA to modify its characteristics, thereby prohibiting some DNA behaviors. Such modifications occur in cancer cells and are the sole cause of cancer. Head and Neck is one of the most important parts of a human body. HNSC (head and neck squamous carcinoma) is one of the leading causes of cancer death, accounting for more than 650,000 cases and 330,000 deaths yearly throughout the world. Males, with a proportion ranging from 2:1 to 4:1, are slightly more affected than females. Four different types of data are used in this research to predict cancerous cells in the HNSCC patients namely methylation, histone, human genome and RNA- Sequences. Nine feature selection methods and ten classifiers were used in this study. All data are obtained through open-source technologies in R. The data is processed to produce features, and the fine-tuned model is used to forecast Head and Neck cancer using statistical analysis and advanced machine learning techniques. Also, with the help of cluster analysis and Variable Importance measure we were able to find top 50 features which are important in the prediction of cancerous cells in HNSCC.

Keywords: Epigenomics, DNA methylation, Histone, Human Genome, RNA, Feature Selection,

Classifiers, Cluster Analysis

(4)

iv

Acknowledgments

I would like to express my deepest gratitude and appreciation to my supervisor Dr. Kalpdrum Passi, who gave me the opportunity and all the possible support to make this research successful. I believe I have got the finest supervisor who has been standing with me throughout the educational journey and he was always available to clear all my doubts and handled all the queries with patience.

I also want to thank my family and all friends who support and encourage me throughout the time.

The journey outside my home country would never be enjoyable without them.

Finally, and most significantly, I offer special thanks to my parents for financial and mental support they provide by trusting my calibre.

(5)

v

Table of Contents

Abstract ... iii

Acknowledgments ... iii

Chapter 1 ... 1

Introduction ... 1

1.1 Head and Neck Cancer ... 1

1.2 Cancer risk factors ... 4

1.3. Cancer staging ... 5

1.4. Significances of Bioinformatics study ... 6

1.5 Objectives of the Study ... 6

Chapter 2 ... 8

Literature Review ... 8

Chapter 3 ... 15

Data Preparation ... 15

3.1 Dataset Selection ... 15

3.2 Data Processing ... 16

3.2.1 DNA Methylation Data ... 16

3.2.2 Histone Data ... 16

3.2.3 Human Genome Data ... 17

3.2.4 RNA-Seq Data ... 17

Chapter 4 ... 19

Feature Selection Methods and Algorithms ... 19

4.1 Feature Selection Methods ... 19

4.1.1 Filter-based Methods ... 20

4.1.2 Wrapper-based Methods ... 20

4.1.3 Embedded Methods ... 21

4.2 Feature Selection Methods ... 21

4.2.1 Chi-Squared ... 21

4.2.2 Mutual information (MI) ... 22

4.2.3 Correlation-based feature selection (CFS) ... 22

(6)

vi

4.2.4 Principal Component Analysis (PCA) ... 22

4.2.5 ReliefF ... 23

4.2.6 Recursive Feature Elimination (RFE) ... 23

4.2.7 Embedded Methods –Variable Importance based methods ... 23

4.3 Classification Methods ... 24

4.3.1 Decision Trees ... 24

4.3.2 Random Forests ... 25

4.3.3 Support Vector Machines ... 26

4.3.4 Naive Bayes ... 26

4.3.5 K-Nearest Neighbors ... 27

4.3.6 Multilayer Perceptron (Neural Network) ... 28

4.3.7 Artificial Neural Network ... 28

4.3.8 Recurrent Neural Network ... 29

4.4 Model assessment ... 30

Chapter 5 ... 31

Results and Discussion ... 31

5.1 Model Comparison using 5-fold Cross Validation ... 35

5.1.1 Correlation-based Feature Selection (CFS) ... 35

5.1.2 Chi-square-based Feature Selection ... 36

5.1.3 Mutual Information (MI) based Feature Selection ... 36

5.1.4 Principal Component Analysis (PCA) ... 37

5.1.5 ReliefF Feature Selection ... 38

5.1.6 Recursive Feature Elimination (RFE) Feature Selection ... 39

5.1.7 Embedded Method with Logistic Regression Feature Selection ... 40

5.1.8 Embedded Method with Random Forest Feature Selection ... 41

5.1.9 Select K Feature Selection ... 42

5.3. Grid-search based Hyper-Parameter tuning of the Models with train-test split ... 64

5.3.1 Correlation-based feature selection (CFS) ... 65

5.3.2 Chi-squared-based Feature Selection ... 67

5.3.3 Mutual Information (MI) based Feature Selection ... 69

5.3.4 Principal Component Analysis (PCA) ... 71

5.3.5 ReliefF ... 73

(7)

vii

5.3.6 Recursive Feature Elimination (RFE) ... 75

5.3.7 Embedded Method with Logistic Regression ... 77

5.3.8 Embedded Method with Random Forest ... 79

5.3.9 SelectK ... 81

5.4 Final Best Features Selection using Correlation and Cluster Analysis ... 91

5.4.1 CFS Feature Selection ... 92

5.4.2 Chi-square Feature Selection ... 94

5.4.3 Mutual Information Feature Selection ... 95

5.4.4 PCA Feature Selection ... 96

5.4.5 ReliefF Feature Selection ... 97

5.4.6 RFE Feature Selection ... 98

5.4.7 Embedded Method with Logistic Regression ... 99

5.4.8 Embedded Method with Random Forest ... 100

5.4.9 Select K Feature Selection ... 101

5.5 Discussion ... 103

5.6 Statistical Analysis of Results ... 104

Chapter 6 ... 108

Conclusions ... 108

6.1 Conclusions ... 108

6.2 Future Work ... 109

References ... 110

(8)

viii

List of Tables

Table 1 Stages of cancer ... 5 Table 5.1 Feature Selection Technique implemented in the research ... 322 Table 5.2 Machine Learning Algorithms implemented in the research ... 313 Table 5.3 Evaluation Metrics used in the Research ... Error! Bookmark not defined.5 Table 5.4 Comparison of Models and Feature Selection Techniques ... 5 Table 5.5 Model Comparison based on CFS Feature Selection method ... 32 Table 5.6 Model Comparison based on Chi Square Feature Selection method ... 31 Table 5.7 Model Comparison based on Mutual Information Feature Selection methodError!

Bookmark not defined.

Table 5.8 Model Comparison based on PCA Feature Selection method ... 5 Table 5.9 Model Comparison based on ReliefF Feature Selection method ... 32 Table 5.10 Model Comparison based on RFE Feature Selection method ... 31 Table 5.11 Model Comparison based on Embedded Method with Logistic Regression Feature Selection method ... Error! Bookmark not defined.

Table 5.12 Model Comparison based on Embedded Method with Random Forest Feature Selection method ... 5 Table 5.13Model Comparison based on SelectK Feature Selection method ... 32 Table 5.14 Best Models Comparison on reduced Feature with best Feature Selection Methods ... 31 Table 5.15 List of gene location containing number of best features to gain best AUC and accuracy ... Error! Bookmark not defined.

(9)

ix

Table of Figures

Figure 5.1 Distribution of classes in the Processed Data ... 334

Figure 5.2 Comparison of models for the Correlation-based feature selection (CFS) method ... 35

Figure 5.3 Comparison of models for the Chi-squared based feature selection method . 36 Figure 5.4 Comparison of models for the Mutual Information score-based feature selection method ... 37

Figure 5.5 Comparison of models for PCA based feature selection method ... 38

Figure 5.6 Comparison of models for reliefF-based feature selection method ... 39

Figure 5.7 Comparison of models for RFE based feature selection method ... 40

Figure 5.8 Comparison of models for embedded with logistic regression-based feature selection method ... 412

Figure 5.9 Comparison of models for embedded with Random Forest based feature selection method ... 42

Figure 5.10 Comparison of models for SelectK-based feature selection method ... 43

Figure 5.11 Correlation-based feature selection (CFS) Tuning Parameter Results ... 66

Figure 5.12 Chi-squared-based Feature Selection Tuning Parameter Results ... 68

Figure 5.13 Mutual Information Based Feature Selection Tuning Parameter Results ... 70

Figure 5.14 Principal Component Analysis (PCA) Tuning Parameter Results ... 72

Figure 5.15 ReliefF Tuning Parameter Results ... 74

Figure 5.16 RFE Tuning Parameter Results ... 76

Figure 5.17 Embedded Method with Logistic Regression Tuning Parameter Results ... 78

Figure 5.18 Embedded with Random Forest Tuning Parameter Results ... 80

Figure 5.19 Select K Tuning Parameter Results ... 82

Figure 5.20 Comparison of Models based on CFS ... 83

Figure 5.21 Comparison of Models based on Chi Square Method ... 84

Figure 5.22 Comparison of Models based on MI score ... 85

Figure 5.23 Comparison of Models based on PCA ... 86

(10)

x

Figure 5.24 Comparison of Models based on reliefF ... 87

Figure 5.25 Comparison of Models based on RFE ... 88

Figure 5.26 Comparison of Models based on Select from LR ... 89

Figure 5.27 Comparison of Models based on Select from RF ... 90

Figure 5.28 Comparison of Models based on SelectK ... 90

Figure 5.29 Correlation among the best features selected from CFS ... 93

Figure 5.30 Variable Importance among the best features selected from CFS ... 93

Figure 5.31 Correlation among the best features selected from Chi-Square ... 94

Figure 5.32 Variable Importance among the best features selected from Chi-Square... 94

Figure 5.33 Correlation among the best features selected from Mutual Information ... 95

Figure 5.34 Variable Importance among the best features selected from Mutual Information ... 95

Figure 5.35 Correlation among the best features selected from PCA ... 96

Figure 5.36 Variable Importance among the best features selected from PCA ... 96

Figure 5.37 Correlation among the best features selected from reliefF ... 97

Figure 5.38 Variable Importance among the best features selected from reliefF ... 97

Figure 5.39 Correlation among the best features selected from RFE ... 98

Figure 5.40 Variable Importance among the best features selected from RFE ... 98

Figure 5.41 Correlation among the best features selected from Select from Model (Logistic Regression) ... 99

Figure 5.42 Variable Importance among the best features selected from Select from Model (Logistic Regression) ... 99

Figure 5.43 Correlation among the best features selected from Select from Model (Random Forest) ... 100

Figure 5.44 Variable Importance among the best features selected from Select from Model (Random Forest) ... 1001

Figure 5.45 Correlation among the best features selected from SelectK ... 101

Figure 5.46 Variable Importance among the best features selected from SelectK ... 101

Figure 5.47 QQ plot Normality check for the distribution ... 106

(11)

1

Chapter 1

Introduction

1.1 Head and Neck Cancer

Head and neck squamous cell carcinoma (HNSCC) is a widespread heterogeneous malignancy that accounts for 500,000 new cases globally per year and involves cancers of the oral cavity, oropharynx, nasopharynx, hypopharynx, and larynx [1,2]. HNSCC traditional treatment involves surgery, radiotherapy and chemotherapy, used alone or in combination, depending on the level of the tumour and the main site [3]. Clinical treatment responses differ greatly among patients with HNSCC and remain disappointing, especially in advanced stage disease [4]. Chemotherapy and radiation, in comparison, also confer major toxicity [3].

Therefore, defining biomarkers from which to choose the right care plan for each condition is important. Several studies showed a causal correlation between the growth of HNSCC cancer and human papilla virus infection (HPV) more than a decade ago [5,6,7]. HPV positive (HPV+) HNSCC has been associated with a stronger prognosis and reaction to treatments, including immunotherapy, compared to HPV negative (HPV-). Present efforts are based on explicitly designing therapeutic options for this subtype. For example, for de-intensifying treatment methods, HPV+ oropharyngeal cancers are now being considered [7].

(12)

2

Worldwide, head-and-neck squamous cell carcinoma (HNSCC) accounts for about 300,000 deaths per year [8]. The key risk factors for the growth of Head-and-Neck Squamous Cell Carcinoma are smoking, alcohol, and high-risk human papillomavirus (HPV) infections. The prevalence of HPV-associated HNSCC is about 25% of the worldwide cases recorded, with an even higher proportion of oropharyngeal cancer, and among those cases, the predominance of HPV-16 and HPV-18 forms of infection.

Historically, squamous cell carcinoma of the head and neck has been linked with tobacco and alcohol use, with a growing proportion of tumours of the head and neck, especially in oropharynx, also associated with human papilloma virus (HPV). Bad prognosis is characterized by recurrent/metastatic illness and there is an unmet requirement for the development of biomarkers for early disease detection, specific prognosis assessment, and efficient collection of therapy.

Furthermore, epidemiological, molecular pathology and cell line evidence suggest that a large percentage of oropharyngeal cancers are sexually transmitted diseases and are causally associated with high-risk human papillomavirus (HPV), especially type 16 [9,10,11]. A distinct biological and clinical species, HPV-associated oropharyngeal cancers (HPV-OSCCs) have a distinct mutation environment and are distinguished by significantly enhanced survival [12]. The majority of patients with HNSCC present with locoregionally advanced (LA) disease using a multimodality clinical approach. The 5-year progression-free survival (PFS) of HPV negative patients with LA disease is ~40-50 percent, despite improvements in diagnosis, care and monitoring, and the survival rates for recurrent/metastatic (R/M) disease have not improved substantially in recent years.

Low HNSCC-related survival rates are attributed in part to early diagnostic failure. In fact, only one-third of HNSCC patients are diagnosed early on [13]; early diagnosis is lacking primarily

(13)

3

due to a lack of proper screening. As per the National Cancer Institute (NCI), biomarkers are defined as "a biological molecule found in the blood, other body fluids, or tissues that is a symbol of a normal or abnormal process, or of a medical condition." How well the body responds to a treatment for a disease or disorder, a biomarker can be used [14].

Basically, biomarkers are valuable instruments that lead to the diagnosis, determine the likely path of the disease, and predict therapeutic response. Thus, finding a biomarker for head and neck squamous cell carcinoma (HNSCC) comprising of a heterogeneous category of malignancies that arise in the oral cavity, larynx and pharynx is required [15]. Regarding HNSCC, although it has been proposed that certain biomarkers have a potential effect on diagnosis and prognosis, few have been confirmed for use in clinical practice [16].

Cancer is a disorder that develops as a result of genetic and epigenetic changes. Recently, epigenetics has become a rapidly growing biological field. Existing research shows that epigenetic modification has a major impact in gene expression as it is reversible change on DNA or Histones that affects gene expression without changing the sequence of DNA. DNA methylation and Histone modification are two of the most well-known epigenetic changes where DNA methylation is a biological process of adding methyl groups to the DNA molecule and a post-translational modification (PTM) of histone proteins is known as a Histone modification. By modifying chromatin structure or recruiting histone modifiers, PTMs to histones can have an effect on gene expression [17]. When gene/DNA expresses itself to form proteins, this process is called gene expression. DNA can control the activity of cell by forming proteins in two major steps which are transcription and translation. Here, transcription means a process in which coded message from DNA is copied to mRNA while translation is a process in which mRNA is decoded and form proteins. We discovered that computational models that use epigenetic data to predict gene

(14)

4

expression with high accuracy are uncommon. Therefore, some experiments are required to build a model which can help finding important features that affect the HNSCC.

The Cancer Genome Atlas (TCGA) contains some large-scale genomic profiling experiments. For data collection for more than 30 types of cancer, the TCGA (The Cancer Genome Atlas) is paramount. In various ways, including statistical simulations, statistics and machine learning, the diversity of omics layers, such as RNA sequencing, methylation, miRNA, proteomic, therapeutic, copy number variance and mutation, can be studied.

1.2 Cancer risk factors

There are different risk factors for cancer. These are:

i. Inherited gene mutation cause only 5-10% of all cancers.

ii. Rest of the 90-95% of all cancers is because of the effect of environment factors and lifestyle pattern including:

Tobacco smoking Alcohol consumption

Diet (Excessive consumption of junk food and red meat) Obesity

Environmental pollutants Physical inactivity Exposure to sun

iii Some biological agents can cause cancer.

.

(15)

5

1.3. Cancer staging

Two agencies, the Union Internationale Contre le Cancer (UICC) and the American Joint Committee for Cancer Staging and End Results Reporting (AJCCS), introduced the specific international classification system (T, N and M) to explain the stage of disease [18]. Table 1 shows the stages of cancer, where T represents the primary stage, N represents the local lymph nodes participation and M represents the presence or absence of metastasis.

Table 1. Stages of cancer

Stages Classification Location Nodal spread Treatment Survival chances I T1NoMo Limited to

origin

Not spread Operable 70-90%

II T2N1Mo Spread to surroundings

Spread Removable, but may not completely

45-55%

III T3N2Mo Spread and fixed in tissues up to

3cm

Spread and fixed

deeper Removable, but not completely

15-25%

IV T4N3M+ Spread and fixed in tissues up to

10cm

Spread, fixed deeper and

metastasis

Not operable ↓ 5%

(16)

6

1.4. Significances of Bioinformatics study

The main significances are:

To raise the performance of model For prediction of sequences For better range of detection

For provision of quick models at low cost

For the achievement of data from basic level to the higher level.

1.5 Objectives of the Study

The main goal of this research is to study the Epigenomics data using gene expression data from Head and Neck cancer patients (HNSCC). Using multiple classifiers and feature selection approaches, this research proposes to create a more reliable model. The key objective is to assess and predict HNSCC by machine learning techniques and find the best model(s). It is also an objective to find the most relevant genes that affect HNSCC through clustering the features. Nine different feature selection methods (Correlation-based feature selection (CFS), Chi-squared-based Feature Selection, Mutual Information Based Feature Selection, Principal Component Analysis (PCA), ReliefF, Recursive Feature Elimination (RFE), Embedded Method with Logistic

Regression, Embedded with Random Forest and Select K features based on highest scores) and ten different classifiers (Naïve Bayes, K-Nearest Neighbour (KNN), Support Vector Machine (SVM) with Polynomial Kernel and Radial Kernel, Decision Tree, Random Forest. Xgboost Linear, Multi-Layer Perceptron (MLP), Artificial Neural Network (ANN) – fully connected Neural Net (FCN) and Recurrent Neural Network (RNN)) are used to accomplish this objective.

With 5-fold cross-validation on training and validation data, the features were chosen from raw

(17)

7

data. Accuracy, AUC and Confusion Matrix are key performance metrics used to assess the models. Moreover, cluster analysis was also done to visualize the features to get better result. To determine whether there was a significant difference in the different models, statistical analysis was used.

(18)

8

Chapter 2

Literature Review

Head and neck squamous cell carcinoma (HNSCC) is the 6th most common type of cancer [19] and is associated with 650,000 new cases and 330,000 deaths annually worldwide [20][21].

The majority of the HNSCC cases are Oral Squamous Cell Carcinomas (OSCC) [22]. More than 90% of HNSCC are associated with OSCC patients [23]. The incidence rates (mainly OSCC (Collaboration, 2019)) are higher in South Asian countries such as India [24], Bangladesh (Collaboration, 2019), and Pakistan [25] as compared to other parts of the world. There are several known risk factors of HNSCC such as chewing tobacco, smoking cigarettes, excessive alcohol consumption [26] and oncogenic virus such as Human papillomavirus (HPV) [27].

Additionally, epigenetic regulation, mutation, copy number variation (CNV) and immune host response also plays a key role in carcinogenesis [28]. Despite current advancement in cancer diagnosis and treatment, the overall 5-year survival rate is less than 50% in HNSCC due to a lack of proper diagnostic markers and targeted therapies [29]. In multiple cancers, including HNSCC, it has been well established that identification of cancer in the early stages leads to a better survival rate than detection in the later stages. Early-stage primary tumours are described as those with a diameter of 2–4 cm and no lymph node proliferation or metastasis, according to the American Joint Committee on Cancer staging (TNM) (TNM stage I and II). If the tumour is greater than 5 cm and has spread to surrounding lymph nodes only (TNM stage III) or has spread to other areas of the body (TNM stage IV), it is called advanced (late stage) [30].

(19)

9

2.1 Role of miRNA

In the last few decades, a lot of new research has been done in the field of HNSCC, but no clinically significant findings have been made. Patients would benefit from a fast and correct diagnosis in a number of ways, including proper care, which would mitigate morbidity and increase treatment outcomes. Unfortunately, there is no such universally accepted biomarker for HNSCC accepted for clinical use. More appropriate therapies and clinically important biomarkers to stratify patients with HNSCC are urgently needed. The micro-RNAs (miRNAs) are 18–25 nucleotide long non-coding RNAs. They can regulate mRNA expression by interacting with the 3 untranslated regions (UTR) leading to mRNA degradation. These miRNAs by virtue of their control over mRNA expression have important regulatory roles such as regulation of cell division, cell maturation, angiogenesis, proliferation, migration, invasion, metastasis, autophagy, and apoptosis [31].

However, in various diseases especially cancers, these miRNAs themselves can get dysregulated leading to pathological conditions [32]. For their biological role in cancer and their capacity to control the expression of various cancer pathways, a significant number of miRNAs have been well described [33]. Changes in miRNA expression profile can be observed well before clinical signs occur in certain cancers, according to research [34]. These miRNAs provide a fair method for the discovery of excellent biomarkers due to their durability and ease of identification (in tissues as well as biological fluids) [35]. MiRNA expression profiles can also provide insight into underlying tumour progression and/or lead to the discovery of new therapeutic targets.

2.2 Need of pre-processing the data

The Cancer Genome Atlas (TCGA) Research Network [36] is a coordinated effort to gather, share, and analyse next generation molecular sequencing data to improve our understanding of

(20)

10

cancer mechanisms on a molecular level [37]. Data utilized in our analysis were obtained from the Genomic Data Centre Portal [38] and contained 528 TCGA HNSC cases, including genotyping, solid-tumour RNA expression, whole exome sequencing, methylation data, and clinical information. Only RNA expression variables and clinical data are included in this study. Tumor grading statistics, patient demographic information, smoking/alcohol history, and other characteristics linked to disease development, such as lympho-vascular invasion and margin status, are all included in clinical records. Human papillomavirus (HPV) status (based on ISH and P16 testing) was also included, as HPV status has strong implications for prognosis and tumour development [39, 40]. These numbers come from a variety of experiments from various universities, many of which used different platforms and assays over long periods of time. The study discussed here describes the difficulties that this common type of dataset poses in oncological research. Large, multi-institutional databases pose a number of obstacles for the advancement of clinical decision support approaches and tools. Specifically, TCGA-HNSC had problems with sparsity and inconsistency across many clinical data areas. Out of 15 identified clinical characteristics relevant to treatment regimen, none were populated for every patient. More specifically, the number of cases (from a total possible 564 cases in TCGA) where 520 patients did not have tumour. In addition to the problem of missing data, several fields were populated inconsistently due to human error. Such complications required extensive pre-processing and an expert system built using domain-specific knowledge to determine whether each patient had received a specific type of therapy. Even after this pre-processing and condensing of treatment fields, issues of missing data persisted. Whether a patient had received radiotherapy and/ or chemotherapy was unclear for 47 and 27% of cases, respectively. One possible technique for handling such problems is to exclude cases or variables with missing data, as was done previously

(21)

11

with this dataset [41].We seek to optimise use of the available data by imputing missed values due to the importance of these attributes to our decision help targets, as well as the finite number of cases from which to choose. Molecular datatypes are often extremely high-dimensional. Feature selection and dimensionality reduction techniques are necessary steps when utilizing such data to best employ available computational resources.

There are several strategies for selection and dimensionality reduction, including feature filtering, feature transformations, and wrapper methods such as sequential selection [42]. In this work, feature filtering and an unsupervised sparse PCA feature space transformation of approximately 20,000 solid-tumour RNA expression variables were employed and evaluated in the context of TCGA-HNSC. Methods including Pre-processing, balancing the imbalance class, and missing data imputation were performed resulting in a much more concise and usable dataset.

2.3 ML Technology

With the rapid growth of data to support its diverse applications in a variety of cancers, Machine Learning (ML) has seen an increase in popularity. ML algorithms can make more reliable and effective predictions as the amount of training data grows [43]. For a long time, mathematical simulation has been commonly used in disease modelling, classification, and estimation of molecular structure. Machine learning has opened new opportunities of disease process awareness and tracking thanks to advances in next-generation sequencing (NGS) technologies and the availability of large sequencing data. Risk stratification, mutational frequency estimation, copy number variation (CNV), and new target recognition are all popular applications of machine learning techniques. It has been suggested that machine learning (ML) techniques can be utilized

(22)

12

for diagnosis and prognosis in cancers [44]. For example, using expression data, SNPs (single nucleotide polymorphisms), and CNVs, ML techniques have been used to diagnose Rat Sarcoma (RAS) activation pathways in cancers (copy number variations) [45]. A multi-parametric Decision Support System (DSS) was used to forecast OSCC progression and possible relapse using a variety of heterogenic data (clinical knowledge, genomics, and imaging data) (local or metastatic) [46].

MiR-221 and miR375 were identified as predictors in a report by Avissar [47] to identify miRNAs to predict the existence of HNSCC . The research, however, did not consider stages or grades. A recent machine learning research used neural networks to forecast tongue cancer recurrences [48].

Kim developed predictive models for survival prediction in oral cancer patients [49]. In one research of predicting survival of oral squamous cell carcinoma with the help of machine learning based prediction model, 80.0 % of the data were randomly selected from the National Cancer Data Base (NCDB) and used. They included 2-class decision forest, 2-class decision jungle, 2-class logistic regression, and 2-class neural network as a classification model. Amongst all four, decision forest classification was the most robust, with an AUC of 0.80, accuracy of 71% [50]. There is a report that stratifies the clinical stages in HNSCC patients using machine learning algorithm such as, Random Forest (RF), Support Vector Machine Radial Kernel (svmR), Adaptive Boost (AdaBoost), averaged Neural Network (avNNet), and Gradient Boosting Machine (GBM) and also predict the riskthrough miRNA expression profiles where Adaptive Boost performed well amongst all with the AUC = 0.929 and accuracy = 0.872 [30].

Machine learning methods play a pivotal role in decision making that helps cancer diagnosis and analysis. Asri et al. [51] predicted breast cancer risk and compared the performance for various machine learning methods including Support Vector Machine (SVM), Decision Tree, Naive Bayes

(23)

13

(NB) and k Nearest Neighbours (KNN). All the experiments were performed on WEKA where

SVM model gave best accuracy of 93.17% with minimum error rate.

Khalilia et al. [52] predicted 8 different disease risks including breast cancer from imbalanced data with the help of Random Forest classifier. In their study, they found that Random Forest (RF) performed very well in the comparison of Support Vector Machine (SVM). They took National Inpatient Sample (NIS) data, which is highly imbalanced and as a solution, they used a method of ensemble learning that relied on repetitive random subsampling. This method splits the training data into several sub-samples while ensuring that each one is completely balanced. Then after they compared the performance of RF and SVM with the subsampling. RF provided best result in predicting 8 disease risks with the AUC = 88.79% on average.

Parmar et al. [53] performed radiomics analysis for predicting overall survival in patients with head and neck cancer (HNSCC) on two HNSCC cohorts for 196 patients using machine learning techniques. Different feature selection methods were used such as, relief, T-score, Chi- square, Gini index, mutual information. After feature selection, various classifiers were applied, namely decision trees, neural networks, random forest, and support vector machine using 10-fold cross validation. Their results show that mutual information based feature selection with neural networks and random forest gave the best AUC scores.

The identification of late stages often involves cumbersome examination and invasive tests;

therefore, such studies could be immensely useful in accomplishing better diagnosis and clinical outcomes. Such strategy can also be extended to the cancers that are difficult to reach and detect such as pancreatic cancer. It has to be remembered that the clinical stage is known to play an

(24)

14

important role in the overall survival of the patients. These studies have led to the identification of signatures that can efficiently stratify the HNSCC patients into early and late clinical stages as well as to stratify patient’s risk that may help in the decision of the treatment regimens e.g., whether it is effective to use less aggressive treatment than highly toxic therapies.

HNSCC literature often focuses on association of regulation of specific genes with prognosis [54, 55]. Other groups, however, acknowledge the need for large-scale integrative analysis to capture potential novel biomarkers [56–58]. In other cancers, unsupervised transformations of molecular data (e.g., RNA sequencing, DNA methylation, miRNA sequencing) are known to be useful in machine learning-based survival prediction [59, 60]. In this way, HNSCC has had so little attention. Similarly, there is less literature available on machine learning imputation of sparse clinical results.

(25)

15

Chapter 3

Data Preparation

3.1 Dataset Selection

For acquiring Head and Neck Cancer squamous cell carcinoma,we downloaded all of the required datasets from The Cancer Genome Atlas (TCGA)(https://www.cancer.gov/about- nci/organization/ccg/research/structural-genomics/tcga) Data Portal using R and the open-source library RTCGA (HNSCC). For data integration, TCGA involves clinical data, genomic characterization data, and high-level gene analysis of tumour genomes ranging from DNA methylation, RNA and mRNA sequences to Histone modification data with epigenetics, Human Genome data, and clinical data of patients. Academic researchers and scientists can use the Cancer Genome Atlas (TCGA) Data Portal to browse, import, and analyse data sets provided by TCGA.

To strengthen cancer treatment, it is useful to understand genomics as it helps in knowing which genetic and epigenetic alterations cause the cancer and accordingly, we can discover the therapies to treat it. RTCGA is an open-source R module that is distributed by Bioconductor (https://www.bioconductor.org/). It also serves as a user interface for combining data from Chip- seq and RNA-sequences or microarray-based gene transcription and histone modification. Methods for data pre-processing are included in the kit. TCGA biolonks and Summarised Experiments, ShortRead, Rsubread, BSgenome.Hsapiens.UCSC.hg19, EnsDb.Hsapiens.v75, DESeq2 and others were also included. Apart from providing an interface to the TCGA in R, the RTCGA kit also helps

(26)

16

researchers to convert The Cancer Genome Atlas data into a format that can be used in R statistical package.

3.2 Data Processing

3.2.1 DNA Methylation Data

Methylation data from Illumina's Infinium HumanMethylation450 Bead chip (Illumina 450k) was collected from TCGA, The Cancer Genome Atlas. 5-methylcytosine (5mC), also known as the fifth base, is formed when a methyl group is attached to the fifth position of genomic cytosine.

It is a well-studied epigenetic mark [61]. 5mC is often used in CpG contexts in mammals, but it can also be used in CHG and CHH contexts in other species where H is an A, C, or T. About 60% and 80% of CpGs are methylated in different tissues, indicating that 5mC is widespread. The genomic instructions for CpG as well as exons from the coding regions were derived from an Illumina report document. Since only provided transcript data in the comment paper, we re-noted the protein coding properties using the exons and coding DNA structures (CDS), eventually removing the exhaustive details from various transcript areas: all introns (with uncommon first and last intron classes), only non-translated positions in the 5' and 3' headings (5' UTR and 3' UTR, individually), as first and last exons and a "single exon" or "single intron" classification for transcripts with only one exon or one-half intron.

3.2.2 Histone Data

The data for Head and Neck squamous cell carcinoma (HNSCC) were taken from the UCSC genome explorer, which is a joint ENCODE project initiative using the UCSC Chrome genome (http:/genome.ucsc.edu) . The Histone Chip-seq data of three sets containing H3k4me3,

(27)

17

H3k27me3, and H3k36me3 with the cell lines FaDu and Detroit360 were used as a dataset. Raw Chip-Seq data from the large set of Histone marker data was downloaded. Using a custom R script, the data was efficiently analysed to ensure the consistency of all device standardisation.

3.2.3 Human Genome Data

We used Bioconductor's BSgenome.Hsapiens.UCSC.hg19 open access library to retrieve nucleotide composition data from the hg19 genome, which was originally determined and chosen from a study of the UCSC genome browser, which is created by clinicians and has three groups of species: vertebrates, primates, and placental. In a custom R script, the data was processed using Rsubreads with other open-source packages.

3.2.4 RNA-Seq Data

RTCGA for head and neck cellular carcinoma has already downloaded RNA-Seq gene expression data from Head and Neck cancer samples with coupled CpG methylation data from the TCGA Testing Network. The DESeq2 package in R was used to conduct differential expression analysis. When a gene crossed two thresholds, its expression was classified as either up-regulated or down-regulated: After making an absolute value of log2 fold shift or logFC equal to less than 0.5 and a modified p value less than .05, FDR cut off 0.01. As a result, 6794 genes were identified as being "differentially expressed."

The four datasets namely Methylation data, Histone data, Human Genome data and RNA- seq data were taken through RTCGA library and the data was preprocessed with the help of R packages to reduce the dimensionality so that it can provide better result in minimum time. After that, all four datasets were combined to generate one final model on which nine different feature selection methods were applied to get the relevant features. Those feature selection methods are:

(28)

18

Correlation-based feature selection (CFS), Chi-squared-based Feature Selection, Mutual Information Based Feature Selection, Principal Component Analysis (PCA), ReliefF, Recursive Feature Elimination (RFE), Embedded Method with Logistic Regression, Embedded with Random Forest and Select K features based on highest scores. Moreover, ten classification methods were used including: Naïve Bayes, K-Nearest Neighbour (KNN), Support Vector Machine (SVM) with Polynomial Kernel and Radial Kernel, Decision Tree, Random Forest. Xgboost Linear, Multi-Layer Perceptron (MLP), Artificial Neural Network (ANN) – fully connected Neural Net (FCN) and Recurrent Neural Network (RNN). Then we used 5-fold cross validation on both training and testing data with various ratios of 90:10, 80:20, 70:30 and 60:40. Also, we perform modelling on different sets of data such as first we took 100% data and then took 90% data from all and then 75%, 60%, 50% accordingly.

(29)

19

Chapter 4

Feature Selection and Machine Learning Methods

4.1 Feature Selection Methods

The aim of feature selection is to identify the most appropriate subset of features. The filtering of unnecessary features by feature selection increases the classification model's accuracy while minimising processing time. The below are some of the benefits of using Feature Selection Methods.

a. It simplifies the model by reducing data, reducing storage, and enhancing visualisation.

b. Decreases preparation time c. Prevents over-fitting d. Increases model accuracy e. Avoids the dimensionality curse

The commonly used methods of feature selection can be divided into three groups, which are as follows.

1. Filter-based Methods 2. Wrapper-based Methods 3. Embedded Methods

(30)

20

4.1.1 Filter-based Methods

In this method, a score for each function is determined using divergence, correlation, or another method, and then a threshold or filter is applied to pick or exclude the features. Filter-based approaches are particularly useful for high-dimensional datasets because they are easier to compute than other methods. The following are some examples:

i. Information gain ii. Chi-square test iii. Fisher score

iv. Correlation Coefficient v. Variance threshold

vi. Principal Component Analysis vii. Gain Ratio

viii. ReliefF

4.1.2 Wrapper-based Methods

Wrapper-based models use the hold-out technique, in which the model is trained on the training set for each subset to be picked, and then the function subset is chosen on the test set based on the severity of the error. These methods' computing costs differ, but they are all time consuming.

The following are some examples:

(31)

21 i. Genetic algorithms

ii. Recursive feature exclusion

iii. Sequential feature filtering algorithms (Backward or Forward Elimination)

4.1.3 Embedded Methods

The embedded method trains machine learning models to obtain the weight coefficients of each function, and then selects features from large to small based on the coefficients. The coefficients are obtained by preparation, which is close to the filter technique. These methods use much less computing power than wrapper methods. The following are some examples:

i. L1 (LASSO) regularization

ii. Decision tree based embedded method iii. Random Forest based embedded method

Principal Component Analysis (PCA), Correlation-dependent feature selection (CFS), Mutual Information Based, ReliefF, Chi-Squared, Recursive Feature Elimination (RFE), Select from Model, and Logistic regression and Random Forest based embedded methods were all used in our study.

4.2 Feature Selection Methods

4.2.1 Chi-Squared

Chi-Squared 𝛘 𝟐 is a filter method for evaluating individual characteristics by estimating their chi-squared statistic in comparison to the groups. The 𝛘 𝟐 measure is used in statistics to determine if two incidents are independent.

(32)

22

4.2.2 Mutual information (MI)

Mutual Intelligence (MI), which is a measure of the amount of information that one random variable has on another variable, is another filter-based technique. Mutual Intelligence (MI) for two random variables is a non-negative value that calculates the dependency between the variables. It is zero if and only if two random variables are independent, and a high degree of dependency/association means higher values. Mutual proof has little impact on whether the univariate relationship is linear.

4.2.3 Correlation-based feature selection (CFS)

Correlation-based Feature Selection (CFS) is a conventional filtering feature selection algorithm that uses a search algorithm and a function to determine the value of a subset of features.

CFS uses a heuristic to measure the accuracy of subsets of features that considers the utility of individual features as well as the degree of inter-correlation.

4.2.4 Principal Component Analysis (PCA)

Principal Component Analysis is a dimension reduction technique that can be used to minimize the dimension of a dataset to a small number while still retaining the significant data. It's a statistical method for converting a large number of associated variables into a smaller number of uncorrelated variables.

(33)

23

4.2.5 ReliefF

The ReliefF algorithm is a simple, efficient, and widely used method for estimating weight characteristics. A probabilistic estimation is performed in ReliefF, which states that the weight learned for a function predicts the difference between two conditional probabilities. These two probabilities reflect the value of a function that varies depending on the closest miss and closest hit given. As a result of the closest neighbor classifier's data, ReliefF usually outperforms other filter- based approaches.

4.2.6 Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a wrapper process for selecting features by recursively considering smaller and smaller ranges of features based on an external estimator that assigns weights to features (i.e., the coefficients of linear model). Second, the estimator is focused on the initial collection of characteristics, and the value of each characteristic is calculated using either a coefficient attribute or a feature importance attribute. From the current set of characteristics, the least significant characteristics are pruned. The method is replicated recursively on the pruned bundle before the optimal number of features to select is eventually reached.

4.2.7 Embedded Methods –Variable Importance based methods

To remove redundant features from high-dimensional datasets, algorithms such as Select K best-features based on Logistics Regression or based on Decision Tree or Random Forest trees can be used. The techniques are typically based on the following approaches.

(34)

24

• Permutation-based algorithm: This algorithm permutes the variables and examines the loss of precision during testing. This strategy is part of the wrapper's algorithms. It can thus be applied to any learning algorithm.

• The quality of the node splits is reflected in the revised Random Forest algorithm. It is in fact similar to Gini importance, but the difference is that the depth importance takes into account the position of the node in the trees.

4.3 Classification Methods

Before a machine learning algorithm can be used to make data projections, it must first learn. Learning implies that the algorithm must be trained on many data examples before making any predictions. The number of examples fed to the algorithm will be in the thousands or even millions. Once the machine learning algorithm has learned from the training feature, it can be used to make forecasts on other data. The machine learning algorithm attempts to sort data into various classes after learning the patterns hidden in the features in a classification model like ours, where the goal is to predict whether the patient has cancer or not. An introduction to the machine learning algorithms that we have used in this research are explained in this chapter.

4.3.1 Decision Trees

Decision tree learning is a data mining method that maps observations of input data using a decision tree as a predictive model. These observations aid in reaching a conclusion about the data's target value. A classification tree is a tree model that accepts a finite set of values, while a regression tree is a decision tree that accepts constant values. A decision tree does not constitute a decision

(35)

25

explicitly; rather, the classification tree that results can be used as a decision-making input. In a decision tree model, a leaf represents a class label, and a branch represents a grouping of characteristics that leads to the class labels.

Xgboost is a gradient boosting algorithm that uses a decision-tree-based ensemble Machine Learning algorithm. Gradient boosting is a supervised learning algorithm that combines the estimates of a series of simpler, weaker models (decision tree) to attempt to correctly predict a target variable. This is used for better speed and performance when decision tree behaves as a weak performer.

4.3.2 Random Forests

A tree may be trained or taught by splitting the source data set into subsets based on attribute value tests. To get the complete model of the decision tree, this procedure must be repeated until a subset at a node has only one value, or until splitting no longer adds value to the forecasts. This technique is known as recursive partitioning, and it is the most widely used approach for training decision trees. Many versions of a straightforward decision tree use a combination of several decision trees. Random Forests, Rotation Forests, and Bagging Decision Trees are examples of algorithms. There are also different algorithms for implementing decision trees. They are quite easy to understand and to reason with. They also require comparatively little preparation of data. An additional advantage is that, even when using large datasets, the necessary computing resources are quite low.

(36)

26

4.3.3 Support Vector Machines

Support vector machines, or SVMs, are a supervised learning algorithm that provides an alternate view of logistic regression, the most basic classification algorithm. Support vector motors look for a model that separates the classes exactly with the same amount of margin on both sides, with the support vectors referred to as samples on the margin. In SVM, three kernels are commonly used: linear, radial basis function (RBG), and polynomial.

4.3.4 Naive Bayes

The Bayes theorem assumes that qualities are independent, so naive Bayes classifiers are straightforward probabilistic classifiers. Given a vector representing some traits, the naive Bayes will assign a probability to all possible outcomes (also known as classes) given this vector on an abstract level. It is a conditional probability model. Based on Bayes' theorem, a credible and computable model for all the possibilities that need to be created can be built. The naive Bayes probability model can then be used to create a classifier. The naive Bayes classifier normally combines a probability model with a law of choice. Which theory should be selected is determined by this law. The most common rule is to choose the one with the greatest probability, or "maximum posteriori" (MAP). The simplicity of naive Bayes classifiers is also one of their advantages. It will also converge faster than other models if the function's conditional independence assumptions are fully satisfied. One of the drawbacks of conditional independence is that it is unable to appropriately model feature-to-feature relationships.

(37)

27

4.3.5 K-Nearest Neighbors

The K-Nearest Neighbors or KNN algorithm is a supervised learning algorithm that computes classification by looking at the groups of the K-Nearest neighbors of the training data. While the algorithm is being trained, the input data and classes are processed. There are a variety of techniques for determining the distance between records. Euclidean distance can be used with continuous data.

For finite variables, another metric, such as the Hamming interval, can be used. The following are some of the most prevalent distance measuring methods.

Euclidean Distance

Squared Euclidean Distance Manhattan Distance

Chessboard Distance or Chebyshev distance.

Increasing the value of k reduces the impact of noise information on prediction. The broad values of k, on the other hand, mean that groups would have less different boundaries between them. The appearance of noise, in general, reduces the algorithm's accuracy considerably. Another issue that may limit the algorithm's precision is irrelevant traits. There are a variety of methods for selecting and scaling features to maximize accuracy. One method is to use evolutionary algorithms.

(38)

28

4.3.6 Multilayer Perceptron (Neural Network)

The multilayer perceptron is one kind of artificial neural network. Artificial neural networks are based on a simplified model of how humans think. They make node layers, each with its own set of input and output values. The nodes are turned on by an activation feature (also called neurons). This capability would be activated by a neuron mixture or a sequence of inputs in a variety of ways. Until the activation mechanism is activated, the neuron should send its output signal through the outgoing channels. This will then activate the next row of activation functions in the network's next layer before the network output is obtained.

A multilayer perceptron in a forward-directed map is made up of multiple layers, each of which is only connected to the next. Every neuron in the layer has a non-linear activation role to correctly model the way neurons in the biological brain are stimulated. A multilayer perceptron has at least three layers: an input, an output, and one or two unseen layers. A deep neural network is one that still uses at least one hidden layer. A multilayer perceptron is equipped using backpropagation. At the start of the training stage, all of the neurons' weights are set to a default value. The neuron weights are adjusted based on the performance errors of each training data set.

4.3.7 Artificial Neural Network

The artificial neural network (ANN) is a data processing model based on the biological neuron system. To solve problems, it is made up of a vast number of strongly integrated computing components known as neurons. It takes a non-linear direction and processes data in parallel through

(39)

29

all nodes. A neural network is a complex, adaptive system. It is adaptive if it can change the weights of inputs and change its internal structure. The neural network was created to solve problems that are simple for humans but complex for computers, such as recognizing cat and dog pictures and recognizing numbered pictures. Pattern identification is a term used to describe these concerns.

Artificial neural networks are divided into two types: feedforward and feedback artificial neural networks. A feedforward neural network is a non-recursive network. This layer's neurons are only linked to neurons in the next layer and do not form a cycle. Signals in a feedforward system can pass in one direction, to the output layer. Cycles are present in feedback neural networks. By inserting loops in the network, signals will propagate in both directions. The network's behavior may change over time because of the feedback loops. Recurrent neural networks are also known as feedback neural networks.

4.3.8 Recurrent Neural Network

Recurrent neural networks (RNN) are more difficult to understand. They save the output of processing nodes and feed it back into the model as a result (they did not pass the information in one direction only). The model is said to learn to predict the outcome of a layer in this way. Each node in the RNN model serves as a memory cell, allowing computation and execution to continue.

During backpropagation, if the network's forecast is incorrect, the machine self-learns and operates against the right prediction. The sequential information in the input data is recorded by RNN. The parameters of RNNs are spread through time measures. This is generally referred to as parameter sharing. Therefore, there are less parameters to train and the computing cost are smaller.

(40)

30

4.4 Model assessment

The dataset was separated into training sets and test sets. The dataset was tested with different feature selection and training and testing data ratios to get the best model.

1. 90:10 where 90% data is used for training the model and 10% data is used for testing.

2. 80:20 where 80% data is used for training the model and 20% data is used for testing.

3. 70:30 where 70% data is used for training the model and 30% data is used for testing.

4. 60:40 where 60% data is used for training the model and 40% data is used for testing.

(41)

31

Chapter 5

Results and Discussion

In this chapter, results obtained by applying feature selection methods and different classifiers are presented and analysed. The experiments were performed on the epigenomics data obtained by combing four datasets methylation data, histone H3 marker CHIP-Seq data, human genome data and RNA-seq data.

Nine feature selection techniques were used to select the best features and avoid the model from overfitting with non-significant features and reducing the computational costs. Table 5.1 shows the feature selection methods used in this study.

Table 5.1 Feature Selection Technique implemented in the research.

Method Feature Selection Technique Name

CFS Correlation-based feature selection (CFS) Chi-Square Chi-squared-based Feature Selection

MI Mutual Information Based Feature Selection

PCA Principal Component Analysis (PCA)

RF ReliefF

RFE Recursive Feature Elimination (RFE)

EMLR Embedded Method with Logistic Regression

EMRF Embedded with Random Forest

SelectK Select K features based on highest scores

(42)

32

The Machine learning algorithms used to predict the occurrence of Cancer is shown in Table 5.2.

Table 5.2 Machine Learning Algorithms implemented in the research.

Algorithm Full Name

NB Naïve Bayes

KNN K-Nearest Neighbour

SVM- Poly Support Vector Machine with Polynomial Kernel SVM - RBF Support Vector Machine with Radial Kernel

DT Decision Tree

RF Random Forest

XGB-Linear Xgboost Linear

MLP Multi-Layer Perceptron

ANN Artificial Neural Network

RNN Recurrent Neural Network

The models were trained using 5-fold Cross Validation as there are lesser observations in the data. The 5-fold cross validation trains the data on four folds and tests the data of the fifth fold.

Each time, a different fold is selected for testing and an average of the results is taken. The processed data has 564 total number of observations out of which 520 have no Tumor and 44 have Tumor.

The data is highly imbalanced and therefore needs to be balanced. Figure 5.1 shows the unbalance in the data.

(43)

33

Figure 5.1. Distribution of classes in the Processed Data

To balance the data, we used SMOTE balancing technique. SMOTE is a well-known approach for artificially generating new minority class instances using the nearest neighbours of the cases [62]. SMOTE can be performed with oversampling and undersampling. Oversampling helps in generating new samples for minority class while undersampling reduces the samples from the majority class to balance the data evenly. In our dataset, as shown in Figure 5.1, we don't want our model to be trained on a sample where we have 520 case of no cancer and only 44 case of cancer because the model will predict no cancer almost every time, as one class has super majority or minority. Therefore, we synthesize data for minority class to balance the dataset using SMOTE .

Evaluation metrics used to compare the results of different feature selection methods and classifiers include Accuracy, Sensitivity, Specificity and Area Under the Receiver Operating Characteristics (ROC) curve (AUC). Confusion matrix is used to calculate the evaluation metrics. High values of

(44)

34

accuracy, AUC, sensitivity and specificity indicates a good model. A brief description of each evaluation metrics is mentioned in Table 5.3.

Table 5.3. Evaluation Metrics used in the Research.

Evaluation Metrics Description

Confusion matrix Representation of correctly and incorrectly classified instances in test set for the classes in a matrix. The confusion matrix is one of the most important representation in classification which describes all performance and can be considered as the base of all evaluation metrics.

Accuracy Percentage of Correctly Classified observations.

AUC Area under the ROC curve (AUC) is a performance measurement. AUC score is a combination of the True Positive Rate and the False Positive Rate (FPR) into one single metric where, FPR = 1 – Specificity. For the ideal model with zero error and complete accuracy, an AUC close to 1 is considered.

Sensitivity Sensitivity is known as the True Positive Rate, also known as Recall, which is the proportion of actual positive cases which are predicted as positive by our model.

Sensitivity = True Positive / (True Positive + False negative)

Specificity Specificity is also known as the True Negative Rate which is the proportion of actual negative cases which are predicted as negative by our model.

Specificity = True Negative / (True Negative + False Positive)

(45)

35

5.1 Model Comparison using 5-fold Cross Validation

Firstly, all the models were compared with each of the feature selection methods using 5-fold cross validation. Given below are the evaluations of the models based on accuracy of the classifiers for each of the feature selection methods.

5.1.1 Correlation-based Feature Selection (CFS)

Figure 5.2 shows the comparison of 10 models (classifiers) for the Correlation-based feature selection (CFS) method. It can be clearly seen that Random Forest and XGBoost have performed the best across the 5-fold training as they both gave an average accuracy of 98%. Multi-Layer Perceptron (MLP) gives the worst performance compared to others with an average accuracy of 79%.

Figure 5.2 Comparison of models for the Correlation-based feature selection (CFS) method

(46)

36

5.1.2 Chi-square-based Feature Selection

Figure 5.3 shows the comparison of 10 models (classifiers) for the Chi-square based feature selection method. It can be clearly seen that SVM Linear, Random Forest, ANN and XGBoost have performed the best across the 5-fold training with an average accuracy of 98%, 99%, 97%, 97%, respectively. Almost all models seem to perform well with Chi-square based feature selection.

Figure 5.3 Comparison of models for the Chi-squared based feature selection method

5.1.3 Mutual Information (MI) based Feature Selection

Figure 5.4 shows the comparison of the models (classifiers) for the Mutual Information (MI) based feature selection method. It can be clearly seen that almost all models have performed well with MI score-based feature selection method across the 5-fold training, except MLP. Till now MLP has

(47)

37

shown less accuracy compared to others which is 85% with the Mutual Information based Feature Selection.

Figure 5.4 Comparison of models for the Mutual Information score-based feature selection method

5.1.4 Principal Component Analysis (PCA)

Figure 5.5 shows the comparison of the models (classifiers) for the PCA based feature selection method. It can be observed from the graph that almost all models have performed well but this time except for the Naive Bayes with an average accuracy of 85%. MLP which did not perform well with CFS feature selection methods also seem to be performing better with PCA based features across various folds.

(48)

38

Figure 1.5 Comparison of models for PCA based feature selection method

5.1.5 ReliefF Feature Selection

Figure 5.6 provides a comparison of the models for the ReliefF-based feature selection method. It can be observed that the SVM (Radial and Linear), Random Forest, and Xgboost models outperformed the others with an average accuracy of 98%. What is more, as fold level increased by 5, Random Forest, Xgboost and SVM Linear touched an accuracy level of 100% but it is not the case with the MLP. MLP seems misfit across various folds though it performed little bit better in between folds 2 and 4.

(49)

39

Figure 5.2 Comparison of models for reliefF-based feature selection method

5.1.6 Recursive Feature Elimination (RFE) Feature Selection

Figure 5.7 shows the comparison of the models for the Recursive Feature Elimination (RFE) based feature selection method. It can be clearly observed that Random Forest, SVM (Linear and Radial), ANN and XGBoost models have performed well with an accuracy of 99% except for the MLP.

The precision of XGBoost was slightly lower in the first three folds, but as the folds increased to five, it was on par with Random Forest. MLP again did not perform well across the 5-folds.

(50)

40

Figure 5.3 Comparison of models for RFE based feature selection method

5.1.7 Embedded Method with Logistic Regression Feature Selection

Figure 5.8 depicts the comparison of the models for the Embedded Method with Logistic Regression based feature selection method. It is evident that Random Forest, ANN and XGBoost with an accuracy of 99% have performed well except for the MLP which has an accuracy of 80% . MLP does not perform well across the 5-folds and is not performing well as compared to other models; same is the case with RNN and KNN.

Références

Documents relatifs

The article is structured in the following manner: in the section 2, we present briefly some basic theoretical aspects concerning the discriminant analysis; in the section 3 we

It is remarkable too that CSFS scores good accuracy even with few number of selected features, these results verify that merging “useful” constraints extracted

A further analysis of the selected subsets by the three methods shows that the Pareto subsets obtained with 3O and 2OMF-3O are statistically dominated by at least one 2OMF subset:

Compared to existing graph- based feature selection methods, our method exploits tree decomposition techniques to reveal groups of highly correlated features.. FSTDCG

knowledge, the algorithm providing the best approximation is the one proposed by Grabisch in [20]. It assumes that in the absence of any information, the most reasonable way of

Figure 1 represents the ob- tained classification rate in function of the size of the index T ′ that depends on the number n of words selected by cat- egory with the different

Also the obtained accuracies of different classifiers on the selected fea- tures obtained by proposed method are better that the obtained accuracies of the same

Step1: the subset of features selected and validated in the previous stage, is the input to the novel BPSO proposed based on MapReduce, we have divided the particles of swarm