Objectives of the Study - Predicting head and neck cancer in patients using epigenomics data an

The main goal of this research is to study the Epigenomics data using gene expression data from Head and Neck cancer patients (HNSCC). Using multiple classifiers and feature selection approaches, this research proposes to create a more reliable model. The key objective is to assess and predict HNSCC by machine learning techniques and find the best model(s). It is also an objective to find the most relevant genes that affect HNSCC through clustering the features. Nine different feature selection methods (Correlation-based feature selection (CFS), Chi-squared-based Feature Selection, Mutual Information Based Feature Selection, Principal Component Analysis (PCA), ReliefF, Recursive Feature Elimination (RFE), Embedded Method with Logistic

Regression, Embedded with Random Forest and Select K features based on highest scores) and ten different classifiers (Naïve Bayes, K-Nearest Neighbour (KNN), Support Vector Machine (SVM) with Polynomial Kernel and Radial Kernel, Decision Tree, Random Forest. Xgboost Linear, Multi-Layer Perceptron (MLP), Artificial Neural Network (ANN) – fully connected Neural Net (FCN) and Recurrent Neural Network (RNN)) are used to accomplish this objective.

With 5-fold cross-validation on training and validation data, the features were chosen from raw

data. Accuracy, AUC and Confusion Matrix are key performance metrics used to assess the models. Moreover, cluster analysis was also done to visualize the features to get better result. To determine whether there was a significant difference in the different models, statistical analysis was used.

Chapter 2 Literature Review

Head and neck squamous cell carcinoma (HNSCC) is the 6th most common type of cancer [19] and is associated with 650,000 new cases and 330,000 deaths annually worldwide [20][21].

The majority of the HNSCC cases are Oral Squamous Cell Carcinomas (OSCC) [22]. More than 90% of HNSCC are associated with OSCC patients [23]. The incidence rates (mainly OSCC (Collaboration, 2019)) are higher in South Asian countries such as India [24], Bangladesh (Collaboration, 2019), and Pakistan [25] as compared to other parts of the world. There are several known risk factors of HNSCC such as chewing tobacco, smoking cigarettes, excessive alcohol consumption [26] and oncogenic virus such as Human papillomavirus (HPV) [27].

Additionally, epigenetic regulation, mutation, copy number variation (CNV) and immune host response also plays a key role in carcinogenesis [28]. Despite current advancement in cancer diagnosis and treatment, the overall 5-year survival rate is less than 50% in HNSCC due to a lack of proper diagnostic markers and targeted therapies [29]. In multiple cancers, including HNSCC, it has been well established that identification of cancer in the early stages leads to a better survival rate than detection in the later stages. Early-stage primary tumours are described as those with a diameter of 2–4 cm and no lymph node proliferation or metastasis, according to the American Joint Committee on Cancer staging (TNM) (TNM stage I and II). If the tumour is greater than 5 cm and has spread to surrounding lymph nodes only (TNM stage III) or has spread to other areas of the body (TNM stage IV), it is called advanced (late stage) [30].

2.1 Role of miRNA

In the last few decades, a lot of new research has been done in the field of HNSCC, but no clinically significant findings have been made. Patients would benefit from a fast and correct diagnosis in a number of ways, including proper care, which would mitigate morbidity and increase treatment outcomes. Unfortunately, there is no such universally accepted biomarker for HNSCC accepted for clinical use. More appropriate therapies and clinically important biomarkers to stratify patients with HNSCC are urgently needed. The micro-RNAs (miRNAs) are 18–25 nucleotide long non-coding RNAs. They can regulate mRNA expression by interacting with the 3 untranslated regions (UTR) leading to mRNA degradation. These miRNAs by virtue of their control over mRNA expression have important regulatory roles such as regulation of cell division, cell maturation, angiogenesis, proliferation, migration, invasion, metastasis, autophagy, and apoptosis [31].

However, in various diseases especially cancers, these miRNAs themselves can get dysregulated leading to pathological conditions [32]. For their biological role in cancer and their capacity to control the expression of various cancer pathways, a significant number of miRNAs have been well described [33]. Changes in miRNA expression profile can be observed well before clinical signs occur in certain cancers, according to research [34]. These miRNAs provide a fair method for the discovery of excellent biomarkers due to their durability and ease of identification (in tissues as well as biological fluids) [35]. MiRNA expression profiles can also provide insight into underlying tumour progression and/or lead to the discovery of new therapeutic targets.

2.2 Need of pre-processing the data

The Cancer Genome Atlas (TCGA) Research Network [36] is a coordinated effort to gather, share, and analyse next generation molecular sequencing data to improve our understanding of

cancer mechanisms on a molecular level [37]. Data utilized in our analysis were obtained from the Genomic Data Centre Portal [38] and contained 528 TCGA HNSC cases, including genotyping, solid-tumour RNA expression, whole exome sequencing, methylation data, and clinical information. Only RNA expression variables and clinical data are included in this study. Tumor grading statistics, patient demographic information, smoking/alcohol history, and other characteristics linked to disease development, such as lympho-vascular invasion and margin status, are all included in clinical records. Human papillomavirus (HPV) status (based on ISH and P16 testing) was also included, as HPV status has strong implications for prognosis and tumour development [39, 40]. These numbers come from a variety of experiments from various universities, many of which used different platforms and assays over long periods of time. The study discussed here describes the difficulties that this common type of dataset poses in oncological research. Large, multi-institutional databases pose a number of obstacles for the advancement of clinical decision support approaches and tools. Specifically, TCGA-HNSC had problems with sparsity and inconsistency across many clinical data areas. Out of 15 identified clinical characteristics relevant to treatment regimen, none were populated for every patient. More specifically, the number of cases (from a total possible 564 cases in TCGA) where 520 patients did not have tumour. In addition to the problem of missing data, several fields were populated inconsistently due to human error. Such complications required extensive pre-processing and an expert system built using domain-specific knowledge to determine whether each patient had received a specific type of therapy. Even after this pre-processing and condensing of treatment fields, issues of missing data persisted. Whether a patient had received radiotherapy and/ or chemotherapy was unclear for 47 and 27% of cases, respectively. One possible technique for handling such problems is to exclude cases or variables with missing data, as was done previously

with this dataset [41].We seek to optimise use of the available data by imputing missed values due to the importance of these attributes to our decision help targets, as well as the finite number of cases from which to choose. Molecular datatypes are often extremely high-dimensional. Feature selection and dimensionality reduction techniques are necessary steps when utilizing such data to best employ available computational resources.

There are several strategies for selection and dimensionality reduction, including feature filtering, feature transformations, and wrapper methods such as sequential selection [42]. In this work, feature filtering and an unsupervised sparse PCA feature space transformation of approximately 20,000 solid-tumour RNA expression variables were employed and evaluated in the context of TCGA-HNSC. Methods including Pre-processing, balancing the imbalance class, and missing data imputation were performed resulting in a much more concise and usable dataset.

2.3 ML Technology

With the rapid growth of data to support its diverse applications in a variety of cancers, Machine Learning (ML) has seen an increase in popularity. ML algorithms can make more reliable and effective predictions as the amount of training data grows [43]. For a long time, mathematical simulation has been commonly used in disease modelling, classification, and estimation of molecular structure. Machine learning has opened new opportunities of disease process awareness and tracking thanks to advances in next-generation sequencing (NGS) technologies and the availability of large sequencing data. Risk stratification, mutational frequency estimation, copy number variation (CNV), and new target recognition are all popular applications of machine learning techniques. It has been suggested that machine learning (ML) techniques can be utilized

for diagnosis and prognosis in cancers [44]. For example, using expression data, SNPs (single nucleotide polymorphisms), and CNVs, ML techniques have been used to diagnose Rat Sarcoma (RAS) activation pathways in cancers (copy number variations) [45]. A multi-parametric Decision Support System (DSS) was used to forecast OSCC progression and possible relapse using a variety of heterogenic data (clinical knowledge, genomics, and imaging data) (local or metastatic) [46].

MiR-221 and miR375 were identified as predictors in a report by Avissar [47] to identify miRNAs to predict the existence of HNSCC . The research, however, did not consider stages or grades. A recent machine learning research used neural networks to forecast tongue cancer recurrences [48].

Kim developed predictive models for survival prediction in oral cancer patients [49]. In one research of predicting survival of oral squamous cell carcinoma with the help of machine learning based prediction model, 80.0 % of the data were randomly selected from the National Cancer Data Base (NCDB) and used. They included 2-class decision forest, 2-class decision jungle, 2-class logistic regression, and 2-class neural network as a classification model. Amongst all four, decision forest classification was the most robust, with an AUC of 0.80, accuracy of 71% [50]. There is a report that stratifies the clinical stages in HNSCC patients using machine learning algorithm such as, Random Forest (RF), Support Vector Machine Radial Kernel (svmR), Adaptive Boost (AdaBoost), averaged Neural Network (avNNet), and Gradient Boosting Machine (GBM) and also predict the riskthrough miRNA expression profiles where Adaptive Boost performed well amongst all with the AUC = 0.929 and accuracy = 0.872 [30].

Machine learning methods play a pivotal role in decision making that helps cancer diagnosis and analysis. Asri et al. [51] predicted breast cancer risk and compared the performance for various machine learning methods including Support Vector Machine (SVM), Decision Tree, Naive Bayes

(NB) and k Nearest Neighbours (KNN). All the experiments were performed on WEKA where

SVM model gave best accuracy of 93.17% with minimum error rate.

Khalilia et al. [52] predicted 8 different disease risks including breast cancer from imbalanced data with the help of Random Forest classifier. In their study, they found that Random Forest (RF) performed very well in the comparison of Support Vector Machine (SVM). They took National Inpatient Sample (NIS) data, which is highly imbalanced and as a solution, they used a method of ensemble learning that relied on repetitive random subsampling. This method splits the training data into several sub-samples while ensuring that each one is completely balanced. Then after they compared the performance of RF and SVM with the subsampling. RF provided best result in predicting 8 disease risks with the AUC = 88.79% on average.

Parmar et al. [53] performed radiomics analysis for predicting overall survival in patients with head and neck cancer (HNSCC) on two HNSCC cohorts for 196 patients using machine learning techniques. Different feature selection methods were used such as, relief, T-score, Chi-square, Gini index, mutual information. After feature selection, various classifiers were applied, namely decision trees, neural networks, random forest, and support vector machine using 10-fold cross validation. Their results show that mutual information based feature selection with neural networks and random forest gave the best AUC scores.

The identification of late stages often involves cumbersome examination and invasive tests;

therefore, such studies could be immensely useful in accomplishing better diagnosis and clinical outcomes. Such strategy can also be extended to the cancers that are difficult to reach and detect such as pancreatic cancer. It has to be remembered that the clinical stage is known to play an

important role in the overall survival of the patients. These studies have led to the identification of signatures that can efficiently stratify the HNSCC patients into early and late clinical stages as well as to stratify patient’s risk that may help in the decision of the treatment regimens e.g., whether it is effective to use less aggressive treatment than highly toxic therapies.

HNSCC literature often focuses on association of regulation of specific genes with prognosis [54, 55]. Other groups, however, acknowledge the need for large-scale integrative analysis to capture potential novel biomarkers [56–58]. In other cancers, unsupervised transformations of molecular data (e.g., RNA sequencing, DNA methylation, miRNA sequencing) are known to be useful in machine learning-based survival prediction [59, 60]. In this way, HNSCC has had so little attention. Similarly, there is less literature available on machine learning imputation of sparse clinical results.

Chapter 3 Data Preparation

Dans le document Predicting head and neck cancer in patients using epigenomics data and advanced machine learning methods (Page 16-25)