Semi-orthogonal non-negative factorization as a feature extraction method to improve prediction accuracy of microarray cancer data

(1)

Semi-Orthogonal Non-Negative Factorization as a Feature Extraction Method to Improve Prediction Accuracy of Microarray Cancer Data

by

Nakul Patel

A thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science (MSc) in Computational Sciences

The Faculty of Graduate Studies Laurentian University Sudbury, Ontario, Canada

(2)

THESIS DEFENCE COMMITTEE/COMITÉ DE SOUTENANCE DE THÈSE Laurentian Université/Université Laurentienne

Faculty of Graduate Studies/Faculté des études supérieures

Title of Thesis

Titre de la thèse Semi-Orthogonal Non-Negative Factorization as a Feature Extraction Method to Improve Prediction Accuracy of Microarray Cancer Data

Name of Candidate

Nom du candidat Patel, Nakul

Degree

Diplôme Master of Science

Department/Program Date of Defence

Département/Programme Computational Sciences Date de la soutenance April15, 2020

APPROVED/APPROUVÉ

Thesis Examiners/Examinateurs de thèse:

Dr. Kalpdrum Passi

(Supervisor/Directeur de thèse)

Dr. Ratvinder Grewal

(Committee member/Membre du comité)

Dr. Peter Adamic

(Committee member/Membre du comité)

Approved for the Faculty of Graduate Studies Approuvé pour la Faculté des études supérieures Dr. David Lesbarrères

Monsieur David Lesbarrères

Dr. Gulshan Wadhwa Dean, Faculty of Graduate Studies

(External Examiner/Examinateur externe) Doyen, Faculté des études supérieures

ACCESSIBILITY CLAUSE AND PERMISSION TO USE

I, Nakul Patel, hereby grant to Laurentian University and/or its agents the non-exclusive license to archive and make accessible my thesis, dissertation, or project report in whole or in part in all forms of media, now or for the duration of my copyright ownership. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also reserve the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that this copy is being made available in this form by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.

(3)

Abstract

Abnormal growth in cells with the potential to diffuse to other parts of the human body could occur due

to multiple reasons such as changes in DNA segments activity. Altering DNA methylation is known as

an important factor in cancer development and altering DNA activity by avoiding some of the normal

activities of DNA. Feature selection and feature extraction is used to reduce the dimensionality in high

dimensional datasets as well as to filter the most useful features in predicting gene expression for a

cancer. A number of feature extraction methods have been used in literature for selecting the most useful

features. In this study Semi-orthogonal Non-Negative Factorization (SONMF) was studied and tested on

four microarray cancer datasets for feature extraction and compared with FFT features, Symmetry of

Methylation Density Features, Principal Component Analysis (PCA) and Non-negative Matrix

Factorization (NMF). Five different classifiers, namely Naïve Bayes, Support Vector Machine (SVM),

K-nearest Neighbour (KNN), Random Forest and Neural Network were used to predict the gene

expression of the four cancer microarray datasets. The experiments show that for colon cancer dataset,

Semi-orthogonal NMF (SONMF) and Non-negative Matrix Factorization (NMF) with Naïve Bayes

classifier performed the best compared with other feature extraction methods. It was shown by the one-

way analysis of variance that the accuracy, specificity and sensitivity of SONMF was significantly higher

than PCA. However, in terms of the highest accuracy, SONMF and NMF feature extraction methods

give the best performance with Naïve Bayes classifier for Colon cancer dataset. For Oral cancer dataset,

the highest accuracy was observed with SONMF and Neural Network classifier. In Leukemia cancer, the

highest accuracy of 100% was observed with NMF, SONMF and PCA with Neural Network and SVM

classifiers. However, comparing the median for the best classifier shows that the median of the SONMF

and NMF were slightly higher than PCA. For prostate cancer dataset, SONMF with Naïve Bayes

(4)

from PCA and NMF. Overall, the results of SONMF were more consistent compared with other features extraction methods.

Keywords: DNA Methylation, Feature Selection, Feature extraction, Non-negative Matrix Factorization,

Semi-orthogonal Non-negative Matrix Factorization, Principal Component Analysis, Enhanced Fourier

Transform, Symmetry percentage error.

(5)

Acknowledgments

Very much Thanks to my supervisor, Dr. Passi, who helped and guided me step by step by his deep knowledge and was really compassionate and kind with me.

Thanks to my family to help me and provide a situation for me to do this study. I wasn’t able to finish my study without their help.

I really appreciate the help of my friends and family in all steps of working on this thesis.

(6)

VI

ABSTRACT ... III ACKNOWLEDGMENTS ... V TABLE OF CONTENTS ... VI LIST OF FIGURES: ... IX LIST OF TABLES: ... X

CHAPTER 1 ... 1

INTRODUCTION... 1

1 I

NTRODUCTION

... 1

1.1 P

ROSTATE

C

ANCER

... 2

1.2 C

OLON

C

ANCER

... 3

1.3 L

EUKEMIA CANCER

... 4

1.4 O

RAL CANCER

... 5

1.5 DNA G

ENE

E

XPRESSION

... 5

1.6 F

EATURE

S

ELECTION

... 7

1.7 F

EATURE EXTRACTION

... 8

1.8 C

LASSIFICATION

... 9

1.9 O

BJECTIVES OF THE STUDY AND OUTLINE OF THE THESIS

... 10

CHAPTER 2 ... 11

LITERATURE SURVEY ... 11

CHAPTER 3 ... 17

(7)

DATA PROCESSING ... 17

3.1. D

ATASETS

... 17

3.2. D

ATA

P

ROCESSING

... 17

3.2.1. Colon Cancer Data ... 17

3.2.2. Leukemia Cancer Data ... 18

3.2.3. Prostate Cancer Data ... 19

3.2.4 Oral Cancer Data ... 20

3.3. M

ETHODOLOGY

... 21

CHAPTER 4 ... 24

FEATURE SELECTION AND FEATURE EXTRACTION METHODS ... 24

4.1 F

EATURE

S

ELECTION

... 24

4.1.1 F-Score... 25

4.2. F

EATURE

E

XTRACTION

... 26

4.2.1. FFT Features ... 26

4.2.2. Symmetry of Methylation Density Features ... 30

4.2.3. Non-negative Matrix Factorization: ... 35

4.2.4. Semi-orthogonal Non-negative Matrix Factorization ... 36

4.2.5. Principal Component Analysis (PCA) ... 38

4.3. F

EATURE

S

ELECTION

& M

ODEL

A

SSESSMENT

... 40

CHAPTER 5 ... 41

CLASSIFICATION METHODS ... 41

5.1 C

LASSIFICATION

M

ETHODS

... 41

(8)

VIII

5.1.1. Naïve Bayes Classifier... 42

5.1.2. Support Vector Machine (SVM) ... 44

5.1.3. K-Nearest Neighbors (KNN) ... 46

5.1.4. Random Forest (RF) ... 47

5.1.5. Neural Network ... 50

5.2. T

OOLS

... 53

5.3 F

EATURE

S

ELECTION

R

ATIO

... 53

5.4. 10-

FOLD

C

ROSS

V

ALIDATION

... 54

CHAPTER 6 ... 55

RESULTS AND DISCUSSION ... 55

6.1. R

ESULTS

... 55

6.2. D

ISCUSSION

... 68

CHAPTER 7 ... 80

CONCLUSIONS ... 80

7. Conclusions ... 80

REFERENCES ... 82

APPENDIX:... 86

1. Results of Prostate Cancer Data: ... 86

2. Results of Oral Cancer Data ... 96

3. Results of Leukemia Cancer Data: ... 108

(9)

List of Figures:

Figure 3-1: Flow chart for data analysis ... 23

Figure 4-1: Mean methylation kernel density of some normal and cancer samples ... 33

Figure 4-2: Mean methylation kernel density of some normal and cancer samples ... 33

Figure 4-3: Mean methylation kernel density of some normal and cancer samples ... 34

Figure 4-4: Mean methylation kernel density of some normal and cancer samples ... 34

Figure 4-5: Algorithm of Semi-Orthogonal NMF for continuous [9] ... 37

Figure 6-1: Sensitivity, Specificity and Accuracy for Colon Dataset ... 69

Figure 6-2: Accuracy of the model for various feature extraction methods (Colon Data) ... 70

Figure 6-3: post-hoc test for multiple comparison. (Colon Data) ... 71

Figure 6-4: box-plot for sensitivity and specificity of feature extraction methods. (Colon Data) ... 72

Figure 0-5: Sensitivity, Specificity and Accuracy for Prostate Dataset...………75

Figure 6-6: Sensitivity, Specificity and Accuracy for Leukemia Dataset…..………..77

Figure 6-7: Sensitivity, Specificity and Accuracy for Oral Dataset……...…..………78

(10)

X

List of tables:

Table 3.1: descriptive statistics for five gene expressions with widest interquartile range (colon data) .. 18

Table 3.2: descriptive statistics for five gene expressions with widest interquartile range (Leukemia) .. 19

Table 3.3: descriptive statistics for five gene expressions with widest interquartile range (prostate) ... 20

Table 3.4: descriptive statistics for five gene expressions with widest interquartile range (Oral data) ... 21

Table 4.1: Fscore for first 10 features with highest Fscore... 25

Table 4.2: Extracted features and comparision for normal and cancer group (colon dataset) ... 29

Table 4.3: Extracted features and comparision for normal and cancer group (leukemia dataset) ... 29

Table 4.4: Extracted features and comparision for normal and cancer group (prostate dataset) ... 30

Table 4.5: Extracted features and comparision for normal and cancer group (oral dataset) ... 30

Table 4.6: Symmetry methylation features and comparision for normal and cancer group (colon data) . 31 Table 4.7: Symmetry methylation features, comparision for normal and cancer group (leukemia data) . 31 Table 4.8: Symmetry methylation features, comparision for normal and cancer group (prostate data) ... 32

Table 4.9: Symmetry methylation features, comparision for normal and cancer group (oral dataset) ... 32

Table 6.1: Classification results of colon dataset for various split ratio and feature selection ratio. ... 55

Table 6.2: Total accuracy for the Naïve Bayes classifier (Colon Data) ... 65

Table 6.3: Total accuracy for the Naïve Bayes classifier (Prostate data) ... 66

Table 6.4: Total accuracy for the SVM and NNET (Leukemia data) ... 67

Table 6.5: Total accuracy for the NNET (Oral data) ... 67

Table 6.6: post-hoc test for difference between accuracy of SONMF and PCA in NNET (Colon Data) 71 Table 6.7: Post-hoc test, comparing SONMF with PCA in sensitivity & specificity (NNET). (Colon) .. 72

Table 6.8: Post-hoc test for SONMF vs PCA in RF classification model. (Colon Data) ... 73

Table 6.9: Post-hoc test for comparison of SONMF vs PCA, in KNN, SVM and NB models. (Colon) . 74

(11)

Chapter 1 Introduction

1 Introduction

Cancer has been one of the most fundamental health problems of human society in past decades. Every year, between 100 to 350 out of every 100,000 people die due to cancer worldwide. Understanding the nature of cancer, which is caused by the malfunction of the mechanisms that regulate growth and cell division, has been a topic of interest to researchers. The development of molecular biology in recent decades enhanced the understanding of complex interactions of the genetic variants, transcription and translation. Proteomic studies can play a critical role in prevention, early detection and treatment of cancer. Given that proteomic studies can help identify cancer biomarkers, it might cause early detection and treatment of cancer. In recent decades progress in molecular biology increased the cognition of the complex interactions which exist in genetic variations. For reasons of prevention, early detection and effective immediate treatment with cancer was found to be essential in proteomic studies. It was shown that using these proteomic studies results, helps much in identifying the presence of cancer in the human body and helps much in early detection and treatment with the cancer [1].

Cancer detection from gene expression information is a challenge because of the reason that the presented data is high dimensional data which include redundant information. Nevertheless, of the many studies that were done of diagnosing various types of cancer, there are still doubts in the clinical diagnosis of human specific tumors.

The cell generation cycle has a regular mechanism, the master cells (stem cells) are those cells from

(12)

lead to cancer stem cells which have the ability of spreading and dividing into daughter cells [2]. Hence, these cancer stem cells have the ability to recreate the tumor again by using a limited number of cells.

Self-renewing and expansion, division wasn’t only seen in the stem cells, it was also seen that acute myeloid leukemia in patients that have severe immunodeficiency disease could cause a fast expansion.

These AML tumor cells are multi potent progenitors [3]. Cancer cells are created in the organs which have stem cells, lineage committed cells and progenitors. The multi potent progenitors lack the self- renewing. The process of transition from stem cells to multi potent progenitors was shown in [4,5]. In 2005 researchers were separated on different types of cells that grow to leukemia [6]. The transplantation of leukemia was successfully done with granulocyte–macrophage progenitor cells for mice which had AML. They also found that the progression in leukemia occurs in rare cases of a genetic and epigenetic genes. In rare cases, some of them occurred in self-renewing cells to urge enough for creating similar leukemic cells [7]. Cancer stem cells have originated from the cells which have acquired oncogenic mutations. Genetic programs related with self renewing could be activated by specific oncoproteins which cause the creation of cancer stem cells from malignant cells. This finding was also confirmed in vivo studies on mice with AML. The presence of oncoprotein and the gene dosage are both important factors in the generation of tumor.

Cancer stem cells are powerful cells which derive properties similar to the normal substance of stem cells.

The normal stem cell properties are added to the cancer stem cell and cause them to have properties to survive and spread into more cells even after using the anti cancer curative process. In this study four cancer datasets were analyzed. Below a brief introduction on these cancers is presented.

1.1 Prostate Cancer

Prostate cancer after lung cancer is the second most common cancer among males and in the advanced

countries it is highest. In the European countries this cancer is the most significant reason of death for the

(13)

males.[8] The long period of curative process effects on the patients cause urinary inconsistence, sexual dysfunction and proctitis due to radiation. Best and easier way to fight with the disease is disease prevention, one of the disease prevention ways is to identify the persons that have high probability of getting the disease and by decreasing the effective factors which cause the disease, prevent them from getting that disease. As regards to this fact that the prostate cancer often occurs in old age, hence predicting it for young people presents sufficient opportunity to prevent those exposed to catch the disease by changing their life style. Prostate cancer like many other cancers, in addition to genetic and family history highly depends on the life style, type of nutrition and the occupation of the people. Occupations that force the men to have contact with harmful things in the work place such as chemical materials, heavy metals and so on are the important instances which causes the men to be exposed in getting the prostate cancer.

Hence, building predictive models could help in identifying the persons which are more probable of getting cancer and help them to prevent from getting this disease by changing their occupation or life style.

1.2 Colon Cancer

Colon cancer after skin, breast and gastric cancer is one of the most common cancers. Fortunately, though it is progressive and deadly but it is preventable. According to the available statistics it is equally distributed among male and female and often occurs after 50 years of age. In the congenital, family instances it may occur sooner in the patients. Environmental and hereditary reasons are shown by evidence and experience that play a role in involving the patients with this disease. This cancer shows itself in three ways:

1- Sporadic case: this is the most common type of infection that occurs without any base of congenital

or family historyand normally it happens after 50 years of age. For this reason, research centers for

(14)

2- Congenital and genetics case: around 10 to 15 percent of this cancer occurs in this case. Since the infection occurs in main family members, fortunately by training the family members to do the preventive activities, it can be avoided for other family members. Luckily by identifying the most of the genes that cause this disease, doctors are capable of early detection and prevention in the family and relatives.

3- Family case: around 25 percent of the infections are in this case. This case occurs in the first, second and third degrees in relatives. Accordingly, other relatives are at risk of infection.

Polyps are warts like appendage that grow in mucosal surface and normally are benign. Experience shows that these benign polyps could be formed in the mucosal region from these benign polyps and it takes several years to change from benign to malignant stage. This provides a proper opportunity for early detection and to perform preventive operation by taking out the polyps, which can prevent the patient from getting colon cancer. Symptoms of the colon cancer depends on the infected region of colon. Rectum cancer symptom is completely different from beginning part of colon. In the left part of the colon the symptoms are noisier compared with the right part. The tests which are done for recognizing the existence of colon cancer are fecal occult blood test (FOBT), sigmoidoscopy, colonoscopy, X-ray barium enema, CT scan and some other tests. Screening test on colon and rectum is done by colonoscopy and genetic tests for investigating the gene mutation (gene mutation is permanent changes in the DNA sequence which make the sequence vary from the ordinary sequence which are found in most of the people).

1.3 Leukemia cancer

Leukemia cancer is cancer of blood, bone marrow. Bone marrow creates the blood cells. Leukemia occurs when the blood cell production is disrupted. This cancer usually has negative effect on white blood cells.

Its more probable to occur after 55 years of age, but it is also a common cancer for the people less than

15 years old. Leukemia is divided into four groups: Chronic, Acute, Lymphocytic and Myelogenous

(15)

leukemia. Chronic leukemia is slower compared with Acute leukemia. Chronic leukemia produces the mature and useful cells but in acute leukemia the immature and useless cells are produced very fast and they become crowded in the blood much faster compared with chronic leukemia. The type of blood cell that are affected by leukemia are also used in classifying the leukemia cancer. In Lymphocytic leukemia the cancer affects the bone marrow and it produces lymphocyte which is a white blood cell and acts on immune system. In Myelohenous leukemia the changes in marrow causes the creation of red blood cells, other white blood cells and platelets. Combination of Acute or Chronic leukemia with the kind of blood cell that is being produced gets four types of leukemia which are called, ALL (Acute Lymphocytic Leukemia), CLL (Chronic Lymphocytic Leukemia), AML (Acute Myelohenous Leukemia) and CML (Chronic Myelohenous Leukemia) [22].

1.4 Oral cancer

Oral cancer occurs in various part of mouth, it could be formed in tongue, lips, in the cheeks, gums, in tonsils, salivary glands and in floor or roof of the mouth. For treatment with this cancer, surgery, radiation therapy, chemotherapy or targeted drug therapy are used. This cancer has four stages and occurs when genetic mutation orders the cells to grow without any control. The exact reason for this cancer is unknown and it is just seen that some risk factors could increase the probability of getting this cancer. Among this risk factor the most leading ones are consumption of alcohol and tobacco. Like other cancer types genetic mutation leads in changes in the cell creation and cell growth functionality which causes the cancer.

1.5 DNA Gene Expression

Progress in the DNA gene expression technologies has caused the simulation of the gene expression

levels possible under various experimental conditions. These gene expressions could be used for

(16)

performing analogous biological functions under different experimental conditions are expected to have high similarity measure. In unsupervised learning, the similarity and distance measures are used to find the observations which are most similar to each other and have a clear distance with the observations which are not similar to them. Hence using the similarity or distance measure, the observations could be distinguished from each other and inserted in to the appropriate cluster which has observations with low distance (high similarity) measure. Clustering gene expressions into various clusters using similarity or distance measure could help much in finding the genes which are in same group (cluster) and they are actually expected to perform same functionality. For cluster analysis of gene expression there are some approaches, for instance non-hierarchical clustering like k-means clustering, hierarchical agglomerative or divisive clustering are common approaches that could be mentioned. These clustering methods are based on the similarity or distance measure. Another clustering approach which is based on the probability distribution could be used which is called model-based hierarchical clustering. In model- based clustering, maximum likelihood estimation or maximum a posteriori estimation is used by implementing expectation maximization algorithms. In a Bayesian framework, Bayesian statistical inference could be used for clustering the data by using Bayes rule and the Marco Chain Monte Carlo algorithm.

Gene expression could also be monitored as a supervised learning problem, when the biological function

of genes is known or for instance when the persons with altered gene functionality are known, the

problem is a supervised learning. Classification models could be used for training the supervised learning

model. A suitable cross validation model which achieves the required accuracy could be used for the

classification of the observations of gene expressions into various class groups. Supervised learning

models also could be trained using traditional statistical methods and Bayesian framework. Among many

existing theoretical models, it should be noted that linear models have good performance in some

instances and nonlinear models are suitable in some other problems. The bias variance tradeoff needs to

be considered if the researcher wants to see whether linear or nonlinear model are required for the

(17)

classification. The mean squared error in the regressions and classification error in the classification on the testing data needs to be calculated. The best degree of nonlinearity or say flexibility of the models is the level that has the minimum classification error on the testing data. It should be noted that the bias is very small in more flexible models (nonlinear) with the high variance meanwhile the less flexible models have high bias and low variance [8]. Among the flexible models K-nearest neighbors, support vector machine and neural network could be mentioned.

1.6 Feature Selection

Most of traditional statistical learning analysis for both classification and regression are used for data with low dimension, in which the number of observations of the data is much more than the number of features.

In some experiments specially in biological science, there might be lots of features in the data set and due

to high cost and availability of sample, the number of observations (sample) are much lower. The cases

like this which has number of observations lower than number of features are high dimensional data. The

classical approach such as least squared method is not suitable for such datasets. Consideration of bias-

variance tradeoff is an issue in such problems and the risk of having overfitted solution for the

classification model becomes high. The overfitting model is a model which fits the training data quite

well, but the performance of the model is not good in predicting the test data. The problem of overfitting

could come up when there is a high dimensional data with many features. Overfitting occurs when there

is more flexibility in the models such as nonparametric and nonlinear models. To avoid overfitting,

parameters are selected to constrain the details learnt by the model. The existence of the noise (outlier) in

the dataset, having redundant features (features which actually are not statistically significant in

distinguishing between class of observations) needs to be dealt with to obtain a robust model. The outliers

could be problematic in both model-based and distance-based models, because existence of an outlier,

(18)

are actually located in the same group. In the model-based cases also the existence of outliers might affect the feature weight of the classes and get inaccurate feature weight. Hence, it is essential to use a feature selection method for keeping the effective features in the model, specially in epigenetic data which has lots of DNA features. In this research the following feature selection methods were used based on the percentage of total involved features.

1- Symmetry of Methylation Density 2- Fast Fourier Transform (FFT)

3- Non-negative Matrix Factorization (NMF)

4- Semi-orthogonal Non-negative Matrix Factorization (SONMF) 5- Principal Component Analysis (PCA)

From the above methods, ordered features from 5 to 100% of them were selected and were used in the classification analysis.

1.7 Feature extraction

Feature extraction is used for getting useful information from the data, in signal, image processing, cryptography and also in biology field to detect the useful information from the dataset. The “Symmetry of Methylation Density” and “Fast Fourier Transform (FFT)” feature extraction methods were used in [8], as hybridized feature extraction techniques to detect the cancer from the DNA methylation data. To increase the speed of machine learning algorithms and increase the accuracy they have used a methodology which keeps less than 10 features in the model, so that there is no longer high dimensionality problem in the model. They have used two stages of feature extraction. First the F-scores are calculated to find the ranking of the features. The highest ranked features are then selected using a percentage of features. The two feature extraction methods namely “Fast Fourier Transform” and “Symmetry of Methylation Density”

are applied to the selected percentage of features.

(19)

Two other feature extraction methods used are Semi-orthogonal Non-negative Matrix Factorization (SONMF) and Non-negative Matrix Factorization (NMF) methods proposed in [9]. These proposed methods are available as an R package called MatrixFact.

Last feature extraction method used in this research is Principal Component Analysis (PCA). It is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possible

variables into a set of values of linearly uncorrelated variables called principal components. These

parameters are called principal component parameters. The first principal component has the highest variance.

The above mentioned five feature selection and feature extraction methods were used in this research and the results were compared by using various percentage of total features.

1.8 Classification

In this research, various classification methods that utilize both distance measure and probabilistic models based on the probability density function and maximization of likelihood function were used. For the purpose of finding an appropriate predictive model for early diagnosis of cancer, finding suitable consistent models requires one to consider the proper feature selection and extraction method before implementing the classification model. Then, comparing various feature selection and feature extraction methods among various classifiers which consider both bias and variance of the model, the most proper method for feature selection could be found.

In our research, five different classifiers have been used, Naïve Bayes, Random Forest, K-nearest

Neighbor, Support Vector Machine (SVM) and Neural Networks.

(20)

1.9 Objectives of the study and outline of the thesis

The main objective of the study is to compare Semi-orthogonal Non-negative Matrix Factorization as a feature selection method with F-score, Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF) for predicting the cancer using DNA methylation data of four cancer datasets.

The feature selection was implemented by selecting different percentage of total features in the datasets.

The percentage used for the feature selection are 5%, 15%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95% and raw data (100%).

Then the classification analysis was done on these selected features to classify the four datasets to predict cancer. The classification techniques used are Naïve Bayes, Random Forest, Support Vector Machine (SVM), K-nearest Neighbor (KNN) and Neural Network. To implement the classification, four split ratios were used by splitting the data into training and testing data with 10-fold cross validation. The split ratios of 60:40, 70:30, 80:20 and 90:10 for training and testing with 10-fold cross validation was used for evaluating the accuracy of the model on the test data.

The thesis is organized as below:

In Chapter 1 the introduction was presented. In Chapter 2 the literature review of previous work in this field is presented.

The four datasets and data analysis is presented in Chapter 3.

In chapter 4, feature selection methods are presented.

Chapter 5 presents the classification methods used in this research.

In chapter 6 the results are presented with the discussion and in chapter 7 conclusion and suggestion for

future works is presented.

(21)

Chapter 2 Literature Survey

Different cancer types are the deadliest diseases in the world. As the significant role of aberrant DNA methylation in the cancer development process is undeniable, many researchers have worked on different strategies to predict cancer type from gene expression data. Majority of the approaches use machine learning algorithms to solve this problem which is usually comprised of feature engineering and classification methodology.

Baur and Bozdag developed a feature selection algorithm based on sequential forward selection that utilized different classification methods to compute gene centric DNA methylation using probe level DNA methylation data [10]. Their proposed feature selection algorithm was compared with other feature selection algorithms such as support vector machines with recursive feature elimination (SVM-RFE), genetic algorithms and ReliefF. Different methods were evaluated based on the predictive power of the selected probes on their mRNA expression levels. Their results showed that their sequential forward selection algorithm performed best on all metrics when using K-nearest Neighbors (KNN) with k=1 (1NN). The dataset of “Agilent microarray data and Illumina 450k DNA methylation data of 25 breast cancer lines” was used in this research.

In another study, biomarkers are chosen to approximate the presence of lymph node metastasis in

stomach cancer. While considering DNA methylation information [11], the data related to stomach

cancer is divided into three groups: normal, lymph node positive and negative which is known as N0,

N1 and N2 corresponding to lymph node metastasis positive. The dataset used in this project are clinical

data and TCGA (The Cancer Genome Atlas) level 3 DNA methylation data with 27 normal, 94 lymph

node negative, 189 cancer lymph nodes positive and 12 unclassified samples. The proposed feature

(22)

is combined with “minimum redundancy maximum relevance” (mRMR) feature selection method to find the preliminary features. The objective in the differential methylation analysis is to choose the probes that are highly correlated to the phenotype to determine the most important probes. Second, genetic algorithm is used to select the best features for classification wherein mRMR is used to select 10% high scored probes. Third, the selected features are passed to the genetic algorithm which selects the top 12 probes for classifier input. The selected probes related to 14 genes contains lymph node metastasis- relevant genes, namely HOXD1, NMT1, and SEMA3E. Finally, a random forest classifier with three step feature selection is compared with a random forest classifier with differentially methylated probes involving one step feature selection. ROC has been computed as a metric to compare the performance by the three-step-feature selection process.

A study on the diagnosis of colorectal cancer at an early stage used a feature selection technique called

“Optimal Mean based Block Robust Feature Extraction (OMBRFE)” to select features in colorectal cancer during medical treatment [12]. This method is evaluated on the colorectal cancer dataset with 197 samples with 5188 genomic features. The suggested feature selection method is based on optimal mean method using SVD which is then used in the OMRFE method “Optimal Mean based Robust Feature Extraction”. Considering the various groups of genetic aberrations in the Cancer Genome Atlas, the accuracy of classification could be affected due to multiple features in the usage of Optimal Mean based Robust Feature Extraction (OMRFE) method. Therefore, OMRFE in which multiple regularization parameters are used for colorectal cancer, has been proposed. This method is evaluated on the regularization parameter determined by synthetic data when verified by real data using other methods such as Penalized matrix decomposition (PMD), Sparse principal component analysis (SPCA), Robust principale component analysis (RPCA) and Optimal mean robust principal component analysis (CRPCA- OM), appears to extract more relative features.

In [13], a study of different feature extraction techniques was used in cancer classification by gene

expression dataset [13]. It states that although gene expression datasets are a great diagnostic tool, they

(23)

are not useful classification methods because they usually entail fewer samples when compared to the number of dimensions. Feature selection methods are necessary to reduce the dimension of these datasets and make them proper for the classification task. Different feature selection methods have some advantages and disadvantages, and the objective is to find the best algorithm for the proposed dataset.

Nie et al in [20] studied the diagnosis and prediction of lung cancer using various classification techniques such as logistic regression analysis, decision tree and artificial neural network model and concluded that models with decision trees and artificial neural networks were more suitable for the diagnosis of lung cancer. Shicheng et al in [19] recommended DNA methylation biomarker as a model of high accuracy after performing logistic regression with five gene methylation data which adjusted age, gender and whether the individual smoked. Desheng et al in [21] compared Linear Discriminant Analysis (LDA) methods to classify cancer based on gene expression and concluded that the classification performance of the LDA modification methods was by far superior to the traditional LDA method.

Ren et al. [14] introduced an extraction method to overcome the curse of dimensionality in information on gene expression. In this method, a set of new features are created by iterating Pearson’s correlation coefficients. It is claimed that by using this method, hidden structures in samples will be highlighted and irrelevant genomic features in the dataset are removed. This approach is verified in prostate cancer, leukemia and psoriasis datasets. For a leukemia dataset (contains 3571 probes after preprocessing), the sample similarity matrix using all features shows a low probability for two groups, AML and ALL. But it strongly suggests two different groups by IPCC (iterative Pearson correlation coefficient) features.

Running k-means on the features extracted by IPCC reached accuracy of greater than 90% after 1000

runs. After evaluating this method on a two-class problem, it is verified on the multiclass case in the

prostate cancer dataset. The dataset contains 22 normal samples, 20 metastatic prostate cancer samples

and 32 localized prostate cancer samples. K-means reached accuracy of greater than 90% with 1000 runs

(24)

Raweh et al. presented another study in this area. They introduced a novel algorithm for feature extraction in DNA methylation dataset to overcome high dimensionality and high noise problem in gene expression datasets [8]. They used both methods of feature selection and extraction, combined together to get the most informative features. F-score method is used to rank the features. In this level 10% of promoter’s features and 1% of the probes are kept for the next process. The feature extraction method combines kernel density method, peaks of mean methylation and symmetry of methylation density to extract enhanced and informative features. The dataset which is used in this project is a large collection of cancer methylomes from public TCGA project. It contains 14 types of cancers such as breast, lung, kidney, thyroid and uterine cancers. For classification, some famous classifiers are used including Naïve Bayes, Support Vector Machine (SVM) and Random Forest. Among these classification methods, SVM outperforms other methods with a 98.16% accuracy in breast cancer, 99.66% in colon cancer, 98.82% in brain cancer, 100% in kidney cancer, 98.8% in lung cancer, 97.96% in thyroid cancer and 99.77% in uterine cancer.

Danaee et al. used a deep learning method in [15] for cancer examination in gene expression data. Their

proposed method is to first pass the gene expression data to the Stacked Denoising Autoencoder (SDAE)

to extract informative features from highly dimensioned gene expression data. Next, they evaluate the

extracted features using supervised classification methods. Then, by analyzing connectivity matrices in

SDAE, the most important genes which are useful to predict breast cancer are extracted. The dataset

which is used in this project is from TCGA for breast cancer which contains 1097 breast cancer samples

and 113 healthy samples. The method is compared with two other feature extraction methods, Principal

Component Analysis (PCA) and differentially expressed genes. These algorithms are evaluated with

multiple classifiers such as SVM, Artificial Neural Networks (ANN) and support vector machine with

redius bias function (SVM-RBF). It has shown that using SDAE and SVM-RBF as classifiers

outperforms other algorithms in accuracy (98.26%) and using SDAE and ANN as classifiers leads to the

best sensitivity (98.73%).

(25)

Another related study has been done by Alshamlan et al. where a hybrid feature selection scheme is used to find essential features of cancer from the gene expression data [16]. The method combines the Genetic Algorithm and the Artificial Bee Colony (ABC) algorithms. Three binary datasets and three multiclass datasets are used to evaluate the proposed method. The binary datasets are for colon, leukemia and lung cancers. Multiclass datasets are for small round blue cell tumor (SRBCT), lymphoma and leukemia. The feature extraction algorithm is finally compared to other hybrid algorithms such as (mRMR, ABC), (mRMR, GA) and (mRMR, PSO) where PSO is the Particle Swarm Optimization algorithm. The results show that the Genetic Bee Colony (GBC) algorithm performs well on different datasets. By using SVM as a classifier, GBC leads to 95.64% accuracy in the colon dataset with 20 genes, 96.43% accuracy in leukemia with 5 genes, 99.5% in the lung cancer dataset with 8 genes, 96.38% in SRBCT data with 6 genes, 98.48% in the lymphoma dataset and 95.83% in the multiclass leukemia dataset.

Particle Swarm Optimization (PSO) is another feature selection method used by Kar et al. in [17]. In this project the PSO method is used to select the most relevant features in gene expression to detect cancer combined with the adaptive k-nearest neighborhood (KNN) method. The parameter k in KNN is chosen after performing 3-fold cross validation on the dataset. Three datasets are implemented to assess the suggested method, including small round blue cell tumor (SRBCT), acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML) and the mixed-lineage leukemia (MLL) data. The results show that on ALL-AML dataset, the proposed method reaches 95.86% accuracy on test data, 92.54% on the MLL dataset and 98.01% on the SRBCT dataset.

Another older study from Chen et al. used a decision tree method for feature selection in gene expression

[18]. In this research, a novel method is introduced which combines Particle Swarm Optimization (PSO)

with decision trees to select the best feature subset from thousands of genes. The performance of this

method is finally evaluated with famous classifiers such as SVM, self-organization map (SOM), back

(26)

leukemia, lung cancer, SRBCT, prostate tumor and DLBCL. The proposed method reached 100%

accuracy for leukemia and lung cancer datasets, 94.31% for prostate cancer and 92.94% for SRBCT”.

In our research Semi-orthogonal Non-Negative Factorization (SONMF) was studied and tested on four

microarray cancer datasets for feature extraction and compared with FFT features, Symmetry of

Methylation Density Features, Principal Component Analysis (PCA) and Non-negative Matrix

Factorization (NMF). Five different classifiers, namely Naïve Bayes, Support Vector Machine (SVM),

K-nearest Neighbour (KNN), Random Forest and Neural Network were used to predict the gene

expression of the four cancer microarray datasets. After applying all the methods above, classification

algorithims are applied on the training data. Error rate is analysed on the test data and the model with the

lowest error rate in the test data is considered as the most appropriate model.

(27)

Chapter 3 Data Processing

3.1. Datasets

R programming language is used for the preprocessing of the dataset, feature selection, feature extraction and classification of the cancerous data. Four different DNA-methylation datasets were used in this study.

Five classification methods were implemented on the datasets to classify the DNA gene expression which was methylated from the normal stem cells. Before implementing the classification algorithm, each of the four DNA gene expression datasets were standardized to have zero mean and a standard deviation of one, for all of the DNA gene expression parameters. This is done because the high variance in one parameter can cause a large and unwanted impact on the classification results. In this study, four types of cancer data were used including, colon, leukemia, prostate and oral cancer. The datasets were analyzed and tested with various classification algorithms to compare the performance of each classification algorithm.

3.2. Data Processing 3.2.1. Colon Cancer Data

The colon cancer dataset, used in the study collected by Alon et.al. (1999) [24], includes 2000 genes measured for 62 patients. From these 62 patients, 40 patients were diagnosed to have colon cancer and 22 of them were found to be healthy. Monitoring the gene expression could present a broad view from the status of the cells in oligonucleotide arrays.

From these 2000 genes, the top five genes with the widest interquartile range were tested between two

groups of cancer and non-cancer patients. A one-way analysis of variance and the descriptive statistics

(28)

with the significant difference can be seen in Table 3.1. P-values less than 5% have a significant variance between gene expressions between cancer and non-cancer patients.

Table 3.1: descriptive statistics for five gene expressions with widest interquartile range (colon data)

Gene Inter Quartile range Mean ± SEM(Standard error of Mean) F-statistics P-value

Cancer (40) 64.5% Non-Cancer (22) 35.5%

Var-1 3834 7476 ± 507 6179 ± 586 2.56 0.115

Var-26 3467 5253 ± 404 2966 ± 322 14.70 0.000*

Var-878 3344 3870 ± 736 2090 ± 498 2.81 0.098

Var-6 3232 5010 ± 397 4152 ± 455 1.84 0.180

Var-9 3199 5117 ± 399 3885 ± 432 3.86 0.054

* two groups are significantly different from each other at 1% significance level.

It can be seen in Table 3.1 that for gene 26, having the second largest interquartile range, the measured values for patients with cancer (5253 ± 404) is significantly higher compared to the gene expression measured for non cancer (2966 ± 322) patients (F = 14.7, P = 0.000). For the other 4 gene expressions, there seems to be a difference for some, however it is not statistically significant.

3.2.2. Leukemia Cancer Data

The data set used for leukemia cancer includes 2000 gene expressions from 47 patients. From these patients, only 9 had leukemia cancer while the remaining were normal. The dataset is unbalanced since 81% patients are normal and 19% have cancer. Before implementing the classification in this dataset, the data has to be balanced, as otherwise the classifier will predict an unknown sample to be normal with more than 99% accuracy. Oversampling the minor class was done by implementing SMOTE algorithm.

The SMOTE algorithm was used from the R CRAN package and the oversample function was applied

to rectify the imbalance. After balancing the dataset there are 69 observations, 38 of which are healthy

and 31 of them having cancer. The proportion is now 55% for normal and 45% for non-normal patients.

(29)

Descriptive statistics for 5 gene expressions with the widest interquartile range are reported for the original dataset before balancing. One-way anova test for the difference between non cancer and cancer groups are presented in Table 3.2.

Table 03.2: descriptive statistics for five gene expressions with widest interquartile range (Leukemia data)

Gene Inter Quartile range Mean ± SEM(Standard error of Mean) F-statistics P-value Cancer (9) 19% Non-Cancer (38) 81%

Var-1 11742 21441 ± 2938 24502 ± 1737 0.63 0.431

Var-925 9562 24597 ± 2075 26515 ± 1162 0.54 0.464

Var-11 8892 21130 ± 1942 12917 ± 800 18.85 0.000*

Var-767 6972 18555 ± 1595 17485 ± 759 0.37 0.542

Var-14 6107 20263 ± 1584 13038 ± 653 21.89 0.000*

* two groups are significantly different from each other at 1% significance level.

It could be seen in the above table that from the five parameters with the highest interquartile range, the 3

^rd

and 5

^th

parameter have measured values in the people with cancer have statistically higher measured values than in non-cancer people. These values are measured to be 21130 and 20263 in patients with cancer, while they were measured to be 12917 and 13038 in patients without cancer respectively. The P- value for both of these parameters is less than 1% which is a significant difference (F = 18.95, P = 0.000) and (F = 21.89, P = 0.000). For the other 3 genes, the difference between measured gene expression in both non-cancer and cancer group are not significantly different.

3.2.3. Prostate Cancer Data

The prostate dataset includes gene expression for 102 patients. For each one of these 102 patients, 1551

gene expressions were measured. In this data, 50 out of 102 patients are normal while 52 of them have

prostate cancer. Descriptive statistics for five of these 1551 gene expressions with the widest interquartile

(30)

Table 3.3: descriptive statistics for five gene expressions with widest interquartile range (prostate data)

Cancer (52) 51% Non-Cancer (50) 49%

Var-8 1496 3031 ± 164 3150 ± 328 0.11 0.742

Var-472 966 2094 ± 109 1637 ± 123 7.72 0.006*

Var-373 926 2278 ± 108 1777 ± 130 8.86 0.003*

Var-80 729 733 ± 55 413 ± 47 19.61 0.000*

Var-27 626 992 ± 56 1065 ± 88 0.49 0.486

* two groups are significantly different from each other at 1% significance level.

It can be seen in Table 3.3 that 3 out of 5 genes with the widest interquartile range have a large statistically significant difference in the mean of the measured gene expression. Var-472, var-373 and var-80 have an average of 2094, 2278 and 733 in the cancer group respectively, while these measures are 1637, 1777 and 413 in the non-cancer group respectively. One-way analysis of variance shows a significant difference between mean values of the cancer group and non-cancer group for these three gene expressions (F = 7.72, df = 1, Pvalue = 0.006), (F=8.86, df= 1, Pvalue = 0.003) and (F=19.61, df = 1, Pvalue = 0.000) respectively. The p-value for these three genes are less than 1% which shows that the difference between them is statistically significant at a 1% significance level. The mean value plus minus standard error of the mean is higher for the cancer group which shows that these three genes measured values are higher than the normal group.

3.2.4 Oral Cancer Data

The oral cancer data includes 21,384 patients with only 180 variables and gene expressions. It was found

that 14,256 patients had oral cancer and 7,128 of them did not. This data includes the least number of

features compared to other datasets used in this study. The interquartile range for this data has low numbers

(31)

that are less than 1 while the other three datasets have a large number of measured gene expressions. From these 180 genes, five of them, including the widest intequartile range, are reported in Table 3.4.

Table 03.4: descriptive statistics for five gene expressions with widest interquartile range (Oral data)

Cancer (14256) 67% Non-Cancer (7128) 33%

Var-31 0.32 0.1668 ± 0.0011 0.2082 ± 0.0014 504.3 0.000*

Var-180 0.12 0.1150 ± 0.0007 0.1436 ± 0.0010 510.8 0.000*

Var-170 0.12 0.1031 ± 0.0007 0.1319 ± 0.0009 651.8 0.000*

Var-171 0.11 0.1349 ± 0.0009 0.1742 ± 0.0013 572.7 0.000*

Var-111 0.09 0.0918 ± 0.0007 0.0976 ± 0.0011 20.54 0.000*

* two groups are significantly different from each other at 1% significance level.

It can be seen in the above observations that although the measured numbers are lower than 1 for this data, all five genes with the widest interquartile range have a statistically different mean in the cancer group compared to the non-cancer group. The mean value plus minus standard error of the mean for all five parameters reported in the Table 3.4 show that the measured gene expressions in the cancer group are significantly less than the measured gene expressions in the non-cancer group.

3.3. Methodology

The flow chart in Figure 3.1 shows the steps for the analysis. In this study four data sets used are: colon

data, leukemia data, prostate data and oral cancer data. The feature selection and extraction is done on

each of these four data sets to extract the features using five different algorithms: FFT feature, Symmetry

methylation density features, non-negative matrix factorization (NMF), semi orthogonal non-negative

matrix factorization (SONMF) and PCA. The feature selection for the “FFT Feature” and “Symmetry

methylation density features” methods is done by first ranking the features using F-score and then taking

different percentage of features, 5%, 15%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, and 95%.

(32)

Next, each of these four datasets are analyzed with the five classification methods using 10-fold cross

validation. The 10-fold cross validation is performed by splitting the data into training and testing ratios

of 60:40, 70:30, 80:20 and 90:10. After applying feature selection methods, classifiers are applied with

cross validation on the datasets. The classifiers used in this study are Naïve Bayes, SVM, KNN, Random

Forest and Neural Network. For each of these classification methods, 13 feature selection percentages

were taken for each of the four training to testing ratios. Therefore, there are 1300 classification analyses

for each of these four datasets. The steps are shown in the flow chart in Figure 3.1.

(33)

Figure 3-1: Flow chart for data analysis

F Score

Feature selection Percentage = m%,

m = 5,15,50,55,60,65,70,75,80,85,90,95,100

Feature extraction:

4. FFT feature

5. Symmetry of methylation density

Feature extraction:

1. NMF 2. SONMF 3. PCA

K = m% of the features or PC.

Parameter k is used for 2^nd dimension of F in factorization

Feature selection Percentage = m%,

m = 5,15,50,55,60,65,70,75,80,85,90,95,100

Classification methods:

1- Naïve Bayes 2- SVM 3- KNN 4- RF

5- Neural Network Colon Cancer

Data

Leukemia Cancer Data

Prostate Cancer Data

Oral Cancer Data

(34)

Chapter 4 Feature Selection and Feature Extraction Methods

Feature selection and feature extraction methods are used to keep the most effective parameters to help distinguish between cancer and non-cancer patients in the predictive model. Removing the redundant parameters is important because using them in the model may cause undesirable and misleading results in the classification output. Methods which are based on the similarity/distance between the observation, including redundant parameter, can have unwanted effects on the distance between the parameters and can lead to incorrect classification results. Moreover, in methods that are based on the minimization of residual error or calculation of the maximum likelihood of the model, the existence of redundant parameters can mask the effect of the other important parameters and lead to an incorrect classification model. In this study, feature selection and extraction are used to reduce the dimension of the data since most of the DNA methylation data are high dimensional, including many gene expression parameters.

For this reason, the Fscore method, a feature selection method, was used. The Fast Fourier Transform (FFT) feature extraction, symmetry percentage, non-negative matrix factorization, semi orthogonal non- negative matrix factorization and principal component analysis are used to extract the suitable features for classification.

4.1 Feature Selection

Feature selection in this study is done based on a fixed number or percentage of the features in the

extracted or ordered features. To order the features, the F-score method is used [8].

(35)

4.1.1 F-Score

This method is simple and is based on a simple formula which calculates the F-score of each parameter.

The parameters are more likely to be more discriminative in the classification if their F-score is larger. The F-score is calculated according to the formula below:

𝐹(𝑖) = (𝑥̅_𝑖⁽⁺⁾− 𝑥̅𝑖)²+ (𝑥̅_𝑖⁽⁻⁾− 𝑥̅𝑖)² 1

𝑛₊− 1 ∑^𝑛+_𝑘=1(𝑥_𝑘,𝑖⁽⁺⁾− 𝑥̅_𝑖⁽⁺⁾)²+ 1

𝑛₋− 1 ∑^𝑛−_𝑘=1(𝑥_𝑘,𝑖⁽⁻⁾− 𝑥̅_𝑖⁽⁻⁾)²

In the formula above, 𝑥

_𝑖⁽⁺⁾

is the average of the positive class, 𝑥

_𝑖⁽⁻⁾

is the average of the negative class, 𝑥

̅_𝑖

is the average of the total data in i-th feature, n

+

is the number of positive instances and n

-

is the number of the negative instances. The F-score is calculated for all of the features in the four datasets and ordered in decreasing order. In Table 4.1, the ordered F-score for 10 features with the highest F-score can be seen for the four datasets.

Table 0.1: Fscore for first 10 features with highest Fscore

Colon Cancer:

Feature Var249 Var245 Var1423 Var493 Var765 Var1772 Var267 Var377 Var1582 var1771

F-Score 1.1 0.87 0.87 0.86 0.85 0.85 0.84 0.75 0.75 0.73

Leukemia Cancer:

F-Score 5.5 3.3 3.2 2.8 2.6 2.5 2.5 2.3 2 1.8

Prostate Cancer:

F-Score 1 0.94 0.81 0.77 0.64 0.62 0.61 0.61 0.61 0.6

Oral Cancer:

Feature Var30 Var40 Var172 Var167 Var163 Var170 Var166 Var164 Var171 var162 F-Score 0.083 0.083 0.082 0.081 0.070 0.070 0.069 0.062 0.060 0.057

(36)

After ordering the important features, fixed proportions of the features will be selected for the classification. In various steps 5, 15, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95 and 100 percent of total features are used in the classification analysis.

4.2. Feature Extraction

Another step before implementing the classification is feature extraction. To extract the most informative features used in creating the predictive model for each classification algorithm, the feature extraction methods used in this study are FFT features, Symmetry methylation density features, Non-negative Matrix Factorization (NMF), Semi-orthogonal Non-negative Matrix Factorization (SONMF) and Principal Component Analysis (PCA).

4.2.1. FFT Features

In this feature extraction method [8], the methylation density in both cancer and normal patients have been studied. The kernel estimation is used by estimating the population probability density function for the selected features. The kernel density estimate at point 𝑥 is calculated using the formula below:

𝑓̂ 𝑥() =_ℎ 1

𝑛∑ 𝐾_ℎ(𝑥 − 𝑥_𝑖) = 1

𝑛ℎ∑ 𝐾(𝑥 − 𝑥_𝑖 ℎ )

𝑛

𝑖=1 𝑛

𝑖=1

In the above formula,

𝐾 is the Gaussian kernel function integrated to 1 and it has zero mean. The

argument of the kernel function is

(𝑥 − 𝑥_𝑖)/ ℎ, where h is a positive smoothing parameter called

bandwidth and 𝑥

_𝑖

is the 𝑖 -th point existing in the sample from 1 to n. The Gaussian kernel function can be seen below:

𝐾(𝑢) = 1

√2𝜋𝑒⁻¹²^𝑢2

(37)

The optimal bandwidth is used for the kernel density estimation which is calculated by [8]:

ℎ = 0.9 ∗ 𝑚𝑖𝑛(𝜎̂,𝐼𝑄𝑅

1.34) ∗ 𝑛⁻¹⁵

In the above formula, the optimal bandwidth is calculated by finding the minimum standard deviation of the sample and the interquartile range divided by 1.34.

The interquartile range (IQR) is the difference between 75

^th

percentile and 25

^th

percentile of the sample and 𝑛 is the sample size (512 was used).

The algorithm finds the local maximum of the mean methylation density in all sample types; normal and cancerous. Calling the index of the peak as p, then feature extraction method gets the methylation density

y at index p for the sample x and considers it as a feature:

f

1

(x) = y(p)

If p

1

is the peak index for the methylation density of sample x, then the number of peaks is another feature to be extracted:

f

2

(x) = | p

1

|

Another extracted feature is the difference between the maximum peak index and minimum peak index in sample x:

f

3

(x) = max(p

1

) – min(p

1

)

Implementing the discrete Fourier transform with N points, the estimated kernel density will be converted

from the space domain to the frequency domain. Due to the complex nature of the results, the amplitude

is determined using the formula is presented below:

(38)

𝑇ℎ𝑒 𝐴𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒 = √[𝑟𝑒𝑎𝑙[𝐹𝐹𝑇(𝐴)]]²+ [𝑖𝑚𝑔[𝐹𝐹𝑇(𝐴)]]²

𝑁 ]

For enhancement, the uncertainty of the information content will be calculated using Shannon entropy.

Hence for f

1

, which shows the methylation density at the peak index, the Shannon entropy of the Fourier Transform of the methylation density will be multiplied by the methylation density at peak index:

𝑓₄(𝑥) = 𝐸𝑁(𝐹𝐹𝑇(𝑦))^∗𝑓₁(𝑥)

In the formula above, EN is the function of shannon entropy and calculates the entropy for the FFT of the methylation density, where f

1

is the extracted feature for the methylation density in the peak index.

Shannon entropy is calculated according to the following formula:

𝐻(𝑥) = − ∑ p(𝑥)logp(𝑥)

𝑥∈𝑋

Where p(x) is the probability mass function which has a sum of 1. A higher entropy value conveys that the variable contains more information. In the above formula, H(𝑥) is equivalent with the EN function.

Hence the FFT of the methylation density is the input for the shannon entropy function. After calculating f

4

, it will be used as the extracted enhanced FFT feature for the methylation density and it will be used in the classification.

These features are extracted for each of four data sets and are used in creating the classification model.

The descriptive statistics for each feature (methylation density in the peak indices, number of methylation

density peaks, the difference between maximum and minimum peaks and enhanced Fourier Transform

feature) are reported in the Table 4.2 for each dataset in normal and cancer samples.

(39)

For colon dataset, top 5% of features with highest F-score were used for feature extraction.

Table 0.2: Extracted features and comparision for normal and cancer group (colon dataset)

Extracted

Feature

Q1 Median Q3 Q1 Median Q3 Mean F-value P-value

Cancer Normal Cancer Normal

F2 3 5 22 5 14 21.75 13.22 17 0.92 0.341

F3 375.5 455.5 492.4 452.2 485 492 432.4 460.1 2.84 0.097

F4-ind5 0.0005 0.0009 0.0012 0.0009 0.0011 0.0014 0.0009 0.0014 4.72 0.034*

F4-ind13 0.0004 0.0007 0.0010 0.0007 0.0009 0.0010 0.0007 0.0008 2.4 0.126

* The difference between cancer and normal mean is significant at a 5% significance level.

It can be seen in Table 4.2 that the difference between the mean of f

4

in a peak index of 5 is significant for cancer and normal samples (at 5% significance level). In the other 2 features, there is not a significant difference between cancer and normal sample. Additionally, in f

4

at a peak index of 13, the difference is not statistically significant.

For leukemia dataset, top 5% of features with highest F-score were used for feature extraction.

Table 0.3: Extracted features and comparision for normal and cancer group (leukemia dataset)

Extracted

Feature

Q1 Median Q3 Q1 Median Q3 Mean F-value P-value

Cancer Normal Cancer Normal

F2 6.5 8 9.5 8 13.5 22 8.71 15.68 18.43 0.000*

F3 429.5 442 453.5 439.8 470 481 441 460.2 11.1 0.001*

F4-ind28 0.0002 0.0003 0.0003 0.0003 0.0004 0.0005 0.0003 0.0004 19.87 0.000*

F4-ind30 0.0003 0.0003 0.0003 0.0003 0.0004 0.0004 0.0003 0.0004 18.2 0.000*

* The difference between cancer and normal mean is significant at a 1% significance level.

It can be seen in Table 4.3 that extracted features for f

2

, f

3

and f

4

are significantly different in cancer and normal group at 1% significance level. The mean value of the normal group for the three reported features is higher compared with the mean value of the cancer group.

Semi-orthogonal non-negative factorization as a feature extraction method to improve prediction accuracy of microarray cancer data

Abstract

Abnormal growth in cells with the potential to diffuse to other parts of the human body could occur due

to multiple reasons such as changes in DNA segments activity. Altering DNA methylation is known as

an important factor in cancer development and altering DNA activity by avoiding some of the normal

activities of DNA. Feature selection and feature extraction is used to reduce the dimensionality in high

dimensional datasets as well as to filter the most useful features in predicting gene expression for a

cancer. A number of feature extraction methods have been used in literature for selecting the most useful

features. In this study Semi-orthogonal Non-Negative Factorization (SONMF) was studied and tested on

four microarray cancer datasets for feature extraction and compared with FFT features, Symmetry of

Methylation Density Features, Principal Component Analysis (PCA) and Non-negative Matrix

Factorization (NMF). Five different classifiers, namely Naïve Bayes, Support Vector Machine (SVM),

K-nearest Neighbour (KNN), Random Forest and Neural Network were used to predict the gene

expression of the four cancer microarray datasets. The experiments show that for colon cancer dataset,

Semi-orthogonal NMF (SONMF) and Non-negative Matrix Factorization (NMF) with Naïve Bayes

classifier performed the best compared with other feature extraction methods. It was shown by the one-

way analysis of variance that the accuracy, specificity and sensitivity of SONMF was significantly higher

than PCA. However, in terms of the highest accuracy, SONMF and NMF feature extraction methods

give the best performance with Naïve Bayes classifier for Colon cancer dataset. For Oral cancer dataset,

the highest accuracy was observed with SONMF and Neural Network classifier. In Leukemia cancer, the

highest accuracy of 100% was observed with NMF, SONMF and PCA with Neural Network and SVM

classifiers. However, comparing the median for the best classifier shows that the median of the SONMF

and NMF were slightly higher than PCA. For prostate cancer dataset, SONMF with Naïve Bayes

from PCA and NMF. Overall, the results of SONMF were more consistent compared with other features extraction methods.

Keywords: DNA Methylation, Feature Selection, Feature extraction, Non-negative Matrix Factorization,

Semi-orthogonal Non-negative Matrix Factorization, Principal Component Analysis, Enhanced Fourier

Transform, Symmetry percentage error.

Acknowledgments

Very much Thanks to my supervisor, Dr. Passi, who helped and guided me step by step by his deep knowledge and was really compassionate and kind with me.

Thanks to my family to help me and provide a situation for me to do this study. I wasn’t able to finish my study without their help.

I really appreciate the help of my friends and family in all steps of working on this thesis.

Table of Contents

ABSTRACT ... III ACKNOWLEDGMENTS ... V TABLE OF CONTENTS ... VI LIST OF FIGURES: ... IX LIST OF TABLES: ... X

CHAPTER 1 ... 1

INTRODUCTION... 1

1 I

... 1

1.1 P

C

... 2

1.2 C

C

... 3

1.3 L

... 4

1.4 O

... 5

1.5 DNA G

E

... 5

1.6 F

S

... 7

1.7 F

... 8

1.8 C

... 9

1.9 O

... 10

CHAPTER 2 ... 11

LITERATURE SURVEY ... 11

CHAPTER 3 ... 17

DATA PROCESSING ... 17

3.1. D

... 17

3.2. D

P

... 17

3.2.1. Colon Cancer Data ... 17

3.2.2. Leukemia Cancer Data ... 18

3.2.3. Prostate Cancer Data ... 19

3.2.4 Oral Cancer Data ... 20

3.3. M

... 21

CHAPTER 4 ... 24

FEATURE SELECTION AND FEATURE EXTRACTION METHODS ... 24

4.1 F

S

... 24

4.1.1 F-Score... 25