Discovering Type 1 Diabetes Patient Subgroups through Integrative Analysis of Heterogeneous Data
by
©S. Sadra Mirhendi
A thesis submitted to the School of Graduate Studies
in partial fulfilment of the requirements for the degree of Master of Computer Science
Department of Computer Science Memorial University of Newfoundland
May 2018
St. John’s Newfoundland
Abstract
Type 1 diabetes (T1D) is a disease in which the body immune system attacks the β-cells. As a result, very little, or no insulin is released to control the level of glucose in the blood. Our research investigates whether groups of patients at higher risk for developing T1D complications can be identified by integrating demographic, clinical and genetic data. Regarding this purpose, we explore two methods including Generalized Low Rank Models (GLRM) and Similarity Network Fusion (SNF) to investigate our T1D dataset and to determine groups of patients at higher risk of developing complications related to T1D.
By applying the stated methods, we have identified groups of patients suffering from nerve damage, high blood pressure, dyslipidemia, and thyroid diseases. This result could be used as the basis to achieve a predictive model that could allow patients and health-care providers to take preemptive steps to reduce the risk of developing T1D related complications.
Acknowledgements
Grateful to the God, completing the master has been an incredibly rewarding experience, and I am very grateful to have had the opportunity to finish this program.
I would like to thank my amazing supervisor Dr. Lourdes Pe˜na-Castillo for her tremendous support and guidance. She created an open and supportive working environment where she was always available and willing to help. Her enthusiasm and incredible work ethic were both inspiring and motivating and I consider myself very fortunate to have had the opportunity to study under her supervision. I would also like to thank my co-supervisor Dr. Ting Hu for her support and guidance.
Additionally, I would like to express my sincere gratitude to my parents for their unconditional support and my wonderful wife for her endless love. I am very fortunate to have such wonderful people in my life and I truly value the encouragement and faith that they have always shown in me.
— S. Sadra Mirhendi
Contents
Abstract ii
Acknowledgements iii
List of Tables vii
List of Figures xi
1 Introduction 1
1.1 Biological Background . . . . 2
1.2 Related Works . . . . 4
1.2.1 Type 1 Diabetes Genetics . . . . 4
1.2.2 Heterogeneous Data Challenges . . . . 4
1.2.3 Patients Subgroup Discovery . . . . 5
1.2.3.1 Network Approaches . . . . 5
1.2.3.2 Machine Learning Approaches . . . . 7
1.3 Research Question . . . . 7
1.4 Dataset Overview . . . . 8
1.5 Study Overview . . . . 11
2 Generalized Low Rank Modeling 13
2.1 Introduction . . . . 13
2.2 Methods . . . . 15
2.2.1 GLRM Framework . . . . 15
2.2.2 GLRM Parameter Setting . . . . 17
2.2.3 Data Clustering . . . . 19
2.2.4 Clusters Evaluation . . . . 21
2.3 Results . . . . 25
2.3.1 GLRM Parameter Selection . . . . 25
2.3.2 Clustering Concise Data . . . . 28
2.3.2.1 K-means Clustering . . . . 28
2.3.2.2 Hierarchical Clustering . . . . 33
2.3.2.3 Affinity Propagation Clustering . . . . 38
2.4 Discussion . . . . 41
2.5 Conclusion . . . . 43
3 Similarity Network Fusion 45 3.1 Introduction . . . . 45
3.2 Methods . . . . 47
3.2.1 Data Pre-Processing . . . . 47
3.2.2 Principles of SNF . . . . 48
3.2.3 Network Clustering . . . . 51
3.3 Results . . . . 52
3.3.1 Network Clustering . . . . 53
3.3.2 Clusters Evaluation . . . . 54
3.4 Discussion . . . . 57
3.5 Conclusion . . . . 60
4 Summary 62 Bibliography 67 A Dataset Features Information 77 B Supplementary Clustering Data 81 B.1 Clustering Results . . . . 81
B.1.1 K-means Clustering . . . . 82
B.1.2 Hierarchical Clustering . . . . 87
B.1.3 Affinity Propagation Clustering . . . . 92
B.1.4 Network Clustering . . . . 97
B.2 Clustering Evaluation p-value . . . 102
List of Tables
1.1 T1D Dataset Features Summary . . . . 11 2.1 Exposed and diseased population ratio definition . . . . 23 2.2 Overview of GLRM three clustering methods’ results . . . . 28 2.3 Statistics of the patients clusters obtained using k-means clustering . 29 2.4 Most significant results per complication obtained usingk-means clus-
tering . . . . 32 2.5 Statistics of the patients clusters obtained using hierarchical clustering 34 2.6 Most significant results per complication obtained using hierarchical
clustering . . . . 37 2.7 Statistics of the patients clusters obtained using affinity propagation
clustering . . . . 38 2.8 Most significant results per complication obtained using affinity prop-
agation clustering . . . . 41 2.9 Concordance between affinity propagation clustering and hierarchical
clustering based on NMI percentage . . . . 42 3.1 Summary of six derived features’ categories . . . . 48
3.2 Concordance among similarity networks based on NMI percentage . . 52
3.3 Overview of SNF network clustering result . . . . 53
3.4 Statistics of the patients clusters obtained using network clustering . 53 3.5 Most significant results per complication obtained using network clus- tering . . . . 57
3.6 Concordance between hierarchical clustering and network clustering based on NMI percentage . . . . 59
3.7 Concordance between affinity propagation clustering and network clus- tering based on NMI percentage . . . . 59
A.1 Dataset Features Details . . . . 78
A.1 Dataset Features Details . . . . 79
A.1 Dataset Features Details . . . . 80
B.1 k-means clustering results for Thyroid Disease . . . . 82
B.2 k-means clustering results for Dyslipidemia . . . . 82
B.3 k-means clustering results for High Blood Pressure . . . . 83
B.4 k-means clustering results for Nerve Damage . . . . 83
B.5 k-means clustering results for Retinopathy . . . . 84
B.6 k-means clustering results for Diabetic Ketoacidosis . . . . 84
B.7 k-means clustering results for Hyperglycemia . . . . 85
B.8 k-means clustering results for Hypoglycemia X . . . . 85
B.9 k-means clustering results for Anxiety . . . . 86
B.10 k-means clustering results for Depression . . . . 86
B.11 Hierarchical clustering results for Thyroid Disease . . . . 87
B.12 Hierarchical clustering results for Dyslipidemia . . . . 87
B.13 Hierarchical clustering results for High Blood Pressure . . . . 88
B.14 Hierarchical clustering results for Nerve Damage . . . . 88
B.15 Hierarchical clustering results for Retinopathy . . . . 89
B.16 Hierarchical clustering results for Diabetic Ketoacidosis . . . . 89
B.17 Hierarchical clustering results for Hyperglycemia . . . . 90
B.18 Hierarchical clustering results for Hypoglycemia X . . . . 90
B.19 Hierarchical clustering results for Anxiety . . . . 91
B.20 Hierarchical clustering results for Depression . . . . 91
B.21 Affinity Propagation clustering results for Thyroid Disease . . . . 92
B.22 Affinity Propagation clustering results for Dyslipidemia . . . . 92
B.23 Affinity Propagation clustering results for High Blood Pressure . . . . 93
B.24 Affinity Propagation clustering results for Nerve Damage . . . . 93
B.25 Affinity Propagation clustering results for Retinopathy . . . . 94
B.26 Affinity Propagation clustering results for Diabetic Ketoacidosis . . . 94
B.27 Affinity Propagation clustering results for Hyperglycemia . . . . 95
B.28 Affinity Propagation clustering results for Hypoglycemia X . . . . 95
B.29 Affinity Propagation clustering results for Anxiety . . . . 96
B.30 Affinity Propagation clustering results for Depression . . . . 96
B.31 Network clustering results for Thyroid Disease . . . . 97
B.32 Network clustering results for Dyslipidemia . . . . 97
B.33 Network clustering results for High Blood Pressure . . . . 98
B.34 Network clustering results for Nerve Damage . . . . 98
B.35 Network clustering results for Retinopathy . . . . 99
B.36 Network clustering results for Diabetic Ketoacidosis . . . . 99
B.37 Network clustering results for Hyperglycemia . . . 100
B.38 Network clustering results for Hypoglycemia X . . . 100
B.39 Network clustering results for Anxiety . . . 101
B.40 Network clustering results for Depression . . . 101
B.41 k-means Clustering p-value . . . 103
B.42 Hierarchical Clustering p-value . . . 104
B.43 Affinity Propagation Clustering p-value . . . 105
B.44 Network Clustering p-value . . . 106
List of Figures
1.1 Research work flow diagram . . . . 12
2.1 Matrix transformation with GLRM framework . . . . 16
2.2 GLRM model average Mean Square Errors (MSE) . . . . 26
2.3 GLRM model average Misclassification Ratio (MCR) . . . . 27
2.4 Heatmap fork-means clustering . . . . 30
2.5 Bubble graph for k-means clustering result . . . . 31
2.6 Hierarchical Clustering Dendrogram . . . . 33
2.7 Heatmap for hierarchical clustering . . . . 35
2.8 Bubble graph for hierarchical clustering result . . . . 36
2.9 Heatmap for affinity propagation clustering . . . . 39
2.10 Bubble graph for affinity propagation clustering result . . . . 40
3.1 Schematic representation of SNF steps . . . . 49
3.2 Heatmap for network clustering . . . . 55
3.3 Bubble graph for network clustering result . . . . 56
4.1 Research summary diagram . . . . 64
Chapter 1 Introduction
Diabetes mellitus type 1 also known as type 1 diabetes (T1D) is a disease in which the body immune system attacks the β-cells. As a result, very little, or no insulin is released to control the level of glucose in the blood. Thus the amount of glucose obtained from foods will be built up in the body instead of being used for energy.
This research investigates whether groups of patients at a higher risk for develop- ing complications or secondary disease related to T1D can be identified by integrating demographic, clinical and genetic data. We have a T1D dataset which contains 239 features concerning demographic, clinical and genetic factors from 196 patients (de- tails are available in Section 1.4). We will explore two methods including Generalized Low Rank Modelling (GLRM) [1] and Similarity Network Fusion (SNF) [2] to analyze this dataset.
As a result of our research, we have taken first steps to identify groups of patients at higher risk of developing T1D complications. This results could be taken as the basis to develop a predictive model that could allow patients and health-care providers
to take preemptive steps to reduce the risk of developing T1D related complications based on each patient characteristics
To sum up, we have a heterogeneous dataset that contains demographic, clinical and genetic data from T1D patients. Our research goal is to determine groups of patients at higher risk of developing complications or secondary disease related to T1D by analyzing this dataset.
1.1 Biological Background
T1D is not a preventable disease and is not related to eating an excessive amount of sugar. Scientists could not determine any particular agent for the cause of T1D yet [3]. Many factors may contribute to T1D, including genetic susceptibility and exposure to specific antigens. Hence, T1D is considered a “complex disease” which a combination of numerous risk factors may lead to it. T1D occurs when the body’s immune system destroys theβ-cells in the pancreas [4, 5]. T1D is a polygenic disorder, with about 50 loci so far known to influence this disease susceptibility [6]. In this research we investigate following complications related to T1D.
Thyroid Disease: Thyroid disease affects the thyroid gland which controls various metabolic processes in the body. Thyroid dysfunction in patients with T1D is two - to three fold higher than in the general population [7, 8].
Dyslipidemia: Dyslipidemia is an abnormal amount of lipids in the blood. People with T1D have increased rates of vascular disease in which dyslipidemia is a major risk factor [9].
High Blood Pressure: High blood pressure is common in people with diabetes and about 25% of people with T1D develop high blood pressure at some stage.
Having diabetes and high blood pressure together, increases risk of other health problems [10].
Nerve Damage: Nerve damage can occur in people with T1D which is called Dia- betic neuropathy. Depending on the types of nerve damage it causes different symptoms. More than 50% of all diabetics patients suffer from some types of nerve damage [11].
Retinopathy: Retinopathy is the impair to the retina of eyes, which may leads to vision problems. During the first two decades of T1D disease, nearly all patients suffers from diabetic retinopathy [12].
Diabetic ketoacidosis: Diabetic ketoacidosis occurs when the body produces high levels of blood acids called ketones. Diabetic ketoacidosis (DKA) are common serious complications of T1D [13].
Hyperglycemia: Hyperglycemia is a condition in which an excessive amount of glucose is in the blood plasma. Low insulin levels in T1D patients cause hyper- glycemia [14].
Hypoglycemia: Hypoglycemia is when blood sugar decreases to below normal levels in the blood. It is a common and dangerous occurrence with T1D patients [15].
Anxiety and Depression: Mental health problems are frequent in youth with T1D, and they are at an increased risk of mental health conditions, such as anxiety, eating and behavioral disorders, as well as depressive symptoms [16].
1.2 Related Works
We review related research to this study from three different approaches: 1- Type 1 Diabetes genetics, 2- Heterogeneous data analyzing challenges, 3- Patients subgroup discovery strategies.
1.2.1 Type 1 Diabetes Genetics
There are many papers published in the literature about T1D. T1D is one of the most common chronic diseases of childhood [17]. Genetic studies of T1D have identified 50 loci (susceptibility regions) that affect risk of T1D [6, 18, 19]. Atkinson et al. [20]
released a survey that reviews current flow in epidemiology, pathology, diagnosis, and treatment of T1D, and its prospects for an improved future for individuals dealing with this disorder. Davies et al. [21] searched the human genome for genes that influence T1D. Additionally, Barrot et al. [22] reported findings of a genome-wide association study of T1D, combined in a meta-analysis. Roizen et al. [23] compared variants associated with increased risk for T1D with those variants identified in other autoimmune diseases and revealed genetic overlap between T1D and other autoim- mune diseases.
1.2.2 Heterogeneous Data Challenges
Integrating data from different sources such as clinical, environmental, and demo- graphic data with genomic data is an ongoing part of current research in genomics.
In our research, we have chosen to use two mentioned approaches (GLRM and SNF) which are able to handle heterogeneous data. However, we acknowledge that there
are several efforts underway to deal with heterogeneous data. For example, Hamid et al. [24] proposed a conceptual framework for integrating data as well as a review of current approaches for combining genomic data. As another example, Ren et al. [25]
evaluated the possible challenges in the integrative analysis of the heterogeneous dis- ease data types. They proposed a computational method (named iBFE) based on a feature extraction perspective. They showed that iBFE could recognize disease subtypes in genomic data.
1.2.3 Patients Subgroup Discovery
Diagnosing and defining subtypes is a difficult challenge for complex diseases. Higdon et al. [26] described how different disease subtypes could be identified through the combination of clinical and multi-omics data. In the article, they clustered various types of omics data and then, the results were integrated with clinical data to identify disease subtypes. They applied this method to Autism Spectrum Disorder (ASD) to facilitate subtype identification.
1.2.3.1 Network Approaches
A computational framework is presented by Zhang et al. [27] to stratify a biological network into function-specific network layers, which transform the network analysis from gene level to the functional level by integrating expression data, the gene/protein network and gene ontology information.
A large scale of studies in complex disease is classifying patients based on their genomic mutations, but these mutations are rarely shared across patients for some diseases. Zhong et al. [28] used network-based stratification approaches on thirteen
major cancer types to classify tumours based on exome-level mutations.
Beforehand, the most common approach to integrative data analyzing was a sep- arate clustering followed by a manual integration. Shen et al. [29] developed a joint latent variable model for integrative clustering (called iCluster). They could identify subtypes in breast cancer and lung cancer, characterized by concordant DNA copy number changes and gene expression using the iCluster algorithm. Kim et al. [30]
proposed a method to improve feature selection on iCluster factor model using prior knowledge of inter-omics regulatory flows.
Hillenmeyer et al. [31] used a combination of an algorithm for weighted-edge mod- ule searching and a probabilistic interaction network in order to explain a method for designating genes with strong associations to the phenotype.
Cho et al. [32] showed how networks could be used to represent clinical data such as genotype and gene expression to distinguish dysregulated pathways and to understand the connections between genotype and phenotype, and to explain disease heterogeneity. Their article showed how to analyze complex disease using similarity network fusions since genetic variations in affected individuals might be different.
Wang et al. [2] used similarity network fusions for disease data obtained from a group of patients. Yang et al. [33] proposed an integrative method based on Similarity Network Fusion (SNF), named ndmaSNF (network diffusion model assisted SNF).
This method can be used for cancer subtype discovery with making use of somatic mutation data and other discrete data.
Wang et al. [34] proposed a network-based approach for the integrative analysis of heterogeneous omics data. They represented a network-based solution in which each type of data is treated independently and tested the method on the subtypes
identification of a brain tumor.
1.2.3.2 Machine Learning Approaches
Speicher et al. [35] extended multiple kernel learning for dimensionality reduction.
They could identify biologically meaningful subgroups for five different cancer types.
Schuler et al. [36] applied Generalized Low Rank Modelling presented by Udell et al. [1] to discover phenotypes in two datasets of patient information related to two different diseases. The method is used to overcome barriers such as missing data, data sparsity, and data heterogeneity in input data. As shown in this paper, the result of GLRM method is remarkably different in comparison to other machine learning methods in applications of discovering patient phenotypes.
Young et al. [37] used unsupervised deep learning to learn the hierarchical struc- ture of cancer gene expression data. They showed that a deep learning model can be trained to represent biologically and clinically important concepts of cancer genes.
Lasko et al. [38] introduced new deep learning methods used for phenotype discovery in clinical data.
Wei et al. [39] tested Support Vector Machine (SVM) on three large-scale GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and demonstrated a risk assessment for this disorder.
1.3 Research Question
This research is an interdisciplinary study across computer science, molecular biology, and medicine. The aim of conducting this research is to improve knowledge regarding
complications associated with T1D and its risk factors and, to eventually achieve an efficient, trusted preemptive strategy for T1D.
Our research is distinct from previous research by three different aspects: 1- We have a unique dataset from T1D patients that comprise demographic, clinical and genetic data, from both diagnosis stage and patients current stage. 2- We are using two state-of-the-art methods to identify patient subgroups. 3- We include several T1D complications instead of focusing on a single complication. To investigate our T1D dataset, we apply two methods namely Generalized Low Ranks Modelling (GLRM), and Similarity Network Fusion (SNF). GLRM advantages are handling missing values and compressing data. SNF profits from capturing both shared and complementary information in the fused network. Our results can be used to identify patients at higher risk of developing T1D complications and could be taken as the basis to create a predictive model of developing T1D complications.
1.4 Dataset Overview
Our dataset is collected from a cohort study by Newhook et al. [40] regarding the incidence of childhood T1D in children aged 0-14 years who were diagnosed with T1D on the Avalon Peninsula of Newfoundland, Canada. Subjects for this study were a cohort of individuals with T1D who participated in a genetics study between 2001 and 2006. At the time of that study, demographic and clinical information from each individual had been entered in the Newfoundland and Labrador Diabetes Genetics Database and was used as a basis for patient contact. Later, given the passage of time, the most up to date demographic information about the cohort was collected.
According to Newhook et al. [40] paper “the Avalon Peninsula of Newfoundland has one of the highest incidences of T1D reported worldwide”. The obtained raw T1D dataset comprises 239 features concerning demographical, clinical and genotype factors from 196 patients. This dataset is heterogeneous and have missing values especially in its genotype information. These imperfections drive us to perform the following pre-processing steps before applying the two selected methods.
Irrelevant features elimination: Not all of 239 features in the raw dataset is re- lated to our research. We eliminate 25 irrelevant features such as “number of patient visits to the hospital” or “patient insurance status”.
Sparse raws elimination: In the raw dataset, we have overall 12724 (27.2%) miss- ing entries. We eliminate 43 patients (rows) which have more than 50% missing entries to reduce data sparsity.
Complications matrix extraction: 20 features are representing patients compli- cation data. We extracted these columns and named them as complications matrix. Complications (columns) with more than 75% missing entries are elim- inated from the obtained matrix. We consider that minimum sample size to analyze a complication is at least ten patients; thus complications with less than eleven patients are excluded. Eventually, we obtained a 153×10 compli- cations matrix. Missing values in complications are replaced with zero (healthy status).
Substituting values and merging categories: Some of the raw dataset entries are text. For each unique string, a number is assigned and strings were replaced
by their corresponding number for better computational handling. Additionally, in each column, low population categories are merged to achieve a larger cate- gory. Additionally, patients date of birth is converted to age and its calculated until 2016.
Genotype features: We have 122 genotype features in total which 98 of them are categorical with three categories (homozygous, heterozygous type 1, heterozy- gous type 2) in each column. However, the rest (24 columns) are categorical while each data entry holds the position of an allele in the chromosome. We rearranged these features into binary features which whether they have an allele in the given position or not. This results in removing 24 features and adding 330 binary genotype features.
Sparse raws elimination: For the final step, we eliminate 64 features (columns) which have more than 30% missing entries to reduce data sparsity. This number is found empirically as a compromise between the number of rows eliminated and amount of missing data.
Following the mentioned steps we have a 153×436 T1D pre-processed dataset and a 153×10 complications data matrix. We use the T1D pre-processed dataset as our input data. The complications matrix is used for evaluating the obtained clusters to identify clusters with higher incidence of having complications. Table 1.1 represents a summary of pre-processed T1D dataset features and Appendix A includes all dataset features details and their specification.
Table 1.1: T1D Dataset Features Summary
Features Category Features Type No. of Features Patient Clinical Data Ordinal, Numeric, Binary 24
Relative Clinical Data Binary 24
Patient Demographic Data Numeric 4
Patient Genotype Data Categorical, Binary 384
1.5 Study Overview
In this chapter, we introduced T1D and provided an overview of our input dataset. In the second chapter, we describe the basis of GLRM and the methods used to identify over enriched clusters. Then we present GLRM result and discuss its outcomes. In the third chapter, we present the principles of SNF, then we present the achieved result and discuss its features. Finally, in the last chapter, we give a comprehensive summary of our research. Figure 1.1 illustrates our work-flow and thesis organization.
Figure 1.1: Research work flow diagram
Chapter 2
Generalized Low Rank Modeling
2.1 Introduction
Artificial Intelligence (AI) and Machine Learning (ML) methods have been used ex- tensively for analyzing health records data and patient stratification. Researchers dealing with these methods are thwarted by imperfect data characteristics such as missing data records, mixed type of features, heterogeneity, and sparsity. We use Gen- eralized Low Rank Modeling (GLRM) as a framework for analyzing our pre-processed Type 1 Diabetes (T1D) dataset described in Section 1.4. Unlike typical machine learn- ing algorithms, this framework offers flexible solutions to overcome data barriers such as missing data and heterogeneity. The GLRM framework was first introduced by Udell et al. [1]. It extends Principle Component Analysis (PCA) technique [41] to design a framework which can handle heterogeneous data with mixed feature types (numerical and categorical). This framework transforms high-dimensional data into lower dimension space by solving an optimization problem.
Prior to applying the GLRM to the T1D raw dataset, we perform data prepro- cessing steps discussed in Section 1.4 including merging, replacing and segregating features values as well as eliminating sparse samples. Following the data preparation step, we apply the GLRM method to the pre-processed T1D dataset and optimize its parameters using cross-validation. Consequently, the low dimensional data with minimum error obtained after cross-validation is used for further analysis. Three clustering algorithms (K-means [42], Hierarchical [43], and Affinity Propagation [44]
clustering) are applied to the optimal low-dimensional data, and then, the results are evaluated with various statistical tests.
In this chapter, we describe the principles of GLRM framework and clustering algorithms in the methods section. Next, we discuss the outcome of each procedure in the results section and finally, we illustrate how GLRM helped us to achieve a patient stratification strategy for T1D patients.
2.2 Methods
In this section, we describe the procedure that we followed to group T1D patients from pre-processed dataset. We will discuss each of the following topics in a subsection:
• Summary of the GLRM framework principles, the philosophy behind it and the software package used for it.
• Description about optimal low-dimensional concise data extraction and how we find the proper parameter set for building the model.
• Reviewing the basis of three clustering methods, including K-means, hierar- chical, and affinity propagation clustering, which we use for analyzing low- dimensional data.
• Finally, investigating clustering results to find the relation between clusters and complications.
2.2.1 GLRM Framework
Unavoidable imperfections in data such as noise, missing entries, sparsity, and het- erogeneity have challenged common machine learning methods in previous studies for finding patterns in clinical health-related data [45]. Udell et al. [1] extended the idea behind Principal Components Analysis (PCA) into a generalized method that can handle different types of data sets including numerical, boolean, categorical, ordinal, and other data types. Generalized Low Rank Modeling (GLRM) handles heteroge- neous datasets, compresses and denoises data, and imputes missing records. We apply
this method to deal with a large number of input data features and to utilize most of the samples even if they have missing values in some features.
GLRM represents high dimensional bulk data in a lower-dimensional space. Sup- pose we have a matrix A that has m rows representing samples and n columns that represent n features while these features have different types of data; for instance, one column may take float values while the others have categorical values. By solving an optimization problem, we can approximate Aby X as a “tall and skinny” matrix and Y as a “short and wide” matrix (Figure 2.1). X represents k new latent features for m samples, and Y encodes the transformation of n original features into the k new latent features.
Figure 2.1: Matrix transformation with GLRM framework (obtained from Udell et al. [1])
To find X and Y the following optimization problem should be solved:
minimize X
(i,j)∈Ω
Lij(xiyj, Aij) +
m
X
i=1
ri(xi) +
n
X
j=1
˜
rj(yj) (2.1)
Where Lij(xiyj, Aij) is the loss function, xiyj is the predicted entry which is ob- tained by matrix production of theith row of estimated X matrix and thejthcolumn of estimated Y matrix, Aij is the observed entry in the input data; rx and ry are regularizers used to limit output matrices. The loss function (first sigma operand)
measures the accuracy of data approximation, and the problem-solving algorithm will try to minimize this part. Different loss functions are appropriate for various types of data inputs. Thus, GLRM gives the flexibility to define different functions for each column of data (features) based on their type. The loss should be calculated only over the set Ω which represents non-missing entries. The regularizersrx andry limits latent feature values. Choosing appropriate regularization can improve the model and prevent it from over-fitting. Furthermore, appropriate k value which represents the number of latent features can be estimated using cross validation over the observed data and investigating test and train errors.
We use GLRM to estimate and fill out the missing values in our dataset and transform our big heterogeneous dataset into a smaller homogeneous one. For this purpose, we use H2O.ai package (version 3.14.0.2)[46] in the R programming lan- guage [47]. This package has built-in implementations of GLRM framework as well as popular machine learning algorithms.
Following building an appropriate model for the input dataset, we extract the tall and skinny matrix X and use it to cluster samples (patients) based on the k latent features obtained by GLRM.
2.2.2 GLRM Parameter Setting
To make a low-rank model converge efficiently, we need to choose proper input param- eters. Udell et al. [1] thoroughly described the impact and purpose of each parameter in GLRM implementation. To achieve an optimal performance, parameters must be fitted based on the dataset. Here we briefly describe input parameters including loss
functions (Lj), regularizers(r,˜r) gamma (γ), and output matrix rank (k).
Loss function (Lj): The loss functions is defined for each column (feature) based on its data nature. According to Udell et al. [1] and GLRM implementation in H2O [46], we use “quadratic”, “logistic”, “categorical” and “ordinal” loss func- tions for numerical, boolean, categorical, and sequential features, respectively.
Regularizers(r,r) and Gamma (γ)˜ The regularization functionsr and ˜rare used to prevent overfitting or to enforce constraints on the values of low-rank ma- trices X and Y. These regularizations can be scaled by γ. Thus, the GLRM optimization problem would be adjusted to:
minimize X
(i,j)∈Ω
Lij(xiyj, Aij) +γx m
X
i=1
ri(xi) +γy n
X
j=1
˜
rj(yj) (2.2)
Where all the terms are as defined in Equation 2.1, γx and γy are scaling val- ues for the regularizers. Our input dataset has many missing values, and this can prevent the model from overfitting itself. Therefore, we use no regular- izer for building the low rank model. Additionally, by adding the regularizers to the model, we observed that test and train errors in cross validation were significantly increased.
Rank (k): Rank of a model is the number of columns in the concise low-rank matrix (X). We use cross validation over the input data and find a proper rank based on the train and test error.
In addition to these parameters, we need to set the low-rank matrices initialization method, Udell et all [1] showed that a suitable approach for matrices initialization is
“Singular Value Decomposition (SVD)” [48] which performs much better than other random initialization methods.
Lastly, following the determination of well-fitted parameters for building the model, we can extract low-rank matrices X and Y. X represents sample features in a dif- ferent domain while it has homogeneous data with no missing entries. It hasm rows which is equal to the number of samples and k columns which represent k latent features. Y with the size of k×n, represents the relation between k latent features and n original features. We call X matrix “concise data” and will use it to cluster patients.
2.2.3 Data Clustering
Patients grouping helps clinicians to investigate the diseases cause in a group [36]. Our input dataset, contains patients complications data as well as clinical, demographical and genetic data. As indicated in Section 1.4 we separate features set into two categories: 1- complications features, 2- Other features. The second features category is used as the input data for building the GLRM model, extracting the concise data and clustering samples based on this concise matrix. The first features category is used to evaluate the clustering results. In this section, we first describe the algorithms used for clustering the concise data (each cluster represents a group of patients), and then we explain how we evaluate clustering results to discover the relationship between clusters and complications.
K-means Clustering: k-means clustering aims to partition m samples into k clus- ters while each sample fits the cluster with the nearest mean [42]. We use
kmeans function in Matlab® Statistics and Machine Learning Toolbox for clus- tering the concise data [49]. This function needs the number of clusters to be determined as the input; we empirically set this number to 10. Other cluster- ing methods which do not require the number of clusters as an input indicated roughly the same number of clusters as well. Euclidean distance is used for distance measurement between data points.
Hierarchical Clustering: Hierarchical clustering groups data by forming a cluster tree or dendrogram [43]. The tree is a multilevel hierarchy where clusters at one level are joined as clusters at the next level [50]. There are two strategies for hierarchical clustering: Agglomerative and Divisive. The agglomerative strat- egy is a “bottom up” approach in which each sample has its own cluster, and two clusters are merged as one moves up the hierarchy. On the other hand, the divisive strategy is a “top down” approach which all samples are gathered in one cluster, and then splits are performed as one moves down the hierar- chy [51]. We use clusterdata function in Matlab® Statistics and Machine Learning Toolbox for clustering concise data [49]. This function supports ag- glomerative clustering. We use Euclidean distance as the distance metric, and inner squared distance (minimum variance algorithm) as the algorithm for com- puting distance between clusters, more information about the function inputs are available at Matlab user guide in hierarchical clustering [50]. Finally, after applying hierarchical clustering to the concise data, we choose a proper value for the maximum number of clusters based on the obtained dendrogram. We select a cutting level on the dendrogram where clusters are neither too small (at
least three members in each cluster) nor too big (not more than 40 members in each cluster).
Affinity Propagation Clustering: Affinity propagation clustering is an algorithm based on the concept of “message passing” between samples, and it does not re- quire the number of clusters to be determined before running the algorithm [44, 52]. We useapclusterfunction in Matlab® for clustering our data, more infor- mation about the algorithm and function parameters are available at Frey et al.
article [44]. This function requires two inputs: 1- A square matrix representing pairwise similarities between two samples. We use the negative of Euclidean distance as the pairwise similarity measure. 2- An input preferencepwhich is a real-valued vector. pi indicates the preference that data pointi be chosen as an exemplar. We set all preferences to a same value since we have no preferences among the samples.
2.2.4 Clusters Evaluation
As indicated in Section 1.4, we extract patients complications information from raw input dataset and name this extracted data as the complications matrix. The compli- cations matrix is a binary data which contains 153 rows corresponding to the number of patients and ten columns corresponding to the number of accessible complications information.
The results of each clustering algorithm is a vector indicating the cluster assign- ment for each patient. For each clustering result, we apply three statistical measures per complication to investigate if there is any relation between a group of patients
and a disease. These measures are hypergeometric test, odds ratio, and risk ratio.
Hypergeometric Test: The hypergeometric test is based on the hypergeometric distribution to calculate the significance of having drawn a specified number of successes from a specified population. This test can be used to distinguish which subsets of the population are over- or under-represented in a clustering scheme [53].
We use Matlab® hygecdffunction to “compute the complement hypergeomet- ric cdf at the value of x using the corresponding size of the population (M), total number of items with the desired characteristic in the population (K), and number of samples drawn (N). The result, p, is the complement probability of drawing up toxof a possible K items inN drawings without replacement from a group ofM objects” [50].
p= 1−
x
X
i=0 K
i
M−K
N−i
M N
(2.3)
We definexas the number of patients with specified complication in the cluster, M as the total number of patients (population size), K as the total number of patients with specified complication, and N as the cluster size. The obtained p value represents the probability in the null distribution of observing up to x patients with the specified complication in the given cluster. Thus, the lower probability means a more reliable cluster that could adequately capture patients with a specified disease. However, this probability is very low in the small clusters (such as clusters with less than four patients), which is not desired;
therefore we add other methods to evaluate clustering results.
Risk Ratio: Risk ratio or relative risk (RR) is the probability of an event occurring in an exposed group divided by the probability of the event occurring in a comparison, non-exposed group [54].
Table 2.1: Exposed and diseased population ratio definition Diseased Healthy Total
Exposed DE HE NE =DE +HE
Not exposed DN HN NN =DN +HN
Considering Table 2.1 the risk ratio can be calculated using Equation 2.4.
RR= DE/NE
DN/NN
(2.4)
A RR = 1 represents no difference in risk between the exposed and non- exposed group. However,RR <1 means the event is less likely to occur in the exposed group than in the not exposed group, and a RR > 1 means the event is more likely to occur in the exposed group. We count “Diseased” population as the patients with a specific complication and “Exposed” population as the patients inside a cluster. Consequently, clusters with higher risk ratio are more reliable clusters which they could represent patients more likely to develop a given complication.
Odds Ratio: The odds ratio (OR) is used commonly to quantify how strongly the presence or absence of an exposure is associated with an outcome in a given population. If OR = 1 it means that the exposure does not affect odds of the
outcome. OR > 1 indicating that the exposure is associated with higher odds of outcome andOR <1 is the opposite case [55, 56].
Considering Table 2.1 the odds ratio is calculated using Equation 2.5.
OR= DE/HE
DN/HN
(2.5)
When the OR is one, the RR will be equal to one as well and OR approximates RR if the probability of the disease occurrence is low. However, the OR is always bigger compared to RR; thus, OR can better represent slight differences. We use OR for plotting our results, but it should be kept in mind that a small cluster (such as clusters with less than four patients) may have very high OR.
To find the significant level of odds ratio in each cluster given a complication, we bootstrap obtained clustering result for each clustering method and determine the p-value for the odds ratio per cluster given a complication. For this purpose, 10000 random set of cluster labels are produced while the total number of clusters and cluster’s size are identical to the original clustering result. Then, for each complication given a cluster, the p-value is measured by dividing the number of results from the random sampling with an OR greater or equal to the real OR, divided by total number of results from the random sampling.
By evaluating the clustering results from k-means, hierarchical and affinity prop- agation clustering algorithms using the three outlined measurements, we can find out whether risk factors for developing a secondary disease related to T1D can be identi- fied by integrating demographic, clinical and genetic data using the GLRM method.
2.3 Results
As described in Section 1.4 our pre-processed T1D dataset is segmented into two sep- arate datasets. First one has 153 samples and 436 features; we call this dataset as the
“T1D pre-processed input data”since we use it for machine learning algorithms input. The second part of the dataset is a binary matrix with 153 samples and ten columns representing the presence of ten complications in each patient. This matrix is used for evaluating the clustering results. We call this matrix the“complications data”. The input data has 14.75% missing entries, however, the GLRM algorithm can tolerate this amount of missing data. Table 1.1 represents a summary of input data characteristics and a detailed table is available in Appendix A.
In this section, we present the cross-validation results to set the GLRM optimal parameters. Next, we illustrate the results of the clustering algorithms and, finally, we discuss whether or not discovered clusters are over-enriched with patients having a given complication.
2.3.1 GLRM Parameter Selection
In section 2.2.2 we discussed methods for finding proper GLRM parameters includ- ing loss functions, regularizers, and rank. Appendix A includes list of all features and their types, we use “quadratic”, “logistic”, “categorical” and “ordinal” loss func- tions for numerical, boolean, categorical and sequential features, respectively. We use no regularizer for building the low rank model since by adding the regularizers to the model, we observed that cross validation test and train errors were significantly increased.
Cross validation is used for finding a proper rank (k). We useωportion of observed samples which is randomly selected as the validation set and the remaining observa- tions (1−ω) as the training set. Additionally, cross validation is repeated five times with distinct sets of ω for each set of parameters in GLRM model. Since our input data is heterogeneous, we use two types of error for evaluating model performance.
Mean Square Error (MSE) used for calculating model errors in numerical features (only 11 features) and Misclassification Ratio (MCR) used for the rest. Considering the few number of numerical features, MSE is not a reliable measure for selecting the parameters, and we decide based on MCR.
0 20 40 60 80 100 120
K 10-1
100 101 102
MSE Error
MSE Error as a fucntion of K
Train MSE, = 0.1 Test MSE, = 0.1 Train MSE, = 0.2 Test MSE, = 0.2 Train MSE, = 0.5 Test MSE, = 0.5
Figure 2.2: GLRM model average Mean Square Errors (MSE) on five fold cross validation results. The horizontal solid lines indicate the MSE on the training data and the horizontal dotted lines indicate the MSE on the test data. The vertical lines indicate the standard error.
Figure 2.2 illustrates the average of MSE, over five cross validations, on train and test set, for three differentω. Figure 2.3 illustrates MCR with same properties as the previous figure. As we can observe in the Figure 2.3, train and test errors are both in their minimum values after k = 70. Thus, we choose k = 70 as the proper rank and extract theX matrix which has 153 samples and 70 latent features. As it mentioned, we call this matrix “concise data”.
0 20 40 60 80 100 120
K 0.05
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
MCR Error
MCR Error as a fucntion of K
Train MCR, = 0.1
Test MCR, = 0.1
Train MCR, = 0.2
Test MCR, = 0.2 Train MCR, = 0.5
Test MCR, = 0.5
Figure 2.3: GLRM model average Misclassification Ratio (MCR) on five fold cross validation results. The horizontal solid lines indicate the MCR on the training data and the horizontal dotted lines indicate the MCR on the test data. The vertical lines indicate the standard error.
2.3.2 Clustering Concise Data
Three algorithms described in Section 2.2.3 are used to cluster the concise data. In this section, we first represent and analyze each clustering results, then, we address clusters over-enriched with patients having T1D secondary diseases. Table 2.2 illustrates an overview of three clustering methods’ results.
Table 2.2: Overview of GLRM three clustering methods’ results
k-means Hierarchical Affinity Propagation
No. of clusters 10 9 11
Clusters’ Size Average ± SD 15.3±18.2 17±11.5 13.9±10.3
Median Clusters’ Size 9.5 15 17
Maximum Clusters’ Size 63 38 36
Minimum Clusters’ Size 3 3 1
2.3.2.1 K-means Clustering
As it explained in Section 2.2.3, we usekmeansfunction in Matlab®for clustering the concise data. Number of desired clusters (k) is empirically set to ten; other clustering methods which do not require to specify the number of clusters a priori indicated roughly the same amount for the number of clusters as well. Table 2.3 illustrates demographic statistics regarding k-means output clusters.
We evaluate clustering outcome using three statistical measures including Hyper- geometric test, odds ratio, and risk ratio which are described in Section 2.2.4.
Figure 2.4 represents odds ratio for each complication and cluster. Numbers in each cell represents total number of patients with the specified complication in the
Table 2.3: Statistics of the patients clusters obtained using k-means clustering Cluster Size Male(%) Female(%) Weight (kg) Age
Cluster #1 14 7.1 92.9 81.0 ±18.1 33.1 ± 7.2
Cluster #2 3 100.0 0.0 91.7 ±20.6 25.7 ± 2.9
Cluster #3 11 54.5 45.5 80.1 ±14.9 30.7 ± 6.0
Cluster #4 12 0.0 100.0 62.8 ± 7.2 27.1 ± 4.9
Cluster #5 7 14.3 85.7 68.7 ±12.2 51.6 ± 10.7
Cluster #6 27 51.9 48.1 82.5 ±16.0 47.9 ± 11.0
Cluster #7 4 50.0 50.0 111.1 ± 35.0 31.5 ± 7.6
Cluster #8 8 75.0 25.0 92.2 ±15.7 37.6 ± 15.1
Cluster #9 63 46.0 54.0 77.5 ±15.3 29.5 ± 6.4
Cluster #10 4 50.0 50.0 76.3 ±15.1 29.8 ± 8.4
Weight and age columns values indicate corresponding average±standard deviation.
specified cluster.
Figure 2.5 illustrates the three clusters with the highest odds ratio for each com- plication while clusters with less than four members are filtered out. Each bubble represents the odds ratio for a cluster concerning the specified complication, and bubbles sizes are proportional to the number of patients with a given complication.
Table 2.4 shows statistical measures for the cluster with highest odds ratio con- cerning each complication while clusters with less than four members are filtered out.
Entire clustering evaluation measures are available in Section B.1 in Appendix B.
Table B.41 shows odds ratio p-value for each cluster given a complication.