Unsupervised Data Analysis Methods for Qualitative and Quantitative Metabolomics and Metabonomics

(1)

Publisher’s version / Version de l'éditeur:

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la

première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à [email protected].

Questions? Contact the NRC Publications Archive team at

[email protected]. If you wish to email the authors directly, please see the first page of the publication for their contact information.

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

Symposium on Biological Complexity: Emerging Concepts and Trends [Proceedings], 2010-11-27

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

NRC Publications Archive Record / Notice des Archives des publications du CNRC :

https://nrc-publications.canada.ca/eng/view/object/?id=13ffe8fd-429e-4f41-b715-b1e6bdb579b0 https://publications-cnrc.canada.ca/fra/voir/objet/?id=13ffe8fd-429e-4f41-b715-b1e6bdb579b0

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. / La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

Unsupervised Data Analysis Methods for Qualitative and Quantitative Metabolomics and Metabonomics

(2)

Result of K-means clustering of studied dataset

(A) Silhouette and cluster memberships for spectral data with small bin size (0.0018ppm).

(B) Silhouette and cluster memberships for spectral data with larger bin sizes (0.1838ppm).

(C) Silhouette and cluster memberships for quantitative metabolite data for all 13 observed metabolites.

I II III IV V 1 2 3 4 5 6 7 8 9 10 11 12 13 I II III IV V

Data Set For method testing we have developed a large dataset obtained from different mixtures of experimental NMR measurements of 13 metabolites at five different

groups of concentrations (with 200 samples in each group). A random component is added to concentrations for each metabolite. Assuming that there is no chemical interaction between molecules in the mixture, the resulting NMR spectrum of a mixture is a direct sum of spectra from the components. We have selected 13 water soluble that were consistently observable in standard 1D 1_{H NMR in several publications dealing with cell culture metabolomics (1-5). The FID for these metabolites were obtained}

Human Metabolomics DataBase (8).

ppm Taurine Succinic Acid Proline Lactic Acid Isoleucine Asparagine Valine Leucine Glutamine Alanine Glutathione Choline Glutamic Acid

Miroslava

Čuperlović-Culf; Nabil Belacel

National Research Council of Canada, Institute for Information Technology, Moncton, NB. Canada

Institute for Information

Technology

Unsupervised Data Analysis Methods for Qualitative and Quantitative

Metabolomics and Metabonomics

Metabolomics or metababonomics is one of the major high throughput analysis methods that endeavors holistic measurement of metabolic profiles of biological systems. Data analysis approaches in metabolomics can broadly be divided into qualitative – analysis of spectral data and quantitative – analysis of individual metabolite concentrations. Here we show examples of an application of the major previously utilized unsupervised analysis methods as well as novel fuzzy clustering methods Fuzzy J-means. The testing was performed using qualitative as well as corresponding quantitative metabolite data derived to represent a large set of 2,000 objects. Spectra of mixtures were obtained from different combinations of experimental NMR measurements of 13 prevalent metabolites at five different groups of concentrations representing different phenotypes. The analysis shows advantages and disadvantages of standard tools when applied specifically to metabolomics.

Salk, 2010

PC1 PC1 PC2 PC3 I II III IV V PC1 PC1 PC2 PC3 I II III IV V PC1 PC1 PC3

PCA

_K-Means

_{Self Organized Maps (6)}

Fuzzy K-Means (7)

Fuzzy J-Means (10)

Conclusions

Algorithms do not drive metabolomics investigation; however the objectives of these investigations can only be achieved by utilizing an appropriate data treatment and analysis strategy at every step. Unfortunately, a perfect method for unsupervised analysis does not exist. However, many methods have been developed for various applications with new and improved tools regularly presented in the literature. Thus, it is crucial to explore different methods for each application rather than relying completely on conclusions drawn from only one methodology. In metabolomics applications PCA is still by far the most popular unsupervised method with only a few true clustering tools even tested. The analysis on the synthetic data presented in this work shows that clustering tools can provide additional information to PCA and should thus become regularly exploited part of metabolomics investigations. PCA is fast and informative for the analysis of qualitative, spectral data however the analysis using for example bagged K-means, SOM or fuzzy K-means presented here shows that these methods can lead to better feature clustering even in the case of spectral data. Currently, there are more efforts underway for obtaining as large as possible quantitative metabolic datasets. For these types of data unsupervised tools such as SOM, bagged K-means and fuzzy K-means should be preferred as they provide much more information and much more accurate feature grouping then PCA.

References:

1.Duarte, I.F.; Marques, J.; Ladeirinha, A.F. et al. (2009) Anal Chem 81: 5023-5032.

2.Griffin, J.L.; Bollard, M.; Nicholson, J.K.; Bhakoo, K. (2002) NMR Biomed 15: 375-384.

3.Yang, C.; Richardson, A.D.; Smith, J.W.; Osterman, A. (2007) Pacif Symp Biocomp 12: 181-192. 4.Gottschalk, M.; Ivanova, G.; Collins, D.M. et al. (2008 NMR Biomed

5.Tiziani, S.; Lodi, A.; Khanim, F.L. et al. (2009) Plos One 4: e4251.

6.Mäkinen V.P., Soininen P., Forsblom C., et al. (2008) Mol Syst Biol 4:167.

7.Cuperlovic-Culf, M.; Belacel, N.; Culf, A. et al. (2009) Magn Reson Chem 47:S96-S104.

8.Wishart DS, Knox C, Guo AC, et al. (2009) Nucleic Acids Res. 37(Database issue):D603-610.

9.Hageman, J.A.; van den Berg, R.A.; Westerhuis, J.A. et al. (2006) Clinical Rev Anal Chem 36: 211-220. 10.Belacel, N.*, Cuperlovic-Culf, M.*, Laflamme, M., Ouellette, R. (2004) Bioinformatics. 20:1690-701

Acknowledgement

MCC would like to thank Dr.’s A. Smilde, G. Zwanenburg and J. Hageman for providing us the Matlab scripts for Bagged K-means method. In many examples of high throughput methodology a strong emphasis in the first level of statistical analysis is on unsupervised approaches. Unsupervised data analysis is employed for obtaining connections between samples and/or molecular features without biasing the results by the introduction of prior knowledge. In the current literature metabolomics, similarly to other omics methods, has as its main objectives: a) Examination of similarities and difference between samples based on metabolic profiles; b) Exploration of similarities and difference between metabolites over time or between different phenotypes; c) Sample classification from metabolic profiles; d) Determination of major significantly different features.

The final data interpretation procedures can be divided into descriptive analysis that includes analysis of any correlation between metabolites or samples or general statistical analysis of variances or deviations, for example. A second group of analysis methods include unsupervised analysis approaches that are used for grouping of features (sample, metabolites or spectral points). This group includes the visualization, i.e. projection and clustering method. Finally, a third group of methods includes supervised analysis tools that are utilized for sample classification and/or for feature selection for biomarker discovery. The unsupervised methods are most appropriate for accomplishing objectives a. and b. (see above). In general, if sufficient information about the samples is available objectives c. and d. are best accomplished using supervised approaches.

Fuzzy Clustering provides one-to-many mapping where a single feature belongs to multiple clusters with a certain degree of membership. The memberships can further be used to discover more sophisticated relations between the data and its disclosed clusters (Xu, 2005). The fuzzy approach is more desirable in situations where a majority of features, such as metabolites, participate in different networks and are governed by a variety of regulatory mechanisms or in the case of sample clustering for samples that can be assigned to different groups depending on the observed characteristics. Fuzzy clustering is also robust to the high level of noise often present in omics data.

The result of fuzzy clustering calculation is the matrix of membership degrees that describe the level of similarity between each feature and each cluster centroid.

PCA is an appropriate overview tool used for initial analysis of the outliers, groups and trends in the data. PCA is not however a classification method and thus, although it can provide some information about the

sample types it does not lead to separation of clusters of data. In addition, as it is focusing only on the major

differences in the data it can lead to loss of information.

PCA of spectral data with small (0.0018ppm, top) and large bins (0.1838ppm, above). Boxes are (clockwise) the average spectrum used; plot of PC2 v.s. PC1; PC1 v.s. PC3 and PC2 v.s. PC3.

PCA of quantitative metabolite data for all 13 observed

metabolites. Boxes are (clockwise) values for 13 metabolites included in the analysis for all subjects divided into 5 groups; plot of PC2 v.s. PC1; PC1 v.s. PC3 and PC2 v.s. PC3. I II III IV V I II III IV V I II III IV V

Bagged K-Means (9)

(A) Histogram measure for spectral data with small bin size

(0.0018ppm); (B) Histogram measure for spectral data with larger bin sizes (0.1838ppm); (C) Histogram measure for quantitative metabolite data for all 13 observed metabolites.

Bagged K-Means uses a resampling technique to deal with noise and in this way improve the accuracy of K-means clustering (9).

Feature separation determined by SOM analysis. Each map represents positions of samples for one group. (A) spectral data with larger bin sizes (0.1838ppm); (B) metabolite data for all 13

observed metabolites

.

The SOM method belongs to a class of artificial neural

networks capable of projecting high-dimensional input data on a two-dimensional map utilizing a single-layered artificial neural network. The data objects are located at the input side of the network and the output neurons are organized as a two dimensional grids. Cluster optimization is

performed by having neurons compete for data objects. The neuron whose weight vector is closest to the current object becomes the winning unit followed by reorganization of the neurons. This process of self-organization is done for several training cycles, resulting in a map that is able to

locate any metabolic pattern of the input data with high reliability in its corresponding area on the map.

Heat map representation of the membership values obtained from F-KM analysis. (A) Membership values for spectral data with small bin size (0.0018ppm); (B) Membership values for

spectral data with larger bin sizes (0.1838ppm); (C) Membership values for quantitative metabolite data for all 13 observed

metabolites.

Heat map representation of the membership values obtained

from F-JM analysis. (A) Membership values for spectral data with small bin size (0.0018ppm); (B) Membership values for

quantitative metabolite data for 13 observed metabolites.

F-JM method takes advantage of the fact that

“fuzzification” parameter has to be larger then 1 (where 1 leads to crisp clustering result) Inclusion of this information in the calculations of membership values and centroid

positions leads to a faster and more accurate method for fuzzy clusification. In addition F-JM method can be

combined with a metaheuristic method Variable

Neighbourhood Search (VNS) which helps in the search for a global minimum rather then only a local one.

In both F-KM and F-JM different levels of fuzziness can be obtained by changing the fuzzification parameter. A

procedure for determining the optimal parameters is described in (10). A B C A B C A B A B C A B