• Aucun résultat trouvé

The impact of gene expression analysis on the classification and prediction of patients’ medical conditions

N/A
N/A
Protected

Academic year: 2021

Partager "The impact of gene expression analysis on the classification and prediction of patients’ medical conditions"

Copied!
9
0
0

Texte intégral

(1)

Publisher’s version / Version de l'éditeur:

Proceedings. Industrial Conference on Data Mining - Poster and Industry

Proceedings, 2010

READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.

https://nrc-publications.canada.ca/eng/copyright

Vous avez des questions? Nous pouvons vous aider. Pour communiquer directement avec un auteur, consultez la première page de la revue dans laquelle son article a été publié afin de trouver ses coordonnées. Si vous n’arrivez pas à les repérer, communiquez avec nous à PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca.

Questions? Contact the NRC Publications Archive team at

PublicationsArchive-ArchivesPublications@nrc-cnrc.gc.ca. If you wish to email the authors directly, please see the first page of the publication for their contact information.

NRC Publications Archive

Archives des publications du CNRC

This publication could be one of several versions: author’s original, accepted manuscript or the publisher’s version. / La version de cette publication peut être l’une des suivantes : la version prépublication de l’auteur, la version acceptée du manuscrit ou la version de l’éditeur.

Access and use of this website and the material on it are subject to the Terms and Conditions set forth at

The impact of gene expression analysis on the classification and

prediction of patients’ medical conditions

Famili, Fazel; Phan, Sieu; Liu, Ziying; Pan, Youlian

https://publications-cnrc.canada.ca/fra/droits

L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.

NRC Publications Record / Notice d'Archives des publications de CNRC:

https://nrc-publications.canada.ca/eng/view/object/?id=853a706e-1df7-409e-bc2f-85ca1b3f83ab https://publications-cnrc.canada.ca/fra/voir/objet/?id=853a706e-1df7-409e-bc2f-85ca1b3f83ab

(2)

The impact of gene expression analysis on the

classification and prediction of patients’ medical conditions

Fazel Famili , Sieu Phan, Ziying Liu and Youlian Pan

Knowledge Discovery Group, Institute for Information Technology, National Research Council Canada, 1200 Montreal Road, Ottawa,

Ontario, K1A 0R6, Canada

{fazel.famili, sieu.phan, ziying.liu, youlian.pan}@nrc-cnrc.gc.ca

Abstract. Proper analysis and validation of gene expression data is not a trivial task. Most of the existing approaches identify individual or group of markers/signatures where the validation only goes as far as literature and biological experimental validation. In this paper we discuss a framework that could be considered to develop and validate patterns from gene expression data and work towards future clinical test kits.

Keywords: Gene expression analysis, patient classification, medical condition prediction

1 Introduction

With rapid development of tools and support systems in life sciences, medical diagnosis and prognosis is going through an evolving paradigm. This change of paradigm is so fast that even some of the procedures that were valid as far as five years ago may no longer be fully acceptable or sufficient for today’s modern medicine. Consequently, the ability to identify the key elements related to classification of patients (where healthy and disease or its subtypes matter) has rapidly changed. Two issues related to the newly generated life science data are important to discuss here: (i) data type and its volume, (ii) data complexity and the need for integration.

High throughput data consists of several forms, most of which comes from a common starting point (e.g. samples, tissue, cell lines). Examples are genomics, proteomics, metabolomics, and of course, associated sequence data (DNA or protein sequences). The volume of these data, initially motivated by the completion of human genome, has rapidly increased over the last decade so that it is now being considered as 50-100% increase every year [1, 2, 3]. Other factors have also contributed to this, such as computational powers, data storage and understanding the real value of this data. Although some researchers have indicated that the superabundance of this data and the associated information may have created a scarcity of attention, we can

(3)

provide several examples where small or large amount of data has resulted in novel discoveries leading to applications and even tools that would be helpful for improved life sciences. Examples are: (i) biomarkers discovery and disease signatures identifications, (ii) better disease understanding, (iii) improved therapeutic (better personalized medical care) and more efficient drug discovery, and (iv) better patient monitoring and understanding the drug efficacy.

All of these have created an enormous business opportunities and life science industry applications where successful applications have been deployed or even products (such as tool kits) have been introduced. Two examples are:

- Agendia, an innovative cancer diagnostics company based on genomics, uses its knowledge and expertise in the application of clinically useful gene expression profiling to facilitate breast cancer diagnosis, prognosis and treatment. Motivated by the Human Genome Project and a genomics microarray platform, Agendia’s MammaPrint breast cancer test has been designed to determine a patient’s risk of metastases, therefore providing a more personalized approach to breast cancer therapy.

- ArticDX, a Canadian company that is developing two molecular tests, Colo Risk™, for Colorectal Cancer and Macula Risk™ for Age-related Macular Degeneration, developed genetic tests that are based on their own proprietary markers discovered by ARCTIC researchers. They are partnering with other researchers to support the translation of their gene discoveries into applications that can be of benefit to doctors and patients in the healthcare setting.

Our objective in this paper is to briefly highlight the importance and impact of high throughput data in life sciences with a focus on gene expressions data, its volume, crucial role of data analysis and potential industrial applications. We provide a brief literature review in the next section followed by a spectrum of applications. Focusing on a small part of this spectrum, we then introduce our proposed framework, explain its application, and provide some discussions that include our ongoing research in this area.

2 Related work

Gene expression analyses based on the use of microarray data allow a simultaneous investigation of the expression patterns of tens of thousands of genes, which provide a powerful approach for understanding the intricate process of tumor development and progression in a global view. Gene expression patterns are thought to be different for various tissue types, at different stages of development, disease vs. normal tissue, and among disease subtypes, which can be used to improve accuracy of diagnostics, predictive markers of treatment response and management of patients’ treatment plan.

Since the emergence of microarray techniques in 1995[4, 5], an enormous number of gene expression knowledge discovery efforts have attempted to understand the genes involved in malignant development and disease progression, to improve the accuracy of diagnosis and prognosis. Golub’s study [6] is one of the pioneers in using gene expression data to discover more efficient diagnostic approach. In their study, a

(4)

set of genetic markers of different types of leukemia were obtained for classification and diagnostics of leukemia, in which the separation of AML (acute myeloid leukemia) and ALL (acute lymphoblastic leukemia) on the basis of gene expression analyses was demonstrated. Methods called “neighborhood analysis” and “weighted vote” were applied to build class predictors. SOM clustering algorithm was applied for class discovery. Recently, Andersson et al [7] conducted gene expression analyses on 121 pediatric acute leukemia patients based on lineage and genetic subtype with high accuracies and validated the results with an independent dataset on B-lineage ALL. This study of a consecutive series of childhood leukemias confirms and extends previous reports demonstrating that global gene expression profiling provides a valuable tool for genetic and clinical classification of childhood leukemias.

Gene expression analyses have been applied in many other kinds of cancer /cancer subtype classifications. For instance, Nutt et al [8] investigated whether gene expression profiling, coupled with class prediction methodology, can classify high-grade gliomas in a way more objective, explicit, and consistent than standard pathology. Supervised learning approacheswere used to build a two-class prediction model based on a subsetof glioblastomas and anaplastic oligodendrogliomas with classic histology. The experimental results showed that the model provided a more accurate prediction of prognosis in these nonclassic lesions than did pathological classification. They suggested that class prediction models, based on defined molecular profiles, classify diagnostically challenging malignant gliomas in a manner that better correlates with clinical outcome than does standard pathology. Another application in GBM (Glioblastoma multiforme), is the study performed by Liang et al [9], which used gene expression analysis to perform on surgical samples of brain tumors to characterize inter-tumor variability at molecular level. They revealed molecular signatures that reflect underlying pathogenic mechanisms and identified molecular markers that associate with survival. Agglomerative hierarchical clustering was combined with a two-step algorithm to identify survival-associated genes, in which Cox regression coefficients were calculated for all clustered genes, and a moving average of these values was calculated for consecutive genes in the hierarchically clustered list. There are many other applications, include melanoma [5], ovarian cancer [10], lung cancers [11, 12, 13, 14, 15], and gastric cancers [16], colon cancer[17, 18, 19] etc.

The current clinical and histopathological criteria used today for diagnosis and prognosis of tumor patients are inadequate for personalized treatment and clinical outcome prediction. van’t Veer [20]’s work demonstrated that the gene expression profile will outperform all currently used clinical parameters in predicting disease outcome. In their study [20], van’t Veer and colleagues identified 70 gene prognosis-signatures as a predictor of survival in breast cancer. They used DNA microarray analysis on primary breast tumors of 117 young patients, and applied supervised classification to identify a gene expression signature strongly predicative of distant metastases in patients without tumor cells in local lymph nodes at diagnosis. The 70 gene prognosis profile has been further validated with 295 patient samples in van de Vijver et al’s [21] work. They concluded that the gene-expression profile is a more powerful predictor of the outcome of disease in young breast cancer patients than the standard systems based on clinical and histological criteria. Gene-expression profiling

(5)

can help distinguish between patientsat high risk and those at low risk for developing distant metastases,and so identify patients for adjuvant therapy.

3 Spectrum of data and its data analysis applications

Data in today’s life sciences consists of several types. Fig.1.shows a conceptual view of this data that consists of four types:

(i) Medical records which represent information such as observations and treatment procedures, mostly in the form of text and set of attributes. This information is becoming more and more digitized with today’s electronic health records,

(ii) Clinical data, such as pathologists reports, lab reports, X-rays, CT and MRI scans, etc, also mostly digitized, from which data analysis features could be derived. This information may be linked and also integrated with medical records,

(iii) High throughput data, which is perhaps the most granular pieces of information in today’s life sciences that one could expect to generate for selected patients. This information is not as easily available and it is normally obtained through an elaborate and sometimes complex process. Examples are converting a tissue obtained through a biopsy to a microarray and its associated data.

(iv) There may still be other types of data that are not related to any of the above categories and yet may be very helpful in our data analysis. These are data that are not collected under any specific circumstances. Genetic information is one example.

The focus of this short paper is only on the genomics aspect of the high throughput biological data. Furthermore, our intention is to discuss:

- gene expression analysis - classification and prediction - patient’s medical conditions,

More specifically, our objective would be to investigate topics such as a person: (i) being sick or healthy, (ii) belonging to any sub-class of a particular disease, or (iii) monitoring patients when they are treated, their disease status, progression etc.

4 A proposed framework and its application

In this section, we outline a generic framework to exploit the microarray technology for the stratification of disease types or the prediction of a patient’s medical condition such as drug resistance or predisposition to the development of certain illnesses.

(6)

The proposed framework consists of 3 basic steps. First from the gene expression data, we identify sets of highly discriminatory genes that can differentiate the different medical conditions under consideration. This process is summarized in Fig. 2. Along with many other researchers, we have developed methodologies to first identify list of differentially-expressed gene lists and then to narrow down these lists to sets of highly discriminatory gene sets.

Fig. 1. Data in Life Sciences

High throughput data generation: - Genomics,

- Proteomics, - Metabolomics, - Sequence Data,

- Other high throughput data

Validation Applications - Medical records collection (text data) - Sample collection

- Others (e.g. Pathology, CT, MRI, X-ray, ultra sound)

Knowledge Discovery

Data preprocessing: - Normalization

- Data characteristics checking - Statistics - … … Pattern Discovery: - Supervised methods - Unsupervised methods - … …

(7)

Patient Clinical Data Prior Knowledge Differentially-Expressed Gene Analysis Discriminatory Analysis Highly Discriminatory Gene Sets Prior Knowledge

Fig. 2. Identification of highly discriminatory gene sets

The second step is to feed the gene sets identified in Step 1 to a machine learning (ML) module that would discover more reliable models. This would be the core of our reasoning module. The role of the reasoning module is to predict patients outcome based on gene expression intensity of the diagnostics panel and patients’ clinical information. One can then construct clinical test kits that could be used by clinical laboratories or at doctors’ offices to perform the more accurate diagnosis.

To illustrate the applicability of the framework, we have experimented the first 2 steps of the framework with a case study of patients’ response to chemotherapy. The study is based on the dataset published in [22]. A group of AML patients were treated with intensive cytarabine-based induction and consolidation. 33 patients were selected for the study: 22 had good response and 11 had poor response. Preliminary analysis indicated promising results. The process identified a prognostics gene panel of 51 genes and constructed a companion reasoning module consisting of 34 models. The overall sensitivity, specificity, positive and negative predictive rates were all high. Detailed of the study will be reported when completed.

5 Discussion and future work

In this paper we have briefly provided an overview of data and its scope in today’s life sciences and highlighted an impact that gene expression data could make in the advancement of classification and prediction of patients’ medical conditions. We have discussed a framework that could be considered to develop and validate future clinical test kits. We have preformed a preliminary study using this framework on a group of AML patients where results were promising. We are currently investigating use of time series leukemia data to obtain models for patient response analysis. Our future work also includes an integrated approach where multiple forms of omics data will be used to generate consistent markers and sets of more biologically plausible signatures.

(8)

References

1. Edgar, R., Domrachev, M., Lash, A. E.: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 1;30(1):207-10 (2002)

2. Benson, D. A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D. L.: GenBank Nucleic Acids Res. 1; 33(Database Issue): D34–D38 (2005)

3. Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M,, Sirotkin, K.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 1;29(1):308-11 (2001)

4. Schena, M., Shalon, D., Davis, R. W., Brown, P. O.: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 20;270(5235):467-70 (1995)

5. DeRisi, J., Penland, L., Brown, P. O., Bittner, M. L., Meltzer, P. S., Ray, M., Chen, Y., Su, Y. A., Trent, J. M.: Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nat Genet. 14(4): 457-460 (1996)

6. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science Vol. 286. no. 5439, pp. 531 – 537 (2009)

7. Andersson, A., Ritz, C., Lindgren, D., Eden, P., Lassen C., Heldrup, J., Olofsson, T., Rade, J., Fontes, M., Porwit-MacDonald, A., Behrendtz, M., Hoglund, M., Johansson, B., Fioretos, T.: Microarray-based classification of a consecutive series of 121 childhood acute leukemias: prediction of leukemic and genetic subtype as well as of minimal residual disease status. Leukemia. 21, 1198–1203 (2007)

8. Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M. von Deimling, A., Pomeroy, S. L., Golub, T. R., Louis2, D. N.: Gene Expression-based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification. CANCER RESEARCH 63, 1602–1607 (2003)

9. Liang, Y., Diehn, M., Watson, N., Bollen, A. W., Aldape, K. D., Nicholas, M. K., Lamborn, K. R., Berger, M. S., Botstein, D., Brown, P. O., Israel, M. A.: Gene expression profiling reveals molecularly and clinically distinct subtypes of glioblastoma multiforme.

Proc Natl Acad Sci U S A. 102(16):5814-9 (2005)

10. Welsh, J. B., Zarrinkar, P. P., Sapinoso, L. M., Kern, S. G., Behling, C. A., Monk, B. J., Lockhart, D. J., Burger, R. A., Hampton, G. M.: Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci U S A. 30;98(3):1176-81 (2001)

11. Raponi, M., Zhang, Y., Yu, J. et al.: Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res. 66(15),7466-7472 (2006) 12. Bhattacharjee, A., Richards, W. G., Staunton, J. et al.: Classification of human lung

carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses.

Proc. Natl Acad. Sci. USA 98(24),13790-13795 (2001)

13. Garber, M. E., Troyanskaya, O. G., Schluens, K., et al.: Diversity of gene expression in adenocarcinoma of the lung. Proc. Natl Acad. Sci. USA 98(24),13784-13789 (2001) 14. Beer, D. G., Kardia, S. L., Huang, C. C., et al.: Gene-expression profiles predict survival

of patients with lung adenocarcinoma. Nat. Med. 8(8),816-824 (2002)

15. Hayes, D. N., Monti, S., Parmigiani, G., et al.: Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. J. Clin. Oncol. 24(31),5079-5090 (2006)

(9)

16. Hippo, Y., Taniguchi, H., Tsutsumi, S., Machida, N., Chong, J.M., Fukayama, M., Kodama, T., and Aburatani, H. Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Res. 62, 233–240 (2002)

17. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., Levine, A. J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci U S A. 8;96(12):6745-50 (1999)

18. Giacomini, C. P., Leung, S. Y., Chen, X., Yuen S. T., Kim Y. H., Bair E., Pollack, J. R.: A Gene Expression Signature of Genetic Instability in Colon Cancer. Cancer Res 65: (20). (2005)

19. Bittner, M. , Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., Sondak, V., Hayward, N., Trent, J.: Molecular classification of cutaneous malignant melanoma by gene expression profiling.

Nature 406, 536-540 (2000)

20. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., et al.: Gene expression profi ling predicts clinical outcome of breast cancer. Nature 415 : 530 – 6 (2002)

21. van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., et al.: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med 347 (2002)

22. Heuser, M., Wingen, L. U., Steinemann, D., Cario, G., von Neuhoff, N., Tauscher, M., Bullinger, L., Krauter, J., Heil, G., Döhner, H., Schlegelberger, B., Ganser, A.,: Gene-Expression Profiles and their Association with Drug Resistance in Adult Acute Myeloid Leukemia. Haematologica vol. 90(11), pp.1484-92. (2005)

Figure

Fig. 1. Data in Life Sciences

Références

Documents relatifs

I.43 Ouais. Et puis tu leur fais ben lire à haute voix les deux définitions. Et puis à la fin tu leur dis « à votre avis de quoi je veux parler » enfin tu utilises ce

« C’est ça le syndicalisme, parce que c’était le début… le syndicalisme c’était un moyen pour faire changer la précarité, les brimades, le truc anti social qu’il y

Retinoic acid-treated (adipogenic conditions) or not (skeletal myogenic conditions) EBs from wild type CGR8 ES cells were induced to differentiate and treated or not by 100 mM

In the following, we will also need the fact that if the set of vectors given as input to the LLL algorithm starts with a shortest non-zero lattice vector, then this vector is

Calibration results and sensitivity analysis of the TOPMODEL show that, for the Hod- der catchment, the DBM model explains the data a little better and, in the view of the

مﺎﻗ ﻲﺗﻟا ﻞﯾﻠﺣﺗﻟا قﺎﻣﻋأ ﻲﻋدﺗﺳ اذﻫو "تدﻟوﺑﻣ ﻫود موﯾﻏ" فوﺳﻠﯾﻔﻟا ﺎﻬﺑ (guilloume de hamboldt) مﻼﻛﻠﻟ ﻲﻠ وﻫ ﺎﻣ ﻓ دوﺟوﻣﻟا صﻧﻟا نأ ﻒﯾ رﺑﺗﺧﺗ ﺔ ﺧرﺎﺗﻟا ﺔ

Grande Bénichon Recrotzon: dimanche 19 octobre Danse avec excellent orchestre. Traditionnels menus de bénichon et spécialités

Ces sources (cerveau, endothélium) sont principalement stimulées lors de l’EX, les désignant ainsi comme possiblement à l’origine des effets positifs de.. Néanmoins, les