• Aucun résultat trouvé

Pattern Discovery from Biological Data

Jesmin Nahar

Central Queensland University, Australia Kevin S. Tickle

Central Queensland University, Australia A B M Shawkat Ali

Central Queensland University, Australia

AbstrAct

Extracting useful information from structured and unstructured biological data is crucial in the health industry. Some examples include medical practitioner’s need to identify breast cancer patient in the early stage, estimate survival time of a heart disease patient, or recognize uncommon disease charac-teristics which suddenly appear. Currently there is an explosion in biological data available in the data bases. But information extraction and true open access to data are require time to resolve issues such as ethical clearance. The emergence of novel IT technologies allows health practitioners to facilitate the comprehensive analyses of medical images, genomes, transcriptomes, and proteomes in health and disease. The information that is extracted from such technologies may soon exert a dramatic change in the pace of medical research and impact considerably on the care of patients. The current research will review the existing technologies being used in heart and cancer research. Finally this research will provide some possible solutions to overcome the limitations of existing technologies. In summary the primary objective of this research is to investigate how existing modern machine learning techniques (with their strength and limitations) are being used in the indent of heartbeat related disease and the early detection of cancer in patients. After an extensive literature review these are the objectives chosen:

to develop a new approach to find the association between diseases such as high blood pressure, stroke and heartbeat, to propose an improved feature selection method to analyze huge images and microarray databases for machine learning algorithms in cancer research, to find an automatic distance function selection method for clustering tasks, to discover the most significant risk factors for specific cancers,

IntroductIon

More frequently clinical decisions are often made based on medical practitioner knowledge and experi-ence rather than on the knowledge hidden in the huge database. The limitations of this practice include unwanted biases, errors and excessive medical costs which affects the quality of service provided to patients (Palaniappan & Awang, 2008). Therefore, it is important to discover the hidden knowledge from a medical database to provide a better care of patient. Li et al., (2004) argue:

‘Data mining techniques can be successfully applied to ovarian cancer detection with a reasonably

• high performance’ (Li et al., 2004).

‘The classification using features selected by the genetic algorithm consistently outperformed

• those selected by statistical testing in terms of accuracy and robustness’ (Li et al., 2004).

Similarly many researchers in the machine learning community found that molecular level classification of human tissues has produced remarkable results, and indicated that the gene expression method could significantly aid in the development of efficient cancer diagnosis and classification platforms (Brown et al., 2000; Lu & Han, 2003; Statnikov et al., 2005; Yeh et al., 2007). Machine learning techniques are also widely explored in heart diseases (Ordonez et al., 2001), computer-based medical picture interpretation method contain a major application area provide significant support in medical diagnosis (Coppini et al., 1995; Zhu & Yan, 1997) and many more.

Magoulas & Prentza (2001) suggest, since the understanding of biological systems is not complete, so there are essential features and information hidden in the physiological signals which are not read-ily apparent. Moreover, the effects between the different subsystems are not distinguishable. Basically biological signals are characterized by substantial variability, caused either by spontaneous internal mechanisms or by external stimuli. Associations between the different parameters may be too complex to be solved with conventional techniques. Modern Machine Learning (ML) methods rely on these sets of data, which can be produced easier, and can help to model the nonlinear relationships that exist between these data, and extract parameters and features which can improve the current medical care (Magoulas, & Prentza, 2001).

and to determine the preventive factors for specific cancers that are aligned with the most significant risk factors. Therefore we propose a research plan to attain these objectives within this chapter. The possible solutions of the above objectives are: new heartbeat identification techniques show promising association with the heartbeat patterns and diseases, sensitivity based feature selection methods will be applied to early cancer patient classification, meta learning approaches will be adopted in clustering algorithms to select an automatic distance function, and Apriori algorithm will be applied to discover the significant risks and preventive factors for specific cancers. We expect this research will add sig-nificant contributions to the medical professional to enable more accurate diagnosis and better patient care. It will also contribute in other area such as biomedical modeling, medical image analysis and early diseases warning.

As shown in Figure 1 that cancer and heart disease are the top two causes of death in the United Kingdom. Finding patterns in the data that can assist in the early detection of these diseases will have a significant impact on human health.

In summary, this chapter proposes some new solutions to overcome the existing limitations of can-cer and heart disease by employing intelligent and statistical machine learning techniques to identify valid, novel, yet potentially useful and ultimately understandable patterns from the large biological data repositories.

This chapter provides a minimal background in human biology to better understand the remainder of the research. A more elaborate introduction can be found in (Marieb & Hoehn, 2006; Marieb &

Mitchell, 2007).

bIoloGIcAl bAcKGround Human biology

All over the world cancer is a serious threat for human health. Cancer is a group of diseases which have the common feature of uncontrolled growth of cells and have the ability to infiltrate and destroy normal body tissue. Normally, cancer cells may spread throughout the blood and lymph system to other parts of the body. Molecular actions are the cause of cancer which changes the properties of cells. Generally, in cancer cells the normal control systems to prevent cell overgrowth and the invasion of other tissues are disabled. The abnormal cells which are divided in an uncontrolled way form a tumour. This tumour can be malignant or benign which means life threatening and not dangerous. Malignant tumour might

Figure 1. Comparative death rates for different causes in UK (King & Bobins, 2006)

attack surrounding organs and tissues in a process called metastasis. The abnormalities in cancer cells cause for mutations in genes that control cell division. Day by day more genes become mutated. As a result, the number of mutations starts to increase, resulting in further abnormalities in that cell. Over time a number of mutated cells die, but abnormal cells multiply much more rapidly than normal cells.

There are 3 types of faulty genes. The first group, called proto-oncogenes enhance cell division and the mutated forms of these genes are called oncogenes. The second group are called tumor suppressors and they prevent cell division. The third group is DNA repair genes that prevent mutations that lead to cancer. There are different types of chromosomal aberrations that can occur in cancer cells-such as a part of chromosome can have more than the normal two copies. A portion of chromosome may also have only one copy or even no copies. The final type of aberration occurs when a part of a chromosome has moved to a different location, called event relocation. It should be mention that not all aberrations cause cancer (Jong, 2006).

There are more than 100 different types of cancer occurring in humans.

The causes of cancer may be inherited or acquired. The main categories of cancer comprise: Carcinoma – this type of cancer begins in the skin or in tissues that line or cover internal organs. Sarcoma –starts in bone, fat, muscle, blood vessels, cartilage or other connective or supportive tissue. Leukemia – this begins in the blood-forming tissue such as the bone marrow and produced large numbers of abnormal blood cell and enters the blood. Lymphoma and myeloma– this begins in the cells of the immune system.

Central nervous system cancers are the type of cancer starting in the tissues of the brain and spinal cord (Cancer, org 2008).

Figure 2. Affected cancer bladder shows by circle. (Source: http://www.ecureme.com/emyhealth/data/

Bladder_Cancer.asp)

tyPes oF cAncers

Among the vast range of cancers the following are the most common: Bladder Cancer, Breast Cancer, Cervical Cancer, Colon Cancer, Lung Cancer, Prostate Cancer, Leukemia, and Skin Cancer.

Figure 3. Normal breast with lobular carcinoma in situ (LCIS) in an enlarged cross–section of the lobule. Breast profile: A ducts, B lobules, C dilated section of duct to hold milk, D nipple, E fat, F pectoralis major muscle, G chest wall/rib cage. Enlargement: A normal duct cells, B ductal cancer cells, C basement membrane, D lumen (center of duct). (http://www.cancer.org/docroot/CRI/content/

CRI_2_2_1X_What_is_breast_cancer_5.asp?sitearea=)

bladder cancer

Bladder cancer is a cancer where abnormal cells multiply without control in the bladder. In most of the cases, bladder cancer begins in cells lining the inside of the bladder and is called urothelial cell or transitional cell carcinoma (UCC or TCC). If the bladder cancer is only in the lining is called superficial bladder cancer, and if the cancer spreads into the muscle wall of the bladder, it is called invasive bladder cancer (Wikipedia, 2008). The American Cancer Society estimates that in the United States there will be about 68,810 new cases of bladder cancer diagnosed in 2008 (about 51,230 men and 17,580 women).

The chance of a man developing this at any time during his life is about 1 in 27 and for a woman, 1 in 85. In 2008, there were about 14,100 deaths from bladder cancer in the United States (about 9,950 men and 4,150 women) (Cancer.org, 2008). The cancer affected bladder is presented in Figure 2.

Figure 4. Cervical cancer (Source: http://www.righthealth.com/Health/Cervical_Cancer_Pictures/-od-definition _adam _2%25252F9163-s)

Figure 5. Lung cancer (Source: http://search.live.com/images/)

breast cancer

Breast cancer develops whilst a number of the cells in the breast start to grow out of spread to other parts of the body (Cancer.org, 2008). In Australia one in eight women will develop breast cancer before the age of 85 and about 12,000 women as well as 84 men were diagnosed with breast cancer in 2002.

Statistics showed breast cancer caused 502,000 deaths (7% of cancer deaths; almost 1% of all deaths) worldwide in 2005. In USA breast cancer death rates for women is higher than those for any other can-cer besides lung cancan-cer.(Breast Cancan-cer Statistics, 2008).It is predicted that by 2011, the number of new diagnoses will increase to about 14,800 women and 122 men (Wikipedia, 2008). The cancer affected breast is showed in Figure 3.

cervical cancer

Cervical cancer is a type of cancer that starts in the cervix (part of the female reproductive systems), the lower part of the uterus (womb) that opens at the top of the vagina (Figure 4). Generally, majority of the cases, early cervical cancer have no symptoms. Cervical cancer has two stages: early or pre invasive stage, and late or invasive stage. Cervical cancer is a major global health problem with prevalence and death rates. It is the number one killer of young women in under developed countries. In Korean women for example, cervical cancer is the third most common form of cancer, after stomach and breast cancer (Herzog, 2003). The American Cancer Society estimated that in 2008, about 11,070 cases of invasive cervical cancer were diagnosed in the United States. Research showed that non-invasive cervical cancer (carcinoma in situ) is about 4 times more common than invasive cervical cancer (Cervical cancer, 2008).

In 2008, about 3,870 women died from cervical cancer in the United States. Hispanic women are affected over twice that in non-Hispanic white women, and African-American women develop this cancer about 50% more often than non-Hispanic white women (Cervical cancer, 2008).

lung cancer

There are various types of lung cancer, depending on which cells are affected. The majority of lung cancer begins in the cells that line the bronchi (Wikipedia, 2008). Statistics shows that both, men and women die from lung cancer more often than from any other type of cancer. Lung cancer accounted for more deaths than breast cancer, prostate cancer, and colon cancer in 2004. In that year, 108,355 men Figure 6. Prostate cancer (Source: http://health.allrefer.com/pictures-images/prostate-cancer.html)

and 87,897 women were diagnosed with lung cancer, and 9,575 men and 68,431 women died from lung cancer. In 2008, there were about 215,020 new cases of lung cancer, which included 114,690 among men and 100,330 among women (Cancer.org, 2008). The cancer affected lung is shown in Figure 5.

Prostate cancer

This type of cancer originates in the prostate with an uncontrolled (malignant) growth of cells in the prostate gland. It is a common malignancy in men older than 50 years of age and it may cause no symp-toms in the early stage. Prostate cancer may remain in the prostate gland or may spread to nearby lymph nodes to other parts of the body. In USA prostate cancer is one of the most frequently diagnosed cancers.

The statistics showed that in 2004 189,075 men were diagnosed with prostate cancer and 29,002 men died from prostate cancer. It is a second leading cause of death of males in the USA. In Australia one in nine males affects for prostate cancer at the age of 75 years and one on five by the age 85 years. The death rate of prostate cancer is lower, which is one in 84 men under the age of 75 years and one in 22 by the age of 85 (Pharmacy, 2008). The prostate cancer is shown in Figure 6.

skin cancer

Skin cancer is generally the common form of cancer which approximated that over 1 million new cases occur annually. Skin cancer occurs in the outer layers of your skin. Each year the annual rates of all forms of one’s cancer are increasing, representing a growing public concern. Research estimated that nearly half of all Americans who live to age 65 will develop skin cancer at least once (Cpaaindia, 2008).

The affected skin cancer is shown in Figure 7.

The data mining research has been using two methods to identify cancer in humans: image and micro array based methods. The following section provides the up-to-date development of cancer research involving data mining techniques.

Figure 7. Skin cancer (Source: http://www.taconichills.k12.ny.us/webquests/noncomdisease/skin%20 cancerpic.jpg)

MAcHIne leArnInG bAcKGround

This section will reviews the existing solutions for cancer risk factor extraction, feature selection for a huge data base, gene clustering and finally cancer patient classification from biological data. A theoreti-cal and technitheoreti-cal discussion about the existing techniques will follow.

Machine learning: An overview

Before explaining machine learning, there is a need to clarify what is learning. Basically learning is a process of knowledge gathering. Similarly, a machine gathers knowledge from data, graph, image and some other sources through computer programming. Basically machine learning is a matter of finding statistical regularities or other patterns in the data. Mitchell (1997) gives the definition of machine learn-ing as follows: ‘A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.’

Now, given a set of attributes about potential cancer patients, and whether those patients actually had cancer, the computer could learn how to distinguish between likely cancer patients and possible false alarms. The symbol © means people already have cancer and ® means people are free from cancer (Figure 8). The first task in machine learning is data processing. A non-linear mapping is used to transform the data. After transformation, data become almost linearly separable. Then the machine tries to constructs some boundary such as line, circle, hyperplane etc to do the separation. Finally using optimisation tech-niques the machine fixes up the final separation boundary. Now, the process of learning is complete.

This whole process in machine learning is called training as described in Figure 8.

At this moment the question is how efficient is the trained machine. To test this, there have a testing phase in the machine learning process. The symbol ⓟ is a new unknown patient (Figure 9). The machine Figure 8. Machine learning phase

it will place the data in the model. Finally the model will predict whether the patient has cancer, and who has not. Therefore, using a variety of mathematical measures, the machine can test whether the prediction results are acceptable or not.

The whole process described in Figure 8 and 9 is called classification learning. Machine learning is not simply on classification. The following learning tasks are the main branches of machine learning (Mitchell, 1997):

Association learning: Learn relationships between the attributes. An association learning has the form LHS (Left Hand Side) --> RHS (Right Hand Side), where LHS and RHS are disjoint sets of items, the RHS set is likely to occur whenever the LHS set occurs. For instance, items in gene expression data can include genes that are highly expressed or repressed, as well as relevant facts describing the cellular environment of the genes (e.g. the diagnosis of a cancer sample from which a profile was obtained) (Creighton & Hanash, 2003). Association learning is one of the promising aspects of a knowledge dis-covery tool in the medical domain, and has been widely explored to date.

Classification: Classification is learning to put instances into pre-defined classes. The alternative name is supervised learning. Supervised learning is fairly common in classification problems because the goal is often to get the computer to learn a classification system that has been created. Cancer pa-tient classification is a common example of classification learning. Generally, classification learning is suitable for any problem where deducing a classification is helpful and the classification is easy to determine (Breiman, et al., 1984).

Clustering: Clustering is discovering classes of instance that belong together. The alternative name is unsupervised learning. In this form of learning, the aim is not to maximize a utility function, but simply to locate similarities in the training data. The hypothesis is often that the clusters discovered will match reasonably well with an instinctive classification. For instance, clustering individuals based Figure 9. Machine evolution phase

on patient gene information in one group belonging to diabetes patients and the other is diabetes free patients (Hand et al., 2001).

Numeric prediction: Learn to predict a numeric quantity instead of a class. Another popular name is estimation. The task could be continuous or discrete prediction. An example is estimation of the concen-tration of foreign bodies remaining in the human body can be made from other attributes such as White Blood Cell (WBC) count in the blood. Any attribute in the body which is not measured but related to other measurable attributes can be estimated (Ali & Wasimi, 2007). The cancer patient survivable time could be obtained by numeric prediction.

We did extensive review from various popular sources. The key areas are as follows:

Cancer image classification

Image is an optical picture about an object. Several cancer images are shown in the above section. Medi-cal image data mining is used to collect effective models, relations, rules, changes, irregularities and common laws from mass amount of data and it became a separate important discipline called Medical Imaging. The image based machine learning technique can accelerate the processing speed and accuracy of the diagnosis decisions made by doctors. Recently the rapid development of digital medical devices, medical information databases have included not only the structured information of patients, but also non-structured medical image information (Wang et al., 2005). Antonie et al., (2001) argue that there are no methods to prevent breast cancer, which is why early detection represents a very important factor in cancer treatment and allows reaching a high survival rate. They suggest mammography as the most

Image is an optical picture about an object. Several cancer images are shown in the above section. Medi-cal image data mining is used to collect effective models, relations, rules, changes, irregularities and common laws from mass amount of data and it became a separate important discipline called Medical Imaging. The image based machine learning technique can accelerate the processing speed and accuracy of the diagnosis decisions made by doctors. Recently the rapid development of digital medical devices, medical information databases have included not only the structured information of patients, but also non-structured medical image information (Wang et al., 2005). Antonie et al., (2001) argue that there are no methods to prevent breast cancer, which is why early detection represents a very important factor in cancer treatment and allows reaching a high survival rate. They suggest mammography as the most