Future work - Clinical data mining with Kernel-based algorithms

In the classification of NI prevalence data, the features were binarized prior any classification tasks. The discretization of the workload variable may not be optimal and causing a negative interaction. A deep investigation of the discretization of this variable should be carried out in the future. The use of information gain sensitive to data imbalance is another avenue of future investigations.

7.2. FUTURE WORK 89 The conversion of the binary values into real numbers by the means of factor analysis is an impor-tant avenue of investigation. Such transformation may also necessary when combining heteroge-neous variables. With the TALISMAN dataset, for example, it may be necessary to perform such transformation of the clinical variables before their combination with the imaging variables.

We have boosted the runtime speed of the SVM using the radius margin bound. Other bounds were proposed in more recent papers. A comparative study of these new bounds with the radius margin bound is another avenue of future investigation. Bound on the MKL error was also pro-posed by Cortes and colleagues [33]. This is of particular interest in the objective of boosting the runtime speed of the MKL.

The works we have carried out in this thesis were not systematically validated by infection control practitioners. Indeed, evaluating the output of a classification algorithm is less evident for these busy medical staffs. The perfect way to bring them on the table for a systematic evaluation is the propose the added value of such system. This is the long term objective of the analysis performed in this thesis and one has to go one step forward to implement a tool exploiting the results of the classifications.

The development of tools exploiting the classification results may be hindered by the fact that it will remain as a research project. A hospital does not necessarily have the dedicated infrastruc-ture to analyze large amounts of data. Setting up an infrastrucinfrastruc-ture enabling to transfer clinical data outside the hospital network where computational resource is available and in a safe manner may be necessary in the near future.

The medical informatics department of the HUG has anticipated this need and a first prototype of such infrastructure was proposed to support the @neurIST project [74, 75]. This infrastructure permits to extract information from the electronic health record of patients, transforms the infor-mation into the desired representation, performs pseudonymization of the data including images and transfers them outside the hospital network for analysis. The pseudonymization is reversible and allows to link the results of the analysis with the original health record. The components of this infrastructure and the information flow according to an external data query is depicted in Fig-ure 7.1. The main limitation of this infrastructFig-ure is about scalability: it is impossible to provide access to live electronic record of all hospitalized patients. The optimization of such infrastructure for at least near real–time data mining purposes is of particular interest and should be investigated.

Figure 7.1: The architecture components and the information flow for an external query against the clinical information system

Appendix A

Nosocomial infection variables:

definition and ranking

The characteristics of the nosocomial infection dataset are described in this chapter of the appendix. The list of the variables are detailed in the table A.1 and the MeSH (Medical Subject Headings) definition of the variable is provided if found within the Unified Medical Language System (UMLS) database. The UMLS database is maintained by the US National Library of Medecine. The nosocomial infection dataset contains 60 variables after binarisation of numeric and nominal variables. This dataset contains 1384 examples and 100 of training/testing (60%/40%) splits are produced. The mean correlation between each variables over the 100 training and testing splits are depicted in the figures A.1 and A.2. Variables ranking after the application of one RBF kernel per variable are summarized in the tablesA.2 and A.3.

APPENDIXA.NOSOCOMIALINFECTIONVARIABLES:DEFINITIONANDRANKING Variable category Variables Signification MeSH definition

Administrative

Var 1 Age>75

-Var 2 Age in [60,75]

-Var 3 Age<60

-Var 4 Length of Stay The period of confinement of a patient to a hospital or other health facility.

Var 5 Sex

-At admission

Var 6 McCabe non fatal

-Var 7 McCabe fatal <5 years -Var 8 McCabe fatal in 6 months

-Var 9 Trauma Damage inflicted on the body as the direct or indirect result of an external force, with or without disruption of structural continuity.

Var 10 Transfer Transfer of client or patient care within or between treatment settings, therapists, or other health care providers.

Var 11 No Information related to co-morbidity at admission

-Var 12 Myocardial Infarction Gross necrosis of the myocardium, as a result of interruption of the blood supply to the area.

Var 13 Congestive heart failure A complication of HEART DISEASES. Defective cardiac filling and/or impaired contraction and emptying, resulting in the heart’s inability to pump a sufficient amount of blood to meet the needs of the body tissues or to be able to do so only with an elevated filling pressure.

Var 14 Peripheral Vascular Diseases General or unspecified diseases of the blood vessels outside the heart.

It is for diseases of the peripheral as opposed to the cardiac circulation.

Var 15 Cerebro-vascular disease Patient with a history of stroke with minor or residual sequelae or transient ischemia.

Var 16 Dementia An acquired organic mental disorder with loss of intellectual abilities of sufficient severity to interfere with social or occupational function-ing. The dysfunction is multifaceted and involves memory, behavior, personality, judgment, attention, spatial relations, language, abstract thought, and other executive functions. The intellectual decline is usually progressive, and initially spares the level of consciousness.

Continued on next page

93 Table A.1 – continued from previous page

Variable category Variables Signification MeSH definition Var 17 Chronic Obstructive Airway

Disease

A disease of chronic diffuse irreversible airflow obstruction. Sub-categories of COPD include CHRONIC BRONCHITIS and PUL-MONARY EMPHYSEMA.

Var 18 Collagen Diseases Historically, a heterogeneous group of acute and chronic diseases, in-cluding rheumatoid arthritis, systemic lupus erythematosus, progres-sive systemic sclerosis, dermatomyositis, etc. This classification was based on the notion that “collagen” was equivalent to “connective tissue”, but with the present recognition of the different types of col-lagen and the aggregates derived from them as distinct entities, the term “collagen diseases” now pertains exclusively to those inherited conditions in which the primary defect is at the gene level and affects collagen biosynthesis, post-translational modification, or extracellular processing directly.

Var 19 Ulcer disease Patients receiving treatment for stomach ulcers, including those who are cured.

Var 20 Hepatic Insufficiency Patients with high transaminase level less than 2 times above the high-est norm.

Var 21 Diabetes Mellitus A heterogeneous group of disorders characterized by HYPER-GLYCEMIA and GLUCOSE INTOLERANCE.

Var 22 Hemiplegia Severe or complete loss of motor function on one side of the body.

This condition is usually caused by BRAIN DISEASES that are lo-calized to the cerebral hemisphere opposite to the side of weakness.

Less frequently, BRAIN STEM lesions; cervical SPINAL CORD DIS-EASES; PERIPHERAL NERVOUS SYSTEM DISDIS-EASES; and other conditions may manifest as hemiplegia. The term hemiparesis (see PARESIS) refers to mild to moderate weakness involving one side of the body.

Var 23 Renal Insufficiency Patients with serum creatinine values (blood) of less than 260mmol / l

Continued on next page

APPENDIXA.NOSOCOMIALINFECTIONVARIABLES:DEFINITIONANDRANKING Variable category Variables Signification MeSH definition

Var 24 Complications of Diabetes Mellitus

Conditions or pathological processes associated with the disease of diabetes mellitus. Due to the impaired control of BLOOD GLUCOSE level in diabetic patients, pathological processes develop in numerous tissues and organs including the EYE, the KIDNEY, the BLOOD VESSELS, and the NERVE TISSUE.

Var 25 Malignant Neoplasms Patients with malignant tumor without documented metastases, but treated in the past five years, including breast, colon, lung and other tumors

Var 26 leukemia A progressive, malignant disease of the blood-forming organs, charac-terized by distorted proliferation and development of leukocytes and their precursors in the blood and bone marrow. It is classified accord-ing to degree of cell differentiation as acute or chronic, and accordaccord-ing to predominant type of cell involved as myelogenous or lymphocytic.

Var 27 Lymphoma A general term for various neoplastic diseases of the lymphoid tissue.

Var 28 Hepatic Insufficiency moder-atre and severe

Patients with high transaminase level more than 2 times above the highest norm (moderate).

Var 29 Metastatic tumors Patients with metastasis of tumors, including breast, lung, colon, and others.

Var 30 Acquired Immunodeficiency Syndrome

An acquired defect of cellular immunity associated with infection by the human immunodeficiency virus (HIV), a CD4-positive T-lymphocyte count under 200 cells/microliter or less than 14% of total lymphocytes, and increased susceptibility to opportunistic infections and malignant neoplasms. Clinical manifestations also include emaci-ation (wasting) and dementia.

Var 31 No comorbidity

-At the survey date

Var 32 Ward = Medecine

-Var 33 Ward = Surgery

-Var 34 Ward = Mixt

-Var 35 Ward = Gynecology

-Var 36 Ward = Obstetrics

-Var 37 Ward = Obstetrics and Gy-necology

-Continued on next page

95 Table A.1 – continued from previous page

Variable category Variables Signification MeSH definition Var 38 Ward = Intensive care unit

-Var 39 No MRSA infection

-Var 40 Methicillin resistant Staphy-lococcus aureus (MRSA) in-fection

-Var 41 Past MRSA infection

-At the survey date and the 6 days before

Var 42 No Fever

-Var 43 Fever An abnormal elevation of body temperature, usually as a result of a pathologic process.

Var 44 Fever unknown

-Var 45 Antibiotic therapy

-Var 46 Antibiotic Prophylaxis Use of antibiotics before, during, or after a diagnostic, therapeutic, or surgical procedure to prevent infectious complications.

Var 47 Operative Surgical Proce-dures

Operations carried out for the correction of deformities and defects, repair of injuries, and diagnosis and cure of certain diseases.

Var 48 No Leukopenia

-Var 49 Leukopenia

-Var 50 Leukopenia unknown -Var 51 Intensive care usint stay

dur-ing the hospitalization

-Var 52 Intubation Introduction of a tube into a hollow organ to restore or maintain pa-tency if obstructed. It is differentiated from CATHETERIZATION in that the insertion of a catheter is usually performed for the introducing or withdrawing of fluids from the body.

Var 53 Workload<45.5 according to the PRN system

The total amount of work to be performed by an individual, a depart-ment, or other group of workers in a period of time.

Var 54 Workload ≥45.5 and <91.5 according to the PRN system

The total amount of work to be performed by an individual, a depart-ment, or other group of workers in a period of time.

Var 55 Workload≥91.5 according to the PRN system

The total amount of work to be performed by an individual, a depart-ment, or other group of workers in a period of time.

Continued on next page

APPENDIXA.NOSOCOMIALINFECTIONVARIABLES:DEFINITIONANDRANKING Variable category Variables Signification MeSH definition

Var 56 Catheterization, Central Ve-nous

Placement of an intravenous catheter in the subclavian, jugular, or other central vein for central venous pressure determination, chemotherapy, hemodialysis, or hyperalimentation.

Var 57 No Urinary Catheterization

-Var 58 Urinary Catheterization Employment or passage of a catheter into the bladder (urethral c.) or kidney (ureteral c.) for therapeutic or diagnostic purposes.

Var 59 Urinary Catheterization un-known

-97

Figure A.1: Mean correlation of the 100 training sets

Figure A.2: Mean correlation of the 100 testing sets

Table A.2: Rank of each variable according to the sum of the coefficients at the point maximizing the AUC and at the maximum values of the regularization parameter for thedatasets A and B.

Top-10 variables are highlighted.

At the point maximizing AUC At the maximal value of during 5–folds cross–validation the regularization parameter Variable dataset A dataset B Variable dataset A dataset B

Var 1 36 – Var 1 45 37

Table A.3: Rank of each variable according to the sum of the coefficients at the point maximizing the AUC and at the maximum values of the regularization parameter for thedatasets C and D.

Top-10 variables are highlighted.

At the point maximizing AUC At the maximal value of during 5–folds cross–validation the regularization parameter Variable dataset C dataset D Variable dataset C dataset D

Var 4 3 9 Var 4 2 9

Var 8 9 7 Var 8 7 7

Var 10 8 6 Var 10 4 6

Var 13 1 1 Var 13 1 1

Var 38 2 10 Var 38 11 12

Var 39 13 18 Var 39 18 19

Var 40 10 12 Var 40 17 15

Var 42 14 13 Var 42 15 14

Var 43 16 11 Var 43 13 13

Var 45 4 2 Var 45 3 2

Var 47 6 4 Var 47 8 4

Var 51 7 5 Var 51 12 5

Var 52 5 3 Var 52 9 3

Var 53 11 8 Var 53 5 8

Var 54 12 14 Var 54 6 10

Var 55 15 15 Var 55 10 11

Var 56 17 16 Var 56 14 16

Var 57 18 17 Var 57 16 17

Var 58 19 19 Var 58 19 18

Appendix B

Elements of topology

Definition B.0.1. Avector spaceVis a set that is closed under finite vector addition and scalar multiplication. In a formal way,∀s,t∈ V and∀α∈R⇒s+t∈ V andα×s∈ V

Definition B.0.2. Thenormof a mathematical object is a quantity describing the length of the object. In aN-dimensional vector spaceV, the normk.kis a positive measure such that∀s,t∈ V andα∈R:

• ksk>0 ifs6=0andksk= 0 if and only ifs=0

• kαsk=|α| ksk

• ks+tk ≤ ksk+ktk

Formally, the norm in a vector space could be computed in the following way: kskp = PN

i=1s^p_i¹_p where p= 1,2, . . .. This quantity is usually called Lp-norm. A special case of this measure for p=∞is defined byksk_∞= maxi|si|

Definition B.0.3. Theinner productof two elements of a vector spaceV is a bilinear algebraic operationh., .i from V × V to R and having the following properties: ∀r,sand t ∈ V and for a scalarα:

• hr+s,ti=hr,ti+hs,ti

• hαr,si=αhr,si

• hr,si=hs,ri

• hr,si= 0 if and only if r=s=0

Thedot productof two vectors sand t of a finite–dimensional euclidian spaceRⁿ is a special case of the inner product and defined byhs,ti=ksk ktkcosθwhereθis the angle between the two vectors. Geometrically, it can be seen as the scalar projection of the vectors onto the the vector t.

Definition B.0.4. Let us consider a vector spaceBin which the distance between two vectors is defined by the normk.k. This metric spaceBis calledBanach space if it is complete i.e. every Cauchy sequence inBhas a limit in B.

Definition B.0.5. A Hilbert space is a Banach space such that the distance metric k.k is derived from the dot product. The distance between two vectors in a Hilbert space is defined by:

d(r,s) =kr−sk= (hr−s,r−si)¹². The norm of a vector is defined askrk= (hr,ri)¹² and can be used to define e.g. the notion of convergence. The most cited examples of such space are:

• the euclidean spaceRⁿ with the euclidean normhr,si=Pn i=1risi, 101

• the space of 2nd–power integrable functionsL²in which the dot product is defined ashf, gi= R

Xf(x)g(x)dx <∞,

• and the space of infinite sequencesl^p withhx,yi=P∞ i=1xiyi

Definition B.0.6. Let us consider a class of functionF in a Hilbert spaceH. The functionk(r,s) is calledreproducing kernelofF if:

1. ∀s∈ H, k(r,s)∈ F

2. ∀s∈ Hand∀f ∈ F: f(s) =hf(s), k(r,s)i

Appendix C

Quasi–regularization path for three benchmark datasets

Three popular benchmark datasets are analyzed with the MKL algorithm. The best accuracy obtained with these datasets with a single RBF kernel (table C.2), the regularization path for bal-anced (figure C.1 and C.3) and unbalbal-anced (figure C.2 and C.4) training sets and the performance on two critical points of the regularization path (tables C.3, C.4, C.5 and C.6) are presented in this appendix.

Table C.1: Benchmark datasets

Dataset Banana Breast cancer Thyroid

Number of training examples 400 200 140

Number of testing examples 4900 77 75

Number of variables 2 9 5

Table C.2: Best performance obtained with the benchmarks with an SVM with RBF kernel Dataset Banana Breast cancer Thyroid

Cost 316.2 15.19 10.0

σ 1.0 50 3.0

Accuracy 88.47±0.66 73.96±4.74 95.2±2.19

103

Figure C.1: Performance measure (AUC, recall, precision) and the ratio of selected kernels (SKER) during the 5–fold cross–validation procedure on 5 random training sets and during the evaluation phase for three benchmark datasets and according to the 100 values ofC. The 13 RBF kernels were applied to the whole dataset. The images on the left are obtained during the training phase (5–fold cross–validation) and the images on the right are obtained during the evaluation phase (evaluation on 100 testing sets). The training set are balanced with respect to the class distribution.

105

Figure C.2: Performance measure (AUC, recall, precision) and the ratio of selected kernels (SKER) during the 5–fold cross–validation procedure on 5 random training sets and during the evaluation phase for three benchmark datasets and according to the 100 values ofC. The 13 RBF kernels were applied to the whole dataset. The images on the left are obtained during the training phase (5–fold cross–validation) and the images on the right are obtained during the evaluation phase (evaluation on 100 testing sets). The training sets have the original data unbalance.

Table C.3: Performance on the banana, breast-cancer and thyroid datasets and at the point max-imizing the AUC during the cross–validation and the one maxmax-imizing the testing set AUC along the regularization path. The training sets are balanced with respect to the class distribution.

Banana training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 2/13 1/13

Cost 247.71 17.07

Accuracy 89.45±3.50 88.90±0.50 89.44±0.47 Recall 85.72±5.04 86.44±1.83 85.24±1.58 Specificity 92.74±4.43 90.90±1.65 92.84±1.44 Precision 91.32±4.97 88.57±1.65 90.66±1.55 F-Measure 88.43±5.00 87.49±1.73 87.87±1.56 AUC 89.23±3.57 88.67±0.50 89.04±0.47

Breast cancer training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 5/13 5/13

Cost 97.70 497.70

Accuracy 67.18±10.43 64.82±5.75 66.72±5.03 Recall 65.05±19.72 60.16±12.88 65.52±11.56 Specificity 67.85±13.67 66.66±9.48 67.01±7.53

Precision 44.99±13.95 42.63±8.63 44.74±7.67 F-Measure 53.19±16.34 49.90±10.33 53.17±9.22 AUC 66.45±11.18 63.41±5.74 66.27±5.55

Thyroid training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 2/13 5/13

Cost 48.63 and 54.62 155.57

Accuracy 96.97±4.12 92.79±2.88 93.44±2.68 Recall 95.57±8.28 75.97±9.29 77.99±8.38 Specificity 97.88±3.86 99.92±0.39 100.00±0.00

Precision 95.79±7.75 99.81±0.97 100.00±0.00 F-Measure 95.68±8.01 86.27±1.76 87.63±0.00

AUC 96.72±4.75 87.94±4.62 88.99±4.19

107

Table C.4: Performance on the banana, breast-cancer and thyroid datasets and at the point max-imizing the AUC during the cross–validation and the one maxmax-imizing the testing set AUC along the quasi–regularization path. The original class distribution is kept in the training sets

Banana training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 2/13 1/13

Cost 100.609₋ 8.75₋

119.904+ 10.429+

Accuracy 91.55±2.68 89.02±0.48 89.53±0.39 Recall 88.47±4.50 86.78±2.03 85.47±1.58 Specificity 94.08±3.52 90.83±1.97 92.82±1.34 Precision 93.21±3.85 88.56±1.94 90.66±1.44 F-Measure 90.78±4.15 87.66±1.99 87.99±1.51 AUC 91.28±2.70 88.81±0.45 89.14±0.41

Breast cancer training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 1/13 5/13

Cost 15.651₋ 79.731₋

18.653+ 95.022+

Accuracy 71.30±6.62 66.72±4.85 69.86±4.60 Recall 62.07±17.50 58.07±11.33 53.72±10.81 Specificity 74.95±11.62 70.18±7.46 76.47±7.27

Precision 51.36±8.69 44.33±8.65 48.47±8.63 F-Measure 56.21±11.62 50.28±9.81 50.96±9.59 AUC 68.51±6.59 64.12±5.37 65.09±4.81

Thyroid training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 5/13 3/13

Cost 113.017₋ 63.185₋

134.691+ 75.303+

Accuracy 96.57±2.82 94.38±2.58 94.64±2.48 Recall 94.33±7.55 81.10±8.45 82.05±8.17 Specificity 97.55±2.98 99.98±0.20 99.94±0.33 Precision 94.78±6.28 99.95±0.52 99.85±0.87 F-Measure 94.56±6.86 89.54±0.99 90.08±1.58 AUC 95.94±3.78 90.54±4.24 91.00±4.09

Figure C.3: Performance measure (AUC, recall, precision) and the ratio of selected kernels (SKER) during the 5–fold cross–validation procedure on 5 random training sets and during the evaluation phase for three benchmark datasets and according to the 100 values ofC. The 13 RBF kernels were applied to each variable . The images on the left are obtained during the training phase (5–fold cross–validation) and the images on the right are obtained during the evaluation phase (evaluation on 100 testing sets). The training set are balanced with respect to the class distribution.

109

Figure C.4: Performance measure (AUC, recall, precision) and the ratio of selected kernels (SKER) during the 5–fold cross–validation procedure on 5 random training sets and during the evaluation phase for three benchmark datasets and according to the 100 values ofC. The 13 RBF kernels were applied to each variable. The images on the left are obtained during the training phase (5–fold cross–validation) and the images on the right are obtained during the evaluation phase (evaluation on 100 testing sets). The training sets have the original class skewness.

Table C.5: Performance on the banana, breast-cancer and thyroid datasets and at the point max-imizing the AUC during the cross–validation and the one maxmax-imizing the testing set AUC along the regularization path. The training sets are balanced with respect to the class skewness.

Banana training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 6/26 2/26

Cost 77.43 30.54

Accuracy 72.17±5.26 68.99±1.04 69.43±0.98 Recall 64.53±7.91 64.32±3.97 62.88±3.03 Specificity 79.15±6.20 72.78±4.35 74.74±3.84 Precision 73.63±6.89 65.91±2.37 67.05±2.34 F-Measure 68.78±7.37 65.10±2.97 64.90±2.64 AUC 71.84±5.16 68.55±0.86 68.81±0.74

Breast cancer training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 34/117 37/117

Cost 312.57 3.35

Accuracy 68.63±8.75 68.45±5.35 71.81±3.84 Recall 64.20±23.36 56.43±11.18 55.30±9.67 Specificity 70.07±13.57 73.23±8.84 78.46±5.06 Precision 48.00±13.51 46.72±8.69 51.04±9.01 F-Measure 54.93±17.12 51.12±9.78 53.09±9.33 AUC 67.14±10.37 64.83±4.96 66.88±4.72

Thyroid training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 16/65 16/65

Cost 351.12 628.03

Accuracy 92.60±5.47 93.23±2.67 93.23±2.69 Recall 85.30±13.85 80.32±8.13 80.51±8.15 Specificity 96.45±4.44 98.71±1.51 98.65±1.56 Precision 92.00±10.17 96.38±4.28 96.26±4.36 F-Measure 88.52±11.73 87.62±5.61 87.69±5.68 AUC 90.88±7.32 89.51±4.08 89.58±4.07

111

Table C.6: Performance on the banana, breast-cancer and thyroid datasets and at the point max-imizing the AUC during the cross–validation and the one maxmax-imizing the testing set AUC along the regularization path. The classes skewness a the same a in the original dataset.

Banana training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 7/26 2/26

Cost 173.829₋ 8.453₋

454.199+ 22.086+

Accuracy 71.00±5.09 61.31±1.63 69.43±0.95 Recall 67.79±8.72 57.82±2.50 62.98±3.19 Specificity 73.94±8.18 64.15±3.05 74.66±3.93 Precision 70.01±8.12 56.72±1.94 67.02±2.37 F-Measure 68.88±8.41 57.26±2.18 64.94±2.72 AUC 70.86±5.13 60.98±1.56 68.82±0.70

Breast cancer training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 37/117 37/117

Cost 38.332₋ 219.348₋

100.157+ 573.135+

Accuracy 70.40±5.71 72.35±3.94 71.25±4.08 Recall 56.39±15.96 52.48±8.62 55.42±9.94 Specificity 76.07±7.69 80.38±4.32 77.67±5.11 Precision 48.43±8.32 51.92±8.81 50.09±9.03 F-Measure 52.11±10.94 52.20±8.72 52.62±9.46 AUC 66.23±7.30 66.43±4.69 66.54±5.07

Thyroid training set testing set testing set

@ TRN(max AUC) @ TST(max AUC)

Selected kernels 15/65 15/65

Cost 173.829₋ 137.757₋

454.199+ 359.945+

Accuracy 95.71±4.49 93.65±2.99 93.73±2.89 Recall 88.50±13.22 81.89±8.52 81.81±8.47 Specificity 98.76±2.72 98.67±1.49 98.80±1.46 Precision 97.26±6.18 96.30±4.25 96.68±4.05 F-Measure 92.67±8.42 88.51±5.68 88.63±5.48 AUC 93.63±6.84 90.28±4.42 90.31±4.34

Notation

The mathematical notation and symbols used throughout the thesis are listed hereafter.

Data

N set of natural numbers

R set of real numbers

X sample of input data

Dans le document Clinical data mining with Kernel-based algorithms (Page 101-0)