• Aucun résultat trouvé

Random Forest strategies for cancer diagnosis

N/A
N/A
Protected

Academic year: 2021

Partager "Random Forest strategies for cancer diagnosis"

Copied!
59
0
0

Texte intégral

(1)

Towards an accurate cancer diagnosis modelization:

Comparison of Random Forest strategies

Ahmed DEBIT

(2)

Laboratory Context

Signatures of biomarkers • Diagnosis of Breast Cancer

• Breast Cancer treatment response • Design of signatures

(3)

From Omics to Clinics

Omics Clinics

~20k genes (RNA-seq) Dozen of genes (biomarkers)

Signatures of biomarkers

(4)

From Omics to Clinics

Omics Clinics

Dozen of genes (biomarkers)

Signatures of biomarkers ~20k genes (RNA-seq)

(5)

From Omics to Clinics

Omics Clinics

Dozen of genes (biomarkers)

Signatures of biomarkers

Robust and precise ~20k genes (RNA-seq)

(6)

From Omics to Clinics

Clinics

Dozen of genes (biomarkers)

Signatures of biomarkers Omics

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

(7)

Random Forest

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

(8)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

Random Forest

Training = modeling

(9)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

Random Forest

Training = modeling

(10)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

Random Forest

A Random Forest is a set of

(11)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

A decision helps to stratify the data

Normal (0) Tumor (1)

Learning (training) process

(12)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

A decision helps to stratify the data

Learning (training) process

(13)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

A decision helps to stratify the data

Normal (0) Tumor (1)

Learning (training) process

samples A B C E D F Set of genes Labels: Tumor Normal

(14)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

A decision helps to stratify the data

Learning (training) process

samples A B C E D F Set of genes Labels: Tumor Normal A

(15)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

A decision helps to stratify the data

Normal (0) Tumor (1)

Learning (training) process Ability to discriminate groups

of samples  Importance samples A B C E D F Set of genes Labels: Tumor Normal A

(16)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

A decision helps to stratify the data

Learning (training) process Ability to discriminate groups

of samples  Importance discriminative power  importance

samples A B C E D F Set of genes Labels: Tumor Normal A

(17)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

A decision helps to stratify the data

Normal (0) Tumor (1)

Learning (training) process Ability to discriminate groups

of samples  Importance discriminative power  importance

samples A B C E D F Set of genes Labels: Tumor Normal A

(18)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

A decision helps to stratify the data

Learning (training) process Ability to discriminate groups

of samples  Importance discriminative power  importance

samples A B C E D F Set of genes Labels: Tumor Normal A

(19)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

Decision tree (real example)

Healthy: 0 Tumor: 1

(20)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

(21)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

(22)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

Prediction procedure

(23)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

(24)

Clinics Omics

Dozen of genes (biomarkers)

Signatures of biomarkers

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

Prediction procedure

(25)

From Omics to Clinics

Clinics

Dozen of genes (biomarkers)

Signatures of biomarkers Omics

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Robust and precise ~20k genes (RNA-seq)

R implementations of

Random Forest algorithm

(26)

From Omics to Clinics

Clinics

Dozen of genes (biomarkers)

Signatures of biomarkers Omics

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

(27)

Objectives

• Empirical comparison of random forest based methods

• Differences/Similarities of RF methods  groups of methods • Designing a high stability score to rank RF methods

(28)

Objectives

• Empirical comparison of random forest based methods

• Differences/Similarities of RF methods  groups of methods • Designing a high stability score to rank RF methods

(29)

Materials and Methods

• Datasets (Perfectly balanced)

• Main classification question

The difference between paired Tumor / Normal samples will be used as a strong • TCGA-BRCA (RPKM): 182 samples x 9560 genes

• TCGA-LUSC (RPKM): 96 samples x 9262 genes

(30)

Overview of the method

(31)

Extreme dimensionality reduction

Generation of random combinations Generation of random training partitions Assessment of RF parameters (ntrees)

(32)

Extreme dimensionality reduction

Overview of the method

Generation of random combinations Generation of random training partitions Assessment of RF parameters (ntrees)

(33)
(34)

First pass Feature Selection results

al u e of s ta b ili ty in d ex

(35)

First pass Feature Selection results

𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200 V al u e of s ta b ili ty in d ex

(36)

𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200

First pass Feature Selection results

al u e of s ta b ili ty in d ex ~9000 to 200 variables (Genes)

(37)

Second pass Feature Selection results

Val u e o f stab ili ty in d e x Variables (Genes) p o rtan ce r an ki n g

(38)

Second pass Feature Selection results

Val u e o f stab ili ty in d e x Variables (Genes) ki n g 30 𝑛𝑉𝑎𝑟 ∗∗ = 30

(39)

Second pass Feature Selection results

𝑛𝑉𝑎𝑟∗∗ = 30 Val u e o f stab ili ty in d e x Variables (Genes) p o rtan ce r an ki n g 30

(40)

𝑛𝑉𝑎𝑟∗∗ = 30

Second pass Feature Selection results

Val u e o f stab ili ty in d e x Variables (Genes) ki n g ~200 to 30 variables (Genes) 30

(41)
(42)

I- Generation of random combinations (Cancer signatures)

• Multiple predictive models using combinations of different lengths

2𝑛𝑉𝑎𝑟∗∗ − 1 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 𝑛𝑉𝑎𝑟∗∗ Random selection of 3 signatures per length

Pre-comparison

(43)

Pre-comparison

(44)

500 trees

Pre-comparison

(45)

Summary of step I + step II

Dataset 𝒏𝑽𝒂𝒓 𝒏𝑽𝒂𝒓∗ 𝒏𝑽𝒂𝒓∗∗ 𝒏𝒕𝒓𝒆𝒆𝒔∗ 𝒏𝒕𝒓𝒆𝒆𝒔∗∗ Nbre combinations TCGA-BRCA 9560 200 30 2000 500 78 TCGA-LUSC 9262 200 10 2000 500 21 TCGA-THCA 9353 200 40 2000 500 108

(46)
(47)

• Comparison of RF methods under perfect conditions

• Using same random training partitions

• Assessing the same signatures

• On computational cores of same characteristics

(48)

• For each signature, we’ll focus on:

• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.

• Analysis of:

• Coefficient of Variation of 1,250 models & AUCs 1,250 models & AUCs

Random Forest Method Comparison

(49)

• For each signature, we’ll focus on:

• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.

• Analysis of:

• Coefficient of Variation of 1,250 models & AUCs 1,250 models & AUCs

Random Forest Method Comparison

(50)

Hyper Stability discriminates RF methods

Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Si gn atu re s Si gn atu re s

(51)

Hyper Stability discriminates RF methods

R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Worst Methods Best Methods Si gn atu re s Si gn atu re s

(52)

Hyper Stability Score helps finding the best method(s)

0.2 0.4 0.6 0.8 1 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA H yp er  S ta b ili ty  S co re  /  A ve ra g e  A U C A ve ra g e  m o d el in g  T im e  (m s) TCGA-BRCA

(53)

Hyper Stability Score is dataset dependent

ccf cFor

estextraTreiForees stobliqueRPPfoF restrandom Forest random Fore stSR random UniformF

rangerRboristrerf RRF wsrf

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA H yp er  S ta b ili ty  S co re  /  A ve ra g e  A U C A ve ra g e  m o d el in g  T im e  (m s) TCGA-BRCA ccf cFor

estextraTreiForeesstobliqueRPPfoF restrandom Forest random Fore stSR random UniformF

rangerRboristrerf RRF wsrf

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for LUSC H yp er  S ta b ili ty  S co re  /  A ve ra g e  A U C A ve ra g e  m o d el in g  T im e  (m s) TCGA-LUSC

(54)

Conclusions

Classification of classification methods

• The AUC precision is dataset dependent

• The Methods are dataset dependent.

• Trade-off:

(55)

Acknowledgment

Human Genetics (GIGA)

• Prof. Vincent Bours, MD, PhD • Christophe Poulet, PhD

• Corinne Fasquelle, Ir

BIO3 Unit (GIGA)

• Prof. Kristel Van Steen

CBIO, Mines ParisTech, Institut Curie

• Chloe-Agathe Azencott, PhD

Oncology (CHU-Liege)

• Prof. Guy Jerusalem, MD • Dr. Claire Josse

• Aurelie Poncin, MD • Jerome Thiry

(56)
(57)
(58)
(59)

First pass Feature Selection results

𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200 V al u e of s ta b ili ty in d ex

Références

Documents relatifs

Finally, on online random forests we perform incremental learning feature-by-feature (i.e.: first we train the random forest on the first feature, then we add the second feature

The proposed method preserving the generally random nature of the selection of subsamplе instances in the process of building decision trees forest will increase the

It was used to classify the Third International Knowledge Discovery and Data Mining Tools Competition dataset (KDD 1999) and the result were compared with the results of

Simulated biomass from the APSIM (Agricultural Production Systems sIMulator) sugar- cane crop model, seasonal climate prediction indices and ob- served rainfall, maximum and

Because the earthquake event occurred before the beginning of our project, we used services of SIFTER (https://sifter.texifter.com/) which allowed until September 2018 the purchase

The hyperparameter K denotes the number of features randomly selected at each node during the tree induction process. It must be set with an integer in the interval [1 ..M], where

As shown in the above algorithm, the main difference between Forest-RI and Forest-RK lies in that K is randomly chosen for each node of the tree leading therefore to more diverse

combining the view-specific dissimilarities with a dynamic combination, for which the views are solicited differently from one instance to predict to another; this dynamic