Towards an accurate cancer diagnosis modelization:
Comparison of Random Forest strategies
Ahmed DEBIT
Laboratory Context
Signatures of biomarkers • Diagnosis of Breast Cancer
• Breast Cancer treatment response • Design of signatures
From Omics to Clinics
Omics Clinics
~20k genes (RNA-seq) Dozen of genes (biomarkers)
Signatures of biomarkers
From Omics to Clinics
Omics Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers ~20k genes (RNA-seq)
From Omics to Clinics
Omics Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers
Robust and precise ~20k genes (RNA-seq)
From Omics to Clinics
Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers Omics
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Random Forest
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Random Forest
Training = modelingClinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
Random Forest
Training = modeling
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Random Forest
A Random Forest is a set of
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
A decision helps to stratify the data
Normal (0) Tumor (1)
Learning (training) process
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)A decision helps to stratify the data
Learning (training) process
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
A decision helps to stratify the data
Normal (0) Tumor (1)
Learning (training) process
samples A B C E D F Set of genes Labels: Tumor Normal
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)A decision helps to stratify the data
Learning (training) process
samples A B C E D F Set of genes Labels: Tumor Normal A
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
A decision helps to stratify the data
Normal (0) Tumor (1)
Learning (training) process Ability to discriminate groups
of samples Importance samples A B C E D F Set of genes Labels: Tumor Normal A
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)A decision helps to stratify the data
Learning (training) process Ability to discriminate groups
of samples Importance discriminative power importance
samples A B C E D F Set of genes Labels: Tumor Normal A
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
A decision helps to stratify the data
Normal (0) Tumor (1)
Learning (training) process Ability to discriminate groups
of samples Importance discriminative power importance
samples A B C E D F Set of genes Labels: Tumor Normal A
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)A decision helps to stratify the data
Learning (training) process Ability to discriminate groups
of samples Importance discriminative power importance
samples A B C E D F Set of genes Labels: Tumor Normal A
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
Decision tree (real example)
Healthy: 0 Tumor: 1
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Prediction procedure
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
Clinics Omics
Dozen of genes (biomarkers)
Signatures of biomarkers
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Prediction procedure
From Omics to Clinics
Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers Omics
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise ~20k genes (RNA-seq)
R implementations of
Random Forest algorithm
From Omics to Clinics
Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers Omics
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
~20k genes (RNA-seq)Objectives
• Empirical comparison of random forest based methods
• Differences/Similarities of RF methods groups of methods • Designing a high stability score to rank RF methods
Objectives
• Empirical comparison of random forest based methods
• Differences/Similarities of RF methods groups of methods • Designing a high stability score to rank RF methods
Materials and Methods
• Datasets (Perfectly balanced)
• Main classification question
The difference between paired Tumor / Normal samples will be used as a strong • TCGA-BRCA (RPKM): 182 samples x 9560 genes
• TCGA-LUSC (RPKM): 96 samples x 9262 genes
Overview of the method
Extreme dimensionality reduction
Generation of random combinations Generation of random training partitions Assessment of RF parameters (ntrees)
Extreme dimensionality reduction
Overview of the method
Generation of random combinations Generation of random training partitions Assessment of RF parameters (ntrees)
First pass Feature Selection results
al u e of s ta b ili ty in d exFirst pass Feature Selection results
𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200 V al u e of s ta b ili ty in d ex𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200
First pass Feature Selection results
al u e of s ta b ili ty in d ex ~9000 to 200 variables (Genes)
Second pass Feature Selection results
Val u e o f stab ili ty in d e x Variables (Genes) p o rtan ce r an ki n gSecond pass Feature Selection results
Val u e o f stab ili ty in d e x Variables (Genes) ki n g 30 𝑛𝑉𝑎𝑟 ∗∗ = 30Second pass Feature Selection results
𝑛𝑉𝑎𝑟∗∗ = 30 Val u e o f stab ili ty in d e x Variables (Genes) p o rtan ce r an ki n g 30𝑛𝑉𝑎𝑟∗∗ = 30
Second pass Feature Selection results
Val u e o f stab ili ty in d e x Variables (Genes) ki n g ~200 to 30 variables (Genes) 30
I- Generation of random combinations (Cancer signatures)
• Multiple predictive models using combinations of different lengths
2𝑛𝑉𝑎𝑟∗∗ − 1 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 𝑛𝑉𝑎𝑟∗∗ Random selection of 3 signatures per length
Pre-comparison
Pre-comparison
500 trees
Pre-comparison
Summary of step I + step II
Dataset 𝒏𝑽𝒂𝒓 𝒏𝑽𝒂𝒓∗ 𝒏𝑽𝒂𝒓∗∗ 𝒏𝒕𝒓𝒆𝒆𝒔∗ 𝒏𝒕𝒓𝒆𝒆𝒔∗∗ Nbre combinations TCGA-BRCA 9560 200 30 2000 500 78 TCGA-LUSC 9262 200 10 2000 500 21 TCGA-THCA 9353 200 40 2000 500 108• Comparison of RF methods under perfect conditions
• Using same random training partitions
• Assessing the same signatures
• On computational cores of same characteristics
• For each signature, we’ll focus on:
• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.
• Analysis of:
• Coefficient of Variation of 1,250 models & AUCs 1,250 models & AUCs
Random Forest Method Comparison
• For each signature, we’ll focus on:
• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.
• Analysis of:
• Coefficient of Variation of 1,250 models & AUCs 1,250 models & AUCs
Random Forest Method Comparison
Hyper Stability discriminates RF methods
Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Si gn atu re s Si gn atu re sHyper Stability discriminates RF methods
R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Worst Methods Best Methods Si gn atu re s Si gn atu re sHyper Stability Score helps finding the best method(s)
0.2 0.4 0.6 0.8 1 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s) TCGA-BRCAHyper Stability Score is dataset dependent
ccf cFor
estextraTreiForees stobliqueRPPfoF restrandom Forest random Fore stSR random UniformF
rangerRboristrerf RRF wsrf
0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s) TCGA-BRCA ccf cFor
estextraTreiForeesstobliqueRPPfoF restrandom Forest random Fore stSR random UniformF
rangerRboristrerf RRF wsrf
0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for LUSC H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s) TCGA-LUSC
Conclusions
Classification of classification methods
• The AUC precision is dataset dependent
• The Methods are dataset dependent.
• Trade-off:
Acknowledgment
Human Genetics (GIGA)• Prof. Vincent Bours, MD, PhD • Christophe Poulet, PhD
• Corinne Fasquelle, Ir
BIO3 Unit (GIGA)
• Prof. Kristel Van Steen
CBIO, Mines ParisTech, Institut Curie
• Chloe-Agathe Azencott, PhD
Oncology (CHU-Liege)
• Prof. Guy Jerusalem, MD • Dr. Claire Josse
• Aurelie Poncin, MD • Jerome Thiry