Towards an accurate cancer diagnosis modelization:
Comparison of Random Forest strategies
Ahmed DEBIT
10/05/2018
From Omics to clinics
Omics Clinics
~20k genes (RNA-seq) Dozen of genes (biomarkers)
Signatures of biomarkers
Robust and accurate Models (signatures)
From Omics to clinics
Omics Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers
Robust and accurate Models (signatures) ~20k genes (RNA-seq)
From Omics to clinics
Omics Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers
Robust and precise ~20k genes (RNA-seq)
From Omics to clinics
Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers
Omics
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm
KNN SVM LDA
Random Forest
Robust and precise Models and signatures ~20k genes (RNA-seq)
From Omics to clinics
Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers
Omics
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
randomForest randomForestSRC Ranger cForest Rborist extraTrees randomUniformForest RRF WSRF iterativeForest canonicalForest PPforest obliqueRF rotationForest randomerForestRobust and precise Models and signatures ~20k genes (RNA-seq)
From Omics to clinics
Clinics
Dozen of genes (biomarkers)
Signatures of biomarkers
Omics
Boosting
Unsupervised ClusteringDeep Learning
Neural Networks
Genetic Algorithm KNN SVM LDARandom Forest
randomForest randomForestSRC Ranger cForest Rborist extraTrees randomUniformForest RRF WSRF iterativeForest canonicalForest PPforest obliqueRF rotationForest randomerForestRobust and precise Models and signatures
Comparison of RF methods
• For each signature, we’ll focus on the Coefficient of Variation of 1,250 AUCs obtained using:
• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.
• Comparison of RF methods under same conditions: • Same training and the validation sets.
• Same signatures.
Comparison of RF methods
• For each signature, we’ll focus on:
• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.
• Analysis of:
• Coefficient of Variation of 1,250 models & AUCs • TCGA-BRCA (RPKM): 182 samples x 9560 genes • TCGA-LUSC (RPKM): 96 samples x 9262 genes • Comparison of RF methods under same conditions:
• Same training and the validation sets. • Same signatures.
• Same computational nodes
Comparison of RF methods
• For each signature, we’ll focus on:
• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.
• Analysis of:
• Coefficient of Variation of 1,250 models & AUCs • TCGA-BRCA (RPKM): 182 samples x 9560 genes • TCGA-LUSC (RPKM): 96 samples x 9262 genes • Comparison of RF methods under same conditions:
• Same training and the validation sets. • Same signatures.
• Same computational nodes
1,250 models & AUCs
LUSC dataset
Hyper Stability discriminates RF methods
R es am p le 0 1 R es am p le 0 3 R es am p le 0 5 R es am p le 0 7 R es am p le 0 9 R es am p le 1 1 R es am p le 1 3 R es am p le 1 5 R es am p le 1 7 R es am p le 1 9 R es am p le 2 1 R es am p le 2 3 R es am p le 2 5 R es am p le 2 7 R es am p le 2 9 R es am p le 3 1 R es am p le 3 3 R es am p le 3 5 R es am p le 3 7 R es am p le 3 9 R es am p le 4 1 R es am p le 4 3 R es am p le 4 5 R es am p le 4 7 R es am p le 4 9 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling resampling si g n a tu re Best Methods R es am p le 0 1 R es am p le 0 3 R es am p le 0 5 R es am p le 0 7 R es am p le 0 9 R es am p le 1 1 R es am p le 1 3 R es am p le 1 5 R es am p le 1 7 R es am p le 1 9 R es am p le 2 1 R es am p le 2 3 R es am p le 2 5 R es am p le 2 7 R es am p le 2 9 R es am p le 3 1 R es am p le 3 3 R es am p le 3 5 R es am p le 3 7 R es am p le 3 9 R es am p le 4 1 R es am p le 4 3 R es am p le 4 5 R es am p le 4 7 R es am p le 4 9 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling resampling si g n a tu re Worst Methods
ccf cFor
estextraTreiForees stobliqueR F PPfo restrandom Forest random Fore stSRC random UniformF orest rangerRbo ristrerf RRF wsrf 0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA method H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s)
ccf cForestextraTre es iForestobliqueR
F PPfo restrandom Forest random ForestSR C random UniformF orest
rangerRboristrerf RRF wsrf
0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA method H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s)
ccf cForestextraTre es iForestobliqueR
F PPfo restrandom Forest random ForestSR C random UniformF orest
rangerRboristrerf RRF wsrf
0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for LUSC method H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s)
Conclusions
Classification of classification methods Towards robust signatures and predictions
• The AUC precision is dataset dependent
• The Methods are dataset dependent.
• Trade-off:
• AUC precision (hyper-stability) • Average AUC value