Random Forest strategies for cancer diagnosis

(1)

Towards an accurate cancer diagnosis modelization:

Comparison of Random Forest strategies

Ahmed DEBIT

(2)

Laboratory Context

Signatures of biomarkers • Diagnosis of Breast Cancer

• Breast Cancer treatment response • Design of signatures

(3)

From Omics to Clinics

Omics _Clinics

~20k genes (RNA-seq) _{Dozen of genes (biomarkers)}

Signatures of biomarkers

(4)

From Omics to Clinics

Omics _Clinics

Dozen of genes (biomarkers)

Signatures of biomarkers ~20k genes (RNA-seq)

(5)

From Omics to Clinics

Omics _Clinics

Robust and precise ~20k genes (RNA-seq)

(6)

From Omics to Clinics

Clinics

Signatures of biomarkers Omics

Boosting

Unsupervised Clustering

Deep Learning

Neural Networks

Genetic Algorithm KNN SVM LDA

Random Forest

~20k genes (RNA-seq)

(7)

Random Forest

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

(8)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

Training = modeling

(9)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Training = modeling

(10)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

A Random Forest is a set of

(11)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

A decision helps to stratify the data

Normal (0) Tumor (1)

Learning (training) process

(12)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

A decision helps to stratify the data

(13)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

A decision helps to stratify the data

samples A B _C E D _F Set of genes Labels: Tumor Normal

(14)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

A decision helps to stratify the data

samples A B _C E D _F Set of genes Labels: Tumor Normal A

(15)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

A decision helps to stratify the data

Learning (training) process Ability to discriminate groups

of samples  Importance samples A B _C E D _F Set of genes Labels: Tumor Normal A

(16)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

A decision helps to stratify the data

of samples  Importance discriminative power  importance

(17)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

A decision helps to stratify the data

(18)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

A decision helps to stratify the data

(19)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

Decision tree (real example)

Healthy: 0 Tumor: 1

(20)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

(21)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

(22)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

Prediction procedure

(23)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

(24)

Clinics Omics

Boosting

Deep Learning

Neural Networks

Random Forest

Prediction procedure

(25)

From Omics to Clinics

Clinics

Boosting

Deep Learning

Neural Networks

Genetic Algorithm

KNN SVM LDA

Random Forest

R implementations of

Random Forest algorithm

(26)

From Omics to Clinics

Clinics

Boosting

Deep Learning

Neural Networks

Random Forest

(27)

Objectives

• Empirical comparison of random forest based methods

• Differences/Similarities of RF methods  groups of methods • Designing a high stability score to rank RF methods

(28)

Objectives

• Empirical comparison of random forest based methods

• Differences/Similarities of RF methods  groups of methods • Designing a high stability score to rank RF methods

(29)

Materials and Methods

• Datasets (Perfectly balanced)

• Main classification question

The difference between paired Tumor / Normal samples will be used as a strong • TCGA-BRCA (RPKM): 182 samples x 9560 genes

• TCGA-LUSC (RPKM): 96 samples x 9262 genes

(30)

Overview of the method

(31)

Extreme dimensionality reduction

Generation of random combinations Generation of random training partitions Assessment of RF parameters (ntrees)

(32)

Extreme dimensionality reduction

Overview of the method

Generation of random combinations Generation of random training partitions Assessment of RF parameters (ntrees)

(33)

(34)

First pass Feature Selection results

al u e of s ta b ili ty in d ex

(35)

First pass Feature Selection results

𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200 V al u e of s ta b ili ty in d ex

(36)

𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200

First pass Feature Selection results

al u e of s ta b ili ty in d ex ~9000 to 200 variables (Genes)

(37)

Second pass Feature Selection results

Val u e o f stab ili ty in d e x Variables (Genes) p o rtan ce r an ki n g

(38)

Second pass Feature Selection results

Val u e o f stab ili ty in d e x Variables (Genes) ki n g 30 𝑛𝑉𝑎𝑟 ∗∗ _{= 30}

(39)

Second pass Feature Selection results

𝑛𝑉𝑎𝑟∗∗ = 30 Val u e o f stab ili ty in d e x Variables (Genes) p o rtan ce r an ki n g 30

(40)

𝑛𝑉𝑎𝑟∗∗ = 30

Second pass Feature Selection results

Val u e o f stab ili ty in d e x Variables (Genes) ki n g ~200 to 30 variables (Genes) 30

(41)

(42)

I- Generation of random combinations (Cancer signatures)

• Multiple predictive models using combinations of different lengths

2𝑛𝑉𝑎𝑟∗∗ − 1 𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑠 𝑛𝑉𝑎𝑟∗∗ Random selection of 3 signatures per length

Pre-comparison

(43)

Pre-comparison

(44)

500 trees

Pre-comparison

(45)

Summary of step I + step II

Dataset 𝒏𝑽𝒂𝒓 𝒏𝑽𝒂𝒓∗ 𝒏𝑽𝒂𝒓∗∗ 𝒏𝒕𝒓𝒆𝒆𝒔∗ 𝒏𝒕𝒓𝒆𝒆𝒔∗∗ Nbre combinations TCGA-BRCA 9560 200 30 2000 500 78 TCGA-LUSC 9262 200 10 2000 500 21 TCGA-THCA 9353 200 40 2000 500 108

(46)

(47)

• Comparison of RF methods under perfect conditions

• Using same random training partitions

• Assessing the same signatures

• On computational cores of same characteristics

(48)

• For each signature, we’ll focus on:

• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.

• Analysis of:

• Coefficient of Variation of 1,250 models & AUCs 1,250 models & AUCs

Random Forest Method Comparison

(49)

• For each signature, we’ll focus on:

• 50 resampling to build the Training and the Validation set. • 25 modeling and validations.

• Analysis of:

• Coefficient of Variation of 1,250 models & AUCs 1,250 models & AUCs

Random Forest Method Comparison

(50)

Hyper Stability discriminates RF methods

Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Si gn atu re s Si gn atu re s

(51)

Hyper Stability discriminates RF methods

R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 0 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 1 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 2 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 3 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 R es am p le 4 Signature1 Signature2 Signature3 Signature4 Signature5 Signature6 Signature7 Signature8 Signature9 Signature10 Signature11 Signature12 Signature13 Signature14 Signature15 Signature16 Signature17 Signature18 Signature19 Signature20 Signature21 Coefficient of variation of pd_tidy_tumor_AUC over signature and resampling si g n a tu re Worst Methods Best Methods Si gn atu re s Si gn atu re s

(52)

Hyper Stability Score helps finding the best method(s)

0.2 0.4 0.6 0.8 1 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s) TCGA-BRCA

(53)

Hyper Stability Score is dataset dependent

ccf _cFor

estextraTreiFore_es stobliq_ueRPPfo_F _restrandom Fore_st rand_om Fore stSR rand_om Unifo_rmF

rang_erRbo_ristrerf RRF wsrf

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for BRCA H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s) TCGA-BRCA ccf _cFor

estextraTreiFore_esstobliq_ueRPPfo_F _restrandom Fore_st rand_om Fore stSR rand_om Unifo_rmF

rang_erRbo_ristrerf RRF wsrf

0 0.2 0.4 0.6 0.8 1 0 1 2 3 4 5 HRS HSS AUC Time Change between HRS and HSS for LUSC H yp er S ta b ili ty S co re / A ve ra g e A U C A ve ra g e m o d el in g T im e (m s) TCGA-LUSC

(54)

Conclusions

Classification of classification methods

• The AUC precision is dataset dependent

• The Methods are dataset dependent.

• Trade-off:

(55)

Acknowledgment

Human Genetics (GIGA)

• Prof. Vincent Bours, MD, PhD • Christophe Poulet, PhD

• Corinne Fasquelle, Ir

BIO3 Unit (GIGA)

• Prof. Kristel Van Steen

CBIO, Mines ParisTech, Institut Curie

• Chloe-Agathe Azencott, PhD

Oncology (CHU-Liege)

• Prof. Guy Jerusalem, MD • Dr. Claire Josse

• Aurelie Poncin, MD • Jerome Thiry

(56)

(57)

(58)

(59)

First pass Feature Selection results

𝑛𝑡𝑟𝑒𝑒𝑠∗ = 2000 𝑛𝑉𝑎𝑟∗ = 200 V al u e of s ta b ili ty in d ex