Algorithm optimization for signature discovery: case of Breast Cancers

(1)

(2)

•

(3)

•

(4)

Breast Cancer

Luminal A (~40%)

Luminal B (~20%)

Triple-negative

basal-like (15-20%)

HER2-enriched

(10-15%)

Normal-like

Source: ER+ and PR+ HER2-Low-levels Ki-67 Grow slowly Best prognosis ER+, PR+ HER2+, HER2-High-levels Ki-67 Grow slightly faster Worse prognosis ER-,

PR-

HER2-BRCA1 gene mutations

Younger and African-American women ER-,

PR-HER2+

Grow faster than luminal cancers Worse prognosis ER+, PR+ HER2-Low-levels Ki-67 Good prognosis 4

(5)

•

(6)

•

6

(7)

Random Forest

SVM

Deep Learning

Unsupervised clustering

Genetic Algorithm

(8)

(9)

(10)

Observation

Decision Threshold

≥ 0.5

< 0.5

1 tree = Multiple decisions

(11)

Training set / OOB set

miRNA1 miRNA2 … miRNAn

sample1 sample2 samplem Group Healthy Tumor . . . Healthy

and

miRNA2 miRNA5 miRNA3

(12)

Training set / OOB set Training set / OOB set Training set / OOB set

and

1 Forest = Multiple trees

(13)

(14)

Unknown Sample

Multiple Trees prediction procedure

Healthy

Tumor

Healthy

(15)

Unknown Sample

Healthy

Tumor

Healthy

(16)

Towards the best model

• Multiple models are built for the same classification

or prediction task.

• Prediction obtained on a model can assess it’s

classification power: the “AUC”.

–

High AUC value = Better prediction.

(17)

Profiling cohort (n=86)

41 Primary Breast Cancer PBC

45 Controls

Validation cohort (n=196)

108 PBC

88 Controls

# miRNAs: 188

• 25 important miRNAs

• #combinations = 2^25 -1

~ 33M

• Best Signature:

miR-16, let-7d, miR-103,

107, 148a, let-7i,

miR-19b, and miR-22*

Best signature performances

AUC = 0.81

Sensitivity = 91%

Specificity = 49%

The screening tool can be used as a complementary tool to the mammography test

to get a best diagnosis test of Breast Cancer

(18)

[P. Freres et al. 2015]

~ 20 hours

~ 2-3 months

(19)

Runtime improvement

Best AUC value

Best specificity

(20)

Runtime improvement

Best AUC value

Best specificity

Best sensitivity

(21)



•

(22)

(23)

(24)

•

(25)

j

•

(26)

26

j

•

(27)

List of 25 most important variables

coming from the FS process

(28)

Variable (set

of variables)

Most

important

PCs

Variability

of the data

value

AUC

Loadings (contribScore)

capture

Correlation

?

(29)

• –

–

(30)

#3

00 #5

00

0

30

(31)

#3

00 #5

00

(32)

21 0.84 R = 0.78 21 0.84 R = 0.73

#300

#5000

32

(33)

•

(34)

(35)

(36)

•

(37)

•

(38)

•

• prpe



Importance

38

all

v

(39)

•

(40)

List of 25 most important variables

coming from the FS process

(41)

Variable (set

of variables)

of the data

Variability

AUC

value

PRPE (prpeScore)

capture

Correlation

(42)

• –

–

•

(43)

(44)

44

5400

(45)

300 500 1K 2K

3K 4K 5K 6K

(46)

(47)

(48)

4477

(74%)

(26%)

923 PRPE_Score >= 0.0475

(49)

Linear model Pearson correlation 300K Linear model Pearson correlation 300K

(50)

• Without applying the PRPE-based filter

• Using the PRPE-based filter

Using of the same 300k combinations and the same random partitions

PRPE-based filtering method finds the same most performant combinations with the same

AUC values

(51)

Method

Total

#combinations

#processed

combinations

#eliminated

combinations

Processing timeϯ

(min)

Exhaustive*

300k

0 4578

PRPE-based

300k

42074

257926

675

* The method used in [P. Freres et al. 2015]

Ϯ Using 1000 jobs launched as parallel tasks on Lemaitre2 (CECI’s cluster), 50 jobs are allowed to be run in parallel. Each single job deals with 300 combinations.

(52)

•

(53)

•

(54)

•

54

(55)

•

(56)

Thanks for your attention

(57)

((machine learning AND breast cancer) AND

(diagnosis OR prognosis OR prediction)) ((machine learning AND breast cancer AND circulating miRNA) AND