Statistical learning: some principles and applications

(1)

Statistical learning: some principles and applications

Adeline LECLERCQ SAMSON Massih-Reza AMINI

A. Samson Grenoble, 7/06/2018 1 / 24

(2)

Challenges in Statistical Learning

Some vocabulary in statistical learning

1. Learning a decision rule

I Prediction of an (labeled) outcome based on observed variables

I Prediction of the outcome from new observations

2. Clustering

I Creation of groups of similar individuals / objects / variables

I Research of patterns

3. Learning a model from the data

I Physical / biological models

I Estimation of the parameters, shape of the model

4. Learning associations, statistical tests

I Correlation between variables

I Interactions in a network

5. Data visualization

I Exploration of the data, outliers detection, errors

I Reduction of the dimension

(3)

Challenges in statistical learning

High dimension

I A lot of variables per individual

I Aim: learn the effect of all these variableseven when the number of individuals / units remains small

Repeated measures

I Repeated trials

I Longitudinal measurements: several measures in time

I Aim: learn the variability in the process and the evolution in time

Structured data

I Connected objects: a function measured with high frequency

I Functional data

I Aim: learn a function(infinite dimension) and not only some parameters

(4)

Main approaches in statistical learning

Parametric versus Non Parametric

Non parametric

I No prior knowledge of a model, of a relationship between variables

I Objectives: learn the model, the distribution of the variables, the network

Parametric

I Models with meaningful parameters: chemical interactions, physical laws, biological transformation

I Prior models: linear regression, reliability

I Objectives: numerical optimization of a criteria

Challenges

Parsimony: high dimension but maybe few significant signals Optimization: new technics to optimize complex criteria/space Distributed calcul: scalability of the solution

(5)

1. Decision rule

1. Decision rule, classification

Supervised learning: class labels are provided

Aim: learn a classifier to predict class labels of novel data Statistical tools

I Logistic regression (parametric)

I K-nearest neighbors (non-parametric)

I Decision tree (non-parametric)

(6)

1. Decision rule

Decision tree

V1 >= 0.5

V2 >= 0.75

V2 >= 0.67

V1 < 0.2

V2 >= 0.25 1

126 3 1 3 2 0 2 14 355 10 10 3 0

3 3 2 152 2 3 0

4 3 5 2 125 0 0

5 4 1 1 1 98 0

6 1 0 2 1 3 64

yes no

V1 >= 0.5

V2 >= 0.75

V2 >= 0.87 V2 < 0.85

V1 < 0.75

V2 >= 0.85 V1 < 0.88

V1 < 0.84 V2 < 0.022

V1 >= 0.93 V2 >= 0.019

V1 >= 0.69

V1 < 0.67

V2 < 0.015

V2 >= 0.67

V1 < 0.2

V1 >= 0.2

V1 < 0.2 V1 < 0.13

V1 >= 0.13 V1 >= 0.099

V1 < 0.1

V2 >= 0.25

V1 >= 0.46

V1 < 0.47

V1 < 0.47 1

78 0 0 1 1 0

1 44 1 0 2 0 0

1 4 0 0 0 0 0

2 0 1 0 0 0 0

3 0 0 1 0 0 0

5 0 0 0 0 1 0

1 1 0 0 0 0 0

2 0 7 0 0 0 0

2 0 2 0 0 0 0

3 0 0 1 0 0 0

3 0 0 1 0 0 0 2 12 346 8 10 3 0

3 3 2 152 2 3 0

3 0 0 1 0 0 0

4 0 0 0 2 0 0

1 1 0 0 0 0 0

2 0 2 0 0 0 0

4 0 1 0 10 0 0

4 1 2 1 64 0 0

4 1 0 0 49 0 0

1 1 0 0 0 0 0

4 0 0 0 1 0 0

5 0 1 0 0 8 0

5 3 0 1 0 90 0

6 1 0 2 1 3 64

yes no

(7)

1. Decision rule

K-nearest neighbors

(8)

1. Decision rule

Examples

Advanced personalized medicine

I Predict the best treatment from the knowledge of biomarkers measured at an initial clinical visit

I Logistic regression and decision tree

(9)

1. Decision rule

Examples

Manufacturing industry

I High production costs

I Some non conformity at the end of the production process

I Large number of sensors

I Decision tree to predict early in the production process which piece is likely to be not conform

I Reduce the proportion of non conformity

(10)

2. Clustering

Unsupervised learning: no class label is given

Aim

I Creation of groups of similar individuals / objects / variables;

I Understanding the structure underlying the data

Statistical tools

I K-means (non-parametric)

I Mixture model (parametric)

I Bi-clustering, Stochastic Block Model (parametric)

(11)

2. Clustering

Examples

Consumption curves

I A curve per consumer

I Prediction of the future consumption

I Clustering and identification of profiles by mixture model of functional data

I Prediction based on these clusters

(12)

2. Clustering

Examples

Consumption curves

I A curve per consumer

I Prediction of the future consumption

I Clustering and identification of profiles by mixture model of functional data

I Prediction based on these clusters

(13)

2. Clustering

Bi-clustering

(14)

2. Clustering

Stochastic block model

(15)

2. Clustering

Examples

Autonomous Vehicle

I Precision of the position

I Sequence of images

I Segmentation of the images by Stochastic Block Model

I Prediction of the position

(16)

3. Learning a model

Aim

I Fit the data with a model

I Regression model

Statistical tools

I Differential equations

I Point process

I Estimation of the parameters (parametric)

I Maximum likelihood

I Bayesian

I Penalization with high dimension

I Estimation of the model (non-parametric)

I Splines, Wavelets, Fourier

(17)

3. Learning a model

Examples

Immunotherapy

I Efficacy of the treatment

I Optimization of the treatment, in terms of dose and time to treatment

(18)

3. Learning a model

Examples

Maintenance, reliability

I Maintenance optimization

I Imperfect maintenance

I Virtual age modeling, point processes

I Optimization of the next preventive maintenance

(19)

4. Learning associations

4. Learning associations, statistical tests

Aim

I Correlation between variables,

I Learning communities and networks

Statistical tools

I Correlation tests, multiple tests

I Graphical models

(20)

Examples

Genomics

I Effect of the pollution on epigenetics and baby growth

I Associations tests and multiple testing

I Mediation to infer causality

(21)

Examples

Neurosciences

I Understanding the connexions in the brain

I Longitudinal, functional data through EEG, MEG data

(22)

Statistical learning frameworks

Open source development frameworks available

Possible to use in industry

(23)

Take-Home message

Core-idea of statistical learning

I Variety of (industrial) problems

I Variety of statistical questions

I Variety of learning approaches

I Machine learning: decision rule, clustering

I Statistical learning: model learning, association/correlation

A strategy that is effective across different disciplines

I Health

I Environment

I Energy

I Marketing

I Manufacturing industry

(24)

Some directions of ongoing research

Structured data, connected objects Social networks, interaction graph High dimension

Optimization with constraints Distributed computation