Statistical and computational approaches for network inference and comparison.
Application to regulation perturbation in cancer.
Etienne Birmel´ e
Analyse du g´ enome tumoral, 2014
Motivations
Discrete approach
A continuous approach: Gaussian Graphical Models Perturbation analysis
Take home messages
Cancer is a wound that does not heal
I
Stat3 activity is associated with both wound healing and cancer in lungs (Dauer et al., Oncogene, 2005).
I
77% of genes differentially expressed in both Renal regeneration and repair and Renal cell carcinoma are concordantly regulated (Riss et al., Cancer research, 2006).
I
The expression pattern of genes involved in wound healing
influences survival in breast cancer (Chang et al., PNAS,
2005).
Bladder cancer
Similar observations can be made on bladder tumor data (F.
Radvanyi, personal communication).
I
179 samples on tumoral cells.
I
30 samples on normal human urothelium cells (J. Southgate, York).
I
7962 genes among which 432 putative regulators.
Main question
Is it possible to determine the elements of the cell’s regulation processes that differ in tumorous cells?
Which computational and statistical methods can be used to deal
with genome-wide data?
Why consider gene regulation networks
Only a very small component of the heritability of common complex diseases is explained by GWAS-identified allelic variants.
Similarly, somatic alterations account for only a small fraction of all cancer subtype cases. The majority of these phenotypes are either associated with extremely rare variants/alterations that are difficult to identify and validate, or show no statistical association with any genetic event.
Recent approaches suggest that the unexplained variance may be accounted for by the ability of master regulator genes, within cell regulatory networks, to integrate an entire spectrum of genetic and epigenetic variants where, in isolation, any one variant may not be statistically significant in a GWAS analysis.
in Lefebvre, Rieckhof and Califano, 2012.
Why consider gene regulation networks
Only a very small component of the heritability of common complex diseases is explained by GWAS-identified allelic variants.
Similarly, somatic alterations account for only a small fraction of all cancer subtype cases. The majority of these phenotypes are either associated with extremely rare variants/alterations that are difficult to identify and validate, or show no statistical association with any genetic event.
Recent approaches suggest that the unexplained variance may be accounted for by the ability of master regulator genes, within cell regulatory networks, to integrate an entire spectrum of genetic and epigenetic variants where, in isolation, any one variant may not be statistically significant in a GWAS analysis.
in Lefebvre, Rieckhof and Califano, 2012.
Regulatory network zoology
Type of regulation
I
Transcriptional regulation
I
Alternative splicing regulation
I
Post-transcriptional and post-translational regulation Type of network
I
Influence (or causal) maps
I
Physical regulatory maps
I
Kinetic models
Preliminary remarks and choices
I
Regulation networks are context-specific.
I
What does a node represent in the network?
easy
I
What does an edge represent in the network?
more tricky
I
Should we work in a discrete or continuous framework?
Motivations
Discrete approachA continuous approach: Gaussian Graphical Models Perturbation analysis
Take home messages
Discretization
The data is discretized, for example by using a Z -score:
Z
ik= X
ik −X
i•σ(X
i•)and
X
ikdiscrete=
1 if Z
ik> c
−
1 if Z
ik<
−c
0 else
I
We suppose that the putative regulators are known.
??
TFs
Targets
I
Each gene is regulated by a GRN composed by a set of regulators and a table of truth.
A B
C
0 1
−1
0 1
−1
−1
−1
0 0
0
−1
−1
0 1
Computational problems
1.
Combinatorial explosion of the number of possible GRN’s:
I mregulators form 2m sets in general and Ω(mk) sets of size at mostk.
I nregulated genes yield respectively 2mnand Ω(mkn) regulation graphs.
m TFs
n Targets
Computational problems
1.
Combinatorial explosion
2.
How to compare two candidate regulation graphs?
Computational problems
1.
Combinatorial explosion
2.
How to compare two candidate regulation graphs?
A discrepancy is a GRN and a set of two experiments such that the inputs of the GRN are the same but the output is different.
Sample 1 Sample 2
0 1
0
1
1 0
Computing the minimum number of ’mistakes’ to correct in order to solve all discrepancies for a given graph is
NP-complete (Karlebach and Shamir, 2012).
Alternative strategy 1: LICORN (Elati et al., 2007)
Assumption
The space of possible GRN’s is restricted: every GRN is a pair
(A, I ) of respectively co-activators and co-inhibitors.
Main steps of the Licorn algorithm
I
candidate co-regulator sets are regulators which have
simultaneously value +1 or
−1 in a given fraction T
s(
∼20%)
of the samples
Main steps of the Licorn algorithm
I
candidate co-regulator sets are regulators which have
simultaneously value +1 or
−1 in a given fraction T
s(
∼20%) of the samples
I
co-activator and co-inhibitor sets for a gene g which are over(under)-expressed in the sample with probability at least T
o(
≥20%) when g is over(under)-expressed.
Remark
The problem of discrepancies is solved at this step.
Main steps of the Licorn algorithm
I
candidate co-regulator sets are regulators which have
simultaneously value +1 or
−1 in a given fraction T
s(
∼20%) of the samples
I
co-activator and co-inhibitor sets for a gene g which are over(under)-expressed in the sample with probability at least T
o(
≥20%) when g is over(under)-expressed.
I
these sets are ranked by their prediction score: for A
∈ A(g ), I
∈ I(g ) with A
∩I =
∅,
h
g(A, I ) =
Xs∈S
|
g
s −g ˆ
s(A, I )
|Main steps of the Licorn algorithm
I
candidate co-regulator sets are regulators which have
simultaneously value +1 or
−1 in a given fraction T
s(
∼20%) of the samples
I
co-activator and co-inhibitor sets for a gene g which are over(under)-expressed in the sample with probability at least T
o(
≥20%) when g is over(under)-expressed.
I
these sets are ranked by their prediction score: for A
∈ A(g ), I
∈ I(g ) with A
∩I =
∅,
h
g(A, I ) =
Xs∈S
|
g
s −g ˆ
s(A, I )
|I
Generation of p-values for the GRN’s by randomizing the
samples.
Subgraph for the normal bladder data
IRX2 SPRR2A
GATA3
HOXB6 PPARGBHLHE41
FOXC1 SPRR3
IRF1 PSIP1
FOXF2 SREBF2
ETV6
HOXA1 ST14 HES4
NR3C1 SIX1
SNORA61 ZNF512B
FOXO3 ZNF83
STC1
HOXA13 MEIS2
E2F5
BCL6
Alternative strategy 2: Hidden variable model
Idea
The observed data follows a random variable depending on the real (but non-observed) discrete status.
If Z
∈ {−1,
0,1}denotes the real status, X
|Z = i
∼ N(µ
i, σ
i).
−5 0 5
0.000.050.100.150.20
x
Hidden (true state) sample:
0,−1,0,0,1,1,0 Observed sample:
x1,x2,x3,x4,x5,x6,x7,x8
Problem
Genome-wide studies are not tractable for the moment (Karlebach
and Shamir, 2012)
Alternative strategy 2: Hidden variable model
Idea
The observed data follows a random variable depending on the real (but non-observed) discrete status.
If Z
∈ {−1,
0,1}denotes the real status, X
|Z = i
∼ N(µ
i, σ
i).
−5 0 5
0.000.050.100.150.20
x
Hidden (true state) sample:
0,−1,0,0,1,1,0 Observed sample:
x1,x2,x3,x4,x5,x6,x7,x8
Problem
Genome-wide studies are not tractable for the moment (Karlebach
and Shamir, 2012)
Motivations Discrete approach
A continuous approach: Gaussian Graphical Models
Conditional independence and Partial correlation Graphical models
The Model Inference
Perturbation analysis
Take home messages
Canonical model settings
Notations
1.
a set
P=
{1, . . . , p
}of p variables:
these are typically the genes (could be proteins, exons . . . );
2.
a sample
N=
{1, . . . , n
}of individuals associated to the variables:
these are typically the microarray (could be sequence counts).
Basic statistical model This can be view as
I
a random vector X in
Rp, whose j th entry is the j th variable,
I
a n-size sample (X
1, . . . , X
n), such as X
iis the i th microarrays
I
assume a Gaussian probability distribution for X .
Canonical model settings
Notations
1.
a set
P=
{1, . . . , p
}of p variables:
these are typically the genes (could be proteins, exons . . . );
2.
a sample
N=
{1, . . . , n
}of individuals associated to the variables:
these are typically the microarray (could be sequence counts).
Basic statistical model This can be view as
I
a random vector X in
Rp, whose j th entry is the j th variable,
I
a n-size sample (X
1, . . . , X
n), such as X
iis the i th microarrays
I
assume a Gaussian probability distribution for X .
Independence
Definition (Independence of events)
Two events A and B are independent if and only if
P(A, B) =
P(A)
P(B), which is usually denoted by A
⊥B. Equivalently,
I
A
⊥B
⇔P(A
|B ) =
P(A),
I
A
⊥B
⇔P(A
|B ) =
P(A
|B
c) Example (class vs party)
party class Labour Tory working 0.42 0.28 bourgeoisie 0.06 0.24
party
class Labour Tory
working 0.60 0.40
bourgeoisie 0.20 0.80
Table: Joint probability (left) vs. conditional probability (right)Conditional independence
Definition (Conditional independence of events) Two events A and B are independent if and only if
P
(A, B
|C ) =
P(A
|C )
P(B
|C ), which is usually denoted by A
⊥B
|C
Example (Do reading skills depends on weight?) Consider the events A = ”reading slowly”,
B = ”having low weight”.
Conditional independence
Definition (Conditional independence of events) Two events A and B are independent if and only if
P
(A, B
|C ) =
P(A
|C )
P(B
|C ), which is usually denoted by A
⊥B
|C
Example (Do reading skills depends on weight?) Consider the events A = ”reading slowly”,
B = ”having low weight”. Estimating
P(A, B),
P(A) and
P(B) in a sample would lead to
P
(A, B)
6=
P(A)
P(B )
Conditional independence
Definition (Conditional independence of events) Two events A and B are independent if and only if
P
(A, B
|C ) =
P(A
|C )
P(B
|C ), which is usually denoted by A
⊥B
|C
Example (Do reading skills depends on weight?) Consider the events A = ”reading slowly”,
B = ”having low weight”. But in fact, introducing C = ”having a given age”,
P
(A, B
|C ) =
P(A
|C )
P(B
|C )
The univariate Gaussian distribution
The Gaussian distribution is the natural model for the level of expression of gene (noisy data).
We note X
∼ N(µ, σ
2), so as
EX = µ,
VarX= σ
2and f
X(x) = 1
σ
√2π exp
−
1
2σ
2(x
−µ)
2, and
log f
X(x) =
−log σ
√2π
−1
2σ
2(x
−µ)
2.
Studying genes one by one doesn’t allow to study their interactions.
One step forward: bivariate Gaussian distribution
Let X , Y be two real random variables.
Definitions
Cov(X, Y ) =
Eh
X
−E(X )
Y
−E(Y )
i=
E(XY )
−E(X )
E(Y ).
ρ
XY=
cor(X, Y ) =
Cov(X, Y )
pVar(X)
· Var(Y) .
Proposition
I Cov(X
, X ) =
Var(X),
I Var(X
+ Y ) =
Var(X) +
Var(Y) +
Cov(X, Y ).
I
X
⊥Y
⇒Cov(X, Y ) = 0.
I
X
⊥Y
⇔Cov(X,Y ) = 0 when X
,Y are Gaussian.
The bivariate Gaussian distribution
f
XY(x, y) = 1
√
2π det Σ exp
{1
2 x
−µ
1y
−µ
2Σ−1
x
−µ
1y
−µ
2}
where
Σis the variance/covariance matrix which is symmetric and positive definite.
Σ
=
Var(X
)
Cov(Y, X )
Cov(Y, X )
Var(Y)
.
and
f
X,Y(x, y ) = 1
√
2π(1
−ρ
2XY)
−1
2(1
−ρ
2XY) (x
2+ y
2+ 2ρ
XYxy),
where ρ
XYis the correlation between X , Y and describe the
interaction between them.
The bivariate Gaussian distribution
f
XY(x, y) = 1
√
2π det Σ exp
{1
2 x
−µ
1y
−µ
2Σ−1
x
−µ
1y
−µ
2}
where
Σis the variance/covariance matrix which is symmetric and positive definite. If standardized,
Σ
=
1 ρ
XYρ
XY1
. and
f
X,Y(x, y ) = 1
√
2π(1
−ρ
2XY)
−1
2(1
−ρ
2XY) (x
2+ y
2+ 2ρ
XYxy),
where ρ
XYis the correlation between X , Y and describe the
interaction between them.
The bivariate Gaussian distribution
The Covariance Matrix Let
X
∼ N(0,
Σ),with unit variance and ρ
XY= 0
Σ
= 1 0
0 1
.
The shape of the 2-D
distribution evolves
accordingly.
The bivariate Gaussian distribution
The Covariance Matrix Let
X
∼ N(0,
Σ),with unit variance and ρ
XY= 0.9
Σ
=
1 0.9 0.9 1
.
The shape of the 2-D
distribution evolves
accordingly.
Full generalization: multivariate Gaussian vector
Let X , Y , Z be real random variables.
Definitions
Cov(X
, Y
|Z ) =
Cov(X, Y )
−Cov(X, Z )Cov(Y , Z )/Var(Z ).
ρ
XY|Z= ρ
XY −ρ
XZρ
YZq
1
−ρ
2XZq1
−ρ
2YZ.
Give the interaction between X and Y once removed the effect of Z .
Proposition
When X , Y , Z are jointly Gaussian, then
Cov(X,
Y
|Z ) = 0
⇔cor(X,Y
|Z ) = 0
⇔X
⊥Y
|Z
.Conditional Independence Graphs
Definition
The conditional independence graph of a set of random variables X
1, . . . , X
pis the undirected graph
G=
{P,
E}with the set of node
P=
{1, . . . , p
}and where
(i , j )
∈ E ⇔/ X
i⊥X
j|P\{i , j
}.
Property
It owns the Markov property: any two subsets of variables
separated by a third is independent conditionally on variables in the
third set.
Conditional Independence Graphs
Definition
The conditional independence graph of a set of random variables X
1, . . . , X
pis the undirected graph
G=
{P,
E}with the set of node
P=
{1, . . . , p
}and where
(i , j )
∈ E ⇔/ X
i⊥X
j|P\{i , j
}.
Property
It owns the Markov property: any two subsets of variables
separated by a third is independent conditionally on variables in the
third set.
Conditional Independence Graphs: an example
Graphical representation
1 2
3 4
I
X
1and X
4are conditionnally independant given X
2.
I
X
1and X
4are not conditionnally independant given X
3.
Scheme for steady-state data
≈10s microarrays over time
≈1000s probes (“genes”)
Inference
Which interactions?
Modeling the underlying distribution
Model for data generation
I
A microarray can be represented as a multivariate vector X = (X
1, . . . , X
p)
∈Rp,
I
Consider n biological replicate in the same condition, which forms a usual n-size sample (X
1, . . . , X
n).
Gaussian Graphical Model
I
X
∼ N(µ,
Σ) withX
1, . . . , X
ni.i.d. copies of X ,
I Θ
= (θ
ij)
i,j∈P ,Σ−1is called the concentration matrix.
I
−
θ
ijp
θ
iiθ
jj=
corX
i, X
j|X
P\i,j= ρ
ij|P\{i,j},
Modeling the underlying distribution
Model for data generation
I
A microarray can be represented as a multivariate vector X = (X
1, . . . , X
p)
∈Rp,
I
Consider n biological replicate in the same condition, which forms a usual n-size sample (X
1, . . . , X
n).
Gaussian Graphical Model
I
X
∼ N(µ,
Σ) withX
1, . . . , X
ni.i.d. copies of X ,
I Θ
= (θ
ij)
i,j∈P ,Σ−1is called the concentration matrix.
I
−
θ
ijp
θ
iiθ
jj=
corX
i, X
j|X
P\i,j= ρ
ij|P\{i,j},
Modeling the underlying distribution
Graphical Interpretation
The matrix
Θ= (θ
ij)
i,j∈Pencodes the network
Gwe are looking for.
conditionaldependency betweenXj andXi
or
non-null partial correlation betweenXjandXi
m θij6= 0 if and only if
i j
?
Remark
G
is the conditional independence graph:
I
It is undirected for steady-state data (only time-course data or biological knowledge allow to retrieve the directions)
I
If X
iand X
jare conditionnally independant on a variable Z
which is not present in the data, they will be dependant: the
links in
Gmay not correspond to biochemical interactions.
The Maximum likelihood estimator
Let X be a random vector with distribution defined by f
X(x;
Θ),where
Θare the model parameters.
Maximum likelihood estimator
Θ
ˆ =
argmaxΘL(Θ;
X)where
Lis the log likelihood, a function of the parameters:
L
(Θ;
X) = log Yn k=1f
X(x
k;
Θ),where
xkis the k row of
X.Remarks
I
This a convex optimization problem,
I
We just need to detect non zero coefficients in
ΘThe likelihood for steady-state model
Let
S= n
−1X|Xbe the empirical variance-covariance matrix:
Sis a sufficient statistic of
Θ.The log-likelihood
Liid(Θ;
S) =n
2 log det(Θ)
−n
2
Trace(SΘ) +n
2 log(2π).
The MLE =
S−1of
Θis not defined for n < p and never sparse.
The need for regularization is huge.
The penalized likelihood approach
Let
Θbe the parameters to infer (the edges).
A penalized likelihood approach
Θˆ
λ= arg max
Θ L
(Θ;
X)−λ
pen`1(Θ),
I L
is the model log-likelihood,
I pen`1
is a penalty function tuned by λ > 0.
It performs
1. regularization(needed whennp), 2. selection(sparsity induced by the`1-norm),
The penalized likelihood approach
Let
Θbe the parameters to infer (the edges).
A penalized likelihood approach
Θˆ
λ= arg max
Θ L
(Θ;
X)−λ
pen`1(Θ),
I L
is the model log-likelihood,
I pen`1
is a penalty function tuned by λ > 0.
It performs
1. regularization(needed whennp), 2. selection(sparsity induced by the`1-norm),
A geometric intuition of penalisation
The `
1-norm of a vector is the sum of the absolute values of its coordinates.
Among vectors with a given `
2-norm (euclidian distance to 0), those with the smallest `
1norm are those with all coordinates which are null but one.
Intuitively, the penalisation therefore favors sparse values of
Θ:θ
ijis chosen non-null only if the gain in likelihood is greater then the
cost of the corresponding penalisation.
Solving the penalized problem
I
Tibshirani (1996) showed that solving the penalized likelihood problem is equivalent to the Ordinary Least Square problem on a `
1bounded area:
3.2. Régularisations!p 23
βls
β!1 β1
β2
βls β!2
β1 β2
Fig.3.2–Comparaisons des solutions de problèmes régularisés par une norme!1et!2. À gauche de la figure3.2,β!1est l’estimateur du problème (3.2) régularisé par une norme!1. La deuxième composante deβ!1est annulée, car l’ellipse atteint la région admissible sur l’angle situé sur l’axeβ2=0. À droite de la figure3.2,β!2est l’estimateur du problème (3.2) régularisé par une norme
!2. La forme circulaire de la région admissible n’incite pas les coefficients à atteindre des valeurs nulles.
Afin de poursuivre cette discussion avec des arguments à la fois simples et formels, on peut donner l’expression d’un coefficient des estimateurs β!1etβ!2, lorsque la matriceXest orthogonale (ce qui correspond à des contours circulaires pour la fonction de perte quadratique). Pourβ!2, nous avons
β!m2 = 1 1+λβlsm.
Les coefficients subissent un rétrécissement2proportionnel par le biais du facteur 1 /(1+λ). En particulier,β!m2ne peut être nul que si le coefficient βlsmest lui même exactement nul. Pourβ!1, nous avons
β!m1 = sign! βlsm
" !
|βlsm| −λ"
+,
où[u]+=max(0,u). On obtient ainsi un seuillage « doux » : les compo- santes des coefficients desmoindres carréssont rétrécies d’une constanteλ lorsque|βlsm|>λ, et sont annulés sinon.
Stabilité
Définition3.2 Stabilité— Selon Breiman [1996], un problème est instable si pour des ensembles d’apprentissage similaires mais pas identiques (petites perturbations), on obtient des prédictions ou des estimateurs très différents (grande perturbation).
Remarque3.5— Bousquet et Elisseeff [2002] ont défini de façon formelle différentes notions de stabilité, basées sur le comportement des estima- teurs quand l’échantillon d’apprentissage est perturbé par le retrait ou le
remplacement d’un exemple. "
2Shrinkage, en anglais.
( minimize
β∈R2 ky−Xβk22
,
s.t. kβk1
=
|β
1|+
|β
2| ≤c.
m minimize
β∈R ky−Xβk22
+ λ
kβk1.
I
The LARS algorithm (Efron et al, 2004) solves the problem
efficiently.
Example: prostate cancer
*
*
*
* * * * *
*
0.0 0.2 0.4 0.6 0.8 1.0
0246
|beta|/max|beta|
Standardized Coefficients
* * *
* *
* * * *
* * * * * *
*
*
*
* * * * *
*
* * *
* *
*
* *
* * *
*
* * * * * * * *
*
* * * * * * * * *
* * * * *
* * *
* LASSO
37821
0 1 3 5 6 7 8
Choice of the tuning parameter λ I
Model selection criteria
BIC(λ) =ky−Xβ
ˆ
λk22−df( ˆβλ) log n 2
AIC(λ) =ky−Xβˆ
λk22−df( ˆβλ) where
df( ˆβλ) is the number of nonzero entries in
βλ. Cross-validation
1.
split the data into K folds,
2.
use successively each K fold as the testing set,
3.compute the test error on this K folds,
4.
average to obtain the CV estimation of the test error.
λ is chosen to minimize the CV test error.
Many variations
Group-Lasso
Activate the variables by group (given by the user).
Adaptive/Weighted-Lasso
Adjust the penalty level to each variables, according to prior knowledge or with data driven weights.
BoLasso
Bootstrapped version that removes false positives/stabilizes the estimate.
etc.
+ many theoretical results.
Motivations Discrete approach
A continuous approach: Gaussian Graphical Models
Perturbation analysisTake home messages
Back to our initial problem
Let X and Y be two gene expression matrices corresponding to the same genes in differenet conditions.
What can we say about the differences in the regulation structure?
The na¨ıve approach
Natural idea
It seems natural to infer a regulation network on normal data, a regulation network on tumoral data and then to compare.
Condition 1 Condition 2
(X1(1), . . . ,Xn(1)
1),Xi(1)∈Rp1 (X1(2), . . . ,Xn(2)
2),Xi(2)∈Rp2
inference inference
Variability of the inference procedure
Problem
The variability of the inference procedure is greater than the biological perturbation we want to detect.
Stability for 4 targets and 14 regulators
Loss probability
Frequency
0.0 0.2 0.4 0.6 0.8
051015
Stability for 20 targets and 59 regulators
Loss probability
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
020406080
The joint inference procedure
Another solution is to learn jointly the two regulation networks, by penalizing the choice of non-common edges.
Condition 1 Condition 2
(X1(1), . . . ,Xn(1)
1),Xi(1)∈Rp1 (X1(2), . . . ,Xn(2)
2),Xi(2)∈Rp2 inference
The joint inference procedure
Another solution is to learn jointly the two regulation networks, by penalizing the choice of non-common edges.
Advantages
I
the sample is greater;
I
even if the network is noisy, the differences may be relevant.
Methods
I
Chiquet et al., 2011
I
Mohan et al., 2012
I
Vialaneix and SanCristobal, 2013
The joint inference procedure: applications
I
CXCL13 is a regulator which is not frequently mutated in brain cancer. However, its role is perturbed in tumorous cells (Mohan et al., 2012)
I
Comparison of ER+ and ER- breast cancer.
ERBB4 ERBB3
IGF1R EGFR
ESR1
BCL2
Apoptosis Extracellular space
Plasma membrane
Cytoplasm Nucleus
Growth Hormone IGF‐1
ac?va?on repression
Kinase
Ligand‐dependent nuclear receptor Transmembrane receptor Other MAPT
B binding
CDK6 B
ER+ specific regula?on ER‐ specific regula?on
(Jeanmougin et al., 2011).
The perturbation model
I
Learn a reference network on the union of all samples, integrating informations from different kinds;
I
In a given tumoral condition, list the genes not behaving as they should according to the reference;
I
Introduce a perturbation model and learn its parameters to explain the previous differences.
Regulators
Targets
The perturbation model
I
Learn a reference network on the union of all samples, integrating informations from different kinds;
I
In a given tumoral condition, list the genes not behaving as they should according to the reference;
I
Introduce a perturbation model and learn its parameters to explain the previous differences.
Regulators
Targets
The perturbation model
I
Learn a reference network on the union of all samples, integrating informations from different kinds;
I
In a given tumoral condition, list the genes not behaving as they should according to the reference;
I
Introduce a perturbation model and learn its parameters to explain the previous differences.
Regulators
Targets
Motivations Discrete approach
A continuous approach: Gaussian Graphical Models Perturbation analysis
Take home messages
Take home messages
1.
Large data, including NGS, allow to study (epi)genomic regulations at the genome scale
I theoretical and computational developments in Systems Biology.
2.
A statistical link may not represent a biological link.
3.
The dimension problem (n << p) is crucial.
I It implies modelisation choices.
I It implies instability. Statistical/computational discoveries have therefore to be validated from a biological point of view.
I Enlarge your samples as much as possible!
Thanks for your attention!
Thanks also to Julien Chiquet for his slides on GGMs and Fran¸ cois
Radvanyi’s group for their data.
Bibliography
D. Dauer, Oncogene, 2005.
Stat3 regulates genes common to both wound healing and cancer.
J. Riss et al., Cancer Research, 2006.
Cancers as Wounds that Do Not Heal: Differences and Similarities between Renal Regeneration/Repair and Renal Cell Carcinoma
HY. Chang et al. , PNAS, 2005.
Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival.
C. Lefebvre et al., WIRE Syst Biol Med, 2012.
Reverse-engineering human regulatory networks.
M. Elati et al., Bioinformatics, 2007.
LICORN: learning cooperative regulation networks from gene expression data.
G. Karlebach and R. Shamir, J. Comp. Bio., 2012.
Constructing logical models of gene regulatory networks by integrating transcription factor-DNA interactions with expression data: an entropy based approach.
R. Tibshirani, 1996.
The Lasso: Least Absolute Shrinkage and Selection Operator
B. Efron, Ann. Stat., 2004.
Least Angle Regression.
J. Chiquet et al., Statistics and Computing, 2011.
Inferring multiple graphical structures.
K. Mohan and al., NIPS, 2012.
Structured sparse learning of multiple Gaussian graphical models.
N. Villa-Vialaneix and M. Sancristobal, SFDS, 2013.
Consensus LASSO:: Inf´erence conjointe de r´eseaux de g`enes dans des conditions exp´erimentales multiples.
M. Jeanmougin et al., personal communication.
Network Inference in Breast Cancer with Gaussian Graphical Models and Extensions