Data analysis and stochastic modeling
Lecture 4 – Machine learning and estimation theory
Guillaume Gravier
guillaume.gravier@irisa.fr
Data analysis and stochastic modeling – Machine learning and estimation theory. – p.
What are we gonna talk about?
1. A gentle introduction to probability 2. Data analysis
3. Cluster analysis
4. Estimation theory and practice
◦ about statistical inference
◦ machine learning basics
◦ parameter estimation theory 5. Mixture models
6. Random processes 7. Hypothesis testing 8. Queing systems
Data analysis and stochastic modeling – Machine learning and estimation theory. – p.
What are we gonna talk about today?
1. machine learning, usual laws, parameter estimation
◦ motivation for statistical modeling
◦ elementary parametric models (reminder)
◦ empirical estimators
◦ theory of Bayesian estimation
◦ practical estimation criteria
2. mixture models, Expectation-Maximization (EM)
Data analysis and stochastic modeling – Machine learning and estimation theory. – p.
Why statistical modeling?
exploratory statistics −→ inferential statistics
describe generalize
analyze decide
→
summarize data using a (parametric) model→
estimate the parameters of a model→
decision, discrimination, classification→
predictionData analysis and stochastic modeling – Machine learning and estimation theory. – p.
Why statistical modeling?
Summarize all the data available in a model for which the number of parameters is small with
respect to the amount of data
◦ compare values, models
◦ take decisions
◦ classify
◦ predict
parametric vs. non parametric
Data analysis and stochastic modeling – Machine learning and estimation theory. – p.
How to analyze data?
Use all the information available!
1. Prior knowledge of what we expect (and do not expect!) to see 2. Data, data, data, data... (the best data is more data)
⇓
STATISTICAL MACHINE LEARNING
Data analysis and stochastic modeling – Machine learning and estimation theory. – p.
NOTIONS OF
MACHINE LEARNING
Slides courtesy of Samy Bengio.
Various problems to solve
◦ Let
Z
1, Z
2, · · · , Z
nbe ann
-tuple random sample of anunknown distributionof densityp(z)
.◦ All
Z
iare independently and identically distributed (iid).1. Classification:
Z = (X, Y ) ∈ R
d× {− 1, 1 }
⇒
given a newx
, estimateP (Y | X = x)
2. Regression:
Z = (X, Y ) ∈ R
d× R
⇒
given a newx
, estimateE [Y | X = x]
3. Density estimation:
Z ∈ R
d⇒
given a newz
, estimatep(z )
The function space
Learning = search for a good function in afunction space
F
Examples of parametric functions:
◦ Regression
ˆ
y = f (x; a, b) = a · x + b
◦ Classification
ˆ
y = f (x; a, b) = sign(a · x + b)
◦ Density estimation
ˆ
p(z) = f (z; µ, Σ) = 1 (2π)
|z|2p
| Σ | exp
− 1
2 (z − µ)
TΣ
−1(z − µ)
Data analysis and stochastic modeling – Machine learning and estimation theory. – p.
The loss function
Learning = search for agood functionin a function space
F
Examples of loss functions
L : Z × F
◦ Regression
L(z, f ) = L((x, y), f ) = (f (x) − y)
2◦ Classification
L(z, f ) = L((x, y), f ) =
0
iff(x) = y 1
otherwise◦ Density estimation
L(z, f ) = − log p(z)
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 10
Risk and empirical risk
◦ Minimize theexpected riskon
F
, defined for a given functionf
asR(f ) = E
Z[L(z, f )] =
Z
Z
L(z, f )p(z)dz
◦ Induction Principle
⊲ find
f ∈ F
which minimizesR(f )
⊲ problems:
p(z)
is unknown, and we don’t have access to allL(z, f )
!!!◦ Empirical Risk
R(f, D ˆ
n) = 1 n
X
n i=1L(z
i, f )
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 11
Risk and empricial risk
(cont’d)◦ The empirical risk:
R(f, D ˆ
n) = 1 n
X
n i=1L(z
i, f )
◦ The (expected) risk:
R(f ) = E
Z[L(z, f )] = Z
Z
L(z, f )p(z)dz
◦ The empirical risk is an unbiased estimate of the risk The principle ofempirical risk minimization(ERM):
f
⋆(D
n) = arg min
f∈F
R(f, D ˆ
n)
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 12
Risk and empricial risk
(cont’d)◦ Training error:
R(f ˆ
⋆(D
n), D
n) = min
f∈F
R(f, D ˆ
n)
◦ Is the training error a biased estimate of the risk? YES.
E [R(f
⋆(D
n)) − R(f ˆ
⋆(D
n), D
n)] ≥ 0
◦ The solution
f
⋆(D
n)
found by minimizing the training error is better onD
nthan on any other setD
′ndrawn fromp(z)
.Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 13
Risk and empirical risk
(classification) error
model capacity true risk
empirical risk
UNDERFITTING OVERFITTING
= ⇒
Don’t fit too much on a (limited) training set!Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 14
Is it magic?
Excellent Results Random
Tuning
Training Data
$$$
$$$$$$
$$$
$$$
$$$ $$$
$$$
Learning Algorithm
Tuning
$$$
$$$$$$$$$
$$$
$$$ $$$
$$$
Random
Algorithm Learning Lot’s of Time and Effort
Garbage Results
What’s the goal?
The goal can be
1. give the best modelyou can on a training set
2. give the expected performanceof a model obtained by empirical risk minimization given a training set
3. give the best model and its expected performance
◦ if the goal is (1)
→
model selection◦ if the goal is (2)
→
risk estimation◦ if the goal is (2)
→
both!Two popular protocoles are used either for model selection methods, namely validationandcross-validation.
Validation methodology
Principle: divide the data into two separate sets
◦ Forrisk estimation
⊲ training set= used for model selection (eventually with divided into training/validation sets)
⊲ test set= compute the empirical risk as an estimation of the risk
◦ Formodel selection
⊲ training set= estimate the parameters with some given hyper-parameters
⊲ validation set= estimate the empirical risk for the model obtained on the training set
⊲ select the hyper-parameters with the best empirical risk on the validation set
⊲ estimate the model parameters on the complete data set for the optimal hyper-parameters (optionnal)
⇒ the risk on the validation set is a very bad estimator of the risk!
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 17
Cross-validation methodology
Principle: divide the data into N separate sets
D
n◦ Forrisk estimation
⊲ foreach segDi
⋄ model selection using thetraining set{Dj6=i}
⋄ estimate the risk with the empirical risk on thetest setDi
⊲ average the risk estimators
◦ Formodel selection
⊲ foreach setDi
⋄ select the best model on thetraining set{Dj6=i}for some given hyper-parameters
⋄ compute the empirical risk on thevalidation setDi
⊲ select the hyper-parameters with the best average empirical risk over all the validation sets
⊲ estimate the model parameters on the complete data set given the optimal parameters
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 18
Statistical models and machine learning
Statistical modeling and decision theory can be seen as a particular “subset” of the general machine
learning theory
- function space limited to probability mass/density functions for density estimation
- only classical decision rules for classification problems
with nice properties concerning the quality of the estimated functions!
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 19
Statistics and decision theory
◦ Maximum likelihood model esimtation
θ b = arg max
θ
p(x; θ)
◦ Classification
Maximum a posteriori
b c = arg max
cp(c | x) = arg max
cp(x | c)p(c)
Maximum likelihood
b c = arg max
cp
c(x)
Maximum likelihood is a particular case of maximum a posterior withp(c) U.
◦ Hypothesis testing
p(x; H
0) p(x; H
1)
H
0>
H <
1β
The maximum a posteriori decision rule minimizes the quadratic risk associated with a decision.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 20
Statistics and decision theory: examples
−3 −2 −1 0 1 2 3 4
−3
−2
−1 0 1 2 3 4
classification problems verification problems speech recognition
numerical communications
speaker verification target detection
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 21
Basic discrete laws
(reminder)◦ Uniform: a priori distribution when nothing is known
◦ Bernoulli:
X = 1
with probabilityp
and 0 with probability1 − p
◦ Binomial law : probability that an event that has an occurence probability of
p
appearsk
times overn
independent trials◦ Poisson: probability of observing
n
events occuring at a ratec
during alapse of time
T
◦ Geometric: probability of the number of Bernoulli trials before the first
“success”
◦ Negative binomial: probability of the number of Bernoulli trials before the
r
th“success”◦ Hypergeometric: probability of choosing
k
defective components amongm
samples chosen without replacement from a total ofn
of whichd
aredefective
◦ etc.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 22
Basic continuous laws
(reminder)◦ Uniform: a priori distribution when nothing is known
◦ Exponential: lifetime of a component or a service
◦ Gamma: somehow equivalent to Poisson in the continous case
◦ Normal or Gaussian: almost everything
◦ Weibull: reliability
◦ etc.
Parameter estimation
Statistical inference
Given a limited amount of samples from a population, infer the proper- ties of the entire population.
Requires independent samples representative of the population
◦ sampling strategies exists (when sampling can be controled)
◦ assume this to be true in pattern recognition problems Two inference strategies :
1. estimate basic characteristics,e.g. mean, variance or median, of the population from the samples
2. estimate the parameters of a modelwhich has been selected from expert knowledge
Note : for simple models, estimating the mean and/or the variance is equivalent to estimating the parameters of the model.
Definition of a statistic
Each sample can be seen as a random variable
X
iwhose observed value isx
i, where all the variablesX
ihave the same distribution.Example : suppose we extractnbulbs from a production line and measure their lifetimexi. Assuming there has been no changes in the fabrication process, the valuesxican be considered as observations of a single random variableX. The model considersXithe random variable corresponding to the lifetime of thei’th bulb, whose value isxi. All the variablesXifollow the same distribution, that ofX.
Definition
A statistic is a random variable which is a measurable function of
X
1, X
2, . . . , X
n, denotedT = f (X
1, X
2, . . . , X
n)
.Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 25
EMPIRICAL ESTIMATORS
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 26
Some well known statistics
Some well known statistics over a set of observations
X
1, . . . , X
n:Empirical frequency
F
k= 1 n
X
n i=1δ(X
i= k)
Empirical mean estimator
X = 1 n
X
n i=1X
iEmpirical variance estimator
S
2= 1 n
X
n i=1(X
i− X )
2A statistic is a random variable since any function of random variables is a random variable.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 27
Empirical mean estimation
Hence, the empirical mean
X = 1 n
X
n i=1X
iis a statistic estimating the mean value of a population from a finite set of samples.
0 10 20 30 40 50 60 70 80 90 100
−0.08
−0.06
−0.04
−0.02 0 0.02 0.04 0.06
zero-mean unit-variance Gaussian, 1000 samples per trial
empirical mean = -0.004, empirical standard deviation = 0.0306Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 28
Distribution of the empirical mean
It can easily be shown that
E[X ] = m
andV [X ] =
σn2The estimator
X
converges in quadratic mean towardm
whenn → ∞
, sinceE [(X − m)
2] → 0
.Moreover, the central limit theorem states that
X − m σ/ √
n
−−−−→ N
L(0, 1)
Note on Gaussian variables: for Gaussian variables, the convergence is in fact an equality, i.e. if
X N (0, 1)
, thenX N (m,
√σn)
.Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 29
Empirical variance estimation
The empirical variance estimator is given by
S
2=
X (X
i− X)
2n
S
2=
X X
i2n
− X
2S
2=
X (X
i− m)
2n − (X − m)
2which leads to
E[S
2] = n − 1 n σ
2The estimator is biased, since
E [S
2] 6 = σ
2, but does converge whenn → ∞
!Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 30
Empirical proportion estimation
Given the discrete random variables
X
i whose values are in[1, K]
, theproportion estimation for the event
k
(or equivalently the probabilityp
k= P [X = k]
) is estimated by the statisticF
k=
P δ(X
i= k) n
where
δ(X
i= k) = 1
ifX
i= k
and 0 otherwise.It can easily be shown that
E[F
k] = p
k andV [F
k] =
pk(1n−pk)According to the central-limit theorem,
F → N (p,
q
p(1−p) n)
.A practical example
Problem: Imagine a production line where objects are manufactured with a given
length, where we know from previous study that the (real) distribution of the length is a Gaussian of mean 10 and standard deviation 2. For quality control purposes, a set of 25 samples are taken from the production line. What is the range of values for which we have 9 chances over 10 of observing the empirical meanX?
Solution: We have seen so far thatX → N(10;√2
25). Moreover, for a zero mean unit variance Gaussian variableU,P(−1.64< U <1.64) = 0.9. Hence, with a probability of 0.9, we have
10−1.64 2
√25 < X <10 + 1.64 2
√25
and the range of value forXis[9.34,10.66].
INTRODUCTION TO
THE THEORY OF ESTIMATION
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 33
Quality of an estimator
We have seen that the empirical statistics
X
,S
2andF
are estimators of respectively the mean, the variance and the (discrete) probability, since the almost surely converges toward the true quantity (resp.m
,σ
2andp
k).But other estimators can be used, e.g. for the mean
◦
α
-truncated meanwhere theαn
biggest and smallest values are discarded◦ median value(
α = 50 %
)◦ mean extrema values(
(max(X
i) + min(X
i))/2
)◦ a randomly chosen sample
◦ a constant value, e.g.
0
= ⇒ need for a quality measure!
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 34
Expected qualities
Let us consider an estimator
T
of a parameterθ
, obtained from a set of samplesX
i. The expected quality of an estimator are:◦ convergence:
T → θ
whenn → ∞
◦ convergence speed: some estimators converges more rapidly than others
◦ risk: the risk is defined as the mean quadratic error
E
θ[(T − θ)
2] = ( E [T ] − θ
| {z }
bias
)
2+ V [T ]
| {z }
variance
⇒
Given two non biased estimators, the best one is the one with the smallest variance.Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 35
Generalized risk
The risk can be defined in a more general way using a loss/error funcion
l(t, θ)
, wherel(θ, t) = 0
iift = θ
,R(T, θ) = E
θ[l(T (X), θ)]
Example of error functions are
quadratic error
l(a, b) = (a − b)
2absolute error
l(a, b) = | a − b |
ǫ
-lossl(a, b) = 0
if| a − b | < ǫ
Warning : unfortunately, directly minimizing the risk is possible only in some very particular cases...
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 36
Comparing the risk of estimators
◦ It is possible to compare estimators based on the risk, even though the risk function is rarely easily defined
◦ An estimator
T
is better than an estimatorT
′ifR(T, θ) < R(T
′, θ) ∀ θ ∈ Θ
◦ It is usually impossible to find an estimator
T
which is better than any other for all the values ofθ
– think of a constant estimator◦ Except in some very special cases, there seldom is an estimator which is uniformly better than all the others
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 37
About biased estimators
A biased estimator is not necessarily a bad estimator!
◦ Let’s consider the following biased and unbiased variance estimators assuming the mean
m
is knownT =
n1P
(X
i− m)
2S
02=
n−11P
(X
i− X)
2It can be shown that
T
is better thatS
02sinceV [T ] < V [S
02]
.◦ Let’s consider the following biased and unbiased estimators of the autocorrelation
r(p) = E[X
iX
i−p]
(assumingm
is known and null)R
0= 1 n − p
n−p
X
i=1
X
iX
i+pR
1= 1 n
n−p
X
i=1
X
iX
i+pR
0is unbiased but with a huge variance whenp → n
, in which caseR
1is often prefered.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 38
Unbiased minimum variance estimators
The problem of finding the best estimator cannot be solved in all generality and we therefore put limits to the problem at hand.
◦ class of estimators
◦ still cannot solve the risk minimization problem in most of the cases
⇒
search for a given law familyf (x, θ)
the unbiased estimator ofθ
with theminimal variance, the search of which is related to the notion of sufficient statistic.
Sufficient statistics
A sufficient statistic is a statistic which contains all the information carried by the samples
X
ionθ
.Let us denote
◦
L(x
1, x
2, . . . , x
n; θ)
the density or mass function of(X
1, . . . , X
n)
,◦
T
a statistic whose density or mass function is given byg(t; θ)
Fisher’s factorization theorem
T is a sufficient statistic if
L( x , θ) = g(t, θ)h( x )
, or, in other words, if the density ofx
conditionnaly toT
is independant ofθ
.The idea of the definition is the following: if, whenT is known, the conditional density of(X1, . . . , Xn)no longer depends onθ, thenT carries all the information concerningθ.
Examples of sufficient statistics
◦ Gaussian law,
m
known,σ
unknownL( x , θ) = 1 σ
n√
n2π exp
− 1 2
x
i− m σ
For the statistic
T = P
(X
i− m)
2, it can be shown thatT /σ
2χ
2nand hence that
L( x , θ) = g(t, θ)h( x )
.◦ Poisson with
λ
unknownL( x , θ) = exp( − nλ) λ P x
iQ x
i!
The statistic
S = P
X
i is exhaustive andS P (nλ)
.⇒ can tell if a statistic is exhaustive but does not tell how to find one if ever there exists one!
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 41
Theorem of Darmois
Theorem of Darmois
A necessary and sufficent condition for a sample
(X
1, . . . , X
n)
toadmit a sufficient statistic is that the density be from the exponential family, i.e.
f ( x , θ) = exp (a(x)α(θ) + b(x) + β(θ))
Under certain conditions on the function
a
, the statisticT = P a(X
i)
is sufficient.
◦ Note that the theorem applies only if the definition domain ofXdoes not depend onθ
◦ In fact, there exists efficient (i.e. unbiased minimum variance) estimators only for the exponential family
◦ Most common laws are from the exponential family (except those with a term of the form
xθ)
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 42
More sufficient statistics
The density of the law
γ
θis given byln f (x, θ) = − x + (θ − 1) ln(x) − ln(Γ(θ))
and the statistic
P
ln(X
i)
is sufficient according to the previous theorem.More examples of sufficient statistics:
Bernoulli with parameter
p P X
iGaussian,
m
unknown,σ
knownP X
iGaussian,
m
known,σ
unknownP
(X
i− m)
2Gaussian,
m
andσ
unknown(X, S
2)
exponential law
P
X
iData analysis and stochastic modeling – Machine learning and estimation theory. – p. 43
The role of sufficient statistics
Theorem of Rao-Blackwell
If
T
is an unbiased estimator ofθ
andU
a sufficient statistic forθ
, thenT
∗= E [T | U ]
is an unbiased estimator ofθ
at least as good asT
.Theorem
If there exist a sufficient statistic
U
forθ
, then the unique unbiased minimum variance estimatorT
ofθ
only depends onU
.Theorem of Lehmann-Scheffe
If
T
∗is an unbiased estimator ofθ
depending of a complete sufficient statisticsU
, thenT
∗is the unique unbiased minimum variance estimator ofθ
. In particular, ifT
is an unbiased estimator ofθ
, thenT
∗= E[T | U ]
.In other words, an unbiased estimator function of a complete sufficient statistic is the best possible estimator.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 44
ESTIMATION TECHNIQUES
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 45
Moment based methods
Principle: express analytically the moments as a function of the parameter and estimate the parameter value based on the empirical estimates.
Let
g
i be functions such that∀ θ ∈ Θ, E
θ[g
i(X)] < ∞
. Typical such functions areg
i(x) = x
iorg
i(x) = I (x ∈ ∆
i)
.Moment estimates are solutions to the equation system given by
E
θ[g
i(x)] = µ
iData analysis and stochastic modeling – Machine learning and estimation theory. – p. 46
Moment based methods: an example
◦ Lifetime of a component represented by the distribution
[λ
α/Γ(α)]x
α−1exp( − λx)
◦ Let’s consider the two moments
µ
1(θ) = E
θ[X] = α/λ
andµ
2(θ) = E
θ[X
2] = α(1 + α)/λ
2◦ This equation system has a single solution
α = (µ
1(θ)/σ(θ))
2 andλ = µ
1(θ)/σ
2(θ)
where
σ
2(θ) = µ
2(θ) − µ
21(θ)
.◦ By replacing the moment by their empirical estimates, the moment estimates are obtained.
Maximum likelihood estimators
Let
X = (X
1, . . . , X
n)
be random variables inR
d, andp(x; θ)
the density.Thelikelihood is the joint density of the observations, seen as a function of
θ → p(x; θ)
The maximum likelihood estimator
θ(X b )
is such thatp(x; θ(X b )) ≥ max
θ∈Θ
p(x; θ)
and, if
p(x; θ)
is differentiable, it is given by∂p(x; θ)
∂θ = 0 .
Note: the maximum likelihood estimator looks for the best fit of the (training) samples, assuming that the observations were the most probable.
Maximum likelihood estimators (cont’d)
In practice, we often use
∂ ln p(x; θ)
∂θ = 0 .
Moreover, if the
X
i’s are iid, thenln p(x; θ) = X
ln p(x
i; θ)
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 49
Maximum likelihood estimators: an example
Assume
X
iN (µ, σ
2)
. The log-likelihood is given byln p(x; µ, σ
2) = − n
2 ln(2π) − n
2 ln(σ
2) − 1 2σ
2X
ni=1
(x
i− µ)
2The maximum likelihood equations are given by
∂lnp(x;µ,σ2)
∂µ
= 0
and ∂lnp(x;µ,σ∂σ2 2)= 0
for which the solutions are given by
µ b = 1 n
X
ni=1
X
i andσ b
2= 1 n
X
ni=1
(X
i− µ) b
2Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 50
Theory of maximum likelihood estimation
Maximum likelihood estimation is related to unbiased minimum variance estimation and moment estimators.
◦ If there exists a sufficient statistics
U
, the maximum likelihood estimator depends on it.p(x, θ) =g(u, θ)h(x) and ∂lnp(x, θ)
∂θ = ∂lng(u, θ)
∂θ hence bθ=f(u)
◦ If
θ b
is the ML estimator ofθ
,f ( θ) b
is the ML estimator off (θ)
.◦ the ML estimate is asymptotically efficient, i.e.
V [ θ b
n] → 1 I
n(θ)
◦ For the exponential family, the ML estimates are equal to the moment estimates.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 51
Dangers of data fitting
(classification) error
model capacity true risk
empirical risk
UNDERFITTING OVERFITTING
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 52
Maximum a posteriori
◦ The ML estimator might lead to bad solutions: e.g., very small variances for Gaussians when the amount of training data is small
◦ Themaximum a posteriori(MAP) estimator is given by
θ b = arg max
θ
p(θ | x) = arg max
θ
p(x | θ)p(θ)
◦ The MAP estimator acts as a regularized ML estimator.
prior ML estimation
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 53
Other approaches
Someother fancy criteriafor parameter estimation:
◦ (Bayesian) information criterion [BIC]
◦ minimum classification errors
⊲ explicit minimization [MCE]
⊲ neural networks based estimation
◦ maximum entropy (models) [Maxent]
⊲ choose the model with the largest entropy possible!
◦ maximum mutual information [MMI]
◦ etc.
Data analysis and stochastic modeling – Machine learning and estimation theory. – p. 54
Other approaches: examples
◦ Bayesian information criterion
BIC
(x
n1, θ) = ln p(x
n1; θ
ML) − 1
2 #θ ln(n)
◦ minimum classification error
d(i) = − ln p(x
i, c
i; θ) + ln 1 N
X
j6=i
exp(η ln p(x
i, c
j; θ))
!
ηe(i) = 1
1 + exp( − αd(l) + β )
⇒
minimizeP
i
e(i)
using a GPD algorithm.Other approaches: examples
MCE algorithm:
1. initialize parameter
θ
02. while not converged
(a) compute log-likelihoods
p(x
i, c
i; θ
k)
(b) update
θ
k+1so as to maximizeP
i