Machine Learning Fundamentals Final Exam
2016-2017
Massih-Reza Amini
Duration: 2 hours, Only documents authorised: Slides of the course
Part I- 7 points
1.1 Explain by words, what is theconsistencyof the Empirical Risk Minimiza- tion principle?
1.2 What is distance ofx= (1,1,1)to the hyperplan of equation, 1
2(1 +x1+x2+x3) = 0, in dimension 3?
1.3 Suppose that the empirical error of a prediction functionf on a test set of size5000isL(f, Tˆ ) = 0.17. What is the upper bound of the generalization error off that holds with probability0.99?
1.4 We apply the Adaboost algorithm to a training set of size10: S ={(xi, yi);i∈ {1, . . . ,10}} ∈(X × {−1,+1})10.
At the first iteration, there is a uniform distribution over the examples in S: ∀i, D1(i) = 101. We suppose that the first classifierh1 : X → {−1,+1}
misclassifies3training examples. Estimate the error1 =P
i:h1(xi)6=yiD1(i) and deduce the weightα1associated toh1by the algorithm. What is the new weightsD2 found by the algorithm over the training examples?
1
Part II- 13 points
TheProbabilistic Latent Semantic Indexing(PLSI) proposed by (Hoffman, 1999)1 is a statistical model for co-occurrence data which associates an unobserved group variable z ∈ Z = {z1, . . . , zK}to each occurrence of the termt ∈ V in a docu- mentd∈ D.
Algorithm 1 provides the generation process of the terms of a document.
Algorithm 1PLSA Model
1: Input:
2: - Collection of documentsD={d1, . . . , dN}
3: - A set of termsV ={t1, . . . , tV}
4: - Number of latent groupsK
5:
6: Choose a documentd∈ Dwith probabilityP(d)
7: fort∈ddo
8: Choose a themez ∈ Z with respect toP(z |d)
9: Generatetwith respect toP(t|z)
10: end for
2.1 If we suppose that each term tbelongs to one and only one group z ∈ Z.
Show that the joint probability model overD × V is defined by a mixture P(d, t) =P(d)X
z∈Z
P(z |d)P(t|z).
The set of observed examples, C = {(d, t) | d ∈ D, t ∈ V} is hence constituted by the set of all co-occurrence pairs(d, t).
2.2 Show that the likelihood of the data is given by L =
V
Y
i=1 N
Y
j=1 K
X
k=1
P(dj)P(zk |dj)P(ti |zk)
!tfti,dj
, (1)
wheretfti,dj is the frequency of termtiin documentdj.
1T. Hofmann. Probabilistic Latent Semantic Indexing (1999)International conference on Re- search and Development in Information Retrieval - SIGIRPages 50-57.
2
2.3 Specify the set of the parametersΘof the model.
2.4 A direct maximization of the log-likelihood corresponding to (Eq. 1) with respect to the model parameters is intractable. In this case, the EM algo- rithm is mostly used to attain a local maximum of the log-likelihood. Ex- plain how the EM algorithm works.
2.5 ForX, the set of all pairs of (dj, ti) ∈ D × V; show that the conditionnel expectation of the complete log-likelihood is
EZ|X[lnP(X,Z |Θ)] =EZ|X
ln Y
dj∈D
Y
ti∈V
P(dj, ti,Z)
=X
ti∈V
X
dj∈D
tfti,dj X
zk∈Z
P(zk |dj, ti) ln(P(dj)P(ti |zk)P(zk |dj)).
2.6 At iteration`, the current parameters of the model are used to estimate the probabilities:
∀z,∀(d, t), P(`)(z |d, t) = P(`)(z |d)P(`)(t|z) P
z0P(`)(z0 |d)P(`)(t |z0), with these estimates and the equality constraints :
∀dj ∈ D, X
zk∈Z
P(zk |dj) = 1,
∀zk ∈ Z,X
ti∈V
P(ti |zk) = 1.
Give the analytical forms of the parameters maximizing the conditionnel expectation of the complete log-likelihood, that is:
Θ(`+1) =argmax
Θ EZ|X
lnP(X,Z | Θ)|Θ(`) .
3