2016-2017

(1)

Machine Learning Fundamentals Final Exam

2016-2017

Massih-Reza Amini

Duration: 2 hours, Only documents authorised: Slides of the course

Part I- 7 points

1.1 Explain by words, what is theconsistencyof the Empirical Risk Minimiza- tion principle?

1.2 What is distance ofx= (1,1,1)to the hyperplan of equation, 1

2(1 +x₁+x₂+x₃) = 0, in dimension 3?

1.3 Suppose that the empirical error of a prediction functionf on a test set of size5000isL(f, Tˆ ) = 0.17. What is the upper bound of the generalization error off that holds with probability0.99?

1.4 We apply the Adaboost algorithm to a training set of size10: S ={(x_i, y_i);i∈ {1, . . . ,10}} ∈(X × {−1,+1})¹⁰.

At the first iteration, there is a uniform distribution over the examples in S: ∀i, D₁(i) = ₁₀¹. We suppose that the first classifierh₁ : X → {−1,+1}

misclassifies3training examples. Estimate the error₁ =P

i:h1(xi)6=yiD₁(i) and deduce the weightα₁associated toh₁by the algorithm. What is the new weightsD₂ found by the algorithm over the training examples?

1

(2)

Part II- 13 points

TheProbabilistic Latent Semantic Indexing(PLSI) proposed by (Hoffman, 1999)¹ is a statistical model for co-occurrence data which associates an unobserved group variable z ∈ Z = {z₁, . . . , z_K}to each occurrence of the termt ∈ V in a documentd∈ D.

Algorithm 1 provides the generation process of the terms of a document.

Algorithm 1PLSA Model

1: Input:

2: - Collection of documentsD={d1, . . . , dN}

3: - A set of termsV ={t₁, . . . , t_V}

4: - Number of latent groupsK

5:

6: Choose a documentd∈ Dwith probabilityP(d)

7: fort∈ddo

8: Choose a themez ∈ Z with respect toP(z |d)

9: Generatetwith respect toP(t|z)

10: end for

2.1 If we suppose that each term tbelongs to one and only one group z ∈ Z.

Show that the joint probability model overD × V is defined by a mixture P(d, t) =P(d)X

z∈Z

P(z |d)P(t|z).

The set of observed examples, C = {(d, t) | d ∈ D, t ∈ V} is hence constituted by the set of all co-occurrence pairs(d, t).

2.2 Show that the likelihood of the data is given by L =

V

Y

i=1 N

Y

j=1 K

X

k=1

P(d_j)P(z_k |d_j)P(t_i |z_k)

!tf_ti,dj

, (1)

wheretf_t_i_,d_j is the frequency of termt_iin documentd_j.

1T. Hofmann. Probabilistic Latent Semantic Indexing (1999)International conference on Re- search and Development in Information Retrieval - SIGIRPages 50-57.

2

(3)

2.3 Specify the set of the parametersΘof the model.

2.4 A direct maximization of the log-likelihood corresponding to (Eq. 1) with respect to the model parameters is intractable. In this case, the EM algorithm is mostly used to attain a local maximum of the log-likelihood. Ex- plain how the EM algorithm works.

2.5 ForX, the set of all pairs of (d_j, t_i) ∈ D × V; show that the conditionnel expectation of the complete log-likelihood is

E^Z|X[lnP(X,Z |Θ)] =E^Z|X



ln Y

dj∈D

Y

ti∈V

P(d_j, t_i,Z)





=X

ti∈V

X

dj∈D

tf_t_i_,d_j X

zk∈Z

P(z_k |d_j, t_i) ln(P(d_j)P(t_i |z_k)P(z_k |d_j)).

2.6 At iteration`, the current parameters of the model are used to estimate the probabilities:

∀z,∀(d, t), P^(`)(z |d, t) = P^(`)(z |d)P^(`)(t|z) P

z⁰P^(`)(z⁰ |d)P^(`)(t |z⁰), with these estimates and the equality constraints :

∀d_j ∈ D, X

zk∈Z

P(z_k |d_j) = 1,

∀z_k ∈ Z,X

ti∈V

P(t_i |z_k) = 1.

Give the analytical forms of the parameters maximizing the conditionnel expectation of the complete log-likelihood, that is:

Θ^(`+1) =argmax

Θ E^Z|X

lnP(X,Z | Θ)|Θ^(`) .

3