• Aucun résultat trouvé

2016-2017

N/A
N/A
Protected

Academic year: 2022

Partager "2016-2017"

Copied!
3
0
0

Texte intégral

(1)

Machine Learning Fundamentals Final Exam

2016-2017

Massih-Reza Amini

Duration: 2 hours, Only documents authorised: Slides of the course

Part I- 7 points

1.1 Explain by words, what is theconsistencyof the Empirical Risk Minimiza- tion principle?

1.2 What is distance ofx= (1,1,1)to the hyperplan of equation, 1

2(1 +x1+x2+x3) = 0, in dimension 3?

1.3 Suppose that the empirical error of a prediction functionf on a test set of size5000isL(f, Tˆ ) = 0.17. What is the upper bound of the generalization error off that holds with probability0.99?

1.4 We apply the Adaboost algorithm to a training set of size10: S ={(xi, yi);i∈ {1, . . . ,10}} ∈(X × {−1,+1})10.

At the first iteration, there is a uniform distribution over the examples in S: ∀i, D1(i) = 101. We suppose that the first classifierh1 : X → {−1,+1}

misclassifies3training examples. Estimate the error1 =P

i:h1(xi)6=yiD1(i) and deduce the weightα1associated toh1by the algorithm. What is the new weightsD2 found by the algorithm over the training examples?

1

(2)

Part II- 13 points

TheProbabilistic Latent Semantic Indexing(PLSI) proposed by (Hoffman, 1999)1 is a statistical model for co-occurrence data which associates an unobserved group variable z ∈ Z = {z1, . . . , zK}to each occurrence of the termt ∈ V in a docu- mentd∈ D.

Algorithm 1 provides the generation process of the terms of a document.

Algorithm 1PLSA Model

1: Input:

2: - Collection of documentsD={d1, . . . , dN}

3: - A set of termsV ={t1, . . . , tV}

4: - Number of latent groupsK

5:

6: Choose a documentd∈ Dwith probabilityP(d)

7: fort∈ddo

8: Choose a themez ∈ Z with respect toP(z |d)

9: Generatetwith respect toP(t|z)

10: end for

2.1 If we suppose that each term tbelongs to one and only one group z ∈ Z.

Show that the joint probability model overD × V is defined by a mixture P(d, t) =P(d)X

z∈Z

P(z |d)P(t|z).

The set of observed examples, C = {(d, t) | d ∈ D, t ∈ V} is hence constituted by the set of all co-occurrence pairs(d, t).

2.2 Show that the likelihood of the data is given by L =

V

Y

i=1 N

Y

j=1 K

X

k=1

P(dj)P(zk |dj)P(ti |zk)

!tfti,dj

, (1)

wheretfti,dj is the frequency of termtiin documentdj.

1T. Hofmann. Probabilistic Latent Semantic Indexing (1999)International conference on Re- search and Development in Information Retrieval - SIGIRPages 50-57.

2

(3)

2.3 Specify the set of the parametersΘof the model.

2.4 A direct maximization of the log-likelihood corresponding to (Eq. 1) with respect to the model parameters is intractable. In this case, the EM algo- rithm is mostly used to attain a local maximum of the log-likelihood. Ex- plain how the EM algorithm works.

2.5 ForX, the set of all pairs of (dj, ti) ∈ D × V; show that the conditionnel expectation of the complete log-likelihood is

EZ|X[lnP(X,Z |Θ)] =EZ|X

ln Y

dj∈D

Y

ti∈V

P(dj, ti,Z)

=X

ti∈V

X

dj∈D

tfti,dj X

zk∈Z

P(zk |dj, ti) ln(P(dj)P(ti |zk)P(zk |dj)).

2.6 At iteration`, the current parameters of the model are used to estimate the probabilities:

∀z,∀(d, t), P(`)(z |d, t) = P(`)(z |d)P(`)(t|z) P

z0P(`)(z0 |d)P(`)(t |z0), with these estimates and the equality constraints :

∀dj ∈ D, X

zk∈Z

P(zk |dj) = 1,

∀zk ∈ Z,X

ti∈V

P(ti |zk) = 1.

Give the analytical forms of the parameters maximizing the conditionnel expectation of the complete log-likelihood, that is:

Θ(`+1) =argmax

Θ EZ|X

lnP(X,Z | Θ)|Θ(`) .

3

Références

Documents relatifs

The model contains (1) a word-n-gram layer obtained by running a contextual sliding window over the input word sequence (i.e., a query or a document), (2) a letter-trigram layer

Small scale: disk storage, with memory mapping (cf. mmap) techniques; secondary index for offset of each term in main index Large scale: distributed on a cluster of machines;

Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method

Melucci, Can Structural Equation Models In- terpret Search Systems?, in: Proceedings of the 42nd International ACM SIGIR Conference on Re- search and Development in

These showed the significant interest in the medical information retrieval do- main and the many research challenges arising in this space which need to be addressed to give added

Both descriptor variants fusion and classifier variants fusion yields a significant improvement and these im- provements cumulates. However, this method has a drawback: the volume

We have carried out experiments using Portuguese queries to retrieve documents in English.. The approach used was Latent Semantic Indexing, which is an automatic method not

After analyzing well-known systems and approaches used for pre-processing, indexing, classification, matching and retrieval of images in digital collections, we conclude that the