Inference algorithms for the regression approach to sequence prediction

(1)

Inference Algorithms for the Regression Approach to

Sequence Prediction

Mémoire Amélie Rolland Maîtrise en Informatique Maître ès sciences (M.Sc.) Québec, Canada © Amélie Rolland, 2016

(2)

Inference Algorithms for the Regression Approach to

Sequence Prediction

Mémoire

Amélie Rolland

Sous la direction de:

Mario Marchand, directeur de recherche François Laviolette, codirecteur de recherche

(3)

Résumé

La prédiction de séquence comporte plusieurs applications en traitement du langage naturel, en bioinformatique, et en vision numérique. La complexité de calcul requise pour trouver la séquence optimale parmi un nombre exponentiel de possibilités limite cependant l’utilisation de tels algorithmes. Dans ce mémoire, nous proposons une approche permettant de résoudre cette recherche efficacement pour deux types de problèmes différents. Plus précisément, nous adressons le problème de pré-image en prédiction de structure nécessitant de trouver la sé-quence associée à une entrée arbitraire, et le problème consistant à trouver la sésé-quence qui maximise la fonction de prédiction de plusieurs classificateurs et régresseurs à noyaux. Nous démontrons que ces deux problèmes se réduisent en un même problème combinatoire valide pour plusieurs noyaux à séquences. Pour ce problème, nous proposons une borne supérieure sur la fonction de prédiction pouvant être utilisée dans un algorithme de recherchebranch and bound pour l’obtention de solutions optimales.

Sur les tâches de reconnaissance de mots et de prédiction de phonèmes, l’approche propo-sée obtient des résultats compétitifs avec les algorithmes de prédiction de structure de l’état de l’art. De plus, la solution exacte du problème de pré-image augmente de manière signi-ficative les performances de prédiction en comparaison avec une approximation trouvée par l’heuristique la plus connue. Pour les tâches consistant à trouver la séquence maximisant la fonction de prédiction de classificateurs et régresseurs, nous montrons que des méthodes existantes peuvent être biaisées à prédire de longues séquences comportant des symboles ré-pétitifs. Nous soulignons que ce biais est enlevé lorsque le noyau est normalisé. Finalement, nous présentons des résultats en conception de médicaments sur la découverte de composés principaux. Le code source peut être téléchargé à https://github.com/a-ro/preimage.

(4)

Abstract

Sequence prediction algorithms have many applications in natural language processing, bioin-formatics, and computer vision. However, the computational complexity required to find the optimal sequence among an exponential number of possibilities limits the use of such algo-rithms. In this thesis, we propose an approach to solve this search efficiently for two types of sequence prediction problems. More precisely, we address the pre-image problem encountered in structured output prediction, which consists of finding the sequence associated with an arbitrary input, and the problem of finding a sequence maximizing the prediction function of various kernel-based classifiers and regressors. We demonstrate that these problems reduce to a common combinatorial problem valid for many sequence kernels. For this problem, we propose an upper bound on the prediction function which has low computational complexity and which can be used in a branch and bound search algorithm to obtain optimal solutions. On the practical tasks of optical word recognition and grapheme-to-phoneme prediction, the proposed approach is shown to be competitive with state-of-the-art structured prediction algorithms. Moreover, the exact solution of the pre-image problem is shown to significantly improve the prediction accuracy in comparison with an approximation found by the best known heuristic. On the task of finding a sequence maximizing the prediction function of kernel-based classifiers and regressors, we highlight that existing methods can be biased toward long sequences that contain many repeated symbols. We demonstrate that this bias is removed when using normalized kernels. Finally, we present results for the discovery of lead compounds in drug discovery. The source code can be found at https://github.com/a-ro/preimage.

(5)

List of Tables

2.1 Example of kernel similarity . . . 35

3.1 Special cases of the GS kernel . . . 44

3.2 Summary of sequence kernels . . . 45

3.3 Example of Hamming pre-image . . . 47

3.4 Matrix representation of the 2-grams in the sequences . . . 50

3.5 CSR matrix representation of the 2-grams in the sequences . . . 50

3.6 Computational complexity of the contribution matrix computation. . . 51

3.7 Pre-image algorithms used for the different sequence kernels . . . 62

3.8 Computational complexity of solving the pre-image and sequence maximization problems for the different sequence kernels. . . 62

4.1 Comparison with state-of-the-arts on the word recognition and grapheme-to-phoneme prediction tasks . . . 64

4.2 Letter risk for different sequence kernels and pre-image algorithms on the word recognition and grapheme-to-phoneme prediction tasks. . . 67

4.3 Zero-one risk for different sequence kernels and pre-image algorithms on the word recognition and grapheme-to-phoneme prediction tasks. . . 67

4.4 Levenshtein risk for different sequence kernels and pre-image algorithms on the word recognition and grapheme-to-phoneme prediction tasks. . . 68

4.5 10-fold cross-validation R2 of the predictors on the BPPs and CAMPs datasets. 73 4.6 Predicted peptides with highest bioactivity on BPPs and CAMPs datasets. . . 74

(7)

List of Figures

1.1 Example of classification and regression problems. . . 5

1.2 Different regression models learned on the same dataset. . . 6

1.3 Training set partitioning for holdout estimation and k-folds cross-validation. . . 7

1.4 Linear models with no classification errors. . . 9

1.5 SVM classifier and margin. . . 10

1.6 Input space and feature space. . . 12

2.1 Example of structured prediction problems. . . 18

2.2 Viterbi graph example . . . 20

2.3 Example of pre-image problem . . . 27

2.4 Example of search states and actions for the SEARN algorithm . . . 30

2.5 Example of vectors and normalized vectors. . . 35

3.1 Example of sequence comparison. . . 41

3.2 Weighted degree pre-image problem . . . 48

3.3 Longest path graph example. . . 49

3.4 Examples of graph partitions for the longest path algorithm . . . 53

3.5 Example of the Eulerian path algorithm when the vector admits a pre-image . 55 3.6 Example of the Eulerian path algorithm when the vector does not admits a pre-image . . . 55

3.7 Example of branch and bound search tree . . . 56

3.8 Tables used for the computation of the g bound . . . 59

4.1 Average prediction time (ms) for the normalized and un-normalized algorithms 71 4.2 Percentage of risk given branch and bound time and number of iterations for the normalized and un-normalized algorithms . . . 72

4.3 Cumulative moving average norm of the 1, 000 peptides with highest predicted bioactivities . . . 75

(8)

For my mother, who taught me to never give up. For my father, who gave me his passion for research.

(9)

Remerciements

Je remercie mon directeur de recherche Mario Marchand pour m’avoir donné l’opportunité de faire une maîtrise au GRAAL. Nos discussions, ses nouvelles idées et ses nombreux conseils ont grandement contribués à ma formation. Travailler avec Mario a impacté autant ma manière de définir un problème que d’en rédiger une solution. J’apprécie énormément avoir eu la chance de travailler avec une personne aussi rigoureuse et minutieuse dans son travail.

Je remercie également mon co-directeur de recherche François Laviolette qui a été tout autant présent durant ma maîtrise. Son habileté à comprendre un problème rapidement, proposer des solutions efficaces et vulgariser des concepts complexes m’a débloquée à maintes reprises. J’ai grandement apprécié avoir l’opinion de François sur les différents problèmes auxquels j’ai fait face durant ma maîtrise.

J’aimerais également remercier Mario et François de m’avoir permis de travailler avec plusieurs entreprises au cours de ma maîtrise. De manière similaire, j’aimerais remercier les gens de Thales, FORAC, Desjardins, de la FSAA, et du CHUL, qui m’ont donné la chance de tra-vailler sur des problèmes pratiques d’apprentissage automatique. Merci également à Michael et Frédérik pour leur contribution au projet FORAC.

Je remercie Brahim Chaib-draa et Luc Lamontagne pour leurs commentaires sur mon mémoire, Claude-Guy Quimper pour nos discussions sur la pré-image, et Jacques Corbeil pour ses suggestions durant nos réunions.

Je remercie les membres du GRAAL pour nos nombreuses discussions et pour avoir fait du laboratoire un endroit aussi agréable où travailler. Plus précisément, j’aimerais d’abord re-mercier Alexandre Drouin de m’avoir initiée au groupe de recherche, et grâce à qui j’ai fait une maîtrise. Je n’oublierai jamais notre semaine à ICML ensemble. J’aimerais remercier Sébastien Giguère avec qui j’ai travaillé tout au long de ma maîtrise. J’ai vraiment été choyée d’avoir un mentor aussi compétent et motivant. J’espère sincèrement que nous aurons la chance de retravailler ensemble un jour.

Je remercie Pascal Germain pour ses réponses à toutes mes questions et ses excellents com-mentaires, mais également pour m’avoir initiée au café et aux Logicomix. Merci à Jean-Francis Roy qui a été ma source d’actualité Python tout au long de ma maîtrise. J’ai vraiment aimé

(10)

coder avec une personne aussi passionnée qui fait avancer un projet rapidement. J’espère que nos chemins se recroiseront également. Enfin, merci à Alexandre Lacoste pour son aide durant ma première session au laboratoire, et pour m’avoir fait rire constamment.

J’aimerais également remercier les membres plus récents du GRAAL : Francis, Hana, Pru-dencio et Mazid. J’ai vraiment apprécié les moments que nous avons passés ensemble autant au laboratoire que dans les soupers du GRAAL et les conférences. J’ai confiance que vous apporterez énormément au GRAAL et je vous souhaite beaucoup de succès pour la suite de vos études graduées.

Enfin, j’aimerais remercier mes parents pour leurs encouragements et leur support tout au long de ma maîtrise.

(11)

Introduction

Machine learning is a subfield of artificial intelligence that uses experience to develop a pre-dictive model. Whether it is to predict the future stock price of a company, the antibiotic resistance in human bacteria, or the relevance of documents, machine learning algorithms have been applied to a wide variety of problems and are now part of our everyday lives. Although these algorithms have obtained impressive results in several areas, each problem has its own set of characteristics that can make it more difficult to solve.

The nature of the prediction is a characteristic that can increase the complexity of a machine learning problem. Some problems require to predict a real value (e.g. a price), while other problems involve the prediction of a category (e.g. a letter from an alphabet). Structured prediction problems are a type of machine learning problems that involve the prediction of a complex object. For instance, this object could be a sequence like a protein or a sentence. Hence, this kind of prediction is often more complex as the algorithm must correctly predict each subpart of the object while considering the relation between these subparts.

Machine translation is an example of a structured prediction problem involving the prediction of a sequence. The translation of a word is often ambiguous and depends on the context in which it is used. For that reason, translating each word individually often gives a poor translation. Instead of predicting a word, the structured prediction algorithm aims at learning a scoring function that gives a high score when the translated sentence is accurate, and a low score otherwise. The translation of a sentence can then be predicted by finding the sequence of words with the highest score in that function.

There exist many other problems that involve the prediction of a sequence. One could try to predict the protein that can best fight some disease, the sequence of letters in an image, or the phonetic pronunciation of a word. Even if these problems all seem different, they share a common characteristic. They all require to search in a large space of possibilities in order to find and predict the sequence with the highest score. Often, this space is too large to allow an exhaustive search. For instance, the space of all proteins of length 15 amino acids contains 2015= 3.28 × 1019 proteins. By processing 1000 proteins per second, it would take more than one billion of years to try them all. Hence, developing approaches to make this search phase tractable is an active area of research in structured prediction.

(12)

This search process is often referred to as the inference problem, and the complexity of solving this problem mainly depends on how the scoring function is modeled and the output to predict. Two types of strategy are generally used for this problem. First, some approaches restrain the complexity of the model to ensure that the inference can be solved efficiently. Other approaches do not enforce this constraint, but often have to use approximate algorithms to solve the inference problem. Both types of approaches have their limitations. In the former case, the scoring function might wrongly predict that a particular output has the highest score, while in the latter case, the approximation of the inference could lead to suboptimal predictions.

The goal of this thesis is to propose a general solution to solve the inference problem efficiently for sequence prediction problems. We propose a general framework where the complexity of the model can be adapted for the problem to solve. As such, we propose two different strategies to solve the inference problem depending on the parameters of the model. We present a polynomial time algorithm that can solve the inference problem in some specific cases, and propose a different search algorithm for more complex cases.

Moreover, we want these algorithms to be applicable to a wide variety of sequence prediction problems. Hence, we consider two common types of sequence prediction problems. The first one occurs when we want to find the output sequence associated with an arbitrary input, such as predicting the translation of a sentence or the phonetic pronunciation of a word. The second one happens when we want to learn a function that predicts the real value associated with an input sequence, and then find the sequence with the highest predicted value. In drug design for instance, we can use this approach to predict the binding affinity of an input protein to some target protein, and then find the input protein with the highest score.

While there exist different algorithms to learn and model the scoring function, we address the case of the regression approach to structured prediction. In that approach, the problem of finding the sequence that maximizes the scoring function can be formulated similarly for these two types of sequence prediction problems. Hence, in this thesis, we propose algorithms to solve the inference problem of the regression approach to structured prediction, and we apply these algorithms to both types of sequence prediction problems.

This thesis is divided as follows:

Chapter1 introduces the background notions required for the rest of this thesis. This chap-ter explains the concept of supervised learning in the context of classification and regression tasks, which requires the prediction of an output less complex than for structured prediction problems. This chapter also presents different machine learning algorithms for these tasks. Chapter 2 formally defines the notion of structured prediction and sequence prediction. Moreover, this chapter introduces different structured prediction algorithms from the

(13)

litera-ture, and present the regression approach to structured prediction for both types of sequence prediction problems. Finally, this chapter proposes a unified formulation of the prediction function of these two types of sequence prediction problems for the regression approach.

Chapter 3 first introduces different similarity functions that can be used by the regression approach to compare the sequences, and then shows how the complexity of solving the inference problem is influenced by the choice of this function. This chapter then presents a polynomial time algorithm to solve the inference problem for some specific choices of similarity function. For other cases, an upper bound on the prediction function is presented. This upper bound can be used with a branch and bound search algorithm to guide the search.

Chapter 4 presents and discusses experimental results obtained on both types of sequence prediction problems. The problem of predicting the sequence associated with an arbitrary input is first considered. Then, results on optical word recognition and grapheme-to-phoneme prediction problems are presented. Finally, the second type of sequence prediction problem is considered, and results on the discovery of highly active peptides in drug design are presented.

(14)

Chapter 1

Background Notions

This chapter introduces the concept of supervised learning, in which the learning algorithm learns from previous data to develop an accurate predictive model. Given a new input, this model predicts a category in the case of a classification problem, a real-valued scalar for a regression problem, and a complex structure like a sequence, a tree, or a graph, for a structured prediction problem. In this chapter, we consider supervised learning in the context of classification and regression, and present learning algorithms for these tasks. The next chapter presents how these algorithms can be adapted for structured prediction.

1.1 Supervised Learning

Supervised learning is a type of machine learning where the goal is to learn a function h : X → Y that accurately predicts the output y ∈ Y of an input x ∈ X . To learn this function, the supervised learning algorithm has access to a dataset S = {(x1, y1), ..., (xm, ym)} of m training examples (xi, yi) ∈ X × Y. These examples are assumed to be drawn independently and identically distributed (i.i.d) from an unknown distribution D. In other words, we assume that the occurrence of one example does not affect the probability of encountering any examples, and that all examples come from the same distribution. The input space X and the output space Y can be arbitrary, but many machine learning algorithms require that x, also referred to as the features of the example, be a real-valued vector of n dimensions belonging to Rn. Moreover, the supervised learning problem is called classification when y is a category (or class), and regression when y ∈ R is a real-valued scalar.

Figure 1.1a shows an example of a classification problem, where the goal is to predict the character y ∈ {a, .., z} associated with the image x. More precisely, this example corresponds to amulticlass classification problem with 26 classes, while a binary classification problem has only two classes Y = {+1, −1}. In practice, optical character recognition (OCR) softwares are widely used to automate the processing of documents like mail items, bank checks, and business forms [Cheriet et al.,2007,Jayadevan et al.,2012]. For instance, the sorting facilities

(15)

→

_a

(a) Optical character recognition (classification).

IEWAK

→

1.563

(b) Peptide bioactivity prediction (regression).

Figure 1.1: Example of classification and regression problems.

of Canada Post use an optical character recognition software to read the address and sort the items automatically [Allum et al.,1995]. The automation of this step saves considerable time as 9 billion items have been delivered by Canada Post in 2014 only [Corporation,2014]. Figure1.1bshows an example of a regression problem, which consists of predicting the bioac-tivity of a small protein called a peptide. The bioacbioac-tivity is a desirable acbioac-tivity of the peptide. For instance, h could predict the binding affinity of a peptide x to a target protein involved in a certain disease. In that case, y represents the strength of the bond between the peptide and the target protein. Alternatively, h could predict the antimicrobial activity of a peptide, like its ability to kill bacteria. These predictions are often used in drug discovery to filter and reduce the number of candidate peptides that must be synthesized and tested in labora-tory [Sliwoski et al.,2014]. Even if these predictions are only a small part of the whole drug discovery process, they can lead to substantial cost benefits as the cost of discovering a new drug can surpass $400 million [Basak,2012].

1.1.1 Expected Risk

The learning algorithm aims at finding the model (function) h ∈ H achieving the fewest prediction errors. This corresponds to the model minimizing the expected risk:

R(h)def= _E (x,y)∼D

l(h(x), y) , (1.1)

where the function l : Y × Y → R defines the loss incurred when the prediction h(x) differs from the true output y. Observe that it is not possible to evaluate the expected risk of a model since the distribution D generating the examples is unknown. However, the learning algorithm has access to the dataset S of examples drawn i.i.d from D. Therefore, theempirical risk RS(h) on S, also referred to as the training risk of the model, evaluates the number of errors on S and provides an estimation of the expected risk:

RS(h) def = 1 m m X i=1 l(h(xi), yi) . (1.2)

(16)

Overfitting

Most learning algorithms do not aim to find the model that minimizes the empirical risk as it can be prone to overfitting. A model overfits when it approximates the training data in such a complex way that it fails to capture the underlying process of the data. In other words, the overfitted model is unable to generalize, and thereby does not perform well on unseen examples. For instance, Figure 1.2 shows three different regression models learned on the same dataset. Observe that the model in Figure 1.2c makes no errors on the training data since the learned function passes through all training points. However, that model seems to capture the noise of the data, while the model in Figure 1.2b is much simpler and would probably obtain better performances on new examples. Finally, Figure1.2ashows an example of underfitting, where the model is too simple and thus cannot capture the relation between the inputs and the outputs.

(a) Underfitting. (b) Correct. (c) Overfitting.

Figure 1.2: Different regression models learned on the same dataset.

Regularization

Regularization is a commonly used approach to avoid overfitting. In that approach, the learn-ing algorithm finds the model that minimizes a regularized empirical risk [Alpaydin,2014]

R0_S(h, λ) = RS(h) + λ · model complexity. (1.3) Hence, the regularization adds a term that penalizes a model for its complexity, and λ controls the importance of that penalty. The model that minimizes the regularized empirical risk is then the one that compromises between having a low complexity and achieving the fewest errors on the training data.

After the learning process, the performances of the selected model are evaluated on an unseen dataset T of examples also drawn i.i.d from D. This step is essential as it gives an estimation of how well the model will perform in practice.

(17)

1.1.2 Cross-Validation

Many machine learning algorithms have a set of hyper-parameters which values are not directly learned during the training phase. For instance, machine learning algorithms minimizing a regularized empirical risk can have a hyper-parameter, similar to λ in Equation (1.3), that controls the complexity of the model. The values of these hyper-parameters need to be specified and can greatly impact the accuracy of the model.

Dataset Partitioning

Cross-validation is a process to estimate the test error of a model using the training set S only. Different hyper-parameters values can be tried during this process, and the ones achieving the lowest cross-validation risk are selected. After the cross-validation step, the machine learning algorithm learns the model on the original training set S with the selected hyper-parameter values, and evaluates the results on the unseen dataset T .

There exist different approaches for partitioning the training set during the cross-validation process. Figure 1.3 shows two of the most common approaches. The holdout estimation consists of splitting S into two new datasets. The former is used for training while the latter is used for testing. The k-fold cross-validation is often preferred when the number of training examples is small. In this case, the training dataset is splitted into k folds. For each set of hyper-parameter values, a model is learned on k − 1 folds and the remaining fold is used for testing. This step is repeated k times, by always keeping a different fold as the testing one, and the result of a set of hyper-parameters is averaged on the k folds.

(a) Holdout estimation.

Steps 1 2 3 4 5 (b) 5-folds cross-validation.

Figure 1.3: Training set partitioning for holdout estimation and k-folds cross-validation.

hyper-parameter Selection

Choosing which hyper-parameter values should be tried during the cross-validation process has been an active area of research during the past few years. A standard approach called grid search consists of specifying different values for each hyper-parameter (e.g. parameter_one = [0.01, 0.1, 1, 10], parameter_two = [2, 4, 6]), and trying every possible combinations. More

(18)

recent approaches [Bergstra et al.,2011, Snoek et al., 2012] based on Bayesian optimization view the hyper-parameter selection problem as a noisy black-box function to optimize, which takes as input hyper-parameter values and returns the cross-validation risk. These approaches use machine learning algorithms to estimate this function and predict which hyper-parameter values should be tried next. It was shown empirically that these methods can find better hyper-parameters in less time than more traditional methods like grid search.

1.1.3 Metrics

Classification

There exist various metrics to evaluate the performances of a model. The choice of the metric mainly depends on the problem. A common metric for classification problems is the zero-one risk that averages the number of misclassified examples:

R01(h) = 1 m m X i=1 I(h(xi) 6= yi) , (1.4)

where I is the indicator function

I(a) =(1 a is true 0 a is false.

The zero-one risk is isomorphic to a loss function. This is not the case for some metrics such as the F₁ score used for binary classification:

F1(h) = 2 ·

precision · recall

precision + recall, (1.5)

where precision is the number of correct positive predictions on the total number of positive predictions, and recall is the number of correct positive predictions on the total number of examples that should have been predicted as positive. When there are no correct positive predictions, we have that precision + recall = 0. Hence, the F1 is undefined in that case. Regression

The zero-one risk and F₁ score are well-suited for classification since all classes are at the same distance from one another. For regression however, the metric must take in account the distance between the real value y and the prediction h(x). The mean squared error (MSE) that measures the squared distance between these values is often used for regression:

M SE(h) = 1 m m X i=1 (h(xi) − yi)2. (1.6)

Another metric for regression is the coefficient of determination, also known as R2, which is defined as follows: R2(h) = 1 − Pm i=1(h(xi) − yi)2 Pm i=1(yi− ¯y)2 , (1.7)

(19)

(a) (b) (c)

Figure 1.4: Linear models with no classification errors.

where ¯y = _m1 Pm

i=1yi. In that case, the model is accurate when its R2 value is close to 1, and inaccurate when it is close to zero or negative. However, the meaning of the R2 is unclear when the function we are trying to learn is (or is close to) a constant. We have a 0₀ when h(x) = const = ¯y.

1.2 Classification and Regression Algorithms

1.2.1 Support Vector Machine

Hard Margin SVM

Support Vector Machine (SVM) [Cortes and Vapnik,1995] is a binary classification algorithm Y = {−1, +1} that learns a linear model of the form:

h(x) = sign(w · x + b) , (1.8)

where w ∈ Rn, b ∈ R, and

sign(a) =(+1 a ≥ 0 −1 a < 0.

Therefore, SVM learns a hyperplane w · x + b = 0 that separates the examples into two classes. In the case of Figure 1.4, different hyperplanes can perfectly separate the training examples. However, we would expect the model of Figure 1.4c to obtain better generalization perfor-mances as it maximizes the distance of the closest examples of each class to the hyperplane. This distance is called the geometric margin, and the SVM algorithm finds the hyperplane that maximizes it.

More formally, w is a vector perpendicular to the hyperplane w · x + b = 0. For simplicity, let us consider that the hyperplane passes through the origin (i.e. b = 0). In that case, the distance of an example to the hyperplane corresponds to the length of the projection p of x

(20)

w · x₊ b = 1 w · x₊ b = 0 w · x₊ b = −₁

Figure 1.5: SVM classifier and margin.

onto w:

w · x = p||w|| = ||w|| · ||x|| cos θ , (1.9) where ||w|| denotes the Euclidean norm ||w|| =pw₁2+ ... + w2

n, and θ is the angle between w and x. Notice that p is positive when θ ≤ 90◦, and negative otherwise. Thereby, the geometric margin γ of the closest example of each class is defined as:

γ def= max w,b min i∈S yi[w · xi+ b] ||w|| , (1.10)

and γ is positive when all the examples can be correctly classified.

When the data is linearly separable, it is possible to scale w and b such that: min i∈S yi[w · xi+ b] ||w|| = 1 ||w||. (1.11)

Since maximizing γ corresponds to minimizing ||w||, the optimization problem of the SVM algorithm can be defined as:

Minimize 1 2||w||

2

Subject to yi[w · xi+ b] ≥ 1 ∀i ∈ S .

(1.12)

Equation (1.12) corresponds to a convex quadratic programming problem, and can therefore be solved by a quadratic programming solver.

Soft Margin SVM

When the training data is not linearly separable, some constraints yi[w · xi+ b] ≥ 1 ∀i ∈ S will be violated. In that case, we can solve the relaxed version of Equation (1.12):

Minimize 1 2||w|| 2_{+ C} m X i=1 ξi Subject to yi[w · xi+ b] ≥ 1 − ξi ξi≥ 0 ∀i ∈ S , (1.13)

(21)

where ξξξ are the slack variables that allow the examples to be incorrectly classified, and C is a hyper-parameter that controls the tolerance of errors. Thus, this version allows a training example to be misclassified, but penalizes the objective function accordingly. Finally, when C → ∞, Equation (1.13) does not tolerate any error and we obtain the same optimization problem as Equation (1.12).

Dual Objective

An optimization problem has an associated problem called the dual problem, which is simply a different way to see the original (primal) problem. The dual problem of the SVM algorithm is obtained by first introducing a positive Lagrange multiplier for each constraint to form the Lagrangian: arg min w,ξξξ,bmaxα,β ( 1 2||w|| 2_{+ C} m X i=1 ξi− m X i=1 αi[yi(w · xi+ b) − 1 + ξi] − m X i=1 βiξi ) , (1.14)

where α and β are the Lagrange multipliers. The parameters w, ξξξ, and b are called the primal variables of the objective, while α and β are the dual variables. Notice that when the constraints are violated, the value of Equation (1.14) is ∞. Using the fact that the derivatives with respect to the primal variables must be equal to zero to minimize the Lagrangian, we obtain the following Lagrangian dual problem [Burges,1998]:

Maximize m X i=1 αi− 1 2 m X i=1 m X j=1 αiαjyiyjxi· xj Subject to 0 ≤ αi ≤ C m X i=1 αiyi = 0 ∀i ∈ S , (1.15)

and the prediction function becomes:

h(x) = sign( m X

i=1

αiyixi· x) . (1.16)

The problem is now formulated in terms of the dual variables only. It is possible to prove that under the Karush–Kuhn–Tucker conditions (KKT)[Karush,1939,Kuhn,2014], the primal and dual solutions of the problem are equal. Moreover, we will see that the dot product of the inputs can be replaced by a kernel function to obtain a nonlinear model in the dual form of the problem.

Feature Space

The SVM algorithm presented in the previous sections is a linear model. However, there might be cases where the data is not linearly separable, and a nonlinear model would better fit the

(22)

data. In that case, we could use a feature map φφφ_X : X → H to map the inputs x into a high-dimensional vector space H called thefeature space. The SVM algorithm could then find a linear model in a more complex space where the data is linearly separable.

Figure 1.6a_{shows an example where X = R}2, and the training data is not linearly separable. However, a linear model can perfectly classify the examples by mapping the inputs into the space φφφX(x) = [x1, x2, x21 + x22] as shown in Figure 1.6b. Indeed, this linear model in the feature space corresponds to a nonlinear model in the original space X . Although the space induced by φφφ_X(x) has a low dimensionality for this example, we often need a more complex feature space in practice to achieve good generalization.

(a) Examples in input space X . (b) Linear model in feature space H.

Figure 1.6: Input space and feature space.

Kernels

Observe that when we replace the inputs x by φφφX(x) in Equation (1.15), the dual version of the SVM algorithm only requires the dot product of the feature vectors in H. Thus, the feature map φφφ_X does not need to be computed explicitly as long as we have akernel function K : X × X → R, where K(x, x0) = φφφX(x) · φφφX(x0). This method of using a kernel function to compute the dot product in the feature space is known as thekernel trick, and can make linear models become nonlinear. A kernel function can also be understood as a similarity measure between two inputs. That way, K(x, x0) returns a high value when x and x0 are similar, and a low value otherwise.

Since the kernel functions do not need to explicitly define the feature map φφφX, Mercer’s condition [Mercer, 1909] can be used to verify that K(x, x0) = φφφ_X(x) · φφφ_X(x0). In this way, the kernel function needs to be symmetric (i.e. K(x, x0) = K(x0, x) ∀x, x0∈ X ), and

Z

X Z

X

K(x, x0)f (x)f (x0)dxdx0 ≥ 0 (1.17) for every square-integrable function f : X → R. Many kernels have been proposed in the literature and were proven to respect Mercer’s condition. For instance, the polynomial kernel

(23)

is widely used and is defined as:

K(x, x0) = (x · x + c)d, (1.18)

where c and d are hyper-parameters that can be found by cross-validation. We will see in Chapter 3that kernels can also be defined over sequences.

1.2.2 Logistic Regression

Logistic regression is a binary classification algorithm Y = {−1, +1} that learns a conditional probability distribution:

p(y|x, w, b) = 1

1 + exp(−y(w · x + b)), (1.19)

where w ∈ Rn is the learned weight vector of the model. Equation (1.19) is known as the logistic or sigmoid function, that maps a real value (e.g. y(w · x + b)) to the interval [0, 1]. The example is classified y = +1 when p(y = +1|x, w, b) ≥ 0.5, and y = −1 otherwise. To find the weight vector w, the algorithm aims to maximize the conditional probability of each training example in Equation (1.19). Since the examples are i.i.d, this corresponds to maximizing the conditional likelihood of the training data as a product:

argmax w,b m Y i=1 p(yi|xi, w, b) . (1.20)

In practice, however, it is often easier to maximize (or minimize the negative of) the conditional log-likelihood of the data since small likelihood can lead to underflow in Equation (1.20). Moreover, to prevent overfitting, an `2 penalty is added to the norm of w. Therefore, the algorithm minimizes the following regularized negative conditional log-likelihood:

argmin w,b λ||w||2+ m X i=1 log (exp(−yi(w · xi+ b)) + 1) , (1.21) where λ is a hyper-parameter that controls the importance of the regularization.

Finally, Equation (1.21) is convex and can be optimized using algorithms like gradient descent, stochastic gradient descent, or stochastic average gradient descent [Bottou, 2010, Schmidt et al.,2013]. These algorithms iteratively update w by taking the gradient of Equation (1.21) with respect to w.

1.2.3 Regression

Ridge Regression

Ridge regression is a regression algorithm that learns a weight vector w ∈ Rn _{to obtain the} following model:

(24)

The weight vector w is learned by minimizing the squared distance between the predictions and the expected outputs. Similar to logistic regression and SVM, this algorithm imposes an `2 penalty on the norm of w to avoid overfitting. Thus, ridge regression finds w by minimizing the following equation:

F (w, b) = m X

i=1

(w · xi− yi)2+ λ||w||2, (1.23) where the parameter λ controls the importance of the regularization. Since Equation (1.23) is convex and admits a global minimum [Mohri et al.,2012], we can minimize it by taking its derivative and setting it to zero:

(XTX + λI)w − XTy = 0 , (1.24) where X =     x1 .. . xm     and y =     y1 .. . ym    

. We then obtain the following equality:

w = (XTX + λI)−1XTy , (1.25)

where ()−1 is the matrix inverse function. Therefore, the learning phase of this algorithm only requires a matrix inversion.

Kernel Ridge Regression

Saunders et al. [1998] have proposed a dual version of the ridge regression algorithm called kernel ridge regression, which allows the use of kernel functions. In that case, w is given by:

w = XT(K + λI)−1y , (1.26)

where K = XXTis the matrix of dot products of the training inputs, also known as thegram matrix. Given a kernel function K : X × X → R, the gram matrix is defined as:

Kij = K(xi, xj) ∀i, j ∈ S . (1.27)

Finally, Equation (1.22) can be rewritten in the following way [Schölkopf et al.,2001]:

h(x) = m X

i=1

αiK(xi, x) , (1.28)

where α = (K + λI)−1y. Kernel ridge regression is thus preferred when n is very large. Moreover, this second version uses a kernel function to represent the dot product in the feature space, and can therefore learn a nonlinear model.

(25)

1.3 Chapter Summary

The supervised learning algorithms presented in this chapter are specifically for classifica-tion and regression problems. Although the funcclassifica-tion optimized during the learning process is different for each of these algorithms, they all use an `2 penalty to regularize and penalize complex models. Moreover, they use different methods to optimize these convex functions. While the learning phase of ridge regression only requires a matrix inversion, the SVM algo-rithm needs to solve a quadratic programming problem, and logistic regression uses gradient descent methods.

The logistic regression algorithm has the advantage of predicting a conditional probability over the possible classes for a given input. On the other hand, SVM produces a positive or negative score for a given input, which can be understood as being more confident when the input is far from the margin (i.e. either a highly positive or highly negative score). Moreover, these algorithms can use kernel functions in their dual representation. They are thus able to learn nonlinear models by replacing the dot product of two vectors by a kernel function. In the next chapter, we show how these algorithms can be adapted for structured prediction tasks. In that case, the output to predict is more complex than a category or a real-valued scalar. As such, the training and prediction phases of these algorithms is often more compu-tationally expensive. Hence, we also present different methods to make these steps faster, as well as other structured prediction algorithms that can bypass this problem.

(26)

Chapter 2

Structured Prediction

This chapter presents the structured prediction problem, where the learned model predicts an object more complex than a category or a real-valued scalar. The structured prediction algo-rithms can solve different problems in fields like natural language processing, computer vision, and bioinformatics. Although this thesis focuses on the prediction of sequences, we present several structured prediction algorithms that are not restricted to this kind of prediction. These algorithms can often be formulated as optimizing a joint scoring function that predicts the compatibility of an input with a structured output. One of the difficulties in structured prediction consists in finding the output that maximizes this scoring function. For several algorithms, this inference step has to be solved during the training phase and the prediction phase. As we shall see, an advantage of the regression approach to structured prediction is that this problem only needs to be solved during the prediction phase. Finally, we present the problem of finding the structure that maximizes the prediction function associated with a classifier or a regressor, which has many applications in drug design and can also be formulated as a structured prediction problem similar to the regression approach.

2.1 The Structured Prediction Problem

2.1.1 Structured Prediction and Sequence Prediction

The structured prediction problem consists of predicting a complex object called a structure. This structure is composed of smaller parts that are not independent, but rather interrelated with one another. There exist different types of objects that can be predicted in a structured prediction problem. However, in the context of this thesis we focus on the prediction of objects that can be represented as sequences.

Given an alphabet of symbols A, a sequence y ∈ A∗ can be defined as an ordered collection of symbols, where A∗ is the set of all possible sequences of symbols from that alphabet. The alphabet A is arbitrary and can for instance be a set of letters, a set of phonemes (i.e. the units

(27)

of sounds of a language), or a set of words from a dictionary. Therefore, sequence prediction algorithms can be applied to various structured prediction problems for which the output can be represented as a sequence of symbols.

2.1.2 Structured Prediction Examples

Figure 2.1 presents different structured prediction problems involving the prediction of se-quences. The problem of Figure2.1ais similar to the optical character recognition problem of Chapter 1, but considers instead that the characters in the image are interrelated and form a word. In this way, the algorithm can learn that the characters dec are more likely to appear together than dcc, which can increase the consistency of the predicted characters with one another, and thereby improve the performances of the model.

Figure 2.1b shows an example of a grapheme-to-phoneme problem, which is commonly used in text-to-speech systems to predict the phonetic pronunciation y of a written word x. These systems allow a computer to speak by transforming a written text into phonemes, which are then transformed into waveforms. Although the pronunciation of a word can be stored in a look-up dictionary to achieve a perfect accuracy, machine learning algorithms can significantly reduce the storage space required for the predictions and can handle arbitrary words [Bisani and Ney,2008]. Text-to-speech systems are now widely used in different applications and can for instance allow blind users to interact with their mobile phone [Csapó et al., 2015]. Even Watson, the supercomputer of IBM that won theJeopardy game, used a text-to-speech system to answer questions [Lewis,2012,Pitrelli et al.,2006].

Figure 2.1c presents a French-to-English statistical machine translation problem, where the model aims at predicting the English translation y of a French sentence x. Statistical machine translation is a hard problem as the model must use context to choose between ambiguous translations from an extremely large output space. For that reason, many statistical machine translation systems use phrase-based translations that first chunk a sentence into meaningful segments called phrases, and then create a look-up table of possible translations for each phrase with assigned probabilities [Koehn et al.,2007,Cho et al.,2014]. An algorithm is then used to select the corresponding phrase translations and the order in which they appear in the translated sentence, while ensuring that the translation is syntactically correct. Microsoft Skype Translator and Google Translate are both examples of statistical machine translation softwares. The former is integrated into a voice communication system and translates speech in real-time, while the latter translates written text and has hundreds of millions of users [Lewis,

2015,Estelle and Khare,2013].

2.1.3 Comparison with Classification and Regression

Intuitively, the structured prediction problem could be seen as a classification task with an extremely large output space. However, an important aspect of structured prediction is that

(28)

declaring

(a) Optical word recognition.

coffee kcfi

(b) Grapheme-to-phoneme.

Le goût délicieux du café The sweet taste of coffee

(c) Machine translation.

Figure 2.1: Example of structured prediction problems.

the loss of predicting y0 when the expected output is y is not necessarily the same for any output y0 ∈ Y such that y0 _{6= y. For instance, consider the word recognition problem of} Fig-ure2.1a with y = declaring, and the predicted sequence h(x) = declering. The prediction is incorrect but is still closer to the expected output than another sequence such as dcclcrlng. Moreover, since the predicted output is a structure composed of smaller subparts, one could try to predict each subpart individually and merge the predictions. However, structured pre-diction considers that exploiting the relationship between these subparts allow the algorithm to improve its performance. Indeed, it is better to consider the translated sentence as a whole than to consider each translated word individually.

On the other hand, structured prediction algorithms share some similarities with regression algorithms as they must take into account the distance between the expected output and the prediction. Whereas this distance can easily be computed between two scalar values for regression tasks, the structured prediction algorithms must incorporate the notion of distance between two arbitrary objects. Therefore, standard classification and regression algorithms are usually extended to structured prediction by incorporating the notion of distance or similarity between two arbitrary objects, and by considering the relationship between the subparts of the structure.

2.1.4 Supervised Learning for Structured Prediction

As for classification and regression, the structured output learner has access to a dataset S = {(x₁, y1), . . . , (xm, ym)} ∈ X × Y of input-output pairs. The input space X is arbitrary but in the case of sequence prediction problems, we assume that the output space Y is the set A∗ of sequences from an alphabet A. The prediction h(x) for a given input x is obtained by maximizing a scoring function f : X × Y → R:

h(x) = argmax y∈Y

f (x, y) . (2.1)

The structured prediction algorithms differ by how they define and learn the function f . In the context of this thesis, the function f is considered as a linear (or log-linear) model f (x, y) = w · φφφ(x, y), where φφφ is a joint feature map defined over an input-output pair.

(29)

2.1.5 Joint Feature Space

Many structured prediction algorithms use a joint feature map φφ_{φ : X × Y → R}d _{to define} features depending on both x and y. The joint feature map can be represented in the following way: φ φ φ(x, y) =       φ1(x, y) φ2(x, y) .. . φd(x, y)       ,

where each φ_i_{: X × Y → R is a feature function taking an input-output pair and returning a} real value [Smith,2011]. When f (x, y) = w ·φφφ(x, y), we can rewrite Equation (2.1) as follows:

h(x) = argmax y∈Y d X i=1 wiφi(x, y) . (2.2)

Different feature functions have been proposed in the literature for diverse structured predic-tion problems. For instance,Altun et al.[2003] proposed to define features that are similar to a hidden Markov model. A hidden Markov model is a graphical model that represents a prob-ability distribution over observation variables and hidden variables. In this case, a sequence x corresponds to a series of observations x = (x1, ..., xt), where the observation xi depends on the state yi, and where yi is independent of the other states given its predecessor yi−1. The joint probability distribution of this model is defined as follows [Sutton and McCallum,2011]:

p(y, x) = ` Y

i=1

p(yi|yi−1)p(xi|yi) . (2.3)

Note that a state y₀ is added at the beginning of each y for the computation of p(y₁|y₀). In a similar way, Altun et al. [2003] defined compatibility features depending on xi and yi, and transition features using yi and yi−1. However, these features are more flexible as they are not restrained to probabilities. Moreover, a context window can be used around x so that y_i can depend on any part(s) of x.

The named entity recognition task is an example of problem requiring a large number of features. This task consists of finding the entities (e.g. person, location, organization) in a text. This can be done by assigning a class to each word in a sentence, and using the class other when the word is not an entity. For this task,Altun et al.[2003] defined binary features such as “Is the previous word ’Mr.’ and the current class ’person’?” to model the dependencies between x and y, and “Is the previous class ‘other’ and the current class ‘location’?” for the relations between the subparts of y. The number of features can thus be very large (e.g. 106 [Smith, 2011]) depending on the problem. Moreover, in the past few years, there has been a growing interest in using neural networks algorithms to learn these features automatically with unlabeled datasets [Socher et al., 2010, Collobert et al., 2011]. Finally, we need to be

(30)

1 2 3 s A B A B A B

Figure 2.2: Viterbi graph example when A = {A, B} and ` = 3. The path in red forms the sequence ABB, and p(ABB, x) = p(A|s)p(x1|A)p(B|A)p(x2|B)p(B|B)p(x3|B).

careful when defining features depending on the different subparts of y as it can increase the complexity of solving Equation (2.1) and (2.2).

2.1.6 Inference

Depending on the context, the maximization problem of Equation (2.1) is usually referred to as theinference, argmax, or pre-image problem. A common drawback in structured prediction is that this maximization problem is often intractable (i.e. too computationally expensive to solve) in practice [Gärtner and Vembu,2009]. Indeed, it is often too costly to use a brute force approach and compute the value of f for every possible output y since Y is exponentially large. Hence, a lot of literature in structured prediction has been devoted to developing approaches to make this argmax problem tractable. One of these approaches consists of restricting f to models where the inference can be solved efficiently. On the other hand, when there is no restriction on the complexity of f , we can approximate the inference by finding a good solution rapidly that is not guaranteed to be the optimal one. Finally, we can design algorithms that can solve the inference exactly, but with an exponential worst-case complexity. See Nowozin and Lampert [2011] for a good review on inference methods in structured prediction.

Dynamic Programming

In some cases, f decomposes over the subparts of the structure, and a dynamic programming algorithm can solve Equation (2.1) in polynomial time. These kinds of algorithms solve smaller sub-problems and use these partial results to build the final solution and avoid unnecessary computation. Most of the dynamic programming algorithms used to solve Equation (2.1) can be understood as variations of the Viterbi algorithm [Forney Jr,1973]. Given a sequence of observations x, this algorithm finds the most likely y in a hidden Markov model, which corresponds to maximizing Equation (2.3). When Y = A∗ for instance, this algorithm creates a graph as shown in Figure 2.2, where the nodes represent the possible symbol y ∈ A at each position 1..`. The initial value of a node at position i corresponds to p(xi|yi), and the value of the edge between the nodes y_i−1 and y_i is p(y_i|y_i−1). Hence, a path in this graph produces a sequence y with a joint probability p(y, x). This probability is obtained by multiplying the node and edge values on that path.

(31)

The Viterbi algorithm can find the most likely path in this graph in O(`|A|2_{) time when y} i depends only on the previous label. When y_i depends on k previous labels instead of one (i.e. p(yi|yi−1, ..., yi−k)), the computational complexity of this algorithm is O(`|A|k+1). Hence, the complexity of dynamic programming algorithms used to solve Equation (2.1) generally increases exponentially with the order k of dependency between the subparts of y. The order k must often be restricted to a low value since many structured prediction algorithms also call the inference algorithm multiple times during the training phase. In Chapter3, we show that the inference problem of the regression approach to structured prediction can be solved with a dynamic programming algorithm in some specific cases.

Approximation

In more complex cases, we often have to approximate the pre-image problem by either finding an optimal solution y ∈ Y0 in a smaller subset of solutions Y0 ⊂ Y, or by finding a good but often suboptimal solution in Y using different search strategies. In Chapter 3, we present an approach proposed by Cortes et al. [2007] to solve the inference problem of the regression approach when f does not consider the position of the subsequences. Their method reduces the space Y by finding a set of highly scored subsequences a ∈ An of length n that must be ordered and concatenated to form y.

Structured prediction cascades [Weiss et al., 2012] is another approach that successively re-duces the output space by keeping a set of subsequences scoring above a threshold t at a specific position, and using the set of the previous position to reduce the number of subse-quences that must be scored at the next position. For instance, given that AB and BB are the only subsequences that obtained a score greater than t at position i, the position i + 1 only requires to evaluate the score of subsequences starting with B. The final prediction is done by finding the output that has the maximum score in that filtered space. This approach can rapidly provide good solutions to the inference problem even when high-order dependencies are used. Thus, the approximation algorithms are often used when no efficient algorithm can solve Equation (2.1).

Exact Search

When the inference cannot be solved by dynamic programming, we can use a search algorithm that aims to find the exact solution to the inference problem. While having an exponential worst-case complexity, a well-designed search strategy can still solve the inference rapidly in some cases, and guarantee that the predicted output is optimal. For instance, in Chapter 3, we develop an upper bound on the prediction function of the regression approach to structured prediction. The proposed bound guides a branch and bound search by estimating the expected score of partial solutions (i.e. y where |y| < `) in the search space. In Chapter4, we empirically show that in most cases, this approach only requires a small amount of time to find the solution

(32)

and prove its optimality. Finally, the search can also be limited to a specific amount of time, and return an approximate solution when the algorithm does not terminate before that time.

2.1.7 Metrics

There exist different metrics to evaluate the predictions of structured output algorithms. The zero-one risk can also be used for these kinds of algorithms. However, it gives no information on how far the prediction is from the expected output when an error is made. Since the structure is composed of subparts, the metrics for structured prediction often use a loss function that is averaged over these parts. For instance, the Hamming risk uses the Hamming loss that averages the misclassification error of each symbol. Let h(x) be the predicted output sequence on input sequence x, and let h_i(x) be the ith symbol of the predicted sequence. Then the Hamming risk is:

Rham(h) = 1 m X (x,y)∈S 1 |y| |y| X i=1 I(hi(x) 6= yi) . (2.4)

The Hamming loss is defined over sequences of the same length. However, blank characters can be added at the end of a sequence so that both the prediction and the observed sequence have the same length.

The Levenshtein risk is often preferred for sequences of different lengths as it is less restrictive than the Hamming risk on the position of the symbols:

Rlev(h) = 1 m X (x,y)∈S Lev(h(x), y) max(|h(x)|, |y|), (2.5)

where Lev is the levenshtein distance that returns the minimum number of symbol modifica-tions required to change the sequence h(x) into y. In this case, a modification (or edit) is either the substitution, insertion, or deletion of a symbol [Levenshtein,1966].

While these metrics can be used for different structured prediction problems, some metrics are however more task specific. For instance, the BLEU score [Papineni et al.,2002] is a method for the evaluation of statistical machine translation systems that was shown to correlate highly with human judgment for this task. This metric evaluates the closeness of the predicted translation with some reference translation(s) by combining a score on the precision of the predicted words, their order in the sentence, and the length of the predicted translation.

(33)

2.2 Structured Prediction Algorithms

2.2.1 Structural SVM

Multiclass SVM

Before presenting the structured output version of the SVM algorithm, let us consider the case of multiclass classification with SVM. While the sign of the linear model is used to predict whether an example is positive or negative in the binary classification setting, we can extend the SVM algorithm to multiclass classification by creating one model per class. For the binary classification case, this would correspond to having two models w+1 and w−1, and imposing that w−1 = −w+1. Intuitively, the higher is the score of a model for a given input, the greater is the chance that this example belongs to the model’s class. Therefore, the class of an example in the multiclass setting can be predicted by:

h(x) = argmax y∈Y

wy· x , (2.6)

where w_y is the linear model of the class y. The optimization problem also needs to ensure that the model predicting the correct class yi for an input xi gives a higher score wyi· xi than

the model of any other class. Following the large margin intuition of the SVM algorithm, we would like this score to be greater by a certain margin. This can be done by adding the following constraints:

wyi · xi− wy· xi≥ 1 ∀i ∈ S, ∀y ∈ Y : y 6= yi (2.7)

for the hard-margin SVM, and

wyi· xi− wy· xi ≥ 1 − ξi ∀i ∈ S, ∀y ∈ Y : y 6= yi, ∀ξi≥ 0 (2.8)

for the soft-margin SVM [Crammer and Singer,2002].

Slack Rescaling

Although it is not possible to have |Y| models for each possible output in structured prediction, the constraints of Equation (2.7) and Equation (2.8) give us a better insight on how the SVM algorithm can be adapted for structured prediction. In a similar way, given a joint feature map φφφ : X × Y → H that maps an input-output pair to a high-dimensional feature space, we would like the score of w · φφφ(xi, yi) to be higher by a certain margin than the score of any other y such that y_i 6= y. Moreover, an incorrect prediction should be penalized more severely when it is far from the expected output (i.e. when l(yi, y) is high). Therefore,Tsochantaridis

(34)

et al. [2005] proposed thestructural SVM algorithm with slack rescaling: Minimize 1 2||w|| 2_{+ C} m X i=1 ξi Subject to w · φφφ(xi, yi) − w · φφφ(xi, y) ≥ 1 − ξi l(yi, y) ∀i ∈ S, ∀y ∈ Y : y 6= y_i, ∀ξi ≥ 0 (2.9) which scales the slack variables ξ_i by the inverse loss. In this way, a missclassification occurs when ξi ≥ l(yi, y), and will cost less of the objective when l(yi, y) is small than when it is large.

Margin Rescaling

Another approach extending the SVM algorithm to structured prediction was originally pro-posed by Taskar et al. [2004] for the hamming loss, and is known as the maximum margin Markov networks (M3 _net). _{Tsochantaridis et al.}_[₂₀₀₅_{] generalized this approach to arbitrary} loss functions and called it the structural SVM with margin rescaling:

Minimize 1 2||w|| 2_{+ C} m X i=1 ξi

Subject to w · φφφ(xi, yi) − w · φφφ(xi, y) ≥ l(yi, y) − ξi ∀i ∈ S, ∀y ∈ Y : y 6= yi, ∀ξi ≥ 0 (2.10) which scales the margin by the loss instead of scaling the slack variables. This formulation has the disadvantage that some constraints have more weight than they should have. Indeed, an output y far from being confused with y_i could be penalized for not having a sufficient margin, whereas only incorrect predictions or y close to the margin (i.e. close to be confused with y_i) are penalized in the slack rescaling version.

Training

The optimization problems of Equation (2.9) and Equation (2.10) often have an exponential number of constraints since they must be satisfied for all y ∈ Y. Therefore, many approaches have been proposed to solve these problems since they cannot be solved directly by a quadratic programming solver as it was the case in multiclass classification [Taskar et al.,2004, Tsochan-taridis et al.,2005,Joachims et al.,2009,Lacoste-Julien et al.,2012]. For instance, Joachims et al.[2009] showed that the structural SVM algorithm can be trained with the cutting-plane algorithm. This algorithm starts with an unconstrained problem and solves the quadratic programming problem. It then loops over the training example to find the most violated con-straints, add it to the working set of concon-straints, and solve the problem again. This step is repeated until no constraint is violated more than . This algorithm only requires a polynomial

(35)

number of iterations to terminate. However, to find the most violated constraint, the follow-ing loss-augmented prediction problem needs to be solved multiple times during the training phase:

argmax y∈Y

l(yi, y)(1 − [w · φφφ(xi, yi) − w · φφφ(xi, y)]) (2.11) for the slack rescaling, and

argmax y∈Y

l(yi, y) − [w · φφφ(xi, yi) − w · φφφ(xi, y)] (2.12) for the margin rescaling. This loss-augmented prediction is similar to the inference problem and its tractability depends of the chosen feature map and the loss function.

2.2.2 Conditional Random Field

Multiclass

The conditional random field algorithm [Lafferty et al.,2001] can be seen as an extension of logistic regression for structured prediction. Recall that the logistic regression algorithm uses the logistic function to predict the conditional probability p(y|x, w) in the binary classification case. A prediction is then made by selecting the output y ∈ {+1, −1} with the highest conditional probability. To extend this algorithm to multiclass classification, we can consider that we have one model wy per class that learns to distinguish one class from all the others, and W is a n × |Y| matrix that contains all these models. A conditional probability distribution over all possible classes is then obtained in the following way [Murphy,2012]:

p(y|x, W) = P exp(wy· x) y0_∈Yexp(w_y0 · x)

. (2.13)

Equation (2.13) is known as the softmax function and corresponds to a generalization of the logistic function for multiclass classification. Notice that the division by the sum of each score exp(wy0· x) ensures that p(y|x, W) is a probability distribution over all classes.

General

In a manner similar to the structural SVM algorithm, the conditional random field uses a joint feature map φφφ(x, y) to model the compatibility between x and y, and the relation between the different parts of y. Moreover, to define a conditional probability distribution over all outputs y ∈ Y, the normalization must sum over the score of every possible output as it was the case for Equation (2.13). The conditional random field algorithm thus learns the following conditional probability distribution [Nowozin and Lampert,2011]:

p(y|x, w) = exp(w · φφφ(x, y))

(36)

where w ∈ Rd_{, and Z(x, w) is the normalization constant defined as:} Z(x, w) = X

y0_∈Y

exp(w · φφφ(x, y0)) . (2.15)

Hence, a prediction is made by finding the output y with the highest conditional probability in Equation (2.14) for a given input x.

In some cases, it is possible find the sequence maximizing Equation (2.14) efficiently. For instance, thelinear-chain conditional random field is a type of conditional random field where the feature functions over y only use the neighboring symbols yi and yi−1 (i.e. φj(x, y) = φj(x, yi, yi−1)). In that case, the inference and training can be done efficiently. However, the general case of conditional random field often requires to approximate the maximization of Equation (2.14).

Training

The training phase of the conditional random field algorithm consists in finding the weight vector w ∈ Rd that maximizes the likelihood of the data, which corresponds to maximizing Equation (2.14) for the training examples. As it was the case for logistic regression, we can minimize the negative regularized log-likelihood of the data instead [Nowozin and Lampert,

2011]: F (w) = λ||w||2− m X i=1 w · φφφ(xi, yi) + m X i=1 log Z(xi, w) . (2.16) Equation (2.16) can be minimized using gradient-based methods such as limited memory Broyden–Fletcher–Goldfarb–Shanno (BFGS), stochastic gradient descent, or conjugate gradi-ent methods [Bottou, 2010, Liu and Nocedal,1989,Nocedal and Wright, 2006]. These opti-mization methods require to compute the value of F (w) and the partial derivatives _∂w∂

kF (w) defined as follows: ∂ ∂wk F (w) = 2λwk− m X i=1 [φk(xi, yi) − X y∈Y p(y|xi, w)φk(xi, y)] . (2.17)

Observe that the sum over the training examples minimizes the distance between the value of φk(xi, yi) with the expected valuePy∈Yp(y|xi, w)φk(xi, y). Furthermore, this sum is equal to zero when p(yi|xi, w) = 1 for all training examples. Finally, the training phase requires to solve a summation problem over Y each time we want to compute Z(xi, w) in Equation (2.16) and the partial derivatives of Equation (2.17).

2.2.3 Regression Approach to Structured Prediction

A regression approach to structured prediction called kernel density estimation was originally proposed byWeston et al.[2002]. Cortes et al.[2007] generalized this approach for the predic-tion of sequences, and proposed an inference algorithm specifically for that context. Giguere

(37)

AA AB BA BB Wφφφ_X(x) = [ 1.1 1 0 0 ] φ φφY(AAA) = [ 2 0 0 0 ] φ φφ_Y(AAB) = [ 1 1 0 0 ] φ φφY(ABA) = [ 0 1 1 0 ] .. .

Figure 2.3: Example of pre-image problem where HY is the 2-gram kernel with subsequences of length 2, A = {A, B}, and AAB is the predicted sequence.

et al. [2014] provided theoretical guarantees for that approach by developing a PAC-Bayes upper bound on the prediction risk, and by showing that the minimizer of this bound is the predictor proposed byCortes et al.[2007]. In this section, we first present the pre-image prob-lem, which corresponds to the inference problem for the regression approach to structured prediction. Then, we present the primal and dual form of the structured ridge regression algorithm.

Pre-Image Problem

We assume here the existence of an input feature map φφφX : X → HX and an output feature map φφφ_Y : Y → HY, where both HX and HY are high-dimensional vector spaces. We consider predictors that are linear operators W : HX → HY. Given any such W, and any x ∈ X , the predicted output h(x) is given by solving the following structured output pre-image problem

h(x) = argmin y∈Y

||φφφ_Y(y) − Wφφφ_X(x)||2. (2.18) In other words, the goal is to learn a linear operator W that transforms the feature vectors of the input space into feature vectors of the output space. Given a new input x, a prediction is made in two steps. First, the output feature vector Wφφφ_X(x) is predicted. Then, we need to solve the pre-image problem to reconstruct the output from WφφφX(x) (i.e. trying to approximate the inverse function φφφ−1_Y : HY → Y). A pre-image is said to be exact when there exists a y ∈ Y for which its feature vector φφφ_Y(y) is exactly equal to Wφφφ_X(x). Unfortunately, an exact pre-image rarely exists for linear operators obtained by vector-valued regression. Hence, in practice, we have to deal with a hard pre-image problem.

Figure 2.3 shows an example of the pre-image problem when the output feature space HY is the feature space of the 2-gram kernel with subsequences of length 2 (see Chapter3for details about the N -gram kernel). In that case, the values of the feature vector φφφY(y) represent the number of times each subsequence {AA, AB, BA, BB} appears in y. Notice that the exact pre-image does not exist in this case since Wφφφ_X(x) contains a non-integer value. However, the closest output feature vector is φφφY(AAB), and the predicted sequence would therefore be AAB.