Dual Coordinate Descent Methods for Logistic Regression and Maximum Entropy Models

(1)

(will be inserted by the editor)

Dual Coordinate Descent Methods for Logistic Regression and Maximum Entropy Models

Hsiang-Fu Yu · Fang-Lan Huang · Chih-Jen Lin

Received: date / Accepted: date

Abstract Most optimization methods for logistic regression or maximum entropy solve the primal problem. They range from iterative scaling, coordinate descent, quasi- Newton, and truncated Newton. Less efforts have been made to solve the dual problem.

In contrast, for linear support vector machines (SVM), methods have been shown to be very effective for solving the dual problem. In this paper, we apply coordinate descent methods to solve the dual form of logistic regression and maximum entropy. Interest- ingly, many details are different from the situation in linearSVM. We carefully study the theoretical convergence as well as numerical issues. The proposed method is shown to be faster than most state of the art methods for training logistic regression and maximum entropy.

1 Introduction

Logistic regression (LR) is useful in many areas such as document classification and natural language processing (NLP). It models the conditional probability as:

Pw(y=±1|x)≡ 1 1 +e^−yw^T^x,

where x is the data, y is the class label, and w ∈ Rⁿ is the weight vector. Given two-class training data{xi, yi}^l_i=1,xi∈Rⁿ, yi∈ {1,−1}, logistic regression minimizes the following regularized negative log-likelihood:

P^LR(w) =C

l

X

i=1

log

1 +e^−yⁱ^w^T^xⁱ +1

2w^Tw, (1)

Department of Computer Science National Taiwan University Taipei 106, Taiwan

{b93107, d93011, cjlin}@csie.ntu.edu.tw

(2)

whereC >0 is a penalty parameter. Problem (1) is referred to as the primal form of logistic regression, as one may instead solve the following dual problem.

minα D^LR(α) = 1

2α^TQα+ X

i:αi>0

αilogαi+ X

i:α_i<C

(C−αi) log(C−αi) subject to 0≤αi≤C, i= 1, . . . , l,

(2)

whereQij=yiyjx^T_ixj ∀i, j.¹By defining 0 log 0 = 0, (2) becomes minα D^LR(α) = 1

2α^TQα+

l

X

i=1

αilogαi+ (C−αi) log(C−αi) subject to 0≤αi≤C, i= 1, . . . , l.

(3)

We omit the derivation of the dual problem as later we show details for the more general maximum entropy model.

Numerous optimization methods have been applied to train logistic regression, as surveyed in, for example, Minka (2003). Most of them solve the primal problem. Dar- roch and Ratcliff (1972), Della Pietra et al. (1997), Goodman (2002), Jin et al. (2003) and many others have proposed iterative scaling methods. Huang et al. (2010) apply a coordinate descent approach. A quasi-Newton method is included in the comparison by Minka (2003). Komarek and Moore (2005) and Lin et al. (2008) propose truncated Newton techniques. Less efforts have been made to solve the dual problem. A few existing studies include Jaakkola and Haussler (1999) and Keerthi et al. (2005). These works, interested in kernel LR, consider the dual because of the easy embedding of kernels. In contrast toLR, for linear support vector machines (SVM),² optimization methods have been shown to be very effective for solving the dual problem (e.g., Hsieh et al., 2008). Note that LR is very related to linear SVM, which takes the following primal and dual forms:

minw P^SVM(w) =C

l

X

i=1

max(1−yiw^Txi,0) +1

2w^Tw, (4)

and

minα D^SVM(α) =1

2α^TQα−

l

X

i=1

α_i

subject to 0≤αi≤C,∀i,

where Qij =yiyjx^T_i xj. An overview of dual forms of various regularized classifiers includingLRandSVMcan be found in Zhang (2002) though it does not explore detailed optimization algorithms for each classifier.

Coordinate descent methods, a classic optimization approach, have been very suc- cessfully applied to solve the dual form of large linearSVM(Hsieh et al., 2008). Moti- vated by their work, in this paper we study if coordinate descent methods are useful for solving the dual problem ofLR. Interestingly, we find that many details are different from the situation in support vector machines. In particular, numerical issues due to

1 In this work we do not consider kernelLR.

2 By linearSVMwe mean that kernel tricks are not employed.

(3)

logarithmic evaluations must be properly handled. We carefully design a coordinate descent algorithm to avoid numerical difficulties and prove the convergence. The proposed method is shown to be faster than most state of the art methods for training logistic regression.

Maximum Entropy (ME) is a generalization of logistic regression for multi-class scenarios.³Thus we also study a coordinate descent method for the dual form ofME.

MEmodels the conditional probability as:

Pw(y|x)≡ exp(w^Tf(x, y)) P

y⁰exp(w^Tf(x, y⁰)),

where x denotes a context,y is the label of the context, and w∈ Rⁿ is the weight vector. A function vectorf(x, y)∈Rⁿindicates features extracted from the contextx and the labely. AssumeNtraining samples{(x, y)}are given, and we have groupedx’s tolunique contexts{xi}and calculate the empirical probability distribution ˜P(xi, y) = Nx_i,y/N, whereNx_i,y is the number of times that (xi, y) occurs in the training data.

MEminimizes the following regularized negative log-likelihood:

minw P^ME(w) =−

l

X

i=1

X

y

P(x˜ i, y) logPw(y|xi) + 1 2σ²w^Tw

=

l

X

i=1

P(x˜ _i) log X

y

exp(w^Tf(x_i, y))

!

−w^Tf˜+ 1 2σ²w^Tw,

(5)

where σ is the penalty parameter similar to C in (1), ˜P(xi) =P

yP(x˜ i, y) is the marginal probability ofxi, and

f˜=

l

X

i=1

X

y

P(x˜ i, y)f(xi, y) (6)

is the expected vector off(x_i, y). For convenience, we assume that y_i∈Y ≡ {1,2, . . . ,|Y|}.

Many optimization methods have been applied to train ME, as discussed in Malouf (2002), Gao et al. (2007), Huang et al. (2010) and references therein. Most existing methods solve the primal problem, though there are a few exceptions: Memisevic (2006) applies a two-level coordinate descent method. Collins et al. (2008) propose an exponentiated gradient (EG) algorithm for conditional random fields (CRF) and their methods can be modified for dualME. In this paper, we extend the two-level coordinate descent method (Memisevic, 2006) to a numerically robust algorithm. Moreover, we carefully study the theoretical convergence.

This paper is organized as follows. In Section 2, we discuss basic concepts of coordinate descent methods and show some existing examples forSVMand primalLR. In Sections 3 and 4, we describe our proposed algorithms forLRandME, respectively. A related optimization method forLR/ME duals is discussed in Section 5. In Section 6, we compare our method with state of the art implementations. Results show that the new methods are more efficient. We conclude our work in Section 7.

3 See the derivation in Section 6.1 of Huang et al. (2010). If xi ∈ Rⁿ,∀i, are training instances, then inME,w∈R^n|Y^|.LRformulation in (1) is a simplified form because itsw hasninstead of 2nelements.

(4)

2 Coordinate Descent Methods

This section gives an overview of coordinate descent methods by considering the following optimization problem with linear constraints:

min

α∈R^l F(α)

subject to Aα=b, and 0≤α≤Ce, (7)

whereA∈R^m×l,b∈R^m, 0< C ≤ ∞ande∈R^lis the vector of all ones. Coordinate descent methods iteratively update a block of variables because optimizing all variables together is more difficult. At each iteration, a nonempty subsetB⊂ {1, . . . , l}is chosen to construct the following sub-problem.

minz F(α+z) (8)

subject to zi= 0,∀i /∈B,

Az=0, and 0≤α_i+z_i≤C,∀i∈B.

That is, we consider changing αB using the solution of (8), while fixing all other elements.

The two design considerations for coordinate descent methods are how to select a block B and how to solve the sub-problem (8). We take SVM and primal LR as examples and discuss different situations.

2.1 Exactly Solving One-Variable Sub-Problem

If the sub-problem has a closed-form solution, we can exactly solve it without using optimization software. We discuss Hsieh et al. (2008) for dual SVM as an example.

They restrictB to contain only one element and sequentially select an element from {1, . . . , l}. Ifαi is being updated, the one-variable sub-problem is

minz D^SVM(α1, . . . , αi+z, . . . , αl)

= 1

2Qiiz²+∇iD^SVM(α)z+ constant, subject to 0≤αi+z≤C,

(9)

where∇_iD^SVM(α) is theith component of the gradient. As (9) is a quadratic function ofz, ifQ_ii>0, easily the solution is:

z= min

max

αi−∇iD^SVM(α) Qii

,0

, C

−αi. (10) We need to calculate:

∇iD^SVM(α) = (Qα)i−1 =

l

X

j=1

Qijαj−1, (11)

which costsO(ln) for calculating theith row of the matrixQα. Such operations are expensive. Hsieh et al. (2008) propose an efficient way ofO(n) to calculate (11). This

(5)

technique is applied to our method for logistic regression. We will have a detailed discussion in Section 3. Algorithm 1 summarizes Hsieh et al. (2008)’s procedure.

In practice, for every round of going throughl variables, Hsieh et al. (2008) ran- domly permutelindices to decide the order for update. They report this setting yields better convergence than sequential updates. In all coordinate descent methods we will discuss, this technique can be applied.

2.2 Approximately Solving One-Variable Sub-Problem

If the sub-problem does not have a closed-form solution, optimization methods must be used to solve the sub-problem. We show the work by Huang et al. (2010) as an example. They apply a one-variable coordinate descent method to solve the primal form ofLR. Ifw_j is being updated, the sub-problem minimizes

g(z) =P^LR(w+ze_j)

=z²

2 +zwj+C





l

X

i=1

log(1 + e^zx^ij −1 1 +e^−w^T^xⁱ)−z

l

X

i=1,yi=1

xij



+P^LR(w), (12)

whereej is the indicator vector for thejth element. This sub-problem does not have a closed-form solution, so Huang et al. (2010) consider the Newton method with the following update rule:

z←z−g⁰(z)/g⁰⁰(z).

The first and second derivatives ofg(z) are respectively:

g⁰(z) =wj+z+C





l

X

i=1

x_ije^zx^ij e^zx^ij+e^−w^T^xⁱ −

l

X

i=1,y_i=1

xij



, (13) and

g⁰⁰(z) = 1 +C

l

X

i=1

x²_ije^−w^T^xⁱe^zx^ij (e^zx^ij+e^−w^T^xⁱ)²

!

. (14)

If w^Txi is available, (13) and (14) cost O(l). In particular, there are l exponential operations, each of which is much more expensive than a multiplication or a division on most computers. As each Newton update is not cheap, Huang et al. (2010) apply only one update and obtain an approximate solution of the sub-problem (12). They also need a line search procedure to guarantee the convergence.

Compared to Section 2.1, clearly the situation is more complicated if the sub- problem does not have a closed-form solution.

2.3 Constrained Problems and Using More Than One Variable

Examples in Sections 2.1 and 2.2 choose one variable at a time, so the sub-problem is simple. Instead, we can choose more than one variable. This is particularly needed for constrained problems as a one-variable update may fail to change the solution (i.e., z= 0 is optimal for the sub-problem). We show several examples in this section.

(6)

For most classicalSVMsoftware, they solveSVMwith a bias termb. That is,w^Txi

in (4) is replaced byw^Txi+b. The dual problem then contains an equality constraint:

minα

1

2α^TQα−

l

X

i=1

αi

subject to

l

X

i=1

yiαi= 0, and 0≤αi≤C, ∀i.

(15)

Due to the equality constraint, the sub-problem must contain at least two variables.

Another example needing more than one variable per sub-problem is multi-lass SVM. Assume there are |Y| classes. Then y_i ∈ {1, . . . ,|Y|} instead of {1,−1}. We discuss the multi-classSVMapproach by Crammer and Singer (2000) because its formulation is related to maximum entropy discussed later. The dual problem is

minα D^CS(α) = 1 2

|Y|

X

y=1 l

X

i=1 l

X

j=1

αiyαjyx^T_ixj+

l

X

i=1

|Y|

X

y=1,y6=yi

αiy (16)

subject to

|Y|

X

y=1

αiy= 0, ∀i= 1, . . . , l, and αiy ≤Cy^y_i, ∀i= 1, . . . , l, y= 1, . . . ,|Y|, where

α= [α11, . . . , α_1|Y_|, . . . , αl1, . . . , α_l|Y_|]^T, and Cy^y_i= (

0 ifyi6=y, C ifyi=y.

The optimization problem (16) has|Y|lvariables. Thel equalities imply that several variables must be chosen for a sub-problem. As each equality involves variables associated with an instance, Crammer and Singer (2000) decompose αto l blocks with α¯i= [αi1, . . . , α_i|Y_|]^T, i= 1, . . . , l, and update one block at a time. The sub-problem is

minz D^CS( ¯α1, . . . ,α¯i+z, . . . ,α¯l)

=1 2

|Y|

X

y=1

x^T_i xizy²+

|Y|

X

y=1

∇iyD^CS(α)zy+ constant, subject to X

y

zy= 0 and 0≤αiy+zy≤C_y^y_i, ∀y= 1, . . . ,|Y|,

(17)

where ∇iyD^CS(α) is the partial derivative with respect to α_iy. Crammer and Singer (2000, Section 6) show that a closed-form solution of this sub-problem can be obtained inO(|Y|log|Y|) time. Alternatively, we can apply general optimization methods.

(7)

Algorithm 1Dual coordinate descent method for linearSVM 1. Given initialα∈[0, C]^l.

2. Whileαis not optimal

– Choose an indexifrom{1, . . . , l}.

– Solve the sub-problem (9) exactly by the analytic form (10).

– Updateαi.

Algorithm 2Dual coordinate descent method for logistic regression 1. Given initialα∈(0, C)^l.

2. Whileαis not optimal

– Choose an indexifrom{1, . . . , l}.

– Solve the sub-problem (18) exactly or approximately.

– Updateαi.

3 A Dual Coordinate Descent Method for Logistic Regression

We begin with discussing difficulties for applying coordinate descent methods forLR.

Next we devise an effective method to solve the sub-problem and present our overall procedure. Earlier studies employing coordinate descent methods for dualLR include Minka (2003, Section 9) and Keerthi et al. (2005). We also discuss the differences between ours and their works.

3.1 Issues in Applying Coordinate Descent Methods for Logistic Regression

Since the dual form for LR is very close to SVM dual, naturally we try to extend existing methods forSVM(e.g., Algorithm 1). In the following we check if each step of Algorithm 1 is applicable toLR.

To give an initialα, Algorithm 1 allows any point in a closed interval [0, C]^land one often usesα=0due to the sparsity at theSVMdual optimal solution. However, for dualLR the objective function is not well defined atαi= 0 orαi=C. Therefore, an initialαmust be in an open interval (0, C)^l. Further, as lim

α_i→0⁺αilogαi= 0, it is unclear if an optimal solution occurs atα_i= 0 orC. The following theorem shows that (3) attains a unique minimum in (0, C)^l:

Theorem 1 TheLRdual problem (3)attains a unique optimal solutionα^∗ andα^∗∈ (0, C)^l.

The proof is in Appendix A.2. In Section 3.4, we discuss how to choose an appropriate initial point in (0, C)^l.

Another important difference from SVMis that the sub-problem no longer has a closed-form solution. If theith variable is selected, the sub-problem is

minz g(z)≡(c₁+z) log(c₁+z) + (c₂−z) log(c₂−z) +a 2z²+bz subject to −c₁≤z≤c₂,

(18)

where

c1=αi, c2=C−αi, a=Qii, andb= (Qα)i.

(8)

Table 1: Cost of operations at a Newton iteration.

Operation Cost

Constructing the sub-problem O(n) Finding Newton directiond O(1) Calculatingg(z^k+λd) in line search O(1)

This sub-problem has been studied in, for example, Keerthi et al. (2005) and Memisevic (2006).⁴ We will discuss the difference between our approach and theirs.

If using Newton methods to solve (18), the update rule without considering the constraint−c1≤z≤c2 is

z^k+1=z^k+d, d=−g⁰(z^k)

g⁰⁰(z^k), (19)

wherekis the index of iterations and∀z∈(−c1, c2) g⁰(z) =az+b+ logc1+z

c2−z, and g⁰⁰(z) =a+ c1+c2

(c1+z)(c2−z). (20) To ensure the convergence of Newton methods, we often need a line search procedure to check the sufficient decrease of function values. For example, we may search for the firstλ= 1, β, β², . . . ,such that

g(z^k+λd)−g(z^k)≤γλg⁰(z^k)d, (21) where γ, β ∈ (0,1). In Keerthi et al. (2005), they suggest a combination of Newton and bisection methods to ensure the convergence, but details are not given. We give an implementation in Section 6.3 and compare it with our proposed method.

We can apply many or few Newton iterations to accurately or loosely solve the sub- problem, respectively. The decision relies on analyzing the cost per iteration; see Table 1. In the beginning, we must construct the sub-problem by calculating coefficients in (18). Since Qii can be pre-stored, the main cost is O(nl) for calculating (Qα)i. The same operation is needed forSVM; see (11). To reduce the cost, we adopt a commonly used trick in linearSVM(e.g., Hsieh et al., 2008) by maintaining a vector:

w(α)≡

l

X

i=1

yiαixi. (22)

Then the cost is reduced toO(n):

(Qα)_i=

l

X

j=1

y_iy_jα_jx^T_jx_i=y_i(

l

X

j=1

y_jα_jx^T_j)x_i=y_iw(α)^Tx_i. (23)

To apply (23),w(α) should be maintained throughout the procedure. By

w(α+ze_i) =w(α) +zy_ix_i, (24)

4 Their sub-problem, though in the same form as (18), is from solving maximum entropy instead of logistic regression. See more discussion in Section 4.

(9)

z g⁰(z)

z^k z^k+1 z^∗ zm

−c₁ c2

(a)z^k> z^∗

z g⁰(z)

z^k z^∗ z^k+1

zm

−c1 c2

(b)z^k< z^∗andz^k+d > c2

Fig. 1: Newton steps for finding a root ofg⁰(z).z^k is an initial point,z^k+1 is derived fromz^kby the Newton step,z^∗is the optimizer, andzm≡(c2−c1)/2 is the mid-point of (−c1, c₂). Figure 1(a) shows that Newton step works fine with a good starting point.

Figure 1(b) shows the situation that Newton stepz^k+dwalks outside the interior.

wherezis the solution of the sub-problem (18) andeiis the indicator vector for theith component,w(α) can be maintained inO(n) time. Hence constructing the sub-problem costsO(n). From Table 1, the complexity of solving the sub-problem is

O(n) + #Newton steps×(O(1) +O(1)×(#Line search steps)). (25) Because of the cheap O(1) cost for finding Newton directions and conducting line search, we should accurately solve the sub-problem. Interestingly, the situation is very different for solving primalLRvia coordinate descent methods (Section 2.2). The sub- problem (12) does not have a closed-form solution either, but Huang et al. (2010) conduct only one Newton iteration (with line search). The reason is that both finding Newton directions and conducting line searches are expensive.

From (20), the time for calculating dis dominated by the log operation, which is much more expensive than addition and multiplication operations. In the line search procedure, calculating one function value g(z+λd) involves two log operations; see (18). Hence line search is more expensive than finding the Newton direction. In Section 3.2, we propose a modified Newton method so that line search is not needed but the convergence still holds. Moreover, our approach will take the constraint−c1≤z≤c₂ into consideration.

The discussion so far indicates that whileLRdual is very close toSVMdual, many details in applying coordinate descent methods are different.

3.2 A Modified Newton Method for Solving the Sub-problem

We propose a modified Newton method for (18) without needing line search proce- dures. Besides, we properly handle the inequality constraint and establish the global

(10)

convergence. To begin, we follow Theorem 2 to show that the optimum of (18) is in the open interval (−c1, c2):

Theorem 2 The sub-problem(18)has a unique minimumz^∗. Moreover,z^∗∈(−c1, c2) andg⁰(z^∗) = 0.

The proof is in Appendix A.3. We draw Figure 1 to analyze how Newton updates (19) may find a root of g⁰(z). By considering two different situations, we can draw some crucial observations:

– From Figure 1(a), if z^k is on the “correct” side of z^∗, then not only subsequent points generated by (19) are in (−c1, c2), but also the Newton method converges toz^∗.

– From Figure 1(b), if z^k is on the “wrong” side of z^∗, thenz^k+1 by (19) may be outside (−c1, c2).

We need a mechanism so that eventually all points are on the “correct” side ofz^∗. To do so a good understanding of “correct” and “wrong” sides is needed. Letzm≡(c2−c1)/2 be the mid-point of the interval (−c1, c2). From Figure 1, we can see that g⁰(z) is concave in (−c1, zm], and convex in [zm, c2).⁵ The following theorem shows that we can check the position ofz^∗andzmto see ifz^kis on the correct side:

Theorem 3 Letz^∗ be the optimizer of (18)andzm= (c2−c1)/2. Ifz^∗ ≥zm, then {z^k} generated by (19)converges toz^∗ for any starting point in[z^∗, c2). If z^∗≤zm, then {z^k} converges to z^∗ for any starting point in (−c1, z^∗]. For any z^k satisfying these conditions, we say it is on the “correct” side ofz^∗.

This theorem can be easily obtained by the standard convergence proof of Newton methods.⁶

For any z^k on the “wrong” side, there are two cases. The first one is z^k+d ∈ (−c1, c2). If z^k+d falls on the “correct” side, Theorem 3 implies that subsequent Newton updates converge. Ifz^k+dis still on the “wrong” side, it at least gets closer to z^∗. Thus we take z^k +d as z^k+1. The second case is that z^k +d /∈ (−c1, c2).

Because (−c1, c2) is an open interval, it is not possible to do a direct projection.

Assumez^k+d≥c2 as Figure 1(b). We propose finding a pointz in [z^k, c2) closer to the “correct” side by

z^k+1=ξz^k+ (1−ξ)c2, (26)

where ξ ∈ (0,1). For any z^k on the “wrong” side, we prove that the above setting eventually reaches a point on the “correct” side. Then this point can be considered as a starting point in Theorem 3 for the convergence.

Theorem 4 Assume z^∗ ≥ zm. If we generate a sequence of Newton iterations by starting fromz^k < z^∗ (i.e., z^k on the “wrong” side of z^∗), and applying the update rule:

z^k+1= (

z^k+d ifz^k+d < c2, ξz^k+ (1−ξ)c2 ifz^k+d≥c2,

then there is k⁰ > k such that z^k⁰ ≥ z^∗. That is, z^k⁰ is on the “correct” side. The situation forz^∗≤zm andz^k > z^∗ is similar.

5 Formally, we can proveg⁰⁰⁰(zm) = 0,g⁰⁰⁰(z)>0 ifz > zm, andg⁰⁰⁰(z)<0 ifz < zm.

6 For example,http://planetmath.org/encyclopedia/NewtonsMethodWorksForConvexRealFunctions.

html

(11)

Algorithm 3A modified Newton method to solve (18) – Given coefficients:a, b, c1,andc2.

– Set initialz⁰∈(−c1, c2).

– Fork= 0,1. . .

– Ifg⁰(z^k) = 0, break.

– d← −g⁰(z^k)/g⁰⁰(z^k).

–

z^k+1=







z^k+d ifz^k+d∈(−c₁, c2), ξz^k+ (1−ξ)(−c1) ifz^k+d≤ −c1, ξz^k+ (1−ξ)c2 ifz^k+d≥c2.

(30)

The proof is in Appendix A.4.

We describe the modified Newton method in Algorithm 3. The update rule is very simple and no line search is needed. Combining Theorems 3 and 4, the global convergence of Algorithm 3 is established.

Theorem 5 The sequence {z^k} generated by Algorithm 3 converges to the optimum z^∗ of (18)for anyz⁰∈(−c1, c2).

The initial z⁰can be any value in (−c1, c₂), but we hope it is close toz^∗ for fast convergence of the Newton method. In the final stage of the decomposition method, αi does not change much and z^∗ ≈0, soz⁰ = 0 is a reasonable choice. However, in the early stage of the decomposition method, thisz⁰may be far away fromz^∗. While we cannot easily find az⁰ on the “correct” side, Theorem 3 indicates thatz⁰ should satisfy

z⁰∈

((−c1, zm) ifz^∗≤zm,

[zm, c2) ifz^∗≥zm. (27)

Later in Section 3.3 we show an easy way to check ifz^k ≤zm or not; see (34). Thus we usez⁰= 0 in general, but also ensure thatz⁰satisfies (27). This is achieved by

z⁰=







(1−ξ0)(−c1) ifz^∗≤zm≤0, (1−ξ0)(c2) ifz^∗≥zm≥0,

0 otherwise.

(28)

We explain that 0< ξ0≤0.5 will letz⁰satisfy (27). If

−c1< z^∗≤zm≤0< c2, (29) then−(1−ξ0)c1∈(−c1,0) and is closer to−c1. Sincezmis the mid-point of (−c1, c2), we have−(1−ξ₀)c₁≤zm. The situation forz^∗≥zm≥0 is similar.

3.3 Numerical Difficulties

Unfortunately, a direct implementation of Algorithm 3 may face numerical difficulties.

Keerthi et al. (2005) point out that whenαi is close to 0 or C, it may be difficult to reach a solutionz^∗ satisfying

g⁰(z^∗) =Qiiz^∗+ (Qα)i+ log(αi+z^∗)−log(C−αi−z^∗)≈0.

(12)

They explain that ifC is large (say 10⁵), (Qα)i is large as well. Then αi+z^∗ may be too small (e.g.,e⁻¹⁰⁵) to be represented as a floating-point number. They propose some ways to handle such a situation. However, through experiments we find that even if C is as large as 10⁵, (Qα)i is generally much smaller (e.g., a few hundreds or thousands). The reason seems to be that from (23), (Qα)_i is the sum of positive and negative terms, so the value is not as large asα_i. Instead, we find that numerical difficulties occur because of catastrophic cancellations (i.e., subtraction between two nearly-equal floating-point numbers) whenαi+zis close to zero. That is, ifz≈ −αi, the relative numerical error of calculatingαi+z can be large (Goldberg, 1991). Then log(α_i+z) is erroneous. A common solution to avoid catastrophic cancellation is by some reformulations.

LetZ1=c1+z ands=c1+c2. An equivalent form to (18) is minZ₁ g1(Z1) =Z1logZ1+ (s−Z1) log(s−Z1) +a

2(Z1−c1)²+b1(Z1−c1) subject to 0≤Z1≤s, b1=b.

Clearly, whenz≈ −c1,

s−Z1=c2−z≈c2+c1=s (31) is far away from zero. Thus we avoid a catastrophic cancellation. However, a new subtractionZ1−c1 occurs. In calculating the Newton direction,Z1−c1appears only ing₁⁰(Z1); see (32). IfZ1−c1 ≈0, thena(Z1−c1) +b1 ≈b1 and the large relative error in calculatingZ1−c1does not cause serious problems.

Similarly, ifz≈c₂, we letZ₂=c₂−zand adopt the following reformulation.

minZ₂ g2(Z2) =Z2logZ2+ (s−Z2) log(s−Z2) +a

2(Z2−c2)²+b2(Z2−c2) subject to 0≤Z2≤s, b2=−b.

Therefore, instead of minimizing onz, we now work on the distance fromzto the lower (or upper) bound. To minimizegt(Zt), t= 1,2 by the Newton method, we need the first and the second derivatives:

g⁰t(Zt) = log Zt

s−Zt

+a(Zt−ct) +bt and g⁰⁰t(Zt) =a+ s

Zt(s−Zt). (32) Next we check if g1(Z1) or g2(Z2) should be used. From the above discussion, g1(Z1) aims to handle the situation ofz ≈ −c1, whileg2(Z2) is forz ≈c2. As{z^k} generated by Algorithm 3 converges toz^∗, most of the points in{z^k}are close toz^∗. Hence we can chooseg1(Z1) org2(Z2) based onz^∗’s closeness to the two bounds:

z^∗closer to (−c1

c₂ ⇒ choose (

g1(Z1),

g₂(Z₂). (33)

To use (33), as z^∗ is unknown before applying the Newton method, we consider the following property:

z^∗ closer to (−c1

c2

⇔z^∗ (≤zm

≥zm

⇔g⁰(zm) (≥0

≤0 ⇔zm

(≥ −b/a,

≤ −b/a. (34)

(13)

Algorithm 4A new modified Newton method for (18) – Given coefficients:a, b, c1, andc2. Lets=c1+c2

– t←

(1 ifzm≥^−b

a, 2 ifzm<^−b_a. – Z_t⁰∈(0, s).

– Fork= 0,1, . . .

– Ifg⁰_t(Z_t^k) = 0 , break.

– d← −g⁰_t(Z^k_t)/g⁰⁰_t(Z_t^k).

– Z_t^k+1=

(ξZ_t^k ifZ_t^k+d≤0, Z_t^k+d otherwise.

– (

Z₂^k=s−Z₁^k ift= 1, Z₁^k=s−Z₂^k ift= 2.

– return (Z₁^k, Z₂^k).

The proof is in Appendix A.5. Thus the decision is by easily comparingzmand−b/a.

UsingZ1=c1+z, Z2=c2−z, and (34), a direct calculaton shows that the initial z⁰considered in (28) becomes

Z₁⁰= (

ξ0c1 ifc1≥s/2,

c1 otherwise, and Z₂⁰= (

ξ0c2 ifc2≥s/2,

c2 otherwise. (35)

We also need to adjust (26), which handles the situation if z^k+d > c₂. Assume g2(Z2) is used. Eq. (26) becomes

c₂−Z₂^k+1=ξ(c₂−Z₂^k) + (1−ξ)c₂ and can be simplified to

Z₂^k+1=ξZ₂^k.

The situation for g1(Z1) is similar. By minimizing g1(Z1) or g2(Z2), Algorithm 3 becomes Algorithm 4. The returned values can be either (t, Z_t^k) or (Z₁^k, Z₂^k). We adopt the latter to avoid possible catastrophic cancellations in calculatingc₁andc₂ for the next sub-problem. See details in Section 3.4.

3.4 The Overall Procedure

Different from the situation in SVM, now α = 0 is not a valid starting point. A naive choice is to set αi = C/2 ∈ (0, C), ∀i. However, experiments show that this initialization is far away from the optimal solution. Note that for SVM, α= 0 is a reasonable choice because at the final solution many elements remain at zero (i.e., the solution is sparse). Though LR does not produce a sparse solution, we explain that manyα_ivalues are small. From the optimality condition⁷, the optimal (w,α) satisfies

α_i= Cexp(−yiw^Txi) 1 + exp(−yiw^Txi), ∀i.

7 We do not show the optimality condition forLR, but a similar form can be found in (75) forME.

(14)

Algorithm 5A dual coordinate descent method for logistic regression – Set initialαi= min(1C, 2)∀iand the correspondingw←P

iαiyixi. – α⁰_i←C−αiandQii←x^T_ixi∀i.

– Whileαis not optimal Fori= 1, . . . , l

1. Construct the sub-problem (18) for instancexiby

c1=αi, c2=α⁰_i, a=Qii, andb=yiw^Txi.

2. Solve (18) by Algorithm 4 and get Z1 and Z2. Note that in Algorithm 4,s ≡ c1+c2=C.

3. w←w+ (Z1−αi)yixi. 4. αi←Z1, α⁰_i←Z2.

As exp(−yiw^Txi) quickly decays to zero for negative−yiw^Txi, many correctly classi- fied instances have their correspondingαi/C close to zero. Therefore, similar toSVM, we should use an initial point close to the zero vector. We consider

αi= min(1C, 2)∀i, (36)

where1and2are small positive values less than one. Keerthi et al. (2005) consider αi = C/l⁺ if yi = 1 and C/l⁻ if yi = −1, where l⁺ and l⁻ are the numbers of positive/negative data, respectively. Ours differs from them in2, which ensures that the initialα_iis sufficiently small regardless of theCvalue.

In constructing the sub-problem (18), another catastrophic cancellation may occur.

Ifαi≈C, then calculatingc2=C−αiis a catastrophic cancellation. An erroneousc2

then causes more numerical errors in subsequent calculations. To remedy this problem, a reformulation can be performed in the previous update ofα_i: From the definition of Z2 in Section 3.3,

Z2=c2−z=C−α^oldi −z=C−α^newi .

Therefore, if earlier g2(Z2) is considered, the returnedZ2 can be directly used as c2

for the current sub-problem. Alternatively, ifg1(Z1) is used, we calculateZ2=s−Z1

in the end of Algorithm 4. According to (31), this is not a catastrophic cancellation.

The discussion here explains why we choose to output both (Z1, Z2) in Algorithm 4.

Algorithm 5 gives details of the proposed coordinate descent method forLR dual.

To updatew(α) via (24), we need to obtainz, but Algorithm 4 gives onlyZ1andZ2. We can consider eitherZ1−αiorC−Z2though a catastrophic cancellation may occur.

However, the situation seems to be less serious than that in Section 3.2, which involves a log operation after a catastrophic cancellation. Finally, the following theorem shows the linear convergence of Algorithm 5.

Theorem 6 Letα^s denote the vector in the beginning of each iteration in the while loop of Algorithm 5. The sequence{α^s}globally converges to the unique optimumα^∗. The convergence rate is at least linear: there are 0< µ <1and an iterations0 such that

D^LR(α^s+1)−D^LR(α^∗)≤µ(D^LR(α^s)−D^LR(α^∗)),∀s≥s0. (37) The proof is in Appendix A.6.

(15)

4 A Two-Level Dual Coordinate Descent Method for Maximum Entropy Based on the experience forLR in Section 3, this section investigates a two-level dual coordinate descent method forME. The outer level considers a block of variables at a time. The resulting sub-problem is then solved by an inner loop of coordinate descent updates. Our method extends that in Memisevic (2006), but we give more complete analysis.

4.1 Dual ofMEand Coordinate Descent Methods

We derive in Appendix A.7 the following dual form for (5):

minα D^ME(α) = 1

2σ²w(α)^Tw(α) +X

i

X

y:α_iy>0

αiylogαiy

subject to X

y

αiy= ˜P(xi) and αiy≥0∀i, y,

(38)

where

w(α)≡σ²



f^˜−X

i,y

αiyf(xi, y)



 (39) and ˜f is defined in (6). The vectorα∈R^l|Y^|can be decomposed tolblocks

α= [ ¯α1, . . . ,α¯_l]^T and ¯α_i= [α_i1, . . . , α_i|Y_|]^T, (40) where ¯αi corresponds to the unique context xi in the data set. If w^∗ and α^∗ are respectively the optimal solution of primal and dual problems, thenw(α^∗) =w^∗.

Eq. (39) is slightly different from the formulation considered in Lebanon and Laf- ferty (2002); Memisevic (2006); Collins et al. (2008), where

w(α)≡σ²X

i,y

αiy(f(xi, yi)−f(xi, y)). (41) The difference is due to that these works additionally assume that there is a uniqueyi

for eachx_i among all training data. That is, ˜P(x_i, y_i) = ˜P(x_i) and ˜P(x_i, y) = 0∀y6=

y_i. Under this assumption and using the equality constraint in (38), (39) can be reduced to (41):

σ² f˜−X

i,y

αiyf(xi, y)

=σ² X

i,y

( ˜P(xi, y)−αiy)f(xi, y)

=σ² X

i

P(x˜ _i)f(x_i, y_i)−X

i,y

α_i,yf(x_i, y)

=σ²X

i,y

α_iy(f(x_i, y_i)−f(x_i, y)). Like the situation inLR, the following theorem shows that the optimalα^∗for (38) is in general an interior point.

Theorem 7 TheMEdual problem (38)attains a unique optimal solutionα^∗ and for anyi, y

α^∗_iy (

= 0 ifP(x˜ _i) = 0,

∈(0,P(x˜ i)) otherwise.

(16)

The proof is in Appendix A.8.

Next we design a coordinate descent method to solve (38). We observe that (38) is very similar to (16) for multi-classSVMin several aspects. First, theαvector can be decomposed to several blocks, and each block is associated with an x_i and all labels;

see (40).⁸ Second, each equality constraint corresponds to a singlexi. Therefore, we follow Memisevic (2006) and earlierSVMworks (Crammer and Singer, 2000; Hsu and Lin, 2002; Keerthi et al., 2008) to consider variables associated with anxias a block.

The sub-problem is:

minz h(z) subject to X

y

zy= 0 and zy≥ −αiy ∀y, (42) where

h(z)≡D^ME( ¯α1, . . . ,α¯i+z, . . . ,α¯l) (43)

=X

y

(αiy+zy) log(αiy+zy) + 1 2σ²

w(α)−σ²X

y

zyf(xi, y)

2+ constant

=X

y

(α_iy+zy) log(α_iy+zy)−X

y

zyw(α)^Tf(x_i, y) +σ²

2 z^TKⁱz+ constant, whereKⁱ∈R^|Y^|×|Y^|is a matrix withK_yyⁱ 0=f(xi, y)^Tf(xi, y⁰), ∀y, y⁰∈Y.

4.2 Solving the Sub-problem

Clearly, (42) is very similar to the sub-problem in (17) because of the same equality constraint. Eq. (17) has a closed-form solution, but (42) has not due to the log terms in the objective function. Many optimization methods can be applied to solve (42). Collins et al. (2008) propose an exponentiated gradient (EG) method to get an approximate solution. We leave details of EG in Section 5. We follow Memisevic (2006) to use a coordinate descent method, so the procedure for solving (38) becomes a two-level coordinate descent method. Each step of the outer level considers variables associated with anx_i as a block and gets the sub-problem (42). The inner level then solves (42) via coordinate descent methods. Such two-level approaches have been considered in trainingSVM(e.g., R¨uping, 2000; P´erez-Cruz et al., 2004).

To solve the sub-problem (42), each time we select two variables αiy₁ and αiy₂. Using the equality constraint, we obtain a one-variable sub-problem:

min

d h(z+d(ey1−ey2))≡(αiy₁+zy1+d) log(αiy₁+zy1+d) + (αiy₂+zy2−d) log(αiy₂+zy2−d)

+ σ² (Kⁱz)y₁−(Kⁱz)y₂

−w(α)^T f(xi, y1)−f(xi, y2) d

+σ²

2 (K_yⁱ₁_y₁+K_yⁱ₂_y₂−2K_yⁱ₁_y₂)d²+ constant subject to − α_iy₁+zy1

≤d≤α_iy₂+zy2.

(44)

8 In fact, by definingw(α) =





 P

iαi1xi

.. . P

iα_i|Y_|xi





, (16) also has aw(α)^Tw(α) term like (38).

(17)

Algorithm 6 Solving the sub-problem (42) by a coordinate descent method with maximal violating pairs. We assume the property (48).

– Givenαi,w(α), K_yyⁱ , ∀y.

– zˆ⁰←αi.

– vy←w(α)^Tf(xi, y), ∀y.

– Find the initial gradient

Gy←log(ˆz⁰_y) + 1−vy, ∀y.

– Fork= 0,1,2. . .,

– If maxyGy= minyGy, break

– y1←arg maxyGy, y2←arg minyGy.

– Calculate coefficients of (44) by using the variable ˆz a←σ² K_yⁱ₁_y₁+K_yⁱ₂_y₂

b←σ²

(ˆz^k_y₁−αiy₁)K_yⁱ₁_y₁−(ˆz^k_y₂−αiy₂)K_yⁱ₂_y₂

−vy₁+vy₂

c1←zˆ^k_y₁, c2←zˆ^k_y₂

(47)

– Solve (44) by Algorithm 4 and get the optimalZ₁^∗, Z^∗₂. – zˆ^k+1y₁ ←Z₁^∗, zˆ^k+1y₂ ←Z^∗₂.

– Update the gradient

Gy₁ ←log(ˆz_y^k+1₁ ) + 1 +σ²K_yⁱ₁_y₁(ˆz^k+1_y₁ −αiy1)−vy₁, Gy₂ ←log(ˆz_y^k+1₂ ) + 1 +σ²K_yⁱ₂_y₂(ˆz^k+1_y₂ −αiy₂)−vy₂.

By assigning a←σ²

Kyⁱ₁y₁+Kyⁱ₂y₂−2Kyⁱ₁y₂

b←σ²

(Kⁱz)y₁−(Kⁱz)y₂

−w(α)^T(f(xi, y1)−f(xi, y2)) c1←αiy₁+zy1 and c2←αiy₂+zy2,

(45)

(44) is in the same form as (18), so Algorithm 4 can be applied.

There are many ways to select the two indicesy1andy2. InSVM, this issue, called the working set selection, has been thoroughly studied. For example, we can sequentially go through all pairs of indices. Alternatively, using gradient information (e.g., Joachims, 1998; Keerthi et al., 2001; Fan et al., 2005) may lead to faster convergence.

Memisevic (2006) adopts the “maximal violating pair” (Keerthi et al., 2001) by select- ing the two indices violating the optimality condition the most. From a proof similar to Theorem 1, the optimalz^∗ of (42) satisfiesz_y^∗>−αiy, ∀y. Thus without considering inequality constraints, the optimality condition implies

∇zyh(z^∗) =∇z_y0h(z^∗), ∀y, y⁰, where

∇zyh(z)≡log(αiy+zy) + 1 +σ²(Kⁱz)y−w(α)^Tf(xi, y). (46) We can select the maximal violating pair by

y1= arg max

y ∇z_yh(z) andy2= arg min

y ∇z_yh(z).