Fast Projection onto the Simplex and the l1 Ball

(1)

HAL Id: hal-01056171

https://hal.archives-ouvertes.fr/hal-01056171v2

Submitted on 1 Sep 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Fast Projection onto the Simplex and the l1 Ball

Laurent Condat

To cite this version:

Laurent Condat. Fast Projection onto the Simplex and the l1 Ball. Mathematical Programming,

Series A, Springer, 2016, 158 (1), pp.575-585. �10.1007/s10107-015-0946-6�. �hal-01056171v2�

(2)

Fast Projection onto the Simplex and the

ℓℓℓ

1

Ball

Laurent Condat

Final author’s version. The final publication is available at Springer via http://dx.doi.org/10.1007/s10107-015-0946-6

Abstract A new algorithm is proposed to project, exactly and in finite time, a vector of arbitrary size onto a simplex or anℓ1-norm ball. It can be viewed as a Gauss-Seidel-like variant of Michelot’s variable fixing algorithm; that is, the threshold used to fix the variables is updated after each element is read, instead of waiting for a full reading pass over the list of non-fixed elements. This algorithm is empirically demonstrated to be faster than existing methods.

Keywords Simplex· ℓ1-norm ball· Large-scale optimization

Mathematics Subject Classification (2010) 49M30, 65C60, 65K05, 90C25.

1 Introduction

The projection of a vector onto the simplex or theℓ1ball appears in imaging prob-lems, such as segmentation [15] and multispectral unmixing [2], in portfolio opti-mization [5], and in many applications of statistics, operations research and machine learning [7, 20, 3, 18, 19]. Given an integer N≥ 1, a sequence (vector) y = (yn)N_n=1∈ RN_{and a real a}_{> 0, we aim at computing}

P_∆_{(y) := arg min} x∈∆

kx − yk (1)

or

P_B_{(y) := arg min} x∈B

kx − yk, (2)

where the norm is the Euclidean norm, the simplex∆ ⊂ RN _{is defined as the set of} sequences whose elements are nonnegative and sum up to a (for a= 1,∆ is called the unit, or canonical, or standard, or probability simplex):

∆:=(x1, . . . , xN) ∈ RN ∑N_n=1xn= a and xn≥ 0, ∀n = 1, . . . , N (3) This work has been partially supported by the LabEx PERSYVAL-Lab (ANR-11-LABX-0025). L. Condat is with Univ. Grenoble Alpes, GIPSA-Lab, F-38000 Grenoble, France.

(3)

and theℓ1ball, a.k.a. cross-polytope, B⊂ RNis defined as B :=(x1, . . . , xN) ∈ RN

∑N_n=1|xn| ≤ a .

These two projections are well defined and unique, since∆ and B are closed and convex sets. In this paper, we focus on algorithms to perform these projections exactly and in finite time. In Sect. 2, we review the methods of the literature. In Sect. 3, we propose a new algorithm and we show in Sect. 4 that it is faster than the existing methods.

2 Review of prior work

We first recall a well known property, which allows to project onto theℓ1ball, as soon as one can project onto the simplex:

Proposition 2.1 (see, e.g., [7, Lemma 3]) PB(y) = y, if ∑

N

n=1|yn| ≤ a,

sgn(y1)x1, . . . , sgn(yN)xN, else, (4) where x= P∆(|y|), |y| = (|y1|, . . . , |yN|), and sgn is the signum function: if t > 0, sgn(t) = 1, if t < 0, sgn(t) = −1, and sgn(0) = 0.

Remark 2.1 Conversely to Proposition 2.1, one can project onto the simplex us-ing projection onto theℓ1ball. Indeed, P∆(y) = P∆(y + c), for every c ∈ R, and P_∆_{(y) = P}B(y) if the elements of y are nonnegative and kyk₁≥ a. Thus, whatever y, we have P∆(y) = PB(y − y_min+ a/N), where y_minis the smallest element of y.

So, by virtue of Proposition 2.1, we focus in the following on projecting onto the simplex only, and we denote by

x := P∆(y) (5)

the projected sequence. An important property of the projection P∆, which can be derived from the corresponding Karush-Kuhn-Tucker optimality conditions, is the following:

Proposition 2.2 (see, e.g., [10]) There exists a uniqueτ∈ R such that

xn= max{yn−τ, 0}, ∀n = 1, . . . , N. (6)

Therefore, the whole difficulty of the operation P∆is to find the value ofτ. Then, the projection itself simply amounts to applying the thresholding operation (6). So, we must findτ such that ∑Nn=1max{yn−τ, 0} = a. Only the largest elements of y, which are larger thanτ and are not set to zero during the projection, contribute to this sum. So, if we knew the index set I = {n | xn> 0}, since ∑Nn=1xn= ∑n∈Ixn= ∑n∈I(yn−τ) = a, we would haveτ= (∑n∈Iyn− a)/|I |. Algorithm 1, given in

(4)

Algorithm 1 (using sorting) [10]       1. Sort y into u: u1≥ ··· ≥ uN. 2. Set K := max 1≤k≤N{k | (∑ k r=1ur− a)/k < uk}. 3. Setτ:= (∑K k=1uk− a)/K.

4. For n= 1,... ,N, set xn:= max{yn−τ,0}.

Algorithm 2 (using a heap) [20]           

1. Build a heap v from y. 2. For k= 1,... ,N, do     

2.1. Set uk:= v1(largest element of the heap).

2.2. If(∑k

r=1ur− a)/k ≥ uk, exit the loop.

2.3. Set K := k.

2.4. Remove v1from v and re-heapify v.

3. Setτ:= (∑K

k=1uk− a)/K.

Algorithm 3 (using partitioning) [12].                   1. Set v := y, K := 0, S := −a. 2. While v is not empty, do

           

2.1. Choose a pivotρin the convex hull of v. 2.2. Construct the subsequences yhigh:= (y ∈ v |

y>ρ), ylow:= (y ∈ v | y <ρ), and set M

as the number of elements equal toρin v. 2.3. If(S + Mρ+ ∑y∈yhighy)/(K + M + |yhigh|)

<ρ, set S := S + Mρ+ ∑y∈yhighy,

K:= K + M + |yhigh|, v := ylow.

2.4. Else, set v := yhigh.

3. Setτ:= S/K.

Algorithm 4 (active set method of Michelot) [17].         1. Set v := y,ρ:= (∑N n=1yn− a)/N.

2. Do, while|v| changes,

2.1. Replace v by its subsequence(y ∈ v | y >ρ). 2.2. Setρ:= (∑y∈vy− a)/|v|.

3. Setτ:=ρ, K= |v|.

Fig. 1 Several algorithms to project onto the simplex∆. The input data consists in N≥ 1, y ∈ RN_{, a}_{> 0,}

and the ouput is the sequence x= (xn)Nn=1= P∆(y). Incidentally, the number K at the end of the algorithms

is the number of nonzero elements in x.

Fig. 1, which is based on sorting the elements of y in decreasing order, naturally follows from these considerations. This algorithm is explicitly given in [10] and was rediscovered many times later. Depending on the choice of the sorting algorithm, the worst case complexity of Algorithm 1 is O(N2_{) or O(N log N) [1].}

An improvement of Algorithm 1 was proposed in [20], by noticing that it is not useful to sort y completely, since only its largest elements are involved in the determi-nation ofτ. Thus, a heap structure can be used: a heap v= (v1, . . . , vN) is a partially sorted sequence, such that its first element v1is the largest and it is fast to re-arrange (v2, . . . , vN) into a heap, with complexity O(log N). The complexity of arranging the elements of y into a heap is O(N). This yields Algorithm 2, given in Fig. 1, whose complexity is O(N + K log N), where K = |I | is the number of nonzero elements in the solution x.

(5)

Another way of improving Algorithm 1 is based on the following observation: it is generally considered that the fastest sorting algorithm isquicksort, which uses partitioning with respect to an element called pivot [1]; the pivot is chosen, and the sequence to sort is split into the two subsequences of elements smaller and larger than the pivot, which are then sorted recursively. But we can notice thatτdoes not depend on the ordering of the elements yn; it depends only on the sum of the largest elements. Thus, many operations inquicksort are not useful for our aim and can be omitted. Indeed, let us consider, at the first iteration of the algorithm, partitioning y with re-spect to some pivot valueρ ∈ [ymin, ymax], where ymin and ymax are the minimum and maximum elements of y: we define the subsequences ylowand yhighof elements of y smaller and larger thanρ, respectively, and we set S := ∑y∈yhighy− a. Then, if

S/|yhigh| ≥ρ, we haveτ≥ρ, so that we can discard the elements of ylowand continue with yhighto determineτ. On the other hand, if S/|yhigh| ≤ρ, we haveτ≤ρ. Thus, we can discard the elements of yhigh, keeping only the values S and|yhigh| in memory, and continue with ylowto determineτsuch that ∑y∈ylowmax{y −τ, 0} + S − |yhigh|τ= 0.

This yields Algorithm 3, given in Fig. 1. We refer to the review paper [12] for refer-ences and discussions about this class of algorithms, which was popularized recently by the highly cited paper of Duchi et al. [7]. Before discussing the choice of the pivot, we highlight a major drawback of the algorithm given in [7], whose only difference with Algorithm 3 is that, at step 2.4., the elements of v equal to the pivot, except one, are left in v instead of being discarded. The pivot is chosen at random in v. The worst case expected complexity (averaged over all choices of the pivot) of this algorithm is not O(N) as claimed in [7], but O(N2_{). Indeed, expected linear time is} guaran-teed only if the elements of y are distinct. Since projection onto a simplex is often one operation among others in an iterative algorithm converging to a fixed point, and since sparsity of the solution is often a desirable property, it is likely that, in prac-tice, the projection algorithm is fed with sequences in the simplex or close to it, thus containing many elements at zero. For instance, when applying the algorithm of [7] to y= (0, . . . , 0, 1), the complexity is O(N2_{): the algorithm iterates over sequences} of size N, N− 1, and so on until 1 is picked as pivot. This issue is corrected with our Algorithm 3, which has O(N) expected complexity when the pivot is chosen at random in v, thanks to our careful handling of the elements equal to the pivot.

Now, concerning the choice of the pivot, this is the same much-discussed problem as for the sorting algorithmquicksort and the selection algorithm quickselect, which are based on partitioning as well [1]. The choice depends on whether one is ready to make use of a random number generator and accept a fluctuating run-ning time. It also depends on whether one is ready to accept the worst case O(N2_), which may come randomly with low probability or may be deliberately triggered by someone having knowledge of the implementation and feeding the algorithm with a contrived sequence y, creating a security risk. Choosing the pivot at random in the list v gives expected complexity O(N), with some variance, but worst case complex-ity O(N2_{). If the pivot is the median of v, the complexity becomes O(N), which is} optimal, but a linear time median finding routine, typically based on the median of medians [4], is cumbersome to implement and slow. According to [12], a good com-promise, which we adopt in Sect. 4, is to take the median of v as pivot, but to find it

(6)

using the efficient algorithm of [11], whose expected complexity (not worst case) in terms of comparisons is 3N/2 + o(N).

A different algorithm, of the variable-fixing type [13, 18, 19], has been proposed by Michelot [17]. It is reproduced1_{as Algorithm 4 in Fig. 1. It can be viewed as} a version of Algorithm 3, where the pivotρ would alway be known to be a lower bound ofτ, so that the step 2.4. is always executed. Indeed, for every subsequence v of y, by settingρ = (∑y∈v−a)/|v|, we have a = ∑y∈v(y −ρ) ≤ ∑y∈vmax{y −

ρ, 0} ≤ ∑N

n=1max{yn−ρ, 0}. Therefore,ρ≤τ. Consequently, if y≤ρ, we know that max{y −τ, 0} = 0 and we can discard the element y, which does not contribute to the determination ofτ. By alternating between the calculation ofρ = (∑y∈vy− a)/|v| and the update of the sequence v by discarding its elements smaller or equal toρ, the algorithm enters a steady state after a finite number of iterations (consisting of steps 2.1. and 2.2.), withρ =τ. Algorithm 4 has several advantages: it is deterministic, very simple to implement, and independent of the ordering of the elements in y. Its complexity is observed linear in practice [13]. Yet, its worst case complexity is O(N2_{); this corresponds to the case where only one element is discarded from v at} step 2.1. of every iteration. Examples of such pathological sequences can be easily constructed; one example is given in [6, end of Section 3].

Yet another way to look at the problem is to view the search ofτas a root finding problem [16, 9]. Let us define the function f :ρ7→ ∑N

n=1max{yn−ρ, 0} − a. We look forτ such that f(τ) = 0, soτ is a root of f . f has the following properties: it is convex; it is piecewise linear with breakpoints at the yn; it is strictly decreasing on (−∞, ymax] and f (ρ) = −a for everyρ∈ [ymax, +∞). So, for anyρ∈ R, if f (ρ) < 0, thenρ >τ and if f(ρ) > 0, then ρ<τ. Moreover, f(ymin− a/N) ≥ 0, f (ymax− a/N) ≤ 0 and f (ymax− a) ≥ 0, so thatτ∈ [max{ymax− a, ymin− a/N}, ymax− a/N]. Thus, Algorithm 3 may be interpreted as a bracketing method and Algorithm 4 as a Newton method [6, Proposition 1] to find the rootτ. The method of [16] combines features from the bisection method and the secant method. However, the proof of [16] that the bisection method has complexity O(N) is not valid: for a fixed, arbitrarily small, valueδ > 0, the number of elements yn in an interval of size δ bracketing

τ may be as large as N, so that the number of bisection steps, each of complexity O(N), may be arbitrarily large. Finally, we note that projection onto the simplex is a particular case of the more general continuous quadratic knapsack problem; most methods proposed to solve this problem are based on the principles discussed in this section [12, 13] and we refer to the survey papers [18, 19] for a complete annotated list of references.

3 Proposed algorithm

Using the principles seen in the previous section, we are in position to explain the proposed algorithm, given in Fig. 2, which can be viewed as a Gauss–Seidel-like

1 _{Actually, Algorithm 4 is an improvement of Michelot’s algorithm, with the test “>” instead of “≥” at}

step 2.1. This modification has been proposed in [13, Sect. 5.7]. Algorithm 4 is also the same algorithm as in [9].

(7)

Proposed Algorithm                         

1. Set v := (y1), ˜v as an empty list,ρ:= y1− a.

2. For n= 2,... ,N, do       If yn>ρ,     2.1. Setρ:=ρ+ (yn−ρ)/(|v| + 1). 2.2. Ifρ> yn− a, add ynto v.

2.3. Else, add v to ˜v, set v := (yn),ρ:= yn−a.

3. If ˜v is not empty, for every element y of ˜v, do

3.1. If y>ρ, add y to v and setρ:=ρ+(y−ρ)/|v|. 4. Do, while|v| changes,

   

For every element y of v, do

4.1. If y≤ρ, remove y from v and set ρ:=ρ+ (ρ− y)/|v|.

5. Setτ:=ρ, K= |v|.

Fig. 2 Proposed algorithm to project onto the simplex∆. The input data consists in N≥ 1, y ∈ RN_{, a}_{> 0,}

and the ouput is the sequence x= (xn)Nn=1= P∆(y). Incidentally, the number K at the end of the algorithm

is the number of nonzero elements in x.

variation of Algorithm 4. Indeed, the lower boundρ ofτ can be updated not only after a complete scan of the sequence v, but after every element of v is read. Let us first describe a simplified version of the algorithm, without the step 3. and with the steps 2.1., 2.2. and 2.3. replaced by “2.1. Add ynto v and setρ:=ρ+ (yn−ρ)/|v|.” The algorithm starts with the first pass (steps 1. and 2.), which does not assume any knowledge about y. Let us consider that we are currently reading the element yn, for some 2≤ n ≤ N and that we have already read the previous elements yr, r= 1, . . . , n − 1. We have a subsequence v of (yr)n−1_r=1 of all the elements potentially larger thanτand we maintain the variableρ= (∑y∈vy− a)/|v|. We know thatρ≤τ. Hence, if yn≤ρ, then yn≤τ. So, we can ignore this element and we do nothing. In the other case yn>ρ, we add ynto v, since ynis potentially larger thanτ, andρ is assigned the new value of(∑y∈vy− a)/|v|, which is strictly larger than previously. Then we continue the pass with the next element yn+1. The pass is initialized with v= (y1) andρ= y1− a.

At the beginning of all the subsequent passes (step 4.), we have the subsequence v of all the elements of y potentially larger thanτ. The difference with the beginning of the first pass is that we have calculated the valueρ= (∑y∈vy− a)/|v|; we will use it to remove elements from v sequentially. Let us consider that we are reading the element y∈ v. If y >ρ, we do nothing. Else, y≤τ, so we remove this element from v. Consequently,ρ is assigned the new value of(∑y∈vy− a)/|v|, which is strictly larger than previously. Then we continue the pass with the next element of v.

The proof of correctness of this algorithm is straightforward. At the end of every pass, either at least one element has been removed from v, or v andρremain the same as after the previous pass. In the latter case, the elements of y which are not present in v are smaller thanρ and the elements in v (which are the|v| largest elements of y) are larger thanρ, so that a= ∑y∈v(y −ρ) = ∑Nn=1max{yn−ρ, 0}. Thus, from Proposition 2,ρ=τ.

The first pass of the proposed algorithm, as given in Fig. 2, contains a refinement with respect to the algorithm just described. Every pass aims at calculating the best

(8)

Table 1 Complexity of the algorithms to project onto the simplex, with respect to the length N of the data and number K of nonzero elements in the solution. For Algorithm 1, quicksort and a random pivot are assumed. For Algorithm 3 with the median pivot, a median finding routine with worst case complexity O(N) (and not O(N2_{) like in the implementation evaluated in Section 4) is assumed.}

worst case expected observed complexity complexity in practice Algorithm 1 O(N2₎ _{O(N log N)} _{O(N log N)}

Algorithm 2 O(N+KlogN) — O(N+KlogN) Alg. 3, random pivot O(N2₎ _O(N) _O(N)

Alg. 3, median pivot O(N) — O(N) Algorithm 4 O(N2₎ _— _O(N)

Proposed Algorithm O(N2₎ _— _O(N)

possible lower boundρ ofτ. And for every n, yn− a ≤τ. So, when reading yn, if yn− a is larger than the current value ofρ, we setρ:= yn− a. But then a cleanup pass (step 3.) is necessary after the first pass, to restore the invariant properties that

ρ= (∑y∈vy− a)/|v| and that v contains all the elements of y larger thanρ.

We end this section with a few comments on the complexity of the proposed al-gorithm. It is always faster than Algorithm 4, since after every pass, more elements of v are removed. Contrary to Algorithm 4, its complexity depends on the ordering of the elements in y. In the most favorable case where y is sorted in decreasing order, the complexity is O(N), since at the end of the first pass, which behaves like step 2. of Algorithm 1, we haveρ=τ. The complexity is also O(N) if y is sorted in increas-ing order, sinceρ=τafter the second pass. However, the worst case complexity is O(N2_{); like for Algorithm 4, an adversary sequence can be easily constructed.}

4 Comparison of the algorithms

The complexity of the algorithms is summarized in Tab. 1. In Tab. 2, for one example, the number of elements not yet characterized as active or inactive is shown, after every pass of the iterative algorithms. This demonstrates the efficiency of the selection process of the proposed algorithm.

All the algorithms were implemented in C and quite optimized. Attention was paid to the numerical robustness as well, with all the averaged quantities like(∑y∈vy− a)/|v| computed using the Welford–Knuth running mean algorithm [14]. The code is freely available on the website of the author. Note that we implemented the algo-rithms to maximize the execution speed, without taking care of the memory usage. For instance, we implemented Algorithm 3 with two auxiliary buffers of size N, to store the elements lower and greater than the pivot. Using only one buffer of size Nwould be possible, performing partitioning using swaps to put the elements lower and greater than the pivot in the first and second part of the buffer, respectively (see the description of Program 6 in [1] for details). This would slow down the algorithm, however. The proposed algorithm only requires one memory buffer of size N to store the sequence v, which is updated in place. As a parenthesis, we remark that one could easily modify Algorithm 4 and the proposed algorithm, so that they do not require any auxiliary memory buffer, except the one to store y, replaced by x in place at the end

(9)

Table 2 Number|v| of elements in the sequence v after every pass of the algorithms, for one example with N= 106_{. y has i.i.d. Gaussian random elements of mean 1/N and std. dev. 0.1.}

Pass Algorithm 3 Algorithm 3 Algorithm 4 Proposed number random pivot median pivot Algorithm 1 575147 499999 499787 8145 2 85926 249999 212149 1622 3 15811 124999 85994 359 4 2049 62499 33840 107 5 2013 31249 13189 56 6 997 15624 5123 53 7 709 7811 1993 53 8 435 3905 785 9 26 1952 306 10 14 975 133 11 12 487 71 12 7 243 55 13 5 121 53 14 2 60 53 15 0 30 16 15 17 7 18 3 19 1 20 0

of the algorithm; this would require reading the entire sequence y at every pass of the algorithm.

When implementing projection onto theℓ1ball, according to Proposition 2.1, the naive approach consists in doing a first pass to compute|y| and to decide whether y is inside theℓ1ball. With the proposed algorithm, this pass can be fused with its first pass (steps 1. and 2.), replacing ynby|yn| everywhere. Then, after step 2., a simple test is performed: ifρ ≤ 0, y is inside the ℓ1 ball, so the algorithm sets x= y and terminates. Else, y is outside theℓ1ball and the algorithm continues with step 3.; the signs of the ynare applied to the xnat step 6.

The code was run on a Apple Macbook Pro laptop under OS 10.9.4 with a 2.3Ghz Intel Core i7 CPU and 8Go RAM. The computation times for projecting onto the sim-plex, reported in Tab. 3, show that the proposed algorithm performs globally the best, except in the particular case, of limited practical interest, where y is maximally sparse and exactly on the simplex. In Tab. 4, for projection onto theℓ1ball, the proposed al-gorithm is compared to the improved bisection alal-gorithm (IBIS) of Liu and Ye [16]; we used the C code of the authors (functioneplb of their package SLEP), available online athttp://www.public.asu.edu/∼jye02/Software/SLEP/. As a result, the proposed algorithm is more than twice faster than IBIS for large N.

5 Concluding remarks

We have provided a synthetic overview of the available algorithms to project onto the simplex or theℓ1ball and we have proposed a new and faster algorithm. In some practical applications, the vector to project is of small length N and the projection

(10)

Table 3 Computation times in seconds for projecting y onto the unit simplex (a= 1), averaged over 102

(for N= 106_{) and 10}4_{(for N}_{= 10}3_{and N}_{= 20) realizations (the number in parentheses is the std. dev.).}

Experiments 1 and 2 correspond to the ynbeing i.i.d. random Gaussian numbers of mean a/N and std.

dev. 1 and 10−3, respectively. Experiment 3 corresponds to the ynbeing i.i.d. random Gaussian numbers of

mean 0 and std. dev. 10−3, except one element at a random position, which is a random Gaussian number of mean a and std. dev. 10−3. Experiment 4 corresponds to the ynbeing 0, except one element equal to a at

a random position. So, y is not on the simplex for Experiments 1–3 and is on the simplex for Experiment 4. In all cases, the best time is in bold.

Experiment 1

N= 106_{(K≈ 6) N = 10}3_(K_{≈ 4) N = 20 (K ≈ 3)}

Algorithm 1 1.1e-1 (8e-4) 7.4e-5 (5e-6) 1.4e-6 (6e-7) Algorithm 2 1.1e-2 (6e-4) 9.1e-6 (1e-6) 6.6e-7 (7e-7) Alg. of Duchi et al. 1.0e-2 (5e-3) 1.1e-5 (6e-6) 8.8e-7 (8e-7) Alg. 3, random pivot 1.0e-2 (5e-3) 1.1e-5 (6e-6) 9.5e-7 (7e-7) Alg. 3, median pivot 2.1e-2 (4e-4) 2.7e-5 (3e-6) 9.9e-7 (7e-7) Algorithm 4 1.8e-2 (7e-4) 1.8e-5 (6e-6) 8.2e-7 (7e-7) Proposed Algorithm 1.8e-3 (2e-5) 1.8e-6 (7e-7) 5.5e-7 (7e-7)

Experiment 2

N= 106_(K_{≈ 3282) N = 10}3_(K_{≈ 816) N = 20 (K ≈ 20)}

Algorithm 1 1.1e-1 (1e0) 8.0e-5 (5e-6) 1.6e-6 (7e-7) Algorithm 2 1.2e-2 (9e-5) 5.6e-5 (4e0) 9.7e-7 (7e-7) Alg. of Duchi et al. 1.3e-2 (7e-3) 1.6e-5 (5e-6) 7.7e-7 (7e-7) Alg. 3, random pivot 1.3e-2 (6e-3) 1.6e-5 (4e-6) 8.2e-7 (7e-7) Alg. 3, median pivot 2.1e-2 (8e-4) 2.5e-5 (2e-6) 1.0e-6 (7e-7) Algorithm 4 1.8e-2 (2e-4) 3.2e-5 (3e-6) 6.3e-7 (7e-7) Proposed Algorithm 3.7e-3 (5e-5) 1.5e-5 (1e-6) 6.1e-7 (7e-7)

Experiment 3

N= 106_(K_{≈ 21) N = 10}3_(K_{≈ 9) N = 20 (K ≈ 4)}

Algorithm 1 1.1e-1 (2e-3) 7.4e-5 (5e-6) 1.4e-6 (6e-7) Algorithm 2 1.1e-2 (3e-4) 9.6e-6 (1e-6) 7.0e-7 (7e-7) Alg. of Duchi et al. 1.0e-2 (6e-3) 1.2e-5 (6e-6) 8.9e-7 (7e-7) Alg. 3, random pivot 1.0e-2 (5e-3) 1.1e-5 (6e-6) 9.6e-7 (7e-7) Alg. 3, median pivot 2.1e-2 (4e-4) 2.7e-5 (2e-6) 9.9e-7 (7e-7) Algorithm 4 1.8e-2 (2e-4) 1.8e-5 (2e-6) 7.9e-7 (7e-7) Proposed Algorithm 3.5e-3 (2e-4) 1.1e-5 (6e-6) 6.9e-7 (7e-7)

Experiment 4

N= 106_{(K= 1)} _N_{= 10}3_(K_{= 1) N = 20 (K = 1)}

Algorithm 1 2.9e-2 (5e-4) 1.8e-5 (2e-6) 8.8e-7 (7e-7) Algorithm 2 3.1e-3 (2e-4) 2.1e-6 (8e-7) 5.5e-7 (6e-7) Alg. of Duchi et al. 1.1e+2 (4e+2) (!) 2.1e-5 (2e-4) 1.0e-6 (9e-7) Alg. 3, random pivot 2.8e-3 (1e-4) 2.4e-6 (7e-7) 5.2e-7 (6e-7) Alg. 3, median pivot 1.4e-2 (5e-4) 1.3e-5 (2e-6) 8.0e-7 (7e-7) Algorithm 4 1.6e-2 (2e-3) 1.8e-5 (1e-6) 7.6e-7 (7e-7) Proposed Algorithm 7.4e-3 (3e-3) 6.9e-6 (3e-6) 6.3e-7 (7e-7)

is one operation among others, some of which with complexity O(N2_{), like dense} matrix-vector products; in such case, the cost of the projection is negligible and all the projection algorithms are equally valid choices. By contrast, for large-scale problems like learning or classification over large dictionaries and in imaging, N is of order 106 and more; in such case, all the operations have complexity O(N) or O(N log N) and a

(11)

Table 4 Computation times in seconds for projecting y onto the unitℓ1ball (a= 1), averaged over 102

(for N= 106_{) and 10}4_{(for N}_{= 10}3_{and N}_{= 20) realizations (the number in parentheses is the std. dev.).}

The ynare i.i.d. random Gaussian numbers of zero mean and std. dev. 0.1. y was always outside the ℓ1

ball. In all cases, the best time is in bold.

N= 106_(K_{≈ 22) N = 10}3_(K_{≈ 10) N = 20 (K ≈ 5)}

Algorithm IBIS 1.3e-2 (5e-4) 1.6e-5 (1e-6) 9.8e-7 (7e-7) Proposed Algorithm 6.7e-3 (1e-4) 1.0e-5 (1e-6) 7.7e-7 (7e-7)

50x speedup of the proposed algorithm with respect to the naive sort-based algorithm does make a big difference.

We can note that the proposed algorithm can be used for matrix approximation problems, by reasoning on their eigenvalues or singular values, e.g. to project a sym-metric matrix onto the spectahedron, which is the set of positive semidefinite matrices of unit trace and can be seen as a natural generalization of the unit simplex to sym-metric matrices.

Future work will be focused on fast algorithms for optimization problems involv-ing the simplex or theℓ1ball. This includes inverse imaging problems regularized by a constraint on the total variation seminorm [8] and the computation of abundance maps in multispectral unmixing [2].

References

1. Bentley, J.L., McIlroy, M.D.: Engineering a sort function. Software—Practice & Experience 23(11), 1249–1265 (1993)

2. Bioucas-Dias, J.M., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., Chanussot, J.: Hyper-spectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE J. Sel. Topics Appl. Earth Observations Remote Sens. 5(2), 354–379 (2012)

3. Blondel, M., Fujino, A., Ueda, N.: Large-scale multiclass support vector machine training via Eu-clidean projection onto the simplex. In: Proc. of the 22th Int. Conf. on Pattern Recognition (ICPR), pp. 1289–1294 (2014)

4. Blum, M., Floyd, R.W., Pratt, V.R., Rivest, R.L., Tarjan, R.E.: Time bounds for selection. J. Computer and System Sciences 7(4), 448–461 (1973)

5. Brodie, J., Daubechies, I., Mol, C.D., Giannone, D., Loris, I.: Sparse and stable Markowitz portfolios. Proc. Nat. Acad. Sci. 106(30), 12,267–12,272 (2009)

6. Cominetti, R., Mascarenhas, W.F., Silva, P.J.S.: A Newton’s method for the continuous quadratic knapsack problem. Math. Prog. Comp. 6, 151–169 (2014)

7. Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto theℓ1-ball for

learn-ing in high dimensions. In: Proc. of the 25th Int. Conf. on Machine learnlearn-ing (ICML) (2008) 8. Fadili, J., Peyr´e, G.: Total variation projection with first order schemes. IEEE Trans. Image Processing

20(3), 657–669 (2011)

9. Gong, P., Gai, K., Zhang, C.: Efficient Euclidean projections via piecewise root finding and its appli-cation in gradient projection. Neurocomputing 74, 2754–2766 (2011)

10. Held, M., Wolfe, P., Crowder, H.: Validation of subgradient optimization. Mathematical Programming 6, 62–88 (1974)

11. Kiwiel, K.C.: On Floyd and Rivest’s SELECT algorithm. Theoretical Computer Science 347, 214– 238 (2005)

12. Kiwiel, K.C.: Breakpoint searching algorithms for the continuous quadratic knapsack problem. Math. Program., Ser. A 112, 473–491 (2008)

13. Kiwiel, K.C.: Variable fixing algorithms for the continuous quadratic knapsack problem. J. Optim. Theory Appl. 136, 445–458 (2008)

14. Knuth, D.E.: The Art of Computer Programming, vol. 2, p. 232, 3rd edn. Addison-Wesley, Boston (1998)

(12)

15. Lellmann, J., Kappes, J.H., Yuan, J., Becker, F., Schn¨orr, C.: Convex multi-class image labeling by simplex-constrained total variation. In: Proc. of Scale Space and Variational Methods in Computer Vision (SSVM), vol. 5567, pp. 150–162 (2009)

16. Liu, J., Ye, J.: Efficient Euclidean projections in linear time. In: Proc. of the 26th Int. Conf. Machine Learning (ICML) (2009)

17. Michelot, C.: A finite algorithm for finding the projection of a point onto the canonical simplex of Rn_.

J. Optim. Theory Appl. 50(1), 195–200 (1986)

18. Patriksson, M.: A survey on the continuous nonlinear resource allocation problem. European Journal of Operational Research 185, 1–46 (2008)

19. Patriksson, M., Str¨omberg, C.: Algorithms for the continuous nonlinear resource allocation problem— new implementations and numerical studies. European Journal of Operational Research 243(3), 703– 722 (2015)

20. van den Berg, E., Friedlander, M.P.: Probing the Pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput. 31, 890–912 (2008)