Self concordant Perceptron for exact computation in linear feasibility

(1)

HAL Id: hal-02399129

https://hal.archives-ouvertes.fr/hal-02399129v12

Preprint submitted on 23 May 2021 (v12), last revised 14 Dec 2021 (v15)

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Self concordant Perceptron for exact computation in linear feasibility

Adrien Chan-Hon-Tong

To cite this version:

Adrien Chan-Hon-Tong. Self concordant Perceptron for exact computation in linear feasibility. 2021.

�hal-02399129v12�

(2)

Self concordant Perceptron for exact computation in linear feasibility

Adrien CHAN-HON-TONG May 23, 2021

Abstract

This paper offers a new polynomial algorithm for linear feasibility.

Despite, arithmetic complexity of this algorithm is slightly higher than best state of the art path-following one, the offered algorithm is very suitable for exact computation as simple transformation allows to handle integer matrix.

All bounds required for establishing complexity have simple explicit values (except for the number of steps classically linked with a sub determinant of the input matrix). In particular, rounding process is explicit.

1 Purpose

Linear programming is a central optimization problem. Today, state of the art algorithms is central-path log-barrier [10] and/or path-following [12] algorithms which solves linear program withM variables and constraints, and L total binary size in less than O(e √

M L) Newton steps. As each Newton step is mainly the resolution of a M ×M linear system, the arithmetic time complexity of those algorithms is O(Me ^ω√

M L)whereω is the coefficient of matrix multiplication (3 with simple algorithm but2.38with [1]). There are even faster randomized algorithm like [5] which are not in the scope of this paper which is about exact computation.

There is also algorithms with higher complexity but with interesting features. [11]

requires O(Me ²√

M L) Perceptron steps, and thus, O(Me ⁴√

M L) arithmetic operations. But, it is related to Perceptron [13] and in particular does not rely on matrix inversion (contrary to Newton based). An other is [4]: it requiresO(Me ⁴L)arithmetic operation but does not require matrix inversion, and, it is linked with other algorithms which are strongly polynomial on special linear program families [3].

This paper aims to offer a new algorithm which is between those two groups of algorithm. This algorithm links Perceptron and self concordance theory, and it requires M L Newton steps (i.e. M^ωM L arithmetic operations). Thus, at first glance, it is slower than [10, 12] and requires matrix inversion contrary to [11, 4]. Yet, it deals straightforwardly with integer matrix.

This last point is important forexactcomputation because matrix inversion can be done inO(Me ^ω)arithmeticoperations i.e. by considering operation onZor even onQ

(3)

as 1 operation. Now, from exact computation perspective, it is required to considerbi- naryoperations. In this case, binary complexity will depend on how the matrix/vector handled during computations are large.

Typically, most state of art methods rely on scaling i.e. at some point in the algorithm, there is an operation likeA = A(I+xx^T)in [11], or column(A, k) =

1

2 ×column(A, k)in [4] orµ= ¹₂ ×µin [2] (a classical implementation for central path). Thus, it means that the binary size of the matrix manipulated during the algorithm is going to increase quickly. Potentially, the total binary size will become much larger than initial one, in particular as the number of step is larger thanL. Indeed, if one variable is scaled twice each√

M Lsteps, then the final value is scaled by2

√M L

which is larger than the initial total binary sizeLof the matrix.

Currently, only careful path following implementation seems to maintain a frozen binary sizeLbecause scaling is(1 + ^√¹

M)[12] (as(1 + ^√¹

M)

√

M L ≈ 2^log(e)L this leads to a binary size ofO(L)). However, [12] admits that this algorithm requires ae precise rounding process depending on severalhiddenconstants.

So, despite it is possible to get a exact algorithm for linear programming with arithmetic complexityO(Me ^ω√

M L)and mastered binary size, there is an interest to present algorithm with simpler rounding scheme like [8] (or even no matrix inversion like [11]).

Thus, the offered algorithm is between [11] and [12]: faster than [11] (but requiring heavy linear algebra), and, simpler than [12] (but with an extra√

M number of iterations). Currently, it seems even simpler than [8] which has same O(Me ^ωM L) complexity. This situation is summarized by table1.

Algorithm Qtime complexity inversion Lbinary size easy rounding

[13] exponential no no no

[7] exponential yes yes yes

[11] O(Me ²N²√

M L) no no probably

[4] O(Ne ⁴L) no yes yes

this O(Me ^ωM L) yes yes yes

[2] O(Ne ^ω√

N L) yes no no

[12] O(Ne ^ω√

N L) yes yes no

bolt highlights good feature like not depending on matrix inversion (hard to implement).ωis the exponent of matrix multiplication.

Table 1: Comparison of self concordant Perceptron with state of the art.

2 Sketch of the algorithm

This section provesarithmeticconvergence of the algorithm. Section 3 will focus on the rounding process.

(4)

2.1 Underlying theories

First, the offered algorithm works with linear program in form of linear feasibility:

Findingxsuch thatAx > 0for a given matrixA ∈ Z^M×N with the prior that one solution to this set of strict inequality exists.

This is not a limitation forexactcomputation, as, any linear program can be encoded in a linear feasibility instance (with the assumption of the existence of a solution) withsamebinary size and number of variables (see appendix).

Then, the algorithm relies on self concordance theory (see [9] for a complete pre- sentation): IfGis a self concordant function (mainly sum of quadratic, linear, constant and−log), with a minimumG^∗, then, Newton descent starting fromx_startallows to findxsuch thatG(x)−G^∗ ≤ εinO(G(xe _start)−G^∗+ log log(¹_ε))damped New- ton steps. Each step consists inx = x− _1+λ¹

G(x)(∇²_xG)⁻¹(∇xG)withλG(x) = p(∇xG)^T(∇²_xG)⁻¹(∇xG). Precisely,

• whileλ(x)≥¹₄, each damped Newton step decreasesGof at least¹₄−log(⁵₄)≥

1

50 - thus this so called phase 1 can not last more than50×(G(x_start)−G^∗) damped Newton steps.

• as soon as one hasxphasesuch thatλ(xphase) ≤ ¹₄, then,O(log log(e ¹_ε))additional steps are required to getxsuch thatG(x)−G^∗ ≤ε(this is the so called phase 2 with quadratic convergence)

2.2 Self concordant Perceptron

Main definition:∀A∈Q^M×N, let introduce the self concordant function:

FA(v) = v^TAA^Tv

2 −1^Tlog(v) =

M

X

i,j=1

vivj×AiA^T_j −

M

X

m=1

log(vm)

Main theorem:For all linear feasibility instancesA∈Q^M^×N withA_mA^T_m= 1

• ∃x^∗_A/ Ax^∗_A≥1(by definition)

• FA(_M¹1)≤1 +Mlog(M)

• FAhas a minimum (let write itF_A^∗) with−F_A^∗ ≤Mlog(x^∗T_A x^∗_A)

• for allv,FA(v)−F_A^∗ ≤ ¹

2M x^∗_A^Tx^∗_A+2 ⇒AA^Tv >0

Important remark:In linear feasibility, not allAcould be encountered, yet, any linear program can be encoded into a linear feasibility instance with existingx^∗_A. Indeed, if there is nox^∗_Asuch thatAx^∗_A≥1, then there existsysuch thatAy=0andy >0, and, FAis not bounded (can go to−∞). But, if the instance is a linear feasibility onen, then there existsx^∗_A, and then, this implies the existence of a minimum as log maintaining

(5)

v >0and Cauchy boundvfor a constantv^TAA^v(_x∗T^v^T^v

A x^∗_A ≤⁽¹_x∗T^T^v)²

Ax^∗_A ≤v^TAA^Tvfrom x^∗T_A (A^Tv)≥1^Tv).

Notation:FAandx^∗_Awill writtenF andxwhen it is not ambiguous.

Trivial corollary: self concordance theory on main theorem directly states that O(Me log(x^Tx) +Mlog(M) + log log(2M x^Tx)) = O(Me log(x^Tx))damped Newton steps (starting from _M¹1) allows to compute a solution of the linear feasibility instance.

Then, classical results allow to linkxwith a vertex of the polytope{z / Az ≥1}

which implies thatxis linked with a sub determinant ofAand thuslog(x^Tx) =O(L)e the total binary size of matrixAbefore normalization (next section will detail this point precisely).

Let stress that in this algorithm the so called second phase is negligible because log(L)is not an issue, thus, all the optimization is done during the so called first phase of the optimization. This is the main difference with [8] explaining why the self concordant Perceptron will have better binary property.

Indeed, rounding during second phase seems harder than during first phase, yet, rounding is even not required when there is onlylog(L)second phase steps. Thus this paper focuses on the rounding process of the first phase which is simple.

Finally, let stress that x^∗_A is the solution of support vector machine problem on Awhich exists asAis linearly separable by assumption (let recall that this does not restrict generality: any linear program can be encoded into such linear feasibility instance with existing solution - see appendix). Thus,x^∗T_A x^∗_Ais the norm of the support vector solution which is the inverse of the marginρin [11]. Thus, this algorithm is easily comparable with [11]: [11] requiresM²√

Mlog(¹_ρ)Perceptron steps (which require to computeA×current point) i.e. M⁴√

M Larithmetic operations, while self concordant Perceptron requiresMlog(¹_ρ)Newton steps i.e. M^ωMlog(_ρ¹)arithmetic operations. So self concordant Perceptron is much faster but with the drawback of requiring heavy linear algebra.

2.3 Proof of the claim

2.3.1 Existence of a minimum First,F(_M¹1) = _M¹2

P

i,j

AiA^T_j+Mlog(M)≤1+Mlog(M)(becauseAis normalized and CauchyAiA^T_j ≤p

AiA^T_i ×AjA^Tj= 1).

Then, from Cauchy(A^Tv)^Tx≤ √

v^TAA^Tv×x^Tx. But (A^Tv)^Tx = v^T(Ax).

And, by definitionAx≥1(it is unknown but the assumption is that it exists). Injecting this last inequality is interesting asv ≥0: as eachA_mx≥1a fortioriA_mx >0and vm > 0sov^TAm > 0. Even more,∀v ≥0,0 ≤ v^T1 ≤v^T(Ax) = (A^Tv)^Tx ≤

√v^TAA^Tv×x^Tx. One can even take the square (because both side are positive):

∀v ≥0, ^(v_x^TT¹⁾x² ≤v^TAA^Tv. And, independently(v^T1)²> v^Tvbecausev≥0. So

∀v≥0, ^v_x^TT^vx≤v^TAA^Tv.

(6)

Let introduce the 1D functionf(t) = _2x^t²Tx −log(t), from previous inequality it stands thatF(v)≥P

m

f(vm).

Now,fis a single variable function which goes to infinity whentgoes to 0 (t²→0 but−log(t)→ ∞) or to infinity (t²growths faster thanlog(t)). So,f has a minimum and soFtoo. Let call themf^∗andF^∗.

AsfandFare smooth, the minimums are characterized by a null derivative or gra- dient.f⁰(t) = _xT^tx−¹_t, so,f⁰(

√

x^Tx) = 0, sof^∗=f(

√

x^Tx) = ¹₂−¹₂log(x^Tx)≥

−log(x^Tx). Thus, the minimum ofFverifiesF^∗≥M f^∗≥ −Mlog(x^Tx).

So the two first assertions of the main theorem are proven.

2.3.2 Normalization, linearization and lemmas

Independently, let remark thatθ(t) =F(tv) = ^v^T^AA₂ ^T^vt²−1^Tlog(v)−Mlog(t)is minimal whenv^TAA^Tv = M. So for anyw, one could build av = µwsuch that v^TAA^Tv=M andF(v)≤F(w). In other words, it stands thatFq

M v^TAA^Tvv

≤ F(v).

So, let considerv ≥0such thatv^TAA^Tv =M. As, v^TAA^Tv ≥ ^(1v)_xTx², nov_m could be higher than√

M x^Txi.e.0≤v≤√

M x^Tx1.

Let also remark thatF(v+w) = ^v^TÂA₂ ^T^v+^w^TÂA₂ ^T^w+w^TAA^Tv−1^Tlog(v)− 1^Tlog(1 +^w_v) =F(v) +^w^TÂA₂ ^T^w+w^TAA^Tv−1^Tlog(1 +^w_v)

Finally, let consider the following lemmas from basic analysis:

1. φ(t) = ¹₂αt²−log(1 +t)≤¹₂(α+ 1)t²−t=ψ(t)fort≥0 2. ψ(_α+1¹ )≤ −_2α+2¹

3. φ(_α+1¹ )≤ −_2α+2¹ i.e.∀α≥0,¹₂_(α+1)^α ₂ −log(1 +_α+1¹ )≤ −_2α+2¹

Lemma1: ψ⁰(t)−φ⁰(t) = (α+ 1)t−1−αt+ _1+t¹ = t−1 + _1+t¹ = _1+t^t² >

0, so ψ(t)−φ(t) always increases. But, ψ(0) = φ(0) = 0 so ψ(t) ≥ φ(t) for t ≥ 0. Lemma2: ψ(_α+1¹ ) = ¹₂(α+ 1)_(α+1)¹ 2 − _α+1¹ = −_2α+2¹ . lemma3 is just lemma1+lemma2.

2.3.3 Convergence

Now, eitherAA^Tv >0(problem solved) or there existsksuch thatAkA^Tv≤0.

Let consider this caseAkA^Tv ≤ 0 andv^TAA^Tv = M, and, let introducew = v+_v2^vk

k+11k.

ThenF(w) = F(v+ _v2^vk

k+11k) = F(v) + ^A^k₂^A^T^k(_v2^vk

k+1)²+AkA^Tv× _v2^vk k+1 − log(1 + _v2¹

k+1). But, AkA^Tv ≤ 0 (by assumption) and AkA^T_k = 1, so F(w) ≤ F(v) +¹₂(_v^vk2

k+1)²−log(1 +_v2¹ k+1).

And, from lemmas just above (considerα=v²_k),F(w)≤F(v)−_2v2¹ k+2.

(7)

But,vk ≤√

M x^Tx, so,F(w)≤F(v)−_{2M x}¹Tx+2which is impossible ifF(v)− F^∗ < _{2M x}¹_T_x+2. So,∀v >0such thatv^TAA^Tv =M,F(v)−F^∗ ≤ _{2M x}¹Tx+2 ⇒ AA^Tv >0.

Finally, the requirement thatv^TAA^Tv=Mcould be remove because normalizing decreasesF:∀v >0,F(v)−F^∗≤_{2M x}¹_T_x+2 ⇒F(q

M

v^TAA^Tvv)−F^∗≤ _{2M x}¹_T_x+2

⇒q

M

v^TAA^Tv×AA^Tv >0⇒AA^Tv >0.

This proves the main theorem.

Remark: this idea is Perceptron based: one could increasevand decrease||A^Tv||

(preciselyFhere) in the same time if there existsk / AkA^Tv <0(i.e. while convergence is not reached).

3 Binary property of self concordant Perceptron

The previous section proves that the offered algorithm requires O(e √

M L) damped Newton step. This section focuses on how this step can be done with integer matrix, and, how one can roundvto keep a low binary complexity.

3.1 Removing row normalization

Section 2 presents the algorithm after normalisation of row ofA. This is classical for Perceptron based algorithm. It also gives the same importance to all constraints which is straightforward. Now, this normalization is not required, and, it is not relevant for exact computation. Thus, this section consider the result of section 2 for a raw matrix A ∈Z^M^×N with total binary sizeL(one may scale rows with low norm to have all rows with similar norm but it is not mandatory).

First, x^∗_A still exists (currently,x^∗_Acan be divided byαif all rows are scaled by α). So, existence of F^∗ does not change. vstart should be updated because a too largevstartleads to an exponential complexity. Yet,F still decreases when one scales v^TAA^Tvsuch thatv^TAA^Tv=M (this does not depend onAbeing normalized). So usingv_start=q

M

1^TAA^T11is still an optimal starting point: this leads toF(v_start)≤ M + ^M₂ log(1^TAA^T1)− ^M₂ log(M). This value may not be exactly computed but just consideringv = t×1such that v^TAA^Tv ∈ [^M₄,4M]is acceptable. Finally, log(1^TAA^T1)≤Lso it does not change the number of steps of the first phase.

So, first phase does not change at all depending on the fact thatAis normalized or not.

For the second phase, it changes the bound for convergence becauseF(v+t1k)≤ F(v) + ¹₂AkA^T_kt²−log(1 + _v^t

k)butAkA^T_k can be as high asL. Thus,tshould be lower than_v2^v^k

k+1(close to _A¹

kA^T_k ×_v2^v^k k+1).

So, the resolution will happen only whenF(v)−F^∗≤O(e _MΥ¹₂_x_T_x)whereΥ²= max

m A_mA^T_m≤ 2^2L. Yet, this is not an issue for the algorithm because convergence is quadratic in phase 2. So, number of additional step is onlyO(log log(Me Υ²x^Tx)) which is stillO(log(L)).e

(8)

So the offered algorithm works with the raw matrixA∈Z^M^×N with total binary sizeLand findsvsuch thatAA^Tv >0in less thanO(M L)e Newton steps, almost all being in so called first phase whereλ(v)> ¹₄ and each damped Newton step decrease Fby at less₅₀¹.

3.2 Approximating root computation

Computingλis impossible onQ, but only a coarse approximation is required: F is convex soF(v−θ(∇²_vF)⁻¹(∇_vF)) ≤ ¹₂(F(v) +F(v−_1+λ(v)¹ (∇²_vF)⁻¹(∇_vF))) if ¹₂_1+λ(v)¹ ≤ θ ≤ _1+λ(v)¹ . It is therefore sufficient to approximateλby a factor 2 approximation to get (at least) a diminution ofF by ₁₀₀¹ (against ₅₀¹ with perfect root computation).

Importantly, finding θ such thatθ ≤ √

ρ ≤ θ+ 1is in log(ρ)using bisection, which is too much here. But, findingθ such thatθ ≤ √

ρ ≤ 2θ is in log(log(ρ)) because bisection can be done on power. So, computing an approximation ofλcan be done inO(log(L))e even without specific algorithm for approximating root.

Then, with the same idea, normalizingv^TAA^Tv=M is not possible exactly, but v^TAA^Tv∈[^M₄,4M]is possible and still guarantees thatv∈]0,√

4M x^Tx]^M and this decreasesF.

3.3 Rounding

This subsection contains one of the most important claims of the paper: self concordant Perceptron allows a very simple rounding process.

Rounding strategy: LetA ∈ Z^M×N withΥ² = max

m AmA^T_m, and letvsuch thatv^TAA^Tv≤4M, then:

∀w∈

0, 1

1000M√ MΥ

^M

, F(v+w)≤F(v) + 1 200

In particular,∀v / v^TAA^Tv≤4M,

F







f loor(1000M√ MΥ×v1) 1000M√

M υ

...

f loor(1000M√

MΥ×vM) 1000M√

M υ





≤F(v) + 1 200

Proof:

First, the log part only decreases when adding w, thus, only the quadratic part should be considered. SoF(v+w)≤F(v) +¹₂w^TAA^Tw+w^TAA^Tv.

But,A^Tw=P

m

w_mA^T_mso||A^Tw|| ≤P

m

w_m||A^T_m|| ≤ ||w||_∞MΥ≤ ¹

500√ M and

||A^Tw||²=w^TAA^Tw≤ ₍₁₀₀₀₎¹2M.

(9)

So, w^TAA^Tv ≤ √

w^TAA^Tw×v^TAA^Tv ≤ q ₁

(500)²M ×4M ≤ ₂₅₀¹ (from Cauchy). And, ¹₂w^TAA^Tw=≤ _2×(1000)¹ 2M ≤ ₁₀₀₀⁵⁰ . Thus, it holds thatF(v+w)≤ F(v) +₂₀₀¹ .

Then, flooring t is a special case of addingτ ∈ [0,1], so the offered rounding scheme correspond to addw∈h

0, ¹

1000M√ MΥ

iM

. Corollary:

During all the first phase performing a damped Newton step, a normalization and a flooring with precision1000M√

Mq

maxm AmA^T_mstill decreasesF by at least ₂₀₀¹ (−₁₀₀¹ for approximate damped Newton+₂₀₀¹ for the rounding).

3.4 Performing the Newton step with integers

The 3 key points for a Newton based algorithm to have good behaviour when using exact computation are:

• having an explicit rounding strategy: this is done in section 3.3 ([12] admits the rounding strategy is not trivial with path following)

• being sure that variables will not become too large: this is done in the main theorem asv^TAA^Tv≤4Mimplies thatv∈]0,√

4M x^Tx]^M with√

4M x^Tx= O(L)e (true for path following but not central path [2] which has variable which becomes as high as ¹

2

√ M L)

• having linear algebra computation on integer ([12] offers a sketch of such strategy but it is based on many not explicit constant and seems to largely increase the size of the matrix to inverse)

This last point is described here for self concordant Perceptron. The Newton direction is given by(∇²_vF)⁻¹(∇vF), but, here,

∇²_vF =AA^T +







1

v₁² 0 ...

0 ... 0 ... 0 _v¹2 M







But for any not singular matrixD,(∇²_vF)⁻¹= (∇²_vF)⁻¹D⁻¹D= (D∇²_vF)⁻¹D.

Yet,

β×





v²₁ 0 ...

0 ... 0 ... 0 v²_M



× ∇²_vF =β×





v₁² 0 ...

0 ... 0 ... 0 v_M²



AA^T+β×I

Using,β= (400)²M³Υ², then, the right part is entirely integer.

So, computing the inverse written as H of the integer matrix (400)²M³Υ² ×





v₁² 0 ...

0 ... 0 ... 0 v_M²



AA^T + (400)²M³Υ² ×Iallows to extract the inverse of∇²_vF

(10)

which isH×(400)²M³Υ²×





v₁² 0 ...

0 ... 0 ... 0 v_M²



.

Finally, the Newton direction can be computed as

N ewton=H×(400)²M³Υ²×





v²₁ 0 ...

0 ... 0 ... 0 v_M²



×









1 v1

...

1 v_M



−AA^Tv





Yet, by grouping all term excludingH, one could recovers an integer vector (at a factor400M√

MΥ). So, all computations are done on integer.

Currently, the total binary size of∇²_vF seems to be as high asO(Me ²L)and not justL(it is not clear if it is possible to take advantage of the shape of the matrix). Yet, as soon asP, Q∈Z^M^×M have total binary sizeL, thenP Qcould have a binary size ofO(Me ²L)and not just2Llike for the multiplication of two scalar of binary sizeL.

However, this drawback is completely shared with [12] where each coefficient of the matrix are rounded to2^K²^Lbut with a not trivialK2.

4 Conclusion

Finally the global self concordant Perceptron is given by Algorithm: self concordant Perceptron:

1. initializev=1

2. while¬(AA^Tv >0)andλ(v)²> ₁₆¹ (a) computeθ / θ≤λ_F(v)≤2θ

(b) v=v−_1+θ¹ (∇²_vF)⁻¹(∇_vF)using integer matrix (see 3.4) (c) whilev^TAA^Tv≥4M,v= ¹₂×v

(d) v= f loor(v×400M√ MΥ) 400M√

MΥ

3. while¬(AA^Tv >0)

(a) computeθa 2 approximation ofλ(v) (b) v=v−_1+2θ¹ (∇²_vF)⁻¹(∇vF)

This algorithm has arithmetic complexity ofO(M L)e Newton steps which is higher by a factor√

M to path following. Yet compared to path following, this algorithm offers a very simple rounding strategy allowing to perform all computation on integer, and guarantee that all variables are properly bounded.

(11)

References

[1] Andris Ambainis, Yuval Filmus, and Franc¸ois Le Gall. Fast matrix multiplication:

limitations of the coppersmith-winograd method. In Proceedings of the forty- seventh annual ACM symposium on Theory of Computing, pages 585–593, 2015.

[2] Erling D Anderson, Jacek Gondzio, Csaba M´esz´aros, and Xiaojie Xu. Implemen- tation of interior-point methods for large scale linear programs. InInterior Point Methods of Mathematical Programming, pages 189–252. Springer, 1996.

[3] Sergei Chubanov. A polynomial algorithm for linear optimization which is strongly polynomial under certain conditions on optimal solutions, 2015.

[4] Sergei Chubanov. A polynomial projection algorithm for linear feasibility prob- lems.Mathematical Programming, 2015.

[5] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. InProceedings of the 51st annual ACM SIGACT symposium on theory of computing, 2019.

[6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learn- ing, 1995.

[7] George B et. al. Dantzig. The generalized simplex method for minimizing a linear form under linear inequality restraints. InPacific Journal of MathematicsAmeri- can Journal of Operations Research, 1955.

[8] H Mansouri and C Roos. Simplified o (nl) infeasible interior-point algorithm for linear optimization using full-newton steps.Optimisation Methods and Software, 22(3):519–530, 2007.

[9] Arkadi Nemirovski. Interior point polynomial time methods in convex programming. Lecture notes, 2004.

[10] Yurii Nesterov and Arkadii Nemirovskii.Interior-point polynomial algorithms in convex programming. Siam, 1994.

[11] Javier Pe˜na and Negar Soheili. A deterministic rescaled perceptron algorithm.

Mathematical Programming, 155(1-2):497–510, 2016.

[12] James Renegar. A polynomial-time algorithm, based on newton’s method, for linear programming.Mathematical programming, 40(1):59–93, 1988.

[13] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 1958.

(12)

Appendix: Basic theoretical foundations of linear pro- gramming

Maximal sub determinant of a matrix

An important result for linear programming and linear feasibility is Hadamard bound on the maximal sub determinant of a matrix: ∀A ∈ Z^N×N with all entries bounded by2^B (i.e. ∀i, j ∈ {1, ..., N},|Ai,j| ≤ 2^B), then, Det(A) ≤N^N2^NB. By exten- sion,A ∈ Z^M^×N with all entries bounded by2^B, any submatrix ofAhas maximal determinant bounded byN^N2^NB(precisely one could considermin(N, M)instead of N). This maximal sub determinant ofA will be written Ω(A) in the paper i.e.

Ω(A) = max

i₁,...,i_r,j₁,...,j_sDet(A_{i₁,...,i_r}×{j1,...j_s}).

Combining with Cramer rule, it leads that ifAx=b, then,|x_n| ≤Ω(A)×(P

m

b_m)

and ifxn 6= 0, then,|xn| ≥ _Ω(A)¹ . It is also true for vertex of a polygon defined by Ax≥b.

Finally, this Hadamard bound can even be refined: ifLis the total binary size ofA thenlog(Ω(A))≤O(L)e (preciselylog(Ω(A))≤O(min(N B, L)), but, for classicale complexity measurement,log(Ω(A))≤O(L)).e

Equivalence between linear programming and linear feasibility

This paper provides an algorithmalgo₀which returnsv such thatAA^Tv >0on an inputAassuming ∃x / Ax ≥ 1(undefined behaviour otherwise - v is positive but this does not matter). Trivially,it is thus possible to formalgo₁which returnsxsuch thatAx >0on inputAassuming suchxexists by returningA^Talgo₀(A)(undefined behaviour otherwise).

• Thank toalgo₁, one could formalgo₂(A, b)which returnsxsuch thatAx > b assuming suchxexists (undefined behaviour otherwise). Indeed, let consider any A, b such that∃x / Ax > b, finding such xis equivalent to find a pair x, tsuch thatAx−t×b > 0andt > 0, because ^x_t is then a solution of the original problem. LetA1the matrixAplus1as additional column and(0 1)as additional row. Thus, one can get(x1 t1)by computingalgo₁(A)and returning

x₁

t₁ as output ofalgo₂(A, b).

Importantly, only constant number of variables/constraints are added, and, binary size is not increased. So complexity ofalgo₂(A, b)is the same thanalgo₁(A, b).

• Thank toalgo₂, one could formalgo₃(A, b)which returnsxsuch thatAx≥b assuming suchxexists. Indeed, if∃x/Ax ≥ b, then a fortiori∃x, tsuch that Ax+t1× > b,0 < t < _Ω(A)¹ (Ω(A)is the maximal subdeterminant ofA- see beginning of appendix). So, one could callalgo₁onA₂, b₂withA₂being Aplus1column plus a row with0andΩ(A)andb2beingbplus two 1. Thus, algo₂(A2, b2) =x2, t2.

(13)

Now, one could consider greedy improvement of min t

x,t / Ax+t1≥b,t≥0initialized from(x2, t2). Such greedy improvement can be performed by projecting(x, t) on{(x, t)/ Ax+t1≥b}while minimizingt. One such greedy step can simply be done by looking forχ, τsuch thatA_Sχ+t1_S = 0andτ =−1withS the saturated rows inAx+t1≥b. If no suchχ, τexists, the greedy improvement has terminated otherwise one could do(x, t) ←(x+µχ, t+µτ)withµsuch thatAx+t1≥b, t≥0. There will be no more thanMsuch greedy purification because one row enter the saturated ones at each step.

When this greedy process terminates, this leads tox,ˆ ˆtwithAˆx+ˆt1≥b,0≤ˆt≤ t2 < _Ω(A)¹ butx,ˆ ˆtis a vertex ofA. So Cramer rule applies, and sotˆ= ^Det(S_Det(S)^t⁾ withSa sub matrix ofAandStthe Cramer partial submatrix related tot. But ˆt ≤t2 ≤ _Ω(A)¹ , soˆt= 0, and thus,Axˆ≥b. So, this projection ofx2, t2gives x₃=algo₃(A, b).

Importantly, the binary size ofA2, b2is just twice the binary size ofA, bbecause log(Ω(A))≤L(A), soL(A2) =L(A) + log(Ω(A))≤2L(A)and only a constant number of variables constraints are added. so complexity ofalgo₃(A, b)is the same thanalgo₂(A, b). (Currently, maximal binary size inalgo₃is increased by a factorN compared toalgo₂- this can have impact on binary complexity but for arithmetic complexity,algo₃(A, b)is just twicealgo₂(A, b).)

• Let anyA, b-without assumption- solvingAx≥b(or producing a certificate that no solution exists) is equivalent to solve min

z /Az+t1≥b,t≥0t(there is a solution if the minimum is 0). Yet, this last linear program is structurally feasible (x= 0 and a sufficiently largetprovided a feasible point) and bounded becauset≥0.

Thus, primal dual theory gives a system Aprimal−dual(x y) ≥ bprimal−dual

whose solution contains solution of the linear program min

z /Az+t1≥b,t≥0t.

Applyingalgo₃(Aprimal−dual, bprimal−dual)provides thus suchxprimal−dual, yprimal−dual from which one could restorex3, t3 with either t3 = 0 and so Ax3≥bort36= 0.

This leads to an algorithmalgo₄ which is able to findxsuch thatAx≥ b(or to produce a certificate that no solution exists) without assumption onA, babout the existence or not of suchx.

Importantly, the number of variables-constraints is only scaled two folds when computing the primal dual, so from theoretical point of view, it does not change the complexity betweenalgo₄andalgo₃.

• Finally, for anyA, b, cwithout any assumption solving min

x /Ax≥bc^Txcan be done with 2algo₄calls and onealgo₃call:

– one to know if the problem is feasible i.e.algo₄(A, b) – one on the dual to known if it is boundedalgo₄(A_dual, b_dual)

– and one call toalgo₃on the primal dual to get the optimal solution (if previous two computations certify that the problem is feasible and bounded).

(14)

Again, from theoretical point of view, the complexity does not change: it only does 3 calls on instances only scaled 2 times. At the end, it returns the optimal solution or a certificate that the problem is not feasible or not bounded.

Thus,algo₀ which only returnsv such thatAA^Tv > 0on inputAif there exists such v (undefined behaviour otherwise) allows to build with same complexity algo₅(A, b, c)which solves min

x /Ax≥bc^Txor returns certificate that problem is either infeasible or unbounded.

Let stress that the opposite way is trivialalgo₅(AA^T,1,0)is a correct implementation ofalgo₁(A)for anyA.

SVM

Assuming∃x∈ Q^N /Ax >0, then,∃x∈Q^N ≥1because _min ¹

mA_mx ×xverifies Ax ≥ 1(ifAx > 0) withA ∈ Z^M^×N, then, Perceptron will converge inO(xe ^Tx) steps while the self concordant Perceptron inO(Me log(x^Tx)).

Now, this claim holds for anyx. Typically, let considerxwithAx≥1such that there is a maximal number of rows saturated and a maximal number of null coordinates.

The impossibility to increase the number of constraints implies thatxis the solution of a system of equation containing rows fromAand rows fromI. Such each coordinate of xis bounded by a maximal determinant fromA. So,log(x^Tx)≤log(Ω(A))≤O(L).e Precisely,log(x^Tx)≤O(min(L, N B))e but theLpart is the classical way to measure complexity.

Let stress that the bound could also be expressed by introducing the solution of the support vector machine problem [6] which consists in solvingmin x^Tx

x / Ax≥1.