Self concordant Perceptron

(1)

HAL Id: hal-02491694

https://hal.archives-ouvertes.fr/hal-02491694v8

Preprint submitted on 2 May 2021 (v8), last revised 3 Oct 2021 (v10)

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

Self concordant Perceptron

Adrien Chan-Hon-Tong

To cite this version:

(2)

Self concordant Perceptron

Adrien CHAN-HON-TONG

May 2, 2021

Abstract

This paper offers a new algorithm for linear feasibility. This algorithm is poly-nomial time (despite an arithmetic time complexity slightly higher than state of art, the algorithm has excellent binary property). This algorithm encodes the original Perceptron idea into a self concordant function, and, has the best complexity in this family of algorithms.

1 Introduction

1.1 Purpose

This paper focuses on the problem of linear separability (or linear feasibility, depending on the community):

Finding_{x such that Ax > 0 for a given matrix A ∈ Z}M ×N with the prior that one solution to this set of strict inequality exists.

There is a very large literature linked with this problem both in machine learning or optimisation, and, both for practical solvers or for theoretic complexity.

In machine learning, finding Ax > 0 provides a linear model with 100% accuracy on training samples represented by rows of A. Yet, machine learning is about achieving high test accuracy, so training accuracy is only one side of the story. This explains why support vector machine (SVM) [6] (which consists in finding the linear model which maximizes the margin) is usually considered instead of simple linear feasibility (which only provides a positive margin). SVM is even one of the most classical problem of machine learning with a large number of solvers efficient in practice (e.g. [9, 3, 11, 21]). Yet, those solvers are often inexact (may return a vector that does not separate the data when there is one), and with bad behaviour in worse case (despite having efficient step and good behaviour on average). Indeed, most of them forget to take into account the dependency on the binary size of the input (for input given as large arbitrary integers), or, on specificity of the solution. For example, first order methods [15] usually are exponential in worse case, and, decomposition strategy [19] heavily depends on the size of the support vector set in the solution [22]).

On the other side, linear feasibility is exactly equivalent to linear programming which consists in solving min cTx

x / Ax≥b for given matrix A ∈ Z

(3)

ZM, c ∈ ZN (see appendix: any exact linear feasibility solver could be converted into an exact linear programming one with the same arithmetic time complexity).

So, exact linear feasibility could be seen as the simplest form of exact linear pro-gramming. Yet, from a theoretical point of view, the best current exact and determin-istic method to solve linear feasibility is to process it as a regular linear program with state of the art linear programming algorithms [17, 16].

Let introduce the usual Landau’s notation big O and tilde big O: if f, g are two func-tions on N, f = eO(g) implies that ∃Υ, κ, n0/ ∀n ≥ n0, f (n) ≤ Υg(n) logκ(g(n)).

Let L be the total binary size of matrix A ∈ ZM ×N_{, then, solving linear}

feasibil-ity on A with [17] only requires at most eO(√N L) Newton steps (and even less for [5, 14]). When relying on an subroutine requiring eO(Nω_{) arithmetic operations to}

per-form N × N matrices multiplication or linear system resolution, this algorithm has an arithmetic time complexity of eO(√N Nω_{L) (e.g. ω = 2.38 < 3 is possible with [2]).}

1.2 Related work and contributions

This paper introduces a new algorithm for theoretic exact linear feasibility with an arithmetic time complexity of at most eO(MωM L) (theorem 2).

So, this algorithm is not better than state of the art, but, it has excellent binary property (theorem 3). In binary time complexity, only binary operation are counted as 1 as opposed to arithmetic complexity where arithmetic operation on Q counting as 1. In addition, this algorithm seems to be the first relying self concordance optimiza-tion with large Newton increment. Thus, the complexity bound is maybe very loose. Independently, the algorithm may bypass theoretical limit pointed by [1] at it seems not relying on central path framework. Finally, this algorithm is related to the Perceptron algorithm [20], and, it has the current best arithmetic time complexity for Perceptron related algorithms. Table 1 summarizes contributions:

Algorithm time complexity Central path [7] exponential no [20] exponential no [13] O(Ne 6L) no [12] O(Ne 3 √ N L) no [18] O(Me 2N2 √ M L) no [4] O(Ne 4L) no Self concordant Perceptron O(Me ωM L) not trivially

[16] O(Ne ω √ N L) yes [5] O((Ne ω+ N 13 6_)L) _yes

Consistently with [5], ω is the exponent of matrix multiplication: multiplying two M × M matrix and/or solving a M × M linear system can be done with eO(Mω) arithmetic operations.

Independently, central path framework is limited by [1].

Table 1: Comparison of self concordant Perceptron with state of the art. Perceptron can be considered as the first linear feasibility algorithm, but, it has the

(4)

drawback of having exponential time complexity. This algorithm has been extended us-ing rescalus-ing framework leadus-ing to an efficient randomized algorithm [8] (not compara-ble with exact and deterministic algorithms). Then a deterministic version of [8] has be introduced in [18], but, this last algorithm requires at most eO(√M N M L) Perceptron steps i.e. it has a eO(√M M2_N2_{L) arithmetic time complexity against e}_O(Mω_{M L) for}

the one of this paper. Precisely, the offered algorithm is not an extension of the Percep-tron like [18] (it does not even work on the same space) but rather an other way to see the Perceptron idea to manage simultaneously the norm of the current point, and, the scalar product it has with the (unknown but existing) SVM.

The paper is structured as follow: next section introduces the main theorem which proves some bounds. Then section 3 just apply classical result on Newton descent on self concordance function to conclude about the complexity of the algorithm.

2 Self concordant Perceptron

2.1 Claims

Main definition: ∀A ∈ QM ×N, let introduce the function:

FA(v) = vTAATv 2 − 1 T_{log(v) =} M X i,j=1 vivj× AiATj − M X m=1 log(vm) (1)

Main theorem: For all normalized linear feasibility instances A ∈ QM ×N _i.e.

with the prior that ∃x / Ax ≥ 1 and with AmATm= 1

• FA(_M11) ≤ 1 + M log(M )

• FAhas a minimum (let write it FA∗) with −FA∗ ≤ M log(x T_x)

• for all v, FA(v) − FA∗ ≤ 1

2M xT_x+2 ⇒ AA

T_{v > 0}

The index A will be omitted when it is not ambiguous.

Let stress that those hypothesis on A may seem strong, but they are not: any linear program instance can be encoded into an instance verifying those property which has (with a constant factor) the same total binary size and number of variables/constraints. Appendix presents precisely all the steps of the conversion of a linear program instance toward a normalized linear feasibility one.

2.2 Proof of the claim

2.2.1 Existence of a minimum First, F (_M11) = _M12

P

i,j

AiATj+M log(M ) ≤ 1+M log(M ) (because A is normalized

(5)

Then, from Cauchy (ATv)Tx ≤ √vT_AAT_{v × x}T_{x. But (A}T_v)T_{x = v}T

(Ax). And, by definition Ax ≥ 1 (it is unknown but the assumption is that it exists). Injecting this last inequality is interesting as v ≥ 0: as each Amx ≥ 1 a fortiori Amx > 0 and

vm > 0 so vTAm > 0. Even more, ∀v ≥ 0, 0 ≤ vT1 ≤ vT(Ax) = (ATv)Tx ≤

√

vT_AAT_{v × x}T_{x. One can even take the square (because both side are positive):}

∀v ≥ 0, (v_xTT1)x2 ≤ v

T_AAT_{v. And, independently (v}T₁₎2_{> v}T_{v because v ≥ 0. So}

∀v ≥ 0, vTv xT_x≤ v

T_AAT_v.

Let introduce the 1D function f (t) = _2xt2T_x − log(t), from previous inequality it

stands that F (v) ≥P

m

f (vm).

Now, f is a single variable function which goes to infinity when t goes to 0 (t2_{→ 0}

but − log(t) → ∞) or to infinity (t2_{growths faster than log(t)). So, f has a minimum}

and so F too. Let call them f∗and F∗.

As f and F are smooth, the minimums are characterized by a null derivative or gra-dient. f0(t) = _xTt_x− 1 t, so, f 0₍√_xT_{x) = 0, so f}∗_{= f (}√_xT_{x) =} 1 2− 1 2log(x T_{x) ≥} − log(xT

x). Thus, the minimum of F verifies F∗≥ M f∗_{≥ −M log(x}T

x). So the two first assertions of the main theorem are proven.

Important remark: If there is no x such that Ax ≥ 1 then, there exists y such that Ay = 0 and y > 0, and, F is not bounded (can go to −∞). But, the existence of x and v > 0, forces the existence of F∗(due to v_xTTv_x ≤

(1T_v)2

xT_x ≤ v

T_AAT_{v thank to Cauchy}

from xT_(AT_v)).

2.2.2 Normalization, linearization and lemmas

Independently, let remark that θ(t) = F (tv) = vTAA₂ Tvt2− 1T

log(v) − M log(t) is minimal when vTAATv = M . So for any w, one could build a v = µw such that vT_AAT_{v = M and F (v) ≤ F (w). In other words, it stands that F}q M

vT_AAT_vv

≤ F (v).

So, let consider v ≥ 0 such that vT_AAT_{v = M . As, v}T_AAT_{v ≥} (1v)2

xTx, no vm

could be higher than √

M xT_{x i.e. 0 ≤ v ≤}√_{M x}T_x1.

Let also remark that F (v + w) = vTAA₂ Tv+wTAA₂ Tw+ wTAATv − 1Tlog(v) − 1Tlog(1 +w_v) = F (v) +wTAA₂ Tw+ wTAATv − 1Tlog(1 +w_v)

Finally, let consider the following lemmas from basic analysis: 1. φ(t) = 1₂αt2− log(1 + t) ≤ 1 2(α + 1)t 2_{− t = ψ(t) for t ≥ 0} 2. ψ(_α+11 ) ≤ − 1 2α+2 3. φ(_α+11 ) ≤ −_2α+21 i.e. ∀α ≥ 0,1₂_(α+1)α 2 − log(1 + 1 α+1) ≤ − 1 2α+2 Lemma1: ψ0(t) − φ0(t) = (α + 1)t − 1 − αt + 1 1+t = t − 1 + 1 1+t = t2 1+t >

0, so ψ(t) − φ(t) always increases. But, ψ(0) = φ(0) = 0 so ψ(t) ≥ φ(t) for t ≥ 0. Lemma2: ψ(_α+11 ) = 1₂(α + 1)_(α+1)1 2 − 1 α+1 = − 1 2α+2. lemma3 is just lemma1+lemma2.

(6)

2.2.3 Convergence

Now, either AATv > 0 (problem solved) or there exists k such that AkATv ≤ 0.

Let consider this case AkATv ≤ 0 and vTAATv = M , and, let introduce w =

v +_v2vk k+1 1k. Then F (w) = F (v + _v2vk k+1 1k) = F (v) + AkATk 2 ( vk v2 k+1 )2+ AkATv × _v2vk k+1 − log(1 + _v21 k+1). But, Ak

AT_{v ≤ 0 (by assumption) and A}

kATk = 1, so F (w) ≤ F (v) +1₂(_vvk2 k+1 )2− log(1 + 1 v2 k+1).

And, from lemmas just above (consider α = v2

k), F (w) ≤ F (v) − 1 2v2 k+2 . But, vk ≤ √ M xT_{x, so, F (w) ≤ F (v) −} 1 2M xT_x+2which is impossible if F (v) −

F∗ < _{2M x}1T_x+2. So, ∀v > 0 such that vTAATv = M , F (v) − F∗ ≤

1 2M xT_x+2 ⇒

AAT_{v > 0.}

Finally, the requirement that vTAAT_{v = M could be remove because normalizing}

decreases F : ∀v > 0, F (v) − F∗≤ 1 2M xT_x+2 ⇒ F ( q M vT_AAT_vv) − F∗≤ 1 2M xT_x+2 ⇒q M vT_AAT_v× AA T_{v > 0 ⇒ AA}T v > 0. This proves the main theorem.

Remark: this idea is Perceptron based, one could increase v and decrease ||AT_v||

(precisely F here) in the same time if the system Ax > 0 is currently not solved.

3 Time complexity

3.1 Self concordance

Despite the fact that this whole paper relies on self concordance theory, there is nothing to prove to apply results from [16].

A smooth single variable function φ is self concordant if and only if |φ000(t)| ≤ 2(φ00(t))32. And, a smooth multi variable function Φ is self concordant if each ray

t → Φ(νt + β) is self concordant (i.e. for all vectors ν, β). Mainly, self concordant functions are quadratic, linear, constant and − log ones. For example, the function x → xT_{Qx + q}T_{x − log(νx − β) is a self concordant function for any vector q, ν,}

bias β and any matrix Q, as soon as Q is positive (not necessarily strictly positive, 0 is possible).

FAdefined at eq.(1) is self concordant ∀A ∈ QM ×N.

Yet, the major property of Newton descent on self concordant functions [16] is that: if G is a self concordant function, and, if G has a minimum G∗, then, damped Newton descent starting from any xstartallows to build x such that G(x) − G∗≤ ε in less than

e

(7)

3.2 Arithmetic time complexity

Now, the main theorem of section 2 gives all the bounds required to apply the main result of Newton descent on self concordant functions recalled just above: F has a minimum and F∗≥ −M log(xT

x) and one can use vstart = F (_M11) as F (vstart) ≤

M log(M ) + 1. Importantly, F is always self concordant (for all matrix A) but F is lower bounded if and only if A is a linear feasibility instance (see appendix). But, in this case using Newton descent is straightforward.

So, damped Newton descent, starting from_M11, builds a solution of the linear feasi-bility in less than eO(M log(xTx)+M log(M )+log log(2M xTx)) = eO(M log(xTx)) steps.

Yet, log(xTx) ≤ eO(min(N B, L)) where L is the total binary size of A (see ap-pendix: x is the solution a system of linear equation extracted from A, so Cramer rules allows to characterize x by sub determinant of A which can be bounded by Hadamard result). For complexity analysis log(xTx) ≤ eO(L). Thus, this finally gives a eO(M L) number of steps, with each step being the resolution of a M × M linear system.

Theorem 2: Newton descent on F solves linear feasibility instance A in at most e

O(M L) damped Newton steps where L is the total binary size of A. The overall arithmetic time complexity of the algorithm is eO(Mω_{M L).}

For practical floating point implementation, this theorem is sufficient, the pseudo code of the offered algorithm is simply:

Algorithm: floating point implementation of self concordant Perceptron: 1. initialize v = _M1 2. while ¬(AAT_{v > 0)} (a) v = v − _1+λ1 F(v)(∇ 2 vF )−1(∇vF )

This algorithm is quite simple (no scaling) and still solves linear feasibility (but so lin-ear programming) in polynomial time - despite arithmetic time complexity is eO(M L) Newton steps which is higher than state of the art by a factor√N .

Now, this bound is only based on the assumption of a constant decreases at each Newton steps. Yet, in the optimisation of F , larger Newton steps ([10]) seems possible because the smallest variables tends to increased by construction. Thus, exploring large steps should considered in future works as it may improve this bound.

3.2.1 Binary time complexity

From a theoretical point of view, it is now required to have a bound on the binary time complexity of each step (let recall that naive Gaussian elimination is exponential when one does not take care about intermediary values). To establish the binary time complexity of each step, one should use an approximation instead of the square root and round the number after each damped Newton step, without breaking the convergence. This is possible as presented bellow.

(8)

Computing λ is impossible on Z, but F is convex so F (v − θ(∇2vF )−1(∇vF )) ≤ 1 2(F (v) + F (v − 1 1+λ(v)(∇ 2 vF )−1(∇vF ))) if 1₂_1+λ(v)1 ≤ θ ≤ _1+λ(v)1 . It is therefore

sufficient to approximate λ by a factor 2 approximation (only twice the number of steps are required compared to infeasible exact computation version). Then, with the same idea, normalizing vT_AAT_{v = M is not possible exactly, but}M

2 ≤ v

T_AAT_{v ≤ 2M is}

possible and still guarantees that v ≤ eO(2M xT_x).

Importantly, finding µ such that µ ≤ √ρ ≤ µ + 1 is in log(ρ) (with bisection), which is too much here. But, finding µ such that µ ≤ √ρ ≤ 2µ is in log(log(ρ)) because bisection can be done on power. So, computing an approximation of λ should be done using this fast bisection: this way, this routine will requires eO(B) binary operations and not eO(B2_{) binary operations (of course, there are dedicated algorithms}

to compute square root, but there are not required).

Finally, flooring (at a given numerical resolution) all the internal variables v could be dramatic due to the log. But ceiling v is acceptable: let consists the rounding op-eration vround = int(Q ∗ v + 1)/Q (which should be done after Newton and

nor-malization). Log are not affected, thus, only the improvement of vTAATv should be bounded. But, this is just F (vround) ≤ F (v) + 2|v − vround|1TAATv so F (vround) ≤

F (v) + 2 Q1

T_AAT_{v. But thank to approximate normalization v}T_AAT_{v ∈ [}1

2M, 2M ],

but it implies that AmATv ≤ 2

√

M and 1T_AAT_{v ≤ 2M}√_{M .}

Thus, for Q ≥ 2Υ × 2M√M , then, F (vround) ≤ F (v) +_2Υ1 .

Now, each Newton step guarantee a decrease of at least 1₄ − log(5

4) in phase 1.

By setting Q such that _Υ1 ≤ 1 4− log(

5

4), one could guarantee that Newton step +

rounding still lead to a constant decrease of at least1₂ 1₄− log(5

4) (because the effect

of Newton is always at most twice the degradation of the rounding). Thus, with this trick, only twice the number of damped Newton step are required to reach a solution, while keeping a mastered infinite fraction representation (with a common denominator Q having very small binary size as Q = 400M√M is enough, and, numerators with binary size also bounded by eO(M log(xT_{x)) (as v ≤ e}_{O(M x}T_{x) because}vT_v

xT_x ≤ 2M ).

Thus, the binary size of all numbers encounter during computation is kept to eO(L) by applying the offered rounding strategy at the end of each damped Newton step. During a step, binary size will not exceed eO(M L) (because there is eO(M ) multiplica-tions).

This leads to an algorithm without any numerical issue using exact infinite integer representation (during a step) + rounding (at the end of each step). And each steps is thus just the resolution of a M × M linear system, with entries of size eO(L): it can be done in eO(Mω_{L) binary operation using fast multiplication of numbers. So binary}

complexity just add a L factor as all numbers encountered during computation are not to high.

Theorem 3: Self concordant Perceptron with rounding solves linear feasibility instance A using less than eO(Mω_{M L}2_{) binary operations (even for input given}

as fraction of arbitrary large integer inputs, typically encoded with arbitrary large vectors).

(9)

Steps 3.a and 3.b correspond to the second phase of optimization, which is negligible (log log(L)), so binary consideration is not mandatory during this step. During first phase, rounding should be applying (2.e - 2.g) to take care of the binary size of the numbers. The proposed theorems guarantee that inner loop 2 offers a strict decrease of F , of at least 14−log(

5 4)

8 (despite 2.a - 2.c - 2.e). Indeed, it seems that an updated version

of this algorithm with an additional little scaling of rows of A (by a factor M√M ) could even allow to round value of v into integer !

Algorithm: self concordant Perceptron: 1. initialize v = _M1, Q = 400M√M 2. while ¬(AAT_{v > 0) and λ}

F(v) ≥ 1₄

(a) compute θ a 2 approximation of λ(v) i.e. θ ≤ λF(v) ≤ 2θ

(b) v = v −_1+2θ1 (∇2vF )−1(∇vF ) (c) compute µ a 2 approximation of q M vTAATv (d) v = µv (e) w = int(Q × v + 1)//Q (f) assert wT_AAT_{w ≤ v}T_AAT_{v +} 14−log(54) 4 (g) v = w 3. while ¬(AATv > 0)

(a) compute θ a 2 approximation of λ(v) (b) v = v −_1+2θ1 (∇2

vF )−1(∇vF )

References

[1] Xavier Allamigeon, Pascal Benchimol, St´ephane Gaubert, and Michael Joswig. Log-barrier interior point methods are not strongly polynomial. Journal on Ap-plied Algebra and Geometry, 2018.

[2] Andris Ambainis, Yuval Filmus, and Franc¸ois Le Gall. Fast matrix multiplication: limitations of the coppersmith-winograd method. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 585–593, 2015. [3] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector

ma-chines. ACM transactions on intelligent systems and technology (TIST), 2011. [4] Sergei Chubanov. A polynomial projection algorithm for linear feasibility

(10)

[5] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In Proceedings of the 51st annual ACM SIGACT symposium on theory of computing, 2019.

[6] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learn-ing, 1995.

[7] George B et. al. Dantzig. The generalized simplex method for minimizing a linear form under linear inequality restraints. In Pacific Journal of MathematicsAmeri-can Journal of Operations Research, 1955.

[8] John Dunagan and Santosh Vempala. A simple polynomial-time rescaling algo-rithm for solving linear programs. Mathematical Programming, 114(1):101–114, 2008.

[9] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learn-ing research, 2008.

[10] Jacek Gondzio. Convergence analysis of an inexact feasible interior point method for convex quadratic programming. SIAM Journal on Optimization, 23(3):1510– 1527, 2013.

[11] Thorsten Joachims. Svmlight: Support vector machine. SVM-Light Support Vec-tor Machine http://svmlight. joachims. org/, University of Dortmund, 1999. [12] Narendra Karmarkar. A new polynomial-time algorithm for linear programming.

In Proceedings of the sixteenth annual ACM symposium on Theory of computing, 1984.

[13] Leonid Khachiyan. A polynomial algorithm for linear programming. Doklady Akademii Nauk SSSR, 1979.

[14] Yin Tat Lee and Aaron Sidford. Path finding methods for linear programming: Solving linear programs in o (vrank) iterations and faster algorithms for maxi-mum flow. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE, 2014.

[15] Md Sarowar Morshed, Md Saiful Islam, and Md Noor-E-Alam. Accelerated sam-pling kaczmarz motzkin algorithm for the linear feasibility problem. Journal of Global Optimization, 77(2):361–382, 2020.

[16] Arkadi Nemirovski. Interior point polynomial time methods in convex program-ming. Lecture notes, 2004.

[17] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. Siam, 1994.

[18] Javier Pe˜na and Negar Soheili. A deterministic rescaled perceptron algorithm. Mathematical Programming, 155(1-2):497–510, 2016.

(11)

[19] John Platt. Sequential minimal optimization: A fast algorithm for training support vector machines, 1998.

[20] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 1958.

[21] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 2011. [22] Hans Ulrich Simon. On the complexity of working set selection. Theoretical

Computer Science, 382(3):262–279, 2007.

Appendix: Basic theoretical foundations of linear

pro-gramming

Maximal sub determinant of a matrix

An important result for linear programming and linear feasibility is Hadamard bound on the maximal sub determinant of a matrix: ∀A ∈ ZN ×N _{with all entries bounded}

by 2B (i.e. ∀i, j ∈ {1, ..., N }, |Ai,j| ≤ 2B), then, Det(A) ≤ NN2NB. By

exten-sion, A ∈ ZM ×N with all entries bounded by 2B, any submatrix of A has maximal determinant bounded by NN2NB (precisely one could consider min(N, M ) instead of N ). This maximal sub determinant of A will be written Ω(A) in the paper i.e. Ω(A) = max

i1,...,ir,j1,...,js

Det(A{i1,...,ir}×{j1,...js}).

Combining with Cramer rule, it leads that if Ax = b, then, |xn| ≤ Ω(A) × (P m

bm)

and if xn 6= 0, then, |xn| ≥ _Ω(A)1 . It is also true for vertex of a polygon defined by

Ax ≥ b.

Finally, this Hadamard bound can even be refined: if L is the total binary size of A then log(Ω(A)) ≤ eO(L) (precisely log(Ω(A)) ≤ eO(min(N B, L)), but, for classical complexity measurement, log(Ω(A)) ≤ eO(L)).

Equivalence between linear programming and linear feasibility

This paper provides an algorithm algo0which returns v such that AATv > 0 on an

input A assuming ∃x / Ax ≥ 1 (undefined behaviour otherwise - v is positive but this does not matter). Trivially,it is thus possible to form algo1which returns x such

that Ax > 0 on input A assuming such x exists by returning ATalgo₀(A) (undefined behaviour otherwise).

• Thank to algo₁, one could form algo₂(A, b) which returns x such that Ax > b assuming such x exists (undefined behaviour otherwise). Indeed, let consider any A, b such that ∃x / Ax > b, finding such x is equivalent to find a pair x, t such that Ax − t × b > 0 and t > 0, because x_t is then a solution of the original problem. Let A1the matrix A plus 1 as additional column and (0 1) as

(12)

additional row. Thus, one can get (x1 t1) by computing algo1(A) and returning x1

t1 as output of algo2(A, b).

Importantly, only constant number of variables/constraints are added, and, binary size is not increased. So complexity of algo₂(A, b) is the same than algo1(A, b).

• Thank to algo₂, one could form algo₃(A, b) which returns x such that Ax ≥ b assuming such x exists. Indeed, if ∃x/Ax ≥ b, then a fortiori ∃x, t such that Ax + t1× > b, 0 < t < _Ω(A)1 (Ω(A) is the maximal subdeterminant of A -see beginning of appendix). So, one could call algo1on A2, b2with A2being

A plus 1 column plus a row with 0 and Ω(A) and b2being b plus two 1. Thus,

algo₂(A2, b2) = x2, t2.

Now, one could consider greedy improvement of min t

x,t / Ax+t1≥b,t≥0initialized

from (x2, t2). Such greedy improvement can be performed by projecting (x, t)

on {(x, t) / Ax + t1 ≥ b} while minimizing t. One such greedy step can simply be done by looking for χ, τ such that ASχ + t1S = 0 and τ = −1 with S the

saturated rows in Ax + t1 ≥ b. If no such χ, τ exists, the greedy improvement has terminated otherwise one could do (x, t) ← (x + µχ, t + µτ ) with µ such that Ax + t1 ≥ b, t ≥ 0. There will be no more than M such greedy purification because one row enter the saturated ones at each step.

When this greedy process terminates, this leads to ˆx, ˆt with Aˆx+ˆt1 ≥ b, 0 ≤ ˆt ≤ t2 < _Ω(A)1 but ˆx, ˆt is a vertex of A. So Cramer rule applies, and so ˆt = Det(S_Det(S)t)

with S a sub matrix of A and Stthe Cramer partial submatrix related to t. But

ˆ

t ≤ t2 ≤ _Ω(A)1 , so ˆt = 0, and thus, Aˆx ≥ b. So, this projection of x2, t2gives

x3= algo3(A, b).

Importantly, the binary size of A2, b2is just twice the binary size of A, b because

log(Ω(A)) ≤ L(A), so L(A2) = L(A) + log(Ω(A)) ≤ 2L(A) and only a

con-stant number of variables constraints are added. so complexity of algo₃(A, b) is the same than algo₂(A, b). (Currently, maximal binary size in algo3is increased

by a factor N compared to algo₂- this can have impact on binary complexity but for arithmetic complexity, algo₃(A, b) is just twice algo2(A, b).)

• Let any A, b - without assumption - solving Ax ≥ b (or producing a certificate that no solution exists) is equivalent to solve min

z /Az+t1≥b,t≥0t (there is a solution

if the minimum is 0). Yet, this last linear program is structurally feasible (x = 0 and a sufficiently large t provided a feasible point) and bounded because t ≥ 0. Thus, primal dual theory gives a system Aprimal−dual(x y) ≥ bprimal−dual

whose solution contains solution of the linear program min

z /Az+t1≥b,t≥0t.

Applying algo3(Aprimal−dual, bprimal−dual) provides thus such xprimal−dual,

yprimal−dual from which one could restore x3, t3 with either t3 = 0 and so

Ax3≥ b or t36= 0.

This leads to an algorithm algo₄ which is able to find x such that Ax ≥ b (or to produce a certificate that no solution exists) without assumption on A, b about the existence or not of such x.

(13)

Importantly, the number of variables-constraints is only scaled two folds when computing the primal dual, so from theoretical point of view, it does not change the complexity between algo4and algo3.

• Finally, for any A, b, c without any assumption solving min

x /Ax≥bc T

x can be done with 2 algo₄calls and one algo₃call:

– one to know if the problem is feasible i.e. algo4(A, b)

– one on the dual to known if it is bounded algo4(Adual, bdual)

– and one call to algo3on the primal dual to get the optimal solution (if

pre-vious two computations certify that the problem is feasible and bounded). Again, from theoretical point of view, the complexity does not change: it only does 3 calls on instances only scaled 2 times. At the end, it returns the optimal solution or a certificate that the problem is not feasible or not bounded.

Thus, algo₀ which only returns v such that AAT_{v > 0 on input A if there}

ex-ists such v (undefined behaviour otherwise) allows to build with same complexity algo₅(A, b, c) which solves min

x /Ax≥bc

T_{x or returns certificate that problem is either}

infeasible or unbounded.

Let stress that the opposite way is trivial algo₅(AAT_{, 1, 0) is a correct}

implemen-tation of algo₁(A) for any A.

Normalization

Let A such that ∃x / Ax > 0, then let consider A which has 4 times more constraint and 2 additional variables (so complexity does not change) built as follow: for all constraint An, A gets the 4 constraints 2An:: ±1 :: ±2AnATn.

The point is that each such new constraints has a norm 4AnATn+ 1 + 4(AnATn)2=

(2AnATn + 1)2. So, each new constraint can be normalized on Q as the root of the

norm is an integer (while binary size is only scaled twice). Importantly, (x 0 0) is a solution for A, and inversely, any solution χ for A is a solution for A when removing the last two coordinates (if not zeros, it always make one from the 4 constraints worse). Let stress that normalizing is mainly for understanding - the offered algorithm works with the raw matrix but proof is a bit harder.

Different linear feasibility

Linear feasibility sometimes refers to different problem. Yet, all those forms are some-how equivalent (with Ax > 0 being the most general). Those forms are:

1. finding Ax > 0

2. finding Ax ≥ 0, Ax 6= 0 3. finding Ax ≥ 0, x 6= 0

(14)

4. finding x > 0, Ax = 0

First let consider y ∈ Ker(A), then, A(x+y) = Ax so if Ax > 0, then, A(x+y) > 0. In particular A(x −√xTy

xT_xyT_y.y) > 0. So, let consider a orthonormal basis y1, ..., yK

of Ker(A), then there is a solution to the problem Ax > 0, xT_y

1 = ... = xTyK = 0

(let take any solution and remove the orthogonal projection). But, Ax > 0, xT_y 1 =

... = xTyK = 0 is just Aχ > 0 when using constraint xTy1 = ... = xTyK = 0 to

remove K variables.

So when considering any of the first 3 linear feasibility versions, one could consider that Ker(A) = {0} (importantly, this is usually not the case for AT otherwise just computing AATx = 1 solves the problem).

Thus version 3 and version 2 are strictly equivalent, and, in both case as Ax ≥ 0, Ax 6= 0 there exists a partition I, J and x such that AIx > 0, AJx = 0. Now,

considering AJ as an instance, one could find a partition K, J − K of J and y such

that AKy > 0 and AJ −K = 0. But, the point is that AK(x + y) = AKy because

AKx = 0. Of course, AIy may be negative, but, finite. So, there exists λ such that

AI+K(λx + y) > 0, AJ −K(λx + y) = 0.

So version 2 or 3 allows to solve version 1 (yet M instances of version 2 or 3 are required to solve one instance of version 1).

Finally, version 4 applied to A :: −A :: I provides solution to version 1, and, inversely, removing variables in version 4 leads to a version 1 instance.

SVM

Assuming ∃x ∈ QN /Ax > 0, then, ∃x ∈ QN ≥ 1 because 1

minmAmx × x verifies

Ax ≥ 1 (if Ax > 0) with A ∈ ZM ×N_{, then, Perceptron will converge in e}_O(xT_x)

steps while the self concordant Perceptron in eO(M log(xT_x)).

Now, this claim holds for any x. Typically, let consider x with Ax ≥ 1 such that there is a maximal number of rows saturated and a maximal number of null coordinates. The impossibility to increase the number of constraints implies that x is the solution of a system of equation containing rows from A and rows from I. Such each coordinate of x is bounded by a maximal determinant from A. So, log(xT_{x) ≤ log(Ω(A)) ≤ e}_O(L).

Precisely, log(xTx) ≤ eO(min(L, N B)) but the L part is the classical way to measure complexity.

Let stress that the bound could also be expressed by introducing the solution of the support vector machine problem [6] which consists in solving min xTx