Online Non-Monotone DR-Submodular Maximization

(1)

Online Non-Monotone DR-Submodular Maximization

Nguyễn Kim Thắng Abhinav Srivastav IBISC, Univ. Evry, University Paris-Saclay, France

Abstract

In this paper, we study fundamental problems of maximizing DR-submodular continuous functions that have real-world applications in the domain of machine learning, economics, operations research and communication systems. It captures a subclass of non-convex optimization that provides both theoretical and practical guarantees. Here, we focus on minimizing regret for online arriving non-monotone DR- submodular functions over down-closed and general convex sets.

First, we present an online algorithm that achieves a1/e-approximation ratio with the regret of O(T^3/4)for maximizing DR-submodular functions over any down-closed convex set. Note that, the approximation ratio of1/ematches the best-known offline guarantee. Next, we give an online algorithm that achieves an approximation guarantee (depending on the search space) for the problem of maximizing non-monotone continuous DR-submodular functions over ageneralconvex set (not necessarily down- closed). Finally we run experiments to verify the performance of our algorithms on problems arising in machine learning domain with the real-world datasets.

1 Introduction

Continuous DR-submodular optimization is a subclass of non-convex optimization that is an upcoming frontier in machine learning. Roughly speaking, a differentiable non-negative bounded functionF : [0,1]ⁿ→[0,1]

isDR-submodularif∇F(x) ≥ ∇F(y)for allx,y ∈ [0,1]ⁿ wherex_i ≤ y_i for every1 ≤i ≤n. (Note that, w.l.o.g. after normalization, assume thatF has values in[0,1].) Intuitively, continuous DR-submodular functions represent the diminishing returns property or the economy of scale in continuous domains. DR- submodularity has been of great interest [3, 1, 10, 22, 31, 35]. Many problems arising in machine learning and statistics, such as Non-definite Quadratic Programming [24], Determinantal Point Processes [26], log- submodular models [12], to name a few, have been modelled using the notion of continuous DR-submodular functions.

In the past decade, online computational framework has been quite successful for tackling a wide variety of challenging problems and capturing many real-world problems with uncertainty. In this computational framework, we focus on the model of online learning. In online learning, at any time step, given a history of actions and a set of associated reward functions, online algorithm first chooses an action from a set of feasible actions; then, an adversary subsequently selects a reward function. The objective is to perform as good as the best fixed action in hindsight. This setting have been extensively explored in the literature, especially in context of convex functions [23].

In fact, several algorithms with theoretical approximation guarantees are known for maximizing (offline and online) DR-submodular functions. However, these guarantees hold under the assumptions that are based on the monotonicity of functions coupled with the structure of convex sets such as unconstrained hypercube and down-closed. Though, a majority of real-world problems can be formulated asnon-monotone

(2)

DR-submodular functions over convex sets that might not be necessarily down-closed. For example, the problems of Determinantal Point Processes, log-submodular models, etc can be viewed as non-monotone DR-submodular maximization problems. Although several algorithms for non-monotone (set) submodular maximization are known and they are built on maximizing the corresponding multilinear extensions (a particular class of DR-submodular functions), they do not extend to general DR-submodular functions since the optimal solutions of the former are integer points while the ones of the latter are not. Besides, general convex sets include conic convex sets, up-closed convex sets, mixture of covering and packing linear constraints, etc. which appear in many applications. Among others, conic convex sets play a crucial role in convex optimization. Conic programming [4] — an important subfield of convex optimization — consists of optimizing an objective function over a conic convex set. Conic programming reduces to linear programming and semi-definite programming when the objective function is linear and the convex cones are the positive orthantRⁿ₊and positive semidefinite matricesSⁿ₊, respectively. Optimizing non-monotone DR-submodular functions over a (bounded) conic convex set (not necessarily downward-closed) is an interesting and important problem both in theory and in practice. To best of our knowledge, no prior work has been done on maximizing DR-submodular functions over conic sets in online setting. The limit of current theory [3, 1, 10, 22, 31, 38]

motivates us to develop online algorithms for non-monotone functions.

In this work, we explore theonlineproblem of maximizingnon-monotoneDR-submodular functions over a hypercube and down-closed¹ and over ageneralconvex sets. Formally, we consider the following setting for DR-submodular maximization: given a convex domainK ⊆[0,1]ⁿin advance, at each time step t= 1,2, . . . , T, the online algorithm first selects a vectorx^t ∈ K. Subsequently, the adversary reveals a non-monotone continuous DR-submodular functionF^tand the algorithm receives a reward ofF^t(x^t). The objective is also to maximize the total reward. We say that an algorithm achieves a(r, R(T))-regretif

T

X

t=1

F^t(x^t)≥r·max

x∈K T

X

t=1

F^t(x)−R(T)

In other words,ris the approximation ratio that measures the quality of the online algorithm compared to the best fixed solution in hindsight andR(T)represents the regret in the classic terms. Equivalently, we say that the algorithm hasr-regretat mostR(T). Our goal is to design online algorithms with(r, R(T))-regret where0< r≤1is as large as possible, andR(T)is sub-linear inT, i.e.,R(T) =o(T).

1.1 Our contributions and techniques

In this paper, we provide algorithms with performance guarantees for the problem over down-closed and general convex sets. Our contributions and techniques are summarized as follows (also see Table 1, the entries in the red correspond to our contribution).

1.1.1 Maximizing online non-monotone DR-submodular functions over down-closed convex sets The natures of online DR-submodular maximization for monotone and non-monotone functions are quite different. In the setting of monotone functions, the best approximation guarantee is (1−1/e) [10] but one can prove that the natural gradient descent algorithm attains 1/2-approximation [10]. However, for online non-monotone DR-submodular maximization, no constant guarantee is known and it is likely that natural algorithms such as gradient descent, mirror descent, etc do not guarantee to achieve any constant

1Kisdown-closedif for everyz∈ Kandy≤ztheny∈ K

(3)

Monotone Non-monotone

Offline

hypercube

1 2-approx

[3]

[31]

down-closed

1−¹_e -approx

[2] ¹_e

-approx [1]

general

1−¹_e -approx [29]

_1−min

x∈Kkxk∞

3√ 3

-approx [14]

Online

hypercube

1/2, O(√ T)

down-closed

1

e, O(T^3/4)

general

1−¹_e, O(√ T)

[10] _1−min

x∈Kkxk∞

3√

3 , O(_ln^T_T)

Table 1: Summary of results on DR-submodular maximization. The entries represent the best-known (r, R(T))-regret; our results are shown in red. The entry in blue is not explicitly stated in literature but can be deduced from [17] or from our technique. The guarantees in blank cells can be deduced from a more general one.

approximation, even small one. An illustrative example of poor local optima (for coordinate ascent algorithm) can be found in [3, Appendix A].

The main result of our paper is thefirstonline algorithm that achieves a constant approximation ratio with the regret ofO(T^3/4)for maximizing non-monotone DR-submodular functions over any down-closed convex set whereT is number of time steps. Moreover, the approximation is1/ewhich matches to the best guarantee in the offline setting.

Our algorithm is built on the Meta-Frank-Wolfe algorithm introduced by Chen et al. [10] formonotone DR-submodular maximization. Their meta Frank-Wolfe algorithm combines the framework of meta-actions proposed in [36] with a variant of the Frank-Wolfe proposed in [2] for maximizing monotone DR-submodular functions. Informally, at every timet, our algorithm starts from the origin0(0∈ KsinceKis down-closed) and executesLsteps of the Frank-Wolfe algorithm where every update vector at iteration1 ≤ ` ≤ Lis constructed by combining the output of an optimization oracleE_`and the current vector so that we can exploit theconcavity property in positive directionsof DR-submodular functions. The solutionx^tis produced at the end of theL-th step. Subsequently, after observing the functionF^t, the algorithm subtly defines a vectord^t_` and feedbacksh·,d^t_`ias the reward function at timetto the oracleE_`for1 ≤`≤L. The most important distinguishing point of our algorithm compared to that in [10] relies on the oracles. In their algorithm, they used linear oracles which are appropriate for maximizing monotone DR-submodular functions. However, it is not clear whetherlinear or even convex oraclesare strong enough to achieve a constant approximation for non-monotone functions.

While aiming for a1/e-approximation — the best known approximation in the offline setting — we consider the following online non-convex problem, referred throughout this paper asonline vee learning problem. At each time t, the online algorithm knows c^t ∈ K in advance and selects a vectorx^t ∈ K.

Subsequently, the adversary reveals a vectora^t ∈ Rⁿand the algorithm receives a reward ha^t,c^t∨x^ti.

Given two vectorsxandy, the vectorx∨yhas thei^thcoordinate(x∨y)_i = max{x(i), y(i)}. The goal is

(4)

to maximize the total reward over a time horizonT.

In order to bypass the non-convexity of the online vee learning problem, we propose the following methodology. Unlike dimension reduction techniques that aim at reducing the dimension of the search space, weliftthe search space and the functions to a higher dimension so as to benefit from their properties in this new space. Concretely, at a high level, given a convex setK, we define a “sufficiently dense” latticeLin [0,1]ⁿsuch that every point inKcan be approximated by a point in the lattice. The term “sufficiently dense”

corresponds to the fact that the values of any Lipschitz function can be approximated by the function values at lattice points. As the next step, we lift all lattice points inKto a high dimension space so that they form a subset of vertices (corners points)Keof the unit hypercube in the newly defined space. Interestingly, the reward functionha^t,c^t∨x^tican be transformed to a linear function in the new space. Hence, our algorithm for the online vee problem consists of updating at every time step a solution in the high-dimension space using a linear oracle and projecting the solution back to the original space.

Once the solution is projected back to original space, we construct appropriate update vectors for the original DR-submodular maximization and feedback rewards for the online vee oracle. Exploiting the underlying properties of DR-submodularity, we show that our algorithm achieves the regret bound of (1/e, O(T^3/4)). A useful feature of our algorithm is that it is projection-free if using appropriate projection-

free oracles.

Over the hypercube. Restricting on the DR-submodular maximization problem over the hypercube[0,1]ⁿ, a particular down-closed convex set that has been widely studied, our approach leads to an1/2-approximation algorithm with regret bound ofO(√

TlogT). Note that this result can be deduced from [17] but as it is not explicitly stated in literature, for completeness, we present the proof in the appendix.

1.1.2 Maximizing online non-monotone DR-submodular functions over general convex sets

We go beyond the down-closed structure by considering general convex sets. Building upon our salient ideas from previous algorithm, we prove that for general convex sets, the Meta-Frank-Wolfe algorithm with adapted step-sizes guarantees_1−min

x∈Kkxk∞

3√

3 , O(_ln^T_T)

-regret in expectation. Notably, if the setKcontains0(for example,Kis the intersection of the semi-definite positive matrix cone and the hypercube[0,1]ⁿ) then the algorithm guarantees in expectation a( ¹

3√

3, O(_log^T_T))-regret. To the best of our knowledge, this is thefirst constant-approximation algorithm for the problem of online non-monotone DR-submodular maximization over a non-trivial convex set that goes beyond down-closed sets. Note that, any algorithm for the problem over a non-down-closed convex set (in particular arbitrary convex set containing the origin) that guarantees a constant approximation must require in general exponentially many value queries to the function [37].

Remark that the quality of the solution, specifically the approximation ratio, depends on the initial solution x0. This confirms an observation in various contexts that initialization plays an important role in non-convex optimization.

1.2 Related Work

In this section, we give a summary on best-known results on DR-submodular maximization. The domain has been investigated more extensively in recent years due to its numerous applications in the field of statistics and machine learning, for example active learning [19], viral makerting [25], network monitoring [20], document summarization [27], crowd teaching [33], feature selection [16], deep neural networks [15], diversity models [13] and recommender systems [21].

(5)

Offline setting. Bian et al. [2] considered the problem of maximizing monotone DR-functions subject to a down-closed convex set and showed that the greedy method proposed by [7], a variant of well-known Frank-Wolfe algorithm in convex optimization, guarantees a(1−1/e)-approximation. However, it has been observed by [22] that the greedy method is not robust in stochastic settings (where only unbiased estimates of gradients are available). Subsequently, Hassani et al. [29] proposed a(1−1/e)-approximation algorithm for maximizing monotone DR-submodular functions over general convex sets in stochastic settings by a new variance reduction technique.

The problem of maximizing non-monotone DR-submodular functions is much harder. Bian et al. [3] and Niazadeh et al. [31] have independently presented algorithms with the same approximation guarantee of(1/2) for the problem of non-monotone DR-submodular maximization over the hypercube (K= [0,1]ⁿ). These algorithms are inspired by the bi-greedy algorithms in [6, 5]. Bian et al. [1] made a further step by providing a(1/e)-approximation algorithm where the convex sets are down-closed. Mokhtari et al. [29] also presented an algorithm that achieve(1/e)for non-monotone continuous DR-submodular function over a down-closed convex domain that uses only stochastic gradient estimates. Remark that when aiming for approximation algorithms (in polynomial time), the restriction to down-closed polytopes is unavoidable. Specifically, Vondrak [37] proved that any algorithm for the problem over a non-down-closed set that guarantees a constant approximation must require in general exponentially many value queries to the function. Recently, Durr et al. [14] presented an algorithm that achieves((1−minx∈Kkxk∞)/3√

3)-approximation for non-monotone DR-submodular function over general convex sets that are not necessarily down-closed.

Online setting. The results for online DR-submodular maximization are knownonlyfor monotone functions. Chen et al. [10] considered the online problem of maximizing monotone DR-submodular functions over general convex sets and provided an algorithm that achieves a(1−1/e, O(√

T))-regret. Subsequently, Zhang et al. [38] presented an algorithm that reduces the number of per-function gradient evaluations from T^3/2in [10] to1and achieves the same approximation ratio of(1−1/e). Leveraging the idea of one gradient per iteration, they presented a bandit algorithm for maximizing monotone DR-submodular function over a general convex set that achieves in expectation(1−1/e)approximation ratio with regretO(T^8/9). Note that in the discrete setting, Roughgarden and Wang [32] studied the non-monotone (discrete) submodular maximization over the hypercube and gave an algorithm which guarantees the tight(1/2)-approximation andO(√

T)regret. Recently and independently, Niazadeh et al. [30] have provided an(1/2, O(√

T))-regret algorithm for DR-submodular maximization over the hypercube. Their algorithm is built on an interesting framework that transform offline greedy algorithms to online ones using Blackwell approachability. Our approach, based on the lifting procedure, is completely different to theirs.

2 Preliminaries and Notations

Throughout the paper, we use bold face letters, e.g.,x,yto represent vectors. For everyS⊆ {1,2, . . . , n}, vector1_S is then-dim vector whosei^th coordinate equals to 1 ifi ∈ S and0otherwise. Given two n- dimensional vectorsx,y, we say thatx ≤ yiffxi ≤ yi for all1 ≤ i ≤ n. Additionally, we denote by x∨y, their coordinate-wise maximum vector such that(x∨y)i = max{x_i, yi}. Moreover, the symbol◦ represents the element-wise multiplication that is, given two vectorsx,y, vectorx◦yis the such that the i-th coordinate(x◦y)i =xiyi. The scalar producthx,yi=P

ixiyiand the normkxk=hx,xi^1/2. In the paper, we assume thatK ⊆[0,1]ⁿ. We say thatKis thehypercubeifK= [0,1]ⁿ;Kisdown-closedif for everyz∈ Kandy≤ztheny∈ K; andKisgeneralifKis a convex subset of[0,1]ⁿwithout any special property.

(6)

Definition 1. A functionF : [0,1]ⁿ →R⁺∪ {0}isdiminishing-return (DR) submodularif for all vector x≥y∈[0,1]ⁿ, any basis vectore_i = (0, . . . ,0,1,0, . . . ,0)and any constantα >0such thatx+αe_i ∈ [0,1]ⁿ,y+αei ∈[0,1]ⁿ, it holds that

F(x+αei)−F(x)≤F(y+αei)−F(y). (1) Note that if functionF is differentiable then the diminishing-return (DR) property (1) is equivalent to

∇F(x)≤ ∇F(y) ∀x≥y∈[0,1]ⁿ. (2) Moreover, ifF is twice-differentiable then the DR property is equivalent to all of the entries of its Hessian being non-positive, i.e., _∂x^∂²^F

i∂xj ≤0for all1≤i, j≤n.

Besides, a differentiable functionF :K ⊆[0,1]ⁿ→R⁺∪ {0}is said to beβ-smoothif for anyx,y∈ K, the following holds,

F(y)≤F(x) +h∇F(x),y−xi+β

2ky−xk² (3)

or equivalently, the gradient isβ-Lipschitz, i.e.,

∇F(x)− ∇F(y)

≤βkx−yk. (4) Properties of DR-submodularity In the following, we present a property of DR-submodular functions that are are crucial in our analyses. The most important property is the concavity in positive directions, i.e., for everyx,y∈ Kandx≥y, it holds that

F(x)≤F(y) +h∇F(y),x−yi.

Addionally, the following lemmas are useful in our analysis.

Lemma 1([22]). For everyx,y∈ Kand any DR-submodular functionF, it holds that h∇F(x),y−xi ≥F(x∨y) +F(x∧y)−2F(x)

Lemma 2([18, 8, 1]). For everyx,y∈ Kand for any DR-submodular functionF : [0,1]ⁿ→R⁺∪ {0}, it holds thatF(x∨y)≥(1− kxk∞)F(y).

Variance Reduction Technique. Our algorithm and its analysis relies on a recent variance reduction technique proposed by [29]. This theorem has also been used in the context of online monotone submodular maximization [9, 38].

Lemma 3([29], Lemma 2). Let{a^t}^T_t=0be a sequence of points inRⁿ. such that

a^t−a^t−1

≤C/(t+s) for all1≤t ≤T with fixed constantsC ≥0ands≥3. Let{˜a}^T_t=1be a sequence of random variables such thatE[˜a|H^t−1] =a^tandE

h

˜a^t−a^t

2|H^t−1i

≤σ²for everyt≥0, whereH^t−1is the history up to t−1. Let{d^t}^T_t=0be a sequence of random variables whered⁰is fixed and subsequentd^tare obtained by the recurrence

d^t= (1−ρ^t)d^t−1+ρ^ta˜^t withρ^t= _(t+s)²2/3. Then we have

E

a^t−d^t

2

≤ Q

(t+s+ 1)^2/3,

whereQ= max{

a⁰−d⁰

2(s+ 1)^2/3,4σ²+ 3C²/2}.

(7)

3 Online DR-Submodular Maximization over Down-Closed Convex Sets

3.1 Online Vee Learning

In this section, we give an algorithm for an online problem that will be the main building block in the design of algorithms for online DR-submodular maximization over down-closed convex sets. In the online vee learning problem, given a down-closed convex setK, at every timet, the online algorithm receivesc^t∈ K at the beginning of the step and needs to choose a vectorx^t ∈ K. Subsequently, the adversary reveals a vectora^t∈Rⁿand the algorithm receives a rewardha^t,c^t∨x^ti. The goal is to maximize the total reward PT

t=1ha^t,c^t∨x^tiover a time horizonT.

The main issue in the online vee problem is that the reward functions are non-concave. In order to overcome this obstacle, we consider a novel approach that consists of discretizing and lifting the corresponding functons to a higher dimensional space.

Discretization and lifting. LetLbe a lattice such thatL={0,_M¹ ,_M² , . . . ,_M^` , . . . ,1}ⁿwhere0≤`≤M for some parameterM (to be defined later). Note that each x_i = _M^` for0 ≤ ` ≤ M can be uniquely represented by a vectory_i ∈ {0,1}^M⁺¹such thaty_i0 =. . .=y_i`= 1andy_i,`+1=. . .=y_i,M = 0. Based on this observation, we lift the discrete setK ∩ Lto the(n×(M+ 1))-dim space. Specifically, define alifting mapm:K ∩ L → {0,1}^n×(M⁺¹⁾such that each point(x₁, . . . , x_n)∈ K ∩ Lis mapped to an unique point (y₁₀, . . . , y_1M, . . . , y_n0, . . . , y_nM) ∈ {0,1}^n×M wherey_i0 = . . .=y_i` = 1andy_i,`+1 = . . .=y_i,M = 0 iffxi = _M^` for0≤`≤M. DefineKebe the set{1_X ∈ {0,1}^n×(M+1):1X =m(x)for somex∈ K ∩ L}.

Observe thatKeis a subset of discrete points in{0,1}^n×(M⁺¹⁾. LetC:=conv(K)e be the convex hull ofK.e Algorithm description. In our algorithm, at every timet, we will outputx^t ∈ L ∩ K. In fact, we will reduce the online vee problem to a online linear optimization problem in the(n×(M+ 1))-dim space. Given c^t∈ K, we round every coordinatec^t_ifor1≤i≤nto the largest multiple of _M¹ which is smaller thanc^t_i. In other words, the rounded vectorc^thas thei^th-coordinate¯c^t_i = _M^` where _M^` ≤c^t_i < ^`+1_M . Vectorc^t∈ Land alsoc^t∈ K(sincec^t≤c^tandKis down-closed). Denote1_C^t =m(c^t). Besides, for each vectora^t∈Rⁿ, define its correspondenceea^t∈R^n×(M+1)such thatea^t_i,j = _M¹ a^t_ifor all1≤i≤nand0≤j≤M. Observe thatha^t,c^ti=hae^t,1_C^ti(where the second scalar product is taken in the space of dimensionn×(M + 1)).

Algorithm 1Online algorithm for vee learning Input: A convex setK, a time horizonT.

1: LetLbe a lattice in[0,1]ⁿand denote the polytopeC:=conv(K).e

2: Initialize arbitrarilyx¹ ∈ K ∩ Landy¹=1_X1 =m(x¹)∈K.e

3: fort= 1toT do

4: Observec^t, compute1_C^t =m(c^t).

5: Playx^t.

6: Observea^t(so computeea^t) and receive the rewardha^t,c^t∨x^ti.

7: Sety^t+1 ←update_C(y^t;ea^t⁰◦(1−1_Ct0) :t⁰ ≤t)where◦denotes the element-wise multiplication.

8: Round:1_X^t+1 ←round(y^t+1).

9: Computex^t+1=m⁻¹(1_X^t+1).

10: end for

(8)

The formal description is given in Algorithm 1. In the algorithm, we use a procedureupdate. This procedure takes arguments as the polytopeC, the current vectory^t, the gradients of previous time steps ae^t◦(1−1_C^t)and outputs the next vectory^t+1. One can use different update strategies, in particular the gradient descent or the follow-the-perturbed-leader algorithm (if aiming for a projection-free algorithm).

y^t+1 ←Proj_C

y^t−η·ae^t◦(1−1_C^t)

(Gradient Descent) y^t+1 ←arg min

C

η

t

X

t⁰=1

hae^t⁰ ◦(1−1_Ct0),yi+n^T ·y

(Follow-the-perturbed-leader) Besides, in the algorithm, we use additionally a procedureroundin order to transform a solutiony^tin the polytopeCof dimensionn×(M+ 1)to an integer point inK. Specifically, givene y^t∈conv(K), we rounde y^tto1_X^t ∈Keusing an efficient polynomial-time algorithm given in [28] for the approximate Carathéodory’s theorem w.r.t the`2-norm. Specifically, giveny^t∈ Cand an arbitrary >0, the algorithm in [28] returns a set ofk=O(1/²)integer points1_X^t

1, . . . ,1_X^t

k ∈KeinO(1/²)-time such thatky^t−¹_kPk j=11_X^t

jk≤. Hence, given1_X^t

1, . . . ,1_X^t

kwhich have been computed, the procedureroundsimply consists of rounding y^tto1_X^t

j with probability1/k. Finally, the solutionx^tin the original space is computed asm⁻¹(1_X^t+1).

Lemma 4. Assume thatka^tk≤Gfor all1≤t≤T and the update scheme is chosen either as the gradient ascent update or other procedure with regret ofO(√

T). Then, using the latticeLwith parameterM = (T /n)^1/4and= 1/√

T (in theroundprocedure), Algorithm 1 achieves a regret bound ofO(G(nT)^3/4).

The total running time of the algorithm isO(T²).

Proof. Observe thaty^tbe the solution of the online linear optimization problem in dimensionn×(M+ 1) at timetwhere the reward function at timetishae^t◦(1−1_C^t),·i. Moreover, recall thatx^t∈ L ∩ Kfor every timet. First, by the construction, we have:

m(c^t∨x^t) =m(c^t)∨m(x^t) =1_C^t∨1_X^t, ha^t,c^t∨x^ti=hae^t,1_C^t ∨1_X^ti.

Therefore,

T

X

t=1

E[ha^t,c^t∨x^ti] =

T

X

t=1

E[hae^t,1_C^t ∨1_X^ti]

Besides, for every vector1C and vectorz ∈ [0,1]^n×M, the following identity always holds: 1C ∨z = z+1_C◦(1−z). Therefore, for every vectorz∈[0,1]^n×M,

hea^t,1_C^t∨yi=hea^t,z+1_C^t◦(1−z)i=hae^t,z◦(1−1_C^t)i+hae^t,1_C^ti

=hea^t◦(1−1_C^t),zi+hea^t,1_C^ti (5) where note that the last termhae^t,1_C^tiis independent of the decision at timet.

By linearity of expectation, we have

T

X

t=1

E[ha^t,c^t∨x^ti] =

T

X

t=1

E[hea^t◦(1−1_C^t),1_X^ti+hea^t,1_C^ti]

=

T

X

t=1

hea^t◦(1−1_C^t),y^ti+

T

X

t=1

hae^t◦(1−1_C^t),E[1_X^t]−y^ti+

T

X

t=1

hea^t,1_C^ti

(9)

≥

T

X

t=1

T

X

t=1

hae^t,1_C^ti −

T

X

t=1

kea^t◦(1−1_C^t)k·kE[1_X^t]−y^tk

≥

T

X

t=1

T

X

t=1

hae^t,1_C^ti −T G. (6) The first inequality follows Cauchy-Schwarz inequality. The last inequality is due to the property ofround andkae^t◦(1−1_C^t)k≤Gsinceka^tk≤Gand the construction ofae^t.

Letx^∗ be the optimal solution in hindsight for the online vee learning problem. Letx^∗ be the rounded solution ofx^∗onto the latticeL. Denote1_X^∗=m(x^∗). By (5) and (6) and the choice of=O(1/√

T), we have

T

X

t=1

E[ha^t,c^t∨x^ti]−

T

X

t=1

ha^t,c^t∨x^∗i ≥

T

X

t=1

E[hea^t◦(1−1_C^t),y^t−1X^∗i]≥ −O(nM G√ T)

where the last inequality is due to the regret bound of the gradient descent or the follow-the-perturbed-leader algorithms (in dimesionnM). Aska^tk≤G, functionha^t,·iisG-Lipschitz. Hence,

T

X

t=1

T

X

t=1

ha^t,c^t∨x^∗i

=

T

X

t=1

T

X

t=1

ha^t,c^t∨x^ti+

T

X

t=1

T

X

t=1

ha^t,c^t∨x^∗i

+

T

X

t=1

E[ha^t,c^t∨x^∗i]−

T

X

t=1

ha^t,c^t∨x^∗i+

T

X

t=1

E[ha^t,c^t∨x^∗i]−

T

X

t=1

ha^t,c^t∨x^∗i

≥ −O(nM G√

T)−O(T G

√n M ).

wherekc^t∨x^t−c^t∨x^tk≤ kc^t−c^tk≤√

n/M and similarlykc^t∨x^∗−c^t∨x^∗kandkc^t∨x^∗−c^t∨x^∗k are bounded by√

n/M. ChoosingM = (T /n)^1/4, one gets the regret bound ofO(G(nT)^3/4).

Finally, the procedureroundhas average running timeO(1/²) =O(T)per time step; that leads to the complexity ofO(T²).

3.2 Algorithm for online non-monotone DR-submodular maximization

In this section, we will provide an algorithm for the problem of online non-monotone DR-submodular maximization. The algorithm maintainsLonline vee learning oraclesE₁, . . . ,E_L. At each time stept, the algorithm starts from0(origin) and executesLsteps of the Frank-Wolfe algorithm where the update vector v_`^tis constructed by combining the outputu^t_òf the online vee optimization oracleE_`with the current vector x^t_`. In particular,v_`^tis set to bex^t_`∨u^t_` −x^t_`. Then, the solution x^t is produced at the end of theL-th step. Subsequently, after observing the DR-submodular functionF^t, the algorithm defines a vectord^t_ànd feedbackshd^t_`,x^t_`∨u^t_ìas the reward function at timetto the oracleE_` for1≤`≤L. The pseudocode in presented in Algorithm 2.

We first prove a technical lemma.

Lemma 5. Letx^t_`be the vectors constructed in Algorithm 2. Then, it holds that1−kx^t_`+1k∞≥Q`

`⁰=1(1−η_`⁰) for everyt∈ {1,2, . . . , T}and for every`∈ {1,2, . . . , L}.

(10)

Algorithm 2Online algorithm for down-closed convex sets

Input: A convex setK, a time horizonT, online vee optimization oraclesE₁, . . . ,E_L, step sizesρ` ∈(0,1), andη_`= 1/Lfor all`∈[L]

1: Initialize vee optimizing oracleE_`for all`∈ {1, . . . L}.

2: Initializex^t₁ ←0andd^t₀←0for every1≤t≤T.

3: fort= 1toT do

4: Computeu^t_` ←output of oracleE_`in roundtfor all`∈ {1, . . . , L}.

5: for1≤`≤Ldo

6: Setv^t_`←x^t_`∨u^t_`−x^t_`

7: Setx^t_`+1←x^t_`+η_`v_`^t.

8: end for

9: Setx^t←x^t_L+1.

10: Playx^t, observe the functionF^tand get the reward ofF^t(x^t).

11: Compute a gradient estimateg^t_`such thatE[g^t_`|x^t_`] =∇F^t(x^t_`)for all`∈ {1,2, . . . , L}.

12: Computed^t_` = (1−ρ`)d^t_`−1+ρ`g^t_`, for every`∈ {1, . . . , L}

13: Feedbackhd^t_`,x^t_`∨u^t_`ito oracleE_`for all`∈ {1,2, . . . , L},

14: end for

Proof. We show that inequality holds for every t. Fix an arbitraryt ∈ {1,2, . . . , T}. For the ease of exposition, we drop the indextin the following proof. Let(x_`)_i,(u_`)_iand(v_`)_ibe thei^th coordinates of vectorsx_`,u_`andv_`for1≤i≤nand for1≤`≤L, respectively. We first obtain the following recursion on fixed(x_`)_i.

1−(x_`+1)i = 1− (x_`)i+η_`(v_`)i

≥1− (x_`)i+η_`(1−(x_`)i)

= (1−η_`)(1−(x_`)i) where the inequality holds since0≤(x_`∨u_`)i ≤1. Since(x1)i = 0, we have1−(x^t_`+1)i ≥Q`

`⁰=1(1−η_`⁰).

The lemma holds since the inequality holds for every1≤i≤n.

We are now ready to prove the main theorem of the section.

Theorem 1. LetK ⊆[0,1]ⁿbe a down-closed convex set with diameterD. Assume that functionsF^t’s are G-Lipschitz,β-smooth andE

h

˜g^t_`− ∇F^t(x^t_`)

2i

≤σ²for everytand`. Then, the step-sizesη_`= 1/Land ρ_` = 2/(`+ 3)^2/3for all1≤`≤L, Algorithm 2 achieves the following guarantee:

T

X

t=1

E[F^t(x^t)]≥ 1 e−O

1 L

! _T X

t=1

F^t(x^∗)−O(n^3/4GT^3/4)−O

(βD+G+σ)DT L^1/3

.

ChoosingL=T^3/4 and note thatD≤√

n, we get

T

X

t=1

E[F^t(x^t)]≥ 1 e−O

1 T^3/4

! maxx∈K

T

X

t=1

F^t(x)−O((G+σ)n^5/4T^3/4). (7) Proof. Let x^∗ be an optimal solution in hindsight, i.e.,x^∗ ∈ arg maxx∈KPT

t=1F^t(x). Fix a time step t∈ {1,2, . . . , T}. For every2≤`≤L+ 1, we have following:

F^t(x^t_`)−F^t(x^t_`−1)≥ h∇F^t(x^t_`),(x^t_`−x^t_`−1)i −(β/2)kx^t_`−x^t_`−1k² (usingβ-smoothness)

(11)

=η`−1h∇F^t(x^t_`),v^t_`−1i −βη_`−1²

2 kv^t_`−1k² (using the update step in the algorithm)

=η`−1h∇F^t(x^t_`−1),v_`−1^t i+η`−1h∇F^t(x^t_`)− ∇F^t(x^t_`−1),v_`−1^t i −βη_`−1²

2 kv_`−1^t k²

≥η`−1h∇F^t(x^t_`−1),v_`−1^t i −η`−1k∇F^t(x^t_`)− ∇F(x^t_`−1)k · kv^t_`−1k −βη_`−1²

2 kv^t_`−1k²

(Cauchy-Schwarz)

≥η_`−1h∇F^t(x^t_`−1),v_`−1^t i −η_`−1βkx^t_`−x^t_`−1k · kv_`−1^t k −βη²_`−1

2 kv_`−1^t k²

(usingβ-smoothness)

≥η`−1h∇F^t(x^t_`−1),v_`−1^t i −2βη_`−1² kv^t_`−1k²

≥η`−1h∇F^t(x^t_`−1),v_`−1^t i −2βη_`−1² D² (diametre ofKis bounded)

=η`−1h∇F^t(x^t_`−1),x^t_`−1∨x^∗−x^t_`−1i+η`−1h∇F^t(x^t_`−1),v^t_`−1−(x^t_`−1∨x^∗−x^t_`−1)i

−2βη²_`−1D²

=η`−1h∇F^t(x^t_`−1),x^t_`−1∨x^∗−x^t_`−1i+η`−1hd^t_`−1,v_`−1^t −(x^t_`−1∨x^∗−x^t_`−1)i +η`−1h∇F^t(x^t_`−1)−d^t_`−1,v_`−1^t −(x^t_`−1∨x^∗−x^t_`−1)i −2βη²_`−1D²

≥η`−1(F^t(x^t_`−1∨x^∗)−F^t(x^t_`−1)) +η`−1hd^t_`−1,v_`−1^t −(x^t_`−1∨x^∗−x^t_`−1)i +η`−1h∇F^t(x^t_`−1)−d^t_`−1,v_`−1^t −(x^t_`−1∨x^∗−x^t_`−1)i −2βη²_`−1D²

(using concavity in positive direction)

≥η_`−1(F^t(x^t_`−1∨x^∗)−F^t(x^t_`−1)) +η_`−1hd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i +η`−1h∇F^t(x^t_`−1)−d^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i −2βη_`−1² D²

(using the definition ofv_`−1^t )

≥η`−1

(1− kx^t_`−1k_∞)F^t(x^∗)−F^t(x^t_`−1)

+η`−1hd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i +η`−1h∇F^t(x^t_`−1)−d^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i −2βη_`−1² D² (using Lemma 2)

(12)

≥η`−1

`−2

Y

`⁰=1

(1−η_`⁰)·F^t(x^∗)−η`−1F^t(x^t_`−1) +η`−1hd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i +η`−1h∇F^t(x^t_`−1)−d^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i −2βη_`−1² D² (using Lemma 5)

≥η_`−1

`−2

Y

`⁰=1

(1−η_`⁰)·F^t(x^∗)−η_`−1F^t(x^t_`−1) +η_`−1hd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i

−η`−1

2 1

γ`−1

k∇F^t(x^t_`−1)−d^t_`−1k²+γ`−1kx^t_`−1∨u^t_`−1−x^t_`−1∨x^∗k²

−2βη²_`−1D² (Young’s Inequality, parameterγ`−1will be defined later)

≥η`−1

`−2

Y

`⁰=1

(1−η_`⁰)·F^t(x^∗)−η`−1F^t(x^t_`−1) +η`−1hd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i

−η`−1

2 1

γ`−1

k∇F^t(x^t_`−1)−d^t_`−1k²+γ`−1D²

−2βη_`−1² D².

(sincekx^t_`−1∨u^t_`−1−x^t_`−1∨x^∗k≤ ku^t_`−1−x^∗k≤D.) From the above we have

F^t(x^t_`)≥(1−η`−1)F^t(x^t_`−1) +η`−1

`−2

Y

`⁰=1

(1−η_`⁰)·F^t(x^∗) +η`−1hd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i

−η`−1

2 1

γ`−1

k∇F^t(x^t_`−1)−d^t_`−1k²+γ`−1D²

−2βη_`−1² D²

=

1− 1 L

F^t(x^t_`−1) + 1 L

1− 1

L `−2

F^t(x^∗) + 1

Lhd^t_`−1,x^t_`−1∨u^t_`−1−x^t_`−1∨x^∗i

− 1 2L

1 γ`−1

k∇F^t(x^t_`−1)−d^t_`−1k²+γ`−1D²

−2βD²

L² (replacingη_` = 1/L) Applying recursively on`, we get:

F^t(x^t_`)≥ `−1 L

1− 1

L `−2

F^t(x^∗) + 1 L

`−1

X

`⁰=1

1− 1

L

`−1−`⁰

hd^t_`0,x^t_`⁰ ∨u^t_`⁰ −x^t_`⁰ ∨x^∗i

!

− 1 2L

`−1

X

`⁰=1

1− 1

L

`−1−`⁰ 1

γ_`⁰k∇F^t(x^t_`⁰)−d^t_`⁰k²+γ_`⁰D² !

−2βD²

L (8)

Summing above inequality (with`=L+ 1) over allt∈[T]and using online vee maximization oracle with regretR^T, we obtain

T

X

t=1

F^t(x^t_L+1)

(13)

≥

1− 1 L

L−1 T

X

t=1

F^t(x^∗) + 1 L

T

X

t=1 L

X

`⁰=1

1− 1

L L−`⁰

hd^t_`0,x^t_`⁰∨u^t_`⁰−x^t_`⁰∨x^∗i

!

− 1 2L

T

X

t=1 L

X

`⁰=1

1− 1

L

L−`⁰ 1

γ_`⁰k∇F^t(x^t_`⁰)−d^t_`⁰k²+γ_`⁰D² !

−2βT D² L

≥

1− 1 L

L−1 T

X

t=1

F^t(x^∗) + 1 L

L

X

`⁰=1

1− 1

L L−`⁰

·





T

X

t=1

hd^t_`0,x^t_`0 ∨u^t_`0−x^t_`0 ∨x^∗i





− 1 2L

T

X

t=1 L

X

`⁰=1

1

γ_`⁰k∇F^t(x^t_`0)−d^t_`0k²+γ_`⁰D²

−2βT D²

L (since1−1/L <1)

≥

1− 1 L

L−1 T

X

t=1

F^t(x^∗)− 1 L

L

X

`⁰=1

1− 1

L L−`⁰

R^T −2βT D² L

− 1 2L

T

X

t=1 L

X

`⁰=1

1 γ`⁰

k∇F^t(x^t_`0)−d^t_`0k²+γ_`⁰D²

(R^T is the regret of a single oracle)

≥ 1 e−O

1 L

! _T X

t=1

F^t(x^∗)− R^T −2βT D²

L − 1

2L

T

X

t=1 L

X

`⁰=1

1

γ_`⁰k∇F^t(x^t_`⁰)−d^t_`⁰k²+γ`⁰D²

(9)

Claim 1. Choosing γ` = _D(`+3)^Q^1/21/3 for 1 ≤ ` ≤ L where Q = {max_1≤t≤Tk∇F^t(0)k²4^2/3,4σ² + 3(βD)²/2}, it holds that

T

X

t=1 L

X

`=1

1

γ_`E[k∇F^t(x^t_`)−d^t_`k²] +γ_`D²

=O

(βD+G+σ)DT L^2/3

Proof of claim.Fix a time stept. We apply Lemma 3 to the left hand side of the claim inequality where, for 1≤`≤L,a_`denotes the gradients∇F^t(x^t_`),a˜_`denote the stochastic gradient estimateg^t_`. Additionally, ρ_` = _(`+3)²_2/3 for1≤`≤L. Hence we get:

E[k∇F^t(x^t_`)−d^t_`k²]≤ Q

(`+ 4)^2/3. (10)

With the choiceγ_`= _D(`+4)^Q^1/2_1/3, we obtain:

L

X

`=1

1

γ`E[k∇F^t(x^t_`)−d^t_`k²] +γ_`D²

≤2DQ^1/2

L

X

`=1

1 (`+ 4)^1/3

≤2DQ^1/2

L

Z

`=1

1

`^1/3d` ≤3DQ^1/2L^2/3 =O

(βD+G+σ)DL^2/3