Online Non-Monotone DR-Submodular Maximization
Nguyễn Kim Thắng Abhinav Srivastav IBISC, Univ. Evry, University Paris-Saclay, France
Abstract
In this paper, we study fundamental problems of maximizing DR-submodular continuous functions that have real-world applications in the domain of machine learning, economics, operations research and communication systems. It captures a subclass of non-convex optimization that provides both theoretical and practical guarantees. Here, we focus on minimizing regret for online arriving non-monotone DR- submodular functions over down-closed and general convex sets.
First, we present an online algorithm that achieves a1/e-approximation ratio with the regret of O(T3/4)for maximizing DR-submodular functions over any down-closed convex set. Note that, the approximation ratio of1/ematches the best-known offline guarantee. Next, we give an online algorithm that achieves an approximation guarantee (depending on the search space) for the problem of maximizing non-monotone continuous DR-submodular functions over ageneralconvex set (not necessarily down- closed). Finally we run experiments to verify the performance of our algorithms on problems arising in machine learning domain with the real-world datasets.
1 Introduction
Continuous DR-submodular optimization is a subclass of non-convex optimization that is an upcoming frontier in machine learning. Roughly speaking, a differentiable non-negative bounded functionF : [0,1]n→[0,1]
isDR-submodularif∇F(x) ≥ ∇F(y)for allx,y ∈ [0,1]n wherexi ≤ yi for every1 ≤i ≤n. (Note that, w.l.o.g. after normalization, assume thatF has values in[0,1].) Intuitively, continuous DR-submodular functions represent the diminishing returns property or the economy of scale in continuous domains. DR- submodularity has been of great interest [3, 1, 10, 22, 31, 35]. Many problems arising in machine learning and statistics, such as Non-definite Quadratic Programming [24], Determinantal Point Processes [26], log- submodular models [12], to name a few, have been modelled using the notion of continuous DR-submodular functions.
In the past decade, online computational framework has been quite successful for tackling a wide variety of challenging problems and capturing many real-world problems with uncertainty. In this computational framework, we focus on the model of online learning. In online learning, at any time step, given a history of actions and a set of associated reward functions, online algorithm first chooses an action from a set of feasible actions; then, an adversary subsequently selects a reward function. The objective is to perform as good as the best fixed action in hindsight. This setting have been extensively explored in the literature, especially in context of convex functions [23].
In fact, several algorithms with theoretical approximation guarantees are known for maximizing (offline and online) DR-submodular functions. However, these guarantees hold under the assumptions that are based on the monotonicity of functions coupled with the structure of convex sets such as unconstrained hypercube and down-closed. Though, a majority of real-world problems can be formulated asnon-monotone
DR-submodular functions over convex sets that might not be necessarily down-closed. For example, the problems of Determinantal Point Processes, log-submodular models, etc can be viewed as non-monotone DR-submodular maximization problems. Although several algorithms for non-monotone (set) submodular maximization are known and they are built on maximizing the corresponding multilinear extensions (a particular class of DR-submodular functions), they do not extend to general DR-submodular functions since the optimal solutions of the former are integer points while the ones of the latter are not. Besides, general convex sets include conic convex sets, up-closed convex sets, mixture of covering and packing linear constraints, etc. which appear in many applications. Among others, conic convex sets play a crucial role in convex optimization. Conic programming [4] — an important subfield of convex optimization — consists of optimizing an objective function over a conic convex set. Conic programming reduces to linear programming and semi-definite programming when the objective function is linear and the convex cones are the positive orthantRn+and positive semidefinite matricesSn+, respectively. Optimizing non-monotone DR-submodular functions over a (bounded) conic convex set (not necessarily downward-closed) is an interesting and important problem both in theory and in practice. To best of our knowledge, no prior work has been done on maximizing DR-submodular functions over conic sets in online setting. The limit of current theory [3, 1, 10, 22, 31, 38]
motivates us to develop online algorithms for non-monotone functions.
In this work, we explore theonlineproblem of maximizingnon-monotoneDR-submodular functions over a hypercube and down-closed1 and over ageneralconvex sets. Formally, we consider the following setting for DR-submodular maximization: given a convex domainK ⊆[0,1]nin advance, at each time step t= 1,2, . . . , T, the online algorithm first selects a vectorxt ∈ K. Subsequently, the adversary reveals a non-monotone continuous DR-submodular functionFtand the algorithm receives a reward ofFt(xt). The objective is also to maximize the total reward. We say that an algorithm achieves a(r, R(T))-regretif
T
X
t=1
Ft(xt)≥r·max
x∈K T
X
t=1
Ft(x)−R(T)
In other words,ris the approximation ratio that measures the quality of the online algorithm compared to the best fixed solution in hindsight andR(T)represents the regret in the classic terms. Equivalently, we say that the algorithm hasr-regretat mostR(T). Our goal is to design online algorithms with(r, R(T))-regret where0< r≤1is as large as possible, andR(T)is sub-linear inT, i.e.,R(T) =o(T).
1.1 Our contributions and techniques
In this paper, we provide algorithms with performance guarantees for the problem over down-closed and general convex sets. Our contributions and techniques are summarized as follows (also see Table 1, the entries in the red correspond to our contribution).
1.1.1 Maximizing online non-monotone DR-submodular functions over down-closed convex sets The natures of online DR-submodular maximization for monotone and non-monotone functions are quite different. In the setting of monotone functions, the best approximation guarantee is (1−1/e) [10] but one can prove that the natural gradient descent algorithm attains 1/2-approximation [10]. However, for online non-monotone DR-submodular maximization, no constant guarantee is known and it is likely that natural algorithms such as gradient descent, mirror descent, etc do not guarantee to achieve any constant
1Kisdown-closedif for everyz∈ Kandy≤ztheny∈ K
Monotone Non-monotone
Offline
hypercube
1 2-approx
[3]
[31]
down-closed
1−1e -approx
[2] 1e
-approx [1]
general
1−1e -approx [29]
1−min
x∈Kkxk∞
3√ 3
-approx [14]
Online
hypercube
1/2, O(√ T)
down-closed
1
e, O(T3/4)
general
1−1e, O(√ T)
[10] 1−min
x∈Kkxk∞
3√
3 , O(lnTT)
Table 1: Summary of results on DR-submodular maximization. The entries represent the best-known (r, R(T))-regret; our results are shown in red. The entry in blue is not explicitly stated in literature but can be deduced from [17] or from our technique. The guarantees in blank cells can be deduced from a more general one.
approximation, even small one. An illustrative example of poor local optima (for coordinate ascent algorithm) can be found in [3, Appendix A].
The main result of our paper is thefirstonline algorithm that achieves a constant approximation ratio with the regret ofO(T3/4)for maximizing non-monotone DR-submodular functions over any down-closed convex set whereT is number of time steps. Moreover, the approximation is1/ewhich matches to the best guarantee in the offline setting.
Our algorithm is built on the Meta-Frank-Wolfe algorithm introduced by Chen et al. [10] formonotone DR-submodular maximization. Their meta Frank-Wolfe algorithm combines the framework of meta-actions proposed in [36] with a variant of the Frank-Wolfe proposed in [2] for maximizing monotone DR-submodular functions. Informally, at every timet, our algorithm starts from the origin0(0∈ KsinceKis down-closed) and executesLsteps of the Frank-Wolfe algorithm where every update vector at iteration1 ≤ ` ≤ Lis constructed by combining the output of an optimization oracleE`and the current vector so that we can exploit theconcavity property in positive directionsof DR-submodular functions. The solutionxtis produced at the end of theL-th step. Subsequently, after observing the functionFt, the algorithm subtly defines a vectordt` and feedbacksh·,dt`ias the reward function at timetto the oracleE`for1 ≤`≤L. The most important distinguishing point of our algorithm compared to that in [10] relies on the oracles. In their algorithm, they used linear oracles which are appropriate for maximizing monotone DR-submodular functions. However, it is not clear whetherlinear or even convex oraclesare strong enough to achieve a constant approximation for non-monotone functions.
While aiming for a1/e-approximation — the best known approximation in the offline setting — we consider the following online non-convex problem, referred throughout this paper asonline vee learning problem. At each time t, the online algorithm knows ct ∈ K in advance and selects a vectorxt ∈ K.
Subsequently, the adversary reveals a vectorat ∈ Rnand the algorithm receives a reward hat,ct∨xti.
Given two vectorsxandy, the vectorx∨yhas theithcoordinate(x∨y)i = max{x(i), y(i)}. The goal is
to maximize the total reward over a time horizonT.
In order to bypass the non-convexity of the online vee learning problem, we propose the following methodology. Unlike dimension reduction techniques that aim at reducing the dimension of the search space, weliftthe search space and the functions to a higher dimension so as to benefit from their properties in this new space. Concretely, at a high level, given a convex setK, we define a “sufficiently dense” latticeLin [0,1]nsuch that every point inKcan be approximated by a point in the lattice. The term “sufficiently dense”
corresponds to the fact that the values of any Lipschitz function can be approximated by the function values at lattice points. As the next step, we lift all lattice points inKto a high dimension space so that they form a subset of vertices (corners points)Keof the unit hypercube in the newly defined space. Interestingly, the reward functionhat,ct∨xtican be transformed to a linear function in the new space. Hence, our algorithm for the online vee problem consists of updating at every time step a solution in the high-dimension space using a linear oracle and projecting the solution back to the original space.
Once the solution is projected back to original space, we construct appropriate update vectors for the original DR-submodular maximization and feedback rewards for the online vee oracle. Exploiting the underlying properties of DR-submodularity, we show that our algorithm achieves the regret bound of (1/e, O(T3/4)). A useful feature of our algorithm is that it is projection-free if using appropriate projection-
free oracles.
Over the hypercube. Restricting on the DR-submodular maximization problem over the hypercube[0,1]n, a particular down-closed convex set that has been widely studied, our approach leads to an1/2-approximation algorithm with regret bound ofO(√
TlogT). Note that this result can be deduced from [17] but as it is not explicitly stated in literature, for completeness, we present the proof in the appendix.
1.1.2 Maximizing online non-monotone DR-submodular functions over general convex sets
We go beyond the down-closed structure by considering general convex sets. Building upon our salient ideas from previous algorithm, we prove that for general convex sets, the Meta-Frank-Wolfe algorithm with adapted step-sizes guarantees1−min
x∈Kkxk∞
3√
3 , O(lnTT)
-regret in expectation. Notably, if the setKcontains0(for example,Kis the intersection of the semi-definite positive matrix cone and the hypercube[0,1]n) then the algorithm guarantees in expectation a( 1
3√
3, O(logTT))-regret. To the best of our knowledge, this is thefirst constant-approximation algorithm for the problem of online non-monotone DR-submodular maximization over a non-trivial convex set that goes beyond down-closed sets. Note that, any algorithm for the problem over a non-down-closed convex set (in particular arbitrary convex set containing the origin) that guarantees a constant approximation must require in general exponentially many value queries to the function [37].
Remark that the quality of the solution, specifically the approximation ratio, depends on the initial solution x0. This confirms an observation in various contexts that initialization plays an important role in non-convex optimization.
1.2 Related Work
In this section, we give a summary on best-known results on DR-submodular maximization. The domain has been investigated more extensively in recent years due to its numerous applications in the field of statistics and machine learning, for example active learning [19], viral makerting [25], network monitoring [20], document summarization [27], crowd teaching [33], feature selection [16], deep neural networks [15], diversity models [13] and recommender systems [21].
Offline setting. Bian et al. [2] considered the problem of maximizing monotone DR-functions subject to a down-closed convex set and showed that the greedy method proposed by [7], a variant of well-known Frank-Wolfe algorithm in convex optimization, guarantees a(1−1/e)-approximation. However, it has been observed by [22] that the greedy method is not robust in stochastic settings (where only unbiased estimates of gradients are available). Subsequently, Hassani et al. [29] proposed a(1−1/e)-approximation algorithm for maximizing monotone DR-submodular functions over general convex sets in stochastic settings by a new variance reduction technique.
The problem of maximizing non-monotone DR-submodular functions is much harder. Bian et al. [3] and Niazadeh et al. [31] have independently presented algorithms with the same approximation guarantee of(1/2) for the problem of non-monotone DR-submodular maximization over the hypercube (K= [0,1]n). These algorithms are inspired by the bi-greedy algorithms in [6, 5]. Bian et al. [1] made a further step by providing a(1/e)-approximation algorithm where the convex sets are down-closed. Mokhtari et al. [29] also presented an algorithm that achieve(1/e)for non-monotone continuous DR-submodular function over a down-closed convex domain that uses only stochastic gradient estimates. Remark that when aiming for approximation algorithms (in polynomial time), the restriction to down-closed polytopes is unavoidable. Specifically, Vondrak [37] proved that any algorithm for the problem over a non-down-closed set that guarantees a constant approximation must require in general exponentially many value queries to the function. Recently, Durr et al. [14] presented an algorithm that achieves((1−minx∈Kkxk∞)/3√
3)-approximation for non-monotone DR-submodular function over general convex sets that are not necessarily down-closed.
Online setting. The results for online DR-submodular maximization are knownonlyfor monotone func- tions. Chen et al. [10] considered the online problem of maximizing monotone DR-submodular functions over general convex sets and provided an algorithm that achieves a(1−1/e, O(√
T))-regret. Subsequently, Zhang et al. [38] presented an algorithm that reduces the number of per-function gradient evaluations from T3/2in [10] to1and achieves the same approximation ratio of(1−1/e). Leveraging the idea of one gradient per iteration, they presented a bandit algorithm for maximizing monotone DR-submodular function over a general convex set that achieves in expectation(1−1/e)approximation ratio with regretO(T8/9). Note that in the discrete setting, Roughgarden and Wang [32] studied the non-monotone (discrete) submodular maximization over the hypercube and gave an algorithm which guarantees the tight(1/2)-approximation andO(√
T)regret. Recently and independently, Niazadeh et al. [30] have provided an(1/2, O(√
T))-regret algorithm for DR-submodular maximization over the hypercube. Their algorithm is built on an interesting framework that transform offline greedy algorithms to online ones using Blackwell approachability. Our approach, based on the lifting procedure, is completely different to theirs.
2 Preliminaries and Notations
Throughout the paper, we use bold face letters, e.g.,x,yto represent vectors. For everyS⊆ {1,2, . . . , n}, vector1S is then-dim vector whoseith coordinate equals to 1 ifi ∈ S and0otherwise. Given two n- dimensional vectorsx,y, we say thatx ≤ yiffxi ≤ yi for all1 ≤ i ≤ n. Additionally, we denote by x∨y, their coordinate-wise maximum vector such that(x∨y)i = max{xi, yi}. Moreover, the symbol◦ represents the element-wise multiplication that is, given two vectorsx,y, vectorx◦yis the such that the i-th coordinate(x◦y)i =xiyi. The scalar producthx,yi=P
ixiyiand the normkxk=hx,xi1/2. In the paper, we assume thatK ⊆[0,1]n. We say thatKis thehypercubeifK= [0,1]n;Kisdown-closedif for everyz∈ Kandy≤ztheny∈ K; andKisgeneralifKis a convex subset of[0,1]nwithout any special property.
Definition 1. A functionF : [0,1]n →R+∪ {0}isdiminishing-return (DR) submodularif for all vector x≥y∈[0,1]n, any basis vectorei = (0, . . . ,0,1,0, . . . ,0)and any constantα >0such thatx+αei ∈ [0,1]n,y+αei ∈[0,1]n, it holds that
F(x+αei)−F(x)≤F(y+αei)−F(y). (1) Note that if functionF is differentiable then the diminishing-return (DR) property (1) is equivalent to
∇F(x)≤ ∇F(y) ∀x≥y∈[0,1]n. (2) Moreover, ifF is twice-differentiable then the DR property is equivalent to all of the entries of its Hessian being non-positive, i.e., ∂x∂2F
i∂xj ≤0for all1≤i, j≤n.
Besides, a differentiable functionF :K ⊆[0,1]n→R+∪ {0}is said to beβ-smoothif for anyx,y∈ K, the following holds,
F(y)≤F(x) +h∇F(x),y−xi+β
2ky−xk2 (3)
or equivalently, the gradient isβ-Lipschitz, i.e.,
∇F(x)− ∇F(y)
≤βkx−yk. (4) Properties of DR-submodularity In the following, we present a property of DR-submodular functions that are are crucial in our analyses. The most important property is the concavity in positive directions, i.e., for everyx,y∈ Kandx≥y, it holds that
F(x)≤F(y) +h∇F(y),x−yi.
Addionally, the following lemmas are useful in our analysis.
Lemma 1([22]). For everyx,y∈ Kand any DR-submodular functionF, it holds that h∇F(x),y−xi ≥F(x∨y) +F(x∧y)−2F(x)
Lemma 2([18, 8, 1]). For everyx,y∈ Kand for any DR-submodular functionF : [0,1]n→R+∪ {0}, it holds thatF(x∨y)≥(1− kxk∞)F(y).
Variance Reduction Technique. Our algorithm and its analysis relies on a recent variance reduction technique proposed by [29]. This theorem has also been used in the context of online monotone submodular maximization [9, 38].
Lemma 3([29], Lemma 2). Let{at}Tt=0be a sequence of points inRn. such that
at−at−1
≤C/(t+s) for all1≤t ≤T with fixed constantsC ≥0ands≥3. Let{˜a}Tt=1be a sequence of random variables such thatE[˜a|Ht−1] =atandE
h
˜at−at
2|Ht−1i
≤σ2for everyt≥0, whereHt−1is the history up to t−1. Let{dt}Tt=0be a sequence of random variables whered0is fixed and subsequentdtare obtained by the recurrence
dt= (1−ρt)dt−1+ρta˜t withρt= (t+s)22/3. Then we have
E
at−dt
2
≤ Q
(t+s+ 1)2/3,
whereQ= max{
a0−d0
2(s+ 1)2/3,4σ2+ 3C2/2}.
3 Online DR-Submodular Maximization over Down-Closed Convex Sets
3.1 Online Vee Learning
In this section, we give an algorithm for an online problem that will be the main building block in the design of algorithms for online DR-submodular maximization over down-closed convex sets. In the online vee learning problem, given a down-closed convex setK, at every timet, the online algorithm receivesct∈ K at the beginning of the step and needs to choose a vectorxt ∈ K. Subsequently, the adversary reveals a vectorat∈Rnand the algorithm receives a rewardhat,ct∨xti. The goal is to maximize the total reward PT
t=1hat,ct∨xtiover a time horizonT.
The main issue in the online vee problem is that the reward functions are non-concave. In order to overcome this obstacle, we consider a novel approach that consists of discretizing and lifting the corresponding functons to a higher dimensional space.
Discretization and lifting. LetLbe a lattice such thatL={0,M1 ,M2 , . . . ,M` , . . . ,1}nwhere0≤`≤M for some parameterM (to be defined later). Note that each xi = M` for0 ≤ ` ≤ M can be uniquely represented by a vectoryi ∈ {0,1}M+1such thatyi0 =. . .=yi`= 1andyi,`+1=. . .=yi,M = 0. Based on this observation, we lift the discrete setK ∩ Lto the(n×(M+ 1))-dim space. Specifically, define alifting mapm:K ∩ L → {0,1}n×(M+1)such that each point(x1, . . . , xn)∈ K ∩ Lis mapped to an unique point (y10, . . . , y1M, . . . , yn0, . . . , ynM) ∈ {0,1}n×M whereyi0 = . . .=yi` = 1andyi,`+1 = . . .=yi,M = 0 iffxi = M` for0≤`≤M. DefineKebe the set{1X ∈ {0,1}n×(M+1):1X =m(x)for somex∈ K ∩ L}.
Observe thatKeis a subset of discrete points in{0,1}n×(M+1). LetC:=conv(K)e be the convex hull ofK.e Algorithm description. In our algorithm, at every timet, we will outputxt ∈ L ∩ K. In fact, we will reduce the online vee problem to a online linear optimization problem in the(n×(M+ 1))-dim space. Given ct∈ K, we round every coordinatectifor1≤i≤nto the largest multiple of M1 which is smaller thancti. In other words, the rounded vectorcthas theith-coordinate¯cti = M` where M` ≤cti < `+1M . Vectorct∈ Land alsoct∈ K(sincect≤ctandKis down-closed). Denote1Ct =m(ct). Besides, for each vectorat∈Rn, define its correspondenceeat∈Rn×(M+1)such thateati,j = M1 atifor all1≤i≤nand0≤j≤M. Observe thathat,cti=haet,1Cti(where the second scalar product is taken in the space of dimensionn×(M + 1)).
Algorithm 1Online algorithm for vee learning Input: A convex setK, a time horizonT.
1: LetLbe a lattice in[0,1]nand denote the polytopeC:=conv(K).e
2: Initialize arbitrarilyx1 ∈ K ∩ Landy1=1X1 =m(x1)∈K.e
3: fort= 1toT do
4: Observect, compute1Ct =m(ct).
5: Playxt.
6: Observeat(so computeeat) and receive the rewardhat,ct∨xti.
7: Setyt+1 ←updateC(yt;eat0◦(1−1Ct0) :t0 ≤t)where◦denotes the element-wise multiplication.
8: Round:1Xt+1 ←round(yt+1).
9: Computext+1=m−1(1Xt+1).
10: end for
The formal description is given in Algorithm 1. In the algorithm, we use a procedureupdate. This procedure takes arguments as the polytopeC, the current vectoryt, the gradients of previous time steps aet◦(1−1Ct)and outputs the next vectoryt+1. One can use different update strategies, in particular the gradient descent or the follow-the-perturbed-leader algorithm (if aiming for a projection-free algorithm).
yt+1 ←ProjC
yt−η·aet◦(1−1Ct)
(Gradient Descent) yt+1 ←arg min
C
η
t
X
t0=1
haet0 ◦(1−1Ct0),yi+nT ·y
(Follow-the-perturbed-leader) Besides, in the algorithm, we use additionally a procedureroundin order to transform a solutionytin the polytopeCof dimensionn×(M+ 1)to an integer point inK. Specifically, givene yt∈conv(K), we rounde ytto1Xt ∈Keusing an efficient polynomial-time algorithm given in [28] for the approximate Carathéodory’s theorem w.r.t the`2-norm. Specifically, givenyt∈ Cand an arbitrary >0, the algorithm in [28] returns a set ofk=O(1/2)integer points1Xt
1, . . . ,1Xt
k ∈KeinO(1/2)-time such thatkyt−1kPk j=11Xt
jk≤. Hence, given1Xt
1, . . . ,1Xt
kwhich have been computed, the procedureroundsimply consists of rounding ytto1Xt
j with probability1/k. Finally, the solutionxtin the original space is computed asm−1(1Xt+1).
Lemma 4. Assume thatkatk≤Gfor all1≤t≤T and the update scheme is chosen either as the gradient ascent update or other procedure with regret ofO(√
T). Then, using the latticeLwith parameterM = (T /n)1/4and= 1/√
T (in theroundprocedure), Algorithm 1 achieves a regret bound ofO(G(nT)3/4).
The total running time of the algorithm isO(T2).
Proof. Observe thatytbe the solution of the online linear optimization problem in dimensionn×(M+ 1) at timetwhere the reward function at timetishaet◦(1−1Ct),·i. Moreover, recall thatxt∈ L ∩ Kfor every timet. First, by the construction, we have:
m(ct∨xt) =m(ct)∨m(xt) =1Ct∨1Xt, hat,ct∨xti=haet,1Ct ∨1Xti.
Therefore,
T
X
t=1
E[hat,ct∨xti] =
T
X
t=1
E[haet,1Ct ∨1Xti]
Besides, for every vector1C and vectorz ∈ [0,1]n×M, the following identity always holds: 1C ∨z = z+1C◦(1−z). Therefore, for every vectorz∈[0,1]n×M,
heat,1Ct∨yi=heat,z+1Ct◦(1−z)i=haet,z◦(1−1Ct)i+haet,1Cti
=heat◦(1−1Ct),zi+heat,1Cti (5) where note that the last termhaet,1Ctiis independent of the decision at timet.
By linearity of expectation, we have
T
X
t=1
E[hat,ct∨xti] =
T
X
t=1
E[heat◦(1−1Ct),1Xti+heat,1Cti]
=
T
X
t=1
heat◦(1−1Ct),yti+
T
X
t=1
haet◦(1−1Ct),E[1Xt]−yti+
T
X
t=1
heat,1Cti
≥
T
X
t=1
heat◦(1−1Ct),yti+
T
X
t=1
haet,1Cti −
T
X
t=1
keat◦(1−1Ct)k·kE[1Xt]−ytk
≥
T
X
t=1
heat◦(1−1Ct),yti+
T
X
t=1
haet,1Cti −T G. (6) The first inequality follows Cauchy-Schwarz inequality. The last inequality is due to the property ofround andkaet◦(1−1Ct)k≤Gsincekatk≤Gand the construction ofaet.
Letx∗ be the optimal solution in hindsight for the online vee learning problem. Letx∗ be the rounded solution ofx∗onto the latticeL. Denote1X∗=m(x∗). By (5) and (6) and the choice of=O(1/√
T), we have
T
X
t=1
E[hat,ct∨xti]−
T
X
t=1
hat,ct∨x∗i ≥
T
X
t=1
E[heat◦(1−1Ct),yt−1X∗i]≥ −O(nM G√ T)
where the last inequality is due to the regret bound of the gradient descent or the follow-the-perturbed-leader algorithms (in dimesionnM). Askatk≤G, functionhat,·iisG-Lipschitz. Hence,
T
X
t=1
E[hat,ct∨xti]−
T
X
t=1
hat,ct∨x∗i
=
T
X
t=1
E[hat,ct∨xti]−
T
X
t=1
hat,ct∨xti+
T
X
t=1
E[hat,ct∨xti]−
T
X
t=1
hat,ct∨x∗i
+
T
X
t=1
E[hat,ct∨x∗i]−
T
X
t=1
hat,ct∨x∗i+
T
X
t=1
E[hat,ct∨x∗i]−
T
X
t=1
hat,ct∨x∗i
≥ −O(nM G√
T)−O(T G
√n M ).
wherekct∨xt−ct∨xtk≤ kct−ctk≤√
n/M and similarlykct∨x∗−ct∨x∗kandkct∨x∗−ct∨x∗k are bounded by√
n/M. ChoosingM = (T /n)1/4, one gets the regret bound ofO(G(nT)3/4).
Finally, the procedureroundhas average running timeO(1/2) =O(T)per time step; that leads to the complexity ofO(T2).
3.2 Algorithm for online non-monotone DR-submodular maximization
In this section, we will provide an algorithm for the problem of online non-monotone DR-submodular maximization. The algorithm maintainsLonline vee learning oraclesE1, . . . ,EL. At each time stept, the algorithm starts from0(origin) and executesLsteps of the Frank-Wolfe algorithm where the update vector v`tis constructed by combining the outputut`of the online vee optimization oracleE`with the current vector xt`. In particular,v`tis set to bext`∨ut` −xt`. Then, the solution xt is produced at the end of theL-th step. Subsequently, after observing the DR-submodular functionFt, the algorithm defines a vectordt`and feedbackshdt`,xt`∨ut`ias the reward function at timetto the oracleE` for1≤`≤L. The pseudocode in presented in Algorithm 2.
We first prove a technical lemma.
Lemma 5. Letxt`be the vectors constructed in Algorithm 2. Then, it holds that1−kxt`+1k∞≥Q`
`0=1(1−η`0) for everyt∈ {1,2, . . . , T}and for every`∈ {1,2, . . . , L}.
Algorithm 2Online algorithm for down-closed convex sets
Input: A convex setK, a time horizonT, online vee optimization oraclesE1, . . . ,EL, step sizesρ` ∈(0,1), andη`= 1/Lfor all`∈[L]
1: Initialize vee optimizing oracleE`for all`∈ {1, . . . L}.
2: Initializext1 ←0anddt0←0for every1≤t≤T.
3: fort= 1toT do
4: Computeut` ←output of oracleE`in roundtfor all`∈ {1, . . . , L}.
5: for1≤`≤Ldo
6: Setvt`←xt`∨ut`−xt`
7: Setxt`+1←xt`+η`v`t.
8: end for
9: Setxt←xtL+1.
10: Playxt, observe the functionFtand get the reward ofFt(xt).
11: Compute a gradient estimategt`such thatE[gt`|xt`] =∇Ft(xt`)for all`∈ {1,2, . . . , L}.
12: Computedt` = (1−ρ`)dt`−1+ρ`gt`, for every`∈ {1, . . . , L}
13: Feedbackhdt`,xt`∨ut`ito oracleE`for all`∈ {1,2, . . . , L},
14: end for
Proof. We show that inequality holds for every t. Fix an arbitraryt ∈ {1,2, . . . , T}. For the ease of exposition, we drop the indextin the following proof. Let(x`)i,(u`)iand(v`)ibe theith coordinates of vectorsx`,u`andv`for1≤i≤nand for1≤`≤L, respectively. We first obtain the following recursion on fixed(x`)i.
1−(x`+1)i = 1− (x`)i+η`(v`)i
≥1− (x`)i+η`(1−(x`)i)
= (1−η`)(1−(x`)i) where the inequality holds since0≤(x`∨u`)i ≤1. Since(x1)i = 0, we have1−(xt`+1)i ≥Q`
`0=1(1−η`0).
The lemma holds since the inequality holds for every1≤i≤n.
We are now ready to prove the main theorem of the section.
Theorem 1. LetK ⊆[0,1]nbe a down-closed convex set with diameterD. Assume that functionsFt’s are G-Lipschitz,β-smooth andE
h
˜gt`− ∇Ft(xt`)
2i
≤σ2for everytand`. Then, the step-sizesη`= 1/Land ρ` = 2/(`+ 3)2/3for all1≤`≤L, Algorithm 2 achieves the following guarantee:
T
X
t=1
E[Ft(xt)]≥ 1 e−O
1 L
! T X
t=1
Ft(x∗)−O(n3/4GT3/4)−O
(βD+G+σ)DT L1/3
.
ChoosingL=T3/4 and note thatD≤√
n, we get
T
X
t=1
E[Ft(xt)]≥ 1 e−O
1 T3/4
! maxx∈K
T
X
t=1
Ft(x)−O((G+σ)n5/4T3/4). (7) Proof. Let x∗ be an optimal solution in hindsight, i.e.,x∗ ∈ arg maxx∈KPT
t=1Ft(x). Fix a time step t∈ {1,2, . . . , T}. For every2≤`≤L+ 1, we have following:
Ft(xt`)−Ft(xt`−1)≥ h∇Ft(xt`),(xt`−xt`−1)i −(β/2)kxt`−xt`−1k2 (usingβ-smoothness)
=η`−1h∇Ft(xt`),vt`−1i −βη`−12
2 kvt`−1k2 (using the update step in the algorithm)
=η`−1h∇Ft(xt`−1),v`−1t i+η`−1h∇Ft(xt`)− ∇Ft(xt`−1),v`−1t i −βη`−12
2 kv`−1t k2
≥η`−1h∇Ft(xt`−1),v`−1t i −η`−1k∇Ft(xt`)− ∇F(xt`−1)k · kvt`−1k −βη`−12
2 kvt`−1k2
(Cauchy-Schwarz)
≥η`−1h∇Ft(xt`−1),v`−1t i −η`−1βkxt`−xt`−1k · kv`−1t k −βη2`−1
2 kv`−1t k2
(usingβ-smoothness)
≥η`−1h∇Ft(xt`−1),v`−1t i −2βη`−12 kvt`−1k2
≥η`−1h∇Ft(xt`−1),v`−1t i −2βη`−12 D2 (diametre ofKis bounded)
=η`−1h∇Ft(xt`−1),xt`−1∨x∗−xt`−1i+η`−1h∇Ft(xt`−1),vt`−1−(xt`−1∨x∗−xt`−1)i
−2βη2`−1D2
=η`−1h∇Ft(xt`−1),xt`−1∨x∗−xt`−1i+η`−1hdt`−1,v`−1t −(xt`−1∨x∗−xt`−1)i +η`−1h∇Ft(xt`−1)−dt`−1,v`−1t −(xt`−1∨x∗−xt`−1)i −2βη2`−1D2
≥η`−1(Ft(xt`−1∨x∗)−Ft(xt`−1)) +η`−1hdt`−1,v`−1t −(xt`−1∨x∗−xt`−1)i +η`−1h∇Ft(xt`−1)−dt`−1,v`−1t −(xt`−1∨x∗−xt`−1)i −2βη2`−1D2
(using concavity in positive direction)
≥η`−1(Ft(xt`−1∨x∗)−Ft(xt`−1)) +η`−1hdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i +η`−1h∇Ft(xt`−1)−dt`−1,xt`−1∨ut`−1−xt`−1∨x∗i −2βη`−12 D2
(using the definition ofv`−1t )
≥η`−1
(1− kxt`−1k∞)Ft(x∗)−Ft(xt`−1)
+η`−1hdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i +η`−1h∇Ft(xt`−1)−dt`−1,xt`−1∨ut`−1−xt`−1∨x∗i −2βη`−12 D2 (using Lemma 2)
≥η`−1
`−2
Y
`0=1
(1−η`0)·Ft(x∗)−η`−1Ft(xt`−1) +η`−1hdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i +η`−1h∇Ft(xt`−1)−dt`−1,xt`−1∨ut`−1−xt`−1∨x∗i −2βη`−12 D2 (using Lemma 5)
≥η`−1
`−2
Y
`0=1
(1−η`0)·Ft(x∗)−η`−1Ft(xt`−1) +η`−1hdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i
−η`−1
2 1
γ`−1
k∇Ft(xt`−1)−dt`−1k2+γ`−1kxt`−1∨ut`−1−xt`−1∨x∗k2
−2βη2`−1D2 (Young’s Inequality, parameterγ`−1will be defined later)
≥η`−1
`−2
Y
`0=1
(1−η`0)·Ft(x∗)−η`−1Ft(xt`−1) +η`−1hdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i
−η`−1
2 1
γ`−1
k∇Ft(xt`−1)−dt`−1k2+γ`−1D2
−2βη`−12 D2.
(sincekxt`−1∨ut`−1−xt`−1∨x∗k≤ kut`−1−x∗k≤D.) From the above we have
Ft(xt`)≥(1−η`−1)Ft(xt`−1) +η`−1
`−2
Y
`0=1
(1−η`0)·Ft(x∗) +η`−1hdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i
−η`−1
2 1
γ`−1
k∇Ft(xt`−1)−dt`−1k2+γ`−1D2
−2βη`−12 D2
=
1− 1 L
Ft(xt`−1) + 1 L
1− 1
L `−2
Ft(x∗) + 1
Lhdt`−1,xt`−1∨ut`−1−xt`−1∨x∗i
− 1 2L
1 γ`−1
k∇Ft(xt`−1)−dt`−1k2+γ`−1D2
−2βD2
L2 (replacingη` = 1/L) Applying recursively on`, we get:
Ft(xt`)≥ `−1 L
1− 1
L `−2
Ft(x∗) + 1 L
`−1
X
`0=1
1− 1
L
`−1−`0
hdt`0,xt`0 ∨ut`0 −xt`0 ∨x∗i
!
− 1 2L
`−1
X
`0=1
1− 1
L
`−1−`0 1
γ`0k∇Ft(xt`0)−dt`0k2+γ`0D2 !
−2βD2
L (8)
Summing above inequality (with`=L+ 1) over allt∈[T]and using online vee maximization oracle with regretRT, we obtain
T
X
t=1
Ft(xtL+1)
≥
1− 1 L
L−1 T
X
t=1
Ft(x∗) + 1 L
T
X
t=1 L
X
`0=1
1− 1
L L−`0
hdt`0,xt`0∨ut`0−xt`0∨x∗i
!
− 1 2L
T
X
t=1 L
X
`0=1
1− 1
L
L−`0 1
γ`0k∇Ft(xt`0)−dt`0k2+γ`0D2 !
−2βT D2 L
≥
1− 1 L
L−1 T
X
t=1
Ft(x∗) + 1 L
L
X
`0=1
1− 1
L L−`0
·
T
X
t=1
hdt`0,xt`0 ∨ut`0−xt`0 ∨x∗i
− 1 2L
T
X
t=1 L
X
`0=1
1
γ`0k∇Ft(xt`0)−dt`0k2+γ`0D2
−2βT D2
L (since1−1/L <1)
≥
1− 1 L
L−1 T
X
t=1
Ft(x∗)− 1 L
L
X
`0=1
1− 1
L L−`0
RT −2βT D2 L
− 1 2L
T
X
t=1 L
X
`0=1
1 γ`0
k∇Ft(xt`0)−dt`0k2+γ`0D2
(RT is the regret of a single oracle)
≥ 1 e−O
1 L
! T X
t=1
Ft(x∗)− RT −2βT D2
L − 1
2L
T
X
t=1 L
X
`0=1
1
γ`0k∇Ft(xt`0)−dt`0k2+γ`0D2
(9)
Claim 1. Choosing γ` = D(`+3)Q1/21/3 for 1 ≤ ` ≤ L where Q = {max1≤t≤Tk∇Ft(0)k242/3,4σ2 + 3(βD)2/2}, it holds that
T
X
t=1 L
X
`=1
1
γ`E[k∇Ft(xt`)−dt`k2] +γ`D2
=O
(βD+G+σ)DT L2/3
Proof of claim.Fix a time stept. We apply Lemma 3 to the left hand side of the claim inequality where, for 1≤`≤L,a`denotes the gradients∇Ft(xt`),a˜`denote the stochastic gradient estimategt`. Additionally, ρ` = (`+3)22/3 for1≤`≤L. Hence we get:
E[k∇Ft(xt`)−dt`k2]≤ Q
(`+ 4)2/3. (10)
With the choiceγ`= D(`+4)Q1/21/3, we obtain:
L
X
`=1
1
γ`E[k∇Ft(xt`)−dt`k2] +γ`D2
≤2DQ1/2
L
X
`=1
1 (`+ 4)1/3
≤2DQ1/2
L
Z
`=1
1
`1/3d` ≤3DQ1/2L2/3 =O
(βD+G+σ)DL2/3