Control of wireless networks under uncertain state information

(1)

Control of Wireless Networks

Under Uncertain State Information

by

Thomas Benjamin Stahlbuhk

B.S., Electrical Engineering, University of California San Diego (2008)

M.S., Electrical Engineering, University of California San Diego (2009)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2018

c

Massachusetts Institute of Technology 2018. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

May 1, 2018

Certified by . . . .

Eytan Modiano

Professor of Aeronautics and Astronautics

Thesis Supervisor

Certified by . . . .

Brooke Shrader

Senior Staff, MIT Lincoln Laboratory

Thesis Supervisor

Certified by . . . .

Moe Win

Professor of Aeronautics and Astronautics

Thesis Committee Member

Accepted by . . . .

Hamsa Balakrishnan

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

(2)

(3)

Control of Wireless Networks

Under Uncertain State Information

by

Thomas Benjamin Stahlbuhk

Submitted to the Department of Aeronautics and Astronautics on May 1, 2018, in partial fulfillment of the

requirements for the degree of Doctor of Philosophy

Abstract

In shared spectrum, wireless communication systems experience interference that can cause packet transmission failures. The channel conditions that determine these losses are driven by an underlying time-evolving state, which is usually hidden from the wireless network and can only be partially observed through interaction with the channel. This introduces a trade-off between exploration and exploitation: the nodes of the network must schedule their transmissions to both observe the channels and achieve high throughput. The optimal balance between these objectives is determined by the network’s stochastic traffic demand. Solving this joint learning and scheduling problem is complex. In this thesis, we devise queue-length-based scheduling policies that can adapt to the network’s traffic, while simultaneously exploring the channel conditions.

We begin by considering controller policies for a transmitter that has multiple available channels. Packets stochastically arrive to the transmitter’s queue, and at each time slot, the transmitter can attempt transmission on one of the channels. For each channel, transmission attempts fail according to a random process with unknown mean. The objective of the transmitter is to learn the channel’s rates while simultaneously minimizing its queue backlog. We proceed to formulate transmission policies that are asymptotically order optimal.

Next, we consider transmission scheduling when the network under our control is sharing its channels with an uncooperative network. Transmission collisions cause the uncooperative network to reattempt transmission. Therefore, the experienced interference is correlated over time through the uncooperative network’s queueing dynamics, which are hidden from our network and must be estimated through ob-servation. We derive upper and lower bounds on the maximum attainable rate of successful transmissions in a two user network and use these bounds to character-ize the performance of larger networks. These results lead to a queue-length-based method for stabilizing the networks.

Finally, we extend our results to networks that have complex constraints on simul-taneous transmissions. The network must learn its channel rates while also supporting

(4)

its stochastic traffic demand. We devise a frame-based max-weight algorithm that learns the channel rates over the duration of a frame to stabilize the network.

Thesis Supervisor: Eytan Modiano

Title: Professor of Aeronautics and Astronautics Thesis Supervisor: Brooke Shrader

Title: Senior Staff, MIT Lincoln Laboratory Thesis Committee Member: Moe Win

(5)

Acknowledgments

There are many people who greatly supported me throughout my thesis work and to whom I owe a large amount of gratitude. I would like to begin by thanking my thesis advisers Professor Eytan Modiano and Dr. Brooke Shrader, who both contributed immensely to this work. Their encouragement, knowledge, and insight have led me throughout the past five years and continue to inspire me. I am also indebted to Professor Moe Win for all of his support as a thesis committee member and to thesis readers Dr. Jelena Diakonikolas and Professor Hyang-Won Lee for their time and excellent feedback. I would also like to thank the Communications and Networking Research Group (CNRG): Xinzhe Fu, Dr. Matthew Johnston, Dr. Nathaniel Jones, Igor Kadota, Dr. Greg Kuperman, Dr. Chih Ping Li, Qingkai Liang, Bai Liu, Dr. Marzieh Parandehgheibi, Dr. Georgios Paschos, Anurag Rai, Dr. Abhishek Sinha, Dr. Rahul Sinha, Rajat Talak, Vishrant Tripathi, Jianan Zhang, and Ruihao Zhu. All have provided a wonderful environment to conduct research in and have contributed in some way to this thesis.

Additionally, I would like to thank MIT Lincoln Laboratory, whose generous sup-port made this work possible. B. Kenneth Estabrook, Dr. Carl Fossa, Matthew Kercher, and Dr. James Ward have especially provided tremendous support to me throughout my thesis work, for which I am very grateful.

Finally, I wish to thank my wonderful family and friends. Elizabeth Christian’s love and support has provided a light to guide me throughout this thesis. I treasure the time we have spent together over the past years. My parents Birger and Suzanne Stahlbuhk and sister and brother-in-law Toni and Joesph Wiggins have provided me the courage to pursue this thesis. I also give thanks to the Christian, Hastie, Hoover, and Pavlakis families for all of their encouragement.

(6)

(7)

Support

This material is based upon work supported by the United States Air Force under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author and do not neces-sarily reflect the views of the United States Air Force. This work was also sponsored by NSF grants AST-1547331, CNS-1701964, CNS-1524317, and CNS-1217048 and by Army Research Office (ARO) grant number W911NF-17-1-0508.

(8)

(9)

List of Figures

1-1 Illustration of different interference conditions. (a) Interference ema-nating from sources in the environment’s background. (b) Our network in direct competition for channel resources with an adjacent network. 18 1-2 System consisting of a single queue and transmitter with N channels.

Each channel has a transmission success rate µi. . . 25

1-3 Primary and secondary users communicating over a collision channel. 26 2-1 Policy π1 for achieving Theorem 2.2. We assume ˆµi is set to zero, if

server i has not yet been observed. . . 38 2-2 The policy π2 for achieving Theorem 2.3. We assume ˆµi is set to zero,

if server i has not yet been observed. . . 41 2-3 Learning algorithm for bringing the queue back to the empty state.

Time n is normalized to when the algorithm is called. We assume bµp_i is set to zero, if server i has not yet been observed during this algorithm’s call. Note that bµp_i _{is a separate variable from b}µiand is only used during

busy period p. . . 44 3-1 Network of N secondary users (left) and M primary users (right). . . 73 3-2 Suppose at time t − 1 we observed a primary user packet with

times-tamp t − 3. Then, all primary user packets that arrived up to time t−3 must have already been serviced. At time t, we are only uncertain whether primary user packet arrivals occurred during times t − 2, t − 1, and t. . . 82

(14)

3-3 Example of our decision process over four time slots. The stochastic shortest path problem is analyzed starting at t = 0 until the first suc-cessful secondary user transmission. The subsequence tk are the time

slots not following a collision where decisions are made to transmit or not transmit. In this example, the first successful secondary user trans-mission occurs at time slot 3. Our analysis of the stochastic shortest path problem provides a policy for deciding when to transmit or not transmit given W (tk). We can then iteratively apply that policy to

obtain subsequent successful transmissions. . . 83 3-4 The upper and lower bounds on µ∗. The line 1 − λ is shown for

refer-ence. Note that if the secondary user could observe the primary user’s queue backlog, it could achieve 1 − λ throughput. There is a clear loss in performance from our restriction that the secondary user rely on receiver feedback to make transmission decisions. . . 87 4-1 Example graph G and its conflict graph Gc_{. Edges that share a}

com-mon node conflict and cannot be in the same activation set. . . 117 4-2 Greedy Maximal Matching (GMM) Algorithm [54] . . . 123 4-3 Online Learning Algorithm . . . 123 4-4 Example graph with edge set shown in light dots. Offline solution Mo

is shown in solid lines, and online solution Mt is shown in dashed lines. 127

4-5 A single connected component in the graph used for Theorem 4.3. . . 130 4-6 (a) Graph with edge set in light dots; edges labeled with their mean

rewards w(e). Matching Mo _{is shown in solid lines. (b) Average regret}

accrued over the time horizon n. (c) Fraction of time slots that labeled edges are chosen to be in matching Mt. Plots are averaged over 100

pseudo-random runs. . . 131 4-7 Average queue backlog for the online learning algorithm with varying

frame size n and the offline GMM scheduler operating on a 4 × 4 grid with uniform arrival rate λ to each edge. . . 141

(15)

4-8 Markov chain for time-varying channel state. When in a given state, the edge has Bernoulli rates with the labeled mean. . . 141 4-9 (Top) The average queue backlog across the network for the

frame-based learning algorithm with frame size n = 100 and the offline GMM scheduler. The values of c(e) are determined by the Markov Chain of Fig. 4-8. (Bottom) Instantaneous upper bound on the network’s capacity.142

(16)

(17)

Chapter 1 Introduction

Over the past few decades, high-speed wireless communications has become a ubiqui-tous technology that has enabled mobile telecommunications and Internet access and has become a major component in the global economy. The success of this technology has relied upon the availability of licensable spectrum that can be dedicated to the large-bandwidth requirements of high-speed data. In this setting, systems engineers can use well known channel models, that describe the phenomena of wireless propa-gation, to design reliable communication links [71]. However, future applications will challenge this dependency. The ever growing demand for greater network capacity will encourage a portion of future high-speed communications to occur in unlicensed and shared spectrum, where access to the channel can be contested, and over new spectrum regions, where the medium’s behavior is more difficult to evaluate [4].

In this thesis, we consider control policies for wireless networks that must operate in these challenging environments, where the channel conditions cannot be charac-terized prior to system deployment and must instead be learned and adapted to in the field. We conceptually view the channel conditions as being driven by an under-lying, time-evolving state that determines when data transmission will and will not be successful. The wireless network does not a priori know the value and statistics of the state and can only partially observe the state by attempting transmission on the channel and using receiver feedback to measure the outcome. This introduces a fun-damental trade-off between exploration and exploitation. The wireless network must

(18)

(a) (b)

Figure 1-1: Illustration of different interference conditions. (a) Interference emanating from sources in the environment’s background. (b) Our network in direct competition for channel resources with an adjacent network.

simultaneously schedule its transmissions to observe its available channels’ conditions while also using its knowledge to support high data throughput.

The above problem has a large scope. In this thesis, we will explore two different specifications for how the channel conditions, and corresponding transmission fail-ures, behave over time slots. In the first, transmission failures occur according to a Bernoulli random process over sequential transmission attempts. This behavior mod-els situations where the network is subject to distant interferers in the environment’s background (see Fig. 1-1a). In aggregate, these interferers cause irregular moments of intense interference, which are difficult to guard against at the physical layer. Ad-ditionally, this behavior is applicable to networks that operate over medium’s that do not have good propagation models or over channels that are changing too quickly for the systems to access and identify their states prior to transmission.

Given this model, the challenge facing the network is to learn the channels’ fail-ure statistics, so that good channels can be identified and exploited. To do this, the network’s transmitters attempt transmission on the available channels and use feed-back from the receivers to determine whether the attempt was successful. Because of limitations in the wireless nodes’ electronics and interference between nodes in the network, not every channel can be observed at every time slot, and the transmis-sion scheduler will need to judiciously choose which channels should be transmitted

(19)

on. In Chapters 2 and 4, we consider transmission scheduling policies for trading off the competing objectives of learning the channel statistics and exploiting previous observations to obtain good performance.

In the second model we consider, the network under our control is assumed to be in direct competition with an adjacent, uncooperative network for channel access (see Fig. 1-1b). The adjacent network receives packets according to a Bernoulli process and transmits on its assigned channel whenever its queue is non-empty. We assume a collision channel model. Therefore, if our network and the adjacent network transmit at the same time, all transmissions fail and the involved packets must remain in their queues awaiting future delivery. Note that this framework is similar to the Bernoulli-failure model above, except that transmission Bernoulli-failures are now correlated over time through the adjacent network’s queueing dynamics. The difficulty facing our network is that it cannot directly observe the adjacent network’s queue backlogs and must instead use its past observations to estimate their sizes. In Chapter 3, we consider transmission scheduling policies for maximizing throughput in this model.

In this thesis, we take a network control perspective, where the objective is to find transmission control policies that stabilize the network’s stochastic traffic demand by adapting to the unknown channel conditions. When the channels’ instantaneous data rates are known, the max-weight policy that schedules transmissions based on the product of the network’s queue backlogs and channel rates is well known to be throughput optimal [25, 52]. When the channels’ instantaneous rates are not known but the statistics of the channels are, one should instead make scheduling decisions using the mean rates. In this thesis, we consider queue-length-based policies for scheduling transmissions in networks that do not know the channels’ states or statistics. We will find that these policies are critical for both minimizing packet delay in the network and correctly time-sharing the channel resources. Our analysis begins by considering networks with a single user in Chapter 2 and expands to consider multi-user networks in Chapters 3 and 4.

The rest of this chapter is organized as follows. In Section 1.1, we summarize and review the previous research literature that is most related to our work. Our review is

(20)

broken down into two topics: network control and exploitation/exploration trade-offs in wireless systems. Following this, in Section 1.2, we highlight and summarize the contributions of this thesis.

1.1 Related Work

In this section, we review the research literature that is most related to our work.

1.1.1 Network Control

In this thesis, we explore transmission scheduling policies for network’s operating un-der uncertain channel conditions. The fundamental objective of our policies is to learn the channel conditions so that each node’s data requirements can be serviced. This objective places our work in the field of network control, which analyzes policies for controllers that are tasked with scheduling transmissions and routing packets through a network to satisfy a set of end-to-end traffic demands [25]. In this framework, each traffic demand injects packets, at its source node, into the network according to a stochastic arrival process and requires that the packets be delivered to a destination node. Then, at each time slot, the controller decides which transmissions should take place to service the network. An important parameter of the network is its stability region, which is the set of all arrival rates to the system that permit a scheduling and routing policy that can stabilize the network’s queue backlogs.

In the seminal works of Tassiulas and Ephremides [67, 68], the backpressure rout-ing with max-weight schedulrout-ing algorithm was introduced and shown to stabilize any arrival rate vector that is interior to the network’s stability region. At each decision point, the method chooses to route packets according to the queue-backlog differ-ences across the links and schedules transmissions so as to maximize a weighted sum of non-interfering links, where each link’s weight is given by its instantaneous ca-pacity multiplied by the packet difference across the link [25]. Although the routing decisions can be easily made in a distributed manner across the nodes of the net-work, the max-weight scheduling decision requires that a centralized controller solves

(21)

a (potentially) computationally-complex combinatorial optimization problem at each decision point. To overcome the computational complexity requirement, in [66], a linear-time algorithm was proposed that is able to stabilize the network at the cost of increased packet delay. Backpressure routing and max-weight scheduling have been extended to multicast and broadcast routing [59–61], network coding [33, 34], and overlay networks [32].

In order to provide distributed solutions for max-weight scheduling, the use of greedy methods, which are easily distributable, has been previously considered in the literature. In these methods, the set of activated links is constructed, at each decision point, by a greedy combinatorial algorithm. Under primary interference, where valid activation sets correspond to graph matchings,1 _{the use of greedy maximal matchings}

(GMM) was proposed in [46] and [17]. In these works, it was shown that a controller that uses GMMs can achieve one-half of the stability region, and thus the use of greedy methods in max-weight scheduling can come at the cost of throughput optimality. The use of GMMs was further studied in [11,35,58]. In contrast to greedy scheduling, the work of [49] proved the feasibility of a distributed throughput-optimal algorithm, although the devised method can be slow to converge and difficult to implement in practice.

In certain problem domains, it can be shown that greedy scheduling is throughput-optimal. In [21], Longest Queue First (LQF) greedy scheduling was proved to be throughput-optimal for networks that satisfy a “local pooling condition.” Identifying network properties that allow the local pooling condition to hold has been previously explored in the literature (e.g., [12, 77]). However, for general network topologies the condition will not hold, and LQF scheduling will not be throughput-optimal.

In this thesis, we will make extensive use of the above results. Specifically, in Chap-ter 3, we will use LQF scheduling to time-share channel resources among the nodes of a network. The additional challenge, explored in that chapter, is the presence of an adjacent, uncooperative network that will also gain access to the channel, creating

1_{A matching is defined to be a subset of the graph’s edges such that no two edges share a common}

(22)

interference for our network. In Chapter 4, we will use greedy maximal scheduling to devise distributed algorithms that can learn the channel conditions and schedule the nodes to meet complex constraints on simultaneous link activations. Specifically, we will consider a variant of the above methods called frame-based max-weight schedul-ing, which has been previously used in [14, 15, 29]. In frame-based max-weight, time is broken up into frame-lengths and the max-weight algorithm is used at the frame boundaries to set service demands for each node. Then over the course of each frame, we will deploy our learning algorithms to estimate the channel conditions to optimize for the service demands.

1.1.2 Exploration and Exploitation Tradeoffs in Wireless

Net-works

Another important line of research in wireless communications, that is closely related to this thesis, deals with exploration/exploitation tradeoffs in channel sensing. In this field, there are two general methods for modeling the channel. The first assumes that the channel states evolve according to a Markov process, while the second assumes that the channel is i.i.d. with unknown statistics. We proceed to highlight some of these results.

One well explored model in the literature considers a channel that evolves as a two-state Markov chain between “on” and “off” two-states. In this problem, a transmitter has a finite set of available channels, and, at each time step, it can schedule a transmission on one them, observing the state of the chosen channel after the decision is made. The objective of the transmitter is to maximize its long-term rate of transmitting on channels that are in the “on” state. This introduces an exploration/exploitation tradeoff as the transmitter both wants to observe channels to learn the probabilities of them being “on” while also scheduling transmissions to obtain high reward. In [2,75], it was shown that when the transition probabilities are the same for all channels, the myopic policy, that choses the channel that is most likely to be “on” given previous observations, is optimal. The general problem falls under the class of restless

(23)

multi-armed bandits, therefore the Whittle index policy can be applied [73]. In [48], the index policy for the problem was derived and analyzed.

In contrast to these works, it was shown in [31] that when the controller can probe a channel other than the one it decides to attempt transmission on, myopic policies are not necessarily optimal. In [29], policies for scheduling a set of queues attached to each channel were analyzed. Likewise, [53] considered a related problem where the system is constrained by the average number of transmissions made over time, which leads to an index policy that is optimal.

Models other than the two state Markov chain have also been considered in the literature. In Chapter 3, we consider a problem where the controller must share its channels with uncontrollable nodes that interfere with our network’s transmissions. This work shares some similarities with the work of [72] and [38] which considered spectrum sharing with an adjacent network. However, in those works, the interfering network’s transmissions were visible to the controller and could therefore be avoided. In Chapter 3, we consider problems where the controller must rely upon feedback to make its transmission decisions.

The above described works formulate the exploration/exploitation problem as a Markov process. Parallel to this line of work, there has been a body of research that has modeled the exploration/exploitation problem as a stochastic multi-armed bandit instead. The stochastic multi-armed bandit does not assume that each arm (or chan-nel in our context) can be modeled using a known probability distribution. Instead, each arm gives a reward according to an i.i.d. stochastic process with unknown distri-bution. At each time, a controller can select one arm to observe and provide reward to the system. The objective is to maximize the total expected reward over time by learning the arm with the highest mean rate of reward. Algorithms for solving the multi-armed bandit problem have long been considered [69]. In the seminal work of [42], policies for achieving an asymptotically efficient rate of reward were devised. The resulting policies are complex, and subsequent work in [1] and [6] have provided simpler policies that are often used in practice.

(24)

bandit, was considered in [23]. In the combinatorial bandit a subset of arms that meet a combinatorial constraint can be chosen simultaneously to be played. Each arm then reveals its individual reward and the total reward is collected by the con-troller. Subsequent work in [18,20,41] has provided improved bounds on performance. Likewise, [40] and [65] considered policies for problems where the combinatorial struc-ture is a matroid and thus greedy combinatorial algorithms can be used, and in [45], greedy policies on general combinatorial problems were analyzed.

Wireless channel access has been previously modeled using stochastic multi-armed and combinatorial bandits. The works of [3, 7, 22, 36, 47, 50] used these frameworks to model problems where multiple nodes are accessing a shared set of channels. The network receives a reward for successful transmissions on the channels, and therefore the objective is to quickly learn which channels give high probabilities of success so as to maximize the transmission rate. To this end, a typical objective is to avoid collision between nodes that can access the same channel at the same time. Problems where nodes that do not interfere can share a channel through spectrum reuse were also explored in [43,74,76]. In Chapter 4, we show how stochastic bandit algorithms can be integrated into a max-weight framework to build controllers that can simultaneously learn the channels’ rates while also responding to stochastic traffic demand.

1.2 Our Contributions

Next, we summarize the main contributions of this thesis.

1.2.1 Minimizing Queue Length Regret

In Chapter 2, we consider a model consisting of a single transmitter and N channels (see Fig. 1-2). Packets arrive to the transmitter’s queue according to a Bernoulli process with rate λ and incur a unit cost for each time slot they remain in the queue. At each time slot, the transmitter can attempt a packet transmission on one of the channels, sending a probe/dummy packet when the queue is empty, and can use receiver feedback to determine if the transmission was successful. Each channel i

(25)

Figure 1-2: System consisting of a single queue and transmitter with N channels. Each channel has a transmission success rate µi.

experiences transmission failures according to a Bernoulli process with a rate, 1 − µi,

that is initially unknown. The objective of the transmitter is to minimize the delay-cost by identifying the channel with the smallest failure rate. To analyze the system, we consider queue length regret, which is defined to be difference between the expected cost incurred over T time slots by a learning algorithm that does not know the best channel a priori and a controller that knows the best channel and schedules it for every time slot. Note that this metric captures the cost of having to learn the best channel (i.e., the sub-optimality incurred by learning).

The problem defined above is similar to the stochastic multi-armed bandit prob-lem [13], and, therefore, one could borrow algorithms from the multi-armed bandit to solve our problem. This would provide policies that focus on maximizing the rate of offered service to the queue, and, if the queue was infinitely backlogged, this approach would maximize the number of successful transmissions. For this reason, the use of bandit algorithms for scheduling wireless transmissions has been considered in the previous literature (see Section 1.1.2 and the references therein). However, in our framework, this approach has the drawback that it does not take into account the system’s queueing dynamics. When the backlog is empty, the service that is offered to the queue is wasted. Therefore, the system should exploit times when the queue is empty to freely improve its estimates of the channels’ failure rates.

In this chapter, we therefore develop a set of queue-length-based policies for min-imizing the system’s regret. The main result of this chapter is to show that for any problem where the transmission success rate of the best channel is greater than the arrival rate to the queue (i.e., problems where the queue can be stabilized) there exists

(26)

Figure 1-3: Primary and secondary users communicating over a collision channel.

a policy such that the queue length regret is asymptotically bounded. Likewise, we show that any algorithm designed to solve the traditional multi-armed bandit prob-lem, when applied to our probprob-lem, may have a queue length regret that asymptotically grows on the order of log (T ). This result is derived from known lower bounds on the performance of the stochastic multi-armed bandit problem [13, Chapter 2].

1.2.2 Spectrum Sharing with an Uncooperative Network

In Chapter 3, we consider the throughput attainable by a wireless network that must share its channels with an adjacent, uncooperative network. For simplicity, we refer to the adjacent network as the primary network and the network under our control as the secondary network. We begin by considering a model where there exists one primary transmitter and one secondary transmitter that are communicating over the same shared channel (see Fig. 1-3). Time is slotted, and packets arrive as a Bernoulli process with rate λ to the primary user and wait in a queue to be successfully transmitted. The primary user transmits whenever its queue is nonempty. The channel is a collision channel, and, thus, if both primary and secondary users transmit at the same time, a collision occurs, and no packets are successfully received.

The objective of the secondary user is to achieve maximum throughput, denoted µ∗, by exploiting time slots when the primary user is not transmitting. However, this task is made difficult by the fact that the secondary user does not know the queue backlog size of the primary user and must therefore use past observations of

(27)

its receiver feedback (which indicate the primary user’s channel use during past time slots) to determine when to transmit. Note that compared to the model of Chapter 2, the interference experienced by our network is now correlated through the queueing dynamics of the primary user and is reactive to the transmission decisions made by our network. The resulting problem is a partially-observable Markov decision process. We proceed to derive an upper bound on the maximum attainable throughput of the secondary user as a function of the primary user’s packet arrival rate and devise a lower-bound policy for scheduling secondary user transmissions. These bounds show that our devised policy is able to nearly attain the maximum throughput when the arrival rate to the primary user is less than 1

4. Interestingly, the policy does not need

to know the exact arrival rate to the primary user to obtain this result. Thus, as long as the system is guaranteed that the arrival rate to the primary user is sufficiently low, the policy provides a complete solution.

Using our results for the two user case above, we proceed to characterize the network stability region for a multi-user secondary network. In this setting, there are multiple, orthogonal channels available to the secondary network with each channel occupied by one primary user that receives packet arrivals according to a Bernoulli process with a fixed rate. Each secondary user generates packets according to an i.i.d. random process and has a subset of channels that it can transmit on. The objective is to design a network control algorithm that is able to simultaneously determine when, and on which subset of channels, each secondary user should transmit. We show that using our policy devised for the two user case and a queue-length-based time-sharing mechanism, our secondary network can achieve stability for almost any set of arrival rates to the secondary users that is inside the stability region, if the arrival rates to all primary users is less than 1₄. Note that under this condition, the controller does not need to know the arrival rates to either the primary or secondary users to achieve the result.

(28)

1.2.3 Scheduling and Learning in Multi-User Networks

In Chapter 4, we extend our work to learning channel conditions in network’s with complex constraints on simultaneous transmissions. We consider a graph G = (N, E) where N is a set of nodes and E is a set of edges. Time is slotted, and for each edge e ∈ E, packets are generated by an i.i.d. arrival process to be transmitted over e. The capacity of each edge fluctuates according to an i.i.d. random process (e.g., Bernoulli), with a mean rate that is initially unknown and can only be learned by attempting transmissions on the edge. Again, the network is faced with an explo-ration/exploitation tradeoff. It must take actions to both learn the channel rates and use past observations to support the network’s traffic demand. Furthermore, each edge conflicts with a subset of the other edges in the network, and therefore, all edges may not be activated simultaneously.2 _{The network must learn which activation sets}

to choose in order to stabilize the network.

To solve this tradeoff, we propose a frame-based max-weight scheduling algorithm that learns the channel rates over the duration of a frame to stabilize the network. At the frame boundaries the algorithm is able to adjust its weights to adapt to the network’s traffic demands. We show that as the frame duration becomes large enough, the algorithm is able to induce negative drift for sufficiently large queue backlogs and thus guarantee strong stability for all queues in the network. Additionally, we illustrate that the frame-based max-weight algorithm is able to adapt to slow changes in the average edge capacities over time.

In practical networks, nodes can be many hops away from one another and there-fore choosing an activation set through a centralized controller, that aggregates all information in the network to make activation decisions, may be impractical. There-fore, in this chapter, we further explore how to make our scheduling algorithms dis-tributable. Our methods use distributed greedy algorithms that only require that nodes learn the states of their local neighbors. The penalty incurred for using these distributed methods, however, is that the algorithm will only be able to guarantee

sta-2_{We assume that the network knows which edges conflict with one another and must only learn}

(29)

bility for a fraction of the network’s stability region. For example, under the primary interference constraint that requires that every activation set be a graph matching, the method will only be able to guarantee stability for those arrival rate vectors that are in the 1₂ stability region.

The success of frame-based methods depends on how efficiently we can learn the channels’ states over the duration of a frame. Therefore, as part of this chapter, we derive novel bounds on the rate at which our distributed methods can learn the channels’ capacities. Importantly, we will show that the metric of interest corresponds to the traditional definition of regret in the multi-armed bandit literature. We give results for networks under both the primary interference model and general conflict graph constraints.

(30)

(31)

Chapter 2 Learning Algorithms for

Minimizing Queue Length Regret

In this chapter,1 _{we consider a statistical learning problem that is motivated by}

the following application. Consider a wireless communication system consisting of a single transmitter and N channels over which the system can communicate to a receiver. Data packets randomly arrive to the transmitter’s queue according to a Bernoulli process over time slots. The RF processing chain of a wireless transmitter is limited in power and bandwidth. Therefore, at each time step the transmitter can select only one of the N channels to transmit on. For each channel, transmissions fail according to a Bernoulli random process with an initially unknown mean. As discussed in the previous chapter, this behavior models channels that are subject to background interference from the environment or that experience rapid changes in signal strength. When a channel is selected by the controller, the system can attempt to transmit one of its packets on the channel and use receiver feedback to determine whether the transmission was successful. If at a given time slot the queue is empty, a probe (i.e., dummy) packet can be transmitted on one of the channels, instead of a data packet, and the receiver will provide feedback on whether the probe packet was successfully received. Note that the probe packets do not carry informational value and are only used to provide a means of sensing the channel during time slots when

(32)

no data packets are available to send.2 The objective of the controller is to minimize the data packets’ delays by quickly identifying the best channel to transmit on.

The channels in the above application behave like servers in a queueing system that, when selected, offer a random amount of service to the queue. Given the above motivation, we consider the problem of identifying the best of N available servers to minimize a queue’s backlog. To this end, we associate a unit cost to each time slot that each packet has to wait in the queue. To obtain good performance, the controller must simultaneously schedule the servers (channels) to explore their service rates and also exploit previous observations to minimize queue backlog. To this end, in this chapter, we define the queue length regret to be Rπ_{(T )} _{, E}hPT−1

t=0 Qπ(t) −

PT−1 t=0 Q∗(t)

i , where Qπ_{(t) is the backlog under a learning policy and Q}∗_{(t) is the backlog under a}

controller that knows the best server. Our objective is to find a policy that minimizes this regret.

Our problem is closely related to the stochastic multi-armed bandit problem. As was described in Chapter 1, in the multi-armed bandit, a player is confronted by a set of N possible actions of which it can select only one at a given time. Each action provides an i.i.d. stochastic reward, but the statistics of the rewards are initially unknown. Over a set of T successive rounds, the player must select from the actions to both explore how much reward each gives as well as exploit its knowledge to focus on the action that appears to give the most reward. Learning policies for solving the multi-armed bandit have long been considered [69]. Historically, the performance of a policy is evaluated using regret, which is defined to be the expected difference between the reward accrued by the learning policy and a player that knows, a priori, the action with highest mean reward.3 _{This quantifies the cost of having to learn the}

best action. It is well known that there exist policies such that the regret scales on the order of log (T ) and that the order of this bound is tight [13].

One approach to solving our problem would be to simply use traditional bandit algorithms. This approach would focus on maximizing the rate of offered service to

2_{Thus, probe and data packets provide the same information to the transmitter about the channel}

it transmitted on: whether the chosen channel was “blocked” or “open” at that time.

(33)

the queue and, if the queue was infinitely backlogged, would maximize the number of successful transmissions.4 _{The use of bandit algorithms for scheduling wireless}

transmissions has been previously explored in the literature (see Section 1.1.2 and the references therein). However, for our problem, this direct application has the drawback that it does not exploit the queueing dynamics of the system. During time periods when the queue is empty, the offered service is unused, and the controller can, therefore, freely poll the servers without hurting its objective. This contrasts with non-empty periods, when exploring can be costly since it potentially uses a suboptimal server. However, a controller cannot restrict its explorations only to time slots when the queue is empty. This is because some servers may have a service rate that is below the packet arrival rate, and if the controller refuses to explore during non-empty periods, it can settle on destabilizing actions that cause the backlog to grow to infinity. What is needed are policies that favor exploration during periods when the queue is empty but perform enough exploration during non-empty periods to maintain stability. In this chapter, we show that for systems that have at least one server whose service rate is greater than the arrival rate there exists queue-length-based policies such that Rπ_{(T ) = O(1). Likewise, we show that any traditional}

bandit learning algorithm, when applied to our setting, may have queue length regret Rπ_{(T ) = Ω(log (T )).}5

The problem considered herein is related to the recent work of [39]. The main difference is that in [39], the controller could only obtain observations of the servers during times when the queue was backlogged. As a result, the policies considered in that work were closely related to those in the bandit literature that focus on maximiz-ing offered service. Under these policies, [39] showed that the tail of E [Qπ_{(t) − Q}∗_(t)]

diminishes as 1_t, implying that Rπ_{(T ) is logarithmic. In the following, we show that}

by exploiting empty periods, the regret is bounded.

The rest of this chapter is organized as follows. In Section 2.1, we give the prob-lem setup. We analyze the queue length regret of traditional bandit algorithms in

4_{One can view the rewards in the bandit problem as service in our problem.} 5_{Recall, for functions f and g, f (x) = O(g(x)) iff ∃M > 0 and ∃x}

0 such that ∀x > x0, |f (x)| ≤

(34)

Section 2.2. In Sections 2.3 and 2.4, we characterize the regret of queue-length-based policies when the best server has a service rate above and below the arrival rate, respectively. We make some summarizing remarks in Section 2.5.

2.1 Problem Setup

We consider a system consisting of a single queue and N servers operating over discrete time slots t = 0, 1, 2, . . . .6 _{Packets arrive to the queue as a Bernoulli process, A(t),}

with rate λ ∈ (0, 1] and can be serviced in the next time slot after their arrival. The amount of service that server i ∈ [N] can provide follows a Bernoulli process Di_(t)

with rate µi. The arrival process and server processes are assumed to be independent.

We refer to server i as stabilizing if µi > λ; otherwise, we refer to it as non-stabilizing.

At each time slot, a controller can select one of the N servers to provide service to the queue. We denote the controller’s choice at time t as u(t) ∈ [N] and the service offered to the queue as D(t) which is equal to Du(t)_{(t). Given the above, the queue}

backlog, Q(t), evolves as

Q(t + 1) = (Q(t) − D(t))++ A(t), for t = 0, 1, 2, . . .

where (x)+ _{is used to denote the maximum of x and 0. We assume Q(0) = 0 and the}

controller knows this. An instantiation of the above problem is characterized by the tuple (λ, ~µ) where ~µ is the vector of service rates. Let P be the set of all problems (i.e., tuples).

The controller cannot observe the values of Di_{(t) prior to making its decision u(t).}

It is then clear that the optimal action, that maximizes expected service, is to select the server

i∗ , arg max

i∈[N ]µi

to provide service. For simplicity, we assume i∗ _{is always unique.}7 _{In our framework,}

6_{We assume N ≥ 2. Otherwise, the problem is trivial.}

(35)

the controller does not a priori know the values of µi and must therefore use

obser-vations of D(t) in order to identify i∗_.8 _{This implies that the service available from}

server u(t) is revealed to the controller after time t, but the service that would have been available from all other servers remains hidden (i.e., bandit feedback). Note that the controller can observe D(t) at all times t, even when Q(t) = 0.

Our objective is to design a controller that makes decisions as to which server to schedule at each time slot. Denoting the history of all previous arrivals, service opportunities, and decisions as

H(t) = (A(0), D(0), u(0), . . . , A(t − 1), D(t − 1), u(t − 1)),

our objective is to find a controller policy π, which at each time t uses any portion of H(t) to (possibly randomly) make a decision on u(t).9 _{We denote the set of all}

policies as Π.

Define Q∗(t) to be the queue backlog under the controller that always schedules i∗ and Qπ_{(t) the backlog under a policy that must learn the service rates. We will}

analyze the performance of π using the following definition of queue length regret,

Rπ(T ) , E "_T₋₁ X t=0 Qπ(t) − T−1 X t=0 Q∗(t) # . (2.1)

We will often simply refer to this metric as regret in the rest of the chapter. Note that

E[Qπ_{(t)] ≥ E [Q}∗_(t)]

and a policy that minimizes (2.1) must also minimize E[PT_t=0−1Qπ_{(t)]. This implies}

that Rπ_{(T ) is monotonically increasing for all π ∈ Π. The focus in this chapter will}

be on characterizing how regret can scale with time horizon T . To this end, we will consider policies that are independent of T (i.e., do not optimize for a specific time horizon).

8_{We assume the exact value of λ is also unknown.}

9_{Since H(t) contains all previous arrivals and offered service to the queue, the controller knows}

(36)

2.2 Regret of Traditional Bandit Algorithms

In this section, we establish that the regret of any policy π ∈ Π that does not use previous observations of the arrival process must have a queue length regret that grows logarithmically with T for some ~µ. Under this restriction, the policies in this section (in effect) do not observe previous arrivals to the queue, and their decision making process is solely focused on the observed offered services. Since the policies do not monitor process A(t), they cannot directly use the queue backlog Q(t) to make their decisions, either. Under this restriction, we may still borrow any strategy from the traditional stochastic multi-armed bandit literature to solve the problem. These policies focus on only maximizing offered service to the queue, using previous observations of the offered service to guide their decisions. Without loss of generality, the theorem is established for a system with two servers. The complete proof of the theorem can be found in Appendix 2.A.1.

Let µ(j) be the jth server when the servers are sorted by increasing rate.

Theorem 2.1. For any π ∈ Π that does not use previous observations of the arrival process, ∃~µ with server rates µ(1) and µ(2) (µ(2) > µ(1)) such that for any fixed λ ∈

0, µ(2), Rπ(T ) = Ω(log (T )).

2.3 Regret of Queue-Length-Based Policies

In this section, we examine the asymptotic scaling of regret Rπ_{(T ) for policies that}

make decisions using Q(t) (or equivalently, the previous observations of the arrival process). We do so by considering the problem for different subsets of P. This section will build to the main result of this chapter, Theorem 2.4, which states that there exists a π ∈ Π such that for any problem (λ, ~µ) with µi∗ > λ, Rπ(T ) = O(1).

We begin our analysis in Subsection 2.3.1 under the assumption that every server is stabilizing (i.e., µi > λ, ∀i ∈ [N]). Under this assumption, the controller does not

need to account for the possibility of system instability. As a result, the controller can limit itself to performing exploration only on time slots in which the queue is

(37)

empty. During time slots in which the queue is backlogged, the controller will exploit its previously obtained knowledge to schedule the server it believes to be best.

In Subsection 2.3.2, we allow for both stabilizing and non-stabilizing servers in the system (i.e., µi is allowed to be less than λ for a non-empty subset of the servers).

However, we will assume that the controller is given a randomized policy that achieves an offered service rate that is greater than the arrival rate to the queue. Note that the controller does not need to learn this policy and can use it to return the queue to the empty state. The given randomized policy is not required to have any relationship to i∗ and will not, in general, minimize queue length regret. As a result, to minimize queue length regret, the controller will not want to excessively rely upon it.

In Subsection 2.3.3, we further relax our assumptions on the problem and require that the controller learn which servers are and are not stabilizing while simultaneously trying to minimize Rπ_{(T ). This will require a policy that does not destabilize the}

system. To this end, the controller will have to explore the servers’ offered service rates during both time slots when the queue is empty and backlogged. Explorations during time slots when the queue is backlogged, in general, waste work and should therefore be performed sparingly. Intuitively, as the controller identifies which subset of servers have service rates µi > λ, it can focus its explorations on time slots when

the queue is empty.

The above three cases build upon one another. The insight from one will point to a policy for the next, and we will therefore analyze the above cases in sequence. Under each of the above assumptions, we will find that there exists a policy such that the regret converges for all (λ, ~µ) meeting the assumption. In contrast, in Section 2.4, we show that there does not exist a policy that can achieve convergent regret over the class of all problems for which µi ≤ λ, ∀i ∈ [N].

2.3.1 All Servers Are Stabilizing

In this subsection, we assume all servers are stabilizing. Assumption 2.1. (λ, ~µ) ∈ P1 , {P : µi > λ,∀i ∈ [N]}.

(38)

for Empty period p do

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update bµi with the observed state of D(t)

end for

for Busy period p do

Schedule arg maxi∈[N ]µbi until the queue empties

end for

Figure 2-1: Policy π1 for achieving Theorem 2.2. We assume ˆµi is set to zero, if server

i has not yet been observed.

Under this assumption we will prove the following theorem, which states that there exists a policy such that, for any problem from the set P1, the regret converges.

Theorem 2.2. Under Assumption 2.1, ∃π ∈ Π such that, for each (λ, ~µ) ∈ P1,

Rπ_{(T ) = O(1).}

To prove Theorem 2.2, we analyze the policy shown in Fig. 2-1 on an arbitrary (λ, ~µ) ∈ P1. This policy maintains a sample mean ˆµi that estimates µi using

observa-tions of server i’s offered service. Note that under this policy π1, the queue backlog will

transition through alternating time intervals, wherein the queue is empty (Q(t) = 0) and busy (Q(t) > 0). We enumerate these periods using positive integers p = 1, 2, . . . . Under the policy, the first time slot of each empty period is used to update an esti-mate of one of the servers chosen uniformly at random, and during each busy period the queue is serviced by only one server (namely, the one with highest sample mean). Therefore, the policy performs no exploration during busy periods and instead focuses on exploiting the observations it has already made.

Given that the policy schedules server i during busy period p, the duration of the busy period is given by random variable Xi with mean Xi. By Assumption 2.1, for

all i, Xi is finite. Furthermore, the integral of the queue backlog over the busy period

scheduled to server i is given by random variable Zi with mean Zi. Thus, for a busy

period starting at time τ under service from i,

Zi,

τ+XXi−1 t=τ

(39)

Finally, we use Sπ1

i (P ) to denote the number of busy periods in which i has been

selected during the first P busy periods.

The proof of Theorem 2.2 now proceeds through four lemmas. We begin with Lemma 2.1, which shows that the busy periods over which we schedule i∗ _{do not}

contribute to the regret. The proof follows from a sample path argument that shows, that for any outcome ω, over any busy period that π1 schedules i∗, Qπ1(t, ω) ≤

Q∗(t, ω). This gives the following bound on regret. We let 1{·} be the indicator random variable. The proof of the lemma can be found in Appendix 2.A.2.

Lemma 2.1. Rπ1_{(T ) ≤ E}   T−1 X t=0 X i∈[N ]−i∗ Qπ1_{(t)1 {u(t) = i}}   .

Next, in Lemma 2.2, the expected value of Ziis shown to be finite. This is because

Zi can be bounded by Xi2 which has a finite expectation. The proof of the lemma

can be found in Appendix 2.A.3. Note that Zi is not a function of the busy period

number p or T .

Lemma 2.2. Zi <∞.

Now, no more than T busy periods can occur in T time slots. Therefore, we can bound the total queue backlog summed over times when we were scheduling server i up until T with the following. The proof of the lemma can be found in Appendix 2.A.4. Lemma 2.3. E "_T₋₁ X t=0 Qπ1_{(t)1 {u(t) = i}} # ≤ ZiE[Siπ1(T )] .

Finally, in Lemma 2.4, we show that the expected number of busy periods in which a sub-optimal server is chosen over the first T busy periods is bounded by a finite constant which is independent of T . Since we obtain a new observation during each empty period p, using Hoeffding’s inequality we can show that the probability that a sample mean bµi ≥ bµi∗, for i 6= i∗, decays exponentially in p. The result then follows

from the convergence of geometric series. The proof of the lemma can be found in Appendix 2.A.5.

(40)

Lemma 2.4. E[Sπ1

i (T )] = O(1), ∀i ∈ [N] − i∗.

Given the above four lemmas, we are now ready to establish the theorem. Proof of Theorem 2.2. Combining Lemmas 2.1 and 2.3,

Rπ1_{(T ) ≤} X i∈[N ]−i∗

ZiE[Siπ1(T )] .

Then by Lemmas 2.2 and 2.4, Rπ1_{(T ) = O(1) giving the result.}

Note that Theorem 2.2 does not imply that there exists a constant that bounds Rπ1_{(T ) for all (λ, ~µ) ∈ P}

1. It states that for any chosen (λ, ~µ) ∈ P1, under policy π1,

the queue length regret converges to a finite number (that is dependent on λ, ~µ, N, and π1) and does not grow to infinity as T goes to infinity. For an increasing number

of servers or an increasing load factor λ

µi∗, we expect that the constant that R π1_{(T )}

converges to will increase.

2.3.2 Non-Stabilizing Servers

In this subsection, we relax Assumption 2.1 to allow for non-stabilizing servers; i.e., we will now allow for µi ≤ λ for some strict subset of the servers. Importantly, we

will assume that the controller is given a known convex summation over the servers’ rates that strictly dominates the arrival rate to the queue. Then, by randomizing over the servers using this convex summation, the controller can always stabilize the system. Concretely, we make the following assumption.

Assumption 2.2. For given αi ≥ 0 and P_{i∈[N ]}αi = 1,

(λ, ~µ) ∈ P2 , n P : X i∈[N ] αiµi > λ o .

Note that this assumption implies that the controller knows one stationary, ran-domized policy that has an offered service rate that is greater than the arrival rate, and that this randomized policy does not need to be learned. Further note that αi is

(41)

for Empty period p do

At the first time slot of the period, set u(t) = i uniformly at random over [N] and update bµi with the observed state of D(t)

end for

for Busy period p do

Schedule arg maxi∈[N ]µbi for first p time slots or until the queue empties

if The queue does not empty during the first p time slots then

At each time slot, schedule server i with probability αi until the queue

empties end if end for

Figure 2-2: The policy π2 for achieving Theorem 2.3. We assume ˆµi is set to zero, if

server i has not yet been observed.

not required to have any special relationship to i∗ and that αi∗ can be zero. Therefore,

the randomized policy will not generally minimize regret and its use should not be overly relied upon. We then have the following theorem.

Rπ(T ) = O(1).

A policy that achieves Theorem 2.3 is given in Fig. 2-2 and is referred to as π2

throughout this subsection. The policy is similar to the policy π1of Subsection 2.3.1 in

that it iterates over empty and busy periods. The main difference however is that each busy period has a time-out threshold. If by the time-out threshold, the queue has not emptied, the policy uses the randomized policy defined by Assumption 2.2 to bring the queue backlog back to the empty state. The time-out threshold grows linearly in p, and therefore with each busy period, the controller becomes more reluctant to call upon the randomized policy.

Theorem 2.3 is established by analyzing π2 on an arbitrary (λ, ~µ) ∈ P2. The proof

is derived through three lemmas. We introduce some additional notation to facilitate understanding. For each time t, define p(t) to be the period number for the empty or busy period in which t resides. Furthermore, let C(p) ∈ {0, 1} be an indicator that takes the value 1 if either: a server not equal to i∗ _{is first scheduled at the start of}

(42)

C(p) = 0. Note that by the definition of π2, if any server i 6= i∗ is scheduled during

busy period p, then C(p) must equal 1. Then, we have the following lemma, which states that the regret is upper bounded by the sum of queue backlogs over those busy periods in which indicator C(p) = 1. The proof is similar to that of Lemma 2.1 and can be found in Appendix 2.A.6.

Lemma 2.5. Rπ2_{(T ) ≤ E}hPT−1

t=0 Qπ2(t)C (p(t))

i .

Now, to upper bound the right-hand side of Lemma 2.5, we will want to bound the probability that indicator C(p) = 1. Note, C(p) can equal 1 if at the start of the busy period, there exists an i 6= i∗ _{such that b}_µ

i ≥ bµi∗ or if we schedule i∗ for the start

of the busy period but still cross the time-out threshold. As with the policy of the previous subsection, the probability of the former case diminishes exponentially in p; likewise for the latter case, since Assumption 2.2 implies µi∗ > λ, the probability of

i∗ not emptying the queue in p time slots also diminishes exponentially in p. This gives the following lemma whose proof can be found in Appendix 2.A.7.

Lemma 2.6. There exists positive constants M0, χ, and p0 such that for all p ≥ p0,

P (C(p) = 1) ≤ M0e−χp.

Now, let variable Zπ2_{(p) ∈ {1, 2, . . . } denote the integral of the queue backlog}

over busy period p. In Lemma 2.7, we show E [ Zπ2_{(p)| C(p) = 1] grows at most}

quadratically with p. This is because the queue backlog can grow at most linearly over the first p time slots, and given that the time-out threshold is hit, the required amount of time that the randomized policy will need to empty the queue will also grow in expectation at most linearly with p. Thus, the integral will be at most on the order of p2_.

Lemma 2.7. There exists positive constants M1 and β1 such that for all p,

E[ Zπ2_{(p)| C(p) = 1] ≤ M}

1p2+ β1.

(43)

We are now ready to establish the theorem.

Proof of Theorem 2.3. Using Lemma 2.5 and the fact that no more than T busy periods can occur in T time slots,

Rπ2_{(T ) ≤ E} "_T₋₁ X t=0 Qπ2_{(t)C (p(t))} # ≤ E " _T X p=1 Zπ2_(p)C(p) # (2.2)

Recall that C(p) is an indicator. When C(p) = 0, Zπ2_{(p)C(p) = 0. Thus,}

(2.2) = T X p=1 P(C(p) = 1)E [ Zπ2_{(p)| C(p) = 1]} ≤ p_X0−1 p=1 M1p2+ β1 + T X p=p0 M1p2+ β1 M0e−χp

where we have applied Lemmas 2.6 and 2.7 to obtain the last inequality. Since

∞ X p=0 M1p2+ β1 M0e−χp <∞

we see that Rπ2_{(T ) = O(1) which gives the result.}

2.3.3 Learning Stabilizing Policies

We now relax Assumption 2.2 and no longer assume that a randomized policy is given to the controller, a priori. Instead, we will only assume that there exists at least one stabilizing server.

Assumption 2.3. (λ, ~µ) ∈ P3 , {P : µi∗ > λ} .

We then have the following theorem, which is analogous to the previous subsec-tions.

(44)

for Time slot n = 0, 1, . . . until queue empties do if n is a dedicated exploration time slot then

Choose u(n) = i uniformly at random over [N] Update bµp_i with the observed state of D(n) else

Schedule u(n) = arg maxi∈[N ]µbpi

end if end for

Figure 2-3: Learning algorithm for bringing the queue back to the empty state. Time n _{is normalized to when the algorithm is called. We assume b}µp_i is set to zero, if server i _{has not yet been observed during this algorithm’s call. Note that b}µp_i is a separate variable from bµi and is only used during busy period p.

To prove the theorem, we will analyze the following policy π3 on an arbitrary

(λ, ~µ) ∈ P3. Under π3 we follow the policy described in Fig. 2-2 with the following

minor change. When the time-out threshold of a busy period is reached, rather than relying on the known randomized policy given by Assumption 2.2, π3 uses the

method of Fig. 3 to bring the queue back to the empty state. The method in Fig. 2-3 has dedicated exploration time slots. During these time slots, the policy chooses from the servers uniformly at random and updates new variables bµp_i based off of its observations of the resulting offered service. The location of the dedicated exploration time slots are predetermined by the controller. Let V (n) be the number of dedicated explorations made by time slot n. Then for our proof of Theorem 2.4, we require that the dedicated exploration times are chosen such that

rnǫ− b1 ≤ V (n) ≤ rnǫ+ b2, for n = 0, 1, 2, . . . (2.3)

for positive constants b1, b2, r, and ǫ ∈ (0, 1). For example, one choice could be to

have dedicated explorations occur at time slots n = k2 _{for k = 0, 1, 2, . . . . Note that}

(2.3) requires that the frequency of explorations diminishes with n and eventually falls below 1 − λ

µi∗. Subject to (2.3), the exact choice of the dedicated exploration

times is left to the designer.

(45)

the queue is backlogged. This allows the controller to continuously empty the queue even if it has not had enough empty periods, so far, to learn the best server. However, similar to policy π2, the controller becomes increasingly reluctant to explore during

busy periods as time progresses.

As policy π3 is similar to π2, the proof of Theorem 2.4 closely follows the proof

of Theorem 2.3 with Lemma 2.8 below replacing Lemma 2.7. For policy π3, let

Zπ3_{(p) ∈ {1, 2, . . . } denote the integrated queue backlog of busy period p. Note that}

Zπ3_{(p) is analogous to Z}π2_{(p) of the previous subsection. Then, we have the following}

lemma whose proof can be found in Appendix 2.A.9.

Lemma 2.8. There exists positive constants M2 and β2 such that for all p,

E[ Zπ3_{(p)| C(p) = 1] ≤ M}

2p2+ β2.

The proof of Theorem 2.4 then follows the proof of Theorem 2.3 with Zπ3_(p)

replacing Zπ2_(p).

2.4 Systems without Stabilizing Servers

In the previous section, we saw that the existence of a stabilizing server (i.e., µi∗ > λ)

allowed for policies that had Rπ_{(T ) = O(1). A natural question is whether we can}

obtain similar results for the subset of problems (λ, ~µ) that do not have stabilizing servers. We proceed to show that there cannot exist a policy that can achieve O(1) regret over this entire subset. The proof can be found in Appendix 2.A.10.

Assumption 2.4. (λ, ~µ) ∈ P4 , {P : µi ≤ λ, ∀i ∈ [N]}.

Given this, we have the following theorem.

Theorem 2.5. For any policy π ∈ Π there exists a (λ, ~µ) ∈ P4 such that Rπ(T ) =

(46)

2.5 Final Remarks

In this chapter, we considered the problem of learning service rates to minimize queue length regret. We showed that queue-length based policies can have a queue length regret that is order optimal, O(1), for all systems such that µi∗ > λ, while traditional

bandit algorithms may have a queue length regret that is Ω(log (T )). This illustrates that in order to optimize queue length regret, learning algorithms must take into account the system’s queueing dynamics. By doing so, they can exploit time slots in which the queue is empty to improve their understanding of the servers’ rates. Importantly, a policy should prefer to explore the servers when its queue is empty, but must also perform enough exploration during nonempty periods to keep the system stable. In this chapter, this trade-off was accomplished through policies that became increasingly reluctant to explore at each busy period. These policies were chosen for their analytical tractability, which allowed us to show the achievability of O(1) regret. However, we believe there are opportunities to improve upon our policies’ rates of convergence.

2.A

Appendix

In the following appendices we will make extensive use of Hoeffding’s inequality stated below.

Definition 2.1 (Hoeffding’s Inequality). Consider Sn = X1+ X2+ · · · + Xn where

Xi _{are i.i.d. random variables with mean X and support between [a, b] for a < b.}

Then, for all x > 0,

P 1 nSn− X ≥ x ≤ e−2n(b−a)2x2 P 1 nSn− X ≤ −x ≤ e−2n(b−a)2x2 _.

Note that in the following, Hoeffding’s inequality will be used even when a tighter Chernoff bound may be found.

Control of wireless networks under uncertain state information

Control of Wireless Networks

Under Uncertain State Information

by

Thomas Benjamin Stahlbuhk

B.S., Electrical Engineering, University of California San Diego (2008)

M.S., Electrical Engineering, University of California San Diego (2009)

Submitted to the Department of Aeronautics and Astronautics

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2018

c

Massachusetts Institute of Technology 2018. All rights reserved.

Author . . . .

Department of Aeronautics and Astronautics

May 1, 2018

Certified by . . . .

Eytan Modiano

Professor of Aeronautics and Astronautics

Thesis Supervisor

Certified by . . . .

Brooke Shrader

Senior Staff, MIT Lincoln Laboratory

Thesis Supervisor

Certified by . . . .

Moe Win

Professor of Aeronautics and Astronautics

Thesis Committee Member

Accepted by . . . .

Hamsa Balakrishnan

Associate Professor of Aeronautics and Astronautics

Chair, Graduate Program Committee

Control of Wireless Networks

Under Uncertain State Information

by

Thomas Benjamin Stahlbuhk

Abstract

Acknowledgments

Support

Contents

List of Figures

Chapter 1

Introduction

1.1

Related Work

1.1.1

Network Control

1.1.2

Exploration and Exploitation Tradeoffs in Wireless

Net-works

1.2

Our Contributions

1.2.1

Minimizing Queue Length Regret

1.2.2

Spectrum Sharing with an Uncooperative Network

1.2.3

Scheduling and Learning in Multi-User Networks

Chapter 2

Learning Algorithms for

Minimizing Queue Length Regret

2.1

Problem Setup

2.2

Regret of Traditional Bandit Algorithms

2.3

Regret of Queue-Length-Based Policies

2.3.1

All Servers Are Stabilizing

2.3.2

Non-Stabilizing Servers

2.3.3

Learning Stabilizing Policies

2.4

Systems without Stabilizing Servers

2.5

Final Remarks

2.A