Taming traffic dynamics : analysis and improvements

(1)

HAL Id: hal-00565750

https://hal.archives-ouvertes.fr/hal-00565750

Submitted on 14 Feb 2011

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Taming traﬀic dynamics : analysis and improvements

Pedro Casas Hernandez, Federico Larroca, Jean Louis Rougier, Sandrine

Vaton

To cite this version:

Pedro Casas Hernandez, Federico Larroca, Jean Louis Rougier, Sandrine Vaton. Taming traﬀic

dy-namics : analysis and improvements. Computer Communications, Elsevier, 2010. �hal-00565750�

(2)

Taming Traffic Dynamics: Analysis and Improvements

Pedro Casasa,b, Federico Larrocac,∗, Jean-Louis Rougierc, Sandrine Vatona

a_T´_el´_{ecom Bretagne. Technopˆ}_{ole Brest-Iroise, CS 83818, 29238 Brest Cedex 3, France}

b_{Facultad de Ingenier´ıa, Universidad de la Rep´}_{ublica. Julio Herrera y Reissig 565, C.P. 11.300, Montevideo, Uruguay} c_T´_el´_{ecom ParisTech. 46 rue Barrault, F-75634 Paris Cedex 13, Paris, France}

Abstract

Internet traffic is highly dynamic and difficult to predict in current network scenarios, which enormously complicates network management and resources optimization. To address this uncertainty in a robust and efficient way, two almost antagonist Traffic Engineering (TE) techniques have been proposed in the last years: Robust Routing and Dynamic Load-Balancing. Robust Routing (RR) copes with traffic uncertainty in an off-line preemptive fashion, computing a single static routing configuration that is optimized for traffic variations within some predefined uncertainty set. On the other hand, Dynamic Load-Balancing (DLB) balances traffic among multiple paths in an on-line reactive fashion, adapting to traffic variations in order to optimize a certain congestion function. In this article we present the first comparative study between these two alternative methods. We are particularly interested in the performance loss of RR with respect to DLB, and on the response of DLB when faced with abrupt changes. This study brings insight into several RR and DLB algorithms, evaluating their virtues and shortcomings, which allows us to introduce new mechanisms that improve previous proposals.

Keywords: Traffic Uncertainty, Traffic Management, Robust Optimization, Robust Routing, Dynamic Load-Balancing

1. Introduction

As network services and Internet applications evolve, network traffic is becoming increasingly complex and dy-namic. The convergence of data, telephony and televi-sion services on an all-IP network directly translates into a much higher variability and complexity of the traffic injected into the network. To make matters worse, the presence of unexpected events such as network equipment failures, large-volume network attacks, flash crowd occur-rences and even external routing modifications induces large uncertainty in traffic patterns. Moreover, current evolution and deployment-rate of broadband access tech-nologies (e.g. Fiber To The Home) only aggravates this uncertainty.

But these are not the only problems network opera-tors are confronted with. The ever-increasing access rates available for end-users we just mentioned is such that the assumption of infinitely provisioned core links could soon become obsolete. In fact, recent Internet traffic studies from major network technology vendors like Cisco Systems forecast the advent of the Exabyte era [1, 2], a massive in-crease in network traffic driven by high-definition video.

∗

Corresponding author. Telephone: +33 (0)1 45 81 75 52 - Fax: +33 (0)1 45 89 79 06

Email addresses: [email protected](Pedro Casas), [email protected] (Federico Larroca), [email protected] (Jean-Louis Rougier), [email protected] (Sandrine Vaton)

In this context, simply upgrading link capacities may no longer be an economically viable solution. Moreover, even if overdimensioning would be possible, its environmental impact is not negligible. For instance, the Information and Communication Technology sector alone is responsi-ble for around 2% of the man-made CO2, a similar figure

to that of the airline industry, but with higher increasing perspectives [3]. An efficient and responsible usage of the resources is then essential1_.

In the light of this traffic scenario, we study the prob-lem of intradomain Traffic Engineering (TE) under traf-fic uncertainty. This uncertainty is assumed to be an ex-ogenous traffic modification, meaning that traffic varia-tions are not produced within the domain for which rout-ing is optimized but are due to external and difficult to predict events. More in particular, we are interested in two almost antagonist approaches that have emerged in the recent years to cope with both the increasing traffic dynamism and the need for cost-effective solutions: Ro-bust Routing (RR) [6, 7, 8] and Dynamic Load-Balancing (DLB) [9, 10, 11].

In RR, traffic uncertainty is taken into account directly within the routing optimization, computing a single rout-ing configuration for all traffic demands within some un-certainty set where traffic is assumed to vary. This uncer-tainty set can be defined in different ways, depending on

1_{To learn more about this emerging discipline, the interested}

reader should consult works related to so-called “green network-ing” [4].

(3)

the available information: largest values of links load pre-viously seen, a set of prepre-viously observed traffic demands (previous day, same day of the previous week), etc. The criterion to search for this unique routing configuration is generally to minimize the maximum link utilization (i.e. the utilization of the most loaded link in the network) for all traffic demands of the corresponding uncertainty set. While this routing configuration is not optimal for any sin-gle traffic demand within the set, it minimizes the worst case performance over the whole set.

DLB copes with traffic uncertainty and variability by splitting traffic among multiple paths in real-time. In this dynamic scheme, each origin-destination (OD) pair of nodes within the network is connected by several a pri-ori configured paths, and the problem is simply how to distribute traffic among these paths in order to optimize a certain function. DLB is generally defined in terms of a link-congestion function, where the portions of traffic are adjusted in order to minimize the total network conges-tion. Ideally, the traffic distribution is set so that at every instant the objective function is optimized.

Those who promote DLB highlight among others the fact that it is the most resource-efficient possible scheme, and that given the configured paths it supports every pos-sible traffic demand, all of this in an automated and de-centralized fashion. In practice, the “always-optimized” characteristic we mentioned above is achieved by means of a distributed algorithm periodically executed by ev-ery ingress router based on feedback from the network. It is precisely this last characteristic that constitutes the most challenging aspect of DLB. In fact, the deployment of DLB has been, to say the least, limited. Two partic-ular problems arise in DLB: convergence to the optimum is not always guaranteed, and convergence speed might be over-killing under large and abrupt changes in traffic de-mands. Network operators are reluctant to use dynamic mechanisms mainly because they are afraid of a possible oscillatory behavior of the algorithm used by each OD pair to adjust load-balancing. As the early experiences in ArpaNet has proved [12], these concerns are not without reason. (In particular, before July 1987, the links’ met-ric was defined as the packet delay averaged over a 10 s period. Although this adaptive routing scheme worked correctly under light or moderate loads, it generated os-cillations under relatively heavy loads. This resulted in substituting the links’ metric by a fixed value as we use it today, sacrificing optimality for stability.) Indeed, for these adaptive and distributed algorithms, a trade-off be-tween adaptability (convergence speed) and stability must be found, which may be particularly difficult in situations where abrupt traffic changes occur.

Those who advocate the use of RR claim that there is actually no need to implement supposedly complicated and possibly oscillatory dynamic routing mechanisms, and that the incurred performance loss for using a single rout-ing configuration is negligible when compared with the in-crease in complexity. RR provides a stable routing

con-figuration for all the traffic demands within the uncer-tainty set, avoiding possible oscillations and convergence issues. However, RR presents some conception problems and serious shortcomings in its current state which we highlight and try to ease in this work. The first draw-back of current RR is related to the objective function it intends to minimize. Optimization under uncertainty is generally more complex than classical optimization, which forces the use of simpler optimization criteria such as max-imum link utilization (MLU). The MLU is not the most suitable network-wide optimization criterion; setting the focus too strictly on MLU often leads to worse distribu-tion of traffic, adversely affecting the mean network load and thus the total network end-to-end delay, an important QoS indicator. It is easy to see that the minimization of the MLU in a network topology with heterogeneous link capacities may lead to poor results as regards global net-work performance. The second drawback of RR we iden-tify is its inherent dependence on the definition of the un-certainty set of traffic demands: the unun-certainty set has to be sufficiently “large” to allow traffic flexibility and to pro-vide performance guarantees, but should not be excessively “large” to avoid wasting network resources. Thus, consid-ering a unique RR configuration to address both traffic in normal operation and unexpected traffic variations is an inefficient strategy, as a single routing configuration can-not be suitable for both situations.

1.1. Contributions of this article

This article presents a fair and comprehensive compar-ative analysis between RR and DLB mechanisms. The analysis is comprehensive as it evaluates the performance of both mechanisms based on different performance indi-cators and considering normal operation as well as unpre-dicted traffic events. We believe our comparison is fair because it considers the particular characteristics of each mechanism under the same network and traffic conditions. To date and to the best of our knowledge this is the first work that conducts such a comparative evaluation, neces-sary indeed not only from a research point of view but also for network operators who seek cost-effective and robust solutions to face future network scenarios. Based on this comparative analysis we develop and evaluate new variants of RR and DLB mechanisms, improving some of the short-comings found in both static and dynamic approaches.

Regarding the RR approach, we will introduce some modifications that strive to alleviate the two problems identified in current proposals. We will first study which is the best objective function to minimize, and propose the mean link utilization instead of the MLU. The mean link utilization provides a better image of network-wide per-formance, as it does not depend on the particular load or capacity of each single link in the network but on the aver-age value. However, a direct minimization of the mean link utilization does not assure a bounded MLU, which is not practical from an operational point of view. Thus, we min-imize the mean link utilization while bounding the MLU

(4)

by a certain utilization threshold a priori defined. This adds a new, and maybe difficult to set, constraint to the problem, namely how to define this utilization threshold. We further improve our proposal by providing a multiple objective optimization criterion, where both the MLU and the mean link utilization are minimized simultaneously. We evaluate the improvements of our proposals from a QoS perspective, using the mean path end-to-end queuing delay as a measure of global performance.

The second problem we address in RR is the trade-off between routing performance and routing reliability. In [13] we have recently proposed a solution to manage this trade-off, known as Reactive Robust Routing (RRR). Basically, RRR consists of constructing a RR configura-tion for expected traffic in nominal operaconfigura-tion, adapting this nominal routing configuration after the detection and localization of a large and long-lived traffic modification. RRR provides good performance for both nominal oper-ation and unexpected traffic, but it is difficult to deploy in a real implementation, because of the routing reconfig-uration step. Reconfiguring the routing of an entire Au-tonomous System is a nontrivial task. In this article we modify the RRR approach, using a preemptive Load Bal-ancing algorithm to balance traffic among pre-established paths after the localization of a large volume traffic modi-fication (preemptive in the sense of preventing a situation from occurring).

In what respects DLB, we evaluate the use of so-called no-regret algorithms as the distributed optimization algo-rithm used by ingress routers to adapt load-balancing. The authors of a recent paper [14] proved that if all OD pairs use algorithms of this kind, convergence to the optimum is guaranteed. Special attention will be paid on the behavior of the algorithm when faced with abrupt and unexpected changes in the traffic demands. We shall introduce simple, and yet effective, modifications to the algorithm to assure a fast convergence to the new optimum in this case.

As we shall see in the following subsection, several pre-viously proposed DLB algorithms strive to minimize the MLU by means of a greedy algorithm in the paths uti-lization (i.e. each ingress router increases the amount of traffic sent along the path with the smallest utilization). As proved in a recent paper [15], convergence to the op-timum for such algorithms may not be guaranteed, in the sense that they may converge to a situation in which the MLU is not minimized. In fact, we shall present an exam-ple in which the difference with the optimum of the MLU is non-negligible. In this work we will present an alterna-tive path cost function, so that greedy algorithms that use it do converge to the optimum equilibrium.

1.2. Related Work

There is a large literature on routing optimization with uncertain traffic demands. Thus, here we shall only men-tion a few papers, and do not expect our list to be exhaus-tive.

Traditional algorithms rely on a single or a small group of expected traffic demands to compute optimal and reli-able routing configurations. An extreme case is presented in [16], where routing is optimized for a single estimated traffic demand and is then applied for daily routing. Traf-fic uncertainty is characterized by multiple trafTraf-fic demands in [17] (set of traffic demands from previous day, same day of previous week, etc.), where different mechanisms to find optimal routes for the set are presented. As discussed for instance in [18], this perspective is no longer suitable for current and future dynamic scenarios. These approaches require a “leap of faith” to perform well, mainly because they assume that traffic patterns do not change that much over time. However, even a relatively small difference between the “real” traffic demand and the one used for the routing optimization may lead to an important per-formance degradation. Such a difference may arise in the event of unexpected traffic variations (which are more com-mon nowadays), or even be due to an error in the traffic estimation.

A different approach has emerged in the recent years to cope with the increasing traffic dynamism and the need for cost-effective solutions, Dynamic Load-Balancing (DLB) [9, 10, 11, 19]. In DLB, traffic is split among a priori es-tablished paths in order to avoid network congestion. The two most well-known proposals in this area are MATE and TeXCP. In MATE [9], a convex link congestion function is defined, which depends on the link capacity and the link load. The objective is to minimize the total network congestion, for which a simple gradient descent method is proposed. In [19], we propose to use a link congestion func-tion based on measurements of the queueing size, which results in better global performance from a QoS perspec-tive. TeXCP [10] proposes a somewhat simpler objective: in order to minimize the MLU, they minimize the biggest utilization each traffic demand obtains in its paths. An-other DLB scheme which has the same objective but a relatively different mechanism is REPLEX [11].

The last category of algorithms consists of Robust Rout-ing techniques [6, 7, 8, 20, 21]. The objective in RR is to find a unique static routing configuration that fulfills a certain criterion for a broad set of traffic demands, gener-ally the one that minimizes the maximum link utilization over the whole set of demands. In [6], authors capture traffic variations by introducing a polyhedral set of de-mands, which allows for easier and faster linear optimiza-tion. This robust technique is applied in [20] to compute a robust MPLS routing configuration without depending on traffic demand estimation, and corresponding meth-ods for robust OSPF optimization are discussed. Oblivi-ous Routing [7] also defines linear algorithms to optimize worst-case MLU for different sizes of traffic uncertainty sets. The author of [21] analyzes the use of robust rout-ing through a combination of traffic estimation techniques and its corresponding estimation error bounds, in order to shrink the set of traffic demands. In [8] authors intro-duce COPE, a RR mechanism that optimizes routing for

(5)

predicted demands and bounds worst-case MLU to ensure acceptable efficiency under unexpected traffic events. The idea behind COPE is similar to ours, in the sense that it strives to alleviate performance degradation due to un-foreseen traffic modifications. Nevertheless, it proposes a single routing configuration to handle expected as well as large and abrupt traffic variations, which is clearly not the best solution.

The same paper [8] presents, to the best of our knowl-edge, the only previous comparative study between RR and DLB. The authors of the paper compare the per-formance of COPE with a dynamic approach which they claim models the behavior of mechanisms such as MATE and TeXCP. Given a time series traffic demands, this dy-namic approach consists of computing an optimal routing for each traffic demand i and evaluate its performance with the following traffic demand i+1. There are two important shortcomings of this DLB simulation. Firstly, adaptation in DLB is iterative and never instantaneous. Secondly, in all DLB mechanisms paths are set a priori and remain un-changed during operation. This is not the case in their dynamic approach, where each new routing optimization may change not only traffic portions but paths themselves. For these reasons, we believe that the comparison provided in [8] is biased against dynamic schemes.

The remainder of this article is organized as follows. In Sec. 2 we introduce the network model and notation, while Sec. 3 and 4 introduce a preliminary version of the RR and DLB mechanisms. Some first results are discussed in Sec. 5. Section 6 presents new variants to the former mech-anisms which alleviate the shortcomings detected. The evaluation of the complete set of algorithms under differ-ent traffic scenarios is conducted in Sec. 7. We finally draw conclusions of this comparative analysis in Sec. 8.

2. Network Model and Performance Indicators Let us begin by introducing the notation used in this article. The network topology is defined by n nodes and a set L = {l1, . . . , lq} of q links, each with a

correspond-ing capacity ci, i = 1, . . . , q. The Traffic Matrix (TM)

X = {xi,j} denotes the traffic demand (expressed in, for

example, Mbps) between every origin node i and every destination node j (i 6= j) of the network; we shall note each of these origin-destination pairs as OD pairs, and each origin-destination traffic demand xi,jas OD flows. In

practice, each traffic demand is measured every T minutes (usually 5’ or 10’ [5]), and its value simply corresponds to the cumulative number of bytes observed between two consecutive measurements, divided by the polling time T . Let X = {xk} be the vector representation of the TM,

where we have reordered OD flows by index k = 1, . . . , m (m = n.(n − 1)). Let N = {OD1, . . . , ODm} be the set of

m OD pairs. We consider a multi-path network topology, where each OD flow xk can be arbitrarily split among a

set of pk origin-destinations paths Pk. In this sense, we

shall call rk

p the portion of traffic flow xk sent along path

p ∈ Pk, where 0 6 rkp 61 and

P

p∈Pkr k p = 1.

Let λp_l be an indicator variable that takes value 1 if path p traverses link l and 0 otherwise, and Y = {ρ1, . . . , ρq}

a vector representation of links traffic load. Then X and Y are related through the routing matrix R, a q × m ma-trix R = {rk l} where rkl = P p∈Pkλ p l. rkp. The variable rkl

indicates the fraction of OD flow xk routed along link l;

this results in the following relation:

Y = R . X (1)

Given X, the multi-path routing optimization problem consists in choosing the set of paths Pk for each OD pair k

and computing the routing matrix R, in order to optimize a certain objective function g(X, R). A simplified version of this problem is the load-balancing optimization problem which, given a set of paths, calculates R. In this work we shall consider different performance indicators, which result in different objective functions.

A very important link-level performance indicator is the link utilization ul = ρl/cl; a value of ul close to one

indicates that the link is operating near its capacity. Net-work operators usually prefer to keep links utilization rela-tively low in order to support sudden traffic increases and link/node failures. A network-wide performance indicator is the maximum link utilization umax:

umax(X, R) = max

l∈L {ul} (2)

The maximum link utilization constitutes by far the most popular TE objective function. However, its mini-mization presents a clear drawback: setting the focus too strictly on the most utilized link often leads to a worse distributions of traffic, adversely affecting the overall per-formance in the network. In this sense we will consider the mean link utilization umean as another possible objective

function: umean(X, R) = 1 q X l∈L ul (3)

Minimizing umeanmay provide better network-wide

per-formance, as long as the maximum link utilization remains bounded; we will further discuss this issue in Sec. 6.

The last performance indicator we shall consider in this work is the queuing delay (i.e. the time that spans between the moment a byte enters the router and leaves it). This choice is justified by two aspects. Firstly, its algebra is relatively simple, in the sense that the total delay of a path is the addition of the delay at each link. Secondly, it is a very versatile indicator. A big queueing delay means more delay and jitter for streaming traffic. Moreover, a link with an important queueing delay is traversed by several bottlenecked flows, meaning that elastic traffic may obtain better throughput in other, less loaded, links. The mean queueing delay may be regarded then as a numerical value of the congestion on the link.

(6)

Assume then that the queuing delay on link l is given by the function dl(ρl). Given this function we can compute

the queuing delay of path p as dp = Pl∈pdl(ρl). As a

measure of the network-wide performance, we consider the expected end-to-end (e2e) queuing delay dmeandefined as:

dmean(X, R) = X k∈N X p∈Pk rk p. xk dp= X l∈L ρl. dl(ρl) (4)

That is to say, dmean is a weighted mean e2e queuing

delay, where the weight for each path is how much traffic is sent along it (rk

p. xk), or in terms of links, the weight

for each link is how much traffic is traversing it (ρl). We

prefer a weighted mean queuing delay to a simple total delay because it reflects more precisely performance as perceived by traffic. Two situations where the total de-lay is the same, but in one of them most of the traffic is traversing heavily delayed links should not be consid-ered as equivalent. Note that, by Little’s law, the value fl(ρl) = ρl. dl(ρl) is proportional to the volume of data in

the queue of link l. We will then use this last value as the addend in the last sum in (4), since it is easier to measure than the queuing delay. Finally, note that, differently to umax, a large mean e2e queuing delay translates into bad

performance for the majority of the traffic and not only for the traffic that traverses a particularly loaded link.

Based on these definitions we will introduce the differ-ent optimization algorithms that strive to minimize some of these performance indicators, considering either the RR or DLB approach.

3. Stable Robust Routing

Finding a multi-path routing configuration minimizing umax is an instance of the classical multi-commodity flow

problem which can be formulated as a linear program [22]. For a single known traffic matrix X, the problem can be easily solved by linear programming techniques [23]. How-ever, as we have previously discussed, traffic demands are uncertain and difficult to predict, and all we can expect is to find them within some bounded uncertainty set.

In a robust perspective of the multi-path routing op-timization problem, demand uncertainty is taken into ac-count within the routing optimization, computing a single routing configuration for all demands within some tainty set. In this work we consider a polyhedral uncer-tainty set X, more precisely a polytope as in [6], based on the intersection of several half-spaces that result from linear constraints imposed to traffic demand.

As an example, let us define an uncertainty set X based on a given routing matrix Roand the peak-hour links

traf-fic load Ypeak_{obtained with this routing matrix:}

X = X ∈ Rm, Ro.X 6 Ypeak, X > 0

Observe that this definition of the uncertainty set has a major advantage: routing optimization can be performed

r1 o r2 o r3 o rio rq−1o rqo X X

Figure 1: The uncertainty set X as a polytope. minimize umax subject to: P k∈N P p∈Pk λp_l. rkp. x(k) 6 umax.cl ∀l ∈ L, ∀ X ∈ X P p∈P(k) rpk = 1 ∀k ∈ N rk p > 0 ∀p ∈ Pk, ∀ k ∈ N umax 6 1

Table 1: The Robust Routing Optimization Problem (RROP)

from easily available links traffic load Y without even know-ing the actual value of the traffic demand X. Figure 1 de-picts the obtained uncertainty set, based on the convex in-tersection of q half-spaces of the form ri

o· X 6 ρ peak

i , ∀i ∈ L,

where ri

ostands for the i-th row of the routing matrix Ro.

The traditional Robust Routing Optimization Problem (RROP) defined in Table 1 consists of minimizing the max-imum link utilization umax, considering all demands within

X. The solution to the problem is twofold: on the one hand, a routing configuration Rrobust, and on the other

hand, a worst-case performance threshold urobust max :

Rrobust = argmin R

max

X∈Xumax(X, R)

urobustmax = max

X∈Xumax(X, Rrobust)

Given a suitable definition of the uncertainty set, the obtained robust routing configuration Rrobust is applied

during long periods of time; in this sense, we refer to Ro-bust Routing as Stable RoRo-bust Routing (SRR). The au-thors of [6] have shown that the RROP can be efficiently solved by linear programming techniques, applying a com-bined columns and constraints generation method. This method iteratively solves the problem, progressively adding new constraints and new columns to the problem.

The new constraints are the extreme points of the un-certainty set X, and the new columns represent new paths added to reduce the objective function value. Only ex-treme points of X are added as new constraints, as it is easy to see that every traffic demand X ∈ X can be ex-pressed as a linear combination of these extreme demands. Regarding new added paths, the algorithm in [6] may not be the best choice from a practical point of view since the number of paths for each OD pair is not a priori restricted

(7)

and the characteristics of added paths are not controlled. For example, it would be interesting to have disjoint paths to route traffic from each single OD pair, improving re-silience. For this reason we modify the algorithm to select new paths, both limiting the maximum number of paths in Pk and taking as new candidates the shortest paths with

respect to link weights wi l: wil= 1 +1 − rk l i (5) where rk l i

corresponds to the fraction of traffic flow xk that

traverses link l after iteration i and is a small constant that avoids numerical problems. If OD pair k uses a single path p at iteration i, rk

l i

= 1 for every link l ∈ p, and so this path is removed from the graph where new shortest paths are computed (wl→ ∞, ∀l ∈ p). While this may result in

a sub-optimal performance, it allows a real and practical implementation. In case there are no disjoint paths for OD pair k, we use the column constraint generation method used in [6] to add new paths for OD pair k.

4. Dynamic Load Balancing

4.1. Routing Games and Wardrop Equilibrium

As mentioned before, the objective in DLB is to mini-mize a certain objective function g(X, R) in a distributed fashion (i.e. without relying on any centralized entity). Algorithms that achieve this are typically greedy, which present the desirable property of requiring minimum co-ordination among border routers. In this kind of mech-anisms, a path cost function φp is defined, and each OD

pair greedily minimizes the cost it obtains from each of its paths. This context constitutes an ideal case study for game theory, and is known as Routing Game in its termi-nology [24, 25].

Since each OD pair may arbitrarily balance traffic among its paths, we will assume that OD pairs are constituted of infinitely many agents. These agents control an infinitesi-mal amount of traffic, and decide along which path to send their traffic. In this context rk

p represents then the fraction

of agents of OD pair k that have p as their choice. If each of these agents acts selfishly, then the system will be at equilibrium when no agent may decrease its cost by uni-laterally changing its path decision. This situation consti-tutes what is known as a Wardrop Equilibrium (WE) [26], which is formally defined as follows:

Definition 1. The paths vector {rk

p}k∈N,p∈Pkis a Wardrop

Equilibrium if for each OD pair k ∈ N and for each couple of paths p, q ∈ Pk with rkp> 0 it holds that φp≤ φq.

Intuitively speaking, a WE is a situation where each OD pair uses only those paths with minimum cost (for the given OD pair). Anyway, the path cost φp is in turn

de-fined in terms of a certain nonnegative, nondecreasing and

continuous link cost function φl(ρl). There are roughly

two kinds of games depending on the definition of φp. A

Congestion Routing Game defines the path cost as φp =

P

l∈pφl(ρl). On the other hand, a Bottleneck Routing

Game defines φp= max l∈p φl(ρl).

Much effort has been put into characterizing the re-sulting equilibrium of these games. In this sense, a certain social cost function is defined, which measures the dis-satisfaction of the OD pairs as a whole (i.e. an optimum paths vector is one that minimizes this function), and the objective is to quantify the difference between the opti-mum and the resulting WE. In the case of a congestion game, the typical social cost function is the same as in (4) (i.e. P

l∈Lρldl(ρl) :=Pl∈Lfl(ρl)), whereas for a

bottle-neck game the social cost is usually the maximum φl(ρl)

over all links (i.e. max

l∈L φl(ρl)).

It may be proved that the WE of a congestion game co-incides with the unique minimum of the so-called potential function Φ(R) =P

l∈L

Rρl

0 φl(x)dx [24]. This means that if

fl(ρl) is continuous differentiable, non-decreasing and

con-vex, the WE of a congestion game with φl(ρl) = fl0(ρl) is

socially optimum. In this sense, to minimize dmeanthrough

DLB, we will play a Congestion Routing Game with a link cost equal to the derivative of the link mean queue size. In the sequel we shall note this game as MinDG (Minimum Delay Game).

On the other hand, characterization of the WE of a bottleneck game is somewhat more complicated. In fact, it is relatively easy to see that in this case the WE is not even unique. Moreover, and rather unfortunately, it has been proved in [15] that even if there always exist at least one WE that is socially optimum, nothing may be guar-anteed about the rest (if any). However, the same paper proved that every WE that fulfills the so-called efficiency condition is optimum, where this condition is defined as follows:

Definition 2. Let B(p) denote the number of network bottlenecks over p; that is to say B(p) =

l ∈ p : φl(ρl) =

max

m∈L,ρm>0{φm(ρm)}

. Then, a WE is said to satisfy the efficiency condition if all OD pairs route their traffic along paths with a minimum number of network bottlenecks; i.e. for all k ∈ N and p, q ∈ Pk with rpk > 0 it holds that

B(p) ≤ B(q).

This result, which is relatively new, was not applied in the design of neither TeXCP or REPLEX, both of which strive to minimize the maximum link utilization by means of a greedy algorithm in the path utilization (i.e. a bottle-neck game with φp = max

l∈p ul). It could then be the case

that these algorithms converge to a sub-optimal WE. Pos-sible consequences on the obtained performance of ignoring this result will be further discussed later in the article. In any case, we shall note this game as MinUG (Minimum Utilization Game).

(8)

4.2. No-Regret Algorithms

We will now briefly discuss how, given the path cost function φp, the WE may be achieved for both routing

games. In a recent article [14], the authors proved that if all OD pairs use no-regret algorithms, the global behavior will approach the WE. To be more precise, for a given TM X, and for most time steps, the instantaneous paths vector {rk

p}k∈N,p∈Pk is very close to the WE, and this difference

vanishes with time.

This result is very general, in the sense that it does not specify any algorithm in particular. Its only require-ment is the use of no-regret algorithms by all OD pairs (for an overview of some of them see [27]). In partic-ular, we will consider the Weighted Majority Algorithm (WMA)[28], which originated in the context of online learn-ing (more precisely from the online prediction uslearn-ing expert advice problem), and whose pseudo-code for OD pair k is described in Algorithm 1.

Algorithm 1Weighted Majority Algorithm (WMA)

1: for_t_{= 1, . . . , ∞ do}

2: Obtain path costs φp∀p ∈ Pk

3: for_{every path p ∈ P}_k do 4: if _φ_p_>_min q∈Pk φq then 5: rk p←β × rpk 6: end if 7: end for 8: Normalize the rk p 9: end for

At each iteration t, those paths whose cost is bigger than the minimum are punished by multiplying their re-spective rk

p by a certain constant β < 1 (throughout our

simulations we have used β = 0.95, a value that we empir-ically verified obtains very good results). Actually, and in order to avoid unnecessary changes in the traffic distribu-tion, we shall only update rk

p when the corresponding path

cost is bigger than the minimum cost plus a certain margin (in the case of MinUG we fixed the margin at 0.005, and for MinDG we used 5% of the minimum).

5. A Preliminary Comparison

In this section we shall present some first simulations that will help us to gain insight into the mechanisms and highlight some of their respective shortcomings. Before, we will discuss how we performed these and the rest of the simulations.

As the reference network we used Abilene, a high-speed Internet2 backbone network. Abilene consists of 12 router-level nodes connected by 30 optical links (we only consider intra-domain links). The used router-level network topol-ogy and traffic demands are available at [29]. Traffic data consists of 6 months of traffic matrices collected every 5 minutes via Netflow from the Abilene Observatory [30]. As measured traffic demands do not significantly load the

0.5 1 1.5 x 104 100 102 104 ρ_l (kB/s)

Mean Queue Size (kB)

Measurements Regression M/M/1

Figure 2: Mean queue size: measurements and approximations

network, we re-scaled them by multiplying all their entries by a constant. The dataset in [29] also provides the static routing configuration Ro deployed in Abilene during the

6-month long TMs measurement campaign.

In the case of MinDG, fl(ρl) is typically chosen based

on a simplistic model (e.g. the M/M/1 model which yields fl(ρl) = ρl/(cl− ρl)) [9]. In order to avoid such arbitrary

and unprecise choice, in [19] we proposed instead to learn this function (and its derivative) from measurements. Fig-ure 2 depicts the real mean queue size of an operational network link at Tokyo obtained from [31], together with the M/M/1 estimation fM/M/1

l (ρl) and the non-parametric

regression ˆfl(ρl). It is clear that f_lM/M/1(ρl) consistently

underestimates the real queue size value, while ˆfl(ρl)

pro-vides quite accurate results.

To be as fair as possible, all mechanisms use the same set of paths, namely those calculated by SRR as discussed in Sec. 3. The TMs are fed to the mechanisms in consecu-tive temporal order. Both DLB mechanisms (MinUG and MinDG) are initiated at arbitrary values of rk

p, which will

be updated as new link load measurements arrive. We have assumed that each OD pair receives these measurements every minute, meaning that for each new TM five updates of their corresponding rk

p values will be performed (recall

that TMs are collected every 5 min). Results are shown then for every minute. As a reference, we also computed the optimum values uopt

max and doptmean for every TM X of

the dataset.

In this example we consider a traffic scenario that pres-ents an abrupt and large volume increase due to an exter-nal routing modification. This corresponds to the TMs with indexes between 1050 and 1200 from dataset X23 in [29]. The evaluation starts with a normal low traffic load situation, but after the 100th minute one of the OD flows abruptly increases its traffic volume, loading the links it traverses until the end of the evaluation.

Regarding SRR, in what follows we shall use the term RROP as a reference to SRR, recalling that the robust routing optimization problem is the one described in Table 1. Based on the static routing matrix of Abilene Ro we

(9)

200 400 600 0 0.2 0.4 0.6 0.8 Time (min) umax Minimum MinUG MinDG RROP−HTL RROP−LTL 200 400 600 20 40 60 80 100 Time (min) dmean (kB) Minimum MinUG MinDG RROP−HTL RROP−LTL

(a) Maximum link utilization (b) Mean e2e queuing delay Figure 3: Maximum link utilization and mean end-to-end queuing delay. Traffic demand volume abruptly increases after the 100th minute.

Low Traffic Load period (LTL period, before the 100th minute) and the latter adapted to the High Traffic Load period (HTL period, after the 100th minute):

XLTL

= {X ∈ Rm, Ro.X 6 YLTL, X > 0}

XHTL

= {X ∈ Rm_{, R}

o.X 6 YHTL, X > 0}

We assume that traffic is known in advance in both defi-nitions, and take YLTL

and YHTL

as the maximum link load values observed during the LTL and HTL periods respec-tively. We compute two different robust routing configu-rations for both polytopes; RROP-LTL corresponds to the SRR configuration for polytope XLTL

, and RROP-HTL for polytope XHTL

. As we mentioned before, both RROP-LTL and RROP-HTL use the same set of paths, namely the paths obtained from Table 1 for polytope XLTL

. Solving RROP for a given set of paths consists of only adding new extreme points of polytope X (i.e., only new constraints are added).

Results are presented in Fig. 3, which depicts (a) the maximum link utilization umax and (b) the mean

end-to-end queuing delay dmeanduring the evaluation period. Let

us first focus the attention on the performance of RROP-HTL after the 100th minute. Despite achieving an almost optimal performance as regards umax(a relative difference

with respect to the optimum smaller than 4%), RROP-HTL obtains a queuing delay that constantly exceeds the optimum by almost 40% under a moderate network load. Such a difference may not be even acceptable from a QoS perspective, where end-to-end delays are even more im-portant than network congestion. As we will show later, this loss in performance is a direct consequence of the local criterion used in RROP.

A second interesting observation comes from the differ-ence between RROP-HTL and RROP-LTL performances before and after the abrupt traffic volume increase; Fig. 3(a) shows that, despite an almost negligible network load, RROP-LTL outperforms RROP-HTL by almost 50% of relative utilization during the LTL period, while the op-posite happens during the HTL period. The difference is not that big as regards delay before the 100th minute, but it becomes significant after the volume increase, where

RROP-LTL obtains a very bad performance. These results are somehow expected given the polytopes definition, and brings to light both the dependence of RROP on the un-certainty set definition and the inherent consequence of us-ing a sus-ingle static routus-ing configuration under large traffic variations. Let us highlight the fact that in this example we have considered that traffic was known in advance for the definition of both polytopes XLTL

and XHTL

. While traffic during the LTL period is easy to predict, the defini-tion of XHTL

in a real traffic scenario is a challenging task. We will come back to this issue in the following section.

Let us now discuss the results obtained by the dy-namic schemes. A first important observation is that they present an important overshoot, with an absolute differ-ence with the optimum uopt

max of approximately 40%.

Re-garding MinDG in particular, convergence after the anom-aly is very slow, taking more than 600 sec. However, it should be noted that when it eventually converges, it ob-tains a dmeanthat is very similar to the optimum. In terms

of umax, the difference with respect to the optimum is

ap-proximately 10%.

Special attention deserves the case of MinUG. After a shorter convergence time (approximately 100 sec.), the resulting value of umax is not the optimum. Let us

re-call that this kind of game (which models schemes such as REPLEX of TeXCP) is used to converge to a routing configuration that minimizes the maximum link utiliza-tion [10, 11]. However, in this case, the difference is more than 15%. Both of these problems will be further discussed in the following section.

6. Improving the Algorithms Performance

The simple evaluation conducted in the previous sec-tion shows some concepsec-tion drawbacks of the SRR and DLB algorithms presented in Sec. 3 and 4 respectively. In this section we shall explain the origin of these problems and present enhanced mechanisms to overcome them. 6.1. Improving Stable Robust Routing

6.1.1. Network-Wide Performance

As we showed in Fig. 3(b), the minimization of umax

leads to a distribution of traffic that results in an exces-sive end-to-end delay. Using the mean delay dmeanas the

objective function in RROP (cf. Table 1) would be an in-teresting approach to ease the problem; however, fl(ρl) is a

non-linear function and the optimization problem becomes too difficult to solve. As we previously said, optimization under uncertainty is more complex than classical optimiza-tion and simple optimizaoptimiza-tion criteria should be used. In this sense, we could use instead the mean link utilization umeanas the objective function.

The mean link utilization considers at the same time the load of every link in the network and not only the utilization of the most loaded link; as we will show in the results, such an objective function provides a better global

(10)

minimize umean subject to: P l∈L P k∈N P p∈Pk 1 clλ p l. r k p. x(k) 6 umean.q ∀X ∈ X P k∈N P p∈Pk λp_l. rk p. x(k) 6 uthresmax.cl ∀l ∈ L, ∀ X ∈ X P p∈P(k) rkp = 1 ∀k ∈ N rk p > 0 ∀p ∈ Pk, ∀ k ∈ N

Table 2: Robust Routing Mean Utilization Optimization Problem (RRMP)

performance as regards end-to-end delay. However, a di-rect minimization of umeandoes not assure a bounded

max-imum link utilization, which is not practical from an oper-ational point of view. In this sense, we propose to change the objective function in RROP by umean, while bounding

the maximum link utilization by a certain threshold uthresmax

defined a priori. The resulting problem, which we shall call the Robust Routing Mean Utilization Optimization Prob-lem (RRMP), is defined in Table 2.

RRMP is solved in the same way as RROP, using the same recursive algorithm proposed in [6]. Note that the difference between the two problems is only a new con-straint per each new traffic demand in X (in fact, for each extreme point of X). The drawback of RRMP is its depen-dence on the value of uthres

max, which directly influences the

routing performance as we will shortly see. An interest-ing choice for uthres

max would be to use the output of RROP,

namely urobust

max . To some extent this would result in a

sim-ilar routing solution but with better traffic balancing. An alternative approach is to minimize both the value of umax and umean at the same time, which constitutes

a problem of multi-objective optimization (MOO). MOO problems are generally more difficult to solve because tra-ditional single-objective optimization techniques cannot be directly applied. Nevertheless, the problem of finding all the Pareto-efficient solutions to a linear MOO problem is well known and different approaches can be used to treat the problem [32, 33]. In this work we consider an intuitive and easy approach to solve a MOO problem with standard single-objective optimization techniques. The approach consists in defining a single aggregated objective function (AOF) that combines both objective functions. We define a weighted linear combination of umax and umean as the

new objective function uaof = α . umax+ (1 − α) . umean,

where 0 6 α 6 1 is the combination fraction. Despite its simple form, this new objective is very effective and pro-vides accurate results for both performance indicators. We shall call this new optimization problem as Robust Routing AOF Optimization Problem (RRAP), defined in Table 3. As before, RRAP is solved with the same algorithms used in RROP.

minimize uaof= α . umax+ (1 − α) . umean

subject to: P l∈L P k∈N P p∈Pk 1 clλ p l. r k p. x(k) 6 umean.q ∀X ∈ X P k∈N P p∈Pk λp_l. rk p. x(k) 6 umax.cl ∀l ∈ L, ∀ X ∈ X P p∈P(k) rk p = 1 ∀k ∈ N rk p > 0 ∀p ∈ Pk, ∀ k ∈ N

Table 3: Robust Routing AOF Optimization Problem (RRAP)

6.1.2. Comparison between RRMP and RRAP

We will now evaluate both the RRMP and RRAP ver-sions of SRR in the traffic scenario previously considered in Sec. 5. In order to appreciate the dependence of RRMP on the maximum link utilization threshold uthres

max, two

dif-ferent thresholds are used in the evaluation: uthres max1 = 1

(which corresponds to the constraint umax 6 1 in Table

1), and uthres

max2 = urobustmax , where urobustmax is the output of

RROP-HTL in Sec. 5. In the case of RRAP, the weight α is set to 0.5, namely an even balance between umaxand

umean. This may impress as a somewhat naive approach

to the reader, but practice shows that this choice provides in fact very good results.

Figures 4 depicts the results in this case. Let us fo-cus our attention on the operation after the 100th minute, as all robust routing configurations use XHTL

as the un-certainty set. To be as fair as possible, both RRMP and RRAP use the same set of paths as those used by RROP in Fig. 3. The figure clearly shows that the performance of RRMP strongly depends on the threshold uthresmax. In the

case of uthres

max1, the attained maximum link utilization is

well beyond the optimal values, reaching almost a 70% of relative performance degradation. This overload directly translates into huge mean end-to-end queuing delays. Re-sults are quite impressive when considering the second threshold, both as regards umax and dmean. RRMP using

uthres

max2 provides a highly efficient robust routing

configura-tion, showing that it is possible to improve current imple-mentations of SRR with a slight modification of the objec-tive function. However, this dependence on the threshold uthres

max introduces a new tunable parameter, something

un-desirable when looking for solutions that simplify network management.

As regards RRAP, obtained results are slightly worse than those obtained by RRMP uthres

max2, but still very close

to the optimal performance, with a relative performance degradation of about 10% as regards umaxand dmeanwith

respect to an optimal routing configuration. Nevertheless, RRAP has no tunable parameter apart from the combina-tion factor α, which in fact is set to a half independently of the traffic situation.

(11)

100 250 400 550 700 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Time (min) um a x Minimum RRMP u_{max 1}thres RRMP u max 2 thres RRAP 100 250 400 550 700 40 60 80 100 Time (min) dm e a n (kB) Minimum RRMP u_{max 1}thresh RRMP u_{max 2}thresh RRAP

(a) Maximum link utilization (b) Mean e2e queuing delay Figure 4: Maximum link utilization and mean end-to-end queuing delay for RRMP and RRAP.

6.1.3. The Reactive Robust Routing

As we showed in Sec. 5, the definition of the uncer-tainty set has a major impact on the performance of SRR. In particular, we saw that using a single definition of un-certainty set under highly variable traffic cannot provide routing efficiency for both normal operation traffic and unexpected traffic events. Despite being one of its most important features, using a single SRR configuration is not the best strategy.

In [13] we proposed an adaptive version of SRR, known as the Reactive Robust Routing (RRR). The basic idea in RRR consists in computing a primal robust routing con-figuration Rrobusto for expected traffic variations in normal

operation within a primal polytope Xo. This polytope is

defined as in Sec. 3, based on a certain fixed routing config-uration Roand the expected links traffic load we shall call

Yo= {ρoi}. Additionally, a set of m anomaly polytopes Xj

are defined, and a preemptive robust routing configuration Rj_robust is computed for each of these anomaly polytopes.

Let us explain the concept of an anomaly polytope. In Fig. 3, the abrupt increase in traffic volume is caused by a single anomalous OD flow xk that unexpectedly carries a

many times bigger traffic load θ due to an external routing modification. After this exogenous unexpected event, the traffic demand X takes the value X’_{= X + θ.δ}

k, where

δ_k _{= (δ}_1,k_{, . . . , δ}_k,k_{, . . . , δ}_m,k₎T_{, δ}

i,k = 0 if i 6= k and

δk,k = 1. We shall designate this unexpected traffic

in-crease in OD flow xk as anomalous traffic event Ak. The

anomaly polytope Xk results from expanding the primal

polytope Xoin the directions of the links that traverses the

anomalous OD flow xk, with respect to Ro. The reader

should bear in mind that the kind of unexpected traffic events we deal with are independent of the intradomain routing; these events originate outside the network and propagate between origin-destination nodes. This justifies the relevance of the polytope expansion with respect to Ro.

The obtained polytope Xk is the smallest polytope that

contains the unexpected traffic demand X’ _{and thus, the}

corresponding robust routing configuration Rk

robust

pro-vides a relatively good performance under its occurrence. Figure 5 explains the idea of the multiple anomaly poly-tope expansion. As before, ri

o stands for the i-th row of

Xo X1 X2 Xj Xm rio rio rio r2o r2o rq−1o r1o rqo rqo r3o ci ci ci c2 c2 cq−1 c1 cq cq c3

Figure 5: Different anomaly polytopes for preemptive robust routing computation.

the routing matrix Ro.

Note that in a real scenario it is not possible to predict the size of the anomalous traffic θ. As a consequence, the primal polytope Xo is expanded to the limits of link

capacities, obtaining the following anomaly polytope for each anomalous traffic event Ak:

Xk=X ∈ Rm, Ro.X 6 YAk, X > 0 , ∀k ∈ N (6)

In (6), the i-th component of YAk _{takes the value ρ} oi if

ri,k

o = 0, or the value ci if ri,ko > 0, being ri,ko the element

(i, k) of Ro.

Given the primal and the m preemptive robust routing configurations Ro

robust and R j

robust, RRR uses an on-line

anomaly detection/localization sequential algorithm to de-tect the occurrence of an anomalous event Ak, switching

routing from Ro

robust to Rrobustk (and from Rkrobust back

to Ro

robust when normal operation is regained). We

re-fer the reader to [13] for additional details on the de-tection/localization algorithms and the implementation of RRR.

RRR can handle large and unexpected traffic variations in single OD flows quite effectively (the case of multiple si-multaneous anomalies is beyond the scope of RRR). How-ever, given the difficulty involved in modifying the routing configuration of a large scale network in an on-line fash-ion, the contributions of RRR are mainly theoretical. This problem can be solved by using a load balancing technique instead of a complete routing reconfiguration. In load bal-ancing, we keep the same set of paths Pk for each OD

pair k, and only modify the fractions of traffic sent along each path. Load balancing can be easily performed on-line and does not require any additional modifications in current path-based networks such as MPLS. We shall refer to the load balancing variant of RRR as Reactive Robust Load Balancing (RRLB), stressing the difference between routing reconfiguration and load balancing.

RRLB uses the same set of anomaly polytopes Xj

de-fined in RRR, but the computation of the m preemptive robust routing configurations R_robustj is slightly modified. The same set of paths Pk obtained during the computation

of Ro

robust is used in every R j

(12)

0 200 400 600 10−10 10−5 100 Time (min) rp k Figure 6: Evolution of rk

pfor the anomalous OD pair (MinDG)

routing configurations Rjrobust are obtained with a

simpli-fied version of the former optimization algorithm, where only new traffic demands are progressively added and no extra paths are created.

6.2. Improving Dynamic Load Balancing 6.2.1. Convergence Time

The DLB algorithms evaluated in Sec. 5 present an important overshoot and a significant settling time in the presence of sudden and large traffic variations. If the traf-fic anomaly is a perfect step, then the overshoot is un-avoidable. We will try to address the long settling time instead. The reason behind this problem is relatively sim-ple, as shown in Fig. 6. The graph depicts the evolution over time of the corresponding rk

p values of the anomalous

OD pair for MinDG in the example of Sec. 5. We may see that, although the rk

p change exponentially fast, at the

moment of the anomaly the values that should increase are so small that it takes them a very long time to converge. A possible solution is to impose a minimum value to all rk

p. However, this will affect the precision of the algorithm

and will still result in significant settling times. Actually, rk

p may be regarded as an indicator of the

performance of path p in the previous iterations. A very small rk

p means that p performed very badly with respect

to the rest of the paths in the past. However, when the anomaly occurs, conditions severely change and history is no longer as relevant. If we consider that we are in such situation, we could for instance completely ignore history and restart the game by setting rk

p = 1/|Pk| ∀k ∈ Pk.

Before deciding how to reassign rk

p, we will discuss how

an OD pair may decide if it should restart its game or not. Consider a situation where most of the traffic for OD pair k is routed along a path that is not the cheapest, and that the rk

p corresponding to the minimum-cost path is

very small. This could mean that although the former per-formed better in the past, this is no longer true and some traffic should be re-routed to the latter. This is more so as the difference in cost increases. However, this “suspicious” situation could be due to noisy measurements. To make sure that the game has actually changed and that it should be restarted, we will require such a situation to persists during a certain number of consecutive iterations. Once we detected that the game should be restarted, we will re-route some of the traffic that was being routed along the path with the biggest rk

p to the cheapest one. The

amount will be proportional to the relative difference in cost to avoid overreacting. Finally, remember that with

WMA fast adaptation is achieved when the rk

p are not too

small. The objective with this “game restart” is simply to move rk

p from critically small values. The algorithm will

then rapidly converge to the optimum. We now present the pseudo-code of the complete algorithm for OD pair k: Algorithm 2WMA with Restart (WMA-R)

1: for_t_{= 1, . . . , ∞ do}

2: Obtain path costs φp∀p ∈ Pk

3: Determine pmin= argmin p∈Pk

φpand pmax= argmax p∈Pk

rk p

4: if _(rk

pmin<0.1) and (φpmin+ φth< φpmax) then

5: nk e ←nke+ 1 6: else 7: nk e ←0 8: end if 9: if _nk e ≤nkththen

10: Perform a normal iteration of WMA (cf. Algorithm 1) 11: else 12: nke ←0 13: ∆r←min n_φ pmax φpmin −1, 1 o ×r k pmax−rpmink 2 14: rk pmax←r k pmax−∆r 15: rpkmin←r k pmin+ ∆r 16: end if 17: end for

The new variable nk

e counts the number of consecutive

occurrences of a “suspicious” situation (we used nk th= 3).

The threshold φthis to make sure that the difference in cost

between paths is significant. In particular, for MinUG we used φth = 0.005 and for MinDG φth = 0.2φpmin. Finally,

note that when the game is restarted, we re-route a cer-tain amount of traffic from pmax to pmin, but at most the

amount of traffic routed along each path is equalized. 6.2.2. Converging to the Social Optimum in Bottleneck

Games

In Sec. 5 we showed an example in which MinUG does not converge to the optimum, and obtains a difference of 15% with respect to the optimum MLU. The reason be-hind this poor performance is simply that MinUG does not take into account the result regarding the optimality of the WE and the efficiency condition discussed in Sec. 4.1 and originally presented in [15]. This result states that if at a WE all OD pairs send their traffic along paths with a minimum number of network bottleneck links (those with the maximum utilization in the whole network), the WE is optimal. The problem we analyze now is how to design a path cost function φpthat takes into account this

condi-tion, so that when using it, the load-balancing algorithm converges, when possible, to the correct WE. Note that the condition is only sufficient, meaning that a WE that fulfills the efficiency condition may not exist. A simple ex-ample of such case is a single OD pair with two paths with different lengths, where all links have the same capacity. Anyhow, the two main difficulties in the design of such path cost are the following. Firstly, the number of bottle-neck links in a path is an integer (thus not continuous on

(13)

rk

p). Secondly, the probability of two links having exactly

the same utilization is zero, and as such we should con-sider the number of links that have an utilization similar to the network bottleneck.

The objective is then to find a cost function that penal-izes paths in which several links have similar utilizations (and that this utilization is the maximum in all the net-work), and that it does not switch between values to avoid oscillations. A candidate φp that fulfills these two

condi-tions is the so-called log-sum-exp function. Consider a set of arbitrary numbers A = {ai}, the log-sum-exp function

g(A) is defined as follows:

g(A) = 1 γA log   X i=1,...,|A| eγAai  = ai∗ + 1 γA log  1 + X i=1,..,|A| ∧ i6=i∗ eγA(ai−ai∗)   (7)

Consider the special case in which ai∗ = max A. It

should be clear that if ai∗ is significantly bigger than the

rest of the elements in A, the above convex and non-decreasing function constitutes an excellent approxima-tion of ai∗. In fact, it easy to prove that a_i∗ ≤ g(A) ≤

ai∗+ log(|A|)/γ_A, meaning that we may control the

preci-sion of the approximation through the parameter γA(the

bigger this parameter, the more precise the resulting ap-proximation). Moreover, as more elements in A are sim-ilar to the maximum, g(A) approaches the upper bound, reaching it when all elements are the same.

We will then use the second term of (7) as a penalty to those paths with several links whose utilization is similar to umax (the maximum utilization in the network). More

precisely, given a path p, let Up = {ul}l∈p be the

utiliza-tions in the path, and l∗_{∈ p be the link with the biggest}

utilization in p. We will then use the penalty function with the alternative set U∗

p, which has the same elements as Up,

but substitutes ul∗ by u_max. This results in the following

cost function: φp= ul∗+ 1 γp log  1 + X l∈p ∧ l6=l∗ eγp(ul−umax)   (8)

Even if this new cost function penalizes paths with several network bottleneck links, it also penalizes longer paths, which was not our original objective. A good choice of γp will alleviate this side-effect. For instance, we used

γp= log(|p|)/ max{0.01, ul∗/10}. This way, we try to

min-imize the effect of log(|p|) and relativize the penalization to ul∗. The following section presents the results obtained

by this new path cost function, which we will still call MinUG.

7. Evaluation and Discussion

In this section we evaluate the performance of the dif-ferent RR and DLB algorithms presented in this work,

200 400 600 800 1000 1200 0.3 0.35 0.4 0.45 0.5 Time (min) umax Minimum MinUG MinDG RROP RRAP 200 400 600 800 1000 1200 40 50 60 70 80 90 Time (min) dmean (kB) Minimum MinUG MinDG RROP RRAP

(a) Maximum link utilization (b) Mean e2e queuing delay Figure 7: Maximum link utilization and mean end-to-end queuing delay under normal operation.

considering both normal operation and anomalous traffic situations. We present and discuss three simulation case-scenarios: starting from a normal traffic variation scenario, we increase the number of OD pairs that present anoma-lous traffic variations. This allows for performance com-parison at different levels of traffic variability. As both RRAP and RRMP provide similar results (when uthres

max is

correctly defined for RRMP), we will only consider the RRAP mechanism in the evaluation. Finally, we shall use RRLB-OP and RRLB-AP to designate the Reactive Rout-ing Load BalancRout-ing variants of RROP and RRAP respec-tively.

7.1. Normal Operation

The first case-scenario corresponds to traffic in nor-mal operation. The only variability is due to typical daily fluctuations. Figure 7 presents the evolution of umax and

dmeanfor RROP and RRAP, using a set of 260 TMs from

dataset X01 in [29] (specifically, those with indexes be-tween 420 and 680). All algorithms perform similarly as regards maximum link utilization, depicted in Fig. 7(a). This may be further appreciated in the boxplot summary presented in Fig. 8(a), where values are relative to those obtained with an optimal routing configuration. Note that the relative performance degradation is around 10% in most cases.

Figures 7(b) and 8(b) show that results are quite dif-ferent as regards mean queuing delay. We may verify that the best results are obtained by MinDG, followed closely by RRAP. However, both RROP and MinUG systemat-ically obtain a significant difference with respect to the optimum, generally between 30% and 40%. These results further highlight the limitations of RROP and MinUG as previously discussed: using umax as a performance

objec-tive results in a relaobjec-tively low maximum utilization, but neglects the rest of the links, impacting the network-wide performance.

7.2. One Anomalous OD Pair

The second case-scenario is the one considered in Sec. 5, where there is a sudden and abrupt increase of the traffic

(14)

RROP RRAP MinUG MinDG 1 1.05 1.1 1.15 1.2 1.25 1.3 umax (relative)

RROP RRAP MinUG MinDG

1 1.1 1.2 1.3 1.4 1.5 1.6 dmean (relative)

(a) Relative maximum link utilization (b) Relative e2e queuing delay Figure 8: Maximum link utilization and mean end-to-end queuing delay under normal operation, boxplot performance summary. De-picted results are relative to the optimal values.

volume carried by one OD flow. As a difference with re-spect to the evaluation in Fig. 3, where traffic was assumed known in advance, this case-scenario corresponds to a real situation where traffic anomalies can not be forecast.

Firstly, notice in Fig. 9 how the improvements dis-cussed in Sec. 6.2.1 for MinUG and MinDG result in a relatively smaller overshoot than before, but most impor-tantly the settling time has been significantly decreased (in the case of MinDG, from 600 min. to less than 50 min). Moreover, note how the modified cost function proposed in Sec. 6.2.2 results in MinUG converging to the socially optimum WE.

Regarding umax, both RRLB-OP and RRLB-AP

ob-tain similar results, with a relative performance degra-dation generally smaller than 15%. Note that while rel-atively important, this performance degradation is sur-prisingly small if we consider that traffic increases more than 500% in less than 10 minutes. The same may be said about MinDG, which obtains a degradation between 20% and 25%. In terms of dmean, MinUG and RRLB-AP

perform similarly. They both clearly outperform RRLB-OP, achieving a relative mean queuing delay almost 30% smaller. These results reinforces once again our observa-tions about the difficulty in RROP to attain global perfor-mance, and the advantages of using a simple network-wide objective function in a robust routing algorithm. More-over, they also illustrate the difference between MinUG and RROP. Even when MinUG was designed with the same objective than RROP (namely to minimize umax),

the fact that in MinUG each OD pair greedily minimizes the path utilization results in a different overall behavior. 7.3. Two Anomalous OD Pairs

In this case-scenario two OD pairs largely increase their traffic demand, one at approximately the 150th minute and the other at the 320th (they correspond to TMs with indexes between 160 and 280 from dataset X06 in [29]). They both present this anomalous traffic until the end of the simulation. We shall then separate the simulation in three parts: the first third where traffic is normal, the second third were only one OD pair is anomalous, and the last third were both OD pairs are anomalous. The

200 400 600 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Time (min) umax Minimum MinUG MinDG RRLB−OP RRLB−AP 200 400 600 30 40 50 60 70 80 90 100 Time (min) dmean (kB) Minimum MinUG MinDG RRLB−OP RRLB−AP

(a) Maximum link utilization (b) Mean e2e queuing delay Figure 9: Maximum link utilization and mean end-to-end queuing delay under one anomalous OD pair.

RRLB−OP RRLB−AP MinUG MinDG

1 1.05 1.1 1.15 1.2 1.25 1.3 umax (relative)

RRLB−OP RRLB−AP MinUG MinDG

1 1.1 1.2 1.3 1.4 1.5 dmean (relative)

(a) Relative maximum link utilization (b) Relative e2e queuing delay Figure 10: Maximum link utilization and mean end-to-end queuing delay under one anomalous OD pair, boxplot performance summary. Depicted results are relative to the optimal values.

anomaly localization algorithm of RRR was designed for the case of one single anomalous OD pair. Because of this, we will further illustrate the tradeoff between size of the considered uncertainty set and efficiency of the obtained routing, and chose the uncertainty polytope by the traffic loads seen after the second anomaly.

In Fig. 11(a) we may see that, as expected, the umax

obtained by both RROP and RRAP in the last third of the simulation are very close to the optimum. However, in the rest of the simulation the difference may be important, specially in the second part where the absolute difference for RRAP is almost 20%. It is important to highlight the results obtained by MinDG and MinUG. Notice that the overshoot this time is much smaller than before (a maximum of 0.1 in umaxfor MinDG) and the settling time

is negligible. In this case, the increase in traffic of the anomalous OD pairs is more gradual than before, which clearly favors dynamic schemes in their performance.

8. Conclusions and Future Work

From the study we presented in this article we may reach several conclusions. The most important is proba-bly that we have shown that using a single routing config-uration is not a cost-effective solution when traffic is rel-atively dynamic. Stable Robust Routing obtains a rather poor performance either when faced with non considered traffic demands (tight uncertainty sets) or when designed