An Algorithm for Rate Allocation in a Packet-Switching Network with Feedback

(1)

An Algorithm for Rate Allocation in a Packet-Switching Network with Feedback

by

Anna Charny

B.S., Mathematics, Moscow Institute of Physics and Technology, Russia, 1979 M.S., Mathematics, Kalinin State University, Russia, 1985

Submitted to the Department of Electrical Engineering and Computer Science

in Partial Fulllment of the Requirements for the Degrees of

Master of Science in Electrical Engineering and Computer Science

and

Electrical Engineer

at the

Massachusetts Institute of Technology May 1994

c

The author hereby grants MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in

part.

1

(2)

Signature of Author

Anna Charny Certied by

Dr. David D. Clark, MIT Thesis Supervisor Certied by

Dr. Raj Jain, Digital Thesis Supervisor Accepted by

Frederic R. Morgenthaler, Chair, Committee on Graduate Students

2

(3)

ABSTRACT

As the speed and complexity of computer networks evolve, sharing network resources becomes increasingly important. Thus, the issue of how to allocate the available bandwidth among the multitude of users needs to be addressed.

Such allocation needs to be in some sense ecient and fair to dierent users. In this work the so-called maxmin fairness is chosen as the optimality criterion.

A new distributed and asynchronous algorithm is suggested. The algorithm is shown to converge to the optimal rate allocation in a network with general topology under dynamic changes in the set of network users, individual user load and occasional route changes. An upper bound on convergence time is given. The algorithm is shown to be well-behaved in transience. Unlike previous work, the algorithm takes bandwidth consumed by feedback trac into account. Further, an extension of the algorithm is suggested to address the problem of policing misbehaved users.

3

(4)

Acknowledgements

I would like to thank my thesis advisors, Dr. David Clark and Dr. Raj Jain for all the fruitful and stimulating discussions and for the guidance they provided throughout this work.

I would also like to thank Digital Equipment Corporation for providing my nancial support during graduate school through the GEEP program.

I am grateful to Lisa Felice, my supervisor at Digital Equipment Cor- poration for being so helpful and supportive.

I am also grateful to William Duane, my Technical Sponsor at Digital Equipment Corporation for his help and his interest in my work.

I would like to thank the sta of the GEEP program at Digital Equip- ment Corporation for their help and eciency.

I thank my oce mates Oumar Ndiaye and Tim Shepard for being so nice and friendly.

I also thank Tim Shepard and Chris Lefelhocz for their help and pa- tience in answering my many questions.

4

(5)

1.1 Background : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 1.2 Route Selection : : : : : : : : : : : : : : : : : : : : : : : : : : 8 1.3 Network Model : : : : : : : : : : : : : : : : : : : : : : : : : : 9 1.4 Optimality Criterion : : : : : : : : : : : : : : : : : : : : : : : 10 1.5 Service Discipline : : : : : : : : : : : : : : : : : : : : : : : : : 13 1.6 Previous Work and Summary of Results : : : : : : : : : : : : 13 1.7 Outline: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 15

2 Optimality Criterion 16

2.1 Denition of the MAXMIN Optimum : : : : : : : : : : : : : : 16 2.2 Finding Globally Optimal Rates : : : : : : : : : : : : : : : : : 17

3 Distributed Algorithm 21

3.1 Assumptions and Goals : : : : : : : : : : : : : : : : : : : : : : 21 3.2 High-Level Description : : : : : : : : : : : : : : : : : : : : : : 22 3.3 Algorithm Description : : : : : : : : : : : : : : : : : : : : : : 24 3.3.1 Data Structures : : : : : : : : : : : : : : : : : : : : : : 24 3.3.2 Source Operation : : : : : : : : : : : : : : : : : : : : : 26 3.3.3 Destination Operation : : : : : : : : : : : : : : : : : : 27 3.3.4 Switch Operation : : : : : : : : : : : : : : : : : : : : : 28 3.3.5 Output Link Operation: : : : : : : : : : : : : : : : : : 28

4 Convergence Theorem 32

5 Transient Behavior 37

6 Simulation Results 44

6.1 Experiment 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 5

(6)

6.2 Experiment 2 : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 6.3 Experiment 3 : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 6.4 Experiment 4 : : : : : : : : : : : : : : : : : : : : : : : : : : : 47 6.5 Experiment 5 : : : : : : : : : : : : : : : : : : : : : : : : : : : 48

7 Discussion 48

7.1 Remarks on M-consistency : : : : : : : : : : : : : : : : : : : : 48 7.2 Comments on the usage of the \Stamped Rate" Field in the

Packet Header : : : : : : : : : : : : : : : : : : : : : : : : : : : 50 7.3 Policing Misbehaved Sessions : : : : : : : : : : : : : : : : : : 51 7.4 Network with Full-Duplex Links : : : : : : : : : : : : : : : : : 52

8 Summary and Areas for Future Research 53

9 Appendix 1 54

10 Appendix 2 56

11 Appendix 3 57

6

(7)

1 Introduction

This section discusses design decisions adopted in this work, describes the existing results for the chosen model, summarizes the main results of this work and nally gives the brief layout of the remaining sections.

1.1 Background

There has been extensive debate in the literature about the relative merits and drawbacks of open-loop control schemes versus closed-loop control schemes.

The large propagation delay to packet transmission time ratio in the modern high-speed networks poses a signicant challenge for any end-to-end feedback scheme [21], [24], [25]. As a result, a number of open-loop alternatives like prior reservation and switch-based controls have been suggested.

Prior reservation schemes are generally considered to be suitable for steady stream-like trac with a priori known resource requirements. Reser- vation also provides quality of service guarantees that are dicult to achieve with walk-in service. The price for this, however, is the lack of exibility in the presence of dynamic changes in the network load leading to potential waste of precious network resources.

Switch-based controls have been shown to be necessary for achieving fairness [27]. However, if no source-based control is exercised, the sources may continue to inject excessive trac into the network, causing more overload and wasting network resources.

The approach adopted in this work is based on the cooperation between the sources and the network in sustaining an acceptable network load from the standpoint of fairness and eciency. This approach is similar to that of [23]

and [26].

The problem of load allocation is twofold - the sources must be able 7

(8)

to determine their optimal load, and the network must ensure that even if all sources operate at their optimal rates, these rates are enforced across the network. The mechanism of such enforcement strongly depends on the shape of source trac, a particular ow control mechanism, and service discipline of all switches in the network. There is a vast amount of work addressing the issue of preserving feasible user rates under various assumptions on the underlying service discipline and the shape of source trac. (see for example [3], [11], [12], [14], [25]).

This thesis addresses the rst problem, i.e. how to determine the set of optimal rates in a distributed network under dynamic changes in the absence of centralized knowledge about the network and without synchronization of dierent network components.

We consider a system in which switches maintain their own controls.

Switches communicate these controls to the source by feedback. We consider an end-to-end feedback scheme, in which the destination generates feedback packets which deliver the aggregate feedback signal from all switches on the packet's route back to the source. Upon receipt of the feedback signal, the source adjusts its load accordingly. The details of the algorithm are discussed later in this work.

We show that the algorithm is very general in nature and is applicable to a broad range of service disciplines and underlying trac shapes. This exibility is largely due to our choice to decouple the problems of determining the optimal rates and enforcing them.

1.2 Route Selection

It is assumed that at any time of the algorithm operation the route of each session is unique. We allow the route to change from time to time (for example in response to equipment failures or due to some routing decisions), but we

8

(9)

disallow existence of more than one route at any given time. We assume that the changes in the route do not occur very often, and that in the absence of network failures the routes will eventually stabilize for a given set of network users.

The obvious argument against this approach is that the best load allocation for a particular choice of session routes may not be the best over all possible route choices. However, even if session routes are unique, the problem of optimal rate allocation is non-trivial and deserves proper attention. In addition, note that the algorithm is shown to be robust in the presence of dynamic route changes. Thus, it can be run in conjunction with any independent routing algorithm which will eventually stabilize to some route. As soon as the route is found, our algorithm will recover from any past changes and will converge for the optimal rates for this route.

1.3 Network Model

In the real world endnodes are interconnected through a complex network of switches. There can be many users physically located at one network node, each of them conducting perhaps several communication sessions with other users in the network. Some of those sessions can be bi-directional like a con- versation, other can be uni-directional, like le transfer.

For our purposes, we assume that all sessions are independent. More- over, we simply treat one user conducting several sessions as several independent users. Similarly, we treat any bi-directional data exchange as two independent uni-directional ones.

We assume that any two connected nodes in the network are connected by a pair of half-duplex links of identical capacity pointing in the opposite directions. In general dierent link pairs have dierent capacities.

It is assumed that each enduser is connected to exactly one switch.

9

(10)

Endusers are not connected directly to each other. A switch can be connected to zero or more endusers of any type and to zero or more other switches. It is assumed that there is a path going through one or more switches from any source to its respective destination.

It is convenient to make a distinction between the `entry' links into the network and all other links. The `entry' links are in essence articially created in the model to separate dierent users located at one network node.

While capacities of all other links are real physical restrictions, capacities of the `entry' links can be chosen as we please as long as they do not impose additional restrictions on session ows. It will be seen later that it is convenient to consider these capacities to be equal to the session demand. Thus we allow these capacities to be innite if session demand is innite. Capacities of all other links are assumed nite.

Similarly, the model creates an articial switch per endnode located at the entry into the network. This switch has two `real' half-duplex links connecting it to the network and 2m articial half-duplex links, connecting it tom endusers located at the real-life endnode.

1.4 Optimality Criterion

The goal of this work is to determine a fair and ecient rate allocation . The precise meaning of the terms `ecient' and `fair' has been a target of extensive debate in the last two decades. References [1], [21], [24], [8], [7] contain a variety of approaches and denitions of fairness and eciency.

The approach adopted in this work chooses the so-called maxmin or bottleneck optimality criterion discussed in various modications in [1], [12], [17], [23], [26].

This approach is based on the following intuition.

Consider a network with given link capacities, the set of sessions and 10

(11)

xed session routes. We are interested in such rate allocations that are feasible in the sense that the total throughput of all sessions crossing any link does not exceed the link's capacity. We would like the feasible rate allocation to be fair to all sessions. On the other hand we want the network to be utilized as much as possible.

We now dene a fair allocation in the following way. We consider all

\bottleneck" links, i.e the link with the smallest capacity available per session.

We give a strict denition of it in section 2. We share the capacity of these links equally between all sessions crossing them. Then we remove these sessions from the network and reduce all link capacities by the bandwidth consumed by the removed sessions. We now identify the \next level" bottleneck links of the reduced network and repeat the procedure. We thus continue until all sessions are assigned their rates.

Such rate vector is known as maxmin fair allocation. The above global synchronized procedure for achieving maxmin optimal rates is well known and is described for instance in [1], [23].

It can be easily seen that the rate allocation obtained in such a way is fair in the sense that all sessions constrained by a particular bottleneck get an equal share of this bottleneck capacity. It is also ecient in the sense that given the fair allocation, no more data can be pushed through the network, since each session crosses at least one fully saturated link.

Assuming that packets are innitely small and the ows are determin- istic, it can be seen that maxmin fairness implies maximum eciency in the sense that the bottleneck resource is utilized up to its capacity and no queues build up.

It is well known, however, that for packets of nite size and the general distribution of packet arrival and service times utilizing the link to its full capacity leads to innite queue growth and causes severe performance degra-

11

(12)

dation. Thus, for general distribution of arrival and service times utilizing the bottleneck to its full capacity is not good for eciency. Thus, in this case a dierent eciency criterion is called for.

Reference [26] introduces the optimal eciency criterion for a general network conguration as the maximum power of the bottleneck resource, where

Resource Power= Bottleneck Resource Throughput Bottleneck Resource Response Time

The resource capacity at which the power is maximized is called the knee capacity. In general, the knee capacity depends on the particular distribution of the packet arrival times and service discipline.

However, if the knee capacity is known, then applying the global procedure for determining maxmin fair rates described above to the network with the knee capacities replacing the original capacities, we can obtain the rate allocation which is fair in the sense that the bottleneck resources are still shared equally among their users and ecient in the sense that the bottleneck resource power is maximized.

In summary, provided the knee capacities are known, we can use maxmin optimality on the network with knee capacities for both eciency and fairness.

In practice, the knee capacities are not known a priori. As a result, either an a priori estimate is required, or an independent algorithm for congestion detection must operate in parallel to provide this estimate \on the y"[26]. Another approach might be to combine the two by choosing some conservative estimate of the knee capacity and then attempt to adjust it if the link detects that it is constantly underutilized.

For the purposes of this work we assume that the knee capacities are known. Moreover, we will use the word \capacity" to mean the \knee capacity"

unless otherwise indicated.

12

(13)

1.5 Service Discipline

The only assumption we make about the service discipline employed by the switch is that the packets of each session are served in FIFO order. Thus, the switches could be strict FIFO, FIFO+, Priority, Stop-and-Go, Fair-Queuing, etc. We emphasize that the reason for such exibility is that the algorithm presented in this work is a calculation algorithm and is not concerned with enforcement of the rates. Such enforcement is strongly dependent on the service discipline.

In addition, we allow the switches to drop packets as they please, as long as at least some packets of each session continue to get through. While dropping packets can cause a lot of wasteful retransmissions, it is essential to note that our algorithm will still calculate correct optimal rates even in the presence of heavy packet loss. This property seems very important, since it means that the algorithm is robust in the presence of data loss due to heavy temporary congestion.

1.6 Previous Work and Summary of Results

The procedure for achieving maxmin optimal rates described earlier used global information, which is expensive and dicult to maintain in the real- world networks.

Several feedback schemes have been proposed to achieve the same goal in a distributed network. In essence, all these schemes maintain some link controls at the switch level and convey some information about these controls to the source by means of feedback. Upon receipt of the feedback signal the source adjusts its estimate of the allowed transmission rate according to some rule.

These algorithms essentially dier in the particular choices of link controls and the type of feedback provided to the source by the network.

13

(14)

References [6], [15], [17] describe distributed algorithms of this type.

However, these algorithms required synchronization, which is dicult to achieve.

Mosley in [23] suggested an asynchronous algorithm for distributed calculation of maxmin fair rates. The algorithm was shown to converge to maxmin optimal rates. However, the algorithm convergence time was rather slow and simulations showed poor adaptation to dynamic changes in the network.

Later Ramakrishnan, Jain and Chiu in [26] suggested a distributed asynchronous algorithm for achieving maxmin optimal rates which uses a dif- ferent type of feedback. The switches still calculate fair rate allocation for all sessions crossing its outgoing links, but this allocation is not explicitly com- municated to the source. Instead, a bit is set in the packet's header if its current ow across the link exceeds the current value of the link's fair allocation. When the source receives packets with the bit set, it decreases its rate, otherwise it increases it. The algorithm has an attractive property of using just one bit in the packet header for feedback. It has been extensively tested in a variety of real-life network congurations and have been demonstrated to be fair and ecient even under dynamic network changes. However, while the simulation results are extremely favorable, no theoretical guarantees on the algorithm convergence to an optimal operating point in a general network topology are available. Moreover, since the optimal rates are not provided to the sources, the algorithm produces oscillations around the optimal rate and it may take a long time to get close to the optimal solution.

The approach adopted in this work requires explicit calculation of the optimal rates. It denes a family of link control calculation policies and a feedback mechanism which ensure convergence to maxmin optimal rates from any initial conditions. An algorithm employing any of these policies is shown to be self-stabilizing in the sense that it recovers from any past errors, changes

14

(15)

in the set of network users, individual session demands, and session routes.

It is demonstrated that convergence of the algorithm is generally faster than that of the algorithms describe earlier in this section. An upper bound on convergence time is provided.

In addition, it is shown that the algorithm is 'well-behaved' in transience. In particular, it is shown that given an upper bound on round-trip delay, the actual transmission rates can be kept feasible throughout the transient stages of algorithm operation while still providing reasonable throughput to all users.

These qualities are extremely important in a dynamic network where changes in user load caused by newly arrived sessions can cause infeasibility which must be quickly taken care of to avoid large queue buildup and performance degradation.

We also suggest a mechanism for policing misbehaved users.

In addition, unlike previous work, we take into account the bandwidth consumed by feedback trac.

Simulation results demonstrate that the algorithm works well under dynamic changes in the network load.

1.7 Outline

Section 2 contains the formal denition of the optimality criterion and provides a global procedure for determining optimal rates in the presense of real feedback trac.

Section 3 contains the description of the distributed algorithm.

Section 4 gives the convergence theorem.

Section 5 discusses the transient behavior of the algorithm and provides an upper bound on convergence time.

Section 6 gives the results of several simulation experiments.

15

(16)

Section 7 contains a discussion on some of related issues and suggests an extension of the algorithm to policing misbehaved users.

Section 8 summarizes the results and gives some suggestions for future research.

2 Optimality Criterion

2.1 Denition of the MAXMIN Optimum

It seems natural to consider only static rate allocations for possible candidates for an optimal allocation. Once such optimal allocation is dened, our goal can be formulated as nding an algorithm to dynamically control an arbitrary rate allocation to bring it as close to the static optimum as possible.

We start with dening a feasible set of rate allocations as follows:

i 0 (1)

X

i^2Gj(ui;j+kwi;j)i Cj (2)

where i is transmission rate of session i, ui;j = 1 if session i crosses j on its forward route and 0 otherwise, and wi;j = 1 if session i crosses j on its feedback route and 0 otherwise, and ^Gj is the set of all sessions crossing link j.

(1) simply states that we are not interested in negative transmission rates, while (2) ensures that a rate allocation is such that no link capacity is exceeded.

Now we can dene the optimality criterion on this feasible set as follows.

We need the following denition rst.

16

(17)

Denition 2.1

Consider vector a = (a¹;:::;an). Let â = (â¹;:::;ân) be a permutation of a such that aî âj if i < j. Vector b is said to be lexicogracally greater that a if either a^¹ <^{^}b¹ or ⁹ 1 j n s.t. aî = ^bi ⁸1 i < j and âi <^{^}bi

Now we dene the maxmin optimal vector of transmission rates by

Denition 2.2

^Vector = (¹;:::;S) is called maxmin optimal for network

N if

it satises restrictions (1) and (2)

it is lexicogracally greater than any other feasible solution of (1) and (2)

It can easily be seen that this denition in fact means that the optimal vector is such that its smallest component is maximized over all feasible vectors, then, given the value of the smallest component, the next smallest component is maximized, etc.

The next section describes a global procedure to obtain maxmin optimal rates for a network with feedback trac.

2.2 Finding Globally Optimal Rates

In this section we give a way to nd the stationary optimal vector given global information about the network. The results here are quite similar to those given in [1], [23], [15], [26]. However, this work considers a somewhat dierent model than the cited authors, since our model accounts for the bandwidth consumed by feedback ows. Note that it is not clear a priori whether it is legitimate to treat feedback sessions in the same way as independent forward sessions, since their rates cannot be chosen independently from their corresponding forward sessions.

17

(18)

For the sake of simplicity we consider the case of \greedy" sessions, i.e.

sessions with innitely large demands. Note however, that the case of nite demands can be reduced to the \greedy" case by simply adding articial links of capacity equal to the session demand at the entry of each session to the network.

We start with the following denition.

Denition 2.3

^Link l is called bottleneck with respect to network

N(^L;^S) if _f_l⁺^C^l_kb_l = minj^2L Cj fj⁺kbj

Note that this is slightly dierent from the traditional denition of a bottleneck link. In our denition, we allocate a link's capacity between sessions (forward and feedback) sharing this link in such a way that each session is allocated _f_l⁺^C^l_kb_l on its forward way.

Optimal stationary rates can now be found by the following procedure.

We nd all bottlenecks link of the network and set the transmission rates of all the sessions crossing these links in either direction to _f_l⁺^C^l_kb_l and mark those sessions. Then we decrease capacities of all links by the total capacity consumed by the marked sessions crossing these links on their forward or feedback paths. We consider a reduced network with all link capacities adjusted as above and with marked sessions removed. We repeat the procedure until all sessions are marked.

This procedure can be formalized as follows.

PROCEDURE

GLOBAL OPTIMUM Given network ^N(^L;^S)

START:

18

(19)

Denote:

^

L

1 - set of all links l ²^L s.t. at least one session of ^S crosses l on its forward of feedback path

L

1 - set of all links l ²^L^{^}¹ s.t. _f_l⁺^C^l_kb_l = min_j²^L1^{^} _f_j^C⁺^j_kb_j ¹ = _f_j⁺^C^j_kb_j for any j ²^L¹

S

1 - set of sessions crossing at least one link ^L¹

f_l¹ - number of sessions of^S¹ crossing linkl on forward path b¹_l - number of sessions of ^S¹ crossing link l on feedback path ITERATION i :

Given:

S

1;:::;^Si^,1,

L

1;:::;^Li^,1

b¹_l;:::;bⁱ_l^,1 f_l¹;:::;f_lⁱ^,1 ¹;:::;i^,1

Dene :

~

Si^,1 =^S¹^[:::^[^Si^,1,

~

Li^,1 =^L¹^[:::^[^Li^,1,

^

Li set of all links l²^Lⁿ^L^~i^,1 s.t. at least one session of ^Sⁿ^S~_i^,1 crosses this link on its forward or feedback path

19

(20)

Li - set of all links l ²^L^{^}i s.t.

C_l^,^Pⁱ_j^,1=1j⁽f_jl⁺kb_jl⁾ fl⁺kbl^,^Pi^,1

j⁼¹⁽f_jl⁺kb_jl⁾ = min_q²^L^{^}_i _f^C_q+^q^,kb^Pq^,ⁱ^j^,1^P⁼¹i^,1^j⁽^f^jq⁺^kb^jq⁾ j⁼¹⁽fjq⁺kbjq⁾

Si - set of sessions of ^Sⁿ^S~_i^,1 crossing at least one link on^Li

i = _f^C_l+^l^,kb^Pl^,ⁱ^j^P^,1⁼¹i^,1^j⁽^f^jl⁺^kb^jl⁾

j⁼¹⁽f_jl⁺kb_jl⁾ ⁸l²^Li

~

Si =^Si^[^S~_i^,1

~

Li =^Li^[^L~_i,1

f_il - number of sessions of ^Si crossing linkl on forward path.

b_il - number of sessions of ^Si crossing linkl on feedback path.

If ^S = ~^Si , then STOP Else perform iteration i+ 1 END of GLOBAL OPTIMUM

Theorem 2.1

1. ProcedureGlobal Optimum terminates in a nite number of iterations.

2. When the procedure terminates, all sessions are assigned their globally optimal rates.

3. Let i be the optimal rates assigned at iteration i. Then ¹ < ::: < m

4. Let ^Li, ^Si and i be the set of bottleneck links of the reduced network of iteration i and sessions crossing these links respectively. Then any session in ^Si crosses at least one link in ^Li

5. Only sessions from ^S¹^[:::^[^Si go through any link in ^Li ⁸1im 20

(21)

6. ⁸1i m

i

8

>

<

>

:

= _f^C_l+^l^,kb^Pl^,ⁱ^j^P^,1⁼¹i^,1^j⁽^f^jl⁺^kb^jl⁾

j⁼¹⁽f_jl⁺kb_jl⁾ if l²^Li

< _f^C_l+^l^,kb^Pl^,ⁱ^j^P^,1⁼¹i^,1^j⁽^f^jl⁺^kb^jl⁾

j⁼¹⁽f_jl⁺kb_jl⁾ if l²^L^{^}i

where ^^Li is the set of sessions in ^Lⁿ^L¹^[:::^[^Li s.t. least one session of ^S ⁿ^S¹ ^[:::^[^Si crosses l.

The proof of this theorem is given in Appendix 2.

3 Distributed Algorithm

3.1 Assumptions and Goals

Section 2.2 provided a way to determine the optimal rates of a xed xed set of sessions using the global knowledge of the network. In addition, the global algorithm described there required synchronization of stages in which the optimal rates were assigned.

This section presents an algorithm to achieve the same goal in a distributed asynchronous way.

We start with a few words about the assumptions of the model and the goals of the distributed algorithm.

We now allow sessions to exit or enter as they please. However, to make the notion of optimal rates meaningful, we must assume that the sessions enter or exit not too often, in the sense that there is an extended period of time in which the set of sessions in the network is xed. Then for this period we can dene optimal rates as in section 2.

Thus, we allow the sessions to go through a period of instability in which some sessions can enter and exit, and then to stabilize to some xed set for an extended period of time. We want the algorithm to stabilize to the optimal transmission rates for this set. Once the network has reached its

21

(22)

current optimal state, we want it to remain there until new sessions enter or old sessions exit. If some sessions exit or enter, the optimal rates over the new set change as well. We want the algorithm to stabilize to the new optimal rates. If the set of network users changes much slower than the time required for the algorithm to converge, then the network will spend most of the time in a currently optimal state.

3.2 High-Level Description

The essential idea of the algorithm is to emulate the iterations of Procedure Global Optimum in a distributed asynchronous way.

To achieve this we let all packets carry an estimate of the bandwidth available for the session. This estimate will be referred to as the packet's

`stamped rate'. We stress here that in fact the algorithm does not require that every packet carries the stamped rate. We use this assumption only for simplicity. It will become clear that special control packets can be used to carry the stamped rate, or only a fraction of data packets can be used for this purpose.

We let each link maintain its current estimate of the fair share of its own capacity, referred to as the link's `advertized rate'.

Originally the source sets the packet's stamped rate to some arbitrary initial value. As the packet travels through the network, its stamped rate is reset to the smallest of the packet's initial rate and the smallest of advertized rates of all links on the packet's round-trip route.

When a feedback packet returns to the source, the source adjusts its transmission rate according to the stamped rate of the feedback packet.

Each link maintains a list of its users. It adds a session to this list when the rst packet of a new session is received. It deletes a session from the list when it determines that a session has exited. We do not address the issue of

22

(23)

exactly how this determination is done. One could dene a timeout value, or let the sessions send a special \last" packet. For the purposes of this work, however, we ignore any details or issues associated with any such choice, and simply assume that there is some way a switch can recognize the fact that the session is no longer active.

For each user the link stores its last seen stamped rate. It will be referred to as the `recorded rate' of the session at the link.

The link sets a bit in the session's entry if a packet of that session is received with stamped rate below or equal to the current advertized rate of the link. We say that a session with this bit set at some link is marked at the link.

The link then calculates its advertized rate as C^,C^~

f+kb^,f^~^,k^~b ⁽³⁾

where ~C is capacity used by last seen stamped rates of the sessions marked at this link; f;b;f;^~^~b are the number of total and marked forward and feedback sessions at the link respectively.

It is essential that the set of marked sessions at any time must satisfy the following conditions:

1. If any session is marked, its recorded rate is less than or equal to the advertized rate of the link.

2. Advertized rate is calculated according to (3).

The above conditions will be referred to as M-consistency, for \marking" consistency.

If at any time a session violates M-consistency, it must be immediately unmarked and the advertized rate must be recalculated. It turns out that M- consistency is central to ensure convergence of the algorithm to optimal values

23

(24)

from any initial conditions. We will discuss M-consistency in more detail later in this work.

Finally note that is possible that if the source's idea of a session's rate is below the advertized rates of all sessions in the sessions route, and the session's demand is not satised, \obeying" the stamped rate received in the feedback packet would cause the session to operate below its optimal rate. To avoid this condition, an extra bit in the packet header is used. The bit will be referred to as the \u-bit". A \greedy" session, (i.e. a session whose demand is innite or unknown), set's the u-bit to 0 on all of its packets. A \conservative' session, whose demand is known and nite, sets the u-bit of all its outgoing packets to 1.If the packet's stamped rate is above or equal to the advertized rate of a link in the packet's route, the link sets the u-bit to 1. Hence, if a feedback packet returns with u-bit set to 0, it means that advertized rate of all links was higher than the stamped rate of the packet. In this case the source ignores the received stamped rate and resets its idea of allowed rate to its demand.

There is no synchronization between operation of dierent network components The next section contains a formal description of the algorithm.

3.3 Algorithm Description

This section describes the data structures and operation of network components. Where appropriate, `pidgin C' code is used to describe component operation. The code is not intended to be ecient and is sometimes redundant for the sake of clarity of the underlying ideas.

3.3.1 Data Structures

Packet p:

up `u-bit' used to indicate that the session's rate can be increased 24

(25)

p packet's stamped rate

tp packet's type (forward or feedback) Sources :

s stamped rate of the last feedback packet received

us bit indicating whether or not to set the u-bit of outgoing packets ds demand of the session

Destinationd :

d stamped rate of the last forward packet received ud `u-bit' of the last forward packet received

countd used for counting the number of unacknowledged forward packets

Link l :

Cl Capacity of the link

fl Number of forward sessions known at the link bl Number of feedback sessions known at the link

Gl Set of sessions known at the link For any session i²^Gl :

a_li - bit used to mark the session at the link

_li - is equal to 1 if the session is forward and to k if it is feedback (Note that k is a universal constant across the network) _li - recorded rate of the session

l - advertised rate of the link, calculated as

25

(26)

l =

8

>

<

>

:

Cl if fl+kbl = 0

Cl^,^Pj^2Gl_lj_lja_lj+ maxi^2Gl_li if fl+kbl =^P_j^2G_l_lja_lj

C_l^,^P_j2Gl_lj_lja_lj

fl⁺kbl^,^Pj^2Gl_lja_lj otherwise

(4)

3.3.2 Source Operation

source initialize(source s) ^f /* called at initialization time */

if (demand not known) set ds=¹, us= 0 s =ds

if (ds<¹) us = 1;

else

us = 0;

g

source receive packet(source s, packet p) ^f /* called upon receipt of a feedback packet */

if (p > ds) ^f /* must be that demand has decreased */

s =DS; us = 1;

g

else if (up == 0) ^f

26

(27)

s =DS;

if (DS <¹) us = 1;

else us = 0;

g

else ^f

/* Packet passed at least one link whose advertized rate /* was equal to the packet's current stamped rate, /* so obey this rate

s =p; us = 0;

g

source generate packet(source s, packet p)^f create new packet p

p =s

up =us

add p to the outgoing link's output queue

g

3.3.3 Destination Operation

destination initialize(destination d)^f countd = 0

d = 0; ud = 0;

g

27

(28)

destination receive packet(destination d, packet p)^f /* called upon receipt of a forward packet */

countd =countd+ 1

/* setparameters for the feedback packet as seen in the /* last of thek packets to be acknowledged

if(countd ==k) ^f d =p; ud =up; countd= 0;

create new feedback packetp p =d; up =ud

send p

g

3.3.4 Switch Operation

As soon as a packet arrives to an input link of a switch, it is added to the end of the output queue of the appropriate outgoing link. If several packets are received simultaneously from dierent input links, they are processed in some random order.

3.3.5 Output Link Operation

link initialize(link l)^f

fl = 0; bl = 0; ^Gl =^;

28

(29)

l =Cl

g

link action(link l, packet p)^f

/* called when packet p is at the head of the link's output queue */

if any session exited

call link update session exit(link l, session s) if (ip ⁶²^Gl) /* packet belongs to a new session */

call link update new session(link l, packet p)

else /* packet belongs to a session already seen at the link */

call link update known session(link l, packet p) transmit packet p

g

link update session exit(link l) ^f

/* update the list of known sessions */

if packet of forward session fl =fl^,1

else /* feedback session */

bl =bl^,1

Gl =^Glⁿ^fi^g

/* recalculate advertized rate l with updated information */

l =calculate adv rate(l);

g

link update new session(link l, packet p) ^f 29

(30)

/* update the list of known sessions */

Gl =^Gl^[^fip^g

_li=tp+k(1^,tp) fl =fl+tp

bl=bl+ (1^,tp)

/* do not mark the new session */

a_li_p = 0;

/* note that we do not need to set the recorded rate of the

/* new session at this time since unmarked session's recorded rate /* is not used in calculation of advertized rate

l =calculate adv rate(l) if (p l) ^f

p =l; up = 1;

g

/* record the new rate now */

i_lp =p;

g

link update known session(link l, packet p) ^f if (p l) ^f

p =l;

30

(31)

up = 1;

g

if (p l) a_li_p = 1;

_li_p =p;

l =calculate adv rate(l);

calculate adv rate(link l)

f

/* rst calculate advertized rate with given set of marked sessions */

RATE CALCULATION: ^f if (fl+kbl) == 0

l =Cl

if (fl+kbl ==^P_j^2G_l_lja_lj)

l =Cl^,^Pj^2Gl_lj_lja_lj+ maxi^2Gl_li else

l = ^C^l^,

P

j^2Gl_lj_lja_lj fl⁺kbl^,^Pj^2Gl_lja_lj

g

unmark any session whose recorded rate is above the calculated advertized rate repeat RATE CALCULATION once more and return

g

31

(32)

4 Convergence Theorem

In section 3.2 we introduced the notion of M-consistent calculation of the advertized rate of the link. Essentially, M-consistency means that once the advertized rate is calculated with some set of marked sessions, no session remains marked with recorded rate exceeding the advertized rate. Function calculate adv rate() in the algorithm description provides a possible way to perform M-consistent calculation. Note that the result of the M-consistent calculation is not only the advertized rate but also the set of \marked" sessions.

Lemma 4.1 below proves that the result of this function is in fact M-consistent.

Convergence theorem given later in this section proves that given any M- consistent advertized rate calculation the algorithm described in the previous section will converge to the optimal rate vector if started from arbitrary initial conditions. Thus, function calculate adv rate() can be treated as a \black box" and can be replaced by any other function providing M-consistent result.

Thus in essence, Convergence Theorem 4.1 proves convergence of a family of algorithms with M-consistent link control calculation. We will return to the issue of M-consistency in section 7, where will will also give another example of M-consistent calculation.

Lemma 4.1

After any link state update the advertized rate of the link and the marking of the sessions known at that link are M-consistent.

Proof of Lemma 4.1

. Consider any link update. Let ^Y be the set of sessions marked at the beginning of this update. Let ¹ be the result of the rst advertized rate calculation in function calculate adv rate(). Let ^Z denote the set of sessions which happen to be marked with stamped rates greater that ¹. By operation of function calculate adv rate() all sessions in

Z will be unmarked. Then, if not all sessions are marked, the nal advertized rate returned by functioncalculate adv rate() is calculated as

32

(33)

= C^,^Pi^2YnZii

f +kb^,^Pi^2YnZ i = C^,^Pi^2Yii+^Pi^2Zii

f+kb^,^Pi^2Yi +^P_i^2Zi

C^,^P_i^2Yii+¹^P_i^2Zi

f +kb^,^Pi^2Yi+^Pi^2Zi =¹

The last equality can be easily checked. Since all sessions which remain marked after the second advertized rate calculation in calculate adv rate() have recorded rates below or equal to ¹, the statement of the lemma follows.

If all sessions are marked, the statement of the lemma trivially holds, since by (4) advertized rate is greater or equal to the maximum recorded rate of its sessions.

Theorem 4.1

Given arbitrary initial conditions on the states of all links in the network, states of all sources, destinations and arbitrary number of packets in transit with arbitrary control information written on them, the algorithm given in section 3 converges to the optimal rates as long as the set of sessions, their demands and routes eventually stabilize.

Note that essentially any change in route or demand of the session is equivalent to an old session exiting and a new session entering. Thus, without loss of generality, the proof will be given under the assumption that demands and routes are xed, but sessions are allowed to enter or exit, as long as eventually the set of sessions stabilizes. We give the proof for the case of innite user demand. This does not cause any loss of generality, since as it has been already mentioned, the case of nite demands is reduced to this case by adding articial links with capacities equal to session demands at the entry to the network.

The proof of this theorem is based on the following 4 lemmae.

33

(34)

Lemma 4.2

After the set of sessions stabilizes at some time t⁰ and all these sessions have become known at all links in the network,

l(t) Cl

fl(t) +kbl(t)

for all links l for all times t t⁰. Here fl and bl are the number of forward and feedback sessions crossing link l respectively.

Proof

of Lemma 4.2

In what follows the link index l and the time argument t is omitted.

Consider the time of any state update of link l after t⁰. By Lemma 4.1 the result of any link update is M-consistent, so any marked session i has recorded rate i . Let ^Y denote the set of indices j s.t. aj = 1, (i.e. the set of marked sessions). Then, for the case when not all sessions are marked, by M-consistency

= C^,^Pj^2Yjj

f +kb^,^Pj^2Yj

C^,^Pj^2Yj

f+kb^,^Pj^2Yj

Hence, _f⁺^C_kb.

If all sessions are marked, two cases are possible. If maxi^2Gi C f⁺kb, then by M-consistency maxi^2Gi C

f⁺kb, where^G is the set of all sessions crossing the link, and the statement of the lemma holds.

If all i < _f⁺^C_kb, then by (4) and by M-consistency = maxi+C^,

Pi^2Gii C ^,^Pi^2Gi = C^,(f +kb) and the statement of the lemma follows.

Lemma 4.3

^Let i denote the optimal rate of sessions in ^Si, where ^Si is the set of sessions whose optimal rates were assigned at iteration i of Procedure Global Optimum, and ^Li - the set of bottleneck links of this iteration. Let t⁰, fl, bl be as in Lemma 4.1. Then for any t > t⁰ it must be that

l(t)> ¹ ⁸l²^Lⁿ^L¹ l(t)¹ ⁸l ²^L¹

34

(35)

Proof

of Lemma 4.3 By Lemma 4.2l C_l

fl⁽t⁾⁺kbl⁽t⁾ C_l fl⁺kbl

By Theorem 2.1 ¹ = _f_l⁺^C_kb^l _l if l ²^L¹ and ¹ < _f_l⁺^C_kb^l _l if l ²^Lⁿ^L¹.

This should be obvious since¹ is the capacity per session of a rst-level bottleneck link, which by denition must be smaller than _f_l⁺^C^l_kb_l of any other link.

The statement of this lemma immediately follows.

The next Lemma states that there exists some time, after which all sessions in ^S¹ will have reached their optimal rate ¹ and will be marked with this optimal rate at all links on their routes.

Lemma 4.4

^Leti; ^Si; ^Li be as in Lemma 4.3. Then⁹T¹ 0 s:t: ⁸t T¹ 1. pi > ¹ for any packet p of sessioni²^Sⁿ^S¹.

2. _li > ¹ for any session i ² ^S ⁿ^S¹ and link l in the route of i or its feedback.

3. l=¹ for any session l²^L¹

4. sj =¹ for the source s of any session j ²^S¹ 5. sj > ¹ for the source of any session j ²^Sⁿ^S¹ 6. pi =¹ for any packet p of sessioni²^S¹

7. a_li = 1; _li =¹ for all sessions i² ^S¹ and all links l in the route of i or its feedback.

Argument t is omitted here.

The proof of this Lemma is given in the Appendix 3.

35

(36)

The result of this lemma will now be used as the base case for induction on the index iof^Si. Note that this lemma states that not only the sessions in

S

1 have reached their optimal rates, but this rates will never change and the sessions will be marked at all links in their routes ever after (as long as the set of sessions remains the same).

The inductive step is given the by following Lemma:

Lemma 4.5

(Inductive Step). Suppose for some 1i < m

9ti 0 s:t: ⁸t ti

1. l=j for any link l ²^Lj; 1j i

2. sj =i for the source s of any session j ²^Sj;1j i 3. sj > j for the source s of any session j ²^Sⁿ(^S¹^[:::^[^Si) 4. p_k =j for any packet p of session k ²^Sj; 1j i

5. a_lk = 1; _lk =j for all sessions k ²^Sj 1 j i and all linksl in the route of k or its feedback

6. pj > i for any packet p of sessionj ²^Sⁿ(^S¹^[:::^[^Si).

7. _lj > i for any session j ²^Sⁿ(^S¹^[:::^[^Si) and link l in the route of i or its feedback.

Then ⁹ti⁺¹ 0 s:t: ⁸t ti⁺¹ such that conditions 1-7 hold for i+ 1.

It is assumed that the set of sessions has stabilized by time ti.

Proof

of Lemma 4.5.

By inductive hypothesis all sessions in ^Sj; 1 j i have reached their optimal rates j and these rates do not change as long as the set of sessions remains unchanged. Moreover, by inductive hypothesis any session in

Sj; 1j i is marked with its optimal ratej at any link on its way for all 36

An Algorithm for Rate Allocation in a Packet-Switching Network with Feedback