Federico Montesino Pouzols, Diego R. Lopez, and Angel Barriga Barros Mining and Control of Network Trafﬁc by Computational Intelligence

(1)

(2)

Mining and Control of Network Trafﬁc by Computational Intelligence

(3)

Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6

01-447 Warsaw Poland

E-mail:[email protected] Further volumes of this series can be found on our homepage: springer.com

Vol. 319. Takayuki Ito, Minjie Zhang, Valentin Robu, Shaheen Fatima, Tokuro Matsuo,

and Hirofumi Yamaki (Eds.) Innovations in Agent-Based Complex Automated Negotiations, 2010 ISBN 978-3-642-15611-3

Vol. 321. Dimitri Plemenos and Georgios Miaoulis (Eds.) Intelligent Computer Graphics 2010

ISBN 978-3-642-15689-2

Vol. 322. Bruno Baruque and Emilio Corchado (Eds.) Fusion Methods for Unsupervised Learning Ensembles,2010 ISBN 978-3-642-16204-6

Vol. 323. Yingxu Wang, Du Zhang, and Witold Kinsner (Eds.) Advances in Cognitive Informatics,2010

ISBN 978-3-642-16082-0

Vol. 324. Alessandro Soro, Vargiu Eloisa, Giuliano Armano, and Gavino Paddeu (Eds.)

Information Retrieval and Mining in Distributed Environments,2010

ISBN 978-3-642-16088-2

Vol. 325. Quan Bai and Naoki Fukuta (Eds.) Advances in Practical Multi-Agent Systems,2010 ISBN 978-3-642-16097-4

Vol. 326. Sheryl Brahnam and Lakhmi C. Jain (Eds.) Advanced Computational Intelligence Paradigms in Healthcare 5,2010

ISBN 978-3-642-16094-3 Vol. 327. Slawomir Wiak and Ewa Napieralska-Juszczak (Eds.)

Computational Methods for the Innovative Design of Electrical Devices,2010

ISBN 978-3-642-16224-4

Vol. 328. Raoul Huys and Viktor K. Jirsa (Eds.) Nonlinear Dynamics in Human Behavior,2010 ISBN 978-3-642-16261-9

Vol. 329. Santi Caball´e, Fatos Xhafa, and Ajith Abraham (Eds.) Intelligent Networking, Collaborative Systems and Applications,2010

ISBN 978-3-642-16792-8 Vol. 330. Steffen Rendle

Context-Aware Ranking with Factorization Models,2010 ISBN 978-3-642-16897-0

Vol. 331. Athena Vakali and Lakhmi C. Jain (Eds.) New Directions in Web Data Management 1,2011 ISBN 978-3-642-17550-3

Vol. 332. Jianguo Zhang, Ling Shao, Lei Zhang, and Graeme A. Jones (Eds.)

Intelligent Video Event Analysis and Understanding,2011 ISBN 978-3-642-17553-4

Vol. 333. Fedja Hadzic, Henry Tan, and Tharam S. Dillon Mining of Data with Complex Structures,2011 ISBN 978-3-642-17556-5

Vol. 334. Álvaro Herrero and Emilio Corchado (Eds.) Mobile Hybrid Intrusion Detection,2011 ISBN 978-3-642-18298-3

Vol. 335. Radomir S. Stankovic and Radomir S. Stankovic From Boolean Logic to Switching Circuits and Automata,2011 ISBN 978-3-642-11681-0

Vol. 336. Paolo Remagnino, Dorothy N. Monekosso, and Lakhmi C. Jain (Eds.)

Innovations in Defence Support Systems – 3,2011 ISBN 978-3-642-18277-8

Vol. 337. Sheryl Brahnam and Lakhmi C. Jain (Eds.) Advanced Computational Intelligence Paradigms in Healthcare 6,2011

ISBN 978-3-642-17823-8

Vol. 338. Lakhmi C. Jain, Eugene V. Aidman, and Canicious Abeynayake (Eds.)

Innovations in Defence Support Systems – 2,2011 ISBN 978-3-642-17763-7

Vol. 339. Halina Kwasnicka, Lakhmi C. Jain (Eds.) Innovations in Intelligent Image Analysis,2010 ISBN 978-3-642-17933-4

Vol. 340. Heinrich Hussmann, Gerrit Meixner, and Detlef Zuehlke (Eds.)

Model-Driven Development of Advanced User Interfaces,2011 ISBN 978-3-642-14561-2

Vol. 341. Stéphane Doncieux, Nicolas Bred`eche, and Jean-Baptiste Mouret (Eds.)

New Horizons in Evolutionary Robotics,2011 ISBN 978-3-642-18271-6

Vol. 342. Federico Montesino Pouzols, Diego R. Lopez, and Angel Barriga Barros

Mining and Control of Network Trafﬁc by Computational Intelligence,2011

ISBN 978-3-642-18083-5

(4)

and Angel Barriga Barros

Mining and Control of Network Trafﬁc by Computational

Intelligence

123

(5)

Dept. of Information and Computer Science Aalto University

P.O. Box 15400 FI-00076 Aalto Finland

E-mail: [email protected].ﬁ http://www.cis.hut.ﬁ/ fedemp/

Dr. Diego R. Lopez RedIRIS, Red.es, Edif. Bronce

Pza. Manuel Gomez Moreno s/n, Planta 2.

E-28020 Madrid Spain

E-mail: [email protected] http://www.rediris.es

Instituto de Microelectrónica de Sevilla c. Americo Vespucio s/n

41092 Sevilla Spain

E-mail: [email protected]

http://www2.imse-cnm.csic.es/ barriga/

ISBN 978-3-642-18083-5 e-ISBN 978-3-642-18084-2

DOI 10.1007/978-3-642-18084-2

Studies in Computational Intelligence ISSN 1860-949X Library of Congress Control Number: 2011921008

c 2011 Springer-Verlag Berlin Heidelberg

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

Typeset&Cover Design:Scientiﬁc Publishing Services Pvt. Ltd., Chennai, India.

Printed on acid-free paper 9 8 7 6 5 4 3 2 1

springer.com

(6)

(7)

As other complex systems in social and natural sciences as well as in engineering, the Internet is difﬁcult to understand from a technical point of view. The structure and behavior of packet switched networks is hard to model in a way comparable to many natural and artiﬁcial systems. Nonetheless, the Internet is an outstanding and challenging case due to its incredibly fast development and the inherent lack of measurement and monitoring mechanisms in its core conception. In short, packet switched networks defy analytical modeling.

It is generally accepted that Internet research needs better models. A great deal of development in network measurement systems and infrastructures have enabled many advances throughout the last decade in understanding how the basic mechanisms of the Internet work and interact. In particular, a number of works in Internet measurement have led to the ﬁrst results in what some authors call Internet Science, i.e., an experimental science that studies laws and patterns in Internet structure.

However, many mechanisms are still not well understood. As a consequence, users experience performance degradations and networks cannot be used to their full po- tential. For instance, it is a common experience to see real-time applications perform poorly unless (or even if) the network is largely overprovisioned.

This monograph deals with applications of computational intelligence methods, with an emphasis on fuzzy techniques, to a number of current issues in measurement, analysis and control of traffic in packet switched networks. The general approach followed here is to address concrete problems in the areas of data mining and control of network traffic by means of specific fuzzy logic based techniques. The set of problems has been chosen on the basis of their practical interest in current networking systems as well as our aim at providing a unified approach to network traffic analysis and control. Of course, not all open issues are addressed here but the set of methods we propose and apply provides a fairly comprehensive approach to current open problems. This set of methods is in addition open to countless extensions to address current and future related problems.

Data mining and control problems are addressed. In the first class we include two issues: predictive modeling of traffic load as well as summarization and inductive analysis of traffic flow measurements. In the second class we include other two

(8)

issues: active queue management schemes for Internet routers as well as window based end-to-end rate and congestion control. While some theoretical developments are described, we favor extensive evaluation of models using real-world data by simulation and experiments.

The ﬁeld of computational intelligence embraces a varied number of computational techniques such as neural networks, fuzzy systems, evolutionary systems, probabilistic reasoning and also computational swarm intelligence, artiﬁcial im- mune systems, fractals and chaos theory and wavelet analysis. Some if not all of the areas covered by the term computational intelligence are also often referred to as soft computing. As opposed to operations research, also known as hard computing, soft computing techniques require no strict conditions on the problems and do not provide guarantees for success. This is a shortcoming that is compensated in practice by the robustness of soft computing methods, a widely accepted fact.

Fuzzy inference systems (FIS for short, also commonly referred to as fuzzy rule- based systems or FRBS) play a central role in this monograph. FIS are used for tasks such as performance evaluation, prediction and control. However, in addition to fuzzy inference based techniques we apply other computational intelligence methods and complementary techniques including nonparametric statistical methods, OWA operators, association rules mining algorithms, fuzzy calculus, nearest neighbor methods, support vector machines and neural networks.

Fuzzy logic is a precise logic of imprecision, based on the concept of fuzzy set.

Fuzzy logic integrates numerical and symbolic processing into a common scheme.

This way, it allows for the inclusion of human expert knowledge into mathematical models, i.e., it provides a mathematical framework into which we can translate the solutions that a human expert expresses linguistically.

FIS are rule-based modeling systems. Fuzzy inference mechanisms have been shown to be an effective way to address problems that are subject to uncertainty and inaccuracy For modeling and control, one major reason to use fuzzy systems is that fuzzy rules can be expressed in a linguistic manner and are thus compre- hensible for humans. This is what makes it possible to use a priori knowledge. In addition, fuzzy inference based models can be interpreted and thus evaluated by ex- perts. Many methods to generate different kinds of fuzzy inference models with an interpretability-accuracy trade-off have been proposed.

An additional key feature of fuzzy inference systems is that they are universal approximators. Also, so-called neuro-fuzzy systems combine FIS with the learning capabilities of artificial neural networks (ANNs), often using the same learning algorithms that were initially developed for ANNs. Neuro-fuzzy systems offer the computational power of nonlinear computational intelligence techniques and can also provide a natural language approach to solving a number of current issues around the analysis and control of network traffic. On the one hand, the rule based structure of FIS allows for the incorporation of domain expert knowledge. On the other hand, the ability to learn allows neuro-fuzzy systems to be used on problems where no a priori or expert knowledge based rule-based solutions seem feasible or one is pri- marily interested in inducing an interpretable model from data. In addition, efficient hardware implementations can be developed in an structured and systematic manner.

(9)

This monograph is organized as follows. In chapter 1 we introduce and provide concise descriptions of the core building blocks of Internet Science and other related networking aspects that will be used throughout the next chapters. Chapter 2 describes a methodology for for building predictive time series models combining statistical techniques and neuro-fuzzy techniques.

Data mining of network traffic is the topic of chapters 3 and 4 where we focus on two related issues: traffic load prediction and analysis of traffic flows measurements.

In chapter 3 we investigate first the predictability of network traffic at different time scales, following a quantitative approach based on statistical techniques for nonparametric residual variance estimation. With an extensive experimental back- ground of a wide set of diverse and publicly available network traffic traces, it is shown that, in some cases, it is possible to predict network traffic with a satisfactory accuracy for a wide range of time scales. Then, the methodology described in chapter 2 is applied to diverse network traffic traces. The methodology is compared against least squares support vector machines (LS-SVM), Ordered Weighted Averaging Aggregation Operators (OWA)-induced nearest neighbors and optimally pruned extreme learning machines (OP-ELM). These methods are applied to an extensive set of time series derived from publicly available traffic traces. The methodology proposed is shown to provide advantages in terms of accuracy and interpretability. Further, it has been implemented in a tool integrated into the Xfuzzy development environment.

In chapter 4 a method and a tool for extracting concise linguistic summaries about network statistics at the flow level are described. In addition, a procedure for mining extended linguistic summaries from network flow collections is developed and the results for a number of publicly available traces are discussed. The theory of linguistic summaries has been extended for traffic statistics summarization and new tools for linguistic analysis of traffic traces at the flow level have been developed.

Chapter 5 deals with control of network traffic in routers, by means of active queue management schemes, as well as on an end-to-end basis, by means of window based techniques. First it is proposed an scheme for implementing end-to-end traffic control mechanisms through fuzzy inference systems. A comparative evaluation of simulation and implementation results from the fuzzy rate controler as compared to that of traditional controlers is performed for a wide set of realistic scenarios. Then, fuzzy inference systems for traffic control in routers are designed.

A particular proposal has been evaluated in realistic scenarios and is shown to be robust. The proposal is compared against the random early detection (RED) scheme.

It is experimentally shown that fuzzy systems can provide better performance and better adaptation to different requirements with mechanisms that are easy to modify using linguistic knowledge.

Finally, chapter addresses 6 the practical implementation of some of the fuzzy inference systems proposed in previous chapters. Both architectural and operational constraints are considered. The chapter focuses on an open FPGA-based hardware platform for the implementation of efﬁcient fuzzy inference systems for solving networking analysis and control problems. A feasibility study is conducted in order to show that the techniques developed can be deployed in current and future network

(10)

scenarios with satisfactory performance. The major contribution is the development of a platform and a companion development methodology that does not only fulﬁll operational requirements but also addresses the scalability and ﬂexibility challenges posed by current routing architectures. In addition, evidence for the feasibility of real implementations is provided.

In conclusion, this monograph describes computational intelligence based methods and tools for addressing a number of current issues around network trafﬁc measurement, modeling and control. Besides developing methods, special attention is paid to a number of practical aspects that have a determining impact on the adop- tion of novel methods and mechanisms for trafﬁc analysis and control.

Espoo, Finland and Sevilla, Spain Federico Montesino Pouzols

September 2010 Diego R. Lopez

Angel Barriga Barros

(11)

The ﬁrst author is supported by a Marie Curie Intra-European Fellowship for Career Development (grant agreement PIEF-GA-2009-237450) within the European Com- munity´s Seventh Framework Programme (FP7/20072013). Most of this work was done while the ﬁrst author was with the Microelectronics Institute of Seville, IMSE- CNM, CSIC. This work was supported in part by the European Community under the MOBY-DIC Project FP7-IST-248858 (www.mobydic-project.eu). The research presented here has been supported in part by a PhD studentship from the Andalusian regional Government, project TEC2008-04920, from the Spanish Ministry of Edu- cation and Science, as well as project P08-TIC-03674 from the Andalusian regional Government.

This monograph is based in part upon the Ph.D. dissertation of the ﬁrst author, directed by the second and third authors, and completed in 2009 at the Department of Electronics and Electromagnetism of the University of Seville and the Micro- electronics Institute of Seville, CSIC. We would like to thank all the colleagues that made this work possible. In particular, we would like to acknowledge the members of the thesis jury, Professors Jose Luis Huertas, Iluminada Baturone and Plamen An- gelov, and Drs. Amaury Lendasse and Santiago Sanchez-Solano. Their comments and encouraging suggestions helped improve this monograph and motivated new research directions.

The extensive and computationally expensive analysis of network measurements performed in this monograph would not have been possible without the facilities and support from the e-Science infrastructure managed by the Centro Inform´atico Cient´ıﬁco de Andaluc´ıa (https://eciencia.cica.es/). A special thanks should go to Ana Silva for her support.

We would like to acknowledge a number of institutions and individuals that have made this research possible by providing measurement infrastructures and repos- itories of network traces. In particular, our work has beneﬁted from the use of measurement data collected on the Abilene network as part of the Abilene Ob- servatory Project (http://abilene.internet2.edu/observatory/). We acknowledge the MAWI Working Group from the Wide Integrated Distributed Environment (WIDE) project (http://tracer.csl.sony.co.jp/mawi/) for kindly providing their trafﬁc traces.

(12)

We also used data sets from the Internet Trafﬁc Archive (http://ita.ee.lbl.gov/), an initiative by the Lawrence Berkeley National Laboratory and the ACM Spe- cial Interest Group on Data Communications (SIGCOMM), as well as the Community Resource for Archiving Wireless Data (CRAWDAD) at Dartmouth (http://crawdad.cs.dartmouth.edu). We are also indebted to the Cooperative Asso- ciation for Internet Data Analysis (CAIDA, http://www.caida.org), for providing a number of data collections. This work uses the following traces from CAIDA:

• The CAIDA OC48 Traces Dataset - August 2002, January 2003 and April 2003, Colleen Shannon, Emile Aben, kc claffy, Dan Andersen, Nevil Brownlee http://www.caida.org/data/passive/.

• The CAIDA Anonymized 2007 and 2008 Internet Traces - January 2007 and April 2008, Colleen Shannon, Emile Aben, kc claffy, Dan Andersen, http://www.caida.org/data/passive/passive 2007 dataset.xml.

Support for CAIDA’s OC48 and Internet Traces is provided by the National Science Foundation, the US Department of Homeland Security, DARPA, Digital Envoy, and CAIDA Members.

(13)

1 Internet Science. . . 1

1.1 Modeling the Internet . . . 1

1.2 Measurement Systems and Infrastructures . . . 4

1.2.1 Active Systems . . . 4

1.2.2 Passive Systems . . . 6

1.2.3 Publicly Available Measurements . . . 6

1.3 Network Trafﬁc . . . 7

1.3.1 Trafﬁc Models . . . 8

1.3.2 Transport Layer Models. TCP . . . 11

1.3.3 Models of Applications and Services . . . 12

1.3.4 Network Simulation . . . 12

1.3.5 Performance Metrics . . . 14

1.3.6 Congestion . . . 15

1.4 Trafﬁc Control . . . 16

1.4.1 End-To-End Trafﬁc Control . . . 19

1.4.2 Trafﬁc Control in Routers . . . 20

1.5 Time Series Models for Network Trafﬁc . . . 26

1.5.1 Short-Memory Stochastic Models . . . 28

1.5.2 Long-Memory Stochastic Models . . . 31

1.5.3 Mean Square Error Predictors . . . 34

1.5.4 OWA-Induced Nearest Neighbor Models . . . 36

1.5.5 Least Squares Support Vector Machines . . . 36

1.5.6 Extreme Learning Machine . . . 38

1.5.7 Prediction Performance Metrics . . . 38

1.6 Conclusions . . . 41

References . . . 41

(14)

2 Modeling Time Series by Means of Fuzzy Inference Systems . . . 53

2.1 Predictive Models for Time Series . . . 53

2.2 Nonparametric Residual Variance Estimation: Delta Test . . . 55

2.3 Methodology Framework for Time Series Prediction with Fuzzy Inference Systems . . . 55

2.3.1 Variable Selection . . . 57

2.3.2 System Identiﬁcation and Tuning . . . 59

2.3.3 Complexity Selection . . . 60

2.4 Case Study and Validation: ESTSP’07 Competition Dataset . . . 61

2.5 Experimental Results . . . 67

2.5.1 Poland Electricity Benchmark . . . 67

2.5.2 Sunspot Numbers . . . 71

2.5.3 Aggregated Incoming Trafﬁc in the Internet2 Backbone Network . . . 73

2.5.4 Santa Fe Time Series Competition: Laser Dataset . . . 73

2.5.5 Mackey-Glass Series . . . 78

2.5.6 NN3 Competition . . . 80

2.5.7 Discussion . . . 80

References . . . 83

3 Predictive Models of Network Trafﬁc Load. . . . 87

3.1 Models for Network Trafﬁc Load . . . 87

3.2 Analysis of Trafﬁc Traces . . . 89

3.3 Series of the Internet Trafﬁc Archive . . . 93

3.3.1 LBL Traces . . . 93

3.3.2 Bellcore Traces . . . 94

3.3.3 DEC Traces . . . 99

3.4 Application to Recent Trafﬁc Time Series . . . 99

3.4.1 Backbone Trafﬁc . . . 99

3.4.2 Exchange and Peering Trafﬁc . . . 111

3.4.3 Intercontinental Trafﬁc . . . 116

3.4.4 Access Point Trafﬁc . . . 120

3.4.5 Wireless Trafﬁc . . . 130

3.5 Discussion . . . 130

References . . . 143

4 Summarization and Analysis of Network Trafﬁc Flow Records. . . 147

4.1 Network Trafﬁc Measurement Systems . . . 147

4.2 Flow Measurement and Statistics: NetFlow and IPFIX . . . 149

4.3 Linguistic Summaries . . . 152

(15)

4.4 Deﬁnition of Linguistic Summaries of Network Flow

Collections . . . 154

4.4.1 Deﬁning Linguistic Labels from a Priori Knowledge . . . 156

4.4.2 Automatic Deﬁnition of Linguistic Labels by Unsupervised Learning . . . 158

4.4.3 Quantiﬁers . . . 159

4.5 Summarization of NetFlow Collections . . . 159

4.5.1 On-Line Summarization of NetFlow Collections . . . 159

4.5.2 Data Mining Summaries of NetFlow Collections . . . 167

4.5.3 Experimental Results . . . 168

4.5.4 Predeﬁned Set of Summaries . . . 170

4.5.5 Identifying Attribute Labels by Clustering . . . 174

4.5.6 Mining Association Rules for Extracting Linguistic Summaries . . . 183

4.5.7 Discussion . . . 183

5 Inference Systems for Network Trafﬁc Control . . . 191

5.1 Network Trafﬁc Control . . . 191

5.2 Simulation Scenarios . . . 192

5.3 Fuzzy End-To-End Rate Control for Internet Transport Protocols . . . 200

5.3.1 Related Work . . . 202

5.3.2 End-To-End Window Based Rate Control and a Fuzzy Generalization . . . 203

5.3.3 Design of a Fuzzy End-To-End Window Based Rate Controler . . . 205

5.3.4 Development Methodology and Tool Chain . . . 213

5.3.5 Simulation Results . . . 214

5.3.6 Implementation Results . . . 219

5.3.7 Discussion . . . 222

5.4 Active Queue Management by Means of Fuzzy Inference Systems . . . 226

5.4.1 Approach and Related Work . . . 226

5.4.2 Development Methodology and Tool Chain . . . 229

5.4.3 Fuzzy Internet Trafﬁc Control of Aggregate Trafﬁc . . . 230

5.4.4 Fuzzy Controler of Best-Effort Aggregate Trafﬁc . . . 231

5.4.5 Simulation Results . . . 233

5.4.6 Implementation Results . . . 250

5.4.7 Discussion . . . 255

(16)

6 Open FPGA-Based Development Platform for Fuzzy Inference

Systems. . . 263

6.1 Fuzzy Inference Systems for High-Performance Networks . . . 263

6.2 Routing Architectures . . . 264

6.2.1 High-End Routing Hardware . . . 269

6.2.2 Expected Evolution . . . 272

6.2.3 Architectures and Platforms for Research . . . 273

6.3 Inference Rate of Software Implementations . . . 274

6.4 Hardware Implementation of Fuzzy Inference Systems . . . 275

6.5 Development Platform for Fuzzy Inference Systems with Applications to Networking . . . 277

6.5.1 Development Methodology and Design Flow . . . 282

6.5.2 Application to Internet Trafﬁc Analysis and Control . . . 285

6.6 Computational Intelligence Based Processing Subsystems in Routing Architectures . . . 296

Index . . . 305

(17)

ACK Acknowledgment

AF Assured Forwarding

AQM Active Queue Management

LS-SVM Least Squares Support Vector Machines AR Autoregression, Autoregressive model ARX Autoregressive model with eXogenous inputs ARMA Autoregression with Moving Average

ARIMA Autoregression with Integrated Moving Average ASIC Application Speciﬁc Integrated Circuit

ATM Asynchronous Transfer Mode BGP Border Gateway Protocol BTC Bulk Transfer Capacity

CAIDA Cooperative Association for Internet Data Analysis CBQ Class Based Queuing

CBR Constant Bit Rate CoS Class of Service

DCCP Datagram Congestion Control Protocol

DNS Domain Name System

DS Differentiated Services

EF Expedited Forwarding

ELM Extreme Learning Machine ECN Explicit Congestion Notiﬁcation FCFS First-Come First-Served FIFO First-In First-Out FIM Fuzzy Inference Module FPGA Field Programmable Gate Array FPI Fuzzy Proportional Integral FTP File Transfer Protocol HTTP HyperText Transfer Protocol IETF Internet Engineering Task Force IOB Input/Output Block

(18)

IP Internet Protocol IPPM IP Performance Metrics IRTF Internet Research Task Force ISP Internet Service Provider

ITU-T International Telecommunication Union, Telecommunication Standardization Sector IXP Internet eXchange Processor

LRD Long-Range Dependence

LUT Look-Up Table

MAC Medium Access Control

MF Membership Function

MPLS Multi Protocol Label Switching

NARX Nonlinear autoregressive model with eXogenous inputs NCL Network Classiﬁcation Language

NP Network Processor

NPU Network Processing Unit NTP Network Time Protocol OPB On-Chip Peripheral Bus OSPF Open Shortest-Path First OWA Ordered Weighted Average PI Proportional Integral QoS Quality of Service RED Random Early Detection

RFC Request For Comments

RIO RED In/Out

RSVP Resource ReSerVation Protocol RTP Real-Time Streaming Protocol

RTT Round-Trip Time

SACK Selective Acknowledgment

SAPE Symmetric Absolute Percentage Error SMAPE Symmetric Mean Absolute Percentage Error SCTP Stream Control Transmission Protocol SLA Service Level Agreement

SoC System-on-a-Chip

SoPC System-on-Programmable-Chip SVM Support Vector Machines

TCAM Ternary Content-Addressable Memory

TM Trafﬁc Management

ToS Type of Service

TCP Transport Control Protocol UDP User Datagram Protocol VBR Variable Bit Rate

VoIP Voice Over IP

VOQ Virtual Output Queuing

(19)

(20)

Internet Science

Abstract. The structure and behavior of packet switched networks is difficult to model in a way comparable to many natural and artificial systems. Nonetheless, the Internet is an outstanding and challenging case because of its incredibly fast development, unparalleled heterogeneity and the inherent lack of measurement and monitoring mechanisms in its core conception. In short, packet switched networks defy analytical modeling. This chapter is intended to introduce and provide concise descriptions of some of the building blocks of what some authors call Internet Science [21, 104], i.e., the study of laws and patterns in Internet structure. Addi- tional related aspects that will be used throughout the next chapters are discussed as well. We will briefly define and describe the most relevant concepts about Inter- net performance and measurement that will be used throughout the next chapters.

However, we will not get into details about all the networking concepts this monograph deals with. We refer to [37] for a good overall and in-depth analysis of traffic measurement and performance analysis. There are also a number of research papers that provide good insight into more specific topics. Among these, we highlight [21], where some key mathematical concepts in Internet traffic analysis are discussed.

It is also out of the scope of this monograph to analyze in detail the mathematical aspects of most of the concepts this monograph deals with, and in particular those related to traffic control. For this, we refer the interested reader to [153] and [15].

Some of the most relevant and seminal research papers in this area can also be con- sulted [134, 132, 129, 171, 71].

1.1 Modeling the Internet

Analyzing and modeling traffic in packet switched computer networks can turn into a daunting task due to the virtually unlimited amount of data. There are both spatial and temporal issues. Considering the spatial dimension, the amount of end nodes, routers and switches can be of the order of several thousands even in local area networks [22]. Regarding the temporal dimension, the volume of data is huge even in medium-sized low-speed subnetworks for todays standards: a traffic trace taken

F.M. Pouzols et al.: Mining & Control of Network Traffic by Computational Intelligence, pp. 1–51.

springerlink.com © Springer-Verlag Berlin Heidelberg 2011

(21)

during a week on a gateway of an university in 1995 added up to 89 GB of data corresponding to 439 millions of packets [24].

The complexity of modeling the Internet of today and the foreseeable future can be understood considering the sustained exponential increase of traffic and nodes observed throughout the years [65] as well as the fast evolution of network protocols and applications. Currently, capturing packet header traces in fast links for a few minutes or hours may produce of the order of hundreds of GBs or even several TBs of data [38].

The recent development of high performance hardware for IP packet capture up to 10 Gb/s [47] has made it possible to record traffic traces in backbone nodes of current high-speed networks. However, it is not feasible to use such a huge volume of information for research and operation tasks. Filtering and preprocessing methods are required. Often, data volumes have to be reduced by 12 orders of magnitude, from 10¹² bytes down to a report of 10 lines of text [48]. It is also common to reduce huge volumes of traffic measurement data down to a set of a few graphs and tables [145].

The difficulties in this field are clear if we consider the analysis and modeling of wide area networks and the Internet in particular. In addition, there is a lack of measurement and monitoring mechanisms in the Internet architecture [164], which has been defined in a rather unstructured manner through an aggregation of protocols, technologies and applications developed independently. This architecture, that has been called a cooperative anarchy [123], defies measurement and characterization.

As Willinger and Paxson point out, “it is difficult to think of any other area in the sciences where the available data provide such detailed information about so many different facets of behavior” [170].

In this sense, technologies based on the Simple Network Management Protocol (SNMP) and the concept of network flow have seen a great deal of development and deployment during the last years [37]. Still, many efforts are required to enable macroscopic analysis of the Internet.

During the last decade, some areas, such as switching techniques and topology design, have seen fast development. However, systems and infrastructures for traffic measurement are still in early stages of development and scarcely deployed. The fast evolution and great diversity of the Internet together with the long periods of time required to analyze measurement data have a drastic consequence: experiments and studies based on traffic measurements are already obsolete when finished and specially when published [32]. Thus, it is hardly feasible to implement measurement and analysis systems that can be used to support other infrastructures.

A number of works in Internet measurement [124, 32] have led to the first results in what some authors call Internet Science [21]: an experimental science that studies laws and patterns in Internet structure [104]. Traditional statistical inference techniques often used to analyze networks are limited. Instead, Internet research require inference methods for searching for law-like relationships across large collections of high-volume data sets that generalize to a wide range of conditions [170]. That is, scientific inference is required in order to unveil traffic invariants. This requires

(22)

building intuition and physical understanding rather than using conventional black- box descriptions and data fitting techniques.

At first sight Internet Engineering might seem a more precise term for this area of research since the current Internet is the result of applying diverse engineering disciplines. However, issues and questions currently posed require an approach more close to that of the experimental sciences. This area involves theories as well as techniques and infrastructures for measurement, analysis and modeling.

Broadly speaking, three main aspects in Internet measurement, analysis and modeling have to be addressed in order to construct models of the Internet as a whole:

1. Traffic.

2. Topology.

3. Effect of protocols on traffic and topology.

In particular, Internet traffic modeling comprises macroscopic characterization as well as multi-scale modeling. Throughout the last years, many developments have shed some light on traffic dynamics. As a result, long-range dependencies, self- similarity and power-laws and wavelets have been established as common modeling tools. These aspects will be overviewed in the next sections. Often, traffic and topology are analyzed as orthogonal aspects. For instance, the obvious effect of routing protocols on traffic dynamics and congestion episodes is not well understood. In fact, the last research efforts towards an in-depth analysis of this interactions, the so-called traffic-sensitive routing, were abandoned several years ago. The adaptive routing protocols designed were found to be highly unstable [167].

Analysis and data mining of topology related measurements are commonly performed off-line and require cooperation from operators. operators, etc.). The objective of these studies is to identify invariants that help understand how topologies evolve. For instance, at the application level, it has been found that two randomly chosen documents on the web are on average 19 clicks away from each other [4].

Research on the overall topology of the Internet has been successful in revealing and validating the so-called jellyfish model: the network is compact, i.e, 99% of pairs of nodes are within 6 hops, there exists a highly connected center, there exists a loose hierarchy, and one-degree nodes are scattered everywhere. In summary, the network has the tendency to be one large connected component. Power laws ap- pear in other settings, such as WWW pages and peer-to-peer networks. In short, the topology of Internet is described by power-laws, its growth is slowing down (following a sigmoid curve), it is compact, becomes denser with time, and looks like a jellyfish [49, 101].

Major advances in Internet modeling include the identification of self-similarity and long-range dependencies in traffic as well the use of power-laws to describe the global topology of the Internet. But many issues are still open: spatio-temporal correlations, interest and group behavior, anomaly detection, etc. From the data mining viewpoint, there are many modeling challenges, including massive multidimen- sional data, time-space correlations, and case dependent phenomena.

(23)

1.2 Measurement Systems and Infrastructures

Network performance depends on and can be measured in terms of a number of parameters such as capacity, available bandwidth, delay, jitter, packet loss and packet disorder. These and other network parameters are related in a complex manner and to a varying extent. Measuring the network is crucial to understanding the Internet behavior and designing control mechanisms for improving performance.

Unfortunately, the original Internet architecture has little or no support for measurement. End hosts and their applications, however, have a limited capability in accessing and acquiring information about the network behavior. To them, end-to- end measurement of the network behavior is usually the only available information.

A number of factors have led to a surge in research of Internet measurement systems and infrastructures during the last years. The outcomes of these research activities have a positive impact in two areas. First, experimental support is provided for a better understanding of network traffic dynamics. Second, the availability of measurement infrastructures enables the development of measurement based traffic control and quality of service mechanisms.

In particular, nodes and protocols in the current Internet provide very little support for performance measurement. In addition, a number of new applications would greatly benefit from dynamic adaptation mechanisms based on network measurement. Also, improved methods and tools for network performance monitoring and troubleshooting are sought.

In fact, besides the development of novel techniques and tools within current architectures, firm proposals have been made [164] towards introducing modifications in network layer protocols as well as switching and routing equipment so that better support for measurement tasks is available in basic infrastructures.

In order to study the dynamics of Internet traffic both on-line and off-line techniques are required. These techniques and the infrastructures that support them are usually based on counting interesting events such as sessions, connections, arrivals of packets or cells to a node for a given period of time.

Current measurement systems [37, 124, 131] can be classified into two main types: active and passive. The former are of a distributed nature and are usually ac- cessible to end users and applications. The latter are centralized and often restricted to network operators and engineers. The current challenges in this area are to increase the maturity of these systems, to deploy measurement infrastructures and to enable generalized macroscopic analysis of the Internet.

1.2.1 Active Systems

Active measurement systems work by sending probe traffic from an end node in order to measure parameters such as round-trip time and packet loss percentage [118, 124, 136]. Active measurement tools inject probe packets into the network and analyze the response. Following a particular network model, some

(24)

characteristics are estimated, such as propagation delay and a number of metrics related to bandwidth.

Active measurement tools can not only provide network operators with useful information on network characteristics and performance, but also can enable end users (and user applications) to perform independent network auditing, load bal- ancing, and server selection tasks, among many others, without requiring access to network elements or administrative resources.

The research community is developing a set of metrics and techniques for active bandwidth measurement, including concise reporting to users [146]. Many of them [136] are well understood and can provide accurate estimates under certain conditions.

Some institutions are currently undertaking initiatives to deploy test platforms for active and passive bandwidth estimation as well as other related techniques. Also, some partial measurement and evaluation studies of bandwidth estimation tools have been published [147, 116, 86, 158].

The models underlying active systems often rely on a large number of parameters difficult to model in an independent manner. As a consequence, these systems suffer from errors and accuracy limitations in measurements and estimations, especially regarding timing accuracy in general purpose platforms [95, 2].

The network model chosen for designing an active measurement tool has a determining impact on the applicability and performance of the tool. Thus, research on active measurement tools [95, 160, 5], and specially of those that estimate bandwidth related metrics by probing the network [86, 46], has been very active during the last years. This area has made important contributions to the understanding of network traffic dynamics, particularly in the case of the behavior of aggregated flows in router queues.

The first attempt at using bandwidth estimates for application adaptation purposes reported in the literature can be tracked back to 1996, when BPROBE/CPROBE were introduced as tools for server selection tasks. Soon after appeared pathchar, introduced in 1997 as a per-hop network capacity estimation tool.

For about a decade, a number of bandwidth estimation methods and tools have been developed. These tools show a wide spectrum of requirements and characteristics, such as accuracy and intrusiveness. Underlying models, metrics definitions, terminologies as well as measurement and processing methodologies also differ.

A number of techniques for estimating bandwidth capacity and available capacity have been developed: variable packet size (VPS), packet pairs, packet trains, packet tailgating, ALBP (Asymmetric Link Bandwidth Probing), self-loading streams, to name a few. Implementations of these techniques can be found in a number of tools [86, 46, 116]. The performance of each technique usually provides insights on how the network reacts to a certain traffic pattern. Note that some tools also estimate parameters related to bandwidth, such as the ADR (asymptotic dispersion rate). The tool thrulay [146] further elaborates on the same idea and combines application level measurement of available bandwidth capacity and round-trip time.

(25)

1.2.2 Passive Systems

Passive measurement systems are based on recording data at a network node, i.e., no probe packets are sent. While passive systems do not require cooperation or co- ordination among end nodes, the quality and relevance of data decisively depends on the location of the measurement point. Thus, cooperation between network operators [118, 32] is a prerequisite of passive measurement infrastructures.

Passive systems are a field for the application of analysis and interpretation techniques for large volumes of data where measurements are often missing and inaccu- rate. These systems run in network nodes and particularly in routers gathering data usually through sampling procedures applied to traffic as traverses the network in real-time. These measurements are usually transfered to collection points following standards such as SNMP and NetFlow. The NetFlow technology is further discussed in chapter 4 where a novel method for summarizing network flow collections is described.

Passive systems enable global analysis of subnetworks at the infrastructure level.

They make it possible to detect the emergence and growth of new applications, protocols and related traffic patterns. Some of the main current areas of research in traffic analysis based on passive measurement systems can be listed as follows:

• Analysis of the interactions between macroscopic traffic dynamic and routing algorithms. In particular, the analysis of routing tables in the BGP protocol [138, 139, 161] is key for understanding traffic flows between service providers and autonomous systems.

• Analysis of the distribution of traffic over the address space (both IPv4 and IPv6).

This is a requirement for building maps of the address space assigned to institutions and service providers as well as the set of addresses that can be globally accessed.

• Analysis of the dynamic characteristics linked to protocols, applications and technologies. This area becomes more and more important as different novel services are deployed on the Internet.

• Development of tools and hardware support for traffic measurement and analysis [47, 43, 81].

• Privacy and security related procedures and techniques, including anonymization of network traces.

1.2.3 Publicly Available Measurements

Traces are one of the main outcomes of measurement infrastructures. The use of common traces recorded by both active and passive measurement infrastructures are key reproducible research and comparison of results in general. Traces may comprise data about topology, traffic, specific applications and a variety of hetero- geneous measurements.

(26)

In this sense, the recent availability traffic traces of high-speed networks, specially at OC48 and OC192 speeds, requires a great deal of effort and cooperation among different agents. Cooperative measurement projects and infrastructures also allows for wide scale analysis of networks.

A remarkable initiative in this context is the Day in the Life of the Internet series of events held in 2007 and 2008, that gathered together institutions from several continents in order to record continuous traffic traces in a coordinated manner for a considerable large period of time, spanning more than 50 hours in some cases.

In this monograph we will use a wide set of publicly available network traffic traces obtained through passive monitoring. These traces are usually made of a sequence of packet headers (possibly including part or all the payload as well).

Some other traces only provide a restricted set of data about each received packet, in particular the arrival time and size, as well as some other specially relevant data such as TCP flags. In chapters 3 and 4 we will analyze traffic traces from two perspectives. First, time series models for traffic load as derived from these traces are designed. Then, a method for summarizing flow collections derived from these traces is described.

Some traces have an historical relevance such as the Bellcore traces and the traces taken at the Lawrence Berkeley National Laboratory. The first were the empirical basis for finding self-similarity and long-range dependence in Ethernet traffic [69, 106] whereas the second were instrumental in showing that the Poisson model fails to capture the general behavior of traffic in wide area networks [134]. It is interesting to note that the limitations of the Poisson model in the communications field, though often overlooked and usually not dealt with in the literature, were well-known by practitioners since more than 2 decades before.

1.3 Network Traffic

The problem of modeling Internet traffic is both interesting in its own right and useful for a variety of applications, including congestion control and protocol design. It is out of the scope of this monograph to review all the proposed descriptive and predictive approaches to modeling Internet traffic. For an in-depth and exhaus- tive overview we refer the interested reader to a general book on traffic measurement [37] as well as a number of research papers on the topic [71, 36, 140, 141, 128, 41, 109]. In this section, we overview some of the most relevant, often antago- nistic, models for network traffic with the focus on those models that can shed some light on the modeling of network traffic from a time series modeling point of view.

Network traffic can be analyzed either from the perspective of the network and transport layers and the impact of generic metrics on the performance perceived by users [118], or from application specific viewpoints, such as Web traffic [120], peer- to-peer traffic [119] and multimedia traffic [121]. Here we will discuss the most important issues in modeling network traffic, network performance metrics and the concept of congestion in a general manner.

(27)

1.3.1 Traffic Models

Data obtained by measurement systems are usually processed using statistical tools in order to obtain as much information as possible [162]. This way, in the case of a video or audio application network flow, packets can be distributed over time following an exponential, subexponential or light-tailed distribution [132, 134]. This process leads to the extraction of empirically derived analytic models of traffic [129]

and helps identifying invariants.

The natural step after network measurements are gathered is to analyze them and run simulations [65]. Network measurement enables analysis of data as well as realistic simulation of networks. By identifying and reproducing invariants in network traffic in simulation scenarios a better understanding on how these invariants impact traffic dynamics can be obtained.

Describing traffic properties for supporting analysis and simulation tasks requires simple models that capture different levels of abstraction and time scales. That is, different levels of detail in simulation systems, represented by application sessions, connections, transfers, packets, etc. In an analogous manner, simulations can be run with different levels of detail, ranging from analytical models to more detailed behavioral simulation at the session and packet levels.

Let us now overview some of the traffic models that have been applied to and developed for packet switched networks. Teletraffic theory originally embraced all the mathematics applied to the design, control and management of the public switched telephone network (PSTN). Techniques belonging to the fields of queuing theory, statistical inference, performance analysis, mathematical modeling and optimiza- tion were used to lay out teletraffic theory. The natural step with the advent of the Internet was to extend this theory in order to include data networks. This way, In- ternet engineering (emcompassing the design, control, operation and management of the global Internet) would become part of teletraffic theory. However, Internet practitioners have emphasized engineering and experimental deployment rather than rigorous mathematical modeling and application of theories. In fact some in the In- ternet community would say that the Internet works because “it ignored mathematics -in particular, teletraffic theory-” [170].

Teletraffic theory has been remarkably successful in the case of the PSTN. Con- ventional PSTN is however a highly static environment where the notion of limited variability is well-defined and ever-present. Typical users, generic behavior and av- erages are proper descriptions of the overall system performance. In addition the most widely used models are specially practical from an engineering viewpoint.

These models are parsimonious and additionally the few required parameters can be easily estimated in practice.

These factor led to the belief that a universal law in voice networks established the Poisson nature of call arrivals for aggregated traffic. According to this assumption, call arrivals are mutually independent and the interarrival times are exponentially distributed. Poison models are the first model widely applied to communications traffic.

(28)

The application of Poison models dates back to the early telephone networks and the pioneering works by Erlang and others. In general, a Poisson process is char- acterized as a renewal process with interarrival times Anexponentially distributed with rate parameterλ. If X= (Xt :t≥1)is the number of arrivals in successive, non-overlapping time intervals of lengthΔt>0, then X is the increment process of a Poisson process with parameterλ if and only if the random variables Xtare i.i.d.

with:

P[Xt=n] =e^(−λ^t⁾(λt)/n!

In this formulation, Poisson process are described as a counting process where the number of arrivals in different intervals is statistically independent.

The so-called Poisson law has been widely accepted for several decades. The same applies to the following laws: call durations follow an approximately exponential distribution, there is a high predictability in growth rates, network control and operation are fully centralized (so information about the global state of the network is available), and services are strictly monitored and regulated. However, the high stability of telephone networks was compromised by the advent of fax in the 1980s.

This was due to the fundamentally different statistical properties of fax transmissions. With the popularization of TCP/IP networks and the WWW, teletraffic theory was no longer able to cope with data transmissions in a satisfactory manner.

Still, the first formal models proposed for Internet traffic were based on traditional teletraffic theory [134]. However, in the Internet, the engineering reality overcomes traditional teletraffic analytical modeling. Since self-similarity and long- range dependencies were first formally identified in data traffic [106] a number of studies have shown extensive evidence of the failure of Poisson models in the In- ternet. Poisson models have thus been rejected for characterizing packet arrival processes in the Internet [128, 134] at different levels of aggregation (ranging from local area networks to backbones).

The relevant mathematics for the PSTN deals with limited variability in both time and space, i.e., traffic processes are either independent or have exponentially decaying temporal correlations, and the distributions of traffic related properties have exponentially decaying tails.

In contrast, the mathematics relevant to packet switched networks has to deal with extreme variability. In many cases, very bursty at many different time scales (or fractal-like) behavior can be identified in network traffic load over a wide range of time scales from milliseconds to tens of seconds and beyond, i.e., traffic is self- similar [128].

More formally, a discrete-time, covariance-stationary, zero-mean stochastic pro- cess X= (Xt:t≥1)is exactly self-similar or fractal with scaling (Hurst) parameter H∈[0.5,1)if, for all levels of aggregation m≥1,

X^(m)=m^H−1X,

where the equality should be understood in the sense of finite-dimensional distribu- tions. The aggregated processes X⁽m)are defined as follows:

(29)

X^(m)=m⁻¹(X_(m−1)k+1+...+X_km), k≥1.

For this kind of process, it is easy to show that the following relationship holds:

var(X^(m)) =km^2H−2.

That is, there is a relationship between a quantity Q of the underlying process, traffic load, and the resolution m that follows:

Q(τ)≈kfτ^f⁽^D⁾,

where f(·) is a simple function of D, and D is a fractal dimension. Thus, such processes are fractal.

In addition, the resulting linear log-log plot representation of var(X^(m))versus m is the so-called variance-time plot, which is one of the methods commonly applied to identify the Hurst parameter of traffic time series.

Many evidences suggest that traffic in packet switched networks is self-similar and fractal in nature. A plausible explanation is that self-similarity is a consequence of the power-law distribution of different types of traffic workload, such as flow durations, web transfers, file sizes and even the way users interact with networked applications [128, 36, 37].

The heavy-tailed property exhibited by the distribution of flow sizes and durations is an invariant for an aggregate property of flows. It does not provide any information on the packet-level behavior of traffic sources. However, direct links between connection sizes and durations with infinite variance and fractal scaling in aggregate network traffic have been mathematically proven. Thus, this invariant has been key in finding a physical explanation of the observed fractal nature of aggregate traffic. A heavy-tailed distribution is defined as follows:

P[X>x]∝x^−α,

as x→∞, and0<α<2. The fact that this kind of distribution governs different traffic workloads can be explained in a generic manner by Zipf’s law [128, 36].

Poisson models cannot cope with high variability at the packet level. However, there is evidence that these models are satisfactory for human interactions with networked applications [36]. That is, the times at which users start interactions with applications conform to a memoryless process with an arrival rate that can be satis- factorily approximated as constant over time intervals of many minutes or perhaps an hour [170]. In addition, some works have shown the usefulness of time-varying Poisson models for small time scales in networks with a high level of traffic aggregation [97, 180, 23]. The argument that network traffic tends to Poisson as the level of aggregation increases is disputed though, as only a few limited studies support it.

In this context, it is currently widely recognized that better theoretical models with more extensive experimental basis are required [10, 64] in order to enable full understanding of the dynamics of Internet traffic.

(30)

1.3.2 Transport Layer Models. TCP

Modeling the dynamics of transport layer flows, and TCP flows in particular, is a central problem in Internet traffic research. Applications of predictive performance models range from peer-to-peer and content distribution networks (CDN) to grid computing. Most traffic in the current Internet, in terms of flows, packets and octets, is due to TCP connections [114]. Models for TCP dynamics have been developed following either of two approaches known as model based and equation based [162, 169].

Modeling TCP performance has also deep implications in transport protocol design. Preventing congestion collapse in the Internet and guaranteeing fairness at least in a TCP-compatible manner are two key aspects that should be addressed when developing new standard transport protocols [58]. In a similar way, TCP models have a significant impact on the design of active queue management and mechanisms for differentiated quality of service provisioning. Additional implications of TCP models include the definition of a meaningful set of evaluation scenarios and conditions for transport protocols [8, 7].

Some simple equation based models [169] point out the dramatic effect of packet loss on the performance of TCP. These models establish the relationship between the transfer rate of a TCP flow, T , and the packet loss rate, p, as follows:

T∝ 1

√p.

Further elaborating on the same simple model, a basic formulation of the expected average TCP transfer rate can be established as follows [78, 112]:

E[T(s,tRT T,p)] = s tRT T

2Dp 3

,

where s is the maximum segment size, tRT T is the round trip time, p is the packet loss rate, and D denotes the number of data units (TCP segments) acknowledged for each ACK packet, The tRT T of a TCP connection between a sender and a receiver is defined as the time elapsed between the instant a packet is sent by the source to the instant the corresponding ACK from the receiver is received by the source.

However, obtaining equations for modeling and predicting the stationary behavior of TCP in a general manner is a complex problem. A number of solutions have been proposed. To date, the most complete model that has been extensively evaluated through experimentation [75, 168] defines the following equation for the TCP- compatible transfer rate:

E[T(s,tRT T,p,tRT O)] =

min

⎛

⎜⎜

⎝sWm

tRT T, s

tRT T

2Dp

3 +tRT Omin

1,3

3Dp8

p(1 + 32p²)

⎞

⎟⎟

⎠,

(31)

where Wm is the maximum size of the TCP congestion window and tRT O is the packet retransmission timeout of the TCP protocol under the particular conditions given by the parameters of the equation. Note the application of this model requires the sender to know the parameters of the equation. Thus, it is necessary that TCP receivers provide the required information. As a particular case, where there is no packet loss, the expected rate is given by the ratio ^W_t ^m^s

RT T.

The model above can be extended to multicast networks [168], which requires the definition of a variety of feedback mechanisms so that senders are informed about network conditions at the reception points.

Nonetheless, accurate modeling of TCP is an increasingly complex problem due to the many variants proposed throughout the years [156, 133, 92] and the intricate evolution of the standard variants [98, 93, 20, 155]. Therefore, there are many open issues in the design of TCP variants that can cope with technological and architectural changes in the Internet [10]. Some TCP variants recently proposed will be overviewed in chapter 5, where a new approach to end-to-end congestion control is described based on fuzzy logic.

1.3.3 Models of Applications and Services

During the last years, the diversity of network conditions and traffic patterns that can be found in the Internet has been progressively increasing [10, 64]. Thus, the development of schemes for generating flexible aggregate flows and topologies is key for modeling applications and services.

Characterizing the dynamics of specific types of traffic linked to particular services and applications is key for providing proper definitions of the quality of service requirements of current and foreseeable network applications. In addition, the definition of traffic models for different applications is crucial not only for characterization purposes but also to enable the development of realistic simulation and emulation environments.

In particular, extensive studies have addressed traffic patterns for widespread applications, such as web [120], bulk transfers by FTP and similar protocols [26], peer-to-peer applications [119], and voice and video applications [121, 143].

1.3.4 Network Simulation

Simulation of network scenarios can help overcome the limitations of measurement and experimentation. In particular, simulation models make it possible to explore new protocols, environments and architectures. By simulation is also possible to explore complex scenarios that would otherwise be difficult or impossible to analyze.

Nonetheless, there does not exist a complete suite of simulation scenarios that can be deemed as sufficient to demonstrate that a new protocol or mechanism will perform properly in the future evolving Internet. Instead, simulations are limited to

(32)

exploring specific aspects of new proposals or the behavior of the Internet, as well as advancing the understanding of traffic dynamics. The role of network simulation is thus to explore scenarios in order to build understanding of dynamics, to illustrate a point, or to explore for unexpected behavior [65]. Simulations however can be misleading when used for producing quantitative performance comparisons.

In particular, network simulation is the most proper method for addressing many of the open issues in traffic dynamics, specially the complex interactions between topologies and traffic, as well as the central role of adaptive congestion control.

Simulating the behavior of the global Internet or a significant part of it is an immense challenge. This is due to its heterogeneity and fast evolution. Experience shows that techniques that were studied using partial models were not implemented eventually because of doubts about their limitations [64]. Thus, the variety of scenarios and conditions taken into consideration for the simulation and evaluation of new systems is a key factor for their eventual acceptance.

A sound network model for simulation comprises all the aspects that can have an impact on a simulation or experiment. These include the topology, traffic gen- eration patterns, behavior of protocols at every layers of the protocol stack, queue control mechanisms, among many other possible factors. In general, it is useful to lay out simulations in such a way that invariants can be identified by exploring the simulation parameter space [65].

However, many research works rely on simulations with assumptions that are not experimentally proven. These include long-lived and large flows, simple topologies with often only one congested link, small range of round-trip times for the simu- lated flows, most traffic flowing in a single direction through the congested link and negligible amount of reverse traffic.

Instead, the use of a number of well known invariants can help designing realistic simulation scenarios. These invariants include diurnal patterns of activ- ity, self-similarity in packet arrival processes, Poisson session arrivals, log-normal connection sizes, heavy-tail distributions and topological invariants of the global Internet derived from the Earth’s geography and the distribution of human population [65].

A large amount of techniques and methodologies for network simulation have been proposed and applied throughout the years and further research is being car- ried out. These techniques and methodologies include discrete event, web-based and agent-based simulation schemes, Petri nets, fluid-flow based simulation, specific languages for simulation and overlay networks among many others. In particular, the use of advanced simulation tools, such as ns-2 [88], SSF Net [154] and OM- NET++ [163], and emulators, such as Netbed/Emulab [68], Planetlab [135], NIST Net [25], iproute2 [80] and dummynet [142]), to name only a few, is key for addressing the aforementioned problems [65]. In chapters 5 and 6 we will describe how we have used some simulation and emulation environments in order to test new traffic control mechanisms.

(33)

1.3.5 Performance Metrics

In order to assess the performance and reliability of networks, a set of parameters are usually measured or indirectly estimated from measurements. When these parameters are unambiguously specified, whether qualitatively or quantitatively, they are identified as performance metrics.

Even though the definition of these parameters can be unambiguous, there many not be clear procedures for their effective measurement. This way, measuring some of the most common network performance metrics, such as connectivity, delay, loss pattern and reordering pattern, poses different practical issues. Moreover, a certain parameter that describes network performance, such as packet reordering, may have several associated metrics. Also, the definition of metrics usually depends on the network model under which they are interpreted.

Defining metrics that provide quantitative and unbiased information about network parameters is required in order to develop tools for network quality, performance and reliability evaluation. Currently, the IP Performance Metrics (IPPM) group of the IETF is working together with the T1A1.3, SG 12 and SG 13 groups of the ITU-T towards laying out and standardizing quantitative metrics on data delivery by transport protocols. The objective is to obtain metrics that provide quantitative information about performance avoiding any ambiguity. The metrics, considered for both end-to-end paths and subnetworks can be listed as follows:

• Connectivity.

• One-way delay and loss rate.

• Round-trip delay and loss rate.

• Delay variation.

• Loss pattern.

• Packet reordering.

• Bulk transfer rate.

• Capacity and bandwidth of links.

The IPPM working group has defined through a series of RFC documents a large number of richly parameterized metrics in order to address the many possible objec- tives of network measurement procedures. Often, the ultimate purpose is to report a concise set of metrics describing a network’s state to an end user. Elaborating on this idea, the Internet Draft on reporting metrics to users [146] defines a small set of metrics that are robust, easy to understand, orthogonal, relevant, and easy to compute.

The standardization process for this metrics considers not only their formal definition but also documentation and measurement procedures. There is however the need for establishing procedures for measuring individual metrics and interpreting their values as relevant properties for different classes of service, such as bulk transfer, periodic and multimedia flows.

Nonetheless, this standardization effort embraces only low level metrics, i.e., those that characterize the network regardless of transport protocols and applications. That is, the definition of metrics for characterizing different traffic patterns