Distributed Computing
Principles, Algorithms, and Systems
Distributed computing deals with all forms of computing, information access, and information exchange across multiple processing platforms connected by computer networks. Design of distributed computing systems is a com- plex task. It requires a solid understanding of the design issues and an in-depth understanding of the theoretical and practical aspects of their solu- tions. This comprehensive textbook covers the fundamental principles and models underlying the theory, algorithms, and systems aspects of distributed computing.
Broad and detailed coverage of the theory is balanced with practical systems-related problems such as mutual exclusion, deadlock detection, authentication, and failure recovery. Algorithms are carefully selected, lucidly presented, and described without complex proofs. Simple explanations and illustrations are used to elucidate the algorithms. Emerging topics of signif- icant impact, such as peer-to-peer networks and network security, are also covered.
With state-of-the-art algorithms, numerous illustrations, examples, and homework problems, this textbook is invaluable for advanced undergraduate and graduate students of electrical and computer engineering and computer science. Practitioners in data networking and sensor networks will also find this a valuable resource.
Ajay D. Kshemkalyani is an Associate Professor in the Department of Com- puter Science, at the University of Illinois at Chicago. He was awarded his Ph.D. in Computer and Information Science in 1991 from The Ohio State University. Before moving to academia, he spent several years working on computer networks at IBM Research Triangle Park. In 1999, he received the National Science Foundation’s CAREER Award. He is a Senior Member of the IEEE, and his principal areas of research include distributed computing, algorithms, computer networks, and concurrent systems. He currently serves on the editorial board ofComputer Networks.
Mukesh Singhalis Full Professor and Gartner Group Endowed Chair in Net- work Engineering in the Department of Computer Science at the University of Kentucky. He was awarded his Ph.D. in Computer Science in 1986 from the University of Maryland, College Park. In 2003, he received the IEEE
Transactions on Computers. He is a Fellow of the IEEE, and his principal areas of research include distributed systems, computer networks, wireless and mobile computing systems, performance evaluation, and computer security.
Distributed Computing
Principles, Algorithms, and Systems
Ajay D. Kshemkalyani
University of Illinois at Chicago, Chicago
and
Mukesh Singhal
University of Kentucky, Lexington
Cambridge University Press
The Edinburgh Building, Cambridge CB2 8RU, UK
First published in print format
ISBN-13 978-0-521-87634-6 ISBN-13 978-0-511-39341-9
© Cambridge University Press 2008
2008
Information on this title: www.cambridge.org/9780521876346
This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press.
Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.
Published in the United States of America by Cambridge University Press, New York www.cambridge.org
eBook (EBL)
hardback
To my father Shri Digambar and my mother Shrimati Vimala.
Ajay D. Kshemkalyani
To my mother Chandra Prabha Singhal, my father Brij Mohan Singhal, and my daughters Meenakshi, Malvika,
and Priyanka.
Mukesh Singhal
Contents
Preface pagexv
1 Introduction 1
1.1 Definition 1
1.2 Relation to computer system components 2
1.3 Motivation 3
1.4 Relation to parallel multiprocessor/multicomputer systems 5 1.5 Message-passing systems versus shared memory systems 13
1.6 Primitives for distributed communication 14
1.7 Synchronous versus asynchronous executions 19
1.8 Design issues and challenges 22
1.9 Selection and coverage of topics 33
1.10 Chapter summary 34
1.11 Exercises 35
1.12 Notes on references 36
References 37
2 A model of distributed computations 39
2.1 A distributed program 39
2.2 A model of distributed executions 40
2.3 Models of communication networks 42
2.4 Global state of a distributed system 43
2.5 Cuts of a distributed computation 45
2.6 Past and future cones of an event 46
2.7 Models of process communications 47
2.8 Chapter summary 48
2.9 Exercises 48
2.10 Notes on references 48
References 49
3 Logical time 50
3.1 Introduction 50
3.2 A framework for a system of logical clocks 52
3.3 Scalar time 53
3.4 Vector time 55
3.5 Efficient implementations of vector clocks 59
3.6 Jard–Jourdan’s adaptive technique 65
3.7 Matrix time 68
3.8 Virtual time 69
3.9 Physical clock synchronization: NTP 78
3.10 Chapter summary 81
3.11 Exercises 84
3.12 Notes on references 84
References 84
4 Global state and snapshot recording algorithms 87
4.1 Introduction 87
4.2 System model and definitions 90
4.3 Snapshot algorithms for FIFO channels 93
4.4 Variations of the Chandy–Lamport algorithm 97
4.5 Snapshot algorithms for non-FIFO channels 101
4.6 Snapshots in a causal delivery system 106
4.7 Monitoring global state 109
4.8 Necessary and sufficient conditions for consistent global
snapshots 110
4.9 Finding consistent global snapshots in a distributed
computation 114
4.10 Chapter summary 121
4.11 Exercises 122
4.12 Notes on references 122
References 123
5 Terminology and basic algorithms 126
5.1 Topology abstraction and overlays 126
5.2 Classifications and basic concepts 128
5.3 Complexity measures and metrics 135
5.4 Program structure 137
5.5 Elementary graph algorithms 138
5.6 Synchronizers 163
5.7 Maximal independent set (MIS) 169
5.8 Connected dominating set 171
5.9 Compact routing tables 172
5.10 Leader election 174
ix Contents
5.11 Challenges in designing distributed graph algorithms 175
5.12 Object replication problems 176
5.13 Chapter summary 182
5.14 Exercises 183
5.15 Notes on references 185
References 186
6 Message ordering and group communication 189
6.1 Message ordering paradigms 190
6.2 Asynchronous execution with synchronous communication 195 6.3 Synchronous program order on an asynchronous system 200
6.4 Group communication 205
6.5 Causal order (CO) 206
6.6 Total order 215
6.7 A nomenclature for multicast 220
6.8 Propagation trees for multicast 221
6.9 Classification of application-level multicast algorithms 225 6.10 Semantics of fault-tolerant group communication 228 6.11 Distributed multicast algorithms at the network layer 230
6.12 Chapter summary 236
6.13 Exercises 236
6.14 Notes on references 238
References 239
7 Termination detection 241
7.1 Introduction 241
7.2 System model of a distributed computation 242
7.3 Termination detection using distributed snapshots 243
7.4 Termination detection by weight throwing 245
7.5 A spanning-tree-based termination detection algorithm 247
7.6 Message-optimal termination detection 253
7.7 Termination detection in a very general distributed computing
model 257
7.8 Termination detection in the atomic computation model 263 7.9 Termination detection in a faulty distributed system 272
7.10 Chapter summary 279
7.11 Exercises 279
7.12 Notes on references 280
References 280
8 Reasoning with knowledge 282
8.1 The muddy children puzzle 282
8.2 Logic of knowledge 283
8.3 Knowledge in synchronous systems 289
8.4 Knowledge in asynchronous systems 290
8.5 Knowledge transfer 298
8.6 Knowledge and clocks 300
8.7 Chapter summary 301
8.8 Exercises 302
8.9 Notes on references 303
References 303
9 Distributed mutual exclusion algorithms 305
9.1 Introduction 305
9.2 Preliminaries 306
9.3 Lamport’s algorithm 309
9.4 Ricart–Agrawala algorithm 312
9.5 Singhal’s dynamic information-structure algorithm 315 9.6 Lodha and Kshemkalyani’s fair mutual exclusion algorithm 321
9.7 Quorum-based mutual exclusion algorithms 327
9.8 Maekawa’s algorithm 328
9.9 Agarwal–El Abbadi quorum-based algorithm 331
9.10 Token-based algorithms 336
9.11 Suzuki–Kasami’s broadcast algorithm 336
9.12 Raymond’s tree-based algorithm 339
9.13 Chapter summary 348
9.14 Exercises 348
9.15 Notes on references 349
References 350
10 Deadlock detection in distributed systems 352
10.1 Introduction 352
10.2 System model 352
10.3 Preliminaries 353
10.4 Models of deadlocks 355
10.5 Knapp’s classification of distributed deadlock detection
algorithms 358
10.6 Mitchell and Merritt’s algorithm for the single-
resource model 360
10.7 Chandy–Misra–Haas algorithm for the AND model 362 10.8 Chandy–Misra–Haas algorithm for the OR model 364 10.9 Kshemkalyani–Singhal algorithm for theP-out-of-Qmodel 365
10.10 Chapter summary 374
10.11 Exercises 375
10.12 Notes on references 375
References 376
xi Contents
11 Global predicate detection 379
11.1 Stable and unstable predicates 379
11.2 Modalities on predicates 382
11.3 Centralized algorithm for relational predicates 384
11.4 Conjunctive predicates 388
11.5 Distributed algorithms for conjunctive predicates 395
11.6 Further classification of predicates 404
11.7 Chapter summary 405
11.8 Exercises 406
11.9 Notes on references 407
References 408
12 Distributed shared memory 410
12.1 Abstraction and advantages 410
12.2 Memory consistency models 413
12.3 Shared memory mutual exclusion 427
12.4 Wait-freedom 434
12.5 Register hierarchy and wait-free simulations 434 12.6 Wait-free atomic snapshots of shared objects 447
12.7 Chapter summary 451
12.8 Exercises 452
12.9 Notes on references 453
References 454
13 Checkpointing and rollback recovery 456
13.1 Introduction 456
13.2 Background and definitions 457
13.3 Issues in failure recovery 462
13.4 Checkpoint-based recovery 464
13.5 Log-based rollback recovery 470
13.6 Koo–Toueg coordinated checkpointing algorithm 476 13.7 Juang–Venkatesan algorithm for asynchronous checkpointing
and recovery 478
13.8 Manivannan–Singhal quasi-synchronous checkpointing
algorithm 483
13.9 Peterson–Kearns algorithm based on vector time 492 13.10 Helary–Mostefaoui–Netzer–Raynal communication-induced
protocol 499
13.11 Chapter summary 505
13.12 Exercises 506
13.13 Notes on references 506
References 507
14 Consensus and agreement algorithms 510
14.1 Problem definition 510
14.2 Overview of results 514
14.3 Agreement in a failure-free system (synchronous or
asynchronous) 515
14.4 Agreement in (message-passing) synchronous systems with
failures 516
14.5 Agreement in asynchronous message-passing systems with
failures 529
14.6 Wait-free shared memory consensus in asynchronous systems 544
14.7 Chapter summary 562
14.8 Exercises 563
14.9 Notes on references 564
References 565
15 Failure detectors 567
15.1 Introduction 567
15.2 Unreliable failure detectors 568
15.3 The consensus problem 577
15.4 Atomic broadcast 583
15.5 A solution to atomic broadcast 584
15.6 The weakest failure detectors to solve fundamental agreement
problems 585
15.7 An implementation of a failure detector 589
15.8 An adaptive failure detection protocol 591
15.9 Exercises 596
15.10 Notes on references 596
References 596
16 Authentication in distributed systems 598
16.1 Introduction 598
16.2 Background and definitions 599
16.3 Protocols based on symmetric cryptosystems 602 16.4 Protocols based on asymmetric cryptosystems 615
16.5 Password-based authentication 622
16.6 Authentication protocol failures 625
16.7 Chapter summary 626
16.8 Exercises 627
16.9 Notes on references 627
References 628
17 Self-stabilization 631
17.1 Introduction 631
17.2 System model 632
xiii Contents
17.3 Definition of self-stabilization 634
17.4 Issues in the design of self-stabilization algorithms 636 17.5 Methodologies for designing self-stabilizing systems 647
17.6 Communication protocols 649
17.7 Self-stabilizing distributed spanning trees 650 17.8 Self-stabilizing algorithms for spanning-tree construction 652 17.9 An anonymous self-stabilizing algorithm for 1-maximal
independent set in trees 657
17.10 A probabilistic self-stabilizing leader election algorithm 660 17.11 The role of compilers in self-stabilization 662 17.12 Self-stabilization as a solution to fault tolerance 665
17.13 Factors preventing self-stabilization 667
17.14 Limitations of self-stabilization 668
17.15 Chapter summary 670
17.16 Exercises 670
17.17 Notes on references 671
References 671
18 Peer-to-peer computing and overlay graphs 677
18.1 Introduction 677
18.2 Data indexing and overlays 679
18.3 Unstructured overlays 681
18.4 Chord distributed hash table 688
18.5 Content addressible networks (CAN) 695
18.6 Tapestry 701
18.7 Some other challenges in P2P system design 708
18.8 Tradeoffs between table storage and route lengths 710
18.9 Graph structures of complex networks 712
18.10 Internet graphs 714
18.11 Generalized random graph networks 720
18.12 Small-world networks 720
18.13 Scale-free networks 721
18.14 Evolving networks 723
18.15 Chapter summary 727
18.16 Exercises 727
18.17 Notes on references 728
References 729
Index 731
Preface
Background
The field of distributed computing covers all aspects of computing and infor- mation access across multiple processing elements connected by any form of communication network, whether local or wide-area in the coverage. Since the advent of the Internet in the 1970s, there has been a steady growth of new applications requiring distributed processing. This has been enabled by advances in networking and hardware technology, the falling cost of hard- ware, and greater end-user awareness. These factors have contributed to making distributed computing a cost-effective, high-performance, and fault- tolerant reality. Around the turn of the millenium, there was an explosive growth in the expansion and efficiency of the Internet, which was matched by increased access to networked resources through the World Wide Web, all across the world. Coupled with an equally dramatic growth in the wireless and mobile networking areas, and the plummeting prices of bandwidth and storage devices, we are witnessing a rapid spurt in distributed applications and an accompanying interest in the field of distributed computing in universities, governments organizations, and private institutions.
Advances in hardware technology have suddenly made sensor networking a reality, and embedded and sensor networks are rapidly becoming an integral part of everyone’s life – from the home network with the interconnected gadgets to the automobile communicating by GPS (global positioning system), to the fully networked office with RFID monitoring. In the emerging global village, distributed computing will be the centerpiece of all computing and information access sub-disciplines within computer science. Clearly, this is a very important field. Moreover, this evolving field is characterized by a diverse range of challenges for which the solutions need to have foundations on solid principles.
The field of distributed computing is very important, and there is a huge demand for a good comprehensive book. This book comprehensively covers all important topics in great depth, combining this with a clarity of explanation
and ease of understanding. The book will be particularly valuable to the academic community and the computer industry at large. Writing such a comprehensive book has been a Herculean task and there is a deep sense of satisfaction in knowing that we were able complete it and perform this service to the community.
Description, approach, and features
The book will focus on the fundamental principles and models underlying all aspects of distributed computing. It will address the principles underlying the theory, algorithms, and systems aspects of distributed computing. The manner of presentation of the algorithms is very clear, explaining the main ideas and the intuition with figures and simple explanations rather than getting entangled in intimidating notations and lengthy and hard-to-follow rigorous proofs of the algorithms. The selection of chapter themes is broad and comprehensive, and the book covers all important topics in depth. The selection of algorithms within each chapter has been done carefully to elucidate new and important techniques of algorithm design. Although the book focuses on foundational aspects and algorithms for distributed computing, it thoroughly addresses all practical systems-like problems (e.g., mutual exclusion, deadlock detection, termination detection, failure recovery, authentication, global state and time, etc.) by presenting the theory behind and algorithms for such problems. The book is written keeping in mind the impact of emerging topics such as peer-to-peer computing andnetwork securityon the foundational aspects of distributed computing.
Each chapter contains figures, examples, exercises, a summary, and references.
Readership
This book is aimed as a textbook for the following:
• Graduate students and Senior level undergraduate students in computer science and computer engineering.
• Graduate students in electrical engineering and mathematics. As wireless networks, peer-to-peer networks, and mobile computing continue to grow in importance, an increasing number of students from electrical engineering departments will also find this book necessary.
• Practitioners, systems designers/programmers, and consultants in industry and research laboratories will find the book a very useful reference because it contains state-of-the-art algorithms and principles to address various design issues in distributed systems, as well as the latest references.
xvii Preface
Hard and soft prerequisites for the use of this book include the following:
• An undergraduate course in algorithms is required.
• Undergraduate courses in operating systems and computer networks would be useful.
• A reasonable familiarity with programming.
We have aimed for a very comprehensive book that will act as a single source for distributed computing models and algorithms. The book has both depth and breadth of coverage of topics, and is characterized by clear and easy explanations. None of the existing textbooks on distributed computing provides all of these features.
Acknowledgements
This book grew from the notes used in the graduate courses on distributed computing at the Ohio State University, the University of Illinois at Chicago, and at the University of Kentucky. We would like to thank the graduate students at these schools for their contributions to the book in many ways.
The book is based on the published research results of numerous researchers in the field. We have made all efforts to present the material in our own words and have given credit to the original sources of information. We would like to thank all the researchers whose work has been reported in this book.
Finally, we would like to thank the staff of Cambridge University Press for providing us with excellent support in the publication of this book.
Access to resources
The following websites will be maintained for the book. Any errors and comments should be sent to ajayk@cs.uic.edu or singhal@cs.uky.edu. Further information about the book can be obtained from the authors’ web pages:
• www.cs.uic.edu/∼ajayk/DCS-Book
• www.cs.uky.edu/∼singhal/DCS-Book.
C H A P T E R
1 Introduction
1.1 Definition
A distributed system is a collection of independent entities that cooperate to solve a problem that cannot be individually solved. Distributed systems have been in existence since the start of the universe. From a school of fish to a flock of birds and entire ecosystems of microorganisms, there is communication among mobile intelligent agents in nature. With the widespread proliferation of the Internet and the emerging global village, the notion of distributed computing systems as a useful and widely deployed tool is becoming a reality.
For computing systems, a distributed system has been characterized in one of several ways:
• You know you are using one when the crash of a computer you have never heard of prevents you from doing work [23].
• A collection of computers that do not share common memory or a common physical clock, that communicate by a messages passing over a communi- cation network, and where each computer has its own memory and runs its own operating system. Typically the computers are semi-autonomous and are loosely coupled while they cooperate to address a problem collectively [29].
• A collection of independent computers that appears to the users of the system as a single coherent computer [33].
• A term that describes a wide range of computers, from weakly coupled systems such as wide-area networks, to strongly coupled systems such as local area networks, to very strongly coupled systems such as multipro- cessor systems [19].
A distributed system can be characterized as a collection of mostly autonomous processors communicating over a communication network and having the following features:
• No common physical clock This is an important assumption because it introduces the element of “distribution” in the system and gives rise to the inherent asynchrony amongst the processors.
1
• No shared memory This is a key feature that requires message-passing for communication. This feature implies the absence of the common phys- ical clock.
It may be noted that a distributed system may still provide the abstraction of a common address space via the distributed shared memory abstraction.
Several aspects of shared memory multiprocessor systems have also been studied in the distributed computing literature.
• Geographical separation The geographically wider apart that the pro- cessors are, the more representative is the system of a distributed system.
However, it is not necessary for the processors to be on a wide-area net- work (WAN). Recently, the network/cluster of workstations (NOW/COW) configuration connecting processors on a LAN is also being increasingly regarded as a small distributed system. This NOW configuration is becom- ing popular because of the low-cost high-speed off-the-shelf processors now available. The Google search engine is based on the NOW architec- ture.
• Autonomy and heterogeneity The processors are “loosely coupled”
in that they have different speeds and each can be running a different operating system. They are usually not part of a dedicated system, but cooperate with one another by offering services or solving a problem jointly.
1.2 Relation to computer system components
A typical distributed system is shown in Figure 1.1. Each computer has a memory-processing unit and the computers are connected by a communication network. Figure 1.2shows the relationships of the software components that run on each of the computers and use the local operating system and network protocol stack for functioning. The distributed software is also termed as middleware. Adistributed executionis the execution of processes across the distributed system to collaboratively achieve a common goal. An execution is also sometimes termed acomputationor arun.
The distributed system uses a layered architecture to break down the com- plexity of system design. The middleware is the distributed software that
Figure 1.1 A distributed system connects processors by a communication network.
P M
Communication network (WAN/ LAN)
P processor(s)
M memory bank(s)
P M P M
P M
P M P M
P M
3 1.3 Motivation
Figure 1.2Interaction of the software components at each processor.
Extent of distributed
protocols Distributed application
Network layer Application layer
Data link layer Transport layer
Network protocol stack
Distributed software (middleware libraries)
Operating system
drives the distributed system, while providing transparency of heterogeneity at the platform level [24]. Figure1.2schematically shows the interaction of this software with these system components at each processor. Here we assume that the middleware layer does not contain the traditional application layer functions of the network protocol stack, such as http, mail,ftp, and telnet.
Various primitives and calls to functions defined in various libraries of the middleware layer are embedded in the user program code. There exist several libraries to choose from to invoke primitives for the more common func- tions – such as reliable and ordered multicasting – of the middleware layer.
There are several standards such as Object Management Group’s (OMG) common object request broker architecture (CORBA) [36], and the remote procedure call (RPC) mechanism [1,11]. The RPC mechanism conceptually works like a local procedure call, with the difference that the procedure code may reside on a remote machine, and the RPC software sends a message across the network to invoke the remote procedure. It then awaits a reply, after which the procedure call completes from the perspective of the program that invoked it. Currently deployed commercial versions of middleware often use CORBA, DCOM (distributed component object model), Java, and RMI (remote method invocation) [7] technologies. The message-passing interface (MPI) [20,30] developed in the research community is an example of an interface for various communication functions.
1.3 Motivation
The motivation for using a distributed system is some or all of the following requirements:
1. Inherently distributed computations In many applications such as money transfer in banking, or reaching consensus among parties that are geographically distant, the computation is inherently distributed.
2. Resource sharing Resources such as peripherals, complete data sets in databases, special libraries, as well as data (variable/files) cannot be
fully replicated at all the sites because it is often neither practical nor cost-effective. Further, they cannot be placed at a single site because access to that site might prove to be a bottleneck. Therefore, such resources are typically distributed across the system. For example, distributed databases such as DB2 partition the data sets across several servers, in addition to replicating them at a few sites for rapid access as well as reliability.
3. Access to geographically remote data and resources In many sce- narios, the data cannot be replicated at every site participating in the distributed execution because it may be too large or too sensitive to be replicated. For example, payroll data within a multinational corporation is both too large and too sensitive to be replicated at every branch office/site.
It is therefore stored at a central server which can be queried by branch offices. Similarly, special resources such as supercomputers exist only in certain locations, and to access such supercomputers, users need to log in remotely.
Advances in the design of resource-constrained mobile devices as well as in the wireless technology with which these devices communicate have given further impetus to the importance of distributed protocols and middleware.
4. Enhanced reliability A distributed system has the inherent potential to provide increased reliability because of the possibility of replicating resources and executions, as well as the reality that geographically dis- tributed resources are not likely to crash/malfunction at the same time under normal circumstances. Reliability entails several aspects:
• availability, i.e., the resource should be accessible at all times;
• integrity, i.e., the value/state of the resource should be correct, in the face of concurrent access from multiple processors, as per the semantics expected by the application;
• fault-tolerance, i.e., the ability to recover from system failures, where such failures may be defined to occur in one of many failure models, which we will study in Chapters5and14.
5. Increased performance/cost ratio By resource sharing and accessing geographically remote data and resources, the performance/cost ratio is increased. Although higher throughput has not necessarily been the main objective behind using a distributed system, nevertheless, any task can be partitioned across the various computers in the distributed system. Such a configuration provides a better performance/cost ratio than using special parallel machines. This is particularly true of the NOW configuration.
In addition to meeting the above requirements, a distributed system also offers the following advantages:
6. Scalability As the processors are usually connected by a wide-area net- work, adding more processors does not pose a direct bottleneck for the communication network.
5 1.4 Relation to parallel multiprocessor/multicomputer systems
7. Modularity and incremental expandability Heterogeneous processors may be easily added into the system without affecting the performance, as long as those processors are running the same middleware algo- rithms. Similarly, existing processors may be easily replaced by other processors.
1.4 Relation to parallel multiprocessor/multicomputer systems
The characteristics of a distributed system were identified above. A typical distributed system would look as shown in Figure 1.1. However, how does one classify a system that meets some but not all of the characteristics? Is the system still a distributed system, or does it become a parallel multiprocessor system? To better answer these questions, we first examine the architec- ture of parallel systems, and then examine some well-known taxonomies for multiprocessor/multicomputer systems.
1.4.1 Characteristics of parallel systems
A parallel system may be broadly classified as belonging to one of three types:
1. Amultiprocessor systemis a parallel system in which the multiple proces- sors havedirect access to shared memorywhich forms a common address space. The architecture is shown in Figure1.3(a). Such processors usually do not have a common clock.
A multiprocessor system usually corresponds to a uniform memory access (UMA) architecture in which the access latency, i.e., waiting time, to complete an access to any memory location from any processor is the same.
The processors are in very close physical proximity and are connected by an interconnection network. Interprocess communication across processors is traditionally through read and write operations on the shared memory, although the use of message-passing primitives such as those provided by
Figure 1.3 Two standard architectures for parallel systems. (a) Uniform memory access (UMA) multiprocessor system. (b) Non-uniform memory access (NUMA) multiprocessor. In both architectures, the processors may locally cache data from
memory. M memory P processor
(b) (a)
Interconnection network Interconnection network
P P
P P
M M M
M P M
P M P M P M
P M P M
Figure 1.4 Interconnection networks for shared memory multiprocessor systems. (a) Omega network [4] forn=8 processorsP0–P7 and memory banksM0–M7. (b) Butterfly network [10] for n=8 processorsP0–P7 and memory banksM0–M7.
P0 P1 P2 P3 P4
P6 P7
101 P5
000 001
M0 M1 010 011 100 101 110 111 001
101 110 111 100
111 110 100 011 010 000
010 M2 000 001
100 101 P0 P1 P2 P3
P4 P5 P6 P7
(a) 3-stage Omega network (n=8, M=4) (b) 3-stage Butterfly network (n=8, M=4)
011 M3
M4 M5 M6 M7 000
001 010 011
M0 M1 M2 M3 M4 M5 M6 M7 110
111
the MPI, is also possible (using emulation on the shared memory). All the processors usually run the same operating system, and both the hardware and software are very tightly coupled.
The processors are usually of the same type, and are housed within the same box/container with a shared memory. The interconnection network to access the memory may be a bus, although for greater efficiency, it is usually amultistage switchwith a symmetric and regular design.
Figure 1.4shows two popular interconnection networks – the Omega network [4] and the Butterfly network [10], each of which is a multi-stage network formed of 2×2 switching elements. Each 2×2 switch allows data on either of the two input wires to be switched to the upper or the lower output wire. In a single step, however, only one data unit can be sent on an output wire. So if the data from both the input wires is to be routed to the same output wire in a single step, there is a collision. Various techniques such as buffering or more elaborate interconnection designs can address collisions.
Each 2×2 switch is represented as a rectangle in the figure. Further- more, a n-input and n-output network uses log n stages and log n bits for addressing. Routing in the 2×2 switch at stage k uses only thekth bit, and hence can be done at clock speed in hardware. The multi-stage networks can be constructed recursively, and the interconnection pattern between any two stages can be expressed using an iterative or a recursive generating function. Besides the Omega and Butterfly (banyan) networks, other examples of multistage interconnection networks are the Clos [9]
and the shuffle-exchange networks [37]. Each of these has very interesting mathematical properties that allow rich connectivity between the processor bank and memory bank.
Omega interconnection function The Omega network which connects nprocessors tonmemory units hasn/2log2nswitching elements of size 2×2 arranged inlog2n stages. Between each pair of adjacent stages of the Omega network, a link exists between outputiof a stage and the input jto the next stage according to the followingperfect shufflepattern which
7 1.4 Relation to parallel multiprocessor/multicomputer systems
is a left-rotation operation on the binary representation ofi to getj. The iterative generation function is as follows:
j=
2i for 0≤i≤n/2−1
2i+1−n forn/2≤i≤n−1 (1.1) Consider any stage of switches. Informally, the upper (lower) input lines for each switch come in sequential order from the upper (lower) half of the switches in the earlier stage.
With respect to the Omega network in Figure1.4(a),n=8. Hence, for any stage, for the outputs i, where 0≤i≤3, the output i is connected to input 2i of the next stage. For 4≤i≤7, the outputi of any stage is connected to input 2i+1−nof the next stage.
Omega routing function The routing function from input lineito output linej considers only jand the stage number s, wheres∈0 log2n−1.
In a stagesswitch, if thes+1th MSB (most significant bit) ofjis 0, the data is routed to the upper output wire, otherwise it is routed to the lower output wire.
Butterfly interconnection function Unlike the Omega network, the gen- eration of the interconnection pattern between a pair of adjacent stages depends not only onnbut also on the stage numbers. The recursive expres- sion is as follows. Let there beM=n/2 switches per stage, and let a switch be denoted by the tuplex s, wherex∈0 M−1and stages∈0 log2n−1.
The two outgoing edges from any switchx sare as follows. There is an edge from switchx sto switchy s+1if (i) x=yor (ii)xXOR yhas exactly one 1 bit, which is in thes+1th MSB. For stages, apply the rule above forM/2s switches.
Whether the two incoming connections go to the upper or the lower input port is not important because of the routing function, given below.
Example Consider the Butterfly network in Figure 1.4(b), n=8 and M=4. There are three stages,s=012, and the interconnection pattern is defined between s=0 and s=1 and between s=1 and s=2. The switch numberxvaries from 0 to 3 in each stage, i.e.,xis a 2-bit string.
(Note that unlike the Omega network formulation using input and output lines given above, this formulation uses switch numbers. Exercise1.5asks you to prove a formulation of the Omega interconnection pattern using switch numbers instead of input and output port numbers.)
Consider the first stage interconnection (s=0) of a butterfly of sizeM, and hence havinglog22M stages. For stages=0, as per rule (i), the first output line from switch 00 goes to the input line of switch 00 of stage s=1. As per rule (ii), the second output line of switch 00 goes to input line of switch 10 of stages=1. Similarly,x= 01 has one output line go to an input line of switch 11 in stages=1. The other connections in this stage
can be determined similarly. For stages=1 connecting to stages=2, we apply the rules considering onlyM/21=M/2 switches, i.e., we build two butterflies of sizeM/2 – the “upper half” and the “lower half” switches.
The recursion terminates forM/2s=1, when there is a single switch.
Butterfly routing function In a stagesswitch, if thes+1th MSB ofj is 0, the data is routed to the upper output wire, otherwise it is routed to the lower output wire.
Observe that for the Butterfly and the Omega networks, the paths from the different inputs to any one output form a spanning tree. This implies that collisions will occur when data is destined to the same output line.
However, the advantage is that data can be combined at the switches if the application semantics (e.g., summation of numbers) are known.
2. Amulticomputer parallel systemis a parallel system in which the multiple processors do not have direct access to shared memory.The memory of the multiple processors may or may not form a common address space.
Such computers usually do not have a common clock. The architecture is shown in Figure1.3(b).
The processors are in close physical proximity and are usually very tightly coupled (homogenous hardware and software), and connected by an interconnection network. The processors communicate either via a com- mon address space or via message-passing. A multicomputer system that has a common address spaceusuallycorresponds to a non-uniform mem- ory access (NUMA) architecture in which the latency to access various shared memory locations from the different processors varies.
Examples of parallel multicomputers are: the NYU Ultracomputer and the Sequent shared memory machines, the CM* Connection machine and processors configured in regular and symmetrical topologies such as an array or mesh, ring, torus, cube, and hypercube (message-passing machines). The regular and symmetrical topologies have interesting math- ematical properties that enable very easy routing and provide many rich features such as alternate routing.
Figure1.5(a) shows a wrap-around 4×4 mesh. For ak×kmesh which will contain k2 processors, the maximum path length between any two processors is 2k/2−1. Routing can be done along the Manhattan grid.
Figure1.5(b) shows a four-dimensional hypercube. Ak-dimensional hyper- cube has 2kprocessor-and-memory units [13,21]. Each such unit is a node in the hypercube, and has a uniquek-bit label. Each of thekdimensions is associated with a bit position in the label. The labels of any two adjacent nodes are identical except for the bit position corresponding to the dimen- sion in which the two nodes differ. Thus, the processors are labelled such that the shortest path between any two processors is theHamming distance (defined as the number of bit positions in which the two equal sized bit strings differ) between the processor labels. This is clearly bounded byk.
9 1.4 Relation to parallel multiprocessor/multicomputer systems
Figure 1.5 Some popular topologies for multicomputer shared-memory machines. (a) Wrap-around 2D-mesh, also known as torus. (b) Hypercube of dimension 4.
0010
0011 0101
(b) (a)
processor+memory
1100 1000
1110 1010
1111 1001 1011
0001 0100 0110 0000
0111 1101
Example Nodes 0101 and 1100 have a Hamming distance of 2. The shortest path between them has length 2.
Routing in the hypercube is done hop-by-hop. At any hop, the message can be sent along any dimension corresponding to the bit position in which the current node’s address and the destination address differ. The 4D hypercube shown in the figure is formed by connecting the corresponding edges of two 3D hypercubes (corresponding to the left and right “cubes”
in the figure) along the fourth dimension; the labels of the 4D hypercube are formed by prepending a “0” to the labels of the left 3D hypercube and prepending a “1” to the labels of the right 3D hypercube. This can be extended to construct hypercubes of higher dimensions. Observe that there are multiple routes between any pair of nodes, which provides fault- tolerance as well as a congestion control mechanism. The hypercube and its variant topologies have very interesting mathematical properties with implications for routing and fault-tolerance.
3. Array processorsbelong to a class of parallel computers that are physically co-located, are very tightly coupled, and have a common system clock (but may not share memory and communicate by passing data using messages).
Array processors and systolic arrays that perform tightly synchronized processing and data exchange in lock-step for applications such as DSP and image processing belong to this category. These applications usually involve a large number of iterations on the data. This class of parallel systems has a very niche market.
The distinction between UMA multiprocessors on the one hand, and NUMA and message-passing multicomputers on the other, is important because the algorithm design and data and task partitioning among the processors must account for the variable and unpredictable latencies in accessing mem- ory/communication [22]. As compared to UMA systems and array processors, NUMA and message-passing multicomputer systems are less suitable when the degree of granularity of accessing shared data and communication is very fine.
The primary and most efficacious use of parallel systems is for obtain- ing a higher throughput by dividing the computational workload among the
processors. The tasks that are most amenable to higher speedups on par- allel systems are those that can be partitioned into subtasks very nicely, involving much number-crunching and relatively little communication for synchronization. Once the task has been decomposed, the processors perform large vector, array, and matrix computations that are common in scientific applications. Searching through large state spaces can be performed with sig- nificant speedup on parallel machines. While such parallel machines were an object of much theoretical and systems research in the 1980s and early 1990s, they have not proved to be economically viable for two related reasons.
First, the overall market for the applications that can potentially attain high speedups is relatively small. Second, due to economy of scale and the high processing power offered by relatively inexpensive off-the-shelf networked PCs, specialized parallel machines are not cost-effective to manufacture. They additionally require special compiler and other system support for maximum throughput.
1.4.2 Flynn’s taxonomy
Flynn [14] identified four processing modes, based on whether the processors execute the same or different instruction streams at the same time, and whether or not the processors processed the same (identical) data at the same time. It is instructive to examine this classification to understand the range of options used for configuring systems:
• Single instruction stream, single data stream (SISD)
This mode corresponds to the conventional processing in the von Neumann paradigm with a single CPU, and a single memory unit connected by a system bus.
• Single instruction stream, multiple data stream (SIMD)
This mode corresponds to the processing by multiple homogenous proces- sors which execute in lock-step on different data items. Applications that involve operations on large arrays and matrices, such as scientific applica- tions, can best exploit systems that provide the SIMD mode of operation because the data sets can be partitioned easily.
Several of the earliest parallel computers, such as Illiac-IV, MPP, CM2, and MasPar MP-1 were SIMD machines. Vector processors, array pro- cessors’ and systolic arrays also belong to the SIMD class of processing.
Recent SIMD architectures include co-processing units such as the MMX units in Intel processors (e.g., Pentium with the streaming SIMD extensions (SSE) options) and DSP chips such as the Sharc [22].
• Multiple instruction stream, single data stream (MISD)
This mode corresponds to the execution of different operations in parallel on the same data. This is a specialized mode of operation with limited but niche applications, e.g., visualization.
11 1.4 Relation to parallel multiprocessor/multicomputer systems
Figure 1.6Flynn’s taxonomy of SIMD, MIMD, and MISD architectures for multiprocessor/multicomputer systems.
P P
C C
C C
I I
I I I
I I
I I
D P
(c) MISD (b) MIMD
(a) SIMD
processing unit control unit P
D data stream I instruction stream
C C
P D
P P
D
D D
• Multiple instruction stream, multiple data stream (MIMD)
In this mode, the various processors execute different code on different data. This is the mode of operation in distributed systems as well as in the vast majority of parallel systems. There is no common clock among the system processors. Sun Ultra servers, multicomputer PCs, and IBM SP machines are examples of machines that execute in MIMD mode.
SIMD, MISD, and MIMD architectures are illustrated in Figure 1.6. MIMD architectures are most general and allow much flexibility in partitioning code and data to be processed among the processors. MIMD architectures also include the classically understood mode of execution in distributed systems.
1.4.3 Coupling, parallelism, concurrency, and granularity
CouplingThe degree of coupling among a set of modules, whether hardware or software, is measured in terms of the interdependency and binding and/or homogeneity among the modules. When the degree of coupling is high (low), the mod- ules are said to be tightly (loosely) coupled. SIMD and MISD architectures generally tend to be tightly coupled because of the common clocking of the shared instruction stream or the shared data stream. Here we briefly examine various MIMD architectures in terms of coupling:
• Tightly coupled multiprocessors (with UMA shared memory). These may be either switch-based (e.g., NYU Ultracomputer, RP3) or bus-based (e.g., Sequent, Encore).
• Tightly coupled multiprocessors (with NUMA shared memory or that communicate by message passing). Examples are the SGI Origin 2000 and the Sun Ultra HPC servers (that communicate via NUMA shared memory), and the hypercube and the torus (that communicate by message passing).
• Loosely coupled multicomputers (without shared memory) physically co- located. These may be bus-based (e.g., NOW connected by a LAN or Myrinet card) or using a more general communication network, and the processors may be heterogenous. In such systems, processors neither share
memory nor have a common clock, and hence may be classified as dis- tributed systems – however, the processors are very close to one another, which is characteristic of a parallel system. As the communication latency may be significantly lower than in wide-area distributed systems, the solu- tion approaches to various problems may be different for such systems than for wide-area distributed systems.
• Loosely coupled multicomputers (without shared memory and without common clock) that are physically remote. These correspond to the con- ventional notion of distributed systems.
Parallelism or speedup of a program on a specific system
This is a measure of the relative speedup of a specific program, on a given machine. The speedup depends on the number of processors and the mapping of the code to the processors. It is expressed as the ratio of the timeT1with a single processor, to the timeTnwithnprocessors.
Parallelism within a parallel/distributed program
This is an aggregate measure of the percentage of time that all the proces- sors are executing CPU instructions productively, as opposed to waiting for communication (either via shared memory or message-passing) operations to complete. The term is traditionally used to characterize parallel programs. If the aggregate measure is a function of only the code, then the parallelism is independent of the architecture. Otherwise, this definition degenerates to the definition of parallelism in the previous section.
Concurrency of a program
This is a broader term that means roughly the same as parallelism of a program, but is used in the context of distributed programs. The paral- lelism/concurrencyin a parallel/distributed program can be measured by the ratio of the number of local (non-communication and non-shared memory access) operations to the total number of operations, including the communi- cation or shared memory access operations.
Granularity of a program
The ratio of the amount of computation to the amount of communication within the parallel/distributed program is termed asgranularity. If the degree of parallelism is coarse-grained (fine-grained), there are relatively many more (fewer) productive CPU instruction executions, compared to the number of times the processors communicate either via shared memory or message- passing and wait to get synchronized with the other processors. Programs with fine-grained parallelism are best suited for tightly coupled systems. These typically include SIMD and MISD architectures, tightly coupled MIMD multiprocessors (that have shared memory), and loosely coupled multicom- puters (without shared memory) that are physically colocated. If programs with fine-grained parallelism were run over loosely coupled multiprocessors
13 1.5 Message-passing systems versus shared memory systems
that are physically remote, the latency delays for the frequent communication over the WAN would significantly degrade the overall throughput. As a corollary, it follows that on such loosely coupled multicomputers, programs with a coarse-grained communication/message-passing granularity will incur substantially less overhead.
Figure 1.2 showed the relationships between the local operating system, the middleware implementing the distributed software, and the network pro- tocol stack. Before moving on, we identify various classes of multiproces- sor/multicomputer operating systems:
• The operating system running on loosely coupled processors (i.e., het- erogenous and/or geographically distant processors), which are themselves running loosely coupled software (i.e., software that is heterogenous), is classified as anetwork operating system. In this case, the application can- not run any significant distributed function that is not provided by the application layer of the network protocol stacks on the various processors.
• The operating system running on loosely coupled processors, which are running tightly coupled software (i.e., the middleware software on the processors is homogenous), is classified as adistributed operating system.
• The operating system running on tightly coupled processors, which are themselves running tightly coupled software, is classified as a multipro- cessor operating system. Such a parallel system can run sophisticated algorithms contained in the tightly coupled software.
1.5 Message-passing systems versus shared memory systems
Shared memory systems are those in which there is a (common) shared address space throughout the system. Communication among processors takes place via shared data variables, and control variables for synchronization among the processors. Semaphores and monitors that were originally designed for shared memory uniprocessors and multiprocessors are examples of how syn- chronization can be achieved in shared memory systems. All multicomputer (NUMA as well as message-passing) systems that do not have a shared address space provided by the underlying architecture and hardware necessarily com- municate by message passing. Conceptually, programmers find it easier to program using shared memory than by message passing. For this and several other reasons that we examine later, the abstraction called shared memory is sometimes provided to simulate a shared address space. For a distributed system, this abstraction is called distributed shared memory. Implementing this abstraction has a certain cost but it simplifies the task of the application programmer. There also exists a well-known folklore result that communi- cation via message-passing can be simulated by communication via shared memory and vice-versa. Therefore, the two paradigms are equivalent.
1.5.1 Emulating message-passing on a shared memory system (MP → SM)
The shared address space can be partitioned into disjoint parts, one part being assigned to each processor. “Send” and “receive” operations can be implemented by writing to and reading from the destination/sender processor’s address space, respectively. Specifically, a separate location can be reserved as the mailbox for each ordered pair of processes. APi–Pj message-passing can be emulated by a write byPi to the mailbox and then a read byPj from the mailbox. In the simplest case, these mailboxes can be assumed to have unbounded size. The write and read operations need to be controlled using synchronization primitives to inform the receiver/sender after the data has been sent/received.
1.5.2 Emulating shared memory on a message-passing system (SM → MP)
This involves the use of “send” and “receive” operations for “write” and
“read” operations. Each shared location can be modeled as a separate process;
“write” to a shared location is emulated by sending an update message to the corresponding owner process; a “read” to a shared location is emulated by sending a query message to the owner process. As accessing another processor’s memory requires send and receive operations, this emulation is expensive. Although emulating shared memory might seem to be more attractive from a programmer’s perspective, it must be remembered that in a distributed system, it is only an abstraction. Thus, the latencies involved in read and write operations may be high even when using shared memory emulation because the read and write operations are implemented by using network-wide communication under the covers.
An application can of course use a combination of shared memory and message-passing. In a MIMD message-passing multicomputer system, each
“processor” may be a tightly coupled multiprocessor system with shared memory. Within the multiprocessor system, the processors communicate via shared memory. Between two computers, the communication is by message passing. As message-passing systems are more common and more suited for wide-area distributed systems, we will consider message-passing systems more extensively than we consider shared memory systems.
1.6 Primitives for distributed communication
1.6.1 Blocking/non-blocking, synchronous/asynchronous primitives
Message send and message receive communication primitives are denoted Send()andReceive(), respectively. ASendprimitive has at least two param- eters – the destination, and the buffer in the user space, containing the data to be sent. Similarly, a Receive primitive has at least two parameters – the
15 1.6 Primitives for distributed communication
source from which the data is to be received (this could be a wildcard), and the user buffer into which the data is to be received.
There are two ways of sending data when theSendprimitive is invoked – the buffered option and the unbuffered option. Thebuffered optionwhich is the standard option copies the data from the user buffer to the kernel buffer.
The data later gets copied from the kernel buffer onto the network. In the unbuffered option, the data gets copied directly from the user buffer onto the network. For the Receive primitive, the buffered option is usually required because the data may already have arrived when the primitive is invoked, and needs a storage place in the kernel.
The following are some definitions of blocking/non-blocking and syn- chronous/asynchronous primitives [12]:
• Synchronous primitives ASendor aReceiveprimitive issynchronous if both theSend()andReceive()handshake with each other. The processing for theSendprimitive completes only after the invoking processor learns that the other correspondingReceiveprimitive has also been invoked and that the receive operation has been completed. The processing for the Receive primitive completes when the data to be received is copied into the receiver’s user buffer.
• Asynchronous primitives ASendprimitive is said to beasynchronous if control returns back to the invoking process after the data item to be sent has been copied out of the user-specified buffer.
It does not make sense to define asynchronousReceiveprimitives.
• Blocking primitives A primitive is blocking if control returns to the invoking process after the processing for the primitive (whether in syn- chronous or asynchronous mode) completes.
• Non-blocking primitives A primitive isnon-blockingif control returns back to the invoking process immediately after invocation, even though the operation has not completed. For a non-blockingSend, control returns to the process even before the data is copied out of the user buffer. For a non-blockingReceive, control returns to the process even before the data may have arrived from the sender.
For non-blocking primitives, a return parameter on the primitive call returns a system-generated handlewhich can be later used to check the status of completion of the call. The process can check for the completion of the call in two ways. First, it can keep checking (in a loop or periodically) if the handle has been flagged orposted. Second, it can issue aWaitwith a list of handles as parameters. TheWait call usually blocks until one of the parameter handles is posted. Presumably after issuing the primitive in non-blocking mode, the process has done whatever actions it could and now needs to know the status of completion of the call, therefore using a blockingWait()call is usual programming practice. The code for a non-blockingSendwould look as shown in Figure1.7.
Figure 1.7 A non-blocking sendprimitive. When theWait call returns, at least one of its parameters is posted.
Send(X, destination, handlek) // handlekis a return parameter
Wait(handle1, handle2, …, handlek, …, handlem) // Wait always blocks
If at the time that Wait() is issued, the processing for the primi- tive (whether synchronous or asynchronous) has completed, the Wait returns immediately. The completion of the processing of the primitive is detectable by checking the value of handlek. If the processing of the primitive has not completed, theWaitblocks and waits for a signal to wake it up. When the processing for the primitive completes, the communication subsystem software sets the value ofhandlekand wakes up (signals) any process with aWait call blocked on thishandlek. This is called posting the completion of the operation.
There are therefore four versions of theSendprimitive – synchronous block- ing, synchronous non-blocking, asynchronous blocking, and asynchronous non-blocking. For the Receiveprimitive, there are the blocking synchronous and non-blocking synchronous versions. These versions of the primitives are illustrated in Figure 1.8using a timing diagram. Here, three time lines are
Figure 1.8 Blocking/
non-blocking and synchronous/asynchronous primitives [12]. ProcessPiis sending and processPjis receiving. (a) Blocking synchronousSendand blocking (synchronous) Receive. (b) Non-blocking synchronousSendand nonblocking (synchronous) Receive. (c) Blocking asynchronousSend. (d) Non-blocking asynchronous Send.
S_C P,
(a) Blocking sync. Send, blocking Receive (b) Nonblocking sync. Send, nonblocking Receive P, R_C
S_C P,
R_C
R R
P The completion of the previously initiated nonblocking operation (c) Blocking async. Send (d) Non-blocking async. Send
R_C
R Receive primitive issued processing for Receive completes kernel_i
buffer_i process i
W W
W W
W Process may issue Wait to check completion of nonblocking operation S_C
Send primitive issued
S processing for Send completes
S S_C
process i buffer_i kernel_i
process j buffer_ j kernel_ j
S
W W
S S_C S
Duration to copy data from or to user buffer
Duration in which the process issuing send or receive primitive is blocked