An Efﬁcient Virtual Network Interface in the FUGU Scalable Workstation

(1)

An Efficient Virtual Network Interface in the FUGU Scalable Workstation

by

Kenneth Martin Mackenzie

S.B., Massachusetts Institute of Technology (1990) S.M., Massachusetts Institute of Technology (1990)

Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 1998

c

Author^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^: Department of Electrical Engineering and Computer Science

22 December 1997

Certified by^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^: Anant Agarwal Associate Professor of Computer Science and Engineering Thesis Supervisor

Certified by^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^: M. Frans Kaashoek Associate Professor of Computer Science and Engineering Thesis Supervisor

Accepted by^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^:^: Arthur C. Smith Chairman, Departmental Committee on Graduate Students

(2)

(3)

An Efficient Virtual Network Interface in the FUGU Scalable Workstation

by

Kenneth Martin Mackenzie

Submitted to the Department of Electrical Engineering and Computer Science on 22 December 1997 in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY Abstract

A scalable workstation is one vision of a mainstream parallel computer: a machine that combines scalable, fine-grain communication facilities for parallel applications with virtual memory and pre- emptive multiprogramming to support general-purpose workloads. A key challenge in a scalable workstation is the Virtual Network Interface (VNI) problem. The problem is that high performance communication for parallel programming depends on a tight coupling between the application and the network while multiprogramming and virtual memory effects disrupt such coupling.

This thesis introduces and evaluates the “direct” virtual network interface: a solution to the VNI problem for fine-grain messages in a scalable workstation. The direct VNI employs two complemen- tary architectural techniques to reconcile speed and protection. First, two-case delivery optimistically provides direct, user-level access to network interface hardware but also transparently backs the direct system with a robust, software-buffered system. Two-case delivery allows the scalable workstation to support both good parallel application performance through the fast hardware interface and good global system performance by permitting buffering when required for multiprogramming. Second, the software-buffered mode uses virtual buffering to provide effectively unlimited buffer capacity by storing messages in dynamically managed virtual memory. Virtual buffering gives the user the convenient illusion of a very large buffer while giving the operating system the means to minimize actual, physical memory consumption.

The direct VNI ideas are implemented in an experimental scalable workstation,FUGU, consisting of emulated hardware, a matching simulator and a custom operating system. Results from workloads of real and synthetic applications show that the direct VNI provides high performance because the direct case is both fast and common. Microbenchmarks show the protected direct delivery case costs only 60% (10s of cycles per message) more than unprotected messages on the same hardware. Further, in a mixed workload experiment, we observe that our parallel applications see only 14^,33% of messages buffered when 10% of the CPU time is devoted to unrelated, high-priority, interactive tasks.

Finally, results show that physical buffering requirements remain naturally low in real applications despite the combination of unacknowledged messages and unlimited buffering.

Thesis Supervisor: Anant Agarwal

Title: Associate Professor of Computer Science and Engineering Thesis Supervisor: M. Frans Kaashoek

Title: Associate Professor of Computer Science and Engineering

(4)

(5)

Acknowledgments

This research was supported in part by NSF grant # MIP-9012773, in part by ARPA contract # N00014-94-1-0985, in part by a NSF Presidential Young Investigator Award to Anant Agarwal and in part by a NSF National Young Investigator Award to M. Frans Kaashoek.

(6)

(7)

1.1 Challenges in a Scalable Workstation ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 15 1.2 An Efficient Virtual Network Interface ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 17 1.3 Contributions^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 18 1.4 Roadmap ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 18

2 The Virtual Network Interface Problem 19

2.1 Programmability ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 19 2.2 Protection ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 22 2.3 Performance ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 25 2.4 Problem Statement ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 29

3 A Direct Virtual Network Interface 31

3.1 Programmability ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 32 3.2 Protection ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 37 3.3 Performance ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 38 3.3.1 Two-Case Delivery ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 40 3.3.2 Virtual Buffering^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 41 3.3.3 Programmer-Visible Performance Tradeoffs ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 42 3.4 Discussion ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 43

4 Two-Case Delivery Technique 45

4.1 Direct Access Path ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 47 4.2 Buffered Path ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 53 4.3 Transparent Access ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 56 4.4 Discussion ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 57

5 Virtual Buffering Technique 59

5.1 Unlimited Buffering ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 62 5.2 User Flow Control ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 65 5.3 Resource Management ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 68 5.4 Discussion ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 69

6 Experimental System 71

6.1 Hardware^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 72 6.1.1 Emulated Hardware ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 72 6.1.2 Fast Simulator ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 75

(8)

6.2 System Software ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 76 6.2.1 Operating System ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 76 6.2.2 Scheduler ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 77 6.3 Libraries ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 79

7 Results 81

7.1 Applications and Standalone Performance ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 83 7.2 Mixed Workload Performance ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 87 7.2.1 A Mixed Workload Experiment ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 88 7.2.2 Mixed Workload with Real Applications ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 92 7.3 Mixed Workload Analysis ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 95 7.4 Buffer Consumption ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 102 7.4.1 Artificially Induced Buffering ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 102 7.4.2 Limits to Buffering Behavior ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 105 7.5 Overflow Control ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 108

8 Related Work 111

8.1 Messaging Models ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 111 8.2 Network Interfaces ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 112 8.3 Miscellaneous ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 114

9 Conclusion 117

A Bulk Transfer 119

(9)

List of Figures

2-1 Communication model abstraction levels. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 20 2-2 Undeliverable messages. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 23 2-3 Approaches to buffering. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 26 3-1 Names ofFUGUcomponents.^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 32 3-2 Message timeline for the fast path. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 34 3-3 Protocol deadlock situation. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 36 3-4 Simple protection based on Group Identifiers (GIDs). ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 38 3-5 Direct virtual network interface operating modes. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 43 4-1 Message timeline for the fast path. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 46 4-2 Message timeline for the buffered path. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 46 4-3 Direct virtual network interface registers. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 47 4-4 Three revocable interrupt disable examples. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 51 4-5 Software buffering components. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 54 5-1 Virtual buffering components. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 61 5-2 Message reception cases.^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 62 5-3 Levels of flow control in a message system. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 64 5-4 Virtual buffering system operating modes. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 65 5-5 Overflow control mechanics. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 66 5-6 Conventional paging system mechanics. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 67 5-7 Virtual buffer queue length threshold. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 67 6-1 Approaches to mixed hardware-software system evaluation. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 72 6-2 Block diagram of a singleFUGUnode. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 73 6-3 Photograph of a singleFUGUnode. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 73 6-4 Software structure of a singleFUGUnode. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 77 6-5 TypicalFUGUworkload. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 78 7-1 Application speedup standalone. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 85 7-2 Impact of increased message overhead. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 86 7-3 Ideal mixed workload with independent scheduling. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 88 7-4 Actual mixed workload with independent scheduling. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 89 7-5 Mixed workload with co-scheduling.^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 90 7-6 Mixed workload timeline. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 90 7-7 Expected slowdowns under interactive scheduling. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 91 7-8 Expected slowdown under gang scheduling. ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: ^: 91

An Efﬁcient Virtual Network Interface in the FUGU Scalable Workstation

An Efficient Virtual Network Interface in the FUGU Scalable Workstation

Kenneth Martin Mackenzie

Acknowledgments

Contents

List of Figures