HAL Id: tel-02315651
https://hal.univ-lorraine.fr/tel-02315651
Submitted on 14 Oct 2019
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Software Datapaths for Multi-Tenant Packet Processing
Paul Chaignon
To cite this version:
Paul Chaignon. Software Datapaths for Multi-Tenant Packet Processing. Networking and Internet Architecture [cs.NI]. Université de Lorraine, 2019. English. �NNT : 2019LORR0062�. �tel-02315651�
AVERTISSEMENT
Ce document est le fruit d'un long travail approuvé par le jury de soutenance et mis à disposition de l'ensemble de la communauté universitaire élargie.
Il est soumis à la propriété intellectuelle de l'auteur. Ceci implique une obligation de citation et de référencement lors de l’utilisation de ce document.
D'autre part, toute contrefaçon, plagiat, reproduction illicite encourt une poursuite pénale.
Contact : [email protected]
LIENS
Code de la Propriété Intellectuelle. articles L 122. 4
Code de la Propriété Intellectuelle. articles L 335.2- L 335.10 http://www.cfcopies.com/V2/leg/leg_droi.php
http://www.culture.gouv.fr/culture/infos-pratiques/droits/protection.htm
Ecole doctorale IAEM Lorraine´
Software Datapaths for Multi-Tenant Packet Processing
Ph.D. Thesis
to obtain the degree of Doctor of Philosophy issued by
University of Lorraine
(Specialization in Computer Science)
publicly presented and discussed May 7th, 2019 by
Paul Chaignon
Committee in charge
President: Marine Minier Professor, University of Lorraine
Reporters: Pierre Sens Professor, Pierre and Marie Curie University Laurent Mathy Professor, University of Li`ege
Examiners: Fulvio Risso Associate Professor, Polytechnic University of Turin Filip De Turck Professor, Ghent University
Advisors: Kahina Lazri Research Engineer, Orange Labs J´erˆome Fran¸cois Researcher, Inria
Olivier Festor Professor, University of Lorraine
Laboratoire Lorrain de Recherche en Informatique et ses Applications — UMR 7503
Acknowledgments
First of all, I would like to thank my academic advisors, Jérôme François and Olivier Festor, for their support and availability throughout my Ph.D. studies despite my limited stays at Inria Nancy.
I would also like to acknowledge Fulvio Risso, Pierre Sens, Laurent Mathy, Filip De Turck, and Marine Minier for accepting to join my dissertation committee. I'm particularly thankful to Fulvio Risso who drove a great distance to attend my (inconsequential) defense.
I remember the rst time I met most people who ended up having a signicant contribution to my thesis. I met Kahina Lazri in her oce, on the rst day of my Master internship, and remembered attending her presentation at SSTIC'14. Little did I know that I would pursue my studies, just six months later, as a Ph.D. student under her guidance. Thank you Kahina for your trust and unwavering support. I know I often was able to focus on technical aspects of my work grâce à toi.
On the rst day of my internship, I also met Xiao. He was sharing a small and dark oce with Aurélien, surrounded by computer parts, development boards, and books. I cannot thank Xiao enough for the many times he helped me understand weird code behaviors, debug my programs, or rewrite an introduction on a Friday evening. Passing from Xiao's oce to Maxime's, next door, was quite a change; Maxime had a large, bright, and tidy oce. Maxime's self-discipline has been an inspiration to me. Three years later, I nd myself having (somewhat) successfully copied several of Maxime's good habits1. Weirdly enough, I don't remember when I rst met Alex. Yet, Alex probably had the most inuence on my work. Having read enough papers to write several theses, Alex often gave me short, inspirational summaries of research trends in systems, which guided me in my research. I'm also grateful for the many talks we had on the way to the subway (after he went back for the forgotten keys).
I also want to thank my interns, Thibault and Diane, who I met respectively for his interview and on the rst day of her internship. Their contagious motivation arrived just when I needed it. I unfortunately cannot thank every single past and present colleague at Orange Labs, but I am really grateful to all of them for the shared times working (a little) and laughing (a lot). If there is one last colleague I ought to thank, it is Pascal. Pascal equally shares his time between work, laughter, and miscellaneous concerns2, but at Orange Labs, he accounts for a large part of the team's good atmosphere.
Of course, I remember seeing Céline for the rst time3. Thank you Céline for your love and our discussions, which gave me another point of view on many aspects of my thesis and life.
My thanks also go to my family, especially Lou, who proof-read my rst papers before morning deadlines, while the rest of us, mere diurnal creatures, slept.
Before concluding these acknowledgments, I wanted to thank the many writers who posted advice and accounts of Ph.D. studies online, including Philip Guo's The PhD Grind. These texts often helped me look past my own experiences.
1Yes Maxime, I even ditched the 5pm biscuits for fruits.
2Based on real-life measurements.
3Students, do go to summer school, you never know...
À mes grands-parents,
Contents
Chapter 1 Introduction
1.1 Motivation . . . 2
1.1.1 Network Functions in Cloud Platforms . . . 3
1.1.2 Performance Challenges . . . 3
1.1.3 Opportunities at the Infrastructure Layer . . . 4
1.2 Contributions and results . . . 5
1.3 Overview of the thesis . . . 6
Chapter 2 Multi-Tenant Networking Architectures 2.1 Packet Demultiplexing and Delivery . . . 8
2.1.1 Virtualization . . . 8
2.1.2 Operating System . . . 10
2.1.3 Software Memory Isolation . . . 11
2.2 Software Switches . . . 14
2.2.1 Packet Switching . . . 14
2.2.2 Forwarding Pipelines . . . 15
2.2.3 Extensibility . . . 20
2.3 Packet Processing Ooads . . . 23
2.3.1 Hardware Ooads . . . 23
2.3.2 Ooading Tenant Workloads . . . 24
2.3.3 Conclusion . . . 26
Chapter 3 Software Switch Extensions 3.1 Introduction . . . 28
3.2 Design Constraints . . . 29
3.3 Oko: Design . . . 30
3.3.1 Oko Workow . . . 30
3.3.2 Safe Execution of Filter Programs . . . 31
3.3.3 Flow Caching . . . 33
3.3.4 Control Plane . . . 37
3.4 Filter Program Examples . . . 38
3.4.1 Stateless Signature Filtering . . . 38
3.4.2 Stateful Firewall . . . 40
3.4.3 Dapper: TCP Performance Analysis . . . 41
3.5 Evaluations . . . 41
3.5.1 Evaluation Environment . . . 41
3.5.2 Microbenchmarks . . . 42
3.5.3 End-to-End Comparisons . . . 48
3.6 Related Work . . . 50
3.7 Conclusion . . . 51
Chapter 4 Ooads to the Host 4.1 Introduction . . . 54
4.2 Background on High-Performance Datapaths . . . 55
4.3 Design . . . 55
4.3.1 Ooad Workow . . . 56
4.3.2 Program Safety . . . 57
4.3.3 Run-to-Completion CPU Fairness . . . 58
4.3.4 Per-Packet Tracing of CPU Shares . . . 59
4.4 Implementation . . . 60
4.4.1 In-Driver Datapath . . . 60
4.4.2 Userspace Datapath . . . 61
4.5 Ooad Examples . . . 61
4.5.1 TCP Proxy . . . 61
4.5.2 DNS rate-limiter . . . 62
4.6 Evaluations . . . 62
4.6.1 Evaluation Setup . . . 62
4.6.2 Fairness Mechanism . . . 63
4.6.3 Performance Gain . . . 65
4.7 Related Work . . . 66
4.8 Conclusion . . . 67
Chapter 5 Conclusion
5.1 Beyond Datacenter Networks . . . 71
5.2 Runtime Software Switch Specialization . . . 71
5.3 Low-Overhead Software Isolation for Datapath Extensions . . . 72
5.4 The Heterogeneity of Packet Processing Devices . . . 73 Appendix A
Example Filter Program Trees Appendix B
Example BPF Bytecode
Annexe C French Summary 79
Bibliography 87
List of Figures
2.1 Illustration of ClickOS, NetVM, and ptnetmap's core assignments . . . 10
2.2 Paths of packets through memory boundaries and demultiplexers in multi-tenant platforms . . . 12
2.3 Open vSwitch architecture . . . 15
2.4 Open vSwitch caching architecture . . . 18
3.1 Oko workow from the compilation of the extension to its execution in the switch 30 3.2 Two examples of lter program trees . . . 36
3.3 Cumulative distribution of packets over OpenFlow rules . . . 43
3.4 Comparison of packet classication performance between Open vSwitch and Oko 44 3.5 Packet classication performance for dierent lter program chain lengths . . . . 45
3.6 Performance evaluation for the three Oko use cases . . . 46
3.7 Illustration of the forwarding pipelines for the three example lter programs . . . 47
3.8 The three evaluation setups for the end-to-end performance comparison . . . 49
3.9 Comparison of performance for the three use cases, with Oko, a vhost-user KVM virtual machine, and a DPDK Ring Port process . . . 50
4.1 Ooad workow from the request to the API to the execution on the host or NIC 56 4.2 Comparison of the CPU consumption of each tenant, with and without fairness mechanism . . . 63
4.3 Packet processing performance with and without the tracing probes . . . 64
4.4 Packet processing performance for dierent fairness mechanisms . . . 65
4.5 Packet processing performance gain from ooad . . . 66
A.1 Example lter program tree for an Oko pipeline with a linear growth of the cache 76 A.2 Example lter program tree for an Oko pipeline with an exponential growth of the cache . . . 76
Chapter 1
Introduction
1.1 Motivation
Cloud computing realizes the dream of utility computing by virtualizing and sharing hardware resources across tenants, thereby allowing crucial economies of scale. The illusion of innite resources and the elimination of up-front, infrastructure investments convinced customers to outsource their processing workloads to the cloud [7]. While system virtualization enables sharing the CPU, the memory, and I/O devices across tenants, properly sharing the network itself requires additional care: tenants should be properly isolated, both in terms of performance and physical access while each having their own view of the network. Cloud computing therefore calls for multi-tenant networks: networks capable of delivering packets to multiple, isolated tenants, from the core network to tenant domains on end hosts.
Whereas public and private clouds initially hosted few network IO-bound workloads, there is now considerable pressure on multi-tenant networks, and in particular on end hosts, to deliver packets with high throughput and low latency. The cause of this change is two fold. On one hand, virtual machine density on hosts is increasing, pushing cloud providers to adopt 10Gbps and 40Gbps network interface cards (NIC). On the other hand, due to the rise of Network Functions Virtualization (NFV) and recent advances in software packet processing, network IO- bound workloads are becoming more common in virtual machines. Initially proposed by ISPs, the NFV paradigm advocates a move away from specialized hardware appliances, towards software implementations running on commodity hardware.
Oering high performance network interfaces to tenants over commodity hardware is however challenging for cloud providers. Cloud providers must process packets in and out of virtual ma- chines with as few resources as possible, always in an eort to increase virtual machine density on hosts. In addition, commodity hardware achieves high performance through the use of numerous caches (e.g., the instruction and data caches and the TLB) and pipelining optimizations (e.g., out-of-order execution and branch target prediction). These components and optimizations, as well as the contention for shared hardware resources, introduce variability in packet processing performance. Finally, the need to share the network across tenants on end hosts adds to the high performance challenge: taking packets across memory isolation boundaries has a cost, which we investigate in Section 2.1.
As discussed in Chapter 2, cloud providers increasingly rely on specialized, high performance hardware to provide network accesses to tenants. While some implement the entire datapath on dedicate, special hardware [66], most achieve high performance through partial ooads of network functions [89, 33, 50]. As the contrasts between cloud providers' reliance on hardware highlights, hardware implementations are not without their drawbacks. First, hardware devices are rather inexible: they have limited resources, are dicult to upgrade, and require expertise in dedicated, low-level languages to program. Emerging programmable devices, although easier to upgrade, retain the resource limitations of xed-function hardware. Worse, these devices may simply not be available to smaller cloud providers before several years.
In this thesis, we argue that signicant performance improvements are attainable in end- host multi-tenant networks, without depending on specialized hardware. While we do not expect software implementations to meet the performance of hardware devices, as we will show, software datapaths can be designed to process packets several times more eciently than current solutions, thereby reducing the performance gap. In particular, we advocate for a consolidation of cloud provider services on the host to reduce the cost of packet processing between the NIC and the virtual machines. Such consolidated datapaths also enable tenants to ooad network services from their virtual machines to the host.
1.1. Motivation 1.1.1 Network Functions in Cloud Platforms
Cloud platforms run a large and diverse set of network functions, to support a variety of needs.
• Connectivity and Isolation. At the very least, the host ends tunnels, forwards packets according to L2 or L3 lookups, and communicates with (or emulates) the network drivers of virtual machines.
• Trac Engineering. Cloud networks rely on a number of trac engineering functions at the end hosts, both for their own needs (e.g., accounting) and for tenants (e.g., load balancing services).
• Security. In addition to the common rewalling services exposed to tenants, cloud platforms may run Intrusion Detection/Prevention Systems (IDPSes), rate limiters, and VPNs. These functions protect both the cloud against malicious users and tenants against external at- tackers.
• Tenant Services. The most diverse set of functions likely runs in virtual machines. Tenants may create virtual machines to run their own network functions, from enterprise middle- boxes [130] to Evolved Packet Core (EPC) components [76].
These functions are diverse in goals, but also in terms of processing overheadrate limiters have a small cost compared to IPSec VPNsand responsibility; some are the responsibility of cloud providers, others of tenants. Conversely, these functions have in common that they all run inline, on the path of packets to virtual machines, making their eciency critical to the end to end performance.
1.1.2 Performance Challenges
The datapaths of cloud platforms must provide the aforementioned network functions to tenants with high throughput and low latency. They however face a number of challenges, which hinder their eciency.
Long Datapaths. In addition to the already costly reception and transmission of packets, cloud datapaths must execute a large number of operations on packets before sending them to the virtual machines or to the NIC. Nevertheless, datapaths have a budget of only few CPU cycles per packet: on a 2.5GHz CPU core, minimum sized, 64B packets must be processed in an average of less than 168 cycles to support line rate 10Gbps processing.
Inecient Detours. Many services are implemented as inecient detours from the existing datapath. For example, a load balancer may run as a separate process to which the software switch redirects packets before sending them to virtual machines. These detours encode organi- zational boundaries rather than technical constraints: in the previous example, dierent entities or companies may develop the load balancer and the software switch.
Performance Gap. As we have seen with the examples of network functions, datapaths execute a number of operations with very dierent costs. One operation in particular, common to all multi-tenant datapaths, has a much higher cost than others: the exchange of packets between the host and the virtual machines. This higher cost induces a performance gap in the datapath operations: on the receive path, performance gains on operations executed after the gap may have
a lesser eect on the overall datapath performance than if those operations had been executed before the gap. For example, aggregating packets or dropping illegitimate packets likely reduces the processing cost for subsequent network functions and will have a much higher impact if executed before the copy to virtual machines. Admittedly, these optimizations are best executed as soon as possible during datapath processing.
1.1.3 Opportunities at the Infrastructure Layer
We argue that these challenges can be overcome by designing key components of the datapath to follow two high level principles.
Consolidate network functions in the datapath. Several network functions can be con- solidated into a single component to reduce the cost of packet transmissions between functions.
Such consolidated functions eliminate the inecient datapath detours, but must preserve the logical separations between network functions. Indeed, maintaining a separation of concerns is important as dierent network functions are likely to be operated and maintained by dierent groups of people.
Execute network functions before the performance gap. The datapath should oer so- lutions for tenants to ooad some of their network functions to the host. In particular, both the tenant and the cloud provider benet from running security services before the copy of packets to the virtual machines, since these services are more likely to drop packets and connections.
In applying these principles to existing multi-tenant datapaths, we face a number of obstacles which we address in this thesis' contributions.
Prevent failure propagation in consolidated components. Complex network functions are unlikely to be free of bugs. To prevent a failure inside a network function to propagate to other, consolidated functions, we need to maintain a form of isolation between functions. To isolate functions without hindering their exchanges of packets, we can rely on a large number of software memory isolation techniques, which we survey in Section 2.1.3.
Extend caching optimizations. Despite recurrent concerns on the performance variations caused packet processing caches, as we highlight in Section 2.2, several large-scale production systems rely on caches, sometimes with several layers, to improve performance. Because these caching mechanisms make assumptions on the processing steps, they are however dicult to extend when consolidating network functions.
Isolate ooaded functions. Lightweight isolation techniques are also necessary to isolate the host from the ooaded network functions and to prevent one network function from accessing other tenants' data. In particular, although all ooaded functions run inside the same consol- idated component, they should not be able to receive other tenants' packets or to access data structures owned by other ooaded functions.
Maintain resource fairness despite consolidation. In addition, hardware resources must be fairly shared among ooaded programs, as if the latter were running inside virtual machines or containers. However, because all ooaded functions are consolidated into a single program to improve performance, we cannot rely on the usual schedulers to enforce fairness.
1.2. Contributions and results
1.2 Contributions and results
This thesis presents the state of the art of end-host multi-tenant networking as well as two novel systems to apply the aforementioned principles to concrete state-of-the-art systems and improve their end-to-end performance.
State of the art. Chapter 2 discusses the state of the art of end-host multi-tenant networking with a focus on packet processing performance bottlenecks. We review the literature on three aspects of end-host networking architectures that are key to the overall performance:
• The demultiplexing and delivery of packets to tenant domains accounts for large portion of processing cost in multi-tenant networking architectures. We take an in-depth look at processing costs and improvements to multi-tenant datapaths over the last 15 years, since the rst paravirtualized drivers in Xen. We highlight the cost of hardware memory isolation and survey alternative isolation techniques.
• The demultiplexing logic, implemented at the software switch before the actual delivery of packets to tenant domain, can also have a severe impact on performance. We survey the literature on packet classication algorithms and their necessary optimizations. We contrast contributions from the literature with reports on experiences with large-scale production systems.
• Processing ooads delegate tasks to faster components, often implemented on specialized hardware, in order to improve the overall performance or to reduce the burden on the multi-tenant system itself. We identify and review three generations of hardware ooads and draw a parallel with ooads of tenants' workloads to the host.
Extensible software switch. We design the Oko software switch to execute stateful ltering and monitoring programs as part of its packet classication pipeline. Oko enables the consolida- tion at the software switch of network services running on the host. Oko achieves the following improvements over state-of-the-art systems:
• The extension of highly optimized packet classication algorithms with a limited impact on performance. Compared to the software switch on which Oko is based, our modications add no measurable overhead (number of packets processed each second) with all caches enabled and a 2% overhead with the rst-level cache disabled.
• The isolation of faults from extensions with negligible runtime overhead. Oko relies on recent advances in software memory isolation to enforce isolation with a static analysis and few runtime bounds checks.
• Signicant performance improvements compared to the virtual machines and separate pro- cesses that usually host extensions to the datapath. When evaluated with a set of network functions, Oko outperforms the same functions running inside virtual machines by 23x, and by 1.71.9x when running as processes separate from the switch.
We present Oko in Chapter 3.
Tenant ooads to the host. We design a framework to allow tenants to ooad network services from their domains to the host's datapath, thereby beneting from increased packet processing performance. Our framework demonstrates the following:
• The feasibility of performance isolation despite all tenant services running inside the same datapath. When evaluating our fairness mechanism with a web server benchmark, our sys- tem adds a 2.6% overhead on the web server's performance, but performs signicantly better than the Linux scheduler under the same constraints (214x depending on the num- ber of ooaded functions).
• The practicability of software memory techniques to isolate tenants with low cost. In this framework, we extend the software memory isolation technique used in Oko to sandbox the services of each tenant.
• The benet of host ooads in increasing the performance attainable by network services.
By executing network functions before the performance gap, i.e., the copy to the virtual machines or userspace processes, our system enables a 46x performance improvement.
We present our host ooad work in Chapter 4.
1.3 Overview of the thesis
This thesis is structured as follows. In Chapter 2, we discuss the state of the art of end-host multi- tenant networking architectures through a review of the literature. In Chapter 3, we present Oko, an extensible software switch that serves as a basis to consolidate network services on the host.
In Chapter 4, we design our framework to ooad tenant workloads from virtual machines and containers to the host datapath. Finally, in Chapter 5, we conclude this thesis with discussions on the increasing heterogeneity of packet processing hardware and the applicability of our work besides cloud networks.
Chapter 2
Multi-Tenant Networking Architectures
In cloud computing infrastructures, the end host and the upstream network devices must execute a number of tasks characteristics of multi-tenant networking. These tasks include de- multiplexing trac to tenants, enforcing security policies, and delivering packets to tenants' domains. In this chapter, we discuss related work on performance challenges inherent of multi- tenant networking. In each section, after providing brief background information, we review eorts to improve the state of the art.
First, we survey work on packet demultiplexing and delivery and discuss its intrinsic perfor- mance limitations. We then review advances in software switching and the algorithmic optimiza- tions involved in fast packet classication on commodity hardware. Finally, we investigate recent trends in packet processing ooads as a mean to further improve performance.
2.1 Packet Demultiplexing and Delivery
When delivering packets to tenant's domains in multi-tenant networks, the last few processing steps are often performed at end hosts, on hardware shared with the tenant's workloads. Hence, their eciency is critical. In this section, we describe these processing steps and survey recent works to improve their eciency in multi-tenant setups. We focus on the receive path, but the transmit path has similar processing steps, in a reverse order.
2.1.1 Virtualization
To enforce isolation between virtual machines, virtualization platforms must intercept all I/O operations. Networking is no exception: on the receive path, packets are demultiplexed to virtual machines based on their headers; on the transmit path, the hypervisor validates packets to prevent malicious behaviors (e.g., spoong attacks or oods).
In the Xen virtualization platform [12], a privileged virtual machine, the host domain, is responsible for virtualizing I/O accesses. When packets arrive at the Network Interface Card (NIC), an interrupt is rst routed to the hypervisor, which noties the host domain. The NIC then DMAs the packet to the host domain's memory. At that point, the host domain can inspect headers to determine the destination guest domain (virtual machine) for that packet.
In the original version of Xen, packets were delivered to virtual machines by exchanging memory pages between the host domain and the guest domain, a technique often referred to as page ipping or page remapping [12]. A. Menon et al. [103] showed that the cost of mapping and unmapping memory pages for the exchange equals the cost of a large packet (1500B) copy. Page ipping was therefore abandoned in subsequent versions of Xen in favor of packet copies.
In [127], J. Santos et al. proposed a number of implementation and architectural improvements to Xen's networking path and performed an in-depth analysis of its CPU cost. Two architectural changes had a decisive role in the performance improvements:
• The guest's CPU becomes responsible for copying packets from the host memory to the guest's memory. Before this change the host's CPU was copying packets to the guest domain. Since the guest's CPU is likely to read packets again afterwardif only to copy them to the guest's userspace, this change improves cache hit rates.
• In Xen, the guest grants the host domain the right to write to a few of its memory pages, in order to receive packets. J. Santos et al. removed the need to perform a new grant request per packet; the guest can now recycle pages previously granted to the host domain.
2.1. Packet Demultiplexing and Delivery These two design changes stood the test of time and were retained in the more recent par- avirtualized virtio driver [125]. In virtio, the host requests a few memory pages to write incoming packets. The guest's driver then copies packets to its own memory when it allocates its internal packet data structure (e.g., sk_buff in the Linux kernel).
Network Function Virtualization. With the advent of network function virtualization, re- searchers focused on improving virtualization performance for network I/O-bound workloads.
ClickOS [97] and NetVM [73] are two virtual platforms for network intensive workloads that come with revamped vNICs, software switches, and guest operating system designs.
ClickOS is based on Xen with many of J. Santos et al.'s improvements. On the host domain side, ClickOS relies on the VALE software switch [124] with two threads to poll packets from the NIC to the virtual machines and vice versa. On the guest side, they use a version of the MiniOS unikernel tailored for packet processing. In particular, they run a single thread in MiniOS to poll packets from the VALE vNIC. Since unikernel OSes have a single address space, using the MiniOS unikernel removes the need for an additional kernel to userspace copy and signicantly boosts performance. To implement network functions, they rely on the Click packet processing framework [107].
They report a 14.2 Mpps forwarding speed through a virtual machine using 3 dedicated threads, one polling from the NIC on the host, the second polling from the vNIC in the virtual machine, and the last polling from the vNIC on the host and sending packets back through the NIC. The setup used in ClickOS's evaluation is illustrated in Figure 2.1, along with NetVM's.
Published the same year, NetVM took a fairly dierent approach. NetVM is based on the KVM hypervisor and uses a userspace packet processing library, DPDK [37], to poll packets from the physical and virtual NICs. In addition, where ClickOS uses VALE, NetVM comes with its own demultiplexing logic. More importantly, NetVM has a zero-copy design in which packets are DMAed to hugepages on the host, and virtual machines can read packets from these hugepages without copying them. This zero-copy design comes at the cost of isolation as any virtual machine can access all packets received from the NIC. In the design of NetVM, the authors mention, but do not implement nor evaluate, the possibility to isolate several trust groups.
They report a throughput of 14.88 Mpps with four dedicated threads, two polling from the NIC on the host on the receive path and from the vNICs on the transmit path, two in the virtual machine to poll packets from the vNIC, process them, and send them back to the host.
The authors doubled the number of polling threads to dedicate one to each of the two NUMA nodes of their system. Their evaluation is however limited by the 10Gbps NIC. In addition, their evaluation doesn't lend itself easily to comparison with ClickOS: they use one more polling thread and a signicantly dierent CPU4. Taking the hardware dierences into account, and given that NetVM requires one less copy per packet than ClickOS, it would likely still be able to saturate the 10Gbps NIC with a single NUMA node and two threads.
S. Garzarella et al. published ptnetmap [53], a virtualization platform for network intensive workloads similar to NetVM, but using a dierent packet processing library and software switch.
Contrary to NetVM, they evaluated both a fast, zero-copy design without memory isolation between virtual machines and a slower design with a single packet copy to enforce isolation. They performed thorough evaluations with varying packet batch sizes and throughput measurements both from the NIC to virtual machines and between virtual machines. With two polling threads one in the host, one in the virtual machineand the same batch size as ClickOS, they achieve a
4In particular, the CPU used in ClickOS's evaluation doesn't support DCA [72], meaning that packets are received in the main memory instead of the last-level CPU cache as in NetVM's evaluation.
VM
Host
(a) ClickOS
VM
Host
(b) NetVM
VM
Host
(c) ptnetmap
Figure 2.1 Adaptation of an illustration of NetVM's core assignment [73, Figure 2(a)] to ClickOS and ptnetmap. For NetVM, the cores on the second NUMA node are drawn in white with dashed arrows. Whereas ClickOS and NetVM evaluate a round-trip to the VM over the physical network, ptnetmap evaluates only the receive path, using a userspace process to free itself from the limitations of the physical network.
throughput of 14.88 Mpps, limited by the NIC capacity, to receive packets in an isolated virtual machine (one packet copy).
Perhaps of more interest is their evaluation of throughput between a sending userspace process and a receiving virtual machine, as illustrated in Figure 2.1. Because this evaluation is not limited by the NIC's capacity, it better highlights the cost of isolation. When copying packets once to enforce isolation, ptnetmap achieves around 17 Mpps, whereas without packet copies it achieves around 60 Mpps.
2.1.2 Operating System
The challenges of eciently delivering packets in multi-tenant architectures is not limited to virtual machines. In a similar fashion, operating systems need to demultiplex packets to appli- cations. If packet lters were considered in the 80s [104], nowadays operating systems generally demultiplex packets to userspace processes based on a few static transport layer elds.
In the most common operating systems, incoming packets are processed by a monolithic kernel up to the transport layer, at which point they are copied to their destination userspace process. Kernel processing involves a large number of indirections to handle the many possible protocols, security checks to reject invalid and spoofed packets, and statistic updates [115].
In recent years, several frameworks [37, 123, 22] have been proposed to remove these over- heads by receiving packets directly in a userspace process. These frameworks all actively poll packets from the NIC, preallocate memory to avoid dynamic per-packet allocations, and process packets in batches to reduce cache misses. netmap [123], however, contrasts with DPDK [37]
and PF_RING [22] by two of its design choices. First, whereas DPDK and PF_RING depart from the POSIX API, netmap retains the Linux kernel for the initialization and uses standard system calls to initiate packet transfers. Second, netmap does not monopolize the CPU core on
2.1. Packet Demultiplexing and Delivery which it is running, but instead dynamically adjust the wait time between two polling operations depending on the load.
These frameworks, however, are unt for multi-tenant architectures on their own. Because they receive packets directly into the process' memory space, they can only support a single memory space.
To overcome this limitation and allow multiple applications to receive packets directly in userspace, the Arrakis OS [115] leverages new capabilities from SmartNICs. In particular, Ar- rakis delegates security checks and demultiplexing to the NIC. The NIC should support complex predicates to allow for demultiplexing of packets to virtual NICs, which userspace applications control. Intel Flow Director [77], for example, enables demultiplexing of packets to queues based on transport layer elds. These features are nevertheless dicult to support in hardware, espe- cially in the face of packet fragmentation and new, unsupported tunneling protocols.
Instead of SmartNICs, IX [13] relies on hardware virtualization to securely receive packets in userspace processes. Its security relies on an ingenious design in which a non-root ring 0, run-to-completion thread switches to ring 3 before executing the application's logic.
Packet delivery to processes has also been investigated in the context of containers, processes sandboxed by the kernel. OpenNetVM [152] adapts NetVM to containers. Packets are received into a shared memory area that all containers can read, with the same drawback as NetVM: one application can read and modify packets destined to other applications. In contrast, VIRTIO- USER [138] is a new interface, based on virtio [125], to receive packets from a host userspace application (e.g., a userspace software switch) to containers securely; it enforces isolation through one packet copy.
2.1.3 Software Memory Isolation
Figure 2.2 on the following page illustrates the dierent trade-os between performance and iso- lation taken by vanilla Xen [12], NetVM [73], ClickOS [97], and ptnetmap [53]. Whereas ClickOS requires two copies to bring packets to tenants, NetVM, with a single tenant per NIC, requires only one copy per packet. ClickOS's second copy is necessary to bring the packet across the hardware memory boundary, from the demultiplexer's memory space into the destination appli- cation's. Although this copy can be avoided by demultiplexing packets at the NIC, most NICs still have very limited demultiplexing logic [67, 55]. They are often unable to parse encapsulated packets and, even without encapsulation, can only demultiplex packets based on a few static elds at the data link and network layers.
Another approach is possible, however: the hardware MMU can be replaced by software memory isolation techniques. These techniques rely on static and dynamic verications to ensure a program only accesses its own memory. In part because they are implemented in software, they can more easily be adapted to allow specic data (e.g., packets) to cross memory boundaries.
These techniques have been extensively studied in the context of kernel extensions [109, 99, 14, 44] and browser plugins [148, 4, 65, 93, 44].
In the following, we review the dierent approaches for software memory isolation. We briey describe seminal works and their recent applications. These techniques often enable more than just memory isolation (e.g., type safety or checking arbitrary safety policies), although we focus on memory safety herein.
Binary Rewriting. Several techniques rely on binary rewriting to insert bounds checks on memory accesses into binaries before their execution.
Software-based Fault Isolation (SFI), proposed in 1993 [145], gives each module (e.g., kernel
Guest userspace
Guest kernel space
Host/Dom0 userspace
Host/Dom0 kernel
space
NIC
Ring 3
Ring 0 non-root
Ring 3
Ring 0 vmx-root
demux
demux
demux
demux
Xen ClickOS NetVM NetBricks
Figure 2.2 Paths of packets through memory boundaries and demultiplexers in multi-tenant platforms. Dashes delineate software-enforced memory boundaries and arrows indicate packet copies. ClickOS' guest OS, MiniOS, has a single address space. ptnetmap implements both ClickOS and NetVM's approaches.
extensions) a portion of the main address space (e.g., the kernel address space). Because portions of the address space are aligned such that they each have a unique pattern of upper bits, SFI can enforce isolation simply by masking addresses of memory accesses in the object code. Enforcing memory isolation with SFI remains expensive since bit masking instructions are required for every memory accesses. Using CPU benchmarks, the authors measure an overhead of 1822% to isolate all memory accesses and a much lower overhead (around 4%) to check only jumps and memory writes.
The initial implementation of SFI targeted RISC architectures. S. McCamant et al. [98]
adapted it to the smaller number of registers and the variable-length instructions of CISC archi- tectures. With Native Client [148], used in the Chrome web browser to sandbox native code, B.
Yee et al. rely on x86 hardware support for memory segmentation to replace the bounds checks for memory accesses, thus reducing the overhead. Although Native Client leverages the same underlying idea as SFI, it relies on a modied compiler rather than binary rewriting.
XFI [44], proposed by Ú. Erlingsson et al., allows native code to run in the contexts of the Windows kernel and the Internet Explorer web browser. XFI supports multiple memory regions with dierent access permissions. Its binary rewriter adds bounds checks for memory accesses that cannot be statically veried. Contrary to S. McCamant et al.'s work on SFI for CISC architectures, XFI requires registers for runtime checks and must therefore run a register liveness analysis to nd available registers. The resulting untrusted binary rewriter is complex, but the authors strove to keep the trusted static verier simple, including through the use of verication hints inserted by the binary rewriter.
Type-Safe Languages. With type-safe languages, accesses to objects in memory are veried through their type. There exist many type-safe languages; we discuss only a few well-known
2.1. Packet Demultiplexing and Delivery examples that pertains to the extension of kernels and web browsers.
In [14], B. N. Bershad et al. propose SPIN, an extensible operating system built with the Modula-3 type-safe language. Kernel extensions, written in Modula-3 as well, are compiled by a trusted compiler, representing a large trusted code base compared to the static veriers of previous approaches.
Conversely to Modula-3 in SPIN, Java does not rely on a trusted compiler. Instead, type safety is veried at the level of the Java bytecode [93], before its execution by the JVM.
Proposed by a consortium of engineers from the four major browser vendors, WebAssem- bly [65] aims to provide a new low-level bytecode with explicitly typed instructions and opera- tors that can be used as a compilation target for C/C++ programs. WebAssembly programs can use a single dynamic memory area and all memory accesses are checked to be within that area at runtime. Since WebAssembly bytecode is Just-In-Time compiled to binary, if the size of the dynamic memory area changes, the binary must be patched to change the bounds checks.
Closer to the subject of this thesis, NetBricks [114] proposed to replace virtualization with a safe runtime and language to implement and isolate network functions. NetBricks leverages the Rust language and the LLVM runtime to isolate network functions. As illustrated in Figure 2.2 on the preceding page, NetBricks isolates several tenants (network functions) using the same NIC without requiring a second packet copy. Other software memory isolation techniques may achieve the same end result. While it prevents most faults and malicious behaviors, its run-to-completion execution model could allow one network function to monopolize a CPU core.
Proof-Carrying Code. In 1996, G. Necula et al. designed a new approach, proof-carrying code (PCC) [109], to enforce memory isolation without requiring runtime checks as in the afore- mentioned approaches. To implement PCC, the kernel denes a formal safety policy, such as the authorized memory locations for reads and writes. The userspace application then loads a kernel extension in native code with its associate proof of correctness, a proof that the extension abide by the safety policy. Before executing the extension, the kernel derives its safety predicate (a predicate that returns true if the program abides by the policy) and veries that the proof proves the predicate.
Although G. Necula et al.'s approach outperforms previous software memory isolation tech- niques, it has one main drawback that impedes adoption: the generation of extension's proofs often requires manual intervention. In addition, the authors only test their proof generator with very simple programs, of a few instructions each, used to lter packets.
Static Analysis of Untyped Bytecodes. The last memory safety approach we discuss relies on interpreters, and in that sense, is close to some of the aforementioned runtimes, such as WebAssembly or the JVM. These approaches, however, rely neither on a type-safe bytecode nor on higher-level, type-safe languages.
DTrace [24], for example, provides a safe runtime in the Solaris operating system to trace both user-level and kernel-level software. DTrace probes are written in D, a C-like language, and compiled to DIF, a small RISC instruction set. Memory safety is enforced through both load-time and runtime verications. Because backward edges are disallowed in the DIF control ow graph, D programs cannot express loops, thereby preventing innite execution in the context of a probe and simplifying load-time verication.
Like DTrace, the BPF runtime [99] was rst designed for a specic application in the context of the BSD kernel: it allowed userspace programs to safely extend the kernel with packet lters to prevent unnecessary packet copies to userspace. BPF rst relied on a basic register-based machine abstraction with 2 registers and 22 instructions. Memory accesses were bounded at
runtime and the instruction set simply didn't allow jumps to negative osets, thereby preventing innite execution.
In the Linux kernel, the BPF runtime was later rewritten to use a more complete instruction set [95], closer to current hardware instruction set, and more amenable to Just-In-Time com- pilation. In addition, runtime verications were replaced by a static analysis, during which the BPF bytecode program is symbolically executed to ensure, among other things, that all memory accesses are correctly bounded. Bounds checks on variable-sized input data (e.g., packets) are the responsibility of the developer and are checked during the static analysis. Accesses to dynamic data structures, implemented outside the BPF virtual machine, are however veried at runtime.
Nowadays, the BPF runtime is used in the Linux kernel to safely probe the kernel as with DTrace in Solaris, but also to safely execute packet processing programs in the context of kernel drivers [70].
In [56], E. Gershuni et al. propose a dierent approach to the verication of BPF bytecode, using Abstract Interpretation [32]. They evaluate the eciency of several abstract domains (over- approximations of possible values for registers and stack slots) in terms of speed and false positives for a large corpus of networking BPF programs. Contrary to Linux's static analyzer5, their static analyzer is able to verify BPF programs are safe to execute in the presence of loops, but does not verify termination.
2.2 Software Switches
Besides the bare task of delivering packets to tenant's domains, end host must also decide whether and where to forward packets, on the receive path as well as the transmit path. In practice, because of the large number of virtual machines or containers per host and the complexity of network policies, executing the logic to decide whether and where to forward packets can be expensive.
In this section, we survey works on the software switch, the end-host component in charge of executing that logic. We begin with a brief discussion on the evolution over the last decade of the role of the software switch in multi-tenant networks. We then review the literature on packet classication algorithms, algorithms to execute the aforementioned logic. Finally, we discuss the challenge of extending software switches, a problem which we address in Chapter 3.
2.2.1 Packet Switching
In the rst virtualization platforms [12], networking between virtual machines and the physical network was managed at the link layer (and, less frequently, at the network layer [3]). A software component of the hypervisor would therefore demultiplex packets to virtual machines based on the Ethernet addresses and VLAN tags. This component, generally referred to as the virtual switch, could also enforce policies, such as rate-limiting trac or ltering outbound packets to prevent spoong.
Other approaches were proposed to process packets in hardware with higher performance [62].
The NIC [78] or the upstream top-of-rack (ToR) switch can for example enforce policies at much higher speeds than the host's CPU. These approaches, however, are limited by the PCIe band- width, and in the case of the ToR switch, require additional tagging of packets and demultiplexing at the hypervisor.
5At least until Linux v5.2
2.2. Software Switches
OpenFlow interface
Platform-agnostic OpenFlow pipeline
OVSDB interface
Platform-specic cache (e.g., Linux module) First
packet
Figure 2.3 Open vSwitch architecture, with the OpenFlow and OVSDB interfaces.
If demultiplexing was predominantly performed on the host's CPU, virtualization platforms still relied on the mature, reliable implementations of ToR switches for many tasks, including management (e.g., SNMP), accounting, monitoring and mirroring (e.g., NetFlow and SPAN), and quality-of-service [117].
In 2009, B. Pfa et al. presented Open vSwitch [119], a software switch for virtualization platforms with full-edged management interfaces and ne-grained control over forwarding rules.
Its OVSDB interface [35], illustrated in Figure 2.3 enables the conguration of port mirroring, QoS policies, and NetFlow logging. The second interface implements the OpenFlow protocol [51]
to give network operators (or an OpenFlow controller) control over the forwarding, ltering, and load-balancing of network ows. The OpenFlow forwarding table is generic in that it can direct packets based on their L2, L3, and L4 headers. Thus, the combination of OpenFlow and OVSDB allows for the implementation of a large number of network tasks at the software switch, in replacement of the ToR switch.
Enforcing network virtualization policies at the hypervisor level benets virtualization plat- forms: because it is closer to the virtual machines, it is easier for the software switch to uniquely identify a virtual machine's packets (each virtual NIC identies a virtual machine) and for the virtualization platform to migrate network congurations with the virtual machine when neces- sary.
Internally, Open vSwitch has a split architecture that both eases porting to new platforms and improves performance. The implementation of the full forwarding pipeline resides in the userspace, platform-agnostic component, whereas the in-kernel component implements a cache for the most frequently matched forwarding rules. The ow caching mechanisms of Open vSwitch, which we discuss in the next section, were detailed in [120].
2.2.2 Forwarding Pipelines
At a high-level, forwarding rules in both hardware and software switches are organized in a pipeline of match-action tables, each table requiring a lookup over a few header elds to nd the appropriate action. Rules may have wildcarded elds (i.e., elds that match all values) or may even match a range of values. This high-level model can have both very limited implementations, with few tables (and rules per tables) matching on a restricted set of elds, or very general implementations, as in Open vSwitch, in which the execution on commodity resources (CPU and RAM) allows for a large number of tables and rules matching on a diverse set of elds.
Hardware forwarding pipelines are generally of the rst kind: they support a limited number of tables and elds, often with each table matching on a specic protocol. In an eort to open
networks, the OpenFlow protocol [100] attempted to standardize a general forwarding model.
OpenFlow denes a pipeline of cross-layer forwarding tables with priorities. All tables support the same set of matching elds and rules may have wildcarded elds to match all values. OpenFlow however received limited support in hardware switches; most products have low limits on the number of rules per table and, often, each table can match on a very narrow set of elds among the 40 OpenFlow 1.3 denes [51].
Software forwarding pipelines, on the other hand, don't have these limitations. For example, several software switches have near-complete support for OpenFlow, even sometimes the latest, extensive versions [120, 46, 75, 74]. Software switches, however, have other shortcomings: whereas hardware switches can rely on TCAMs and BCAMs to implement fast lookups over ow tables, software switches must nd ecient ways to perform the same lookups using the CPU.
Packet Classication. The problem of eciently matching packets with rules in a forwarding pipeline is known as packet classication. Three aspects of forwarding pipelines complicate packet classication in software. First, virtualization platforms have long pipelines [89] for which a naive implementation would require multiple, expensive table lookups. Second, each table may contain a large number of rules and receive frequent updates. Thus, the tables' data structures must enable fast lookups and updates, regardless of the number of entries. Third, whereas hardware switches have TCAMs, CPUs do not have an ecient way (i.e., O(1)) to perform lookups over tables with wildcardstables with only exact-match rules can be implemented with hash tables.
In 1999, V. Srinivasan et al. [136] proposed one solution to this problem, namely Tuple Space Search. In a forwarding table with wildcarded rules, each tuple models a set of elds being matched on and is associated to a hash table implementing a lookup over those elds. The lookup over the forwarding table, i.e., the Tuple Space Search, is then implemented as a linear search over each hash table. In their 1999's paper, V. Srinivasan et al. discussed an optimization to lter the number of hash table lookups by searching the longest prex match over all IP addresses rst.
Another approach to this problem relies on binary logic [91, 9]. For each packet eld in the forwarding table, a set of bit vectors is dened. Each bit vector encodes the forwarding rules matching a given range of values. For example, for a table of four rules matching on the source and destination TCP ports, a bit vector of 0011 associated with source ports 11024 signies that only the two last rules match on source ports 11024. Thus, the length of bit vectors is equal to the number of rules in the table. During lookups, for each packet eld in the forwarding table, a corresponding bit vector is selected. Bit vectors are ANDed and the remaining 1-bits indicate rules that match the packet for all packet elds in the forwarding table.
Many decision tree-based packet classiers have been proposed in the literature [137, 131, 143]. These algorithms have varying memory versus throughput trade-os, but compared to Tuple Space Search, they are generally faster at the cost of memory consumptions polynomial in the number of ruleswhereas Tuple Space Search uses memory linear in the number of rules.
In addition, even though decision trees can be incrementally updated, they require complex and slower update logic than Tuple Space Search, for which a single hash table operation is needed.
Another approach, referred to as cross-producting [64], consists in doing separate lookups for each eld and, using the cross-products of found elds, doing a lookup into the cross-product table of all possible eld value combinations. This last cross-product table however grows exponentially with the number of rules.
6Complexity when a new match eld is added. The complexity to add a new value to an existing match eld is slightly lower.
2.2. Software Switches
Classier Complexity Memory space
Lookup Update
Tuple Space Search [136] O(t) O(1) O(n)
Decision trees [137] O(w) - O(n·w)
Binary logic [91] O(f·log(2n)) - O(f·n2)
Cross-producting [64] O(f) Q
i
vi6 Q
i
vi
Table 2.1 Worst-case complexity and memory space consumed by a few packet classiers. n is the number of rules,f the number of dierent match elds, tthe number of dierent tuples, wthe maximum number of bits matched by rules in the table, and vi the number of values for match eld i. We consider the worst-case only in terms of table content: hash tables without collisions are assumed for Tuple Space Search and Cross-producting.
Table 2.1 summarizes the dierent lookup and update complexities for the packet classi- cation algorithms presented in this section, as well as their memory space consumption. The binary logic and cross-producting methods have prohibitive memory costs. Two viable methods remain, namely Tuple Space Search classiers and decision trees. The choice between these two last methods mostly depends on the expected forwarding rule patterns as Tuple Space Search achieves higher performance with a smaller number of dierent wildcard patterns. As an exam- ple, there is an ongoing discussion in the Linux kernel community to choose a general packet classication algorithm for a new Open vSwitch datapath [141], with one side arguing for deci- sion trees to avoid making strong assumptions on rule patterns, the other arguing for a Tuple Space Search classier with constant-time updates. Similar design choices arise with iptables implementations [16].
Flow Caching. None of these algorithms for packet classication address the rst challenge, that of supporting long forwarding pipelines with multiple tables. They all would require one lookup per table in the pipeline. Flow caching has long been studied as a solution to this problem [110, 26, 88], to reduce the average cost of lookups, especially with the use of hardware caches.
Exact-match caches are one simple design for caches. After a packet has been classied using the full forwarding pipeline, a rule is installed in the cache with the action found and all eld values equal to those of the packet; thus, subsequent packets with the same eld values hit the cache. Nevertheless, if even a single eld used in the classier changes in a subsequent packet, a new full classication is required. This case can happen fairly frequently with short HTTP requests, but also with port scans or even route changes (due to the TTL value changes).
To overcome this issue, second-level caches are used, as an intermediate step between the exact match cache and the full pipeline. In [129], N. Shelly et al. study several strategies to compute rules for a second-level cache with support for wildcards but not priorities. Removing support for priorities from the second-level cache simplies lookups as iterating over all matches to nd the highest-priority is not required, but the cache can then only contain disjoint rules.
Using Header Space Analysis [86], they rst consider the population of a perfect cache, one that completely avoids cache misses. They show that such a cache grows polynomially with the number of rules in the full pipeline and consider an incremental population strategy as an alternative.
Thus, for a given incoming packet, they need to compute the cached rule that matches the packet while having as many wildcards as possible, to increase cache hits. They nd that the problem is NP-hard and turn to heuristics. They design several heuristic algorithms and evaluate their respective trade-os in complexity versus optimality (highest number of wildcards to increase
. . . Table 0
TSS classier Table n
TSS classier
Megaow cache
TSS classier without priorities
Microow cache Hash table
SlowpathFastpath
First packet
Figure 2.4 Open vSwitch caching architecture. TSS stands for Tuple Space Search.
cache hits).
Open vSwitch. In [120], B. Pfa et al. present the design of Open vSwitch and its evolution to support diverse virtualization platform workloads. Open vSwitch draws heavily from previous research on packet classication to implement the highly general OpenFlow forwarding pipeline without sacricing performance. As illustrated in Figure 2.4, Open vSwitch consists of a slowpath with the full OpenFlow implementation and a fastpath with two cache levels. The full pipeline implementation uses Tuple Space Search classiers for each table, with added support for priori- ties by ordering rules in hash tables according to their priority. As an optimization, Open vSwitch records the highest priority in each hash table and skips lookups if a rule of higher priority was already found.
The rst-level cache is an exact-match cache, while the second-level cache has a single for- warding pipeline with support for wildcards but not priorities, as in N. Shelly et al.'s work.
The single table of the second-level cache uses a Tuple Space Search classier for which rules are constructed reactively, upon packet arrival. Rule construction does not rely on any of N.
Sherry et al.'s heuristics, but instead tracks the elds used during the full pipeline lookup and un-wildcards them with the packet's values to construct the cache rule. Open vSwitch includes a number of other optimizations [120] to compute keys for hash tables incrementally and improve the optimality of cached rules for IP addresses, for example.
Dataplane Specialization. In [106], L. Molnár et al. make a case against ow caching in software switches. They show that, under certain trac patterns, ow caching can introduce high performance variations. In addition, ow caching requires complex invalidation algorithms to keep caches consistent with rules in the full pipeline7.
The authors propose ESWITCH, a software switch that specializes the fastpath to attain high performance. They decompose the full pipeline into four prepared code templates: a hash table, a prex tree, a direct template with hardcoded rules, and a default linked list. Their rationale is that forwarding pipelines can often be decomposed into a set of common forwarding tables (e.g., hash table for ACLs or prex tree for L3 routing).
For simple forwarding pipelines (no more than three consecutive tables) with large numbers of network ows (10K and more), ESWITCH achieves one order of magnitude better performance than Open vSwitch general packet classication algorithm. In these conditions, the eciency
7Open vSwitch revalidates the whole cache for each update in the slowpath.
2.2. Software Switches VMware's
NVP GCE's
Andromeda Azure's AccelNet
Publication 2014 [89] 2018 [33] 2018 [50]
Packet classier
Slowpath TSS classiers Not reported8 Specialized 2nd cache TSS classier TSS classiers - 1st cache Hash table Hash table FPGA Hardware
ooads Protocol
ooads
Protocol ooads, packet copies, and encryption
Cache in FPGA
Table 2.2 Design choices for packet classication in large-scale virtualization platform deploy- ments.
of the cache degrades with the number of network ows, and Open vSwitch is unable to keep up with ESWITCH's specialized classiers. They however do not compare their from-scratch prototype to Open vSwitch for the long pipelines characteristic of virtualization platforms [89].
VFP. In [48], D. Firestone presents his experience of the iterative design and deployment of VFP, Azure's software switch. VFP takes a somewhat intermediate position between Open vSwitch's aggressive ow caching and ESWITCH's specialization: VFP adopts a specialized slowpath with a single, exact-match cache. D. Firestone justies the lack of a second-level cache by the lowest cost of VFP's slowpath lookups compared to Open vSwitch's; whereas Open vSwitch's second- level cache runs in userspace and requires a context switch, VFP's runs in kernel space, alongside the exact-match cache.
VFP's slowpath consists of four specialized classiers: an interval tree, a prex tree, a hash table, and a default linked list. Instead of decomposing the forwarding pipeline into a set of tables that can be eciently implemented with the specialized classiers as in ESWITCH, VFP relies on a heuristic algorithm [49] to map each rule to a classier. When updating the forwarding pipeline, VFP rst adds all new rules to all classiers in order to compute a score for each couple of rule and classier. The score takes into account the lookup cost and how the rule impacts the lookup cost of other rules in the same classier (e.g., due to collisions in a hash table). VFP then assigns each rule to the classier that gave the best score.
Table 2.2 summarizes the dierent design choices for packet classication in three large-scale virtualization platforms: VMware's NVP [89], GCE's Andromeda [33], and Azure's AccelNet [50].
Although NVP and Andromeda both rely on Open vSwitch for packet classication, whereas NVP uses an unmodied Open vSwitch, Andromeda uses only its slowpath component as a second-level cache and replaces its fastpath by a custom implementation of a hash table-based, rst-level cache. AccelNet, on the other hand, relies heavily on an FPGA-based cache and does not implement a second-level cache. Interestingly, all three platforms leverage ow caching and diverse hardware ooads to achieve high performance. We discuss hardware ooads further in Section 2.3.1.
8In Andromeda, the Open vSwitch slowpath acts as a second-level cache for a physically separate software switch whose classier is not described.
2.2.3 Extensibility
Despite its generality, OpenFlow has several drawbacks that limit its applicability. First, it is protocol-dependent; it matches packets on pre-dened elds that correspond to supported protocols. To support new protocols, the specication and its implementations must be updated and new matching elds added. As a result, the last version of OpenFlow supports over 40 elds. Second, OpenFlow remains limited to its original intents, describing forwarding pipelines in hardware devices. Many researchers have noted the benets of applying the core principles of OpenFlow (central control and open interfaces) to a larger set of devices (e.g., middleboxes [5, 54, 23]) and use cases (e.g., ne-grained monitoring [149, 96]). The main barrier to this evolution is the stateless nature of OpenFlow, each packet being processed independently with no means to use persistent information to take forwarding decisions.
Protocol Independence. The rst limitation is a consequence of OpenFlow initially targeting hardware devices with xed ASIC-based pipelines whose parser cannot be recongured. Recent advances in the design of hardware switches [20, 58] enable operators to recongure switches at runtime. With these advances came high-level languages and compilers [21, 83] to ease switch programing. P4 [21], for example, is a popular language in the research community to describe for- warding pipeline. Contrary to OpenFlow's xed pipeline description, P4 is protocol-independent and can describe pipelines that match packets at arbitrary osets.
Although all widespread software switches are protocol-dependent, there wasn't any signi- cant barrier for a protocol-independent implementation. Since software switches run on CPUs, they benet from higher exibility to match packets. Nevertheless, the recent advances on hardware switch programmability motivated and helped design protocol-independent software switches.
PISCES is one such design of a P4-compatible software switch based on Open vSwitch. The authors focus their eorts on compiler optimizations to retain the high performance of Open vSwitch despite the use of a higher-level programming language.
In [48], D. Firestone reports that VFP is protocol-dependent, in that matching elds are compiled in, but supports programmable actions that abstract away some of the specics of protocols (e.g., tunneling protocols).
Switching Frameworks. Several software switches were designed to be easily extended. mSwitch [71]
is such a software switch. Based on VALE [124], it runs entirely in the kernel, supports a large number of forwarding ports, and replaces the xed learning switch logic of VALE with a plug- gable switching logic component. This new component must be written and loaded as a Linux kernel module. The authors develop four dierent switching logic modules, but none of them requires more than a single lookup.
Several fast packet processing frameworks built atop the Click modular router have been proposed in the literature. With SMP Click [29], B. Chen et al. made one of the earliest attempt to improve the performance of Click, by leveraging multiprocessor servers. RouteBricks [36] explores the replacement of hardware routers with software routers based on Click and distributed over a cluster of commodity servers. In [11], T. Barbette et al. evaluate several packet I/O frameworks before designing a fast packet processing framework based on Click and running on both netmap and DPDK. ClickNF [52] extends Click with a high-performance TPC stack to enable application- layer processing.
BPFabric [85] is a switch framework built on DPDK with a switching logic expressed in BPF.
It includes only two data structures which can be used as packet classiers: a hash table and an