Optimized sampling strategies to model the performance of virtualized network functions

(1)

Optimized Sampling Strategies to Model the Performance of Virtualized Network Functions

Corresponding author Steven Van Rossem

Albert Lienartstraat 13, B-9300 Aalst, Belgium stevenvanrossem@gmail.com

+32485750365 Other authors Wouter Tavernier Didier Colle Mario Pickavet Piet Demeester

Ghent University - imec, IDLab

iGent Tower - Department of Information Technology Technologiepark-Zwijnaarde 126, B-9052 Ghent, Belgium {firstname}.{surname}@ugent.be

+32 9 33 14920

(2)

(will be inserted by the editor)

Optimized Sampling Strategies to Model the Performance of Virtualized Network Functions

Steven Van Rossem · Wouter Tavernier · Didier Colle · Mario Pickavet · Piet Demeester

Received: date / Accepted: date

Abstract Modern network services make increasing use of virtualized compute and network resources. This is enabled by the growing availability of softwarized network functions, which take on major roles in the total traffic flow (such as caching, routing or as firewall).

To ensure reliable operation of its services, the service provider needs a good understanding of the performance of the deployed softwarized network functions. Ideally, the service performance should be predictable, given a certain input workload and a set of allocated (virtualized) resources (such as vCPUs and bandwidth). This helps to estimate more accurately how much resources are needed to operate the service within its performance specifications. To predict its performance, the network function should be profiled in the whole range of possible input workloads and resource configurations. However, this input can span a large space of multiple parameters and many combinations to test, resulting in an expen- sive and overextended measurement period. To mitigate this, we present a profiling framework and a sampling heuristic to help select both workload and resource configurations to test. Additionally, we compare several machine-learning based methods for the best prediction accuracy, in combination with the sampling heuristic.

This work has been performed in the framework of the NG- PaaS and 5GTANGO project, funded by the European Com- mission under the Horizon 2020 and 5G-PPP Phase2 pro- grammes, resp. under Grant Agreement No. 761 557 and 761 493 (http://ngpaas.eu) (https://www.5gtango.eu). This work is partly funded by UGent BOF/GOA project “Autonomic Networked Multimedia Systems”.

All authors are at

Ghent University - imec, IDLab

E-mail:{firstname}.{surname}@ugent.be The corresponding author is:

Steven Van Rossem E-mail: stevenvanrossem@gmail.com

As a result, we obtain a reduced dataset which can still model the performance of the network functions with adequate accuracy, while requiring less profiling time.

Compared to uniform sampling, our tests show that the heuristic achieves the same modeling accuracy with up to five times less samples.

Keywords Sampling Heuristic ·Network Function Virtualization · Performance Profiling · Machine Learning·Regression

1 Introduction

In the telecom industry, there is an increasing adoption of cloud-native services and network functions based on Software Defined Networking (SDN) and Network Function Virtualization (NFV) techniques. By virtual- izing compute and network resources, a very flexible environment can be created to deploy Virtual Network Functions (VNFs) with an optimal amount of allocated resources, adapted to the realtime incoming workload.

The recent rise of 5G enabled services further advocates the use of cloud-native functions, which are deployed

Iterative Development

Ops Dev

Core IaaS Edge IaaS

Access IaaS Profiling Framework

Staging Environment(s)

Operational Management and Orchestration

Infrastructure (IaaS)

(mobile/fixed)

Production Environment

Fig. 1 The profiling framework can use a similar infrastructure compared to operations, as part of a DevOps workflow.

(3)

over a virtualized infrastructure [1] [2]. This illustrates the growing need to map the amount of allocated resources and incoming workload to the Key Performance Indicators (KPIs) of the deployed network service, specified in the Service Level Agreement (SLA). To charac- terize this relation as good as possible, we propose that the virtualized function is profiled on the targeted infrastructure, before being deployed in production. This is illustrated in Fig. 1 as a DevOps inspired workflow.

Cloud-native techniques enable a very flexible use of the Infrastructure as a Service (IaaS) for a wide vari- ety of use cases, as envisioned in [3]. This enables the profiling framework (Dev) to do its testing on a representative (but isolated) part of the IaaS, compared to the operational environment (Ops).

In this paper, we propose an optimized sampling procedure to shorten the profiling time as much as possible, without losing modeling accuracy. This optimized profiling procedure greatly benefits a DevOps-inspired deployment cycle: frequent VNF updates can be quickly and representatively validated, before the updated VNF is handed over to the service provider for deployment in production. We hereby consider the VNF as a black-box, with no formal way to deterministically calculate the performance metrics. VNF Profiling is then basically a form of load testing, where representative workloads are emulated and performance is recorded. The data analysis investigated in [4], shows further that the output data of the profiling tests can be poured into a model;

such a model can predict from the resource allocation and expected workload, the performance level of the profiled VNFs. The validation of this model in the profiling environment, brings additional trust when deploying the profiled VNF in the production environment. The research goal of this article is to optimize the data gath- ering part of the profiling procedure as much as possible.

This is done by investigating whether the total test space of possible workloads and resource allocations can be limited, without losing too much prediction accuracy.

We compare several machine-learning based methods for the best performance modeling accuracy, in combination with several sampling heuristics.

But first, in Section 2, we compare our work to state- of-the-art approaches in VNF performance modeling.

Next, in Section 3, we present the architecture and implementation of our automated profiling tool, which we used to gather the profiling data. In Section 4 we present the VNFs used in our experiments. Then in Section 5, the general performance trends we witnessed on the tested VNFs are analysed. Finally, Section 6 uses the profiled data to evaluate several machine-learning-based methods for the best performance modeling accuracy, in combination with multiple sampling heuristics.

2 Related Work

In this paper we unite two different domains: On one hand, platforms for automated VNF testing and on the other hand, multi-variate sampling to model a certain response function as efficiently as possible. Efficiency in the context of this paper means as little samples as needed, to obtain an accurate model of the VNF performance, in the shortest time frame possible. In Table 1, we give an overview of the related research in these two domains. Our presented method builds further upon this previous work, and achieves a higher sampling efficiency. The problem of efficient performance sampling is not unique for the NFV domain, related work can be found in other application fields, as indicated in the table. For each referenced related work, we shortly describe if a sampling heuristic was used or not, and any drawbacks we see, compared to our method.

Various base principles for profiling virtual environments were first discussed in [5]. Some the main conclu- sions in this reference also apply to our use case: Profiling tests must be able to apply a range of representative workload intensities and should run in nearly-identical environments compared to production. Additionally, linear regression is exemplified to be a good modelling method for CPU usage. Further work is however needed to expand the methodology in [5] to non-linear regions, where resources such as CPU get saturated. It is also possible to optimize the sampling procedure, by pro- actively defining significant workloads to test.

Every reference in the NFV domain in Table 1, makes use of a platform to automate VNF performance measurements. Generic architectures for profiling frame- works have been described earlier in [6, 7, 8, 9, 11, 12], where also the relation to DevOps related workflows is highlighted. We extend this previous work with more insights for using a Service Oriented Architecture (SOA) and integration of both sampling and modeling methods.

Also we exemplify VNF profiling in a more elaborate parameter space of both workload and resource metrics.

The VNF profiling platforms in the first two rows of Table 1 exhaustively executes a list of input benchmark tests, without trying to optimize the test time. Seve- ral sampling strategies for VNF performance profiling have been studied before and combined with automated profiling, in the last seven rows of Table 1:

In [13], it is experimentally observed that VNF performance shows a trend break when resources are getting saturated. During profiling, this approach increases the workload in fixed steps until resource saturation occurs.

We argue that our approach, based on bisection, pro- vides a faster way to model the VNF performance trend, needing less samples (as we will see in Section 5.1).

(4)

Reference Domain

Sampling Effi- ciency

Sampling heuristic Drawbacks

Wood [5] Cloud

computing -

Exhaustive sampling using CPU, network and disk intensive microbench- marks. Post process the training set to filter anomalous measurements.

Not time-efficient, covers only part of the operational space.

z-Torch [6]

NFV-vital [7]

NFV-inspector [8]

NFV + Manually selected parameters and values

Covers only part of the operational space.

SONATA SDK [9, 10]

Gym [11]

Probius [12]

NFV - Exhaustive sampling Not time-efficient, not scalable.

ORCA [13] NFV +

Stepwise increase workload until resource saturation + exhaustive resource combinations.

More time-efficient but not scalable.

Duplyakin [14] Cloud

computing ++ Multi- variate Gaussian Process to select the next samples.

Needs many initial samples, less scalable.

GEIST [15] Cloud

computing + Random uniform in selected areas via CAMLP (label propagation).

Not fit for trend modelling since only global optimum finding.

Sumo [16] Generic ++

Voronoi partitioning of the parameter space + KPI gradient-based selection of interesting partitions.

Very generic, needs large pool of initial uniform samples.

PANIC [17] Cloud

computing ++ Greedy, KPI gradient- based bisection

on all parameters. Less scalable.

Peuster [18] NFV +++

Set parameter weights via KPI gradient + random sampling on weighted parameters. Fixed workload over multiple resource combinations.

Less scalable to many workload parameters, can outperform PANIC but marginally outperforms random uniform sampling.

Giannakopoulos [19] Cloud

computing +++

Use piecewise linear functions to model the KPI trends, bisect areas with high- est linearity deviation.

Can outperform PANIC but marginally outperforms random uniform sampling.

This paper NFV ++++

Scalable PLS-based parameter selection (phase1) + KPI gradient-based bisection (phase2) on selected workload metric + curve fitting accuracy as sampling stop criterium. Outperforms uniform sampling.

Table 1 Related VNF perfomance sampling approaches and drawbacks.

The authors in [14] recommend the use of Gaussian Processes (GP) as a flexible method to both model performance and choose the next configurations to sample.

Our tests show however that the trends witnessed in VNF profiling are not efficiently captured by GPs. Also when we let the GP model determine the next samples, the method remains sub-optimal compared to generic uniform sampling (see Section 6.2). Likewise, in [15], it is experimentally tested that uniform sampling and Gaussian Processes are not the best methods to sample the large parameter space of compute intensive algo- rithms. The used method is however aimed at finding a global optimum, which is not applicable to our use case, which tries to model the complete response surface.

Another commonly used adaptive sampling method is based on surrogate modeling using a gradient based approach [16]. This is a very generic method which

can be applied in many domains, but the risk is that sampling efforts focus too hard on local extrema, and therefore take an unbalanced number of samples in only limited regions. Also a relatively large number of initial random samples is needed to feed the sampling algorithm.

The use of gradient based sampling selection is applied to VNFs in [17]. By using a greedy algorithm, this method becomes less scalable when multiple workload and resource parameters must be profiled, since a many possible combinations remain to be sampled.

Several sampling strategies for VNF chain profiling have been investigated in [18], but no method is found which significantly beats a generic uniform sampling strategy. Decision tree based models are put forward as a promising solution, however our tests show that random forest is one of the less performing methods

(5)

in our use-case of profiling a single VNF. The profiling tests in [18] are also limited to varying only one resource metric (allocated vCPU), under only one fixed workload.

The profiling method in [19] focuses on the deployment space of big data applications, with up to seven configuration dimensions on a fixed server, thus with a fixed set of resource parameters. In this deployment space, areas are clustered in which the performance metric can be approximated using a linear model. The total model is than a piecewise combination of different linear models and the space partitioning into different linear regions is done by a decision tree. More samples are adaptively chosen in the areas where a linear model has bad accuracy. However, when trying to model non- linear functions, a deteriorating performance is reported.

Moreover, new sample points are drawn randomly in a uniform manner, without exploiting any expert knowl- edge. Our sampling heuristic tries to mitigate this by using a curve fit model with piecewise (non-linear) functions and online sample selection using bisection.

The scalability mentioned in Table 1, relates to how well the sampling heuristic can cope with additional parameters to test. Exhaustive or greedy methods need a lot of time to test the complete range of specified parameters, similar to methods which require a large number of initial samples. This is mitigated by including some form of feature selection. Our presented sampling procedure further augments the mentioned related work, with the PLS method [20], to select any significant workload or resource parameters to prioritize sampling on.

A reference architecture for VNF Management and Orchestration is called “ETSI MANO” [21], a standard maintained by the ETSI NFV working group. The ETSI MANO architecture is followed in our framework, in the sense that a strict separation between infrastructure and VNF management functionality is enabled. Also, management functions per VNF and per executed profiling test are integrated. According to ETSI MANO specifications, a VNF Manager (VNFM) is responsi- ble for VNF lifecycle management (e.g. instantiation, update, query, scaling, termination), with a clear link to the infrastructure for updating resource allocations.

Other VNF management related actions (such as VNF login or functional configuration) is done via the ETSI- defined Element Manager (EM). The VNF Manager entity in our setup (as will be explained in Section 3) incorporates characteristics of both the ETSI EM and VNFM. The main interface protocol used by the VNF Manager in our setup is either SSH or the Docker API.

The EM in our setup can then be considered as the SSH or Docker agent instantiated in the infrastructure node

where the VNF under test is running, addressed by the VNF Manager.

Our platform implementation for automated profiling is based on the development in [10] (which is also based on ETSI MANO). We re-factored the tool following a Service Oriented Architecture (SOA) approach, where multiple profiling tests can run independently and in parallel. As such, the tool can shorten the total profiling time, by parallelizing multiple measurement campaigns over multiple hardware nodes. In [9], a descriptor format is presented as input for an automated VNF profiling procedure. We have loosely adopted the syntax of this YAML-based test definition and included some adaptations to map it better to our envisioned sampling heuristic. The needed adaptations were mainly specialized configuration directives and settings for test execution and monitoring. In the next section we will explain our VNF profiling platform more in detail.

3 VNF Profiling Framework

The automation of VNF performance profiling is the main objective of our VNF profiling platform. In this section we first explain how such test automation was implemented. To further optimize the profiling procedure and save time, a sampling heuristic is further investigated in the next sections of this paper. We can put forward three main characteristics for a VNF profiling framework:

– A light-weight, modular architecturewhich is easily expandable. Since every VNF can have unique control interfaces or other configuration mechanisms, customization is necessary. The framework should therefore allow easy integration of custom VNF configuration functionality. Also parallel execution of profiling tests is an important feature to optimize profiling time. Details are given in Subsection 3.1.

– Easy generation and reproducibility of VNF profiling tests can be achieved by using descriptor files which contain a programmable testing and monitoring workflow. Such descriptor files allow an easily customizable profiling workflow per VNF. Details are given in Subsection 3.2.

– The integration of existingcloud-native functional blocks allows a fast development and flexible deployment on available IaaS. This cloud-native nature allows the profiling framework to be be quickly set up. As a result, we can test and measure VNFs on available IaaS nodes as a realistic staging environment (as seen in Fig. 1 and described in [3]). Details are given in Subsection 3.3.

(6)

Test Node 1 Test Node 1

export metrics control VNFs, source, sink

(initialization, workload config)

probes:

• cAdvisor

• Prometheus node exporter

• Libvirt exporter experiment descriptor:

-monitored metrics -VNF/workload/resource configuration

Traffic src

Traffic sink VNF(s)

under test hypervisor export status via http api export metrics file

configure metric queries send alerts

configure dashboard Profiling

services

VNF Managers VNF Managers

Test Infrastructure Node(s)

Infrastructure Infrastructure Infrastructure Agents

Manager Node

(a) Architecture and used tools.

Profile service

Initialize monitoring + test setup Profiler loop:

-start next workload -wait until stable -export averaged metrics

VNF Managers (Docker, VM) methods:

-reset

-set environment variable -set resource allocation -execute command

Infrastructure Agents type: Docker API methods:

-execute command -set resource allocation type: SSH

methods:

-execute command Metrics descriptor

-Grafana panels -Prometheus queries -probe configuration -alert thresholds Test descriptor -VNF configuration -workload settings -commands, scripts

(b) Structural implementation of a profiling service.

Fig. 2 Architecture and implementation of the profiling framework.

3.1 Modular Architecture

The purpose of VNF profiling is to generate a dataset which can be used to create a model for the VNF performance. The framework therefore automates a set of measurements where varying resources are allocated, workload is generated and KPIs are measured. Each test setup consists of a VNF under test, where the test traffic is routed from a traffic source to a traffic sink.

The traffic source and sink can be considered as VNFs also, custom built and deployed for our test purpose.

We hereby give a concise overview of this framework.

The architectural overview is given in Fig. 2a. On the upper level, the Manager Node is where the profiling workflow is executed in a loop: (i) A workload and/or resource allocation is selected to test, (ii) instructions are given to the VNFs and traffic generators and (iii) monitored metrics are analyzed. On the lower level, the Test Infrastructure Nodes are located in the IaaS. This is the actual execution environment of the VNFs and traffic generators.

The Manager Node is where each profiling test is running as a separate service instance. By executing a profiling test as a separate service, multiple tests can be running simultaneously. This allows parallelization of multiple measurements. Each profiling test can be considered as a micro-service, from which the test status can be queried through an HTTP API. At the end of the profiling loop, the measurements are stored in .csv format for further analysis (See Section 5). A more

detailed implementation diagram is depicted in Fig. 2b.

We can distinguish following important class objects, addressed by the profiling service:

– Every VNF in the test setup (source, sink and VNF under test) has its ownVNF Managerinstance, which has methods available to control the VNF state. The most important function is to execute commands which start/stop traffic workloads. The actual commands or scripts are specified in the descriptor files, the VNF Manager only executes the defined command using the correct Infrastructure Agent. We have implemented a separate VNF Manager class for Docker containers and Virtual Machines (VMs), because they need different functions regarding (re)starting, or configuration.

– Every VNF Manager, should have anInfrastructure Agent attached, to address a specific API in the remote Infrastructure Test Node. This agent is used to execute commands inside a container or set container resources via the Docker API of the remote node or execute commands through SSH. Also specific VM hypervisors such as KVM can be addressed via an Infrastructure Agent.

Whenever a new type of VNF or infrastructure interface needs to be addressed, a new class type can be added in this framework. In this way, the profiling framework can be easily expanded or customized.

(7)

3.2 Profiling Descriptors

Profiling descriptors are YAML based documents, which describe the execution of a profiling test. They enable a programmable workflow which can be easily customized.

We distinguish two main descriptor categories:

– Every profiling test is defined by a Test descriptor.

Here we describe which scripts or commands to execute in each VNF, in order to generate the requested workload. See the example Listings 1 and 2.

– A Metrics descriptor defines for each VNF in the Test descriptor which metrics should be monitored and recorded. Care should be taken that all required metrics are exported by the pre-deployed probes. Ad- ditionally, it can be specified when to send alerts back to the profiling service when measurement stability or overload of the traffic src/sink is detected. See the example in Listing 3.

To have a more practical idea of the profiling execution, we give a succinct overview of the format used to describe the profiling tests. A first part of theTest descriptoris given above in Listing 1. This part defines a configuration agent for each VNF in the profiling setup.

As previously explained, a class instance is made for each specified VNF manager and each manager uses an infrastructure agent as interface to the underlying VNF. The names of the VNF managers defined here, are referred to in the remainder of the descriptor. The declarations in the descriptor hold test-specific settings.

For the agents this includes: API endpoints, credentials, authentication methods, ... For the VNF managers we need specific settings such as: the infrastructure agent to communicate with the VNF, container or VM uid, resource or operational initialization to be configured via the infrastructure agent, ...

1 agents:

2 docker1:

3 class: DockerApiClient

4 url: ’tcp://docker.api.url:port’

5 # ...

6 ssh1:

7 class: SshClient

8 host: ’infrastructure.node.url’

9

10 managers:

11 src:

12 class: Docker

13 agent: docker1

14 cpuset_cpus: ’8-15’

15 # ...

16 pfsense:

17 class: Vm

18 agent: ssh1

Listing 1 YAML based test descriptor - interface configuration

A second part of theTest descriptor is then given in Listing 2. This is the most important part from the profiling perspective, as it defines the actual values for each relevant parameter in the test. For each parameter we define:

– A list or range of values to test (Line 4,10,14,25).

– The method of the manager instance to call, in order to practically configure this value into the VNF test (Line 5,11,15,26). In our tests, we use a common technique based on environment variables. The workload settings are stored in environment variables in the VNFs, later when the workload script is called, these variables are read and the configured workload is started.

– A list of fixed initialization commands, which are executed every time a new setting is configured (e.g. to stop/start a workload generating script in the traffic generator) (Line 18,29).

1 resource_parameters:

2 - name: pfsense_cpu_limit

3 # allocated cpu in %

4 values: [25,50,75,100,200,300,400,500]

5 function: set_cpu

6 manager: pfsense

7

8 workload_parameters:

9 - name: packetsize

10 values: [64,128,256,512,1024,1500] # Bytes

11 function: set_environment_var

12 manager: src

13 - name: flows

14 values: [1, 2, 10, 100, 1000, 10000]

16 manager: src

17

18 initialization:

19 - manager: src

20 cmd: ’pkill -9 -f start_traffic_stream.sh’

21 # ...

22

23 primary_workload_parameter:

24 name: packetrate

25 # Values will be chosen in the defined interval for this parameter

26 range: [0.1,500] #kpps

28 manager: src

29

30 initialization:

31 - manager: src

32 cmd: ’bash start_traffic_stream.sh’

33 # ...

Listing 2 YAML based test descriptor - test configuration space

(8)

The sampling heuristic takes the defined parameter ran- ges into account and will iterate through all combinations in an optimized order. For this reason, the total configuration space is categorized into three sections:

– resource parameters – workload parameters

– primary workload parameter

The main reason for this categorization is to guide the sampling heuristic to the metric value to sample next, as will be explained in the coming sections.

Also commands to start or stop the workload are specified in theTest descriptor. The commands specified in Line 20 or 32 in Listing 2 are literally executed in a (bash) shell inside the specified VNF. Instead of combining commands in a fixed script, also multiple commands could be specified here as a list. The VNF profiler which is parsing the descriptor, would execute these commands in sequential order as they appear in the descriptor.

In Listing 3, we illustrate the structure of theMetrics descriptor. This file is translated to the needed configuration directives for the monitoring framework to gather the required metrics. The list of all the required metrics is given (Line 1). For each given metric, a template should be defined, which maps to the correct metric query (Line 9). Deployment or test specific parameters such as id’s should be dynamically filled in the template.

The metrics descriptor also defines the probes where the monitoring framework can get the metric values from (Line 21). The Profile service will query all defined metrics from the monitoring database, once a configured workload shows stabilized measurements. (Measurement stability is assessed by the implementation described in [4].) The queried metric values are exported to a file and kept for online analysis by our sampling heuristic. In Subsection 3.3 we will explain the integrated monitoring framework.

It is beneficial if each setting and each exported metric is explicitly mentioned in either theTest descriptoror the Metrics descriptor. When exporting the test results after the profiling phase, there should be a clear link between the setting/metric name in the exported results and where/how this value was exactly set or measured.

Our approach is that the exact name of the setting or metric can be found back in one of the descriptor files.

The descriptor files become the reference for the used configuration settings and similarly for what each metric stands for and where/how it is exactly gathered: (i) The test descriptor explains the activated settings regarding the workload generation and resource allocation. (ii) The metrics descriptor explains the metrics gathered from the probes, representing the VNFs performance and current resource usage for example.

1 metrics:

2 - sink:cpu

3 - src:cpu

4 - pfsense:cpu

5 - sink:packetrate_receive:eth1

6 - pfsense_packetrate_loss

7 # ...

8

9 definitions:

10 docker:

11 cpu:

12 template: ’sum(rate(

container_cpu_usage_seconds_total{id="/

docker/{{ docker_id }}"}[10s]))*100’

13 unit: ’%’

14

15 packetrate_receive:

16 template: ’sum(rate(

container_network_receive_packets_total{id

="/docker/{{ docker_id }}",interface="{{

interface_id }}" }[10s]))’

17 unit: ’pps’

18

19 # ...

20

21 probes:

22 node_exporter:

23 job_name: node_exp_pfsense1

24 scrape_interval: 1s

25 static_configs:

26 - targets:

27 - ’infrastructure.node.url:9100’

28 cadvisor:

29 job_name: cAdvisor_pfsense1

30 scrape_interval: 1s

31 static_configs:

32 - targets:

33 - ’infrastructure.node.url:8080’

34 # ...

Listing 3 YAML based metrics descriptor

The use of the above explained descriptor files, makes it easy to customize and repeat profiling tests. The configuration of monitored metrics, workload and resource parameters is kept very generic to allow a wide applica- bility in VNF testing.

3.3 Cloud-Native Functional Blocks

If we look again at Fig. 2a, we see that several function- alities are implemented by readily available components.

The Infrastructure Node should be pre-provisioned. This means that before the profiling tool can operate, the VNFs under test should be pre-deployed on one or more infrastructure nodes. This can be done by a common orchestration framework (e.g. OpenStack, Kubernetes).

Prometheus is used as monitoring framework and metrics database. Additionally, for every started profiling test, aGrafanadashboard is generated, to visually check

(9)

the status of the defined metrics being monitored. For each requested metric, the correct Prometheus Query (PromQL) should be given in the Metrics Descriptor, this is a Prometheus specific syntax to retrieve the metric from the database. Prometheus is also configured to send alerts back to profiling service when measurement stability or overload of the traffic src/sink is detected.

To let Prometheus gather the metrics defined in the decriptor, the required probes must be running on each Infrastructure Node:

– cAdvisor is a tool to export performance and resource metrics of Docker containers.

– ThePrometheus Node Exporter does the same for bare metal, or host specific metrics.

– We also use a custom built probe to export VM metrics gathered from KVM and libvirt.

In our setup, the main “ancillary” services are deployed as Docker containers (Prometheus, Grafana, traffic source/sink, probes). The actual VNF under test can be deployed as Docker container or as VM under KVM. Care should be taken that resources (e.g. assigned vCPUs) are well isolated between VNFs under test and other components. Depending on the virtualization method of the VNF (container or Virtual Machine (VM)) we use the configuration options of Docker resp.

KVM to isolate the CPU cores between the Device Un- der Test (DUT) and the traffic sink/source (based on the Linux kernel feature cgroups). In the Test Descriptor, separate vCPU cores are being assigned to each VNF. A vCPU smaller than 1 means that a vCPU share smaller than 100% has been allocated. E.g. 0.5 vCPU means that 50% of the vCPU time of one core is allocated to a specific container or VM.

To make sure that the performance of the VNF under test is not bounded by an external factor, we monitor if the sink or source traffic VNF are not overloaded.

When this happens, the monitored performance is not bounded by the VNF under test and not representative for its performance profile, and so the performance measurements are invalid in this case. So by detecting the overload in the traffic VNFs, we ensure that tested VNF’s performance measurements are not affected.

In the remainder of this article we will present measurement results gathered by the above explained framework and descriptor formats.

4 VNFs Under Test

To choose exemplary VNFs for our profiling tests, we looked at some typical use cases defined in [1]. The adoption of 5G technologies enables new possibilities for the

telco industry to diversify their network services to new markets. To enhance the security of these services we look at the deployment of a virtual firewall (pfSense). As a large portion of the traffic over 5G will be media based, we also look at the deployment of a virtual streaming server (Nginx). The choice for these two specific VNFs is also to exemplify the generic nature of our presented approach. Different VNFs, which are stressed by different workloads and characterized by different KPIs, are easily testable by our platform and sampling heuristic.

For the tested VNFs in this paper, we consider CPU and network bandwidth as the most important resource metrics, as we experimentally validated these are more likely to become a bottleneck resource than memory.

This is also confirmed in [22]. Our measurements also show little to none variation in the memory usage of the VNFs while they are under test. In the next subsection we will discuss each VNF more in detail.

4.1 Firewall - pfSense

We use pfSense¹ as a free and open source firewall solution example, deployed as a VM. We stress the firewall by generating multiple unique parallel flows. Also the packetsize is varied. Using the toolScapy we assemble a .pcap file with a stream of packets of varying mac addresses and unique destination IP/port in the packet header.Tcpreplay is then used to stream the .pcap file at a given packetrate from the traffic source. There is also an iperf stream running, with an iperf server in the traffic sink. This is used to monitor packet loss. For the firewall to function properly, we need to make sure the ARP table of the VNF contains the mac addresses of the generated packets, so the firewall forwards the packets properly to the traffic sink. This is done by arp spoofing the firewall from the traffic sink. To have an idea of the baseline performance of the firewall, we install no specific firewall rules and let the traffic pass.

Generated workload metrics:

– packetrate: [0.1-500]kpps. 50 different packetrate values are selectively chosen, spaced evenly along the log scale.

– packetsize: [64,128,256,512,1024,1500] bytes – flows: [1,2,10,100,1000,10000] unique parallel flows

(with unique IP/port combination in the header).

Resource metrics:

– CPU allocation: [0.25, 0.5, 0.75, 1, 2, 3, 4, 5] vC- PUs

1 https://www.pfsense.org/

(10)

The bandwidth allocation is not a dedicated setting in this test. It is determined by the workload, since we specify the generated packetrate and packetsize up front.

Performance metric:

We choose packet loss (%) as the main KPI to reflect the performance of the firewall.

All combinations of above metrics result in 12000 measurement points. If we need about 30sec per measurement to get a stable reading, the total profiling time reaches up to 100h to measure each combination once.

4.2 Streaming Server - Nginx

We set up a live streaming service using Nginx² , a well known open source, all-in-one load balancer, web server, content cache and API gateway solution. Nginx is deployed in a Docker container. We configure Nginx to accept incoming live movie streams via the Real- Time Messaging Protocol (RTMP [23]) protocol. The incoming RTMP live stream is then transcoded to a specific video bitrate and resolution (Nginx uses ffmpeg for this purpose). Next, Nginx serves the newly encoded movie chunks live, through the HTTP Live Streaming (HLS [24]) protocol. In our test setup, the traffic source sends 1 - 5 movies in realtime to Nginx over RTMP. On the client side, the traffic sink opens many concurrent sessions to Nginx, to download playing live movies over HLS (We useLocust.io to emulate the HLS clients and download the stream requests). This use-case exemplifies the situation where a small number of incoming live movies is temporarily cached in an edge server and than streamed with a certain quality to a large number of clients.

Generated workload metrics:

– streams: [10-5000] parallel client HLS streams. 80 different stream values are selectively chosen, spaced evenly along the log scale.

– movies: [1,2,3,4,5] number of different source movie streams, input via RTMP.

– quality: [1,2,3,4,5] indicator for the quality of the streams (resolution and video bitrate ranging from 1280x720/2500kbps to 426x240/200kbps).

Resource metrics:

The streaming performance is determined by both the available bandwidth and vCPU. It is unpredictable how the balance between cpu time for ffmpeg transcoding and cpu time for serving the movie chunks will be scheduled (as we consider this a black-box VNF). Therefore we have no way to deterministically predict the influence of both

2 https://www.nginx.com/products/nginx/modules/rtmp- media-streaming/

the allocated bandwidth and cpu on the KPI. We need to profile the performance with several combinations of allocated vCPU and bandwidth. This also reflects the availability of different flavours to deploy the VNF.

– flavours: [(0.5, 0.5), (0.5, 1), (1, 1), (1, 2), (2, 1), (2, 2), (3, 2), (3, 3), (3, 4), (4, 5), (6, 5)] (vCPUs,

Gbps). Eleven different flavours to deploy the VNF, defined by their given vCPU and bandwidth allocation and encoded from [0-10].

Performance metric:

We chooselag ratio(%) as the main KPI to reflect the performance of the streaming server. This indicator is a measure for the risk of “hickups” or lagging during video playback. It is the ratio of downloaded video playback time over the last period. If the video time is less than the waiting time, the playback buffer will empty and the risk of lagging will increase:

lag ratio (%) =max

1−Tvideo

Twait

,0

We measure the lag ratio in a moving average over 20s (we assume 20s buffer time). If the KPI gets above zero, it means that during the last 20s, the playback buffer was addressed because less than 20s of video stream was downloaded. Increasing KPI values mean more buffer time is continuously needed, resulting in video rebuffering and thus “lagging”. The HLS protocol will try to keep the lag ratio at zero by varying the size of the served movie chunks and maximizing the bandwidth over all clients.

In order to get a stable measurement, a certain ramp- up tine is need to generate to required number of clients and to let the HLS based streaming stabilize. In our setup this takes up to 100sec per measurement point.

To test all above combinations once, takes then over 500h to complete.

We only evaluate the KPIs below 30% packet loss or lag ratio, as we assume that above this threshold the VNF is practically unusable. Therefore there is no need to accurately model the KPI above 30%.

4.3 Test Traffic Generation

For our tests, every type of test traffic is started and received via a dedicated script stored inside the traffic source/sink VNF itself. For the Nginx test, all related commands to start ffmpeg to stream the video files, are stored in the scriptstart traffic stream.sh located in the traffic source VNF. A similar script is stored to start the Locust tool to receive the video stream in the traffic sink VNF. Similarly, for the pfSense

(11)

tests, iperf and TCPreplay commands are also stored in a dedicated script. This script is copied and stored inside traffic source/sink VNF at their build time. In Listing 2 (line 20 and 32) we show how this script can be called and stopped from the descriptor.

4.4 Hardware Infrastructure

Our Test Infrastructure Nodes are equal compute nodes with 2x 8core Intel E5-2650v2 (2.6GHz) CPU with Ubuntu 18.04. Linux Bridge is used as the hypervisor switch. We do not change default OS options (e.g. we leave hyperthreading enabled). The Manager Node is a lighter machine: 4 core Intel E3-1220 CPU with Ubuntu 18.04. The main bottleneck resource of the Manager Node is the disk space used by Prometheus, to store all the metrics gathered from various running profiling tests.

The long profiling times of the above introduced VNFs show the need to optimize both: (i) the parallel execution of measurement runs by the profiling framework (as explained in Section 3) and (ii) the sampling strategy to limit the number of needed sampling points. The latter will be explained next.

5 VNF Data Analysis

As proposed in [4], we have classified the tested VNF metrics under three groups in the previous section:

– Workload metrics reflect the configuration of the incoming traffic to be processed by the VNF.

– Resource metricsquantify the allocated resources which determine the cost and processing capabilities of the VNF. For our analysis we express this as resource usage, which is the averaged used portion (%) of allocated vCPU and bandwidth.

– Performance metricsmonitor the Key Performance Indicators (KPIs), to assure that the performance of the VNF remains within the SLA.

From the obtained VNF measurements, we want to derive a model which predicts the performance KPI in function of the given workload and resource allocation.

From an abstract and generalized viewpoint, the VNF performance modelf can be described as:

f(wl, res) =perf (1)

where:

wl= input workload (e.g. packetrate, filesize, streams)

res= resource allocation (e.g.number of allocated vC- PUs, bandwidth or flavour)

perf = KPI metrics (e.g. packet loss, lag ratio) Figure 3 shows a subset of our measurements: for each VNF a certain workload configuration is executed on varying resource allocations. The measurements in Fig. 3 confirms these trends (which were also described in [4], but on other VNF examples):

– The resource usage is correlated with the rising workload (on the x-axis) until saturation (Fig. 3a and 3b). Either CPU or bandwidth gets saturated first, which explains why the averaged resource usage can saturate below 100%.

– Before resource saturation, the KPI levels remain stable and flat. When resource contention starts, the KPI levels start to vary more rapidly (Fig. 3c and 3d).

We can also distinguish two other interesting facts from the plots:

– For pfSense (Fig. 3c) we can see that the performance does not increase with more than 3 allocated vCPUs.

This points to a deployment limitation where it makes no sense to allocate more vCPUs, because it is not exploited by the VNF implementation.

– When it comes to Nginx (Fig. 3d), we see that some resource flavors have overlapping performance curves.

This indicates that the workload is bounded by a common resource limit of those flavors, namely bandwidth in this case.

It is challenging to discover the above mentioned phenomena automatically, without visual inspection of the data plots. A KPI prediction can be made by training the model with the obtained profiled datasets. But a common adagio from the Machine Learning domain is that the model will only be as smart as its training data, meaning that we must provide representative training data in all foreseeable situations. This implies that we must also profile in the regions where resource saturation occurs or where resource flavors overlap, or where the performance is limited by the internal VNF implementation.

In the next subsection we will outline methods to model the performance of the trends shown above. Re- garding the accuracy, it is important to note that the breakpoint of the KPI curve is the area of most interest.

This is the maximum workload possible by the VNF, just before the performance declines more severely.

(12)

0 100 200 300 400 500 packetrate (kpps)

0 20 40 60 80 100

total resource usage (%)

pfSense packetsize=512B, flows=1000

allocated vCPU 0.25vCPUs 0.5vCPUs 0.75vCPUs 1.0vCPUs 2.0vCPUs 3.0vCPUs 4.0vCPUs 5.0vCPUs

(a) pfSense resource usage

0 1 2 3 4 5

streams (x1000) 0

20 40 60 80 100

total resource usage (%)

Nginx movies=1, quality=1 resource flavor

0.5vCPUs 0.5Gbps 0.5vCPUs 1.0Gbps 1.0vCPUs 1.0Gbps 1.0vCPUs 2.0Gbps 2.0vCPUs 1.0Gbps 2.0vCPUs 2.0Gbps 3.0vCPUs 2.0Gbps 3.0vCPUs 3.0Gbps 3.0vCPUs 4.0Gbps 4.0vCPUs 5.0Gbps 6.0vCPUs 5.0Gbps

(b) Nginx resource usage

0 100 200 300 400 500

packetrate (kpps) 0

20 40 60 80

packet loss (%)

pfSense packetsize=512B, flows=1000 allocated vCPUs

0.25vCPUs 0.5vCPUs 0.75vCPUs 1.0vCPUs 2.0vCPUs 3.0vCPUs 4.0vCPUs 5.0vCPUs

(c) pfSense KPI

0 1 2 3 4 5

streams (x1000) 0

20 40 60 80

lag ratio (%)

Nginx movies=1, quality=1 resource flavor

0.5vCPUs 0.5Gbps 0.5vCPUs 1.0Gbps 1.0vCPUs 1.0Gbps 1.0vCPUs 2.0Gbps 2.0vCPUs 1.0Gbps 2.0vCPUs 2.0Gbps 3.0vCPUs 2.0Gbps 3.0vCPUs 3.0Gbps 3.0vCPUs 4.0Gbps 4.0vCPUs 5.0Gbps 6.0vCPUs 5.0Gbps

(d) Nginx KPI Fig. 3 Subset of measured VNF metrics under different resource allocations.

5.1 Modeling Methods

We look for an appropriate modeling method to predict the KPI values from a given workload and resource configuration, as explained earlier by Eq. 1. A first idea of the trends to be modelled can be seen in Fig. 3c and 3d. From a pure mathematical perspective, we can consider the KPI values to be a response surface, defined by a multi-variate function where the workload and resource allocation metrics are the input parameters.

The total input space is multivariate, since all workload and resource metrics can influence the resulting KPI value. We compare several generic methods from the machine learning domain, which are capable of modeling generic, non-linear and multi-variate response surfaces.

The used methods have also shown promising applications in regression modeling, where the amount of training samples is limited. We have used the implemen-

tations available in the library Scikit-learn [25]. We also include the Interpolation and Curve Fit method, which have shown promising results in [4]. The investigated modeling methods are:

– Support Vector Regression (SVR)This method has shown promising results in estimating non-linear relationships using limited, sparse datasets. SVR selects samples to form a flexible tube of minimal radius, symmetrically around the estimated function, such that the absolute values of errors less than a certain threshold () are ignored both above and below the estimate. Points outside the tube are penalized and not taken into account for the regression. The hyperparameters of this method areC, the penalty parameter, and the standard RBF kernel. More details can be found in [26]. The use of SVR for modeling VNF performance has also been applied in [18] with limited

(13)

success. We also include it here for verification on our data sets.

– Random Forest (RF): The basic idea behind this method is to combine multiple decision trees in de- termining the final output rather than relying on an individually built decision tree. Maximum tree depth is set to 10, and the number trees in the forest is 100.

The use of decision trees for modeling VNF performance has been investigated in [18], [8] and [19] with promising results. We include the RF method here for verification on our data sets.

– Gaussian Process (GP): This method implements a Bayesian approach to (non-)linear regression. A GP defines a prior over functions, which can be con- verted into a posterior over functions once it has seen some data. The covariance between training samples is a given kernel function. The kernel function we use:

Constant∗RBF+W hiteN oise. This is a generic kernel function used in many GP examples. One main advantage of using GPs, is that the kernel hyperparameters (RBF lengthscale, noise level, constant) can be learnt automatically via evidence maximisation from the training points themselves, no exhaustive search is needed as with other methods. The key idea is that if the training samples are deemed by the kernel to be similar, then we expect the output of the function around those points to be similar, too.

More information on this method is available in [27].

The use of GP for modeling software performance has been proposed in [14]. We also use GP here for verification on our datasets.

– k-Nearest Neighbors (kNN): is a commonly used technique due to its simplicity and often accurate results. In kNN regression, the output value is the average of the values ofknearest neighbors in the input space (weighted by distance). We use the Euclidean distance metric and standardize the input values. For our tests we use k= 2.

– Interpolation method: Regression is done by inter- polating linearly between surrounding samples. The interpolant is constructed by triangulating the input data using Delaunay triangulation, and on each triangle performing linear barycentric interpolation.

This method also works in multiple dimensions, so the total workload and resource input space is taken into account to interpolate an intermediate KPI value.

This method has been used for VNF modeling in [4]

and we also include the method here for verification on our new datasets.

– Curve Fit method: This method has been success- fully used for VNF modeling in [4] and we reuse the method here for verification on our new datasets.

Based on the analysis in [4], we fit the KPI trends to the analytical functions of this form:

(f(x)non−saturated =a+exp[b(x−c)]

f(x)saturated= 100

1−_x−e^d

,wherex > d+e (2) where the output is clipped in the range [0, 100]:

(0 , where f(x)<0 100 , where f(x)>100

The full justification for the functions in Curve Fit method is explained in [4], we summarize here shortly the fitting procedure. Figure 4 shows how the trends in Eq. 2 are fitted to data subsets of both investigated VNFs. The saturated region starts when the workload shows a higher covariation with the KPI than with the resource usage metric. The parametersa, b, c, d, ein Eq.

2 are fitted to the profiled data points in the respective (non) saturated regions. Note that two possible inter-

sections can exist betweenfnon−saturated andfsaturated. When the fitted curve is known, all possible intersections are estimated. For each intersection point, the accuracy (RMSE) of the resulting piecewise model is calculated.

Finally, the intersection point which yields the best accuracy is chosen and stored into the Curve Fit model.

The clipped output represents that the KPI values (representing packet loss and lag ratio) are limited between [0, 100] %. We further use a weighted curve fitting for the saturated part, to prioritize more accuracy at lower KPI values, just after resource saturation. Since this is the region where most accuracy is wanted.

6 VNF Sampling Strategies

In this section, we present our results after investigating different sampling methods. As reference and baseline sampling method, we generate workload samples uniformly (using the values specified in Section 4). The dots in Fig. 4 represent all the samples taken for one specific subset of all generated workloads, described in the figure title. Each of our tested sampling strategies would yield a different set of sampling points, but following the same trend curves. Instead of uniformly spaced samples, our heuristic tries to focus on more interesting parts of the trend curves, like the breakpoint between the (non)saturated regions.

We also check the effect of the sampling heuristic on the accuracy of the modeling methods proposed in previous section. We have executed twice all workload and resource configurations defined in Section 4. We

(14)

0 20 40 60 80 100 120 140 160 packet rate (kpps)

0 20 40 60 80 100 120

res usage (%)

non-saturated saturated

0 20 40 60 80 100 120

packet loss (%)

pfSense 1vCPUs, packetsize: 64B, 2 flows

res usage

packet loss in non-saturated region packet loss in saturated region

(a) pfSense example subset

0 1 2 3 4 5

streams (x1000) 0

20 40 60 80

res usage (%)

non-saturated saturated

0 20 40 60 80

lag ratio (%)

Nginx flavor2:{100vCPU,1Gbps}, movies: 1, quality: 3

res usage

packet loss in non-saturated region packet loss in saturated region

(b) Nginx example subset

Fig. 4 The Curve fit model exemplified in two example VNF subsets. For each plot, the y-axis on the left is for the saturating resource usage, the one on the right depicts the increasing KPI value. The boundary between (non) saturated regions is where the workload covariance with the KPI becomes larger than with the resource usage.

thus obtain two datasets per VNF, of which one is used as training set and the other as test set. The full factorial test set, containing all possible combinations, counts 14400 samples for pfSense and 22000 samples for Nginx. From the training set, we select data points using the validated sampling heuristics and use these selected samples to train a prediction model for the KPI values. The increasing number of samples taken, is given on the x axis in the plots in throughout this section.

We then check the Root Mean Squared Error (RMSE) of the trained model on the test dataset. The reported RMSE in the next sections is then an indication for the accuracy of the model, trained from a limited sample set. We only evaluate the RMSE below 30% packet loss or lag ratio, as we assume that above this threshold the VNF is practically unusable. Therefore there is no need to asses the model accuracy above a KPI threshold of 30%.

6.1 Uniform Sampling

As baseline measurement and benchmark for our sampling heuristic, we first test the accuracy of a generic uniform sampling strategy. For each VNF, we pick resource and workload metric values uniformly in the range described above in Section 4. The result is shown in Fig. 5. On the x-axis, the number of measurements indicates an increasing amount of uniformly chosen values per metric. First we measure one value, then two values for each metric and so forth...

As can be seen in Fig. 5, different modeling methods yield varying accuracy. Support Vector Regression (SVR) proves the least accurate. The tuning of hyperparameters (C, ) in the SVR model is a tedious task which we do using an exhaustive search and also seems very sensitive to the number of training samples. we experience long training times (tens of minutes with more then 1000 samples). We do not succeed to reach the same accuracy as the other methods. The best performing methods, which we select for further use, are interpolation and curve fit. This is also in line with previous research done in [4].

Uniform sampling is however not an online sampling method, as the number of samples must be chosen in advance. In the next sections, we will try to find heuristics to select samples in an online way, and try to reach the same accuracy with less training samples.

6.2 Unsuccessful Strategies

It is worth noting that uniform sampling, exemplified above, is not the least performing method we encoun- tered during our tests. When following the approach outlined in [14], a Gaussian Process (GP) is used to select online which samples to measure next. In fact, GP is a modeling method, but it also produces an estimate of uncertainty in the prediction, a confidence interval for the predicted values as indicated in Fig. 6a. The points where the predicted values have the largest confidence interval, is considered as the most interesting region to sample next. The confidence interval calculated by

(15)

0 1000 2000 3000 4000 5000 number of measurements

0 2 4 6 8 10 12 14

RMSE (packet loss, %)

Pfsense uniform sampling

Random forest Support Vector Regr kNNGaussian process Curve fit Interpolation

(a) pfSense

1000 2000 3000 4000 5000

number of measurements 0

2 4 6 8 10 12 14

RMSE (lag ratio, %)

Nginx uniform sampling

Random forest Support Vector Regr kNNGaussian process Curve fit Interpolation

(b) Nginx Fig. 5 The accuracy of different modeling methods applied to a uniformly sampled dataset.

0 50 100 150 200 250

packetrate (kpps) 0

5 10 15 20 25 30 35

packet loss (%)

pfSense subset: 2 vCPU, packetsize: 512B, 100 flows 95% confidence interval

Gaussian Process prediction baseline measurement bisected samples

(a) Gaussian Process confidence example.

1000 2000 3000 4000 5000

number of samples 2

4 6 8 10

RMSE (packet loss, %)

pfSense resampling strategies sampling + model method

uniform + interpolation GP + curve fit GP + GP GP + interpolation

(b) pfSense model accuracy using Gaus- sian Process to select samples.

0 1000 2000 3000 4000 5000 6000

number of samples 0

2 4 6 8 10 12 14 16 18

RMSE (lag ratio, %)

Nginx resampling strategies

sampling + model method uniform + curve fit GP + curve fit GP + GP GP + interpolation

(c) Nginx model accuracy using Gaus- sian Process to select samples.

Fig. 6 The Gaussian Process sampling method does not improve the accuracy.

the GP is an indication in which region the uncertainty of the predicted value is larger. For our datasets, this selection method is not optimal, since it favors regions where a flat response is measured, and not the transition zone to the steeper part of the trend.

Fig. 6b and 6c show the effect of a GP chosen sample set on different modeling methods, compared to the uniformly sampled benchmark. We see indeed that GP does not optimally selects samples in our datasets, as the accuracy is worse compared to the benchmark. We can also visualize why some strategies are under-performing. For example, a sampled KPI response surface is illustrated in Fig. 7.

– Sample selection using GP will likely select points in the flat (blue) area of Fig. 7, as explained in previous paragraph.

– On the other hand, gradient-based sampling strategies will likely choose sample points in the steep (green/red) regions of Fig. 7.

For our datasets, the most interesting region to focus on, is where the trend break happens, i.e. where the flat region transforms into the steep region. In the next sections we try to find heuristics which focus on this area.

6.3 Feature Selection

In our black-box approach, we try to cover the operational boundaries at the start of the profiling procedure.

We measure each combination of the minimal and max- imal values of the VNF workload and resource metrics mentioned in Section 4. This comes down to a full factorial sampling space with two levels per factor. This gives us a minimal sample set where an initial analysis can indicate which parameters are important or not. We then use this information to determine which metric configurations to profile next.

(16)

packetrate (kpps) 100 0 300200 500400

cpu limit (%) 200100 400300 500

packet loss (%) 0 20 40 60 80

pfSense response surface, subset: 512B packetsize, 100 flows

Fig. 7 A subset of the pfSense dataset to illustrate the response surface of the KPI (packet loss), under different vCPU allocations.

In this first phase of metrics exploration, it is prima- rily the intention to select the workload and resource metrics (X) which have the most influence on the performance metrics (Y). Several mathematical methods are generally used to capture the relation between two sets of metrics (Y =A.X+B):

– Multiple Linear Regression (MLR): achieves maximum correlation between X and Y, using the well known Ordinary Least Squares (OLS) method.

– Principle Component Regression (PCR): cap- tures the maximum variance in only X, using the well known Principle Components Analysis (PCA) method. Then it uses OLS to predictY from the main components inX, calculated by PCA.

– Partial Least Squares (PLS): tries to do both by maximizing the covariance between X andY. The power of this method lies in the fact that influential factors inX can be determined using relatively few samples compared to the other methods. PLS remains robust when the influence on multiple response metrics needs to be assessed (there are multiple columns inY), or when there is multicollinearity between input variables (collinear columns inX) [20].

For our purpose, we need a method which can estimate the relation betweenX andY, from a small initial dataset. Also it is not our main goal in this initial phase to derive a (linear) model. We only need have an idea of the main factors inX that seem to have the most effect onY. We therefore select PLS (also sometimes called Projection to Latent Structures) [20] as most generic method for this use case. The results are shown in Fig.

8, where the height of the bars indicates how much each workload or resource metric influences the KPI variation. The bar heights can be thought of as coefficients

of a regression model. But from this small dataset, the only reasonable conclusion for now is that a higher bar indicates a more influential factor for the KPI.

pcktrate pcktsize flows cpu_limit

relative feature importance

pfSense feature selection

(a) pfSense

streams movies quality flavor

relative feature importance

Nginx feature selection

(b) Nginx

Fig. 8 Feature selection based on the initial sample set of the profiled VNFs.

– For pfSense in Fig. 8a, we clearly see that workload metricspacketsize andflowshave little to no influence the KPI (packet loss).

– For Nginx in Fig. 8b, we observe that all parameters have a reasonable contribution to the variation of the KPI (lag ratio). This means that we should not be selective in the input metric space here.

We must note that the analysis in the previous subsection takes only the variation between the edge values into account. The sample size is too limited to asses if there are any local extrema in between the edge values.

In the next sections we will further investigate where to pick next profiling points.

6.4 Primary Workload Metric Selection (wlp)

We further limit the sampling space, by focusing on selected metrics. For each VNF we can prioritize one specific workload metric, which is more likely to vary in real-world traffic. This is the x-axis metric used on the plots in Fig. 3:packet rate (pfSense) andstreams (Nginx). With the VNF deployed in production, we assume that in the most realistic scenarios both resource allocation and workload configuration are relatively stable, while the respective x-axis metrics in Fig. 3 are most likely to vary. We call these the primary workload metricwl_p. We focus our profiling efforts in such a way, that the VNF model can predict more fine grained at which level ofwlp the performance is outside SLA bounds (using Eq. 1). The other workload and resource allocation metrics are sampled more coarse grained. We favor the respectivewlpmetrics during our profiling measurements by picking more samples in their specified range, in order to estimate more accurately the performance breakpoint. For pfSense the wlp =packetrate,