Prediction - FTP traffic - A causal approach to the study of telecommunication networks

5.2 FTP traffic

5.2.3 Prediction

Having inferred the causal graph of our system, we now demonstrate its useful-ness to predict interventions on the three most important variables (see PFTK model, Equation (3.1), Section 3.2.1), namely RTT, p andrwin. It should be noticed that, in this scenario we cannot directly intervene on the network to val-idate the results as we could do in for the emulated network (cf Section 5.1.4).

Intervening on the Round Trip Time

From the causal graph presented in Figure 5.10, we want to predict the effect of manually intervening on the RTT. To do so, we use the back-door criterion (cf Section 1.3.4) which ensures the identifiability of this intervention. There is one trek between rtt and tput blocked by tod.⁴ We can apply the back-door adjustment, Theorem 3, with the blocking setZ={tod}. We use normal kernels to estimate the marginals of the throughput (tput), the RTT (rtt) and the Time Of the Day (tod) and a T-copula to model the conditional probability density function of ftput|rtt,tod.

We know from Table 5.3 that the value of thethroughputbefore intervention has an average value of 332 kBps, for an average observed RTT of 270 ms. The intervention of setting the RTT value to 100 ms gives an expected value of 480 kBps, with a standard deviation of 176 kBps (compared to 256 kBps prior to intervention). This result is good in the sense that it shows that, by decreasing

4“A trek between distinct verticesAandBis an unordered pair of directed paths betweenA andBthat have the same source, and intersect only at the source” [SGS01]

5.2. FTP TRAFFIC 117 the value of theRTT and its variance, we can improve thethroughput via inter-vention. We can precisely quantify the expected improvement in thethroughput if we modify theRTT.

The improvement of the throughput (tput) when decreasing the RTT is an ex-pected behavior of TCP performance. However, the naive approach, that com-putes the effect of setting the RTT to 100 ms by selecting observation where this value ofRTT is observed, would produce an incorrect estimate of the post intervention throughput. Figure 5.16 compares the results of the two meth-ods. The dashed line with cross markers represents the expected value of the throughput, for a given value of RTT, that will be obtained using the naive ap-proach. The solid line with diamond markers represents the expected value post intervention computed with the back-door criterion adjustment. We also added the information of the number of samples that were used to compute the different values of thethroughput.

We see in Figure 5.16 that the expected value of thethroughputafter interven-tion is smaller than the value we would obtain with the naive approach. This result is supported by our causal model, Figure 5.10, where conditioning on RTT to predict thethroughput also takes into account the effect of thereceiver windowon theRTT. As mentioned in Section 5.1.4, when studying how an in-tervention on a parameter, X, would impactY, all external variables (< {X,Y}⁾ maintain their dependence withY but not via their (possible) dependence with X, whose behavior is only dictated by the intervention. Therefore, the impact of the receiver window on the RTT is spurious and must be neglected, which is done in our approach by the use in the back-door formula and Z = {tod}.

However, this is not the case in the naive approach that includes the spurious impact of the receiver window on theRTT and overestimates the effect of an intervention on theRTT. The fact that both curves are quite similar can be ex-plained by the fact that only one variable, the tod, is used to block the only back-door passing throughrwin, which does not exhibit important variations in our dataset. It should be noticed that in the field of telecommunications, one should not expect improvements orders of magnitude higher than the ones we observe here due the complexity of the systems, which does not mean that improvements are not useful.⁵

Intervening on loss event frequency

We now predict the effect of an intervention on the loss parameter p. In this case the blocking set isZ={rtt}^.

Setting the p parameter to 0.01% allows to achieve an expected throughput

5Amazon claimed that every 100 ms of latency costs it 1% of profit.

A causal approach to the study of Telecommunication networks

118 5. STUDY OF THE FILE TRANSFER PROTOCOL

throughput(kbps)/NumberofSamples

100 150 200 250 300 350 400 450 500 550 600

0 100 200 300 400 500

321

46 65 13 13 21 34 31 79 163

46 Expected value with naive approach Expected value with causal approach Number of Samples

RTT (ms)

Figure 5.16:Effect of an intervention on the RTT, on the throughput distribution

value of 525.5 kBps as compared to athroughputvalue of 332 kBps for a value ofpof 0.07% prior to intervention.

Figure 5.17 compares the naive approach and the causal approach to predict the impact of intervention. Again, we see that the naive approach overestimates the effect of an intervention on p. We obtain a difference between these two methods which is slightly bigger than the one observed in Figure 5.16 when intervening onRTT.

Intervening on the receiver window

One parameter absent from the the PFTK model, Equation (3.1), but present in the model Figure 5.10 is the receiver window. After investigations, we observed that the distribution of the receiver window in our dataset has essentially two peaks around 72 kB and 235 kB, which reduces the range of interventions we can predict, see Figure 5.18. The average value of the receiver window prior to intervention is 137 kB for an average throughput of 332 kBps, see Table 5.3.

Using the back door criterion with Z = {tod}, we could predict the effect of in-tervening on the receiver window and fix its value to 235 kB. This intervention would result in a 9% improvement of the throughput (365 kBps) for an increase of 70% in the value of the receiver window. Instead, with a decrease of 70%

of RTT orp we predicted an improvement of the throughput of 44% and 58%

respectively. This observation is consistent with our domain knowledge and In-trabase [SBUK⁺05] result that shows that the impact of the receiver window on the throughput is less important than the one of loss and delay.

5.2. FTP TRAFFIC 119

throughput(kbps)/NumberofSamples

0.01 0.03 0.06 0.09 0.11 0.13 0.16 0.18 0.21

0 100 200 300 400 500 600

255 218

67 36 22 10 9 6

Expected Throughput with naive approach Expected Throughput with causal approach Number of Samples

p(%)

Figure 5.17:Effect of an intervention on p, on the throughput distribution This does not imply that an intervention on the receiver window should never be made. To decide on an intervention, an operator must balance and com-pare benefits and costs. Intervening on the loss or delay means upgrading network equipment, which may be more costly than intervening on the receiver window. Here, we do not have the necessary information to assess the cost (and monetary benefits) of interventions, hence our goal is not to give generic recommendations on which intervention should be performed. Rather, we pro-vide a formal method to predict the expected gain in term of performance for each intervention. Then, these predictions can be used to decide on the best intervention in terms of cost versus performance gain (and the corresponding economical gain).

Numberofsamples

0 50 100 150 200 250 300

0 100 200 300 400 500 600 700

Rwin (kBps)

Figure 5.18:Receiver window variations in the FTP dataset

A causal approach to the study of Telecommunication networks

120 5. STUDY OF THE FILE TRANSFER PROTOCOL

5.2.4 Concluding remarks

In this section, we applied the causal approach, using the methodology de-scribed in earlier sections, to the study of TCP performance under real-world FTP connections. Although we cannot fully validate the predictions made in this real-case scenario (since we do not have directly access to the parameters of intervention), the results we obtain are in complete agreement with our under-standing of TCP and with the literature. Our study shows that, provided the right tools are used appropriately to cope with the characteristics of the data (in par-ticular the use of the KCI test and of copulae to cope with the non-normality of the distributions and non-linearity of the dependences), the causal framework can be very useful to predict the effect of interventions in the network.

In telecommunication networks, changes in the architecture, routing policies or content placements represent important costs in term of money and time so that testing interventions is often impossible. Due to the complexity of the system, intervention decisions usually rely on domain knowledge and back-of-the-envelope computations. Our study offers an appealing formal and auto-matic methodology to estimate much more precisely the expected effect of an intervention on the system’s performance without having to test the interven-tion in practice. These predicinterven-tions can then be translated in economical gains (see [TJA12] and references therein) and used by an operator to decide on a potential intervention by comparing this gain to the intervention cost that only the operator can quantify.

A limitation of the approach proposed in this study is that the range of interven-tions that can be predicted is dependent on the dataset, i.e., on the range of values that were observed prior to these interventions (as, e.g., in the case of the intervention on the receiver window, Section 5.2.3). This constitutes a limi-tation as we are often interested in predicting the effect of setting a parameter to a value that is not observed. This limitation is inherent to the fact of inferring the model solely from the dataset. For this reason, when leading a causal study,

“outliers” are often of great importance as they represent points of interest for predictions. A possible solution to perform intervention predictions in regions not observed in the data is to extend the model to these regions using addi-tional assumptions not drawn from the dataset, but the accuracy of the results will likely be very sensitive to the assumptions made. Our attempts to model the parameters with mixtures of known distributions such as Normal distributions, Beta distributions, Gamma distributions and so on, showed that such models can approximate the distribution for the values of the parameters we observe but they become totally inaccurate if used to extrapolate the distributions of the parameters of our model to domains where they were never observed.

5.3. LESSONS LEARNED 121

5.3 Lessons learned from the FTP study

As in the study of the Dropbox application, Section 4.5, this section presents some of the lessons learned from the two studies presented in this chapter.

5.3.1 The importance of artificial dataset designs and their limita-tions

The first part of this chapter has been dedicated to the causal study of the emulated network. The necessity to study a system where intervention can be performed has been critical to our work. Our interest in the causal study of telecommunication networks comes from the ability to predict interventions.

The accuracy of our methods should only be assessed in terms of intervention prediction accuracy. For this reason, the study presented in Section 5.1 is a key step in our work.

We started our work using solutions already implemented, publicly available and used in other domains, in the study presented in Chapter 4. We found that these methods did not work and could not work. We then adapted these meth-ods to take into account the specificity and constraints of the telecommunication networks. The changes that we had to make implied to use more complex con-cepts for testing independences in the inference of our graphical causal models and for modeling of parameter distributions. Such changes meant to allocate more resource (time and computation).

The only possibility to justify to these changes and validate our approach, and its parameterization, was to compare our results to a ground truth. This could be done with the use of an emulated network.

As in many other research domains, it is always important to dedicate time in validating step by step the different choices that were made in the design of any solution. As the Dropbox study showed, Chapter 4, it is important to take into considerations the domain being studied and the constraints inherent to the system being studied. Dropbox was a complicated application to study compared to FTP and the choice of the FTP application allowed us to create a simple scenario and validate our methods.

Finally, we used to artificial datasets several times in our work. Such artificial datasets range from the most advanced situation, presented in this Chapter, where we emulate a network, to a very simple case where we generate 4 vari-ables according to pre defined mechanisms, as it is the case to study the pa-rameterization of our method in Section 3.3.2, to compare the Z-Fisher criterion and KCI test in Appendix A.1, to study the KCI performance in the presence of categorical data in Appendix A.2 or to study the impact of data shortage on the quality of our predictions in Appendix B. However, artificial datasets are limited.

A causal approach to the study of Telecommunication networks

122 5. STUDY OF THE FILE TRANSFER PROTOCOL We can see the limitation of the emulated network in the dependence found be-tween the bandwidth and the delay in the causal model presented in Figure 5.2 and the causal model of the randomized experiment presented in Figure 5.8.

The two models show i) the risk of bias when recreating scenarios artificially ii) the limitation, in terms of complexity and variations, that exists when trying to use simple models to recreate complex mechanisms.

5.3.2 Reality is always more complex

Interestingly, complementary conclusions can be made concerning the study presented in Section 5.2.

In the study of a real world system, we are subject to more possible errors.

The misbehavior of a component of a network (electronic failures) or errors in our measurements are among the most probable errors that can happen in the study of real systems. The reason why we want to model a system is to gain a control on its behavior that, by default, we do not have. Fortunately, the different values of a parameter do not only capture the behavior of this parameter but also the external parameters (exogenous variables) influencing this parameter (we ignore latent variables here). Taking as an example the receiver window, its values do not only capture the values present in the TCP headers of packets sent by one entity to the other but also the parameters influencing the auto-tuning mechanism, as highlighted in the causal model presented in Figure 5.10.

The fact that a parameter represents a mechanism ruling the system behavior sheds light on two important points related to domain knowledge:

• First, while we do not need to have a very deep domain knowledge of the system (otherwise modeling such system with a graphical causal model would be less appealing) it is necessary to have, in our model, a param-eter for each mechanism influencing the performance of the system.

• Second, while Bayesian networks offer a very intuitive way to see causal dependencies between the different parameters of the system we are studying and can exhibit dependencies we did not know about, it is our domain knowledge that will finally give us the full characterization of such dependence. The best examples are the dependencies between delay and receiver window, which required some research to understand, and the dependence between the RTT and loss that invalidates the PFTK model that had suggested the choice of several parameters of our model.

5.3.3 The importance of using the correct methods

The comparison between the predictions of interventions that we obtain with the naive approach and the causal approach, presented in Figure 5.16 and

5.3. LESSONS LEARNED 123 Figure 5.17, shows that the results we obtain with the two methods present a relatively small difference. While we explain in Section 5.2.4 why such dif-ference is still very important in the domain of telecommunication networks, such remarks should not reduce the important finding that, if the correct meth-ods are used to infer the causal model, and to model the distributions of the different parameters of our model. The accuracy of our predictions could be measured in the case of the emulated network, where the difference with the results obtained with naive approach is more important. The predictions we made showed to be accurate while requiring a limited amount of data.

This remark shows the importance of dedicating time and resource to correctly design our method and take into account the constraint of the system, in our case the non linearity and non normality. Parametric modeling of order three distributions, in zones where we only have a few tens of samples, using, for example the Rebmix package [NF11], did not succeed. Other methods, such as histograms, or any quantization methods, could not work in cases where the number of dimensions of the distribution we need to estimate is important (like the ones we obtain from the equations inferred with the back-door criterion) and the number of samples is small.

A causal approach to the study of Telecommunication networks

Chapter 6 Study of the DNS service

In the previous chapter we validated our methods in an emulated environment and then presented the causal study of the FTP traffic in real case scenario. In this chapter, we extend our use of causal models and of the causal theory to study counterfactuals that allow us to better understand the impact of the dif-ferent mechanisms behind the DNS service in the performance of the Content Distribution Network users.

6.1 Problem statement

Each time an Internet user wants to access a resource, he uses an human readable name called Uniform Resource Locator (URL), containing the domain name of the administrative entity hosting this resource. However, a domain name is not routable and needs to be translated into the (IP) address of a server hosting the resource the client wants to access. It is the role of the DNS service to translate the domain name to the corresponding IP address. Many popular services such as Youtube, Itunes, Facebook, Twitter, rely on CDNs where objects are replicated on different servers and in different geographical locations to optimize the performance for their users. When a client wants to access one of these services, its default DNS server contacts the authoritative DNS server for the domain hosting the resource the client is requesting. Based on the location the request is coming from, the authoritative DNS redirects the client to the optimal server. Most of the ISPs provide a DNS service, but it is now common to see customers using the service of a DNS instead [OSRB12]

such as Google DNS or openDNS. Clients using the DNS service of their ISP are served by a local DNS server. Local DNS servers are closer to clients, their usage provides a more accurate location information to the CDN compared to the locations represented by the few DNS servers offered by a public DNS service, Figure 6.1. There have been several studies suggesting that public

125

126 6. STUDY OF THE DNS SERVICE DNS services do not perform as well as local DNS services provided by ISPs,

Dans le document A causal approach to the study of telecommunication networks (Page 132-0)