• Aucun résultat trouvé

It should be noticed that, as X is a continuous variable, the probability of ob-serving a given value is 0. Therefore, instead of selecting samples for which X

A causal approach to the study of Telecommunication networks

176 B. TECHNICAL DETAILS RELATED TO THE DNS STUDY

f(cwinmin)

0 2 4 6 8 10 12

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

0.8 LDNS GDNS

Minimum congestion window, cwinmin (kB)

Figure B.3:Histogram of the minimum congestion window for the LDNS and GDNS

= xis observed, we define an interval Ix corresponding to[x−δX;x+δX] and assume that the samples for which theXparameter value falls into this interval, the value is approximated byx.

From these samples, we estimate the cumulative distribution function (CDF) of the throughput and Z (for the inetrttavg parameter Z = {DNS,ts} and for the cwinmin parameterZ = {DNS,inetrttstd}, as we first condition on a given DNS we have Zinetrttavg = {ts} and Zcwinmin = {inetrttstd}). After selecting the connections (= samples) where the DNS service we want to condition on is present, we use normal kernels to estimate the CDFs oftputandZ.

Using a Gaussian copula or a T-copula we estimate f(T put|X= x,Z =z,DNS = dns)using the equation:

f(T put|X= x,Z =z,DNS =dns) =

c(F(T put|X = x,DNS =dns),F(Z|X = x,DNS =dns))f(T put|X =x,DNS =dns) (B.3)

Integrating Equation (B.3) onZwe obtain the PDF of

f(T put|do(X =x),DNS =dns) (B.4)

We then select the samples for which DNS , dnsand use normal kernels to estimate f(X|DNS ,dns)and frequencies to estimateP(DNS ,dns).

After these steps, we have all the factors present in Equation (B.1) or Equation (B.2) and we can integrate over Xto obtain the probability distribution function of the throughput post intervention.

However, some practical issues remain:

B.3. PRACTICAL ISSUES 177

1. How to define the intervalsIX ?

2. As we are working with continuous variables, the two PDFs conditionally to different DNS do not share the same domain. How do we define the conditional probability so that we have common value to integrate on ? The last point is solved by always defining the PDFs domains as equally spaced points between the minimum and maximum observed value of the correspond-ing parameter in the whole dataset. We use normal kernel to estimate the different PDFs.

The first point, however, is more complicate as many possibilities exist and there is the constraint of finding enough samples in each interval to estimated the CDFs from which the copula parameters will be estimated and the condi-tional PDFs derived.

Several solutions have been tested

• Variable bin width histogram

• Fixed bin width and interpolation

• Fixed bin width, filtering, rescaling.

B.3.1 Variable bin width histograms

The first method consists in fixing an objective number of samples and, starting from a fixed width bin histogram, merging adjacent bins until obtaining bins of different size but with a minimum number of samples:

Pros This method ensures the maximum number of atomic predictions being successful

Cons As many of the parameters we observe have a distribution with a long tail, and it is often in the tail of the parameter distributions that we find the values for interesting predictions, we obtain very large bins for the extreme values of the intervention parameter. The approximation of this bin representing a single value is then too important. Additionally, when multiplying the the value of the PDF after an atomic intervention fixing X to the value xby the value of the PDF conditionally to the other DNS for X = x(Equation (B.1) and Equation (B.2)) we need to take into account the actual range of X that the atomic intervention represents to have a consistent approximation. Said differently, if the bin corresponding to X = x under DNS = d1 is covering a big interval, because of data shortage, multiplying the probability of f(Y|do(X = x),DNS = d1)with f(X|DNS = d2)requires that f(X|DNS =d2)is computed from the same bin interval, if such is possible, or weighted according to the bin width.

A causal approach to the study of Telecommunication networks

178 B. TECHNICAL DETAILS RELATED TO THE DNS STUDY

B.3.2 Fixed bin width

The second method is going in the opposite direction. We use histograms with fixed width bins and try to make predictions of f(Y|do(x)) with the number of samples we found in a given bin corresponding to a X value. If the estimation of the prediction fails, then we use interpolation of the f(Y|do(x)for theXvalues where the post intervention PDF ofY could be estimated.

Pros This method solves the issues of approximation inconsistency of the vari-able bin width method. We fix the bin width and decide on the approxi-mation of assimilating an interval to a given value.

Cons It can often happen that we manage to predict f(Y|do(x)even when there are few samples in the intervalIx corresponding to thexvalue. However, with very few samples, it is very likely that this estimation is not accu-rate. This lack of accuracy impacts in a very negative way the overall post intervention PDF of the throughput accuracy, as the PDF ofY post intervention that could be computed with few values will be used in the interpolation to recover the PDFY for x values where the estimation of the post intervention PDF failed.

B.3.3 Fixed bin width and high pass filter

The last method, the one eventually adopted, is based on the fact that, if there are very few samples in a given area of the distribution ofX|DNS = dns1then this value is very unlikely to be observed and, as the previous method will im-pact negatively the overall results by including these samples, we consider that the PDF of X|DNS =dns1in the domains with few samples is null.

This method uses a fixed bin histogram and selects only the bins where a mini-mum number of samples is observed to predict atomic interventions. After pre-dicting the distribution of the throughput after intervention we rescale it based on the following observation:

Z

tput f(tput|do(X = x),DNS =dns1)dtput=1 (B.5) This method solves the previous issues. There is a risk of not having any prediction for values in the tail of the distribution of X|DNS = dns1, that can represent the zone of overlap of the two distributions X|DNS = dns1 and X|DNS = dns2. However, limitations due to the lack of samples cannot be overcome by simple approximations as seen in the second method. The only way to solve this issue is to use parametric PDF. The studies that we have made so far show that the approximation of the distributions of the parameters of our

B.4. ARTIFICIAL DATASET 179 systems for values that are actually observed offers an acceptable accuracy but that these models become inefficient as soon as we try to use them to estimate the PDF of a parameter for values that have not been observed.

B.3.4 Open questions

Eventually, one can ask the following question:

1. In which proportion the absence of values observed for both conditional distributions impacts the prediction accuracy and how to improve the ac-curacy in this case ?

2. How to choose the bin width ?

3. Should we favor the number of atomic predictions from which the final PDF is estimated or the quality of the atomic intervention predictions by increasing the number of samples from which these atomic predictions are computed ?

To study these aspects we need a ground truth to estimate the accuracy of our prediction for different strategies and parameterization. Therefore, we generate a set of artificial datasets where the intersection between the two conditional PDF (DNS = d1 and DNS = d2) domains is modified and its impact on the accuracy of our method is studied.

In practice, instead of varying the bin width and the threshold defining when a bin is given a 0 probability, we do the following: We select an objective number of samples,∆S, under which the atomic intervention f(Y|X2 = 0,do(X = x))is considered as null, and search for the optimal quantization leading to the max-imum bins with a number of samples≥ ∆S. To do so we use this very simple algorithm:

B.4 Artificial dataset

B.4.1 Simulated dependencies

To simulate the same situation as the one met in the study of the DNS service impact on the throughput, we randomly generate 4 parameters, X1,X2,X,Y, with dependencies illustrated by the graph presented in Figure B.4. To be closer to the situation observed in our DNS study we generate X1 by randomly re-sampling the throughput observed in the causal study of the DNS impact and we generate X2 by randomly re-sampling the observed DNS from the same A causal approach to the study of Telecommunication networks

180 B. TECHNICAL DETAILS RELATED TO THE DNS STUDY Data: VectorX

Result: Set of bins (Xf inal with fixed width (δ) defining the intervals around the values on which atomic interventions will be predicted:

initialization;

nB= 10;

nVold = 0;

nVcur= 1;

Xold =[];

Xcurr = [];

Hold = [];

Hprev= [];

whilenvcur >nvold do Ht,Et =hist(X,nB); nVold =nVcur;

nVcur=nvalues(Ht >∆S); Xold =Xcur;

Xcur =Et; Hold =Hcur; Hcur=Ht; nB=2∗nb; end

nVf inal =nVold; Xf inal =Xold; Hf inal =Hold;

δ=Xf inal(2)−Xf inal(1);

Algorithm 1:Dynamic quantization

study. We eventually convert X2 to binary value (0 or 1). The presence of a categorical data (the DNS in the corresponding study) is an important aspect that we want to keep in this study.

X1 X2

X

Y

Figure B.4:Artificial dataset dependencies