• Aucun résultat trouvé

B.4.2 Intervention prediction

First, from the causal model, G, represented by the Bayesian network of Fig-ure B.4, we can use the d-separation criterion to deduce the following indepen-dences:

• (XyY|X1,X2)GX

• (X2yX)GX2

From the do-calculus rules from [Pea09] we can deduce that the distribution of Y under the conditionX2 = 0intervening on X to fix its distribution to X ∼ X|do(X2 =1)is given by

f(Y|X2=0,do(X ∼X|do(X2=1)))= Z

X

Z

X1

fY|X,X1,X2(x,x1,0)fX1(x1)fX|X2(x,1)Pr(X2 =1)dx1dx (B.6)

B.5 Ideal scenario

In this case we generate the dataset as follows:

X1= random_re-sampling(DNS) X2= random_re-sampling(Throughput) X∼

( Γ(k1, θ1)+ √

X1µ21X :X2=0 Γ(k2, θ2)+ √X1µ22X :X2,0 Y∼

( 10.√

5.X+10.X1Y :X2=0 25.√

3.X+6.X1Y :X2,0 withε·representing an error terms.

In this first case we chose the following values{k1 = 5, θ1 = 1.0,k2 = 2.0, θ2 = 2.0, µ1 =5, µ2 =8}. The resulting distributions are presented in the Figure B.5.

To make sure that the parameters are correctly generated, we infer the corre-sponding causal model using the PC algorithm [SG91] with the independence test from [ZPJS12].

Notice that, for testing our method under the same main constraint we limit our sample size to 10000 (against 7500 in the real case scenario) and we keep the same proportion between the number of samples where X2 = 0 is observed and the number of samples where X2 = 1 is observed, as it was the case in the real case scenario for the number of connections where the LDNS service was observed (80%) compared to the number of connections where the GDNS service was observed (20%).

A causal approach to the study of Telecommunication networks

182 B. TECHNICAL DETAILS RELATED TO THE DNS STUDY

Figure B.5:Distribution of the different parameters for both values ofX2

X1 X Y

X2

Figure B.6:Causal Model of the first dataset

B.5.1 Prediction of Y given X2 = 0 after intervention on X giving it the distribution of X given X2 = 1

Using the Equation B.6 we are able to compute the PDF f(Y|X2 = 0,do(X ∼ X|X2 =1))and obtain the expected valueE[Y|X2 =0,do(X∼ X|X2 =1)]. As mentioned in Section B.3.4, regarding the method exposed in Section B.3.3, the choice of the number of bins and the threshold to decide or not whether to estimate the post atomic intervention PDF, should have an impact on the pre-diction accuracy. Consequently, in this ideal scenario where the two distribu-tions of X|X2 = 0and X|X2 = 1have very similar domains, we vary these two parameters and study their impact on the prediction accuracy.

To obtain the distribution of fY|X2=0,do(X∼X|X2=1)we generate X and Y as following:

• X∼Γ(k2, θ2)+ √

X1µ22 +ε:X2 =0

• Y∼10.√5.X+10.X1

B.5. IDEAL SCENARIO 183

We obtain an expected value ofE[Y|X2 =0,do(X ∼ X|X2 =1)]of 101.5.

We summarize the different results we obtained in Table B.1.

We can see that the use of a T-copula (T-cop) in the modeling of the multi dimensional PDF gives slightly better results, in terms of accuracy, than the modeling of multidimensional PDF with a Gaussian copula (G-cop). However, very likely due to the functions used for generating the artificial dataset and the use of a Gamma distribution, the T-copula is not always able to model the PDFs fY|X,X1,X2(y,x,x1,0)that are necessary to compute the post intervention PDF of Y in Equation (B.6).

From this table, we can notice two additional things:

1. The accuracy of the prediction depends on the distribution of X (already noticed when comparing the two settings in the previous study, one where X is drawn from normal distribution and one from Gamma distribution).

We manage to obtain better accuracy (3%) when using normal distribu-tion instead of Gamma distribudistribu-tions (not presented).

2. In the results presented in Table B.1, there is no clear preference for smaller value of∆S

It is important to notice that the usage of a Gaussian copula was motivated by the prediction of an intervention in the opposite case (fY|do(X X|X2=0),X2=1(y)), see next section. In addition, we generated the artificial dataset in order to have the same domains for X|X2 = 1 and X|X2 = 1 and do not observe the same tail dependence as we would have for the throughput in the real case. From this perspective, the usage of T-copula is not fully justified and the usage of a Gaussian copula, for this particular dataset, could be more appropriate.

Table B.1:Effect of varying∆S on the prediction accuracy for the first artificial dataset for two different multi dimensional PDFs modeling, using a T-copula (T-cop) and Gaus-sian copula (G-cop).#AIstands for the number of Atomic Interventions

S ˆE[Y|X2 =0,do(X ∼ X|X2 =1)] Error #A.I. #Failures

G-cop T-cop G-cop T-cop G-cop T-cop

20 93.80 96.22 7.8% 5.5% 195 0 (0%) 5 (2.6%)

30 93.28 95.90 8.6% 6.0% 118 0 (0%) 3 (2.5%)

50 95.40 96.79 6.5% 5.2% 76 0 (0%) 2 (2.6%)

70 94.41 96.94 7.5% 5.0% 54 0 (0%) 1 (1.9%))

100 94.24 96.74 7.6% 5.2% 39 0 (0%) 0 (0 %)

A causal approach to the study of Telecommunication networks

184 B. TECHNICAL DETAILS RELATED TO THE DNS STUDY

B.5.2 Prediction of Y given X2 = 1 after intervention on X giving it the distribution of X given X2 = 0

Without repeating the explanations given in the previous section, we use Equa-tion (B.6) to predict the expected value ofE[Y|X2=1,do(X ∼X|X2=0)]. The results are presented in Table B.2.

Several important remarks can be made from the results presented:

• The usage of a T-copula for modeling the PDF of fY|X,X1,X2(y,x,x1,0)for different values ofXfails more than 50% of the times.

• The utilization of wider bins, gathering more data to approximateX = x, does not improve the the success rate of the prediction of the PDFs post atomic intervention.

• The usage of G-copula gives better results (in terms of prediction accu-racy) than the previous prediction of intervention.

Table B.2:Effect of varying∆S on the prediction accuracy for the first artificial dataset for two different multi dimensional PDF modeling, using a T-copula (T-cop) and Gaus-sian copula (G-cop).#AIstands for the number of Atomic Interventions

S ˆE[Y|X2 =1,do(X∼ X|X2=0)] Error #A.I. #Failures

G-cop T-cop G-cop T-cop G-cop T-cop

20 173.9 N.A. 2.5% N.A. 43 0 (0%) 23 (53%)

30 173.9 N.A. 2.5% N.A. 32 0 (0%) 20 (63%)

50 175.7 N.A. 1.4% N.A. 17 0 (0%) 13 (76%)

70 175.7 N.A. 1.6% N.A. 13 0 (0%) 10 (77%)

100 174.0 N.A. 2.3% N.A. 9 0 (0%) 8 (89%)

Remarks It should be noticed that, despite the apparent symmetry between the prediction of Y conditionally to X2 = 1 if we perform an intervention on X where we fix its distribution to the one of X|X2 = 0 and the one the prediction of the value of Y conditionally to X2 = 0 if we perform an intervention on X where we fix its distribution to the one ofX|X2 =1, the problematic is different.

From an external point of view, looking at the completion time and success rate, Gaussian copulae are less data demanding. We generatedX2 by randomly re sampling the DNS parameters from our real dataset. Doing so, we have 80% of the samples whereX2=0is observed against 20% whereX2 =1is observed.

This second scenario shows the sensitivity of our approach to resource limita-tion. Even in this “optimal scenario” where both conditional PDF of X|X2 = 0 andX|X2 =1have the same domains, the shortage of data for the second con-ditional PDF prevents us from using a model which should be more accurate