KCI and categorical data - A causal approach to the study of telecommunication networks

Test Z Functions σ_Z Error Terms

1 X∼ N(0,1) ZY==−53··XY proportional None

5·X^◦ proportional None

Z=−3 cot√

500·X^◦ disproportional None Z=−2· √

disproportional εX/Y ∼ N(0,0.1) Z=−300·Y

500·X^◦ disproportional None Z=−2· √

Table A.2: 16 scenarios for testing conditional independences with the Z-Fisher cri-terion and KCI test, for the case illustrated in Figure A.2b, X^◦ = X−min(X)+1, Y^◦=Y−min(Y)+1.

A.2 Testing the KCI algorithm in the presence of cate-gorical data

In Chapter 6 one of the parameter of our system, theDNS, is categorical. In the KCI [ZPJS12], few mentions to the presence of a categorical variable in the independence tests are given.

We designed a simple scenario to test the performance of the KCI (+bootstrap) A causal approach to the study of Telecommunication networks

168A. ADDITIONAL RESULTS AND STUDIES RELATED TO THE INDEPENDENCE TESTS

Test Z Functions σ_Z Error Terms

1 XY ∼ N∼ N(0,(0,1)1) Z=5∗X−3∗Y proportional None 2 XY ∼ N∼ N(0,(0,1)1) Z=5∗X−3∗Y proportional εZ ∼ N(0,0.1) 3 YX∼ N∼ N(20,(5,10)100) Z=5∗X−3∗Y disproportional None 4 YX∼ N∼ N(5,(20,10)100) Z=5∗X−3∗Y disproportional εZ ∼ N(0,0.1) 5 XY ∼ N∼ N(0,(0,1)1) Z =−3∗√(X+Y)^◦ proportional None 6 X∼ N(0,1)

Z =−3∗√

(X+Y)^◦ proportional εZ ∼ N(0,0.1) Y ∼ N(0,1)

7 YX∼ N∼ N(20,(5,10)100) Z =−3∗√(X+Y)^◦ disproportional None 8 X∼ N(5,10)

Z =−3∗√

(X+Y)^◦ disproportional εZ ∼ N(0,0.1) Y∼ N(20,100)

9 X = not normalY = not normal Z=5∗X−3∗Y proportional None 10 X = not normal Z=5∗X−3∗Y proportional εZ ∼ N(0,0.1)

Y = not normal

11 X = not normalY = not normal Z=−5∗X−300∗Y disproportional None 12 X = not normalY = not normal Z =5∗X−300∗Y disproportional εZ ∼ N(0,0.1) 13 X = not normal Z =−3∗√

(X+Y)^◦ proportional None

Y = not normal

14 X = not normalY = not normal Z =−3∗√(X+Y)^◦ proportional εZ ∼ N(0,0.1) 15 X = not normal Z=−2∗ √

3∗(5∗X+300∗Y)^◦ disproportional None Y = not normal

16 X = not normalY = not normal Z=−2∗ √3∗(5∗X+300∗Y)^◦ disproportional εZ ∼ N(0,0.1) Table A.3: 16 scenarios for testing conditional independences with the Z-Fisher

crite-rion and KCI test, for the case illustrated in Fig.A.2d,X^◦=X−min(X)+1.

in the presence of categorical variables. As in the study of the parameteriza-tion of the KCI+bootstrap procedure, in Secparameteriza-tion 3.3.2, we generate an artificial dataset. We generate 4 parameters whose dependencies have a correspond-ing causal model presented in Figure A.3 and test different independences that should be entailed by such system.

We generate two different datasets. One for which only one parameter, X1 in the model of Figure A.3, is categorical and another for which two parameters, X1andX3in the model of Figure A.3, are categorical.

The two dataset of 1000 samples are generated as follows:

A.2. KCI AND CATEGORICAL DATA 169

Test 500 1000 1500 2000 2500 3000

I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2

Table A.4: Results of the 16 conditional independence scenarios for the model in Figure A.2a, tested using the Z-Fisher criterion and KCI test for 6 different data set sizes, averaged on 10 trials.I1=XyY,I2 =XyY |Z.

Test 500 1000 1500 2000 2500 3000

I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2

Table A.5:Results of the 16 conditional independence scenarios for the model in Fig-ure A.2b for the Z-Fisher criterion and KCI test, for 6 different data set sizes, averaged on 10 trials.I1 =XyY,I2=XyY |Z.

Test 500 1000 1500 2000 2500 3000

I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2

Table A.6: Result of the 16 tests for the model Fig.A.2d for 6 different data set sizes using the Fisher criterion and KCI testI1=XyY,I2=XyY/Z.

First dataset : ds1

A causal approach to the study of Telecommunication networks

170A. ADDITIONAL RESULTS AND STUDIES RELATED TO THE INDEPENDENCE TESTS

Figure A.3:Causal model of 4 variables

Causal model corresponding to the causal dependencies of the 4 parameters of our artificial dataset to test for KCI performance in the presence of

categorical data

• X1categorical: randomly drawn, with replacement, from a pool of 4 values

• X2continuous: discrete function ofX1(mapping)

• X3continuous: discrete function ofX1(mapping)

• X4continuous: square root function ofX2andX3

Second dataset : ds2

• X1categorical: randomly drawn, with replacement, from a pool of 4 values

• X2continuous: mapping according toX1values

• X3categorical: drawn from a pool of 4 values, the probability of each category depending onX1value

• X4continuous: square root function ofX2plus coefficient depending onX3

We also add a noise term,ε_X_·, to all variable, X_·, with a standard deviation equals to 20% of the variable (σ_ε_X_· =0.2.σX_·)

The results of the different independence tests are presented in Table A.7. We denote byp-value ds1the median of the p-values obtained in the sub-tests of the KCI+bootstrap procedure, Section 3.2.2, for the tests made on the samples of the first dataset. Accordingly, we denote byp-value ds2the median of the p-values obtained in the sub-tests of the KCI+bootstrap procedure for the tests made on the samples of the second dataset.

We can observe that the results that we obtain are consistent with the ground truth. Suggesting that the KCI+bootstrap procedure performs correctly in the presence of Categorical variable. Additionally, the results show that the KCI+bootstrap performs well in the case where one of the two parameters between which the independence is tested is categorical (X1 y X4|{X2,X3} for the first dataset).

It also shows that it is still the case when one member of the conditioning set is categorical (X2 y X3|{X1} for the first dataset). Eventually, with the second

A.2. KCI AND CATEGORICAL DATA 171 dataset, we can observe that the KCI+bootstrap procedure still performs cor-rectly in a mixed case (X1yX4|{X2,X3}for the second dataset).

Table A.7: Results of the different independence tests made with the KCI+bootstrap for the two artificial datasets

X Y Z p-value ds1 p-value ds2

1 2 0 0

1 3 0 0

1 4 0 0

2 3 0 0

2 4 0 0

3 4 0 0

1 4 2 0 0

1 4 3 0 0

2 3 1 0.5 0.9

1 4 {2,3} 0.3 0.9

2 3 {1,4} 0 0

A causal approach to the study of Telecommunication networks

Appendix B

Different methods to estimate the distribution of the

throughput post intervention in the DNS study

In the study of the DNS impact on the throughput, Chapter 6, we could observe that the distribution of the intervention variable, X, conditionally to the GDNS (Google DNS) allowed to predict the distribution of the throughput for LDNS (Local DNS) users. We could also observe that the opposite, predicting the GDNS users throughput for an intervention where X follows the distribution seen by the LDNS users, was not possible. In this document we artificially generate datasets with parameters and dependencies simulating the situation we had to deal with in the DNS performance study. We vary the conditional distributions of the intervention parameter to minimize the number of values observed for both distributions (f(X|LDNS) and f(X|GDNS)) and we study its impact on the prediction accuracy. We also expose, with more details, the method developed to predict counterfactuals.

Dans le document A causal approach to the study of telecommunication networks (Page 183-189)