Completion time - KCI+bootstrap parameterization

3.3 KCI+bootstrap parameterization

3.3.1 Completion time

We first discuss computation time constraints.

KCI test completion time

In a first experiment, we record the time it takes for different dataset sizes and for different conditional set sizes to complete an independence test with the

3.3. KCI+BOOTSTRAP PARAMETERIZATION 61

(b) Conditioning set of size 1

1000 200 300 400 500 600 700 800 900 1000

(d) Conditioning set of size 3

Figure 3.1:Evolution of the completion time of the KCI test as function of the number of samples for different conditioning set sizes

implementation of the KCI test from [ZPJS12].²

The results are shown in Figure 3.1. All graphs show that the completion time seems to increase faster than linearly with the number of samples.

In practice, for a sample size larger than 400, the completion time starts to increase very fast with the number of samples. Together with the observation that no important improvements are obtained for sample sizes bigger than 400 samples (see Section 3.3.2), this motivated our choice of the value400forN. Additionally, in Figure 3.2 and Figure 3.3, we present the result of approximating the completion time with a polynomial of degree 3. Figure 3.2 shows that a polynomial of degree 3 gives a rather good approximation of the completion times of the different tests for different sample sizes and different conditioning set sizes. In Figure 3.3, the figures ares repeated in a semi-logarithmic scale.

While the results presented in these figures are limited, the observations are consistent with the hypothesis of a superlinear evolution of the completion time of the KCI test as a function of the number of samples.

Influence of bootstrap parameterization on completion time

Taking as an example the system studied in Section 5.2 where we model the causal influence of 11 parameters on the throughput of client downloading ob-jects from an FTP server, we estimate the time it would take for different param-eterizations of our bootstrap method to infer the corresponding causal model.

2Note that this completion time was obtained with MATLAB 2009b, while the same trends are still valid for latest release of MATLAB.

A causal approach to the study of Telecommunication networks

62 3. COPING WITH TELECOMMUNICATION NETWORK CONSTRAINTS

Figure 3.2:Approximation of the KCI completion time with a polynomial of degree 3

0 200 400 600 800 1000

Figure 3.3: Approximation of the KCI completion time with a polynomial of degree 3, Y log scale

To do so, we first use the PC algorithm with the Z-Fisher criterion and count the number of independences that were tested for each conditioning set size.

We then measure, for different values of land N the time that it takes to test one independence for a conditioning set size of 0, 1, 2, ... until the maxi-mum conditioning set size that was observed in the different tests performed

3.3. KCI+BOOTSTRAP PARAMETERIZATION 63 by the PC algorithm when the Z-Fisher criterion is used. We can then obtain a rough estimate of the time it would have taken for the PC algorithm to infer a causal model corresponding to the same scenario if it would have used the KCI+bootstrap method for different values oflandN. The results are presented in Table 3.1.

Note that the different estimations had to be run in parallel on different machines with slightly different performances: 12 CPUs, 2260 MHz, 30 GB or 8 CPUs, 1862 MHz, 15 GB (but note that only one or two Central Processing Unit (CPU) were used during the tests so the difference in the number of CPUs is not impor-tant). The use of different machines explains small variations in the estimates (for example, that the completion time estimate for{N=200,l=1000}^{is found} smaller than for{N= 100,l=1000}); but this does not affect the main trends.

Table 3.1 highlights important time constraints that limit the number of sub-tests in the bootstrap. Indeed, for N = 400, computation times are of the order of days and go to almost a year forl=1,000. It should be noticed, however, that one very important advantage of the bootstrap method (in addition to detecting independences in cases where the basic KCI test would fail, see Section 3.2.2), is the possibility it offers to parallelize the computation very easily. Indeed, when using the bootstrap method, each test on a re-sampled dataset can simply be performed on a different machine. By using Mmachines, we can then shrink the completion time by a factor proportional to the number of machines used.

As it is now very common for a university or a company to work with a cluster of machines (virtually or not), this is a very important advantage. For instance, at our university, we have aroundM=50machines. ForN =400, this reduces the computation time from 33 days to 18 hours. However, this still limits the number of re-sampling. For instance, forN = 400, usingl= 1,000the algorithm would still take 5 days. While 5 days could still seem an acceptable time (although it can become unmanageable with less resources or if tests need to be run several times), the important point is that it corresponds to a very large increase compared to 18 hours. In the model used for this estimation we consider only 12 parameters but this difference will rapidly become difficult to handle as soon as more complex systems, with more parameters and independences to test, are studied. For example, in the study of the impact of choosing one DNS service instead of another on a CDN user experience, Chapter 6, our model consists of 19 parameters.

Dans le document A causal approach to the study of telecommunication networks (Page 76-79)