The study of Dropbox at the packet level - A causal approach to the study of telecommunication

4.4 The study of Dropbox at the packet level

4.4.1 Presentation

We present a new characterization of Dropbox uploads traffic focusing our study to the packet level. Working at a lower level gives us a better under-standing and modeling of the Dropbox protocol. We define metrics that show that Dropbox implements its own congestion control algorithm, that adapts the data exchanges to the traffic conditions (congestion, loss).

The upload of a file is done in one connection which is divided into periods of transfer and periods of silence. We call the transfer period a Dropbox Chunk (DC or chunks). A Dropbox chunk size is fixed (around 4MB). We call the si-lence periods between two Dropbox chunk transmissions: Inter Chunk Periods (ICP). If the file is too big there may be several connections as it was the case for the previous study where a 900MB file was uploaded. In this study, we upload files small enough to be uploaded in one connection, which does not prevent the generalization of the conclusions drawn from this study to bigger file uploads.

When studying the behavior of TCP packet exchanges during the transmission of the Dropbox chunks, we can see another subdivision which is not present in [DMMM⁺12]. We can observe, inside a chunk, a succession of transmis-sion and silent periods during which no data is transmitted. The same way we defined Dropbox chunk, we define flightsas the periods during which data is being transmitted, inside a chunk. We will simply nameinter-flightsthe pe-riods between the transmissions of two consecutive flights where no data is exchanged between the client and the server.

4.4.2 Observations of the Dropbox traffic

Our study represents a 24 hours experiment. Each hour we upload a 200MB file to Dropbox with a 100MB Ethernet connection from EURECOM. For this study, we use some of the metrics defined in Section 4.3.

Because studying the performance of the Dropbox application at the Chunk level (or connection level) requires the understanding of the mechanisms that are used by the Dropbox protocol to control the sending rate we focus on the per-flight performance.

Therefore, we first divide the connections into chunks and then each chunk into flights. The Dropbox chunk detection is done with the detection of Dropbox application acknowledgement, symbolizing the end and good reception of the Dropbox chunk by the server. The detection of packets flights is done by ob-serving the periods of silence between the transmission of groups of packets A causal approach to the study of Telecommunication networks

86 4. DROPBOX PROTOCOL STUDY and fixing a threshold based on the measured RTT. After a given time where no packet is sent, we consider that the next packet transmission that we observe belong to a newflight.

For each flight we record the following metrics

• Beginning: Time of the first packet of the flight

• Ending: Time of the last packet of the flight

• Duration: Difference between ending and beginning

• RTT: Time elapsed between the sending of a data packet by the client and the reception of the corresponding acknowledgement from the server, averaged among the packets of the flights

• Loss: Number of retransmitted segment divided by the total number of segments sent during the flights

• Size: Sum of all the bytes of data segments belonging to this flight

• Inter-flight: Difference between beginning of this flight and ending of the previous one (Zero for the first flight)

• Throughput: Average throughput obtained along the transmission of this flight

• Chunkid: The chunk it belongs to

Table 4.2 summarizes the different observations of the metrics we selected to model the per-flight throughput of the Dropbox traffic. Our dataset consists of 627 samples. Each sample represents a flight period during which the dif-ferent parameters are measured. In Table 4.2 we present the average (Avg), minimum (Min), maximum (Max), standard deviation (Std) and coefficient of variation (CoV = StdAvg) computed over the 627 flights periods.

4.4.3 Results

PC algorithm

Figure 4.4 presents the causal model inferred by the PC algorithm for two differ-ent levels of significance, used in the differdiffer-ent tests that are performed during the inference of the Bayesian graph.

Figure 4.4a presents the causal model inferred by the PC algorithm for a sig-nificance level of 0.05. As mentioned in the previous section, lower values of

4.4. THE STUDY OF DROPBOX AT THE PACKET LEVEL 87

Table 4.2: Summary of the different metrics at the flight level for the hourly Dropbox uploads with their average value (Avg), minimum value (Min), maximum value (Max), variance (Var), standard deviation (Std= √

Var) and coefficient of variation (CoV =

AvgS td)

Parameter Avg Min Max Std CoV

Duration (s) 0.05 0.02 0.91 0.06 1.1

Size (MB) 4400 1100 4500 410 0.09

RTT (ms) 100 98 110 2.7 0.03

Rwin (kB) 550 280 1400 150 0.3

Loss (%) 0.20 0 72 3.0 18

Inter-flight (s) 0.08 0.06 0.14 0.01 0.1

Throughput (Mbps) 1.0 0.12 2.12 0.3 0.3

the significance level in the independence tests result in sparser graphs. The Bayesian graph representing the causal model of our system when testing the parameter independences with α= 0.05 presents few dependencies between the parameters of our model. We can make the hypothesis that, again, the variations in the values of the parameters we observe do not present enough variability to be able to detect the parameter dependence with such significance level.

On the opposite, the Figure 4.4b presents the causal model inferred by the PC algorithm for a significance level of0.1. Higher value of the significance level used in the independence tests result in more connected graphs. However, the Bayesian graph in Figure 4.4b exhibits counter-intuitive dependencies such as the throughput that is modeled as a parent of the RTT.

In our study, the system we observe does not present very important variations (see Table 4.2). The fact that the Z-Fisher criterion relies on wrong linear and normal assumptions might be less visible in the case where the dependencies are difficult to detect and a low value ofα is used. However, the inadequacy of the Z-Fisher criterion appears more clearly when we increase the value ofα and more dependencies are detected.

The models we obtain with the PC algorithm are not exploitable for two reasons i) Their simplicity suggests that some important dependencies are missing ii) When we increase the confidence level of the independence tests to detect weaker dependencies, we obtain inconsistent results both, when comparing the model obtained with α = 0.1 with the model obtained with α = 0.05 and when interpreting the dependencies obtained with a higher level of significance with our understanding of telecommunication networks.

A causal approach to the study of Telecommunication networks

88 4. DROPBOX PROTOCOL STUDY

duration

tput size

r t t

interflight

rwin loss

(a)α= 0.05

duration

tput size

r t t

rwin interflight loss

(b)α= 0.1

Figure 4.4: Causal models of the per-flight Dropbox performance. Models inferred with the PC algorithm for different independence test significance level,α

kPC algorithm

Figure 4.5 presents the causal model inferred by the kPC algorithm for a inde-pendence test significance level of 0.05, Figure 4.5a, and 0.1, Figure 4.5b. We can observe that, for both levels of significance, the Bayesian graphs we obtain are very sparse, we cannot draw any interesting knowledge from these models.

These results support the hypothesis made previously concerning the lack of variations of the observed parameters that makes their dependencies difficult to detect.

In the case of the kPC algorithm, the HSIC is used to test the parameter in-dependences and should perform better than the Z-Fisher criterion. However, we could observe that the implementation of the HSIC present in the kPC per-forms very poorly. On the other hand, we are trying to model the performance of the TCP in the flights of packets, part of the Dropbox chunks, part of a user connection. The detection of dependencies at this level is expected to be more complex than when we study an application performance at the connection level directly, as it is the case for the studies presented in Chapter 5 and Chapter 6.

4.4.4 Concluding remarks

We would like to mention, first, that the first attempt to model the TCP per-formance at the connection level did not give good results. As the Dropbox

4.4. THE STUDY OF DROPBOX AT THE PACKET LEVEL 89

duration size

loss

r t t rwin

tput

interflight

(a)α= 0.05

duration size

loss

r t t rwin

tput

interflight

(b)α= 0.1

Figure 4.5: Causal models of the per-flight Dropbox performance. Models inferred with the kPC algorithm for different independence test significance level,α

protocol is dividing connections into silent and transmission periods, if we ob-serve the performance by averaging parameters values on the connection (like the throughput) too many parameters and mechanisms are absent from the model (see the remarks made on the relationship between a parameter and the mechanisms it represents in Section 1.3.4).

In this second study, we tried to go deeper in the understanding of Dropbox mechanisms by defining, first, Dropbox chunks, based on the Dropbox protocol, and, then, packet flights. Such subdivision tried to remove the impact of the Dropbox protocol on TCP performance and to be able to study the impact of the network only. Unfortunately, this subdivision does not solve the issues met in the previous study and, worse, highlight additional problems.

First, the role of Dropbox in the inter-flight time is a parameter that has an important influence on the performance when studied at the flight level. Some tests where we observed Dropbox traffic when uploading files from a public area, using a Hotspot access, showed that Dropbox, in the case of a loss event, seems to implement its own recovery mechanisms. Additional studies where we observed the Dropbox traffic when uploading files from Boston, US, showed that the inter-flight is 10 times smaller for a RTT whose value is half of the one we observe in the previous two studies. Such observations suggest that our model misses important parameters to capture the mechanisms implemented by the Dropbox application to control the performance of its users.

Second, by studying a lower level performance (per-flight performance), the study of the dependencies becomes more complex and the variation of the observed parameters becomes smaller.

A causal approach to the study of Telecommunication networks

90 4. DROPBOX PROTOCOL STUDY Finally, the global approach of “reverse-engineering” the Dropbox protocol to define the parameters that will capture its mechanisms is somehow necessary to define the parameter set defining our system, but this is not the final goal of a causal study. By increasing the granularity of our study, we obtain models that are not exploitable for predicting interventions that would explain the causal dependencies underlying the system performance. One important notion to keep in mind in a causal study is the one ofdeterminismand its violation of the faithfulness assumption. Therefore, in addition to the study of mechanisms that might not be in line with our performance study, we might be trying to model deterministic mechanisms.

Dans le document A causal approach to the study of telecommunication networks (Page 101-106)