Step 2 : Identification of Correlated Time Intervals

5.4 Macro Attack Event Approach

5.4.2 Correlation Sliding Window

5.4.2.2 Step 2 : Identification of Correlated Time Intervals

Given a threshold δ and a correlation vector C of two time series Φ and Ψ, our goal in this step is to identify time intervals where these two time series are considered as being correlated. In other words, we try to identify the correlated pair Pi ={start, stop,Φ,Ψ}. To this end, we have tested the following three techniques, namely strict policy, tolerant policy, and tolerant policy with adjustment. They are described as follows :

– Tolerant policy : We call C^′ the vector that holds the correlation status of Φ and Ψ. At first, we assign C^′(i) = 0, ∀i ∈[1, T]. Then, for each C[k]≥δ, we update the status of C^′ in the time interval [k, k+L−1] to 1, (C^′(i) = 1, ∀i ∈ [k, k+L−1]). Φ and Ψ are considered as being correlated in the time interval[a, b]when C^′(i) = 1, ∀i∈[a, b].

An important parameter in our procedure is the choice of the threshold δ to declare that two time series are correlated. Again, we rely on experience, i.e., visual inspection of a lot of cases for different values of δ, to choose our threshold. We end up having a threshold of 0.7.

Figure 5.7 represents the evolution of two observed cluster time series from day 1 to day 100. The dashed curve represents the correlation value (vector C), the threshold curve (dotted curve) intersects it at two points t1, and t2 (corresponding to X=9 and X=40, respectively). In this specific example, we use the sliding window of size 30. In thetolerant policy, we define these two time series as correlated from day 9 to day 69. From a statistical viewpoint, this is totally correct since the two time series are indeed similar, w.r.t the correlation value calculated, during this period. However, when looking at the network activity we see that this is due mostly (if not all) to the time interval from day 38 to day 40 during which most of the important synchronized activities happen.

– Strict policy : In contrast to the previous case, we first assign C^′(i) = 1, ∀i∈[1, T]. Then for each correlated valueC[k]< δ, we update the status ofC^′ in the time interval[k, k+L−1]to zero(C^′(i) = 0, ∀i∈[k, k+L−1]).

0 20 40 60 80 100 120

Figure 5.7 – Thetolerant policy says that the two time series are correlated from day 9 to day 69 whereas thestrict policy says from day 38 to day 40

ΦandΨare then considered as being correlated in the time interval[a, b]when C^′(i) = 1 ∀i∈[a, b].

Applying the strict policy to the same example in Figure 5.7, we obtain the correlated time interval from day 38 to day 40, which is much more precise than what is provided by the tolerant policy. However, the strict policy faces another kind of error. For instance, Figure 5.8 represents the same example as on Figure 5.7, except that there are peaks of activities from day 60 to day 63 on the sole time series A. Due to the interference of these activities, the correlation values degrade significantly. This time, the threshold curve (dashed curve) intersects the correlation value curve (dashed dotted) at two points t1(X=9), and t2(X=33). As a consequence, strict policy can not discover the correlated time interval from day 38 to day 40 (note that our sliding window’s length is always equal to 30). We use the term “dismissal problem” to indicate this kind of error.

– Tolerant policy with adjustment :as discussed earlier, the tolerant policy suffers from some inaccuracy when detecting the correlated time interval and the strict policy faces the dismissal problem when identifying the correlated periods of two time series. We want a more robust technique to achieve this goal. Our solution is, first, to apply the tolerant policy to get the temporal correlated time intervals, then we apply the post-processing step to eliminate the potential inaccuracies caused by the tolerant policy technique. We present hereafter three steps to do the post-processing techniques.

1. Recognition of type of correlation :Suppose that after applying the tolerant policy, we identify that Φand Ψare correlated during[a, b]. Our

50 5. ON THE IDENTIFICATION OF ATTACK EVENTS

Figure 5.8 – The strict policy can not discover the correlation period in this case

0 10 20 30 40 50 60 70 80 90

Figure 5.9 – Example of long correlation

observations show that the correlation ofΦ and Ψ can be classified into one of the following two categories :

– Short correlation : The correlation of two time series within the time interval[a, b]is mostly caused by some synchronized peaks of activities.

Besides that, the two time series are not correlated at all. Figure 5.7 offers an example illustrating this case.

– Long correlation : Two time series are synchronized at almost every point in the time interval [a, b]. Figure 5.9 is an example illustrating this case.

10 20 30 40 50 60 70 80 90 100 0

50 100 150 200

Time(day)

Number of sources

cluster time series A cluster time series B

Figure 5.10 – Correlation detection of two time series 1 and 2. The strict policy says that the two time series are correlated from day 59 to day 63 whereas the tolerant policy says from day 29 to day 93

Since the inaccuracy occurs mostly in the case of short correlation, we apply the post-processing technique only to this family. We have tested the two following techniques to recognize the type of correlation of two time series in the period [a, b].

– As mentioned earlier, in the case of short correlation, the correlation is caused mostly by the few synchronized peaks of activities. To recognize whether such peaks exist in a time series, we apply the technique as pre-sented in Section 5.3. If there is a peak in the time series, we conclude that it belongs to the short correlation. Otherwise, we classify it into the long correlation family.

0 10 20 30 40 50

0 50 100 150 200

Time(Year)

Number of sources

cluster time series A

threshold

Figure 5.11 – Peak Peaking

– With the same arguments as before, we remove the most important

52 5. ON THE IDENTIFICATION OF ATTACK EVENTS data points from each time series. To recognize their type of correla-tion, we compute the correlation coefficient between the two residual time series. If the correlation computation yields a high value, we flag the correlation as long correlation. Otherwise, we class it into short correlation family.

cluster t.s 15610 on platform 4 cluster t.s 15611 on platform 9

Peaks of activities spanning

cluster t.s 15610 on platform 4 threshold

a) (b)

Figure5.12 – (a) temporal correlated time interval b) peak picking technique misses the second peak

2. Determination of potential correlated time interval :Suppose that now we have two time seriesΦandΨidentified as being shortly correlated during[a, b]. As a remainder, the short correlation is due to the outliers existing in the time series. Our task now is to separate these peaks of activities on each time series during[a, b]. To identify the peaks, we can apply the peak picking technique presented in Section 5.3. We compare the data points of the attack trace with a threshold computed as a certain factor of the average of the population. Any data point greater than threshold is considered as a peak. Experiment shows that it works well especially when the peaks expand in a short time interval. For instance, Figure 5.11 shows an example of this technique. The threshold curve intersects with the time series curve at three places, the time series has three outliers. However, this approach does not give the expected results, neither when the important activities span on several days (not long enough though to make it belong to thelong correlation class), nor when there are peaks of distinct heights, as in Figure 5.12a. In that specific example, there are two correlated peaks of activities of cluster 15611 and 15610 on platforms 4 and 9. There is almost no activity anywhere else during the time interval from day 60 to day 150 for these two clusters.

Figure 5.12b represents the peak picking technique based on the average of the population. As we can observe, the method misses the second peak of activities since its values are less than the threshold. The reason for this is that the high peaks make the average value important.

We have observed that, in the case of short correlation, the total spanning lengths, in term of number of days, of the important peaks of activities

is much smaller than the temporal correlated time interval identified by the tolerant policy. To be sure that we can detect all the correlated time intervals, we choose a very conservative threshold of 40^th percentile³ By doing so, the threshold does not depend on the outlier data points as far as their total spanning life is less than 60%. Since the40^this a conservative threshold, the amount of detected peaks of important activities will be greater than its real value. We show how to mitigate them in the next step. We use the peak picking algorithm represented in Algorithm1 and we give hereafter an example of the output of this function.

I_X = 000011100001100001110000 I_X(i) =

½ 0 if the i^th element of X is below the threshold 1 if the i^th element of X is above the threshold

In this particular example, we have three peaks of activities detected.

The first one starts from position 5 to position 7, the second one starts from position 12 to position 13, the third one starts from position 18 to position 20.

3. Identification of correlated time intervals :We identify all the per-iods where the elements from two time series are greater than the thre-shold. By doing so, we will reduce an important amount of wrongly detec-ted peaks. For instance, let’s consider the following outputs of the peak picking technique :

As said earlier, there are three peaks (from position 5 to position 7, from 12 to 13, and from 18 to 20) in the time series X. And we have two peaks in the time series Y (from position 5 to position 7, and from position 18 to position 20). As a result, we have two correlated periodsP1 ={5,7, X, Y} and P₂ = {18,20, X, Y}. Note that, the second peak (from position 12 to position 13) from the time series X is not considered as there is not corresponding period on time seriesY.

Application of the above procedure to all the pairs of time series leads to the iden-tification of a set of correlated pairs over different periods of timeP_i = (start, stop,Φ,Ψ).

Figure 5.13 illustrates the situation at the end of the first phase. In this illustrated case, the length T of the time series is equal to 9. A curve on the plot represents a correlated time interval of two time series, for instance, the plot shows that time series 1 and 2 (curve 1&2) are correlated from day 3 to day 6 or P₁ = (3,6,1,2), time series 7 and 8 (curve 7&8) are correlated from day 1 to day 8,P2 = (1,8,7,8) etc.

3. the40^th percentile of a time series X is the value at the position of 40% of the length of X on the sorted form of X

54 5. ON THE IDENTIFICATION OF ATTACK EVENTS

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4

Dans le document Honeypot traces forensics by means of attack event identification (Page 86-92)