String distances and intrusion detection : bridging the gap between formal languages and computer security

(1)

DOI: 10.1051/ita:2006010

STRING DISTANCES AND INTRUSION DETECTION:

BRIDGING THE GAP BETWEEN FORMAL LANGUAGES AND COMPUTER SECURITY

Danilo Bruschi

¹

and Giovanni Pighizzini

¹

Abstract. In this paper we analyze some intrusion detection strategies proposed in the literature and we show that they represent the various facets of a well known formal languages problem: computing the distance between a stringxand a languageL. In particular, the main diﬀerences among the various approaches adopted for building intrusion detection systems can be reduced to the characteristics of the languageL and to the notion of distance adopted. As a further contribution we will also show that from the computational point of view all these strategies are equivalent and they are amenable to eﬃcient parallelization.

Mathematics Subject Classification.68M99, 68Q17, 68Q45.

Introduction

Intrusion detection systems, initially introduced by Anderson [2] and Denning [7], are security devices attempting to identify (in quasi real time) and isolate computer systems intrusions. A very broad classification generally adopted distinguishes between HIDS (Host Intrusion Detection System) and NIDS (Net- work Intrusion Detection Systems). Host-based IDSs mainly monitor operating system activities on specific hosts, in order to detect intrusion attempts, while network-based IDSs examine network traffic.

Any category of IDS can be further divided into two subcategories on the basis of the mechanisms adopted for detecting malicious activities. More precisely, we distinguish between signature-based IDS (also referred to as misuse detection) and anomaly detection IDS. A misuse detection IDS detects attacks as instances of attack signatures,i.e., sets of rules or ﬁlters which characterize a malicious event.

Anomaly detection instead focuses on normal system behaviors, rather than attack

1Dipartimento di Informatica e Comunicazione, Universit`a degli Studi di Milano, via Comelico, 39, 20135 Milano, Italy;{bruschi,pighizzi}@dico.unimi.it

c EDP Sciences 2006

Article published by EDP Sciences and available at http://www.edpsciences.org/itaor http://dx.doi.org/10.1051/ita:2006010

(2)

behaviors,i.e., a normal behavior profile is created for any activity performed on the system, which has to be monitored, and any deviation from such a profile is flagged as a potential attack. The misuse detection approach, adopted for almost all the commercial products, enable a very accurate detection of known attacks, but it is ineffective against previously unseen attacks, and it can be bypassed by a slight variation of a known attack. On the other hand, anomaly detection can solve such a problem but it is not still clear how to precisely define and capture the “normal” behavior profile of the activities to be monitored.

In [9, 10] Forrestet al. introduced a methodology for building anomaly detection HIDS based on the observation that the behavior of any program can be characterized by the sequence of system calls it makes. The HIDS they proposed works as follows. LetP be the program to be monitored. During a training period, sequences of system calls are collected by executingP many times in a sterile environment. In the meantime characteristic patterns of the collected system calls are placed in a databaseD, which will be subsequently used for intrusions detection.

For detectingP’s misbehavior the following strategy is followed. Any execution of Pin a production environment is monitored and system calls are again collected in a stringx. At runtime they are compared against the contents of the databaseD.

The Hamming distance between xand the database content is computed, when such a value is greater than a given threshold, an alarm is raised. It turns out that in such a context the problem of detecting computer intrusions is equivalent to compute the Hamming distance of regular languages.

Forrest’s et al. idea has inspired many authors who proposed many variants of the original model, in order to obtain more eﬃcient and more precise (i.e., which recognize broader classes of intrusions) anomaly detection HIDS (see Sect. 3 for further details). In the following we will show that all these approaches are computationally equivalent. More precisely, they can be reduced to the problem of computing distances between strings and languages, using various notions of distance among those provided in literature. As a further contribution we will show that from the computational point of view all the distance notions considered are equivalent and that they can be computed by eﬃcient (in terms of the classical complexity theory) parallel algorithms.

The paper is organized as follows. In Section 1 some preliminary notions about distances between languages and computational models will be recalled. In Sec- tion 2 we recall the most prominent results about intrusion detection systems related to our work and we will show the connections between the various notion of intrusion detection system and distances. In Section 3 we will show that the problem of computing all the notions of distances considered can be eﬃciently parallelized. In Section 4 some ﬁnal remarks will be provided.

1. Preliminaries

Given a ﬁnite alphabet Σ, we denote by Σ^∗ the free monoid of strings over Σ, and by the empty string. Given a string x∈ Σ^∗, we denote by |x| its length

(3)

and byx_i theith symbol of x, i = 1, . . . ,|x|. We recall that a string z ∈ Σ^∗ is asubword orfactorof xif and only if there exist two stringsu, v∈Σ^∗ such that x=uzv. Ifu=(v=, respectively) then zis apreﬁx (suﬃx, resp.) ofx.

1.1. Distances

A distance over Σ^∗ is a function d : Σ^∗×Σ^∗ → N∪ {∞} such that, for all x, y, z∈Σ^∗ the following holds

i) d(x, y) = 0 if and only ifx=y;

ii) d(x, y) =d(y, x); and iii) d(x, y)≤d(x, z) +d(z, y).

Strings over Σ^∗can be compared symbol by symbol. In particular, one can deﬁne the Hamming distance between x and y as the number of positions where the symbol ofxis diﬀerent from the symbol of y. More formally:

Definition 1.1.Givenx, y∈Σ^∗, theHamming distancebetweenxandy, denoted byd_H(x, y) is deﬁned as:

d_H(x, y) =

#{i|x_i=y_i} if|x|=|y|

∞ otherwise.

The above deﬁnition of the Hamming distance is given by considering substitution operations transforming a string into another one. If we consider, besides sub- stitutions, the other twoedit operations of insertion and deletion, we get another notion of distance, callededit distance. More precisely, we can consider the binary relation| over Σ^∗ by setting for allx, y ∈ Σ^∗, x| y if and only if there exist u, v∈Σ^∗,a, b∈Σ, witha=b, such that either condition is satisﬁed

i) x=uav,y =ubv (substitution);

ii) x=uav,y =uv (deletion);

iii) x=uv,y=ubv (insertion).

Note that| is symmetric. As usual, we denote by |^k the composition of| with itself,ktimes.

Definition 1.2. Given two stringsx, y∈Σ^∗, theedit distanced_e(x, y) betweenx andy is the minimum number of edit operations which transformxinto y, i.e.,

d_e(x, y) = min{k|x|^k y}.

(4)

Two strings can be also compared by considering their longest common preﬁx, suﬃx and subword, respectively. This leads to the following notions:

Definition 1.3. Givenx, y∈Σ^∗, theprefix, suffixandsubword distance between xandy, denoted respectively byd_p(x, y), d_s(x, y) andd_f(x, y), are defined as:

d_p(x, y) =|x|+|y| −2 max{|z| |x, y∈zΣ^∗} d_s(x, y) =|x|+|y| −2 max{|z| |x, y∈Σ^∗z}

d_f(x, y) =|x|+|y| −2 max{|z| |x, y ∈Σ^∗zΣ^∗}.

Each distance over Σ^∗can be extended to a distance between strings and languages.

In particular, the distance between a stringxand a languageL is defined as the minimum of the distances betweenxand the strings belonging toL. Thus Definition 1.4. Let Σ be a finite alphabet,d: Σ^∗×Σ^∗ →N∪ {∞} a distance, x ∈ Σ^∗ a string, and L ⊆ Σ^∗ a language. The distance between x and L is defined as:

d(x, L) =d(L, x) =

min{d(x, y)|y∈L} ifL=∅

∞ otherwise.

We recall that it is also possibile to deﬁne distances between languages in a stan- dard way, by resorting to the Hausdorﬀ distance. However, this notion is beyond the scope of this paper. For a discussion and results on this topic we address the interested reader to [4].

Sometimes, the edit distance is deﬁned by considering a weight function asso- ciating a cost with each edit operation. Also for other distance notions considered in the paper, it is possible to consider weighted versions. Our results can be extended to such versions, however, for the sake of simplicity, here we consider only non-weighted versions.

1.2. Computational models and complexity classes

The model of parallel computation we will consider is represented by uniform families of circuits. More precisely, a family of circuits is a set {C_n | n ∈ N} whereC_n is a circuit for inputs of sizen. {C_n}is logspace uniform if the function n → C_n is computable on a deterministic Turing machine in logarithmic space.

NC^k(AC^k, SAC^k, respectively) denotes the class of problems solvable by logspace uniform families of bounded (unbounded, semiunbounded¹) fan-in Boolean circuits of polynomial size andO(log^kn) depth.

Unbounded fan-in circuits are equivalent to parallel random access machines with both concurrent read and concurrent write (CRCW PRAM). In particular the class AC¹ coincides with the class of functions computed by CRCW PRAM in O(logn) time, using a polynomial number of processors. For further notions

1In a circuitsemiunboundedthe fan-in of the AND gates is bounded, while the fan-in of the OR gates is unbounded.

(5)

@@

@@ @@

@@

AC⁰ 1–L

LOGSPACE 1–NL CFL

NL 1NAuxPDA^p

LOGCFL=NAuxPDA^p= SAC¹ AC¹

NC P=AuxPDA

= = =

=

Figure 1. Complexity classes.

of parallel computation and arithmetic circuits the reader is referred to Cook [6], Karp and Ramachandran [13].

Aone-way nondeterministic auxiliary pushdown automaton(1–NAuxPDA) (see Brandenburg [3]) is a nondeterministic Turing machine having a one-way, end- marked, read-only input tape, a pushdown tape, and a two-way, read/write work tapewith a logarithmic space bound. (For more formal deﬁnitions see,e.g., Hopcroft and Ullman [12].)

We recall that without any bound on the running time, two-way nondeterministic and deterministic auxiliary pushdown automata are equivalent, and they characterize the class P (Cook [5]). On the other hand, two-way nondeterministic auxiliary pushdown automata working in polynomial time characterize the class LOGCFL of languages reducible in logarithmic space to context-free languages. As proved in [19], the class LOGCFL coincides with the class of languages recognized by semiunbounded circuits of polynomial size and logarithmic depth, i.e., SAC¹ circuits. By 1–NAuxPDA^pwe denote the family of language accepted in polynomial time by 1–NAuxPDA. Finally we recall that NL is the class of languages accepted by logspace bounded nondeterministic Turing machines, while 1–NL is the class of languages accepted by logspace bounded one-way nondeterministic Turing machines.

In Figure 1, the relationships among these classes and other complexity classes considered in the literature are summarized.

(6)

2. Intrusion detection

In this paper we are interested in the host based intrusion detection model initially introduced by Forrest et al. As mentioned earlier, such a model, also known asN-gram model, is based on the intuition that the “normal” behavior of a program can be characterized by the sequences of system calls it uses during its executions in a sterile environment. The characteristic patterns of such sequences are placed in a database, and they are subsequently used for detecting intrusions in production environments. An essential insight of such an approach was that the database could consist of short substrings, calledN-grams, that may occur during a program execution. Thus to detect intrusions, sequences of system calls of a given length are collected and compared against the contents of the database, using the Hamming distance. More precisely, given a languageL⊆Σ^∗describing the normal behavior of a processp and a string xdescribing the current behavior of p, the HIDS computesd(L, x); if the result is greater than a threshold, experimentally deﬁned, an alarm is raised. In the original model,Lis a regular language and the adopted distance function is the Hamming distance.

One of the main drawbacks of theN-gram model is its accuracy,i.e., it is not able to recognize particular forms of attacks and, as shown in [11], it is characterized by a relatively high degree of false alarms. Many contributions appeared in the literature suggesting improvements to the N-gram model. Here we are particularly interested in [16, 18].

In [16] the behavior of a program is described by a deterministic FSA (Finite State Automaton), which contrarily toN-grams is more suited to capture common program structures such as branches, joins and loops, thus contributing to reduce the degree of false alarms. The FSA is built by monitoring the normal program execution at runtime. The states of the automaton are represented by the distinct program counter values at which a system call is made, while transitions are labeled with the syscall name. Such an automaton is then used to detect intrusions in the following way. Anytime the monitored program executes a syscalls, a check is performed for verifying if the location from wheres was made corresponds to a FSA state and if there exists a transition from the current state labeled withs;

if not, a transition to a “sink” state is performed. When the number of such transitions, or equivalently the anomaly counter, exceeds a threshold, an alarm is raised². Thus in this case the HIDS operates on a regular languageL, while the notion of distance adopted for detecting intrusions attempts is the preﬁx distance.

The model previously described is not able to deal with a speciﬁc form of computer attack ﬁrstly described in [18] and known as impossible path problem (for further details see also [8]). In order to improve the resistance against this type of attacks Wagneret al. introduced in [18], among others, a new HIDS model calledabstract stack model. Such a model, is still based on the assumption that the

2To be more precise the authors of [16] associate different weights with different kinds of anomalies, and use such weights to compute the anomaly counter. It is not difficult to see that this more general case can be handled by adopting a proper FSA.

(7)

“normal” behavior of a program can be characterized by the sequence of system calls used, but it derives such sequencesviastatic analysis (instead of runtime as the models previously described) of the source code, along with a ﬁxed model of the operating system. A program is modeled as a transition system which can be simulated by a nondeterministic pushdown automata. In the detection phase, intrusions are detected by recognizing strings of system calls which could not have been generated by the underlying transition system. Any time the monitored program makes a system call, one or more steps of the nondeterministic transition system are simulated; if none of them reaches an admissible transition then an alarm is raised. In this case the HIDS operates on a context-free language L, while the notion of distance adopted is the preﬁx distance.

In [14] theN-gram model has been extended to work with distributed applications. In such a context it has been argued by the authors that instead of tracing syscalls it was natural to trace calls from client to server, and that variable length N-grams were more suitable to capture the anomalous behavior of applications.

It turns out that the Hamming distance was no more suitable for comparing process current behavior against normal behaviors, and in such a case the authors proposed to adopt the suﬃx distance.

So far, the Forrest’s approach has been used for implementing anomaly based HIDS, however it can also implement misuse detection HIDS. In such a case, the language L will contain all the traces of the malicious agents. The string x describing the current behavior of a process p will be compared against strings contained inLin order to verify whether it contains some of them. The detection of intrusions will be performed by computing the subword distance between the regular languageLandx.

In conclusion, the various approaches adopted for the construction of anomaly based HIDS can be uniﬁed by the problem of computing the distance between a stringx∈Σ^∗ and a language L⊆Σ^∗ for a suitable alphabet Σ. The diﬀerences among the various approaches are determined by the characteristics ofLand the notion of distance chosen by the various authors.

In the following, we will provide a computational classiﬁcation of such a problem and individuate a class of languages for which it can be eﬃciently (in terms of the classical complexity theory) solved.

3. Efficient computation of distances for languages in 1–NAuxPDA

^p

In [15], it was shown that the edit distance of any language accepted in polynomial time by one-way auxiliary pushdown automata can be eﬃciently computed.

More precisely, it belongs to the parallel complexity class AC¹. In [1] a similar result was proved for themaximal word functionof a such a language. In this section, we are able to show a better result for preﬁx, suﬃx, and subword distance.

In fact, we show that those distances are in the parallel complexity class SAC¹.

(8)

Let us start by presenting the following preliminary lemma that states a linear upper bound on the length of the string belonging to a language and with minimal distance with respect to a given string.

Lemma 3.1. Given a nonempty language L ⊆ Σ^∗, there exists a constant β such that for any σ ∈ {e, p, s, f}, for each x ∈ Σ^∗ and for each y ∈ L with d_σ(L, x) =d_σ(y, x), it holds thatd_σ(L, x)≤ |x|+β and|y| ≤2|x|+β.

Proof. Letx₀ be a string of minimal length belonging toLandβ=|x₀|. It is not diﬃcult to observe thatd_σ(L, x)≤d_σ(x₀, x)≤ |x|+β.

Now, observe thatd_σ(y, x)≥ |y| − |x|, for each stringy. Ifd_σ(y, x) =d_σ(L, x),

this implies that|y| ≤2|x|+β.

Given a languageLand the constantβ of Lemma 3.1, in order to compute the preﬁx distance forLwe introduce the following relation:

R_p={(i, j, x) | ∃z, w, v∈Σ^∗ s.t. x=zw, zv∈L,|z|=iand 0≤ |v|=j ≤2|x|+β}.

We now prove the following result, concerning the setR_p:

Lemma 3.2. d_p(L, x) = min{|w|+j| ∃z∈Σ^∗ s.t. x=zw and (|z|, j, x)∈R_p}.

Proof. Lety ∈L such thatd_p(L, x) =d_p(y, x) with |y| ≤2|x|+β. Hence, there exist stringsz, w, v such thatx=zw, y=zv andd_p(y, x) =|v|+|w|=d_p(L, x).

It is easy to verify that the triple (|z|,|v|, x) belongs to R_p. Hence d_p(L, x) ≥ min{|w|+j| ∃z∈Σ^∗ s.t. x=zwand (|z|, j, x)∈R_p}.

Conversely, let (|z|, j, x)∈R_p, withx=zw, for some stringw, such that|w|+j gives the minimum on the right side of the equality. Then there is a string v of lengthj such that zv∈L. Hence, d_p(L, x)≤d_p(zv, x)≤ |w|+|v|=|w|+j.

We now consider the following language encoding the relationR_p: L_p={z#w#^j | ∃v∈Σ^∗ s.t. zv∈Land 0≤ |v|=j≤2|zw|+β}. It is immediate to observe that a stringz#w#^j belongs to the languageL_p if and only if the triple (|z|, j, zw) belongs to the relationR_p.

Now, we study the relationships between the complexities of languagesLandL_p. In particular, we prove the following result:

Theorem 3.3. Given L⊆Σ^∗, let L_p be the language above deﬁned.

• L∈1–NAuxPDAimplies that L_p∈1–NAuxPDA.

• L∈1–NAuxPDA^p implies that L_p∈1–NAuxPDA^p.

• L∈1–NLimplies that L_p∈1–NL.

(9)

Proof. First suppose that L is accepted by the 1–NAuxPDA M We deﬁne a 1–NAuxPDAM_p acceptingL_pthat, on an input string of the formz#w#^j, works in three phases, as described in the following.

In the ﬁrst phase, M_p directly simulates a computation of M on the input preﬁxz. During this phaseM_palso keeps tracks, using its work tape, of the length ofz. In the second phaseM_p scans the factor|w| to count its length. With this informationM_pcan compute, using its logarithmic worktape, the number 2|zw|+β.

In the third phase,M_p scans the input suﬃx #^j. In this phaseM_p continues the simulation ofM of the ﬁrst phase, by guessing, for each scanned symbol, an input symbol forM. If during this phaseM_p discovers thatj >2|zw|+β, then it stops and reject. Otherwise, letvbe the string ofjsymbols guessed byM_pin the third phase. M_p accepts at the end of the computation if and only if the simulated computation ofM on inputzvis accepting.

It is immediate to verify that ifM works in polynomial time then alsoM_pworks in polynomial time. Moreover, ifM does not use the pushdown store, namely, it is a one-way machine working in logarithmic space, then M_p is a machine of the same kind.

This completes the proof of the theorem.

We informally recall that a functionf :{0,1}^∗→ {0,1}^∗ is NC¹ reducible to a function tog :{0,1}^∗ → {0,1}^∗ if and only if f can be computed by a family of NC¹circuits, extended withoracle gates that compute values ofg, subject to the restriction that no two oracle gates lie on the same input-output path (for more details see, e.g., [13]). Under a suitable encoding, the notion of NC¹-reducibility can be extended in order to consider not only binary inputs and outputs.

Lemma 3.4. The preﬁx distance for LisNC¹-reducible to the languageL_p. Proof. (outline) Given an input stringx, the circuit realizing the reduction uses the oracle gateG_i,j, for 0≤i≤ |x|and 0≤j≤2|x|+β, in order to check whether or not (i, j, x)∈R_p,i.e., z#w#^j ∈L_p, wherezw=xand |z|=i.

The output of the circuit,i.e.,d_p(L, x), is the minimum of all integersn−i+j such that the oracle gateG_i,j gives answer “yes”.

Because the minimum of a polynomial set of numbers can be computed in AC⁰ and, hence, in NC¹ [17], it is not diﬃcult to verify that this is an NC¹-

reduction.

Using Lemma 3.4 we are now able to prove the main result of this paper:

Theorem 3.5. For each language L∈1–NAuxPDA^p, the preﬁx, suﬃx and sub- word distances ofL are inSAC¹.

Proof. It is immediate to see that SAC¹is closed under NC¹-reductions. Hence, if L∈1–NAuxPDA^p⊆SAC¹then by Lemma 3.4 it turns out that its preﬁx distance is in SAC¹.

For the suﬃx and subword distances the proof can be given using similar steps.

In particular, in the case of the suﬃx distance it is useful to consider the following

(10)

relation:

R_s={(i, j, x) | ∃z, w, v∈Σ^∗ s.t. x=wz, vw∈L,|z|=i, and 0≤ |v|=j≤2|x|+β},

while in the case of the subword distance the relation:

R_f ={(i, j, j, x)| ∃w, w, v, v, z∈Σ^∗ s.t. x=wzw, vzv∈L,

|z|=i,|v|=j,|v|=j, and 0≤j+j≤2|x|+β}.

Using a similar technique, it is easy to prove the following:

Theorem 3.6. For each languageL ∈1–NL, the preﬁx, suﬃx and subword dis- tances ofLare in NL.

4. Conclusions

We have revisited the notion of host intrusion detection system from a formal languages point of view. We have shown that the various models of HIDS based on the approach proposed in [9] represent instances of the problem of computing the distance between a stringxand a languageL, using various notion of distance as well as referring to diﬀerent type of formal languages. We also showed that such a problem is highly parallelizable. Ongoing research eﬀorts are trying to evaluate the practical impact of such results.

Acknowledgements. The authors thank the anonymous referees whose hints contributed a lot to improve the quality of the ﬁnal version of the manuscript. In particular, a referee suggestion helped us to improve our original result.

References

[1] E. Allender, D. Bruschi and G. Pighizzini, The complexity of computing maximal word functions.Comput. Compl.3(1993) 368–391.

[2] J.P. Anderson,Computer security threat monitoring and surveillance. Tech. Rep., James P.

Anderson Company, Fort Washington (1980).

[3] F. Brandenburg, On one-way auxiliary pushdown automata, in Proc. 3rd GI Conference.

Lect. Notes Comput. Sci.48(1977) 133–144.

[4] C. Choﬀrut and G. Pighizzini, Distances between languages and reﬂexivity of relations.

Theoret. Comput. Sci.286(2002) 117–138.

[5] S. Cook, Characterization of pushdown machines in terms of time–bounded computers.J.

ACM18(1971) 4–18.

[6] S. Cook, A taxonomy of problems with fast parallel algorithms.Inform. Control64(1985) 2–22.

[7] D.E. Denning, An intrusion detection model.IEEE Trans. Software Engineering13(1987).

[8] H. Feng, O. Kolesnikov, P. Fogla, W. Lee and W. Gong, Anomaly detection using call stack information, inProc. IEEE Symposium on Security and Privacy. IEEE Press (2003).

(11)

[9] S. Forrest, S. Hofmeyr, A. Somayaji and T. Longstaﬀ, A sense of self for Unix processes, in Proc. IEEE Symposium on Security and Privacy. IEEE Press (1996).

[10] S. Forrest, S. Hofmeyr, A. Somayaji and T. Longstaﬀ, Intrusion detection using sequences of system calls.J. Comput. Security6(1998) 151–180.

[11] A.K. Ghosh and A. Schwartzbard, A study in using neural networks for anomaly and misuse detection, inProc. USENIX Security Symposium. USENIX Association (1999).

[12] J. Hopcroft and J. Ullman,Introduction to automata theory, languages, and computations.

Addison-Wesley, Reading, MA (1979).

[13] R. Karp and V. Ramachandran, A survey of parallel algorithms for shared-memory machines, inHandbook of Theoretical Computer Science, Vol. A. North Holland (1990).

[14] C. Marceau, Characterizing the behavior of a program using Multiple length N-grams, in Proc. New Security Paradimg Workshop. ACM Press (2000) 101–110.

[15] G. Pighizzini, How Hard is Computing the Edit Distance? Inform. Comput. 165(2001) 1–13.

[16] R. Sekar, M. Bendre, D. Dhurjati and P. Bollineni, A fast automaton-based method for detecting anomalous program behaviors, inProc. IEEE Symposium on Security and Privacy.

IEEE Press (2001).

[17] Y. Shiloach and U. Vishkin, Finding the maximum, merging and sorting in a parallel computation model.J. Algorithms2(1981) 88–102.

[18] D. Wagner and D. Dean, Intrusion detectionviastatic analisys, inProc. IEEE Symposium on Security and Privacy(2001).

[19] H. Venkateswaran, Properties that characterize LOGCFL.J. Comput. Syst. Sci.43(1991) 380–404.

To access this journal online:

www.edpsciences.org