• Aucun résultat trouvé

A Re-Ranking Method of Search Results Based on

Keyword and User Interest

Ming Xu

Hangzhou Dianzi University, P. R. China Hong-Rong Yang

Hangzhou Dianzi University, P. R. China Ning Zheng

Hangzhou Dianzi University, P. R. China

IntroductIon

The most common activity task for a forensic investigator is to search a hard disk for interesting evidences.

The investigator needs to focus on specific evidence and important indicators of suspicious activity (e.g., specific key word searches). Unfortunately, the large size of modern hard disk makes it extremely difficult and wastes investigator’s vast time on huge irrelevant search hits. Many commercial or open sources tools have been developed to assist investigators to find relevant hits among large amounts of data, e.g., Forensic Tool Kit (AccessData, 2009), Encase (Guidance Software, 2009), etc. Nevertheless, AbstrAct

It is a pivotal task for a forensic investigator to search a hard disk to find interesting evidences. Currently, most search tools in digital forensic field, which utilize text string match and index technology, produce high recall (100%) and low precision. Therefore, the investigators often waste vast time on huge ir-relevant search hits. In this chapter, an improved method for ranking of search results was proposed to reduce human efforts on locating interesting hits. The K-UIH (the keyword and user interest hierarchies) was constructed by both investigator-defined keywords and user interest learnt from electronic evidence adaptive, and then the K-UIH was used to re-rank the search results. The experimental results indicated that the proposed method is feasible and valuable in digital forensic search process.

huge number of search hits will be returned by search operations with high recall and low precision.

What’s more, these digital forensic text string search tools fail to group and/or order search hits in a manner that appreciably improves the investigator’s ability to get to the relevant hits first.

In the works of Petrovic, and Franke (2007), they presented a new search procedure which makes use of the constrained edit distance in the pre-selection of the areas of the digital forensic search space that are interesting for the investigation. They divided the whole search space into several fragments and then computed constrained edit distance between each fragment and the query. However, our ap-proach focuses on the entire hard disk instead of dividing it into small search spaces. Jee, Lee, and Hong (2007) also tried to improve search efficiency of digital forensic. Pattern matching board was used to build high speed bitwise search model for large-scale digital forensic investigations. This approach is different from ours, because we attempt to re-rank search results to reduce human efforts, and no ad-ditional hardware is used in the search process. It is not a new issue to personalize search results, which has been successfully applied in web information retrieval field. Kim and Chan (2008) learnt implicit interest from user to reorder search results. Various files on user’s computer were used as the training set of user interest. Unfortunately, their user profile did not focus to represent from general to specific topics. The works of Kim and Chan (2008) sufficed this end. Their approach is to learn a user interest hierarchy (UIH) from the web pages visited by user. A divisive hierarchical clustering (DHC) algorithm was designed to group words into hierarchy where higher-level nodes are more general and lower-level ones are more specific. In their study (Kim, & Chan, 2006), a ranking algorithm was proposed to reorder the results with a learned user profile. In our search results re-ranking algorithm, large amounts of data from digital evidence can be used to learn user interest, but the primary goal of digital forensic search is to satisfy the investigator, which is different from web personalization.

However, during the digital investigation, developing a profile of the offender can help focus the search.

Armed with a better understanding of the possible motivation, modus operandi (MO), and signatures, the investigator can be able to derive specific search criterion for forensic analysis (Rogers, 2003). After all, our approach attempts to automate extract user interest from digital artifact, no human effort act in this process. So we believe that identifying user interest is important in digital forensic search process, and the UIH method can be extending to digital forensic field after combined with investigator’s focus.

Yang, Sun, and Sun (2006) also proposed an algorithm for learning hierarchical user interest models according to the Web pages which users had browsed. But they attempted to update user interest ac-cording to dynamic document set, while the dataset of the proposed method is based on static electronic evidence.

relAted WorK

Dario Forto illustrated the importance of text searches in digital forensics (Forte, 2004). He took GREP tool as example, and realized that its power depends on the technical expertise of investigator. Beebe, and Dietrich (2007) disclosed a general consensus that industry standard digital forensic tools are not scalable to large data sets. In their following work (Beebe, & Clark, 2007), a new and high-level text string search process model was presented. They proposed and empirically tested the feasibility and utility of post-retrieval thematically clustering of digital forensic search results. Our method also at-tempts to resort search results for quickly finding important hits. The difference is that we try to learn

user interest from evidence and combine it with investigator-defined keyword to build adaptive user interest hierarchy, which is used to rank the search results.

In the works of Petrovic and Franke (2007), they presented a new search procedure which makes use of the constrained edit distance in the pre-selection of the areas of the digital forensic search space that are interesting for the investigation. They divided the whole search space into several fragments and then computed constrained edit distance between each fragment and the query. However, our approach focuses on the entire hard disk instead of dividing it into small search spaces. Jee, Lee, and Hong (2007) also tried to improve search efficiency of digital forensic. Pattern matching board was used to build high speed bitwise search model for large-scale digital forensic investigations. This approach is differ-ent from ours, because we attempt to re-rank search results to reduce human efforts, and no additional hardware is used in the search process. It is not a new issue to personalize search results, which has been successfully applied in web information retrieval field. Teevan, Dumais, and Horvitz (2005) learnt implicit interest from user to reorder search results. Various files on user’s computer were used as the training set of user interest. Unfortunately, their user profile did not focus to represent general to specific topics. The work of Kim, and Chan (2008) sufficed this end. Their approach is to learn a user interest hierarchy (UIH) from web pages visited by user. A divisive hierarchical clustering (DHC) algorithm was designed to group words into hierarchy where higher-level nodes are more general and lower-level ones are more specific. In their study (Kim, & Chan, 2006), a ranking algorithm was proposed to reorder the results with a learned user profile. In our search results reranking algorithm, large amounts of data from digital evidence can be used to learn user interest, but the primary goal of digital forensic search is to satisfy the investigator, which is different from web personalization.

However, during the digital investigation, developing a profile of the offender can help focus the search.

Armed with a better understanding of the possible motivation, modus operandi (MO), and signatures, the investigator can be able to derive specific search criterion for forensic analysis (Rogers, 2003). After all, our approach attempts to automate extract user interest from digital artifact, no human effort act in this process. So we believe that identifying user interest is important in digital forensic search process, and the UIH method can be extending to digital forensic field after combined with investigator’s focus.

Yang, Sun, and Sun (2006) also proposed an algorithm for learning hierarchical user interest models according to the Web pages which users had browsed. But they attempted to update user interest ac-cording to dynamic document set, while the dataset of the proposed method is based on static electronic evidence.

tHe K-uIH K-uIH

In Kim’s user interest hierarchy (Kim, &Chan, 2008), more general interest is represented by a larger set of words, which are extracted from web pages. Each web page can be assigned to a set of nodes for further processing. According to DHC algorithm, the similarity function and threshold-finding method greatly influence the clustering algorithm. The former measures how close the two words are related, and the latter determines what value of similarity is considered to be “strong” or “weak”. Edges with weak weights are removed in Similarity Matrix (Denoted by SM). In this work, we fix the similarity

A UIH organizes a user’s interests from general to specific. Towards the root of a UIH, more general or longer-term interests are represented by larger clusters of words, while towards the leaves, more specific or shorter-term interests are represented by smaller clusters of words. Before discussing our approach for building K-UIH, a picture of UIH is drawn to be taken as an example. To generate a UIH, seven web pages in bookmarks of web browser were collected as input. We firstly parsed the HTML documents and extracted text information from them without considering link or multimedia information.

And then, the words were fragmented (Chinese pages) or stemmed (English pages) so that we can get all words in web page. At last, we filtered the words through a stop list (Frakes, & Baeza-Yates, 1992), which contains the most common words. The sample data set is shown in Table 1. It should be note that the samples throughout our study are in Chinese language. To illustrate our idea more intelligibly, we have translated them into English.

In Figure 1, the words in nodes come from the sample data set (Table 1). Each node represents a con-ceptual relationship if those terms occur together at the same web page. For example, in the left bottom of picture, ‘journal’ and ‘Chinese’ can be typed as journal submission, while in its brother node, ‘con-ference’ and ‘deadline’ are brought into conference program, but the exchange is not true. Additionally, these words are all related to some other words, such as ‘research’ and ‘paper’, which are contained in the parent node. While investigating the whole tree, it can be easily found that left side represents user interest about research and paper submission, and the right side is related to computer forensics.

In this study, we mainly focus on improving search efficiency of investigator. It seems natural to incorporate investigator’s interest into UIH as a new tree, which we name it K-UIH. In digital forensic field, it is common that there is hundreds or thousands of files in digital evidence, so a huge UIH will be built by using original cluster algorithm. We attempt to utilize keywords inputted by investigator to localize original SM. Here, we should pay attention to a hypothesis: there exists an intersection between keywords set and SM words set, which means at least one word in keyword set also occurs in SM words set. It is believed that the investigator can easily seek important evidence if the input keyword is contained in the SM word set. The SM words set will limit and explain the actual meanings of the key words in the digital evidence contexts.

Note that we would like to see a small K-UIH contains one or more keywords, so a new threshold-finding method should be designed to suffice it. We observe a component of SM, called keyword simi-larity set (denoted by C), which contains similarities between keywords and other words. For example, in SM, if there are 10 edges connected between keywords and other words, the member number of C (denoted by n) is 10. We determine the threshold using the formula below:

Table 1. The sample data set

Page words

1 computer academy journal Chinese science

2 computer academy conference deadline security submit 3 computer journal submit engineering

4 computer crime investigate network security forensic technology case 5 computer confident abuse identify material enterprise

6 paper submit review revise research study 7 paper submit review revise research

threshold = {Max (s,C1 ) | t = Min(n,t)} (1) Before the threshold is computed, similarity values in C should be arranged in descend order (i=1, 2, ..., n). In equation 1, we define a constant i as the smaller one between n and t, which prevent too many similarity edges are under consideration. And then, the threshold is selected as the bigger one between ci and s, which is also defined as a constant, i.e., 10. The role of s is similar to t. The value of t can be determined by MaxChildren method, which selects a threshold such that maximum of child clusters are generated and is guided to generate a shorter tree.

After localizing SM with the threshold discussed above, we will build K-UIH according to MaxChil-dren method and AEMI-SP similarity function as the same as original UIH algorithm. A simple example of K-UIH is shown in Figure 2.

In this mock case in figure 2, someone was suspected of making pirate sale of video, audio or games.

The input keywords for searching were ‘crack’, ‘manufacturer’ and ‘free’. As drawn in Figure 2, his interest was demonstrated well in the K-UIH we build. We are confident that search efficiency of digital investigation would be greatly improved by using K-UIH.

Figure1. The sample user interest hierarchy

Figure 2. Sample adaptive user interest hierarchy

To illustrate the proposed approach clearly, the improved search process is summarized in Figure 3. The first step inputs the electronic evidence, and parsed those documents and extracted text informa-tion from them without considering link or multimedia informainforma-tion, and then, the words were fragmented (Chinese pages) or stemmed (English pages) so that we can get all words in web page, at last, we filtered the words through a stop list.

The second step uses AEMI-SP as a similarity function, to calculate the “closeness” of two words;

the third step builds a weighted undirected graph, called Similarity Matrix, with each vertex represent-ing a word and each weight denotrepresent-ing the similarity between two words. Since related words are more likely to appear in the same document than unrelated terms, we measure co-occurrence of words in a document. At the same time, the investigator needs to input the keywords.

The third step uses SM, keywords inputted by investigator, and expression (1) to locate threshold.

The fourth step builts the K-UIH by MaxChildren and AEMI-SP. When we built the K-UIH be-tween step 3 and 4, the common search evidence process based on inputted keywords should also be transacted to get search results.

The final step is using ranking algorithm and K-UIH to rank the search results. The ranking algo-rithm is presented in section 4.

similarity Functions

The similarity function is used to calculate how strongly two words are related. We assume two words co-occurring within a window size are related, because related words are more likely to be close to each other than unrelated words. In this work, the window size is simply assumed to be the entire length of a document. That is, two words co-occur if they are in the same document.

The document frequency of a word calculates the number of documents that contain the word. Words that are commonly used in many documents are usually not informative in characterizing the content of the documents. Hence, the inverse document frequency (the reciprocal of document frequency) measures how informative a word is in characterizing the content. AEMI is an enhanced version of MI (Mutual Information) and EMI (Expected Mutual Information). Unlike MI which considers only one corner of the contingency matrix and EMI which sums the MI of all four corners of the contingency matrix, Figure 3. The proposed search procedure

AEMI sums supporting evidence and subtracts counter-evidence. AEMI could find more meaningful multiword phrases than MI or EMI. Consider variables A and B in AEMI(A,B) are the events for the two terms (a and b), where the capital A and B are variables and lowercase a and b are the instances. P(A=a) is the probability of a document containing a term of a and P A( =a) is the probability of a document not having term a. P(A=a, B=b) is the probability of a document containing both terms a and b. These probabilities are estimated from documents that are interesting to the user. AEMI(A,B) is defined as:

AEMI A B P a b P a b

In this work, we enhance AEMI by incorporating a component for inverse document frequency (IDF) in the correlation function. The measures how informative a term is in characterizing the content. While involving the IDF, we adapt sigmoid function in order to emphasize more specific (informative) terms.

The adjusted sigmoid function is called SP (specificity):

SP m( ) m We choose the larger probability so that SP is more conservative. The factor 0.6 smoothes the curve, and constants 10.5 and −5 shift the range of m from between 0 and 1 to between −5 and 5.5. The new range of −5 and 5.5 is slightly asymmetrical because we would like to give a small bias to more specific terms. For example, for a=’ann’ and b=’perceptron’, m is 0.2 and SP(m) is 0.85, but for a = ‘machin’

and b = ‘ann’, m is 0.6 and SP(m) is 0.31.

The correlation function:

AEMI -SP =AEMI´SP

2 (4)

The usual range for AEMI is 0.1–0.45 and SP is 0–1. To scale SP to a similar range as AEMI, we divide SP by 2.

the Maxchildren

The MaxChildren method is used to dynamically determine a reasonable threshold value to differenti-ate strong from weak correlation values between a pair of terms. The MaxChildren method selects a threshold such that maximum of child clusters are generated. This ensures that the resulting hierarchy tree does not degenerate too tall and thin. This preference stems from the fact that topics are in general more diverse than detailed and the library catalog taxonomy is typically short and wide. The MaxChil-dren calculates the number of child clusters for each boundary value between two quantized regions.

The method ignores the first half of the boundary values to guarantee the selected threshold is not too low. The MaxChildren method recursively divides the selected best region until there are no changes

scorInG oF tHe FIle returned by trAdItIonAl seArcH enGIne

In this section, the ranking algorithm for reordering the search results returned by traditional search engine is discussed. The most important step is scoring each file in the search results. Therefore, a reasonable scoring method should be designed so that the more interesting file would be assigned a higher score. We are inspired by the H R .Kim’s work which has made a good example of how to scoring search results depending on UIH (Kim, & Chan, 2006).

Given a file in search results, we firstly identify the terms both occur in the file and K-UIH. The number of distinct terms in K-UIH is denoted by m, and the number of distinct terms in the file is de-noted by n. For each matching term ti, we compute the score of it according to three sides: the deepest level of a node where a term of belongs to Dt

i, the length of a term such as how many words are in the term Lt

i and the frequency of a term Ft

i. The first one is related to K-UIH structure. The terms in more specific interests are harder to match, and the deepest level (depth) where the term matches indicates significance. If a term in a node also appears in several of its ancestors, we use the level (depth)

i. The first one is related to K-UIH structure. The terms in more specific interests are harder to match, and the deepest level (depth) where the term matches indicates significance. If a term in a node also appears in several of its ancestors, we use the level (depth)