Data Structuration - List of Abbreviations

List of Abbreviations

2.5 Data Structuration

This step groups the unstructured requests of a log file by user, user session, page view, visit, and episode. At the end of this step, the log file will be a set of transactions, where by transaction we refer to a user session, a visit or an episode.

Data Structuration 23 2.5.1 User Identification

In most cases, the log file provides only the computer address (name or IP) and the user agent (for the ECLF log files). For Web sites requiring user registration, the log file contains also the user login (as the third record in a log entry). In this case, we use this information for the user identification. When the user login is not available, we consider (if necessary) each IP as a user, although we know that an IP address can be used by several users.

2.5.2 User Session Identification

Identifying the user sessions from the log file is not a simple task due to proxy servers, dynamic addresses, and cases where multiple users access the same computer (at a library, Internet caf´e, etc.) or one user uses multiple browsers or computers. However, a number of techniques can provide additional information. To identify a user that al-ready visited the Web site, the most common techniques are cookies, user registration or modified browsers. All these techniques have drawbacks, especially concerning the user privacy.

If the user login is available, we combine the user login field with the pair (Host, User Agent) to separate the user sessions. We choose this solution because a registered user might use different computers or browsers when exploring the Web site and the inclu-sion of the user agent allows us to better distinguish between users within a common host.

Moreover, in an experiment conducted by [BMNS02], the authors report that the com-bination (Host, User Agent) correctly identifies the user in 92.02% of the cases and only a small number of these combinations (1.32%) are used by more than three users (because of proxies). Therefore, we can affirm that using this combination provides a good identification criteria for the user session.

2.5.3 Page View Identification

Using the site map M(represented with XGMML [PK01]), the requests are grouped by page views with the following algorithm:

• When the request for the page viewp_i is in the log file, we remove the log entries corresponding to the embedded resources from p_i, and we keep only the request forp_i.

• When the request forp_i is absent (due to the browser or proxy cache), but some entries for its corresponding resources are present and these entries havepi in the referrer field, we replace the entries corresponding to the resources with a request forp_i and we set the time of this request to t_i = min{time(l_i)}, where l_i is the corresponding log entry for the resourceri.

If the site map is not available, we identify the page views by using the time of the request. For requests made at the same time (i.e. the same second), we keep only the

first request (as ordered in the log file) and discard the following ones.

Finally, a third solution consists in using a statistical or a DM approach for identifying Web pages that are usually requested together and in a short period of time. One solution would be to use sequential patterns with high confidence obtained from the user sessions or visits.

After the page view identification, the log file will contain, normally, only one request for each user action.

2.5.4 Visit Identification

So far, we have obtained a sequence of page views for each user. This represents the user’s clickstream sequence on the Web site (i.e. the user session) during a certain period. Several heuristics can be used to split the user session into visits:

• H_{P age} uses a threshold ∆t for measuring the time gaps between consecutive re-quests. A new visit begins each time when a gap exceeding ∆t occurs between two page views. This is the most common method in WUM and it is also the one we adopted in our preprocessing methodology.

• H_{V isit} uses a time threshold for the entire visit [CMS99, FSS00]. Once the visit duration exceeds ∆t, a new visit begins. However, depending on the Web site pages or the users, it may take more time than ∆t (usually 30 minutes) to visit the Web site. In this case, usingH_{V isit}, the users’ visits will be cut and the page views will be separated in different visits.

• H_Ref uses the history of the visit and the referrer of the current page view. If there is no link from a previous requested page to the current one, a new visit begins. This heuristic needs the Web site map at the time of the visit, because the Web sites are dynamic and their pages and links are dropped and removed all the time.

• M F (which stands for the “Maximal Forward” algorithm [CPY98]) ends a visit when a backward reference occurs. With theM F algorithm, backward references are dropped. For example, a user that viewed the pages A, B, C, B, D did two visits: A, B, C and A, B, D. This method has a drawback because for some classes of applications, predicting even this kind of “back” reference (e.g. C to B) is important. Because the session is split after the user does a backward reference from C toB, the information about the traversal of the link C−> B is lost at the visit level.

In our methodology, we use the H_{P age} approach with a parameterizable threshold,

∆t. In [BMSW01], the authors evaluated the performances of the first three heuristics presented. According to their experiment, the best results for visit identification are obtained either with the H_{V isit} or with the H_{P age} approach. The differences between the two approaches were minor when evaluated in their framework, while the quality of the H_Ref approach was low.

Data Structuration 25 2.5.5 Episode Identification

Identifying episodes is a complex problem as it needs: the semantic definition of the entire Web site (i.e. the Web site ontology) and a distance measure on the semantic definitions of the Web pages. Thesemantic definitionfor a Web page represents all the semantic topics associated with the Web page. The semantic topic is a label charac-terizing the content of a Web page. We can have several semantic topics for one page, like in the Figure 2.3 where we represented some of the pages ofwww-sop.inria.fr.

Figure 2.3: www-sop.inria.fr’s Semantic Topics Hierarchy

For identifying episodes, we propose the use of a hierarchy of semantic topics (see Figure 2.4) and of a semantic distance between semantic topics (distance defined on this hierarchy). We propose to identify episodes by calculating the semantic distance between the semantic topics of any two consecutive pages or between the current page and the group of previously visited pages. The distance may be a simple link distance counting the number of edges to be traversed from one semantic topic to another. When this distance exceeds a predefined threshold, a new episode begins.

Cooley used different heuristics [Coo00] to determine the type of a page (i.e. syntac-tic/semantic topic in our case). These heuristics are generally based on the average time the user spends on the page, on the page size, or both, thus they are highly de-pended on the usage of the Web site and not on its content. Moreover, the number of types proposed was only five, which is very limited compared with the number of topics.

Automatically finding semantic topics for a Web page is still an open problem of Web Mining. One possible solution would be the usage of the XML language to annotate the Web pages with semantic topics when they are created or modified. Then, these

Figure 2.4: Web Pages of www-sop.inria.fr

topics must be placed in an hierarchy on which a distance measure can be applied.

Recent works in the Semantic Web Mining domain [BHS02, OBHG03], discuss in more detail these issues.

Finally, techniques from Web Content and Text Mining can be used to automatically discover semantic hierarchies for the pages belonging to a Web site [MS01, MBGMF04, CHS04].

Dans le document The DART-Europe E-theses Portal (Page 45-49)