USER IDENTIFICATION - DATA MINING THE WEB

Here, the goal is to identify each distinct user. Ideally, this would be accomplished easily if the user provided his or her registration information, such as user name and password, each time the Web site was accessed. Unfortunately, the free-form structure of the Internet means that most user accesses to most Web sites are done anonymously, so that registration information is not available. Another way of describing this situa-tion is to say that the Internet is essentiallystateless, meaning that each request for a web page gets treated as an isolated event, unrelated to all other requests for the site’s web pages. User identiﬁcation is one way of introducing a state into this stateless system.

Another means of identifying users is the use ofcookies. A cookie is an arbitrary text string, usually set by a web server, containing whatever information the server wishes to place. In this way, cookies can be used to connect current web page accesses to previous accesses. In addition to tracking user access, the most common uses for cookies are:

rTo avoid requiring returning registered users from signing in again each time they access the site

rTo personalize the user’s experience: for example, with individualized recom-mendations

rTo maintain the user’s shopping cart for e-commerce sites

However, many users are concerned that the abuse of cookie information can lead to violations of privacy. Further, cookies can be blocked or cleared by the user. Therefore, the web usage miner needs recourse to other strategies for identifying users.

The remote host field, or IP address field, may in principle be used to identify users. However, the widespread use of proxy servers, corporate firewalls, and local caches renders problematic the use of the IP address as a substitute for user identifica-tion. For example, several users may be accessing the same site, using a proxy server, which will provide the web server with the same IP address for each user. To provide an example of how sparse the user name field is for a typical Web site, we show a table of the most common values for this field in the CCSU web log data, given in Table 7.3. The server name has been changed, as have the user names provided here.

Note that over 99.5% of the user names are blank, taking the “-” value in the web

USER IDENTIFICATION 165

TABLE 7.3 Most Common Values for the User Name Field, CCSU Data

Value Proportion Count

— 0.9955034 192,833

CCSU Server\smith 0.001115 216

CCSU Server\jones 0.000780 151

CCSU Server\akhbar 0.000614 119

CCSU Server\ivanov 0.000361 70

CCSU Server\chang 0.000217 42

CCSU Server\feliciano 0.000186 36

CCSU Server\chagnon 0.000181 35

CCSU Server\johnson 0.000176 34

CCSU Server\washington 0.000134 26

CCSU Server\rivera 0.000129 25

log entry. Since users by and large do not provide their own identiﬁcation, we should seek alternative methods to identify them.

Next, consider Table 7.4, containing an excerpt from the fictional web log for an imaginary Web site. Note that all the IP addresses are the same, which would at first glance seem to indicate that all the entries are from the same user. However, such a conclusion would be mistaken. We shall use the followingheuristic, which seems to be a reasonable assumption: If the agent field differs for two web log entries, the requests are from two different users. Although this assumption ignores users who access the same Web site with two different browsers on the same machine, this sort of behavior is relatively rare.

Consider the sample web log ﬁle for an imaginary Web site in Table 7.4. Ap-plying this heuristic to the entries in the table, we can discern that there are at least two users represented here, one using Windows NT and MS Internet Explorer, the other using Linux and Firefox. Based on this, we can postulate the following paths through the Web site taken by each user:

r User 1:A→B → E →K →I → O→E →L r User 2: A → C →G →M →H→N

However, do you see a problem with these reconstructions? If we apply the information available from the referrer ﬁeld, along with the Web site topology, we can uncover the highly likely result that “user 1” here is actually two different users. Why is this?

Follow the referrer ﬁeld along user 1’s path through the Web site. We see that access to B.html has been referred from A.html, access to E.html referred from B.html, and access to K.html referred from E.html.

However, unexpectedly, there is no referrer shown for the page I.html request.

Also, consider the Web site topology shown in Figure 7.7. The arrows indicate link directionality. There is no direct link between K.html and I.html. Thus, it appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I. It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by entering the URL directly into the browser using the

TABLE7.4SampleWebLogFileforanImaginaryWebSite IPAddressTimeMethodReferrerAgent 987.654.32.100:00:02“GETA.htmlHTTP/1.1”—Mozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:00:05“GETB.htmlHTTP/1.1”A.htmlMozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:00:06“GETA.htmlHTTP/1.1”—Mozilla/5.0(Linux1.0,Firefox/0.9.3) 987.654.32.100:00:10“GETE.htmlHTTP/1.1”B.htmlMozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:00:17“GETK.htmlHTTP/1.1”E.htmlMozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:00:20“GETC.htmlHTTP/1.1”A.htmlMozilla/5.0(Linux1.0,Firefox/0.9.3) 987.654.32.100:00:27“GETI.htmlHTTP/1.1”—Mozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:00:36“GETG.htmlHTTP/1.1”C.htmlMozilla/5.0(Linux1.0,Firefox/0.9.3) 987.654.32.100:00:49“GET0.htmlHTTP/1.1”I.htmlMozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:00:57“GETM.htmlHTTP/1.1”G.htmlMozilla/5.0(Linux1.0,Firefox/0.9.3) 987.654.32.100:03:15“GETH.htmlHTTP/1.1”—Mozilla/5.0(Linux1.0,Firefox/0.9.3) 987.654.32.100:03:20“GETN.htmlHTTP/1.1”H.htmlMozilla/5.0(Linux1.0,Firefox/0.9.3) 987.654.32.100:31:27“GETE.htmlHTTP/1.1”K.htmlMozilla/4.0(WindowsNT5.1,MSIE6.0) 987.654.32.100:31:34“GETL.htmlHTTP/1.1”E.htmlMozilla/4.0(WindowsNT5.1,MSIE6.0) Source:Adaptedfromref.1.

166

USER IDENTIFICATION 167

Navigation Page

Content Page

I J

O P N

G H

L K

E F

Figure 7.7 Topology of the imaginary Web site, showing links. (Adapted from ref. 1.)

same browser version and operating system. Further, note that the only way to access page O.html is from I.html. The referrer information supports the inference that this third user clicked from I.html to O.html. Thus, it appears that we have evidence in this web log ﬁle for the presence of three distinct users:

r User 1:A→ B →E →K → E →L r User 2: A → C →G → M →H→N r User 3: I → O

User Identiﬁcation Procedure

In general, the following procedure could be used to identify users:

1. Sort the web log ﬁle by ID address and then by time stamp.

2. For each distinct ID address, identify each agent as belonging to a different user.

3. For each user identiﬁed in step 2, apply path information garnered from the referrer ﬁeld and the site topology to determine whether this behavior is more likely the result of two or more users.

4. To identify each user, combine the user identiﬁcation information from steps 1 to 3 with available cookie and registration information.

Dans le document DATA MINING THE WEB (Page 182-186)