Selective Dissemination - author title year

author title year

5.10 Selective Dissemination

The expected entropy loss is given by,

e−(ep∗P(y) +ep∗P(y)) and the entropy loss ratio for a tag is given by,

E=e−(ep∗P(y) +ep∗P(y))/e

E takes values in the range of (0:1). Higher values of E indicate more discriminatory tags. All the tags with entropy loss ratio values greater than a predefined variable are chosen as features for the SVM.

5.9.1 SVM for User Model Construction

SVM is a machine learning approach for the task of classification which is based on structural risk minimization [20]. Here, the decision surface chosen must minimize the test error on unseen samples. The binary SVM can be extended to support mul-ticlass classification using the one against one approach. Here k(k−1)/2 SVMs are used where, k is the number of classes. Each SVM trains data from two different classes. A voting vector with a dimension for each class is also used for classi-fication. There are as many votes as the number of SVMs and the class having the maximum number of votes is the result. The result of application of SVM is the user model. The user model has two functions. First, it classifies the various profiles into user interest category. Second, the same model can assign a user interest category to an incoming XML document from among the various prespecified categories.

5.10 Selective Dissemination

The selective dissemination is the task of disseminating the documents to the users, based on their profiles to whom the incoming documents would be most relevant.

The first step in this task is determining the user interest category of an incoming XML document. Next, the similarity between the incoming XML document and the user profiles belonging to the same user interest category are determined. A high similarity value indicates that the document is relevant to the corresponding user.

The similarity between the XML document and the user profile is determined by modeling the XML document as a directed tree G= (Vg,Eg). Each node in Vg

corresponds to an XML element and Eg is a set of edges which defines the rela-tionships between the nodes in Vg. A vertex in Vgis said to be at a level levi if it is at distance of levi from the root. Let leveli(Dx)represent the set of all tags of an XML document Dx at a level levi. Let user pjrepresent the j^th user profile and user pj={tag1,tag2,...tagl}, where l is the total number of tags in the j^thuser pro-file. The similarity between a user profile user pjand the incoming XML document Dxis given by

S(Dx,user pj) =∑^d_i=1^{|user p}_i∗|level^j^∩levelⁱ^(D^x^)|

i(Dx)|

user pj∪Dx

(5.2) where d is the depth of the XML document tree. The following observations can be made about the similarity metric.

achieved.

• S(Dx,user pj) =0, iff there exists no common tags between the XML documents and the user profile.

• S(Dx,user pj) =S(user pj,Dx)

• Let Dx1 and Dx2 be two XML documents so that |user pj,Dx1|>|user pj,Dx2| i.e., the number of tags shared between user pjand Dx1 is greater than the num-ber of tags shared between user pj and Dx2. However, this does not imply that S(Dx₁,user pj)>S(Dx₂,user pj)i.e., the number of tags shared between the in-coming XML document and the user profiles is not the only factor which decides their similarity.

Definition: The similarity between the user profile and the XML document depends upon two factors:

• The level in the document tree where a dissimilarity occurs. A dissimilarity is said to occur in leveljiff|leveli(Dx)−user pj| ≥1.

• The Degree of Congruity (dc) in dissimilar levels also effects the similarity. The degree of congruity between the user profile and a level in a XML tree Dx1is given by,

dc(user pj,levelm(Dx1)) =|user pj∩levelm(Dx1)|

|levelm(Dx1)|

Proof: Let levelm(Dx1)and leveln(Dx2)represent the tags in the m^thand n^thlevels of two XML documents Dx1and Dx2respectively. Assume that the m^thlevel in Dx1

and nth level in Dx2are the only levels which have dissimilarity with the user profile user pj.

Case 1: Let. dc(user pj,levelm(Dx₁)) =dc(user pj,leveln(Dx₂)) From Equation 5.5, it is clear that the similarity depends upon the values of m and n. Thus, S(Dx1,user pj)>S(Dx2,user pj)iff m<n. The similarity between the user profile and the XML document depends upon the depth at which the dissimilarity occurs. A dissimilarity near the root, results in very less similarity values whereas dissimilarity near the leaf nodes, can still result in high similarity values.

Case 2: Assume m=n i.e., the dissimilarity in the two documents occurs at the same level. From equation 5.5, it can be inferred that the similarity S now de-pends upon the degree of congruity, dc. That is, S(Dx1,user pj)>S(Dx2,user pj) iff dc(user pj,levelm(Dx1))>dc(user pj,leveln(Dx2)). Thus, higher the value of dc, better are the similarity values and vice versa.

The architecture for selective dissemination of XML documents based on the user model learnt using SAMGA and SVM is given in Figure 5.12. The user model is used for two purposes. First, it classifies the user profiles among the various user interest categories. The profile is then stored under the corresponding category. Sec-ond, for streaming XML documents it determines the user interest category. The similarity metric of Equation 5.2 is used to find the similarity between the user

5.11 Performance Analysis 105

User Profile Streaming XML Documents

User Model

Determine User Interest Category

Store Profile Under the Category

Determine Similarity with Profiles Stored Under

the Category

Document Dissemination

Fig. 5.12 Architecture of the Selective Dissemination System

profiles and the XML document. A high similarity value represents that the corre-sponding user is interested in the document. The document is disseminated to the top k users whose profiles have the greatest similarity with the input XML document.

5.11 Performance Analysis

The SAMGA used in our approach takes a small number of user queries and the documents adjudged as relevant by the user as input. The GA explores all possible tag combinations from the tag pool and tries to find the best tag combination which satisfies the maximum number of queries. This tag combination forms the profile of a user. Even for small XML document collections the tag pool is usually large. The time taken for the dissemination of the documents depends upon two factors: The number of stored user profiles and the number of user interest categories. The user interest categories are various divisions like sports, books, politics, religion, etc., to which the user profiles are assigned. It is important to have sufficient numbers of such categories. If the number of user interest categories is less, large number of profiles come under a single category and the time to find a matching profile increases. Thus, maintaining a optimal number of user interest categories results in good performance. The time taken for selective dissemination of XML documents is shown in Figure 5.13.

The number of user interest categories utilized also determines the accuracy of the selective dissemination task. The accuracy of the selective dissemination system

0 200 400 600 800 1000 1200 1400

0 10 20 30 40 50 60 70 80

Filter Time(msecs)

Number of Profiles*1000 UIC = 15

UIC = 10

Fig. 5.13 Time for Selective Dissemination

is the proportion of disseminated documents that are relevant to the user. The varia-tion of the accuracy with the number of User Interest Categories (UIC) is shown in Figure 5.14.

In order to validate the accuracy and efficiency of the proposed technique, we compare it with Multi-level Indexing Technique proposed in [19]. From Figure 5.15 it can be observed that the accuracy of selective dissemination increases with the number of profiles accessed. If an exhaustive search over the user profiles is performed, both the accuracy and time for dissemination increase. Since selec-tive dissemination systems should be able to handle a large number of user profiles, the number of profiles accessed for an incoming document must be limited. From Figure 5.15, the accuracy of both the techniques is same when the percentage of pro-files accessed is high. When the percentage of the propro-files accessed is in the range of 30-40%, the proposed technique outperforms the Multi-level Indexing strategy in [19]. Thus the application of SAMGA and SVM helps in accurate and fast selective dissemination of XML documents.

Too many categories result in segmentation of the user interests and results in low accuracy. If the number of categories is less, it leads to superior performance with respect to accuracy but the time for selective dissemination increases. The intricate relationship among the number of profiles, the number of user interest categories, accuracy and time for selective dissemination, is given in Table 5.10 and it includes the following metrics; (i) Number of User Profiles (NUP), (ii) Number of User In-terest Categories (NUIC), (iii) Accuracy (Acc), and (iv) Time for Selective Dissem-ination (TSD). From Table 5.10, it can be observed that the accuracy depends more

5.11 Performance Analysis 107

1 2 3 4

0 20 40 60 80 100 120 140 160

Number of Keywords

Execution Time (In milliseconds)

Partitioned Index Normal Index

Fig. 5.14 Accuracy of Selective Dissemination

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 20 40 60 80 100

Accuracy

Profile Access (%) Using SAMGA and SVM

Multi Level Indexing

Fig. 5.15 Accuracy versus Number of Profile Access

NUPx1000 NUIC Acc TSD(msec)

5 10 0.92 162

5 15 0.87 143

10 20 0.89 201

10 25 0.81 189

20 30 0.74 331

25 15 0.87 547

on the number of user interest categories than on the number of profiles. Thus the system is potent in handling a large number of users. An optimal number of user in-terest categories should serve as a tradeoff between the accuracy and the efficiency and can result in good performance of the selective dissemination task.

Dans le document K.R.Venugopal, K.G. Srinivasa and L.M. Patnaik Soft Computing for Data Mining Applications (Page 122-127)