Genetic Learning of Tags - author title year

author title year

5.4 Genetic Learning of Tags

author title year

(6) (7) (8)

Land

Use... 1979

Synchronization ...

Vipin Ravi 1980

Fig. 5.1 Example XML Document Tree

For example, in the XML tree shown in Figure 5.1, inproceedings(1) satisfies the search term Vipin and the search term Vipin 1979 but not the term Vipin 1980.

The nodes obtained as result should also be semantically related. Semantically related nodes are nodes that appear in the same context. The various steps in the working of SAGAXSearch are enlisted below.

1. A representative training set is chosen to assist the genetic learning of tags.

2. The keyword queries and the relevant search results are collected from the user.

3. The GA retrieves tag combination which can answer a maximum training queries.

4. Separate indices are built for the frequently used and occasionally used tag com-binations.

5. A search over the XML documents in the decreasing order of importance of tags is performed.

6. The search produces only semantically related results.

5.4 Genetic Learning of Tags

XML documents include extensible tags for formatting the data. These tags are self-describing and thus represent the semantics of the contents associated with them. For example, consider the keyword Widows Location. Due to the ambiguity in the key-words, it is not possible to determine context and the exact meaning of the keywords.

But in the case of XML, using self-describing tags such as<Operating Systems>

or<Building Plan>the context of the keywords can be precisely highlighted. The

combination of tags in the XML documents also helps in revealing more information

5.4 Genetic Learning of Tags 87 about the contents of the documents. For example, the tag combination<author>,

<age> <date of birth>describes the personal details of the author. Whereas, the tags<author>,<books>,<publication year>are more concerned about the work of the author rather than his personal details. An XML document usually includes a large number of tags and only a small number of these may be of interest to a user.

Hence, a user profile that stores only the tag combinations interesting to a user is more accurate. Using genetic algorithms to learn user profiles has two advantages.

First, the tag combinations which are interesting to a user can be extracted. This task can be automatically done using the search terms and the relevant feedback given by the users. The tags which are not interesting to a user can be omitted from the user profile. Second, the context of the search terms given by the users can be adjudged and a profile can be constructed accordingly.

Consider an XML document collection with n documents. Each of the distinct tags in this document collection is stored in a tag pool. Let T={t1,t2,...tm}be the tag pool with m tags and tirepresents i^th tag. Usually, for document collection the tag pool is huge. The purpose of SAMGA is to select from the tag pool, the tag com-binations which are interesting to a user. For the system to learn the user interests, the user has to first issue a set of search queries q={k1,k2,...km}. The documents satisfying the search terms are retrieved as results. The user has to classify the re-sults relevant to him. This is the feedback given to the system in order to learn the user interest. The fitness function used in the GA is given by,

f itness=α∗(

Fig. 5.2 Genetic Learning of Tags

Begin

Initialize the population P_iby assigning random tag weights to j={j₁,j₂,...j_l}. for gen = 1 : maximum generation limit, do

(a) Order the tags by their decreasing weights and select the top k tags.

Let Stag={t₁,t₂,...t_k}represent the selected tags.

(b) Evaluate fitness using Equation 5.1.

(c) For the population Piperform a selection with stochastic universal sampling as the selection operator.

(d) Perform discrete recombination on the selected individuals of the population P_i. (e) Perform mutation on the individuals of the population Pi.

Next.

End

where N is the number of documents retrieved with a specific tag configuration, Stag

is the set of top k tags with highest tag weights. f req(i,Stag)is the frequency of occurrence of the terms of the query q={k1,k2,...km}within the tags in Stagin the i^th retrieved document. The retrieved documents are ranked according to the fre-quency of occurrence of the terms. The rank(i)denotes the rank of the i^thretrieved document provided the document is also classified as relevant by the user.α ^{is a} parameter that is used to express the degree of user preference for accuracy of the search results or the total number of documents that are retrieved. The architecture of the genetic learning system is illustrated in Figure 5.2. A real coded GA is used for learning the tag information in GaXsearch, and is explained in Table 5.3 and the application of SAMGA for XML search(SAGAXSearch) is given Table 5.4.

Consider a training set with n documents. Let q={q1,q2,...qm}be a collection of typical user queries where qirepresents the i^thquery and m is the total number of queries. The chromosome is represented as j={j1,j2,...jl} where ji denotes the Table 5.4 Self Adaptive Migration Model Genetic Algorithms(SAMGA) for Learning Tag Information

Initialize the population size and the mutation rate for each population.

Associate random tag weights with tags in the tag poolτ. This represents individuals of the initial population.

for generation = 1: maximum generation limit for each population

for each individual select top k tags with highest tag weights.

Let Stag={t1,t2,...tk}represent the selected tags.

Evaluate the fitness function using Equation 5.1.

Modify mutation rate using Equation 2.3.

Modify population size according to Equation 2.1.

Select individuals and perform the recombination operation.

If the average fitness of the ecosystem fails to change over two successive generations, migrate best individuals between populations.

Dans le document K.R.Venugopal, K.G. Srinivasa and L.M. Patnaik Soft Computing for Data Mining Applications (Page 105-108)