Random Walk and Feedback on Scholarly Network

(1)

Random Walk and Feedback on Scholarly Network

Yingying Yu

College of Transportation Management Dalian Maritime University

Dalian, China, 116026

Zhuoren Jiang

College of Transportation Management Dalian Maritime University

Dalian, China, 116026

Xiaozhong Liu

School of Informatics and Computing Indiana University

Bloomington

Bloomington, IN, USA, 47405

ABSTRACT

The approach of random walk on heterogeneous bibliographic graph has been proven effective in the previous studies. In this study, by using various kinds of positive and negative feedbacks, we propose the novel method to enhance the performance of meta-path-based random walk for scholarly recommendation. We hypothesize that the nodes on the heterogeneous graph should play different roles in terms of different queries or various kinds implicit/explicit feedbacks. Meanwhile, we prove that the node usefulness probability has significant impact for the path importance. When positive and negative feedback information is available, we can calculate each node’s proximity to the feedback nodes, and use the proximity to infer the usefulness probability of each node via the sigmoid function. By combining the transition probability and the usefulness probability of nodes on the path instance, we propose the new random walk function to compute the importance of each path instance. Experimental results with ACM full-text corpus show that the proposed method (considering the node usefulness) significantly outperforms the previous approaches.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

Keywords

Meta-path-based Random Walk, Feedback, Heterogeneous Graph

1. INTRODUCTION

The volume of scientific publications has increased dramatically in the past couple of decades, which challenges existing systems and methods to retrieve and access scientific resources. Classi- cal text-based information retrieval algorithms can recommend the candidate publications for scholars. However, most of them ig- nored the complex and heterogeneous relations among the scholarly objects. Not until recently, some studies proved that adopt- ing the mining approaches on heterogeneous information networks

Copyrightc 2015 for the individual papers by the papers’ authors. Copy- ing permitted for private and academic purposes. This volume is published and copyrighted by its editors.

Published on CEUR-WS:http://ceur-ws.org/Vol-1393/.

could significantly improve the scholarly recommendation performance [3,7,9,12]. For instance, Liu et al., [2,3] constructed the heterogeneous scholarly graph and proposed a novel ranking method based on pseudo relevance feedback (PRF), which can effectively recommend candidate citation papers via different kinds of meta- paths on the graph.

In this paper, we intend to further investigate feedback information and enhance the meta-path-based random walk performance.

Intuitively, for different information needs, when user feedbacks are available, the nodes on the graph should play different roles in the final measure. For example, given two different queries

"Content-based Citation Recommendation" and "Heterogeneous In- formation Network", the same paper "ClusCite: effective citation recommendation by information network-based clustering" may be retrieved by scholarly search engines, e.g., Google Scholar. But the target paper can be more useful (positive) for the second query than the first one. As another example, for user X, if she prefers to cite influential scholars’ work, the highly cited authors will be useful for her. While for user Y, if she tends to cite the frontiers, she will mark the newest publications and the newly topics as the useful feedback information. Therefore, the same node may perform significantly different based on different information needs and feedback information. Furthermore, by using (implicit/explicit positive/negative) feedbacks, it is possible to infer the usefulness probability of other nodes on the graph. So that, the importance of path instance will vary in terms of the probability of node usefulness.

The main contribution of this paper is threefold. First, in this paper, the feedback is not limited to documents. In scholarly network, user could provide feedback judgments for authors, keywords and venues, either useful or not useful. If the explicit user feedback is unavailable, we propose an approach to automatically generate the feedback nodes based on user queries and the relation- ships among the entities on the heterogeneous graph. Second, we infer the usefulness of the nodes in terms of feedback information.

For instance, a node is less useful when it is close to the negative node(s). We make a conjecture that the usefulness probability of each node depends on its average proximity to the feedback set and can be estimated via sigmoid function. Third, we emphasize the node usefulness has a great impact on the path importance. Our approach about computing the random walk probability differs from the previous study in that, not only the transition probability, but also the usefulness probability of the node should be taken into ac- count for random walk. To verify these hypotheses, we adopt a number of meta-paths on the graph (Figure 1) and make a comparison between the classical random walk function and the novel method. Experimental results on ACM corpus show that the proposed method significantly outperforms the original one.

The remainder of this paper is structured as follows. We 1) re-

(2)

view relevant methodologies for pseudo relevance feedback, 2) in- troduce the preliminaries, 3) propose the improved methods, 4) de- scribe the experiment setting and evaluation results, and 5) con- clude with a discussion and outlook.

2. RELATED WORK

Pseudo relevance feedback, also known as blind relevance feedback, provides a way for automatic local analysis. When the user judgments or interactions are not available, it turns out to be an effective method to improve the retrieval performance. Traditional pseudo relevance feedback tends to treat the top ranked documents as relevant feedback, and then expand the initial queries. How- ever, some of the top retrieved documents may be irrelevant, which could result in noisy feedback into the process. So that, there are various efforts to improve the traditional pseudo feedback. [11] ex- ploited the possible utility of Wikipedia for query dependent expansion. From the perspective of each query and each set of feedback documents, [4] proposed how to dynamically predict an op- timal balance coefficient query expansion rather than using a fixed value. [1] suggested to use evolutionary techniques along with semantic similarity notion for query expansion. [6] introduced an approach to expand the queries for passage retrieval, not based on the top ranked documents, but via a new term weighting function, which gives a score to terms of corpus according to their related- ness to the query, and identify the most relevant ones. Instead of using term expansion, graph-based feedback provides a new ranking assumption based on topology expansion. [2] used the pseudo relevant papers as the seed nodes, and then explored the potential relevant nodes via specific restricted/combined meta-paths on the heterogeneous graph. Our study is motivated by this approach and mainly focused on updating the random walk algorithm by inves- tigating both the positive and negative feedbacks. In fact, positive and negative feedback approach has been studied in image retrieval [5]. With several steps of positive and negative feedback, the retrieval performance could be increasingly enhanced. From the view of negative feedback, [10] studied and compared different kinds of methods, it addressed that negative feedback is important especially when the target topic is difficult and initial results are poor. Besides, using multiple negative feedback methods could be more effective.

3. PRELIMINARIES

Following the work [2,8], an information network can be defined as follows.

DEFINITION 1. (Information network)Aninformation network is defined as a directed graphG = (V,E) with an object type mapping functionτ : V → Aand a link type mapping function φ: E → R, where each objectv ∈ Vbelongs to one particular object typeτ(v) ∈ A, each linke ∈ E belongs to a particular relationφ(e) ∈ R, and if two links belong to the same relation type, the two links share the same starting object type as well as the ending object type.

When there are more than one type of node or link in the information network, it is calledheterogeneous information network.

In [8], Sun further defined meta-path as follows.

DEFINITION 2. (Meta-path)Ameta-pathP is a path defined on the graph of network schemaTG = (A,R), and is denoted in the form ofA˙1

R₁

−→ A˙2 R₂

−→ . . . −→^R^l A˙l+1, which defines a composite relationR=R1◦R2◦. . .◦Rlbetween typesA˙1and A˙l+1, where◦denotes the composition operator on relations.

Given a specific scholarly network, there can be many kinds of meta-paths. For example,P^∗ →^w A ←^w P^?is a simple meta-path on the scholarly network, denoting all the papers published by the seed paper’ author.P^∗is the starting paper node (seed node) in this path. P^?denotes the candidate publication node. More examples can be found in Table 1.

4. RESEARCH METHODS 4.1 Generate the Feedback Nodes

Generally, given user initial queries, a list of ranking publications would be found via text retrieval. Based on the top ranked documents, user would probably give explicit judgments on whether the related keywords, authors or venues are useful or not. However, explicit feedback is not easy to get. In this study, we propose methods to infer the implicit feedback nodes on the heterogeneous graph according to the given information.

The feedback is a collection of multiple nodes marked with useful (positive) or unuseful (negative) on the heterogeneous graph.

We represent this collection asN F. N FP andN FN denote the positive and negative nodes set respectively. The kinds of feedback nodes in discussion include keyword (K), author (A) and venue (V).

4.1.1 Generate the Positive Feedback Nodes

Since we know the initial queries (i.e., author provided paper keywords) that the users should be most concerned with, it is rea- sonable to take the explicit keywordsKPas the positive feedback nodes. Next, we will infer the positive authors and venues based on KP. We deem that the authors or venues that are highly likely related toKPare positive as well. So we rank authors via meta-paths KP

−−→con A^?andKP

←r P −^w→A^?, and take the top rankedKpos

authors as the pseudo positive authorsAP. Similarly, we locate the positive venues viaKP

−−→con V^?andKP

←r P −→^p V^?, and select the top rankedKposvenues as the positive nodesVP.

4.1.2 Generate the Negative Feedback Nodes

Intuitively, to generate the negative feedbacks, our basic assumption is that the negative nodes should be directly related to the searched results, but least relevant to the explicit positive keywords.

First, based on text retrieval results, we define the top rankedtopK papers asPr, and then we locate the keywords, authors and venues that are directly connected toPr via different meta-paths,Pr

→r

Kr,Pr

→w ArandPr

→p Vr.

Next, we filter collections ofKr,ArandVr. 1. Rank the key- wordsKrvia the transition probability of meta-pathKP

con→P →^r Kr. Use the last rankedKnegkeywords as the pseudo negative nodesKN. 2. Similar to keywords, rank the authorsArvia the transition probability of meta-pathKP

con→ P →^w Ar, and use the last ranked Kneg authors as the pseudo negative nodesAN. 3.

Rank the venuesVrviaKP

con→ P →^p Vr, and use the last ranked Knegvenues as the negative nodesVN. Here we useKP

con→Pin- stead ofKP

←r Pbecause the "contribution" characterizes the importance of each paper, given a topic. It does not necessarily means paper is relevant to topic [2]. Even if one paper is not explicit relevant to some topic, it might also be important. The "contribute"

conveys more information.

Thus, we obtain all the positive and negative feedback nodes.

N FPincludesKP,APandVP.N FNcontainsKN,ANandVN.

4.2 Infer the Usefulness Probability of Node

Unlike previous studies, in this paper, the importance of nodes on scholarly network is not even. The usefulness probability of

(3)

nodeNiis determined by the feedback nodes. Intuitively, if node Niis more closely related to the positive nodes, it could be more useful. Conversely, ifNi is much closer to the negative nodes, and further away from the positive nodes, it indicates thatNimay be not very useful. Therefore, the proximity between given node and feedback node set is very crucial. We should note that the usefulness probability of each node varies from different feedback node sets.

To infer the usefulness probability of nodeNi, we adopt the sigmoid functionPu(Ni) = ¹

1+e^−αD(Ni) to convert the proximity into probability, whereαcontrols the convergent rate (default is 1). In our assumption, ifNjis positive node,Pu(Nj) = 1, other- wiseP(Nj) = 0. D(Ni)denotes the proximity betweenNiand the feedback node setN F. It can be derived from the following formula.

D(Ni) =

P

Nk∈N FNd(N_i,N_k)

|N F_N| −

P

Nj∈N FPd(N_i,N_j))

|N F_P| , where

|N FN|and|N FP|represents the size of collectionN FNandN FP

respectively. d(Ni, Nj)indicates the proximity between nodeNi

and nodeNj. In this paper, we will estimate the proximityd(Ni, Nj) based on the pathsNi Njon the graph. There could be lots of path instances connected nodeNiandNj. If the length of path is too long, the influence would be too small to be considered. We as- sume the maximum of path length is 10. Then we select the shortest path and define its length as the proximityd(Ni, Nj).

IfD(Ni) is negative, it reflects nodeNi is closer to negative nodes than positive ones, which means nodeNicould be less important, and vice versa. Particularly, ifD(Nj) → +∞, it indicates thatNjis far away from negative feedback nodes, so the importance of this node approach to 1; IfD(Nj) = 0, it indicates thatNjhas the same distance to negative and positive nodes, then Pu(Nj) = 0.5; IfD(Nj)→ −∞, it indicates thatNjis closest to negative feedback node, thenPu(Nj)→0.

4.3 Compute the Random Walk Probability Based on Meta-path

Meta-path illustrates how the nodes are connected in the heterogeneous graph. Once a meta-path is specified, a meta-path-based ranking function is defined, so that relevant papers determined by the ranking function can be recommended [3]. It turns out that meta-path based feedback on heterogeneous graph performs better than other methods (PageRank) based PRF [2]. Random walk on heterogenous network can explore more global information, combining multiple feedback nodes, which might be very important for the recommendation tasks.

In order to quantify the ranking score of candidates relevant to the seeds following one given meta-path, a random walk based approach was proposed in [2]. The relevance betweenP^∗and P^? can be estimated vias(a⁽¹⁾_i , a^(l+1)_j ) = P

t=a⁽¹⁾_i a^(l+1)_j RW(t), wheretis a path instance from nodea⁽¹⁾_i toa^(l+1)_j following the specified meta-path, andRW(t)is the random walk probability of the instancet.

Supposet = (a⁽¹⁾_i1 , a⁽²⁾_i2 , . . . , a^(l+1)_il+1), the random walk probability can be computed viaRW(t) =Q

jw(a^(j)_ij, a^(j+1)_i,j+1). While this formula only considers the weight of link on the path instance.

Based on our hypothesis, the node usefulness probability has a great effect on the path importance. So in this study, we propose a novel random walk function as follows.

RW(t) =Q

j(β·w(a^(j)_ij, a^(j+1)_i,j+1)+(1−β)·Pu(a^(j+1)_i,j+1)), where Pu(a^(j+1)_i,j+1)is the usefulness probability of the nodea^(j+1)_i,j+1 on the path (derived from section 4.2), andβdetermines which factor is more important. Theoretically, we need to tuneβfor each meta-

path to optimize the weight of each sub-meta-path. For this study, we setβ= 0.6.

Then, the random walk probability will be decided by the transition probability and the usefulness probability of the node on the path instance. In this paper, we use eight meta-paths to investigate the novel random walk method with node feedback information for citation recommendation. All the meta-paths are listed in Table 1.

5. EXPERIMENT 5.1 Data Preprocessing

We used 41,370 publications (as candidate citation collection), published between 1951 and 2011, on computer science for the experiment (mainly from the ACM digital library). As [2] introduced, we constructed the heterogeneous graph shown in Figure 1 and Ta- ble 2.

For the evaluation part, we used a test collection with 274 papers.

The selected papers have more than 15 citations from the candidate citation collection.

5.2 Generate Feedback Nodes

Attaining different types of feedback information is the most important part in this research. Since it is not available to get the user judgments right away. We used the method introduced in section 4.1 to create positive and negative feedback nodes. As aforemen- tioned, the collectionKPis the set of user given keywords. It is explicit positive feedbacks. WhileAPandVPcan be derived by their connectivity to setKPbased on the heterogeneous graph. Here we setKpos = 10, and take the top 10 ranked authors/ venues as the implicit positive feedbacks.

Next, we produced the implicit negative feedback nodes. Through the text retrieved results, we grabbed the top ranked papers asPr

(topK = 20). Then we located the list of keywords/ authors/

venues which have direct correlations toPr, but the least relevance toKP. Find the last rankedKneg= 10and used them asKN,AN

andVNrespectively.

5.3 Experiment Result

In the evaluation part, we experimented with 8 different meta- paths. For each meta-path, two sets of results were shown on row

‘N’ and ‘Y’ in Table 3. The ‘N/Y’ column in Table 3 indicates whether we use the positive and negative feedback nodes or not for computing the path importance. ‘N’ indicates that the result was from the baseline in [2], while ‘Y’ means multiple feedback nodes were employed and the node influence was appended into the final random walk function. MAP and NDCG are used as the ranking function training and evaluation metrics. For MAP, binary judg- ment is provided for each candidate cited paper (cited or not cited).

NDCG estimates the cumulative relevance gain a user receives by examining recommendation results up to a given rank on the list.

We used an importance score, 0-4, as the candidate cited paper importance to calculate NDCG scores. Apparently, in most cases, row ‘Y’ significantly outperforms row ‘N’ , which shows that the positive/negative feedbacks enhance the random walk performance quite well. We also used t-test to verify this improvement and most meta-paths are significantly refined.

6. CONCLUSION AND LIMITIONS

In this study we use multiple kinds of feedback nodes and propose a new method to enhance the meta-path-based random walk performance. The new random walk function considers both transition probability and node usefulness probability on the path instance. We find that the node influence varies from the set of

(4)

feedback nodes, which could be inferred based on the explicit user queries via a series of steps. Experimental results with ACM data illustrate that the new approach with positive/negative feedback information helps to improve the performance of meta-path-based recommendation.

For further study, we will continue this approach based on real user explicit feedbacks and design the personalized recommendation model to improve user experience. Not only the node usefulness is related to the feedback nodes, but also the weight of each relation type may be affected by the feedback nodes or retrieval task.

If the retrieval task is to search the relevant papers based on given authors, the author feedback nodes will be more useful for "writtenby" relation, "writtenby" and "co-author" relation might be more important. This hypothesis will be discussed in the next step. Be- sides, more sophisticated inference models will be adopted which may enhance the ranking performance.

7. FIGURES AND TABLES

Figure 1: Heterogeneous Bibliographic Graph

Table 1: All the meta-paths used in this study

NO. Meta-path Feedback ranking hypothesis

1 P^∗−^w→A←^w−P^? Relevant paper’s author’s other papers can be relevant

2 P^∗−→^c P^? Relevant paper’s cited papers can be relevant

3 P^∗−→^c P−→^c P^? Relevant paper’s cited paper’s cited paper can be relevant

4 P^∗−→^c P−^w→A←^w−P^? Relevant paper’s cited papers’ authors’

papers can be relevant

5 P^∗→^wA→^coA←^wP^? Relevant paper’s author’s co-author’s papers can be relevant

6 P^∗−^w→A←^w−P−→^c P^? Relevant paper’s author’s cited papers can be relevant

7 P^∗→^p V ←^p P→^c P^? Paper can be relevant if it is cited by the ones published at the same venue as the relevant paper

8 P^∗→^p V ←^p P−^w→A←^w−P^? Paper can be relevant if its authors’ papers are published at the same venue as the relevant paper

8. REFERENCES

[1] P. Bhatnagar and N. Pareek. Improving pseudo relevance feedback based query expansion using genetic fuzzy approach and semantic similarity notion.Journal of Information Science, page

0165551514533771, 2014.

[2] X. Liu, Y. Yu, C. Guo, and Y. Sun. Meta-path-based ranking with pseudo relevance feedback on heterogeneous graph for citation

Table 2: Graph statistics Node/Edge Number Description

P 41,370 Paper

A 63,323 Author

V 369 Venue

K 3,911 Keyword

P→^c P 168,554 Paper cites another paper P→^w A 105,992 Paper is written by an author P→^p V 41,013 Paper is published at venue A→^coA 239,744 Co-author relationship

P→^r K 587,252 Paper is relevant to keyword(topic) K^con→ P 3,577,111 Keyword (topic) is contributed by paper K^con→ A 2,397,205 Keyword (topic) is contributed by author K^con→ V 18,450 Keyword (topic) is contributed by venu

Table 3: Meta-path Based Random Walk Performance Comparison(|P^∗|= 10)

NO. N/Y MAP MAP@5 MAP@10 NDCG NDCG@5 NDCG@10

1 N 0.0277 0.0085 0.0129 0.1035 0.0306 0.0394

Y 0.0365

***

0.015

***

0.0211

***

0.1149

***

0.0459

**

0.0565

***

2 N 0.1315 0.0552 0.0773 0.2193 0.1427 0.1548

Y 0.1459

***

0.0678

***

0.0904

***

0.2307

**

0.1656

***

0.1705 **

3 N 0.0744 0.0306 0.0404 0.1539 0.0689 0.0766

Y 0.0948

***

0.0441

***

0.0582 * 0.1707

***

0.0945 * 0.1002 **

4 N 0.027 0.0042 0.0076 0.1378 0.0146 0.025

Y 0.038

***

0.0109

***

0.0153

***

0.1521

***

0.0318

***

0.0387

***

5 N 0.0436 0.0121 0.0187 0.1672 0.0476 0.0585

Y 0.0561

***

0.0257

***

0.0328

***

0.1854

***

0.0867

***

0.0885

***

6 N 0.0327 0.0234 0.03 0.0734 0.0693 0.0748

Y 0.0872

***

0.0359

***

0.0471

***

0.1962

***

0.0805 * 0.09 *

7 N 0.0238 0.0083 0.0097 0.1529 0.0216 0.0224

Y 0.0373

***

0.0133

***

0.0163

***

0.1718

***

0.0317

**

0.0344 **

8 N 0.0092 0.0005 0.0007 0.1397 0.0011 0.0013

Y 0.012

***

0.0011

***

0.0017

***

0.1476

***

0.0027

***

0.0045

***

p <0.05: *,p <0.01: **,p <0.001: ***

recommendation. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 121–130. ACM, 2014.

[3] X. Liu, Y. Yu, C. Guo, Y. Sun, and L. Gao. Full-text based context-rich heterogeneous network mining approach for citation recommendation. InACM/IEEE Joint Conference on Digital Libraries, 2014.

[4] Y. Lv and C. Zhai. Adaptive relevance feedback in information retrieval. InProceedings of the 18th ACM conference on Information and knowledge management, pages 255–264. ACM, 2009.

[5] H. Muller, W. Muller, S. Marchand-Maillet, T. Pun, and D. M.

Squire. Strategies for positive and negative relevance feedback in image retrieval. InPattern Recognition, 2000. Proceedings. 15th International Conference on, volume 1, pages 1043–1046. IEEE, 2000.

[6] H. Saneifar, S. Bonniol, P. Poncelet, and M. Roche. Enhancing passage retrieval in log files by query expansion based on explicit and pseudo relevance feedback.Computers in Industry, 65(6):937–951, 2014.

[7] Y. Sun and J. Han. Meta-path-based search and mining in heterogeneous information networks.Tsinghua Science and Technology, 18(4), 2013.

[8] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. PathSim: Meta path-based top-k similarity search in heterogeneous information networks. InProc. 2011 Int. Conf. Very Large Data Bases

(5)

(VLDB’11), Seattle, WA, 2011.

[9] Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, and X. Yu. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. InProc. of 2012 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’12), Beijing, China, 2012.

[10] X. Wang, H. Fang, and C. Zhai. A study of methods for negative relevance feedback. InProceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 219–226. ACM, 2008.

[11] Y. Xu, G. J. Jones, and B. Wang. Query dependent pseudo-relevance feedback based on wikipedia. InProceedings of the 32nd

international ACM SIGIR conference on Research and development in information retrieval, pages 59–66. ACM, 2009.

[12] X. Yu, X. Ren, Y. Sun, B. Sturt, U. Khandelwal, Q. Gu, B. Norick, and J. Han. Recommendation in heterogeneous information networks with implicit user feedback. InProc. of 2013 ACM Int. Conf. Series on Recommendation Systems (RecSys’13), pages 347–350, Hong Kong, 2013.

Random Walk and Feedback on Scholarly Network