• Aucun résultat trouvé

Spreading Activation Methods

2. Input/Output of Node Activation

Before the pulse, the node v has the activation level F(v).

Through incoming links v get more activation:

Input(v) = Σ O(e)

for all links e such that init(e) ∈Vn, term(e) = v.

By dissipating the activation through outgoing links, the node v might lose activation:

Output(v) = Σ I(e)

for all links e such that init(e) = v, term(e) ∈Vn 3. Computation of the New Level of Activation

A new value Fnew(v) is computed based on values of F(v), Input (v), and Output(v), for example Fnew(v) = F(v) + Input (v)

sAM and Methods of numerical simulation in Physics

Spreading activation algorithms were introduced in 1990s; however the same iterative methods were used long before in numerical simulation in physics, mechanics, chemistry and engineering sciences

(Morton, K.W. & Mayers, D.F., 2005; Rübenkönig, O., 2006). The major distinctions of these algorithms from what is called now as spreading activation are:

a) in physics – such algorithms usually work on a regular mesh (so that the local topology of the graph is encoded into formulas of the recomputation stage)

b) in physics – initial conditions, or initial activation – are usually assigned to all nodes on the mesh;

and the use of algorithms for efficient graph traversal is not needed. For instance, steps 2a (List ex-pansion) and 2b (List Purging) in the generic description of SAM framework might be skipped.

For instance, one dimensional heat transfer equations might be numerically simulated on a one-dimensional mesh, by iterative methods. On each iteration recomputation stage is based on the formula below:

Fnew (v) = (F(RightNeighbor(v)) + F(LeftNeighbor(v))) / 2

Using a different formula, one can simulate the behavior of an oscillating string (although this will require storing tree values at each node - position, mass and velocity of the material point correspond-ing to the node).

Using the same iterative algorithm, with one set of parameters one can emulate heat transfer; with another set of parameters the same algorithm will show us the behavior of oscillating strings. But the phenomena of heat propagation and string oscillation are quite different (for instance, heat propagation might lead to “thermal death” - the state of equilibrium where the level of activation is the same for all nodes, while oscillation might continue forever). Our illustration concern only basics, while real model-ing might be much more complicated, for instance, hear transfer might lead to combustion, where after reaching some level of activation a node generates more “heat” than it gets from neighboring nodes.

This recall to physics gives us useful insight into the high potential of SAM framework and direc-tions for future work:

There are numerous possible modifications of

• SAM algorithms

Changing parameters of

• spreading activation significantly affects the results Selection of “correct” parameters for new applications of

• SAM might be potentially a hard task,

and must be based on the creation of a “correct” model for the phenomena in question

Understanding the nature (“the physics”) of what and how is propagated on the network, requires

• domain specific knowledge.

Discovering how to do this efficiently is computer science.

Applications described in this chapter use formulae similar to “heat transfer”, which ensures fast convergence after limited number of iterations.

Theorizing about potential areas of applications for SAM algorithms which are more similar to “os-cillation”, we can suggest that such algorithms might be potentially used to rank web-sites based not only on their current status, but also on the trend (like the site becomes popular).

classification of sAM Algorithms based on the distribution of Initial Activation An important dimension of classification of SAM algorithms is the intended mode of the use with respect to the distribution of nodes with non-zero level of initial activation:

• egocentric applications – where SAM algorithms are used mainly to process egocentric queries (i.e. only one node on a network has non-zero level of activation on the initialization stage)

• polycentric applications – where several nodes on a network have non-zero level of activation on the initialization stage;

• omnicentric applications – where most of the nodes on a network have non-zero level of activa-tion on the initializaactiva-tion stage.

Examples of egocentric applications are described in (Kinsella et al., 2008) and (Nepomuk PSEW Recommendation). Polycentric applications are described in (Troussov, Sogrin, Judge, & Botvich, 2008a; Troussov et al., 2008b). In omnicentric spreading activation, we probably should talk about redistribution of activation, rather than about spreading of activation. And indeed, the authors of such algorithms (Levner, Pinto, Rosso, Alcaide, & Sharma, 2007b) don’t call their algorithm as the algorithms of spreading activation. Nevertheless we believe that these algorithms must be presented together with

“classical” spreading activation algorithms as described in section 2.2. Having the single umbrella of the SAM framework allows to focus on the core part of these algorithms – i.e. recomputation step in iterations; and simplify knowledge transfer across the application domains of SAM algorithms.

Figure 1. This is a two-dimensional numerical simulation done by the Galaxy library (Troussov, A., Judge, J., & Sogrin, M., 2007). Parameters of the algorithm were tuned to work with networks like WordNet to detect focus concepts of documents. For example, if four concepts (depicted at the corners of the mesh) are mentioned in the text, SAM computes that the center of the mesh got the highest value. Note, that if parameters of the algorithm were chosen to emulate heat transfer, the highest level activation will stay with initial four nodes.

spreading Activation as a Graphmining technique

As we have already seen, the technique of SAM is quite polymorphic. In this section we interpret the results of spreading activation in terms of graph mining.

First of all, one can think that after running SAM the most activated nodes will be those nodes, which get the activation from multiple sources, or, in other words, those nodes which minimize the “distance”

to the nodes which were initially activated. Therefore these nodes might be considered as potential centroids of strong clusters induced by the initial activation. Since partitioning of the nodes according to these clusters is not immediately available (and is not needed in many applications), SAM algorithms might be considered as methods of soft clustering.

On the other hand, the most activated nodes are those nodes, which are connected to the initial condi-tions by particular types of directed links (arcs with large weights). Therefore we might consider SAM as an efficient scheme for computing fuzzy inferencing. For such applications replacing a single valued function F by a vector function might be useful.

We conclude by noting that SAM algorithms might be used for soft clustering and fuzzy inferencing on networks.

coMPosItIon oF MultIdIMensIonAl netWorKs And PertAInInG nAVIGAtIon MetHods

Successful application of graph-based mining methods strongly depends on the understanding of the phenomena encountered in the modelled networks. In this section we outline socio-semantic aspects of modern networks and discuss the problem of related item recommendation.

composition of Multidimensional networks

The proliferation of Web 2.0 technologies has lead to the emergence of massive networks connecting people and various digital artifacts. Collaborative tagging systems like Del.icio.us give us examples of such networks. Most of the data in such systems might be represented as a network with four types of nodes: people, resources, tags and instances of tagging (Mika, 2005). In Del.icio.us there are no direct links between people or links between resources, instances of tagging usually have three links: link to the user, link to the resource, and link to the tag used.

Social networks are traditionally modeled by graphs. “Advances in digital technologies invite consideration of organizing within communities as a process that is accomplished by global, flexible, adaptive, and ad hoc networks that can be created, maintained, dissolved, and reconstituted with remark-able alacrity. Increasingly these networks are multidimensional including individuals as well as digital artifacts and concepts.” – (Contractor, 2007). Since most of such networks are now based on computer mediation (Facebook, LinkedIn, IBM internal social network Beehive), more types of links between people are known, and more digital artifacts might be accounted for, thus providing “the opportunity to capture, tag, and manifest high-resolution high-fidelity relational “metadata” (which node is connected to which other node) from these multidimensional networks” (Contractor, 2008).

Enterprise 2.0 usually adds new dimensions and new connections (for example, since identity manage-ment on the intranet is simple, it is easy to add additional links from, for example, a corporate remake of Facebook to a corporate remake of Delicious).

Communication networks are of particular interest to business and security applications.

Lexico-semantic resources (such as WordNet or medical ontologies) are important resources for knowledge-based methods in language engineering; the semantic web and the Nepomuk Social Semantic Desktop (Sauermann, 2005; Groza et al., 2007; Sauermann, Kiesel, Schumacher, & Bernardi, 2009) rely on the use of ontologies. The data for ontologies and their relatives (catalogs, thesauri, taxonomies, topic maps, semantic networks etc) are graphs with vertices corresponding to concepts (and their instances) and labeled (weighted) arcs denoting relationship.

navigating networked data using Polycentric Fuzzy Queries

The content of the network brought to life by Web 2 is influenced by premises which encourage utilising data before providing structure, the result being that often the content of these networks is usually of mixed quality. The composition of the networks which are based on Semantic Web technologies frequently includes nodes and links which are more related to the technologies underpinning the functioning of these networks, than to the potential interpretation of these networks by humans.

The efficiency of human navigation in modern networks depends on the availability of suitable user interfaces powered by an “intelligent” back end which provides guidance and recommendations based on soft computing methods. Later in this chapter we describe how the “pile” based GUI (Graphical User Interface) called Nepomuk-Simple and the IBM library Galaxy (Troussov, A., Judge, J. & Sogrin, M., 2007) can be used for such guided navigation through the network of Personal Information Management Ontology concepts in the scenario of the social semantic desktop as pertaining to the EU 6th framework project Nepomuk.

In navigation on networks, one of the most important guiding tools is related item recommenda-tion - that is given a set of nodes on a network, to recommend potentially relevant nodes. The role of related item recommendation is to reduce cognitive load, provide guidance in navigation and browsing, contextualize, simplify, and make sense of otherwise complex interlinked data.

Related item recommendation is different from search, since the goal of recommendation is not to find nodes with particular properties (the user herself frequently would not be able to specify what exactly she would like to have as a recommendation), but the search of nodes with strong cumulative direct and indirect connections to the initial set of nodes. Therefore we consider the problem of related item recommendation on networked data as a problem of “how to find something without having searched for it”, or, in technical terms, as a problem of processing fuzzy (underspecified) polycentric queries on multidimensional networks. As argued in (Troussov et al., 2008b), processing of such queries might, for instance, require the use of fuzzy logic, soft clustering and fuzzy inferencing, and spreading activation is one of the technique particularly suitable for the task.

spreading Activation for Processing Polycentric Queries

The application of an SAM algorithm to processing polycentric queries might be straightforward: take the nodes from the query and propagate activation to other nodes; however, better results might be achieved

for processing fuzzy polycentric queries. Troussov et al., (2008b) describes components of the software architecture to process fuzzy polycentric queries. This includes

Query generator

The use case of tag recommendation for enterprise collaborative tagging systems illustrates all aspects of such architecture. As we mentioned above, people, resources and tags are “wired” together by instances of tagging; to achieve tag recommendation one can put an activation in the nodes representing the user and the resource. After propagation, the list of most activated people, resources, tag and instances of tagging might be post processed, to show only tags.

The explanatory module might take the list of most activated instances, and convert it into explana-tions like “the list of all people who have the same geographical location or are connected through the reporting chain to you, who use this tag this resource”.

Results of recommendation will depend on which tags are mostly frequently used by the user, by the tags used by the people who have significant overlap in tagged resources with the user, etc. In gen-eral, this will be a community based tag recommendation (Sigurbjörnsson, B., & van Zwol, R., 2008).

However, spreading activation is a method of soft computing; this means that if the external commu-nity grows, the results of recommendation will tend to be skewed towards the most popular tags in the whole community. To make the results of tag recommendations more “community based”, the query processor might have two parts: firstly, spreading activation from the user and the resource is used to detect the subcommunity most connected to the user and the resource; secondly, activation starts from the members of this community.

The same architecture based on SAM framework, might be used to provide other services for col-laborative tagging systems. For example, expertise location in the scenario like “who can explain these documents to me from the point of view of semantic web technologies?” might be construed as a polycentric query which include several resources, tags and people. Processing of polycentric queries based on a generic graphmining technique, like SAM, might take into account multiple relations like relations between people, hyperlinks between resources, relation that the tag JSP might be semantically close to the tag Java. It also can take into account timestamps (when a particular instances of tagging occurred) and use this information about temporal aspects of collaborative tagging systems (thus ad-dressing the problem of tag expiration).

Collaborative tagging systems are socio-technical systems, and therefore we cannot assume that ev-eryone will use the system in the same way and with the same purposes in mind as others. For instance, in addition to tagging topicality of the content of the resources, people might use evocative tags or tags needed to manage their workflow instead of (or in addition to) building folksonomies (user generated taxonomies).

composition and navigation summary for Multidimensional networks

Many modern multidimensional networks are created by the proliferation of socio-technical systems, which requires careful considerations regarding what humans bring into such networks, such as

seman-tics, social aspects and task management. Related item recommendations for networked data facilitate guided navigation; such recommendations (done in a predictive search mode), introduce fuzzification and serendipity aspects in browsing. The use case of collaborative tagging systems demonstrates the advantages of navigating networked data using polycentric fuzzy queries, and the advantages of using SAM algorithms for processing such queries. Spreading activation methods might be used as a primary method for related item recommendation.

ontology based text Processing

SAM algorithms might be used for ontology based text processing to allow us to detect the relevancy of ontological concepts to a text by propagating the relevancy measure from concepts mentioned in the text to other concepts not mentioned in the text. Iterative redistribution of relevance might also improve the ranking of concepts according to their relevancy to the text in a similar way as PageRank provides ranking of web sites. The rationale of applying SAM algorithms might be explained as follows:

1. Text understanding is inferencing, although a computational approach by clustering ontological concepts mentioned in the text might be somewhat useful

2. Soft clustering, fuzzy inferencing and other methods of soft computing are suitable for knowledge-based analytics on term mentions when our knowledge is incomplete and inconsistent, and when the parsing methods used to process text are “shallow”

3. Spreading activation is a method which combines elements of soft clustering and fuzzy inferencing.

4. and therefore spreading activation on ontological networks taking concepts mentioned in a text as the initial input, and propagating this “input” to other concepts might work (although the exact parameters of such propagation are not known in advance)

This section is based mainly on the results of the EU 6th Framework project Nepomuk (2006-2008).

This project created a social semantic desktop (SSD), based on the Semantic Web technologies (Decker,

& Frank, 2004; Sauermann, Bernardi, & Dengel, 2005), and is available for download from (Nepomuk Installation). Semantic web technologies are used to annotate resources and relate them to the Personal Information Management (PIMO) ontology (Sauermann, & Dengel, 2007). IBM/Nepomuk components are available as one Java library “Galaxy” (Troussov, Judge, & Sogrin, 2007), and address the problems of the consumability of SSD, especially in the corporate environment by providing automatic metadata generation for free texts and scalable back-end for social software. Galaxy is library of components centered around a core spreading activation component – the primary graph-mining technique used in all stages of processing.

The nature of PIMO ontology excludes the use of methods tailored to the particular domain and use of particular lexico-semantic resources, and therefore spreading activation methods, which work based on the local topology, are especially suitable. In this section we’ll describe ontology-based methods used in Galaxy in text processing applications, while section 4.3 describes applications of the Galaxy to related item recommendation (based on both text processing and link analysis).

Major steps in Nepomuk use of PIMO ontology for text processing are:

1. Converting of PIMO ontology to a lexico-semantic resource 2. Mapping from free texts into PIMO ontology

3. Analytics on term mention which allow to reason which concepts sits well together resulting in term disambiguation and creation of metadata

The task of converting a Nepomuk PIMO ontology into a lexico-semantic resource is addressed in (Davis, B., Handschuh, S., Troussov, A., Judge, J., & Sogrin, M., 2008; Troussov et al., 2008c).

Mapping from free texts to a PIMO ontology is done in Galaxy by exploiting IBM LanguageWare lexical analyzer which was influenced by the approach developed in above mentioned papers. This mapping allows us to build semantic models of documents. We define semantic models of free texts as a function on nodes of a semantic network which shows the relevance of corresponding ontologi-cal concepts to the text. This semantic model might be built by an ontology aware lexiontologi-cal analyzer or a parser. We call this model - Semantic Function Space Model (SFSM). This model covers traditional Vector Space Model (VSM), and it is somewhat similar to it. However, VSM is an algebraic model, while Function Space Model can be studied by the methods of function analysis (find local extremes, make function “more smooth”), etc involving graphmining.

Galaxy library uses SAM to “improve” the SFSM, assuming that the model represents a cohesive coherent text (not random list of words). This empirical approach to language understanding is based on the use of fuzzy inferencing methods (like mentioning of car in a sentence increases out awareness that the term Jaguar mentioned in the same text refers to a car, not an animal) and soft clustering (Dublin in Ireland might be the geographical focus of a text which mentions Clonsilla, Drumcondra, and Malahide).

To this end, Galaxy uses spreading activation methods which essentially provide soft clustering and fuzzy inference, i.e. activation from the concepts mentioned in text is propagated to other concepts in PIMO, new concepts even those not mentioned in the text, might be discovered as relevant to the text, the concepts mentioned in the text mutually corroborate each other in similar way as Google’s PageRank algorithm discovers the relative importance of web pages (Langville & Meyer, 2006).

One can say that SAM adds dimension of soft computing methods to the methods traditionally used in ontology-based text processing; and this makes Galaxy tolerant to incompleteness and

One can say that SAM adds dimension of soft computing methods to the methods traditionally used in ontology-based text processing; and this makes Galaxy tolerant to incompleteness and