• Aucun résultat trouvé

Recette pour réaliser une thèse dans la bonne humeur.

N/A
N/A
Protected

Academic year: 2021

Partager "Recette pour réaliser une thèse dans la bonne humeur."

Copied!
167
0
0

Texte intégral

(1)
(2)
(3)

Recette pour réaliser une thèse dans la bonne humeur.

Préparation : 3 ans et 3 mois Cuisson : 4 mois

Laissez refroidir 2 mois avant de déguster.

Ingrédients

Prenez un bon promoteur. Choisissez bien, il va vous accompagner pendant 4 ans, vous guider, vous encourager, vous conseiller, et vous mettre la pression quand nécessaire. Prenez en un qui vous laisse une grande liberté mais est toujours disponible pour répondre à vos questions ; quelqu’un comme Hugues.

C’est toujours meilleur avec un co-promoteur, n’hésitez surtout pas à en ajouter un. Un Marco est particulièrement apprécié, il enrichira votre thèse de ses connaissances encyclopédiques, s’assurera que votre liste de lecture ne désemplisse jamais, et vous fournira de nouvelles idées constamment. Faites cependant attention à ne pas le laisser seul dans une bibliothèque, vous le perdrez sans doute pour la journée.

Ajoutez une bonne poignée de collaborateurs, chacun ajoutera une saveur nouvelle à votre thèse. Quelques valeurs sures :

• Amin Mantrach déborde de motivation et vous présentera 10 nouvelles idées de recherche à la minute, certaines se chevauchant dans la même phrase. Une fois un projet fini, il sera toujours motivé pour collaborer sur le suivant. Il vous poussera à voir plus grand, à viser plus haut, ce qui vous sera particulièrement utile si vous avez des tendances pessimistes.

• Nicolas Kourtellis se combine parfaitement avec Amin : Il apporte un point de vue extérieur et force à remettre en question certaines hypothèses.

Il dispose aussi d’un rire contagieux qui transformera toutes les réunions de travail en bons moments.

• Le combo Ilkka Kivimaki + Bram Van Moorter + rennes vous fera découvrir des applications insoupçonnées. Par les pouvoirs combinés des mathématiques, de la biologie et de l’informatique vous pourrez explorer des domaines de recherche inattendus. Soyez prudent cependant, une exposition prolongée vous donnera envie d’aller explorer les contrées nordiques.

1

(4)

• Un soupçon de Peter Staar vous poussera dans vos retranchements, vous forcera à réfléchir vite, à apprendre encore plus vite et mènera sûrement à des découvertes intéressantes.

• Pour pimenter le tout, nous vous conseillons grandement d’ajouter Jeremy Grosman. Il vous apportera énormément en vous poussant à réfléchir sérieusement à ce que vous faites, à pourquoi vous le faites, et à comment l’exprimer. Il aura toujours quelque chose d’intéressant à vous appren- dre, une histoire à vous raconter qui ajoute de la perspective à votre raisonnement. Il ira aussi volontiers manger des pâtes à midi.

Assurez-vous d’avoir un labo de qualité. IRIDIA fera parfaitement l’affaire.

Tarik, Aurélien, Guillaume et Fabrice sont des éléments de qualité que nous ne pouvons que conseiller. N’oubliez surtout pas Michael Waumans, il vous sauvera la mise à plusieurs reprises grâce à ses énormes connaissances techniques et à sa capacité à tout réparer. Faites attention cependant à ne pas trop en demander, car il n’arrivera pas à refuser, même si vous êtes loin d’être le seul à dépendre de lui. Gardez aussi un pied dans le labo de votre co-promoteur, les discussions avec son équipe épiceront votre travail.

Mettez le tout dans le bâtiment C, ajoutez-y des CC, des chimistes, des physiciens et tous les Hot PhD. Mélangez bien le tout, et donnez un accès direct au bâtiment P pour faire retomber la pression. N’oubliez pas de régulièrement laisser reposer votre thèse pour la faire murir. En cas d’oubli, vos amis sur et hors campus vous rappelleront à l’ordre. Pensez en particulier à toujours garder une bonne dose de Karl et de Gégé à disposition, et à en utiliser sans modération.

Lorsque la texture et l’aspect de votre thèse vous satisfont, commencez la cuisson. Plus que jamais, vous aurez besoin du soutien de vos amis et de votre famille. Trouvez-en qui comprennent pourquoi vous disparaissez pendant un mois et ne réapparaissez que pour crier que votre thèse brûle. Nous vous recommandons chaudement de faire recourt à Antoine, Jean, Laurent, et surtout à d’incroyables parents, car ils vous aideront à rectifier la sauce.

Vers la fin de la cuisson, tournez-vous vers des pros pour obtenir de précieux conseils, un bon jury est indispensable pour assurer la cohérence et la qualité de votre thèse.

Astuce bonus : trouvez quelqu’un de proche, qui vous accompagnera tout

au long de votre thèse et vous soutiendra quoi qu’il arrive. Cela ajoutera une

saveur unique à votre travail. Marie est un exemple parfait, mais nous ne

pouvons malheureusement pas vous la conseiller, elle est déjà prise.

(5)

This work is a collection of five projects which tackle different aspects of how to predict users’ interest and behaviour with social networks and recommender systems. On social network, we worked on the definition of similarity measure and on how to use them to predict characteristics of the nodes of a network.

We made two contributions:

• A novel similarity measure that is inspired by community detection (specif- ically by the modularity) and by a probabilistic view of the distribution of paths in a network (Chapter 3).

• A general method to learn similarity measures on large networks and to use it for node classification (Chapter 4).

On recommender systems, we focused on improving collaborative filtering methods, which are a family of recommender systems based on the modelling of user-item interactions. Collaborative filtering is a large field, and we made contributions in three distinct areas:

• We developed a method to update in real-time collaborative filtering models that take into account the fact that observed ratings form a biased sample of the complete set of ratings (Chapter 6).

• We studied how to exploit the order in which users interacted with items to provide more accurate recommendations, and we showed that gated recurrent neural networks are a powerful model for collaborative filtering (Chapter 7).

• We proposed a method to cluster items while training a collaborative filter- ing models, which can reveal information about the characteristics of the item, but above all can speed up the recommendation process by reducing the number of items that must be considered for each recommendation (Chapter 8).

Recently, social networks and recommender system have been under fire for trapping us in filter bubbles, a sort of positive feedback loop that shows us more and more content that agrees with our opinions and filters out the rest.

Unfortunately, we were not able to tackle directly this problem, but we hope that the tools we made to understand social network and recommender system will be used not to create filter bubbles, but to fight them.

3

(6)

Contents

Contents 4

List of Symbols 7

1 Introduction 9

1.1 Semi-supervised learning on networks . . . . 10

1.2 Collaborative filtering . . . . 11

1.3 Connecting the dots . . . . 11

1.4 Structure of the thesis . . . . 12

1.5 List of publications . . . . 13

I Semi-supervised learning on networks 15 2 Introduction 17 2.1 Machine learning problems on networks . . . . 18

2.2 Similarity measures . . . . 21

2.3 Methods for semi-supervised learning . . . . 23

2.4 Challenges and contributions . . . . 27

3 Semi-supervised learning with path-based modularity max- imisation 29 3.1 Introduction . . . . 29

3.2 Background on the modularity . . . . 30

3.3 Random Walk based Modularity . . . . 32

3.4 Derivation of the semi-supervised learning algorithm . . . . 35

3.5 Experiments . . . . 38

3.6 Related Work . . . . 45

3.7 Conclusions . . . . 45

4 Learning the similarity on very large graphs 47 4.1 Introduction . . . . 47

4.2 Efficient computation of sum of similarities . . . . 48

4.3 Related work . . . . 49

4.4 Problem definition . . . . 51

4.5 Learning the similarity . . . . 52

4.6 Avoiding the assumption of homophily . . . . 55

4.7 Experiments . . . . 56

4

(7)

4.8 Conclusion . . . . 62

II Collaborative Filtering 63 5 Introduction 65 5.1 Collaborative filtering . . . . 66

5.2 Families of collaborative filtering algorithms . . . . 68

5.3 Training matrix factorisation methods with stochastic gradient descent . . . . 73

5.4 Evaluating recommender systems . . . . 75

5.5 Contributions . . . . 76

6 Dynamic Matrix Factorisation with Priors on Unknown Values 79 6.1 Introduction . . . . 79

6.2 Standard matrix factorisation . . . . 81

6.3 Interpreting missing data . . . . 81

6.4 Objective functions . . . . 82

6.5 Experiments . . . . 88

6.6 Related Work . . . . 96

6.7 Conclusions . . . . 98

7 Sequence-based Collaborative Filtering with Recurrent Neu- ral Networks 99 7.1 Introduction . . . . 99

7.2 Sequence-based collaborative filtering . . . . 100

7.3 Collaborative filtering with recurrent neural networks . . . . . 104

7.4 Methods comparison . . . . 111

7.5 Short-term / long-term profile . . . . 114

7.6 Other variations of the RNN . . . . 118

7.7 Conclusion . . . . 120

8 Accelerating model-based collaborative filtering with item clustering 121 8.1 Introduction . . . . 121

8.2 Related Works . . . . 122

8.3 Method . . . . 123

8.4 Experiments . . . . 128

8.5 Discussion . . . . 134

8.6 Conclusion . . . . 134

9 General conclusion 137 9.1 Findings and contributions . . . . 137

9.2 Improving recommender systems . . . . 140

9.3 Concerns and perspectives . . . . 141

A Collaborative filtering based on sequences 143 A.1 Installation . . . . 143

A.2 Usage . . . . 143

A.3 Methods . . . . 147

(8)

Bibliography 153

(9)

General

a, b, c, . . . Scalar variables.

a, b, c, . . . Vectors (upright lowercase bold).

A, B, C, . . . Matrices (upright uppercase bold).

A, B, C, . . . Sets.

v

i

ith element of the vector v. Equivalent to [v]

i

. m

ij

Element (i, j) of the Matrix M. Equivalent to [M]

ij

.

e Vector of ones.

e

i

Vector of zeros, with the exception of its ith element, which is equal to one.

I Identity matrix (its size is generally inferred from the context).

Part I: Networks

n Number of nodes.

N Number of edges.

A Adjacency matrix.

d

ii

Degree of node i. d

ii

= P

j

a

ij

.

D Diagonal matrix whose diagonal elements are the degrees of the nodes of the corresponding network.

A ˜ Symmetrically normalised adjacency matrix. A ˜ = D

1/2

AD

1/2

.

L Laplacian matrix. L = DA.

Normalised laplacian matrix. = IA. ˜

P Transition probability matrix of a random walker on A.

P = D

−1

A.

Part II: Collaborative filtering

n Number of users.

7

(10)

m Number of items.

N Number of observed interactions.

r

ui

Interaction between user u and item i. It can be a rating (1, 2, 3, . . .) or an implicit feedback, in which case it is equal to 1.

ˆ

r

ui

Estimated rating, or predicted interest of user u for item i.

R Set of observed interactions.

R

u•

Set of observed interactions of user u.

R

•i

Set of observed interactions of item i.

R Matrix of observed interactions. [R]

ij

= r

ij

if r

ij

∈ R,

otherwise [R]

ij

= 0.

(11)

Introduction

The Internet changed the way we access information and art. It made almost everything available: any piece of music, any movie or book, is somewhere on the Internet. There is an even greater abundance of opinions, news, facts and falsehoods. But availability does not mean accessibility. The information that you want or need may be somewhere on the Internet, but you might not know where, and you may not even know it exists.

To access this wealth of information the machine learning community devel- oped algorithms to sort information (search engines and recommender systems) and efficient ways to share it (online social networks). Those algorithms, and the mechanisms behind social networks, are the gatekeepers of the Internet, and in many ways they are the gatekeepers of our access to information. This is why we have to study them.

This work is not a reflexion on how information should be made accessible, or on the societal concerns about the current approaches. It is only an attempt to solve some practical problems faced by the current algorithms. In particular, we are interested in algorithms whose aim is to predict the interests of users in order to retrieve relevant information.

This “information” that we suggest to the users can take several forms, it can be a movie for a Netflix user, a scientific paper for a researcher, a group in some social network, or a simple web page. In the following we will refer to all those objects of interest simply as “objects”. Generally speaking, the algorithms that attempt to predict the interests of users base their predictions on the traces that the users leave, voluntarily or not, when they navigate on the Internet. Any interactions with some object (visiting, liking, sharing, buying) or with other users (befriending, chatting) constitute such traces. The intuitive basis of those algorithms is that if we can identify two users who have similar interests, the observed behaviour of one user can help to predict the behaviour of the other. The first goal is therefore to identify users with similar interests, and there are two main ways of doing so: using explicit links between users on social networks (friends have similar interests), or using behavioural similarities (if two user interact with the same objects, they must share the same interests).

The first approach is essentially covered by a research field called semi- supervised learning on networks [233, 41, 208], while the second is referred to as collaborative filtering [7, 176]. This work makes contributions in both semi-supervised learning and collaborative filtering. In the following we will

9

(12)

introduce both fields of research, before highlighting some of the parallels that can be made between them.

1.1 Semi-supervised learning on networks

A common feature of Facebook is to make recommendations in the form “this page might interest you because three of your friends liked it”. This is the simplest use of a social network to predict the interests of a user: it is reasonable to assume that friends share common interests, and therefore that if somebody liked a given page, or belongs to a certain group, his friends might be interested as well [146].

In practice however, making the best use of the information given by a social network is a complex challenge. Not all friends are equally close for example.

Moreover, some friends of friends might actually have more in common than direct friends, and it is therefore interesting to consider the full network of interaction surrounding a user rather than its direct friends only.

In order to efficiently work on that problem, it is useful to conceive an abstract view of networks [19, 149, 62]. Networks are a set of nodes connected by edges. The topology of the network is the way in which the nodes are connected by the edges (in stars, in circle, forming groups or chains, etc.). The nodes and edges can have certain attributes: for example, a node can belong to a certain class, or have a certain label, and the edges can have a direction or a weight. Generally, we assume that the attributes of the nodes are correlated with the topology of the network [41]. If this assumption is correct, the topology can help us predicting some attributes of the nodes.

This abstract view of network can easily be applied to social networks: nodes are users, edges are friendships, the attributes are the interests of the users, and we assume that there is a correlation between those interests and the topology of the social network. But this view can also be applied to many other domains:

road networks, trophic networks, protein networks, etc. So even though we mostly focus on social networks, many of the tools developed in this work can be transposed to other domains [19].

In abstract terms, predicting the interests of users in a social network based on the observed interests of users surrounding them can be seen as the problem of predicting the attributes of the nodes of a network based on the known attributes of a few nodes and on the topology of the network. This is the problem called semi-supervised learning on networks (see Chapter 2).

An important step in semi-supervised learning is to define a measure of

similarity between the nodes of the network, based on its topology [61]. One

example of measure would be based on counting the smallest number of edges

that must be visited to go from one node to the other: the shorter the path,

the higher the similarity between the nodes. Of course, many other definitions

are possible. In the following work, we present new similarity measures that

improve on previously known ones, and are computable on very large networks

(see Chapters 3 and 4). This will be helpful in predicting the attributes defined

on nodes.

(13)

1.2 Collaborative filtering

Social networks are not the only way to identify users with similar interests.

If we have access to observations of the past behaviour of many users, it is possible to identify users who behaved similarly in the past, and guess that they must have similar interests, and thus might behave similarly in the future. In other words, any application that centralises the history of many users can start to make predictions about the future actions of these users. This technique is called collaborative filtering [74].

The basic approach is to measure a similarity between pairs of users based on the number of interactions that they have in common [52]. This kind of approach however has many problems related to the time complexity and the difficulty to work when observations are scarce. Over the years, many variations have been proposed, and a popular approach that emerged is to learn for each user and each object a representation in some latent space, and to make recommendations based on the similarities between users and object in that latent space [7, Chapter 3].

Many challenges remains, not the least of which being to reduce the time complexity of those algorithms in order to deal with larger and larger datasets.

In production, a recommender system must handle new users and objects entering the system constantly, old items becoming obsolete, changing interests of users, and must be able to provide recommendations in real time based on the latest available information. We tackle all of those problems, and provide methods that improve the state of the art of collaborative filtering.

In particular, we present a way to handle new users and items in real time in order to always make recommendations based on up to date information (see Chapter 6). We also identify an efficient way to take into account the order in which users interact with the objects in order to provide more accurate recommendations (see Chapter 7). Finally we explore a new approach to accelerate the recommendation process (see Chapter 8).

1.3 Connecting the dots

As we have seen, semi-supervised learning on networks and collaborative filtering can share the same goal, even though they handle different types of data. On the technical side, interesting parallels emerge as well. Both revolve strongly around the idea of similarity: similarity between users based on a network topology or based on common behaviour, similarity between users and objects with which they interact.

Moreover, the set of interactions of users with objects is often represented

as a bipartite network, with edges connecting the users to the objects. This

representation opens the way to use tools from network science in collaborative

filtering, bringing the two fields closer yet. One might for example apply to

that bipartite network the same kind of similarity measure used on social

networks and interpret it as a similarity between the users and objects of the

graph. Similarly, techniques of dimensionality reduction commonly used in

collaborative filtering (matrix factorisation in particular) can be useful in some

semi-supervised learning problem as well (like in Chapter 3). Another common

(14)

feature of these two problems is the sparsity of the observed interactions, leading to very sparse networks.

Admittedly though, this kind of weak similarities can be found between many fields related to machine learning, and we might actually learn more by trying to identify why those fields differ rather than how they resemble each other. Firstly, the nature of the data of semi-supervised learning on networks and collaborative filtering is different. We said that the interactions used in collaborative filtering can be represented by a network, but that is only true up to a point: the interactions can carry a lot of information that might be poorly adapted for a network representation. For example, when we consider that the order of interactions is important information (as in Chapter 7), using a network representation becomes impractical.

Yet, the representation of interactions as a bipartite network is a common one, but even in that case, the typical dimensions of semi-supervised learning problems and collaborative filtering problems typically differ, and call for different solutions. Most notably, semi-supervised learning on network focusses on situations where the number of potential interests for a node is small [233].

Each interest can generally be considered as an independent problem, and the complexity is mostly driven by the number of users and of links between those users. In collaborative filtering on the other hand, the number of objects of interest is an important driver of complexity, with catalogues of hundreds of thousands of items that must be treated in less than a second, and the methods must therefore comply with much stronger constraints of time complexity [47, 15].

Another distinction comes from the fact that networks in semi-supervised learning are often considered as static entities, while collaborative filtering is very much concerned with the ability to handle constantly changing users and items [192, 5, 20]. Admittedly, real social networks are also evolving constantly, but this problem seems to be less directly confronted in the semi-supervised learning literature.

In the following chapters we tackle several challenges that reveal the need for specialised solutions in both semi-supervised learning on networks and collaborative filtering.

1.4 Structure of the thesis

This work is divided into two parts. The first part concerns networks, and in particular the task of semi-supervised learning on networks. As said earlier, semi-supervised learning can, among other things, be used to identify the interests of people in social networks, but the range of applications extends far beyond social networks. In the following we favour an abstract treatment of the subject, which encourage a broader use of the findings. Chapter 2 is an introduction to network science and to semi-supervised learning in particular.

It lays down the concept necessary to the subsequent work, and identifies the

challenges that have been tackled. In Chapter 3 we introduce an extension of

a commonly used similarity measure that leads to better prediction than the

original one in a task of semi-supervised learning. In Chapter 4 we propose a

new semi-supervised learning algorithm which learns similarity measures based

on available information, and is able to work on very large networks. The

(15)

main interest of this new approach is that it avoids the time-consuming task of choosing an adapted similarity measure for a given problem.

The second part of this work concerns recommender systems, and in par- ticular collaborative filtering algorithms. Chapter 5 gives an overview of the field and identifies the problems that we tackle in the subsequent chapters.

Chapter 6 deals with the ability of collaborative filtering methods to adapt to a changing environment (where new users and new items regularly appear).

Chapter 7 proposes a new family of methods based on recurrent neural networks exploiting the sequential structure of the data, and highlights some side effects of that choice. Finally, Chapter 8 presents a way to use clustering of the items in order to make faster recommendations.

1.5 List of publications

Here is a list of the publications submitted in the context of this work:

• The work presented in Chapter 3 has been published in:

Devooght, R., Mantrach, A., Kivimäki, I., Bersini, H., Jaimes, A., &

Saerens, M. (2014, April). Random walks based modularity: application to semi-supervised learning. In Proceedings of the 23rd international conference on World wide web (pp. 213-224). ACM.

• The work presented in Chapter 6 is the result of a collaboration with the Yahoo! Research lab of Barcelona, and has been published in:

Devooght, R., Kourtellis, N., & Mantrach, A. (2015, August). Dynamic matrix factorization with priors on unknown values. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 189-198). ACM.

• The work presented in Chapter 7 will be partly published in:

Devooght, R., Bersini, H. (2017, July). Long and Short-Term Recom- mendations with Recurrent Neural Networks. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. ACM.

Moreover we want to signal that the work presented in Chapter 4 is the

result of a collaboration with Peter Staar and Costas Bekas from the IBM

research lab of Zurich.

(16)
(17)

Semi-supervised learning on networks

15

(18)
(19)

1

This work only explores a very narrow slice of the diverse science of networks.

For a broader view of the subject we rec- ommend to turn to a few books. Network science, by Alber-Laszlo Barabasi [19] is a beautifully illustrated introduction to the questions related to topology, filled with concrete applications. Net- works: En Introduction, by Mark New- man [149] gives a great overview of the fundamental tools of network analysis.

Finally, Network data and Link analysis, by François Fouss, Marco Saerens and Masashi Shimbo [62] is a vast collection of advanced techniques and algorithms related to networks.

Introduction

Before Deep Learning became all the rage, the A.I. community had a crush on networks. Networks are a simple abstraction; it is a collection of objects called nodes, interlinked by edges that model two-ways relationships between the nodes. Using this abstraction, networks can be noticed everywhere

The nodes can be cities and the edges roads, and you have a road network. If the nodes are proteins and the edges are chemical interactions, you can describe the protein interactions that make our cells work. If the nodes are living beings, the edges can indicate predation, parasitism and co-dependencies, and you can describe a complex ecosystem. Brains, financial networks, the power grid are all systems that can be described with networks. Then of course, there are the networks most emblematic of the past decades: the Internet, the web, and online social networks.

Using this abstraction allows to move away from the particularities of each domain, in order to identify common traits. It offers simple mathematical tools, from which a new understanding of the underlying subjects can arise, and it leads to the development of algorithms solving problems across numerous domains

1

.

Thinking in terms of networks has generated multiple research questions, such as, to name a few, how to identify important nodes, how to predict the evolution of a network or how to find the shortest path between two nodes. In this work, we will mostly focus on one fundamental question, central to the whole study of networks: what does the structure of a network reveal about the similarity, or closeness of two of its nodes? In the following chapters we establish a new measure of similarity between pairs of node in a network, as well as a method to identify an appropriate measure of similarity for a given network. Scalability is an important aspect of modern network algorithm, and we demonstrate the ability of the methods that we present to work with large networks.

In this introductory chapter, we present some of the applications of machine learning on networks (Section 2.1); we then discuss how the notion of similarity measure is central to those problems (Section 2.2), before diving deeper into how to use those similarity measures in a specific task called node classification (Sec- tion 2.3); we finally describe some of the limitations of current techniques, and which of those limitations we attempt to alleviate with our work (Section 2.4).

17

(20)

2

For an in-depth look at machine learning on networks, on can look at the recent work of T. C. Silva and L. Zhao [201].

3

We are actually not acknowledging the diversity of network science by reducing clustering to the detection of communi- ties. There is a collection of fields related to clustering, each with their own speci- ficities: community detection, graph par- titioning, hierarchical clustering, etc. See for example [19, Chapter 9] for a brief overview.

2.1 Machine learning problems on networks

In this section we describe some of the most prominent problems for machine learning on networks

2

. In this work we only consider situations where the only available information is structure of the network (i.e. the adjacency matrix).

There are many situations where multiple sources of information exist, for example when some features of the nodes are available, but combining those information is a research area on its own, which is out of the scope of this work.

2.1.1 Clustering Figure 2.1: This network is the

result of the observation by Wayne W. Zachary of the interactions between the members of a karate club [226]. During this period of observation, a conflict arose between the president (node 34) and the instructor of the club (first node), which resulted in

the creation of a new club. This network has been the canonical test for graph clustering ever since, the goal being to predict who joined the new club based only on the structure of the network. Here, the members of the new club are represented by white nodes. Source of the image:

commons.wikimedia.org/wiki/

File:Karate_Cuneyt_Akcora.png.

Clustering is the task of dividing a data set in meaningful groups, or clusters. By meaningful, we generally mean that elements that are in the same cluster should have some common properties, and that they should be significantly different from elements in other clusters. The blurry language here is inevitable, because finding a mathematically rigorous definition of what is a good clustering is the first challenge of the clustering problem, and the many propositions are debated at length [120, 186]. Once a definition of a clustering quality is chosen, the second challenge is to find an efficient algorithm to find an optimal clustering according to that definition.

Clustering is of course a central problem in machine learning and is not at all limited to networks, but its application to networks requires a special treatment. On social networks, clustering has an especially intuitive signification:

it aims to detect communities

3

. The assumption is that the structure of interactions captured by a social network is enough to identify distinct groups of people. For example, Blondel et al.[29] have shown that the network of mobile communications of two millions customers in Belgium can be used to identify the French-speaking and Dutch-speaking communities. The canonical example of social network clustering is the Zachary karate club network described in Figure 2.1

On networks, the definition of the quality of a clustering generally revolves around the idea that nodes within a cluster should be densely connected, and that few links should exist between clusters. In practice however, neatly separated clusters are very rarely observed. Some regions of the network might be more densely connected than others, but the limit is often hard to draw, and often communities can overlap [119]. Many competing approaches exist, and the work of Mark Newman [150], S.E. Schaeffer [186] and Leskovec et al. [120]

give an overview of the field.

A popular definition of clustering quality is the modularity, introduced by Mark Newman [151]. In Newman’s words [151]:

The modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.

An equivalent network being here a network with the same number of nodes

and edges, and the same degree distribution (i.e. a node with x edges in one

network must also have x edges in an equivalent network). The motivation is

the assumption that given a number of nodes and a degree distribution, the

clustering that best explains where the edges are is the best clustering. The

(21)

4

Let u

i

∈ R

c

be an indicator vector for the node i, defined such that [u

i

]

j

= 1 if the ith node belongs to the jth cluster, and [u

i

]

j

= 0 else. The modularity Q with k clusters is then defined as Q =

1 2N

P

n

i,j

(a

ij

dii2Ndjj

)u

i

u

Tj

where d

ii

is the degree of node i and N is the number of edges in the network [45].

modularity formula is briefly defined in the sidenote

4

, then explained in greater details in Chapter 3.2.

2.1.2 Links prediction

Another common application of machine learning on network is to predict unobserved links in the network [122, 129]. Often, the available network is either incomplete (think for example of the protein-protein interaction network where most of the possible interactions are probably yet to be observed), or evolving (social networks for example), and in either cases it could be useful to predict the missing, or soon-to-arrive links.

A common approach to the link prediction problem is to make prediction based on the number of neighbours that two nodes have in common. Variations of this approach (such as the Adamic/Adar score [2]) work well on social network [122], even though they are very simple.

However, those methods can only predict links between nodes that have at least one common neighbour. To avoid this limitation, one can turn to methods based on random walks [16] or matrix factorization [141].

We do not directly tackle the link prediction problem in this work. However, the second part of the work is concerned with the problem of collaborative filtering, which is closely related to link prediction. As we explain in Chapter 5, collaborative filtering is the problem of producing recommendation for a user based on observed interaction between users and items. One way of approaching collaborative filtering is to see it as the problem of predicting links in the biparted graph formed by the users and items. As we show in Chapter 5, methods based on neighbours count, random walks and matrix factorisation have all been widely applied to the collaborative filtering problem.

We recommend the work of Liben et al.[122] and Lü and Zhou [129] for an overview of the field.

2.1.3 Node classification

Notations In the following, n represent the number of nodes in a graph, and

N = {1, 2, . . . , n} is the set of nodes of the graph. The structure of the graph is described by the adjacency matrix A ∈ R

n×n

whose element a

ij

is 1 if there is a link between nodes i and j, zero else. In Chapter 3 we will also use the adjacency matrix to encode weights on the edges, but everywhere else we work with unweighted graphs. We also commonly use the laplacian matrix defined by L = DA, where D is the diagonal matrix whose element d

ii

= P

j∈N

a

ij

is the degree of node i. Finally, we define the symmetrically normalised adjacency matrix as A ˜ = D

−1/2

AD

−1/2

and the normalised laplacian as = IA, where ˜ I is the n × n identity matrix.

Node classification is the task of determining the class (or label) of each

node of a network, when the class of some nodes is known. For example, the

goal might be to determine the age of all the members of a social network when

the age of only 10% of them is known, or to categorize scientific paper based

on the network of citation and a few handmade categorizations.

(22)

Node classification is the main problem that we treat in this work. Hence, we provide a more rigorous definition.

We consider m classes (or labels), and each node in N belongs to only one class. The matrix Y ∈ R

n×m

encodes the class of each node: its element y

ij

is 1 if and only if the node i belongs to the class j. K is the set of nodes whose class is known, and the goal of the classification is to predict the classes of unlabelled nodes u ∈ N \ K based on the known classes y

ij

∀i ∈ K, j ∈ {1, . . . m}. The matrix Y ˆ contains the class predictions.

In the case where m = 2, we can simplify the notations, and use the vectors y and ˆ y instead of Y and Y, with ˆ y

i

= 1 if the node i belongs to the first class, and y

i

= 0 if it belongs to the second class or if its class is unknown. It is common to treat multi-class classification problems as a combination of simpler binary classification problem. In the following we will present methods only for the binary classification problem, and in Chapter 4.5.1 we briefly discuss how to extend those to treat multiple classes.

2.1.4 Seed set expansion

Seed set expansion is a variation of the node classification problem that probably better reflects actual application of machine learning on the web and on social networks. We start with a set of nodes (called the seed set) that are known to belong to the same community or to the same class, and the goal is to find which other nodes belong to that community. In other words: we aim to expand the seed set.

The main difference with the classification problem is that we are not given nodes that explicitly do not belong to the community: every node outside of the seed set may or may not belong to it. This prevents from using traditional classification methods that need to learn from both positive and negative examples. The seed set expansion problem can be separated into two sub- problems: (1) ranking nodes according to their likeliness of belonging to the same class as the seed, and (2) finding a cut-off point in that ranking. The ranking is often performed by starting truncated random walks or random walks with restart from the seed set [9, 104]. Choosing a cut-off point generally consists in minimizing some heuristic about the quality of the community [9, 220, 178], and we find here the same kind of quality measure as those used to evaluate a clustering. However, the cut-off point is sometimes imposed, for example by the number of item presented in an interface, or for speed and memory reasons, and the seed set expansion problem is then reduced to finding a good ranking of the nodes.

One application is to find sub-graphs of manageable size in very large graphs.

In this case the crucial part is to find the right cut-off point [178]. Others have combined seed set expansion with the unsupervised selection of seed sets to perform graph clustering [220].

However, in this work we use it as a special case of node classification on

social network where the goal is to predict which user of an online social network

might join a given group. In most online social networks, a user can join a

group (by “liking” a page for example), but he cannot explicitly refuse to join a

group. Therefore, it is often impossible to know if a user is not part of a group

because of a lack of interest, or because he does not know the group.

(23)

5

Distances, kernels and similarity measures

A similarity measure can be any function that takes as input two nodes and outputs a scalar value. The absolute value of the similarity between two nodes generally does not have a meaning, it is only meaningful when compared to other similarities.

Distances on the other hand must comply to stricter rules: a distance must be nonnegative and symmetrical, the distance of a node to itself must be zero, and it must satisfy the triangle inequality [101].

Kernel functions compute the dot product of two nodes in some high- dimensional space [90, 193].

Distances and kernels can both be used as similarity measure, and many works restrict themselves to using those because they offer many theoretical guar- anties [63, 61, 105, 203], but in this work we use the broadest definition of similar- ity measures.

2.2 Similarity measures

When working with networks, we very often need some definition of distance or similarity between two nodes of a network. With an appropriate measure of distance or similarity, the problems presented in the previous section become trivial or can at least be solved by traditional machine learning methods. For example, once a distance is defined between any pair of nodes, one can perform graph clustering with a distance-based technique such as k-means [225]. In the node classification task, a simple approach is to assign to each node the class of the nearest labelled node, or use any other variation of the nearest neighbours techniques. In some problems of link prediction, the best approach is simply to find the most similar nodes that are not yet connected [129].

Having a good similarity measure seems to solve a large range of machine learning problems on networks, and is therefore central in link prediction [129], community detection [60], node classification [61] and seed set expansion [9].

The difficulty, of course, lies in how to define such a similarity measure

5

. The adjacency matrix itself is of course the simplest measure of similarity: two nodes are more similar if they are connected than if they are not. It is however a very coarse-grained measure, giving the same similarity to each pair of disconnected nodes, even if some may be only two edges apart and some much further away.

A better definition is to consider the minimum number of edges that needs to be crossed in order to go from one node to the other. This is called the shortest-path distance. However, one drawback of this definition is that it is still a very coarse-grained measure: it puts many nodes at exactly the same distance. One way to improve this is to also consider the number of possible paths of each length between two nodes (i.e. the amount of connectivity). This way, two nodes connected by three paths of length two are considered to be closer than two nodes connected by only one path of length two.

Similarity measures are functions of the form s : {1, . . . , n}

2

7→ R , mapping a pair of nodes to a scalar value, but we often represent it by the matrix S containing the similarity between any pair of node. This matrix is called the similarity matrix, and its element s

ij

is the similarity between nodes i and j.

Plenty of definitions of similarity have been proposed over the years, using different intuitions, and it is impossible to affirm that one is more reasonable than the others. In practice, the best similarity measure is the one that leads to the best results or the one that is the most practical to compute, and this depends on the problem at hands. Fouss et al. [61] review many of the existing measures, and in the following section we review some representative methods.

2.2.1 Diffusion and random walks

Imagine that one of the nodes of the network is maintained at a certain temperature, while the other nodes are initially uniformly colder. As time passes, the heat flows through the edges of the network, warming up the closest nodes first, then the nodes further away. After some time, the closest node will be the warmest, and the temperature could be used as a measure of similarity to the node maintained at a high temperature [103]. This is the kind of intuition used to define diffusion-based similarity measure. One such measure is the laplacian exponential diffusion kernel, defined by Kondor and Lafferty [105].

The similarity between any pair of nodes i and j is given by the element (i, j)

(24)

6

The matrix of transition probabilities of a random walker is P = D

−1

A, which means that if x

i

(t) is the probability of a random walker being in node i at time t, its probability distribution over the nodes at time t + 1 is given by x(t + 1) = P

T

x.

It is easy to see that the distribution x(∞) x

i

(∞) = d

ii

/N is stable for a symmetric network:

P

T

x(∞) = AD

−1

x(∞)

= Ae/N

= x(∞)

For a discussion of why this is the only stable distribution and the conditions in which an initial distribution converge to that one, as well as for a general intro- duction to discrete Markov chains, we recommend the work of J.R. Norris [157]

of the matrix S

exp

= exp(−tL), where exp is the matrix exponential, L is the matrix laplacian (defined earlier), and t is a scalar parameter that can be interpreted as the duration of the diffusion.

Another common metaphor used to define similarity measures is to imagine random walkers moving on the network. A popular measure is the random walk with restart [162, 215, 104], also known as the personalized PageRank because of its relation to Google’s PageRank algorithm [161]. Imagine someone starting in one node i, and walking to one randomly chosen neighbour. He then walks to a randomly chosen neighbour of that new node, etc. The probability that he arrive in a given node j after t steps can be used as a similarity between the nodes i and j [210]. However, after a long enough time, the probability that the walker is in one node only depends on the edges in the direct neighbourhood of the node, making the measure useless

6

. The random walk with restart slightly alters the definition of the process in order to reach a useful steady-state: at each step, the walker has a probability α of jumping back to the initial node instead of walking to a direct neighbour. The probability distribution of finding the walker in any node after a time t is given by:

x(0) = e

i

x(t) = (1α)P

T

x(t − 1) + αe

i

(2.1) where e

i

is the column vector of size n whose elements are all zeros, except for the ith element that is equal to one. It can be shown that the steady state distribution is:

x(∞) = (I − (1 − α)P

T

)

−1

e

i

(2.2) (I − (1 − α)P

T

)

−1

is an asymmetric matrix, and by convention we define the random walk with restart similarity S

PPR

as the transpose of that matrix [61]:

S

PPR

= I − (1 − α)D

−1

A

−1

(2.3) 2.2.2 Eigenvalues transformation

Smola and Kondor [203] offer another justification for the similarity mea- sures. They study regularisations of the normalised graph laplacian (∆ = ID

−1/2

AD

−1/2

), obtained by applying a scalar function to its eigenvalues.

We note ∆ =

n

P

i=1

λ

i

v

i

v

Ti

the decomposition of the normalised laplacian in eigenvalues {λ

i

} and eigenvectors {v

i

}. Those eigenvalues are contained in [0, 2]

(see [203]).

In their work, Smola and Kondor show that the smaller eigenvalues of are associated to smoother eigenvectors. It is easy to show the same property for the laplacian L. First, notice that for any vector y, we have:

y

T

Ly = y

T

(D − A)y (2.4)

= X

i

y

i2

X

j

a

ij

− X

ij

y

i

y

j

a

ij

(2.5)

= 1 2

X

i,j∈N

a

ij

(y

i

y

j

)

2

(2.6)

(25)

Equation 2.6 is known as Geary’s contiguity ratio [68] and is often used as a measure of how smooth a vector y is with regards to the network structure. In other words, y is smooth if the product y

T

Ly is large. Let us now decompose the laplacian in eigenvectors and eigenvalues L =

n

P

i=1

λ

0i

u

i

u

Ti

. The smoothness of an eigenvector u

j

is given by:

u

Tj

Lu

j

= u

Tj

(

n

X

i=1

λ

0i

u

i

u

Ti

)u

j

(2.7)

=

n

X

i=1

λ

0i

(u

Ti

u

j

)

2

(2.8) Using the fact that the eigenvectors form an orthonormal basis, we find:

u

Tj

Lu

j

= λ

0j

, (2.9)

which makes it clear that smoother eigenvectors of the laplacian are associated to smaller eigenvalues. The same can be shown for the normalised laplacian.

Based on this observation, Smola and Kondor suggest to use the eigenvectors of the normalised laplacian to build a similarity measure by giving more weight to the smoother eigenvectors. If r : [0, 2] 7→ R is a monotonically increasing function defined on [0, 2], we can define a similarity measure S as:

S =

n

X

i=1

r

−1

i

)v

i

v

Ti

(2.10) Where r

−1

is the pseudo-inverse of r (0

−1

maps to 0).

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

y

x

(a) Without unlabelled data

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

y

x

(b) With unlabelled data

Figure 2.2: In the first image, only labelled samples are visible. Based only on labelled samples, the red line is a reasonable classification boundary. In the second image however, the unlabelled samples re- veal structure in the data, and the red line now seem to be a very bad choice. Semi-supervised methods attempt to use both labelled and unlabelled data to deal with similar cases.

Many well-known similarity measures can be expressed this way given the right r function:

r(λ) = 1 + σλ: the regularised laplacian (S = (I + σ∆)

−1

),

r(λ) = e

(σ/2λ)

: the exponential diffusion (S = exp(−σ/2∆)),

r(λ) = (αλ)

−p

with α > 2: the p-step random walk (S = (αI − ∆)

p

).

Many other works define similarity measures on graphs as a transformation of the eigenvalues of the (normalised) laplacian matrix, with applications in node classification [42, 98] and link prediction [109]

2.3 Methods for semi-supervised learning

Node classification and seed set expansion fall in the category of semi-supervised learning. Semi-supervised learning is used to describe machine learning methods that learn not only from labelled samples, but also from unlabelled samples.

Often, unlabelled samples are much more numerous and easier to gather than

labelled ones, and those unlabelled samples can be a useful source of information

as they can reveal the structure of the data. The classical example is the two

moon dataset (see for example [23]), where points are spread around two curves

in a 2D plane. As shown in Figure 2.2, seeing the unlabelled samples reveals

the whole density function and helps the classification.

(26)

7

For a more thorough review of existing network-based semi-supervised learn- ing methods, one can turn to the works of Zhu [233], Subramanya and Taluk- dar [208], and Silva and Zhao [201, Chap- ter 7].

Although we only discuss semi- supervised learning on networks in this work, semi-supervised learning is a vast field of research that extends well beyond network-based methods (see e.g. [41, 1, 233]). In other domains it might not always be referred to as semi- supervised learning. In neural networks for example, unsupervised layer-by-layer pre-training before supervised fine-tuning can be seen as a semi-supervised ap- proach [57].

In the context of networks, semi-supervised learning refers to methods that use the whole structure of a partially labelled network to predict the missing labels. In the following, we mostly work with datasets that are naturally expressed as networks. However, it is worth knowing that many semi-supervised methods are designed for non-network data such as the two-moon problem, but create a network in a pre-processing step before solving the problem in its network form, for which many semi-supervised algorithms exist (see e.g. [208, Chapter 2]).

In the following we describe some foundational and state-of-the-art methods for semi-supervised learning on networks

7

.

2.3.1 Smoothness optimization

A common approach is to express the labelling as the solution of an optimization problem that would take into account two aspects: (1) consistency with known labels and (2) smoothness of labelling on the graph. Those two aspects can take multiple definitions, but we present here one version described in [62, Chapter 6.2], whose definition of smoothness is based on Geary’s contiguity ratio[68]. The optimisation problem is stated as follows:

min

ˆ y

X

i∈K

y

i

y

i

)

2

+ σ 1 2

X

i,j∈N

a

ij

y

i

y ˆ

j

)

2

(2.11) The second term ensures the smoothness of the labelling by rewarding direct neighbours that have the same class (that kind of regularisation is also used in nonparametric statistics, under the name of roughness penalty [80]). The parameter σ is used to balance the two terms (consistency and smoothness). This optimisation problem can be expressed in a matrix form once we introduce the diagonal matrix Γ whose elements γ

ii

are 1 if node i ∈ K, 0 else. Problem 2.11 becomes:

min

ˆy

yy)

T

Γ(ˆ yy) + σˆ y

T

y (2.12) The optimal labelling can be found by computing the gradient with regards to ˆ l and setting it to zero (see more details in [62, Chapter 6.2]):

ˆ

y = (Γ + σL)

−1

y (2.13)

Notice that although the approach was completely different, the matrix (Γ + σL)

−1

is very similar to the regularized laplacian defined in Section 2.2.2, and can also be interpreted as a similarity measure.

Some methods took a similar approach, but expressed the smoothness in different ways. Fang et al. [58] propose another definition that leads them to an iterative procedure similar to the one described in Section 2.3.2, but using a different diffusion matrix (closer to the random walk with restart [162]).

Orbach and Crammer [158] proposed an extension of the optimisation problem described in Equation 2.12 that associates to each node a measure of confidence about its label. Nodes with lower confidence have a lower impact on the labelling of their neighbours (the confidence levels are learned simultaneously to the labelling).

Smoothness can also be defined in terms of graph wavelets instead of the

laplacian matrix [199, 56]. Without going into the details, graph wavelets are

(27)

a family of labelling that are localised over the edges of the graph and in the spectral domain of the laplacian [82]. Using them allows to define labelling that are locally smooth (by opposition to the solutions of Problem 2.11 which are

“globally” smooth). Shuman et al. [199] report however that the benefit of that method is small with regards to its complexity.

2.3.2 Iterative processes

Another interpretation of semi-supervised methods is to think of them as iterative processes. Labelled nodes are first used to label their direct neighbours, and then this information is transferred - diffused - to their neighbours and so on until the whole network is labelled.

In their work Learning with local and global consistency, Zhou et al. [231]

propose the following approach:

1. Choose a diffusion matrix M that defines how the labels flow through the network at each step. They suggest to use M = A ˜ (the symmetrically normalised adjacency matrix).

2. Start with the initial labelling y ˆ

0

= y, and iterate using the following formula: ˆ y

t

= αMˆ y

t−1

+ (1 − α)y, with 0 < α < 1. At each step, the predicted labels are diffused to their direct neighbours and the known labels are reinforced.

If the spectral radius of αM is inferior to 1, this iterative process converges, and the equilibrium state can be expressed in a closed form:

ˆ

y = (I − αM)

−1

y (2.14)

As with the previous method, (I − αM)

−1

can be interpreted as a similarity matrix. In this work we will refer to this similarity matrix by S

LLGC

(in reference to “Learning with Local and Global Consistency”):

S

LLGC

= (I − α A) ˜

−1

(2.15)

We easily notice the relationships between this iterative method and the similarity measures based on diffusion and random walks. Many other works have proposed iterative methods following different diffusion rules [235, 58, 18, 221, 210, 39]. Notice however that they do not always have a closed form solution like Equation 2.15.

2.3.3 Sum of similarities

Smoothness optimisation and iterative diffusion are can seem very different, but they are in fact often closely related. Methods based on smoothness optimisation are often solved by iterative approaches [58, 158], and methods motivated by an iterative process can sometimes be expressed as the solution of some smoothness optimisation problem [231].

Moreover, we observe in the case of Equations 2.13 and 2.14 that their closed form solutions have the same form:

ˆ

y = Sy (2.16)

(28)

Where S is some similarity matrix. Many other approaches lead to the same formulation [62, Chapter 6], and they are all gathered under the term sum of similarities. Indeed, the prediction for a node i is the sum of all known label y

i

, weighted by the similarities s

ij

.

Once expressed this way, any similarity measure can be used to define a new semi-supervised learning method. Fouss et al. [61] compare many different similarity measures with the sum of similarities approach on tasks of nodes classification.

2.3.4 Embedding and classification

Although most semi-supervised learning techniques can be interpreted as sum of similarities, a completely different approach to semi-supervised learning uses unsupervised learning in two well separated steps: first an unsupervised method is used to extract features for each nodes that attempt to capture what the network structure tells about the nodes, and then a classic supervised method is used to learn a classifier based on those features. The classifier is trained only with the labelled samples, but the features are based on both labelled and unlabelled samples, making the overall approach semi-supervised. One advantage of this approach is that it can easily be extended to use non-network features of the nodes, by simply feeding those features to the classifier alongside the network features.

The classifier can be any standard method such as an SVM or a neural network, and is not specialised to work on graphs. The most interesting part is therefore the unsupervised method that must extract from the network features that are used by the classifier, and in the following we will only discuss that part.

Usually, those features can be interpreted as an embedding of the nodes in a high-dimensional space that conserves some important properties about the structure of the network. The most active advocate of the approach has been Lei Tang [212, 214, 213], and in the following we describe two of his method.

The general idea behind Tang’s work is to use the soft membership to topological clusters as the features of the nodes. In other words, the network is summarised by a series of clusters, and the classifier makes prediction based on the level of membership of the node to each of those clusters.

In [212], Tang and Liu propose a method called ModMax using a soft clustering based on the modularity matrix. Newman showed in [152] that the top k eigenvectors of the modularity matrix minimise the modularity for k clusters when soft-membership is allowed. In other words, the ith element of the jth eigenvector of the modularity matrix is the optimal membership of the ith node to the jth cluster according to the modularity measure. Tang and Liu propose to extract the k dominant eigenvectors of the modularity matrix using an iterative approach (e.g. Lanczos [114]) and to use the ith element of those k eigenvectors as the features of node i before training a linear SVM with those features.

A drawback of ModMax is that it requires the extraction of hundreds - if

not thousands - of eigenvectors to work well, which is not only slow to compute,

but also difficult to store for large graphs. In [213], Tang and Liu propose

EdgeCluster using another clustering approach in order to obtain sparse features,

which are easier to store for large graphs. Instead of clustering the nodes of the

(29)

graph, EdgeCluster performs a clustering of the edges of the graph. An edge connecting the nodes i and j is described by a vector of size n whose elements are all zero, except for the ith and jth element who are set to one. A k-means algorithm is then used to perform a hard clustering of the edges based on those description vectors. Tang and Liu then suggest to use the clusters to which the edges of one node belongs as the features of that node. Those features are sparse because the number of cluster associated to each node is limited by the number of edges it has. As in ModMax, those features are then used to train a linear SVM.

The work of Rizos et al. [179] offers a recent comparison of many similar approaches. It is worth noting that clustering is not the only approach for extracting features from the network. Another approach is for example to find an embedding of the nodes that preserves some similarity measure between the nodes. Zhao et al. discuss how to create an embedding based on measures such as the personalised pagerank [230].

2.4 Challenges and contributions

As we have seen, similarity measures play a central role in most network related problems. Many algorithms of graph clustering, link prediction, node classification and seed set expansion depend on similarity measures, and this is why it is crucial to understand and to improve those similarity measures. As often in machine learning, improving the scalability of similarity measures (by that we mean the ability to use them on larger networks) is always a welcomed achievement. But speed is not the only goal; over the years, many similarity measures have been proposed and it has become a challenge to manage this abundance of choice. Sometimes completely different approaches have led to mathematically identical measures, and sometimes small variations can lead to large differences in the quality of the measure. Moreover, if so many measures co-exist today, it is probably because the quality of a similarity measure is judged by its ability to solve the problem at hand, and the large variety of problems requires different similarity measures. Making sense of the different similarity measures and of how to choose one for a given problem is therefore much needed today.

Roughly speaking, one might say there are two types of approaches to machine learning. One is to carefully build coherent framework with mathemat- ically sound properties that facilitate its transposition to many problems, and the ability to generalize many existing algorithms, to unify them in a way that offers new insights. The other would be to just look for the best performing method for a set of problems, at the cost of using hardly interpretable models.

In the following work, we successively followed both approaches.

In Chapter 3 we use the bag-of-path framework to establish a new similarity

measure. This framework was gradually developed by Kevin Françoisse, Amin

Mantrach, Illka Kivimäki and Marco Saerens [135, 100, 62] and is described in

details in the work of Françoisse et al. [63]. The bag-of-paths assigns a probability

distribution to the infinite but countable set of possible paths on a network,

and gives a probabilistic interpretation, as well as closed-form expressions, to a

large family of similarity measures. Working within this framework allows to

tackle machine learning problems on networks and to generalize several existing

(30)

approaches with a sound and interpretable mathematical basis. In Chapter 3 we show how to use this framework to develop an alternative to the modularity based on paths rather than edges. Given the prevalence of the modularity in the clustering problem, being able to extend it through the bag-of-path framework offers interesting perspectives. In particular, we apply it to the node classification problem using a variant of the ModMax approach, and show that the path-based modularity performs better than the classical edge-based modularity.

The bag-of-paths framework, although interesting, is unfortunately difficult to apply to very large graphs. Moreover, its ability to generalize multiple approaches does not really help to reduce the number of possible variation of similarity measures from which one must choose. In Chapter 4 we took a radically different approach in an attempt to automatically find the best similarity measure for any node classification and seed set expansion problem on very large networks. We developed a method to learn a similarity measure from labelled nodes, and to use it in an efficient semi-supervised learning algorithm.

The method is powerful in its ability to deal with much larger networks than

the bag-of-path, and to cover a wide range of possible similarity measures. The

similarity measures learned by this method are, however, hard to interpret and

can be counter-intuitive, which might be the price to pay for its efficiency.

Références

Documents relatifs

A typical scenario for apprenticeship learning is to have an expert performing a few optimal (or close to the optimal) trajectories and use these to learn a good policy.. In this

The body mass structure of the Middle and Late Pleis- tocene Sicilian FCs seems to confirm the general nature of the “island rule,” but it underlines the relationship be- tween

Hence, we want to trans- form handwritten digit images into other handwritten digit images while removing identity features and keeping relevant features about the original

L’évaluation de l’effet des QDs Thick shell, qu’ils soient pristine ou vieillis, sur les kératinocytes : Cette analyse s’intéresse aux protéines

In the following set of experiments with the practical FOM and CG algorithms, we assume that the products Ap can be computed in double, single or half precision (with

Chapter 4 investigated fast and e ffi cient implementations that avoid the costly matrix inversion demanded by G-SSL methods. We contributed with an extension of the success-

In practice, popular SSL methods as Standard Laplacian (SL), Normalized Laplacian (NL) or Page Rank (PR), exploit those operators defined on graphs to spread the labels and, from

Subsidiarity has been part of EU law since 1993, but under the Lisbon treaty, national parliaments will be given powers for the first time to police adherence to it,