Recette pour réaliser une thèse dans la bonne humeur.
Préparation : 3 ans et 3 mois Cuisson : 4 mois
Laissez refroidir 2 mois avant de déguster.
Ingrédients
Prenez un bon promoteur. Choisissez bien, il va vous accompagner pendant 4 ans, vous guider, vous encourager, vous conseiller, et vous mettre la pression quand nécessaire. Prenez en un qui vous laisse une grande liberté mais est toujours disponible pour répondre à vos questions ; quelqu’un comme Hugues.
C’est toujours meilleur avec un co-promoteur, n’hésitez surtout pas à en ajouter un. Un Marco est particulièrement apprécié, il enrichira votre thèse de ses connaissances encyclopédiques, s’assurera que votre liste de lecture ne désemplisse jamais, et vous fournira de nouvelles idées constamment. Faites cependant attention à ne pas le laisser seul dans une bibliothèque, vous le perdrez sans doute pour la journée.
Ajoutez une bonne poignée de collaborateurs, chacun ajoutera une saveur nouvelle à votre thèse. Quelques valeurs sures :
• Amin Mantrach déborde de motivation et vous présentera 10 nouvelles idées de recherche à la minute, certaines se chevauchant dans la même phrase. Une fois un projet fini, il sera toujours motivé pour collaborer sur le suivant. Il vous poussera à voir plus grand, à viser plus haut, ce qui vous sera particulièrement utile si vous avez des tendances pessimistes.
• Nicolas Kourtellis se combine parfaitement avec Amin : Il apporte un point de vue extérieur et force à remettre en question certaines hypothèses.
Il dispose aussi d’un rire contagieux qui transformera toutes les réunions de travail en bons moments.
• Le combo Ilkka Kivimaki + Bram Van Moorter + rennes vous fera découvrir des applications insoupçonnées. Par les pouvoirs combinés des mathématiques, de la biologie et de l’informatique vous pourrez explorer des domaines de recherche inattendus. Soyez prudent cependant, une exposition prolongée vous donnera envie d’aller explorer les contrées nordiques.
1
• Un soupçon de Peter Staar vous poussera dans vos retranchements, vous forcera à réfléchir vite, à apprendre encore plus vite et mènera sûrement à des découvertes intéressantes.
• Pour pimenter le tout, nous vous conseillons grandement d’ajouter Jeremy Grosman. Il vous apportera énormément en vous poussant à réfléchir sérieusement à ce que vous faites, à pourquoi vous le faites, et à comment l’exprimer. Il aura toujours quelque chose d’intéressant à vous appren- dre, une histoire à vous raconter qui ajoute de la perspective à votre raisonnement. Il ira aussi volontiers manger des pâtes à midi.
Assurez-vous d’avoir un labo de qualité. IRIDIA fera parfaitement l’affaire.
Tarik, Aurélien, Guillaume et Fabrice sont des éléments de qualité que nous ne pouvons que conseiller. N’oubliez surtout pas Michael Waumans, il vous sauvera la mise à plusieurs reprises grâce à ses énormes connaissances techniques et à sa capacité à tout réparer. Faites attention cependant à ne pas trop en demander, car il n’arrivera pas à refuser, même si vous êtes loin d’être le seul à dépendre de lui. Gardez aussi un pied dans le labo de votre co-promoteur, les discussions avec son équipe épiceront votre travail.
Mettez le tout dans le bâtiment C, ajoutez-y des CC, des chimistes, des physiciens et tous les Hot PhD. Mélangez bien le tout, et donnez un accès direct au bâtiment P pour faire retomber la pression. N’oubliez pas de régulièrement laisser reposer votre thèse pour la faire murir. En cas d’oubli, vos amis sur et hors campus vous rappelleront à l’ordre. Pensez en particulier à toujours garder une bonne dose de Karl et de Gégé à disposition, et à en utiliser sans modération.
Lorsque la texture et l’aspect de votre thèse vous satisfont, commencez la cuisson. Plus que jamais, vous aurez besoin du soutien de vos amis et de votre famille. Trouvez-en qui comprennent pourquoi vous disparaissez pendant un mois et ne réapparaissez que pour crier que votre thèse brûle. Nous vous recommandons chaudement de faire recourt à Antoine, Jean, Laurent, et surtout à d’incroyables parents, car ils vous aideront à rectifier la sauce.
Vers la fin de la cuisson, tournez-vous vers des pros pour obtenir de précieux conseils, un bon jury est indispensable pour assurer la cohérence et la qualité de votre thèse.
Astuce bonus : trouvez quelqu’un de proche, qui vous accompagnera tout
au long de votre thèse et vous soutiendra quoi qu’il arrive. Cela ajoutera une
saveur unique à votre travail. Marie est un exemple parfait, mais nous ne
pouvons malheureusement pas vous la conseiller, elle est déjà prise.
This work is a collection of five projects which tackle different aspects of how to predict users’ interest and behaviour with social networks and recommender systems. On social network, we worked on the definition of similarity measure and on how to use them to predict characteristics of the nodes of a network.
We made two contributions:
• A novel similarity measure that is inspired by community detection (specif- ically by the modularity) and by a probabilistic view of the distribution of paths in a network (Chapter 3).
• A general method to learn similarity measures on large networks and to use it for node classification (Chapter 4).
On recommender systems, we focused on improving collaborative filtering methods, which are a family of recommender systems based on the modelling of user-item interactions. Collaborative filtering is a large field, and we made contributions in three distinct areas:
• We developed a method to update in real-time collaborative filtering models that take into account the fact that observed ratings form a biased sample of the complete set of ratings (Chapter 6).
• We studied how to exploit the order in which users interacted with items to provide more accurate recommendations, and we showed that gated recurrent neural networks are a powerful model for collaborative filtering (Chapter 7).
• We proposed a method to cluster items while training a collaborative filter- ing models, which can reveal information about the characteristics of the item, but above all can speed up the recommendation process by reducing the number of items that must be considered for each recommendation (Chapter 8).
Recently, social networks and recommender system have been under fire for trapping us in filter bubbles, a sort of positive feedback loop that shows us more and more content that agrees with our opinions and filters out the rest.
Unfortunately, we were not able to tackle directly this problem, but we hope that the tools we made to understand social network and recommender system will be used not to create filter bubbles, but to fight them.
3
Contents
Contents 4
List of Symbols 7
1 Introduction 9
1.1 Semi-supervised learning on networks . . . . 10
1.2 Collaborative filtering . . . . 11
1.3 Connecting the dots . . . . 11
1.4 Structure of the thesis . . . . 12
1.5 List of publications . . . . 13
I Semi-supervised learning on networks 15 2 Introduction 17 2.1 Machine learning problems on networks . . . . 18
2.2 Similarity measures . . . . 21
2.3 Methods for semi-supervised learning . . . . 23
2.4 Challenges and contributions . . . . 27
3 Semi-supervised learning with path-based modularity max- imisation 29 3.1 Introduction . . . . 29
3.2 Background on the modularity . . . . 30
3.3 Random Walk based Modularity . . . . 32
3.4 Derivation of the semi-supervised learning algorithm . . . . 35
3.5 Experiments . . . . 38
3.6 Related Work . . . . 45
3.7 Conclusions . . . . 45
4 Learning the similarity on very large graphs 47 4.1 Introduction . . . . 47
4.2 Efficient computation of sum of similarities . . . . 48
4.3 Related work . . . . 49
4.4 Problem definition . . . . 51
4.5 Learning the similarity . . . . 52
4.6 Avoiding the assumption of homophily . . . . 55
4.7 Experiments . . . . 56
4
4.8 Conclusion . . . . 62
II Collaborative Filtering 63 5 Introduction 65 5.1 Collaborative filtering . . . . 66
5.2 Families of collaborative filtering algorithms . . . . 68
5.3 Training matrix factorisation methods with stochastic gradient descent . . . . 73
5.4 Evaluating recommender systems . . . . 75
5.5 Contributions . . . . 76
6 Dynamic Matrix Factorisation with Priors on Unknown Values 79 6.1 Introduction . . . . 79
6.2 Standard matrix factorisation . . . . 81
6.3 Interpreting missing data . . . . 81
6.4 Objective functions . . . . 82
6.5 Experiments . . . . 88
6.6 Related Work . . . . 96
6.7 Conclusions . . . . 98
7 Sequence-based Collaborative Filtering with Recurrent Neu- ral Networks 99 7.1 Introduction . . . . 99
7.2 Sequence-based collaborative filtering . . . . 100
7.3 Collaborative filtering with recurrent neural networks . . . . . 104
7.4 Methods comparison . . . . 111
7.5 Short-term / long-term profile . . . . 114
7.6 Other variations of the RNN . . . . 118
7.7 Conclusion . . . . 120
8 Accelerating model-based collaborative filtering with item clustering 121 8.1 Introduction . . . . 121
8.2 Related Works . . . . 122
8.3 Method . . . . 123
8.4 Experiments . . . . 128
8.5 Discussion . . . . 134
8.6 Conclusion . . . . 134
9 General conclusion 137 9.1 Findings and contributions . . . . 137
9.2 Improving recommender systems . . . . 140
9.3 Concerns and perspectives . . . . 141
A Collaborative filtering based on sequences 143 A.1 Installation . . . . 143
A.2 Usage . . . . 143
A.3 Methods . . . . 147
Bibliography 153
General
a, b, c, . . . Scalar variables.
a, b, c, . . . Vectors (upright lowercase bold).
A, B, C, . . . Matrices (upright uppercase bold).
A, B, C, . . . Sets.
v
iith element of the vector v. Equivalent to [v]
i. m
ijElement (i, j) of the Matrix M. Equivalent to [M]
ij.
e Vector of ones.
e
iVector of zeros, with the exception of its ith element, which is equal to one.
I Identity matrix (its size is generally inferred from the context).
Part I: Networks
n Number of nodes.
N Number of edges.
A Adjacency matrix.
d
iiDegree of node i. d
ii= P
j
a
ij.
D Diagonal matrix whose diagonal elements are the degrees of the nodes of the corresponding network.
A ˜ Symmetrically normalised adjacency matrix. A ˜ = D
−1/2AD
−1/2.
L Laplacian matrix. L = D − A.
∆ Normalised laplacian matrix. ∆ = I − A. ˜
P Transition probability matrix of a random walker on A.
P = D
−1A.
Part II: Collaborative filtering
n Number of users.
7
m Number of items.
N Number of observed interactions.
r
uiInteraction between user u and item i. It can be a rating (1, 2, 3, . . .) or an implicit feedback, in which case it is equal to 1.
ˆ
r
uiEstimated rating, or predicted interest of user u for item i.
R Set of observed interactions.
R
u•Set of observed interactions of user u.
R
•iSet of observed interactions of item i.
R Matrix of observed interactions. [R]
ij= r
ijif r
ij∈ R,
otherwise [R]
ij= 0.
Introduction
The Internet changed the way we access information and art. It made almost everything available: any piece of music, any movie or book, is somewhere on the Internet. There is an even greater abundance of opinions, news, facts and falsehoods. But availability does not mean accessibility. The information that you want or need may be somewhere on the Internet, but you might not know where, and you may not even know it exists.
To access this wealth of information the machine learning community devel- oped algorithms to sort information (search engines and recommender systems) and efficient ways to share it (online social networks). Those algorithms, and the mechanisms behind social networks, are the gatekeepers of the Internet, and in many ways they are the gatekeepers of our access to information. This is why we have to study them.
This work is not a reflexion on how information should be made accessible, or on the societal concerns about the current approaches. It is only an attempt to solve some practical problems faced by the current algorithms. In particular, we are interested in algorithms whose aim is to predict the interests of users in order to retrieve relevant information.
This “information” that we suggest to the users can take several forms, it can be a movie for a Netflix user, a scientific paper for a researcher, a group in some social network, or a simple web page. In the following we will refer to all those objects of interest simply as “objects”. Generally speaking, the algorithms that attempt to predict the interests of users base their predictions on the traces that the users leave, voluntarily or not, when they navigate on the Internet. Any interactions with some object (visiting, liking, sharing, buying) or with other users (befriending, chatting) constitute such traces. The intuitive basis of those algorithms is that if we can identify two users who have similar interests, the observed behaviour of one user can help to predict the behaviour of the other. The first goal is therefore to identify users with similar interests, and there are two main ways of doing so: using explicit links between users on social networks (friends have similar interests), or using behavioural similarities (if two user interact with the same objects, they must share the same interests).
The first approach is essentially covered by a research field called semi- supervised learning on networks [233, 41, 208], while the second is referred to as collaborative filtering [7, 176]. This work makes contributions in both semi-supervised learning and collaborative filtering. In the following we will
9
introduce both fields of research, before highlighting some of the parallels that can be made between them.
1.1 Semi-supervised learning on networks
A common feature of Facebook is to make recommendations in the form “this page might interest you because three of your friends liked it”. This is the simplest use of a social network to predict the interests of a user: it is reasonable to assume that friends share common interests, and therefore that if somebody liked a given page, or belongs to a certain group, his friends might be interested as well [146].
In practice however, making the best use of the information given by a social network is a complex challenge. Not all friends are equally close for example.
Moreover, some friends of friends might actually have more in common than direct friends, and it is therefore interesting to consider the full network of interaction surrounding a user rather than its direct friends only.
In order to efficiently work on that problem, it is useful to conceive an abstract view of networks [19, 149, 62]. Networks are a set of nodes connected by edges. The topology of the network is the way in which the nodes are connected by the edges (in stars, in circle, forming groups or chains, etc.). The nodes and edges can have certain attributes: for example, a node can belong to a certain class, or have a certain label, and the edges can have a direction or a weight. Generally, we assume that the attributes of the nodes are correlated with the topology of the network [41]. If this assumption is correct, the topology can help us predicting some attributes of the nodes.
This abstract view of network can easily be applied to social networks: nodes are users, edges are friendships, the attributes are the interests of the users, and we assume that there is a correlation between those interests and the topology of the social network. But this view can also be applied to many other domains:
road networks, trophic networks, protein networks, etc. So even though we mostly focus on social networks, many of the tools developed in this work can be transposed to other domains [19].
In abstract terms, predicting the interests of users in a social network based on the observed interests of users surrounding them can be seen as the problem of predicting the attributes of the nodes of a network based on the known attributes of a few nodes and on the topology of the network. This is the problem called semi-supervised learning on networks (see Chapter 2).
An important step in semi-supervised learning is to define a measure of
similarity between the nodes of the network, based on its topology [61]. One
example of measure would be based on counting the smallest number of edges
that must be visited to go from one node to the other: the shorter the path,
the higher the similarity between the nodes. Of course, many other definitions
are possible. In the following work, we present new similarity measures that
improve on previously known ones, and are computable on very large networks
(see Chapters 3 and 4). This will be helpful in predicting the attributes defined
on nodes.
1.2 Collaborative filtering
Social networks are not the only way to identify users with similar interests.
If we have access to observations of the past behaviour of many users, it is possible to identify users who behaved similarly in the past, and guess that they must have similar interests, and thus might behave similarly in the future. In other words, any application that centralises the history of many users can start to make predictions about the future actions of these users. This technique is called collaborative filtering [74].
The basic approach is to measure a similarity between pairs of users based on the number of interactions that they have in common [52]. This kind of approach however has many problems related to the time complexity and the difficulty to work when observations are scarce. Over the years, many variations have been proposed, and a popular approach that emerged is to learn for each user and each object a representation in some latent space, and to make recommendations based on the similarities between users and object in that latent space [7, Chapter 3].
Many challenges remains, not the least of which being to reduce the time complexity of those algorithms in order to deal with larger and larger datasets.
In production, a recommender system must handle new users and objects entering the system constantly, old items becoming obsolete, changing interests of users, and must be able to provide recommendations in real time based on the latest available information. We tackle all of those problems, and provide methods that improve the state of the art of collaborative filtering.
In particular, we present a way to handle new users and items in real time in order to always make recommendations based on up to date information (see Chapter 6). We also identify an efficient way to take into account the order in which users interact with the objects in order to provide more accurate recommendations (see Chapter 7). Finally we explore a new approach to accelerate the recommendation process (see Chapter 8).
1.3 Connecting the dots
As we have seen, semi-supervised learning on networks and collaborative filtering can share the same goal, even though they handle different types of data. On the technical side, interesting parallels emerge as well. Both revolve strongly around the idea of similarity: similarity between users based on a network topology or based on common behaviour, similarity between users and objects with which they interact.
Moreover, the set of interactions of users with objects is often represented
as a bipartite network, with edges connecting the users to the objects. This
representation opens the way to use tools from network science in collaborative
filtering, bringing the two fields closer yet. One might for example apply to
that bipartite network the same kind of similarity measure used on social
networks and interpret it as a similarity between the users and objects of the
graph. Similarly, techniques of dimensionality reduction commonly used in
collaborative filtering (matrix factorisation in particular) can be useful in some
semi-supervised learning problem as well (like in Chapter 3). Another common
feature of these two problems is the sparsity of the observed interactions, leading to very sparse networks.
Admittedly though, this kind of weak similarities can be found between many fields related to machine learning, and we might actually learn more by trying to identify why those fields differ rather than how they resemble each other. Firstly, the nature of the data of semi-supervised learning on networks and collaborative filtering is different. We said that the interactions used in collaborative filtering can be represented by a network, but that is only true up to a point: the interactions can carry a lot of information that might be poorly adapted for a network representation. For example, when we consider that the order of interactions is important information (as in Chapter 7), using a network representation becomes impractical.
Yet, the representation of interactions as a bipartite network is a common one, but even in that case, the typical dimensions of semi-supervised learning problems and collaborative filtering problems typically differ, and call for different solutions. Most notably, semi-supervised learning on network focusses on situations where the number of potential interests for a node is small [233].
Each interest can generally be considered as an independent problem, and the complexity is mostly driven by the number of users and of links between those users. In collaborative filtering on the other hand, the number of objects of interest is an important driver of complexity, with catalogues of hundreds of thousands of items that must be treated in less than a second, and the methods must therefore comply with much stronger constraints of time complexity [47, 15].
Another distinction comes from the fact that networks in semi-supervised learning are often considered as static entities, while collaborative filtering is very much concerned with the ability to handle constantly changing users and items [192, 5, 20]. Admittedly, real social networks are also evolving constantly, but this problem seems to be less directly confronted in the semi-supervised learning literature.
In the following chapters we tackle several challenges that reveal the need for specialised solutions in both semi-supervised learning on networks and collaborative filtering.
1.4 Structure of the thesis
This work is divided into two parts. The first part concerns networks, and in particular the task of semi-supervised learning on networks. As said earlier, semi-supervised learning can, among other things, be used to identify the interests of people in social networks, but the range of applications extends far beyond social networks. In the following we favour an abstract treatment of the subject, which encourage a broader use of the findings. Chapter 2 is an introduction to network science and to semi-supervised learning in particular.
It lays down the concept necessary to the subsequent work, and identifies the
challenges that have been tackled. In Chapter 3 we introduce an extension of
a commonly used similarity measure that leads to better prediction than the
original one in a task of semi-supervised learning. In Chapter 4 we propose a
new semi-supervised learning algorithm which learns similarity measures based
on available information, and is able to work on very large networks. The
main interest of this new approach is that it avoids the time-consuming task of choosing an adapted similarity measure for a given problem.
The second part of this work concerns recommender systems, and in par- ticular collaborative filtering algorithms. Chapter 5 gives an overview of the field and identifies the problems that we tackle in the subsequent chapters.
Chapter 6 deals with the ability of collaborative filtering methods to adapt to a changing environment (where new users and new items regularly appear).
Chapter 7 proposes a new family of methods based on recurrent neural networks exploiting the sequential structure of the data, and highlights some side effects of that choice. Finally, Chapter 8 presents a way to use clustering of the items in order to make faster recommendations.
1.5 List of publications
Here is a list of the publications submitted in the context of this work:
• The work presented in Chapter 3 has been published in:
Devooght, R., Mantrach, A., Kivimäki, I., Bersini, H., Jaimes, A., &
Saerens, M. (2014, April). Random walks based modularity: application to semi-supervised learning. In Proceedings of the 23rd international conference on World wide web (pp. 213-224). ACM.
• The work presented in Chapter 6 is the result of a collaboration with the Yahoo! Research lab of Barcelona, and has been published in:
Devooght, R., Kourtellis, N., & Mantrach, A. (2015, August). Dynamic matrix factorization with priors on unknown values. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 189-198). ACM.
• The work presented in Chapter 7 will be partly published in:
Devooght, R., Bersini, H. (2017, July). Long and Short-Term Recom- mendations with Recurrent Neural Networks. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization. ACM.
Moreover we want to signal that the work presented in Chapter 4 is the
result of a collaboration with Peter Staar and Costas Bekas from the IBM
research lab of Zurich.
Semi-supervised learning on networks
15
1
This work only explores a very narrow slice of the diverse science of networks.
For a broader view of the subject we rec- ommend to turn to a few books. Network science, by Alber-Laszlo Barabasi [19] is a beautifully illustrated introduction to the questions related to topology, filled with concrete applications. Net- works: En Introduction, by Mark New- man [149] gives a great overview of the fundamental tools of network analysis.
Finally, Network data and Link analysis, by François Fouss, Marco Saerens and Masashi Shimbo [62] is a vast collection of advanced techniques and algorithms related to networks.
Introduction
Before Deep Learning became all the rage, the A.I. community had a crush on networks. Networks are a simple abstraction; it is a collection of objects called nodes, interlinked by edges that model two-ways relationships between the nodes. Using this abstraction, networks can be noticed everywhere
The nodes can be cities and the edges roads, and you have a road network. If the nodes are proteins and the edges are chemical interactions, you can describe the protein interactions that make our cells work. If the nodes are living beings, the edges can indicate predation, parasitism and co-dependencies, and you can describe a complex ecosystem. Brains, financial networks, the power grid are all systems that can be described with networks. Then of course, there are the networks most emblematic of the past decades: the Internet, the web, and online social networks.
Using this abstraction allows to move away from the particularities of each domain, in order to identify common traits. It offers simple mathematical tools, from which a new understanding of the underlying subjects can arise, and it leads to the development of algorithms solving problems across numerous domains
1.
Thinking in terms of networks has generated multiple research questions, such as, to name a few, how to identify important nodes, how to predict the evolution of a network or how to find the shortest path between two nodes. In this work, we will mostly focus on one fundamental question, central to the whole study of networks: what does the structure of a network reveal about the similarity, or closeness of two of its nodes? In the following chapters we establish a new measure of similarity between pairs of node in a network, as well as a method to identify an appropriate measure of similarity for a given network. Scalability is an important aspect of modern network algorithm, and we demonstrate the ability of the methods that we present to work with large networks.
In this introductory chapter, we present some of the applications of machine learning on networks (Section 2.1); we then discuss how the notion of similarity measure is central to those problems (Section 2.2), before diving deeper into how to use those similarity measures in a specific task called node classification (Sec- tion 2.3); we finally describe some of the limitations of current techniques, and which of those limitations we attempt to alleviate with our work (Section 2.4).
17
2
For an in-depth look at machine learning on networks, on can look at the recent work of T. C. Silva and L. Zhao [201].
3
We are actually not acknowledging the diversity of network science by reducing clustering to the detection of communi- ties. There is a collection of fields related to clustering, each with their own speci- ficities: community detection, graph par- titioning, hierarchical clustering, etc. See for example [19, Chapter 9] for a brief overview.
2.1 Machine learning problems on networks
In this section we describe some of the most prominent problems for machine learning on networks
2. In this work we only consider situations where the only available information is structure of the network (i.e. the adjacency matrix).
There are many situations where multiple sources of information exist, for example when some features of the nodes are available, but combining those information is a research area on its own, which is out of the scope of this work.
2.1.1 Clustering Figure 2.1: This network is the
result of the observation by Wayne W. Zachary of the interactions between the members of a karate club [226]. During this period of observation, a conflict arose between the president (node 34) and the instructor of the club (first node), which resulted in
the creation of a new club. This network has been the canonical test for graph clustering ever since, the goal being to predict who joined the new club based only on the structure of the network. Here, the members of the new club are represented by white nodes. Source of the image:
commons.wikimedia.org/wiki/
File:Karate_Cuneyt_Akcora.png.
Clustering is the task of dividing a data set in meaningful groups, or clusters. By meaningful, we generally mean that elements that are in the same cluster should have some common properties, and that they should be significantly different from elements in other clusters. The blurry language here is inevitable, because finding a mathematically rigorous definition of what is a good clustering is the first challenge of the clustering problem, and the many propositions are debated at length [120, 186]. Once a definition of a clustering quality is chosen, the second challenge is to find an efficient algorithm to find an optimal clustering according to that definition.
Clustering is of course a central problem in machine learning and is not at all limited to networks, but its application to networks requires a special treatment. On social networks, clustering has an especially intuitive signification:
it aims to detect communities
3. The assumption is that the structure of interactions captured by a social network is enough to identify distinct groups of people. For example, Blondel et al.[29] have shown that the network of mobile communications of two millions customers in Belgium can be used to identify the French-speaking and Dutch-speaking communities. The canonical example of social network clustering is the Zachary karate club network described in Figure 2.1
On networks, the definition of the quality of a clustering generally revolves around the idea that nodes within a cluster should be densely connected, and that few links should exist between clusters. In practice however, neatly separated clusters are very rarely observed. Some regions of the network might be more densely connected than others, but the limit is often hard to draw, and often communities can overlap [119]. Many competing approaches exist, and the work of Mark Newman [150], S.E. Schaeffer [186] and Leskovec et al. [120]
give an overview of the field.
A popular definition of clustering quality is the modularity, introduced by Mark Newman [151]. In Newman’s words [151]:
The modularity is, up to a multiplicative constant, the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.
An equivalent network being here a network with the same number of nodes
and edges, and the same degree distribution (i.e. a node with x edges in one
network must also have x edges in an equivalent network). The motivation is
the assumption that given a number of nodes and a degree distribution, the
clustering that best explains where the edges are is the best clustering. The
4
Let u
i∈ R
cbe an indicator vector for the node i, defined such that [u
i]
j= 1 if the ith node belongs to the jth cluster, and [u
i]
j= 0 else. The modularity Q with k clusters is then defined as Q =
1 2N
P
ni,j
(a
ij−
dii2Ndjj)u
iu
Tjwhere d
iiis the degree of node i and N is the number of edges in the network [45].
modularity formula is briefly defined in the sidenote
4, then explained in greater details in Chapter 3.2.
2.1.2 Links prediction
Another common application of machine learning on network is to predict unobserved links in the network [122, 129]. Often, the available network is either incomplete (think for example of the protein-protein interaction network where most of the possible interactions are probably yet to be observed), or evolving (social networks for example), and in either cases it could be useful to predict the missing, or soon-to-arrive links.
A common approach to the link prediction problem is to make prediction based on the number of neighbours that two nodes have in common. Variations of this approach (such as the Adamic/Adar score [2]) work well on social network [122], even though they are very simple.
However, those methods can only predict links between nodes that have at least one common neighbour. To avoid this limitation, one can turn to methods based on random walks [16] or matrix factorization [141].
We do not directly tackle the link prediction problem in this work. However, the second part of the work is concerned with the problem of collaborative filtering, which is closely related to link prediction. As we explain in Chapter 5, collaborative filtering is the problem of producing recommendation for a user based on observed interaction between users and items. One way of approaching collaborative filtering is to see it as the problem of predicting links in the biparted graph formed by the users and items. As we show in Chapter 5, methods based on neighbours count, random walks and matrix factorisation have all been widely applied to the collaborative filtering problem.
We recommend the work of Liben et al.[122] and Lü and Zhou [129] for an overview of the field.
2.1.3 Node classification
Notations In the following, n represent the number of nodes in a graph, and
N = {1, 2, . . . , n} is the set of nodes of the graph. The structure of the graph is described by the adjacency matrix A ∈ R
n×nwhose element a
ijis 1 if there is a link between nodes i and j, zero else. In Chapter 3 we will also use the adjacency matrix to encode weights on the edges, but everywhere else we work with unweighted graphs. We also commonly use the laplacian matrix defined by L = D − A, where D is the diagonal matrix whose element d
ii= P
j∈N
a
ijis the degree of node i. Finally, we define the symmetrically normalised adjacency matrix as A ˜ = D
−1/2AD
−1/2and the normalised laplacian as ∆ = I − A, where ˜ I is the n × n identity matrix.
Node classification is the task of determining the class (or label) of each
node of a network, when the class of some nodes is known. For example, the
goal might be to determine the age of all the members of a social network when
the age of only 10% of them is known, or to categorize scientific paper based
on the network of citation and a few handmade categorizations.
Node classification is the main problem that we treat in this work. Hence, we provide a more rigorous definition.
We consider m classes (or labels), and each node in N belongs to only one class. The matrix Y ∈ R
n×mencodes the class of each node: its element y
ijis 1 if and only if the node i belongs to the class j. K is the set of nodes whose class is known, and the goal of the classification is to predict the classes of unlabelled nodes u ∈ N \ K based on the known classes y
ij∀i ∈ K, j ∈ {1, . . . m}. The matrix Y ˆ contains the class predictions.
In the case where m = 2, we can simplify the notations, and use the vectors y and ˆ y instead of Y and Y, with ˆ y
i= 1 if the node i belongs to the first class, and y
i= 0 if it belongs to the second class or if its class is unknown. It is common to treat multi-class classification problems as a combination of simpler binary classification problem. In the following we will present methods only for the binary classification problem, and in Chapter 4.5.1 we briefly discuss how to extend those to treat multiple classes.
2.1.4 Seed set expansion
Seed set expansion is a variation of the node classification problem that probably better reflects actual application of machine learning on the web and on social networks. We start with a set of nodes (called the seed set) that are known to belong to the same community or to the same class, and the goal is to find which other nodes belong to that community. In other words: we aim to expand the seed set.
The main difference with the classification problem is that we are not given nodes that explicitly do not belong to the community: every node outside of the seed set may or may not belong to it. This prevents from using traditional classification methods that need to learn from both positive and negative examples. The seed set expansion problem can be separated into two sub- problems: (1) ranking nodes according to their likeliness of belonging to the same class as the seed, and (2) finding a cut-off point in that ranking. The ranking is often performed by starting truncated random walks or random walks with restart from the seed set [9, 104]. Choosing a cut-off point generally consists in minimizing some heuristic about the quality of the community [9, 220, 178], and we find here the same kind of quality measure as those used to evaluate a clustering. However, the cut-off point is sometimes imposed, for example by the number of item presented in an interface, or for speed and memory reasons, and the seed set expansion problem is then reduced to finding a good ranking of the nodes.
One application is to find sub-graphs of manageable size in very large graphs.
In this case the crucial part is to find the right cut-off point [178]. Others have combined seed set expansion with the unsupervised selection of seed sets to perform graph clustering [220].
However, in this work we use it as a special case of node classification on
social network where the goal is to predict which user of an online social network
might join a given group. In most online social networks, a user can join a
group (by “liking” a page for example), but he cannot explicitly refuse to join a
group. Therefore, it is often impossible to know if a user is not part of a group
because of a lack of interest, or because he does not know the group.
5
Distances, kernels and similarity measures
A similarity measure can be any function that takes as input two nodes and outputs a scalar value. The absolute value of the similarity between two nodes generally does not have a meaning, it is only meaningful when compared to other similarities.
Distances on the other hand must comply to stricter rules: a distance must be nonnegative and symmetrical, the distance of a node to itself must be zero, and it must satisfy the triangle inequality [101].
Kernel functions compute the dot product of two nodes in some high- dimensional space [90, 193].
Distances and kernels can both be used as similarity measure, and many works restrict themselves to using those because they offer many theoretical guar- anties [63, 61, 105, 203], but in this work we use the broadest definition of similar- ity measures.
2.2 Similarity measures
When working with networks, we very often need some definition of distance or similarity between two nodes of a network. With an appropriate measure of distance or similarity, the problems presented in the previous section become trivial or can at least be solved by traditional machine learning methods. For example, once a distance is defined between any pair of nodes, one can perform graph clustering with a distance-based technique such as k-means [225]. In the node classification task, a simple approach is to assign to each node the class of the nearest labelled node, or use any other variation of the nearest neighbours techniques. In some problems of link prediction, the best approach is simply to find the most similar nodes that are not yet connected [129].
Having a good similarity measure seems to solve a large range of machine learning problems on networks, and is therefore central in link prediction [129], community detection [60], node classification [61] and seed set expansion [9].
The difficulty, of course, lies in how to define such a similarity measure
5. The adjacency matrix itself is of course the simplest measure of similarity: two nodes are more similar if they are connected than if they are not. It is however a very coarse-grained measure, giving the same similarity to each pair of disconnected nodes, even if some may be only two edges apart and some much further away.
A better definition is to consider the minimum number of edges that needs to be crossed in order to go from one node to the other. This is called the shortest-path distance. However, one drawback of this definition is that it is still a very coarse-grained measure: it puts many nodes at exactly the same distance. One way to improve this is to also consider the number of possible paths of each length between two nodes (i.e. the amount of connectivity). This way, two nodes connected by three paths of length two are considered to be closer than two nodes connected by only one path of length two.
Similarity measures are functions of the form s : {1, . . . , n}
27→ R , mapping a pair of nodes to a scalar value, but we often represent it by the matrix S containing the similarity between any pair of node. This matrix is called the similarity matrix, and its element s
ijis the similarity between nodes i and j.
Plenty of definitions of similarity have been proposed over the years, using different intuitions, and it is impossible to affirm that one is more reasonable than the others. In practice, the best similarity measure is the one that leads to the best results or the one that is the most practical to compute, and this depends on the problem at hands. Fouss et al. [61] review many of the existing measures, and in the following section we review some representative methods.
2.2.1 Diffusion and random walks
Imagine that one of the nodes of the network is maintained at a certain temperature, while the other nodes are initially uniformly colder. As time passes, the heat flows through the edges of the network, warming up the closest nodes first, then the nodes further away. After some time, the closest node will be the warmest, and the temperature could be used as a measure of similarity to the node maintained at a high temperature [103]. This is the kind of intuition used to define diffusion-based similarity measure. One such measure is the laplacian exponential diffusion kernel, defined by Kondor and Lafferty [105].
The similarity between any pair of nodes i and j is given by the element (i, j)
6
The matrix of transition probabilities of a random walker is P = D
−1A, which means that if x
i(t) is the probability of a random walker being in node i at time t, its probability distribution over the nodes at time t + 1 is given by x(t + 1) = P
Tx.
It is easy to see that the distribution x(∞) x
i(∞) = d
ii/N is stable for a symmetric network:
P
Tx(∞) = AD
−1x(∞)
= Ae/N
= x(∞)
For a discussion of why this is the only stable distribution and the conditions in which an initial distribution converge to that one, as well as for a general intro- duction to discrete Markov chains, we recommend the work of J.R. Norris [157]
of the matrix S
exp= exp(−tL), where exp is the matrix exponential, L is the matrix laplacian (defined earlier), and t is a scalar parameter that can be interpreted as the duration of the diffusion.
Another common metaphor used to define similarity measures is to imagine random walkers moving on the network. A popular measure is the random walk with restart [162, 215, 104], also known as the personalized PageRank because of its relation to Google’s PageRank algorithm [161]. Imagine someone starting in one node i, and walking to one randomly chosen neighbour. He then walks to a randomly chosen neighbour of that new node, etc. The probability that he arrive in a given node j after t steps can be used as a similarity between the nodes i and j [210]. However, after a long enough time, the probability that the walker is in one node only depends on the edges in the direct neighbourhood of the node, making the measure useless
6. The random walk with restart slightly alters the definition of the process in order to reach a useful steady-state: at each step, the walker has a probability α of jumping back to the initial node instead of walking to a direct neighbour. The probability distribution of finding the walker in any node after a time t is given by:
x(0) = e
ix(t) = (1 − α)P
Tx(t − 1) + αe
i(2.1) where e
iis the column vector of size n whose elements are all zeros, except for the ith element that is equal to one. It can be shown that the steady state distribution is:
x(∞) = (I − (1 − α)P
T)
−1e
i(2.2) (I − (1 − α)P
T)
−1is an asymmetric matrix, and by convention we define the random walk with restart similarity S
PPRas the transpose of that matrix [61]:
S
PPR= I − (1 − α)D
−1A
−1(2.3) 2.2.2 Eigenvalues transformation
Smola and Kondor [203] offer another justification for the similarity mea- sures. They study regularisations of the normalised graph laplacian (∆ = I − D
−1/2AD
−1/2), obtained by applying a scalar function to its eigenvalues.
We note ∆ =
n
P
i=1
λ
iv
iv
Tithe decomposition of the normalised laplacian in eigenvalues {λ
i} and eigenvectors {v
i}. Those eigenvalues are contained in [0, 2]
(see [203]).
In their work, Smola and Kondor show that the smaller eigenvalues of ∆ are associated to smoother eigenvectors. It is easy to show the same property for the laplacian L. First, notice that for any vector y, we have:
y
TLy = y
T(D − A)y (2.4)
= X
i
y
i2X
j
a
ij− X
ij
y
iy
ja
ij(2.5)
= 1 2
X
i,j∈N
a
ij(y
i− y
j)
2(2.6)
Equation 2.6 is known as Geary’s contiguity ratio [68] and is often used as a measure of how smooth a vector y is with regards to the network structure. In other words, y is smooth if the product y
TLy is large. Let us now decompose the laplacian in eigenvectors and eigenvalues L =
n
P
i=1
λ
0iu
iu
Ti. The smoothness of an eigenvector u
jis given by:
u
TjLu
j= u
Tj(
n
X
i=1
λ
0iu
iu
Ti)u
j(2.7)
=
n
X
i=1
λ
0i(u
Tiu
j)
2(2.8) Using the fact that the eigenvectors form an orthonormal basis, we find:
u
TjLu
j= λ
0j, (2.9)
which makes it clear that smoother eigenvectors of the laplacian are associated to smaller eigenvalues. The same can be shown for the normalised laplacian.
Based on this observation, Smola and Kondor suggest to use the eigenvectors of the normalised laplacian to build a similarity measure by giving more weight to the smoother eigenvectors. If r : [0, 2] 7→ R is a monotonically increasing function defined on [0, 2], we can define a similarity measure S as:
S =
n
X
i=1
r
−1(λ
i)v
iv
Ti(2.10) Where r
−1is the pseudo-inverse of r (0
−1maps to 0).
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
y
x
(a) Without unlabelled data
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
y
x
(b) With unlabelled data
Figure 2.2: In the first image, only labelled samples are visible. Based only on labelled samples, the red line is a reasonable classification boundary. In the second image however, the unlabelled samples re- veal structure in the data, and the red line now seem to be a very bad choice. Semi-supervised methods attempt to use both labelled and unlabelled data to deal with similar cases.
Many well-known similarity measures can be expressed this way given the right r function:
• r(λ) = 1 + σλ: the regularised laplacian (S = (I + σ∆)
−1),
• r(λ) = e
(σ/2λ): the exponential diffusion (S = exp(−σ/2∆)),
• r(λ) = (α − λ)
−pwith α > 2: the p-step random walk (S = (αI − ∆)
p).
Many other works define similarity measures on graphs as a transformation of the eigenvalues of the (normalised) laplacian matrix, with applications in node classification [42, 98] and link prediction [109]
2.3 Methods for semi-supervised learning
Node classification and seed set expansion fall in the category of semi-supervised learning. Semi-supervised learning is used to describe machine learning methods that learn not only from labelled samples, but also from unlabelled samples.
Often, unlabelled samples are much more numerous and easier to gather than
labelled ones, and those unlabelled samples can be a useful source of information
as they can reveal the structure of the data. The classical example is the two
moon dataset (see for example [23]), where points are spread around two curves
in a 2D plane. As shown in Figure 2.2, seeing the unlabelled samples reveals
the whole density function and helps the classification.
7
For a more thorough review of existing network-based semi-supervised learn- ing methods, one can turn to the works of Zhu [233], Subramanya and Taluk- dar [208], and Silva and Zhao [201, Chap- ter 7].
Although we only discuss semi- supervised learning on networks in this work, semi-supervised learning is a vast field of research that extends well beyond network-based methods (see e.g. [41, 1, 233]). In other domains it might not always be referred to as semi- supervised learning. In neural networks for example, unsupervised layer-by-layer pre-training before supervised fine-tuning can be seen as a semi-supervised ap- proach [57].
In the context of networks, semi-supervised learning refers to methods that use the whole structure of a partially labelled network to predict the missing labels. In the following, we mostly work with datasets that are naturally expressed as networks. However, it is worth knowing that many semi-supervised methods are designed for non-network data such as the two-moon problem, but create a network in a pre-processing step before solving the problem in its network form, for which many semi-supervised algorithms exist (see e.g. [208, Chapter 2]).
In the following we describe some foundational and state-of-the-art methods for semi-supervised learning on networks
7.
2.3.1 Smoothness optimization
A common approach is to express the labelling as the solution of an optimization problem that would take into account two aspects: (1) consistency with known labels and (2) smoothness of labelling on the graph. Those two aspects can take multiple definitions, but we present here one version described in [62, Chapter 6.2], whose definition of smoothness is based on Geary’s contiguity ratio[68]. The optimisation problem is stated as follows:
min
ˆ y
X
i∈K
(ˆ y
i− y
i)
2+ σ 1 2
X
i,j∈N