Quelle IA pour les données

(1)

Quelle IA pour les données structurées ?

Graph Neural Networks

Romain Raveaux

romain.raveaux@univ-tours.fr Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT) Equipe RFAI

1

(2)

Outline

1. Dense Deep Learning : MLP/FC 2. Graph Neural Network

3. Problems of Graph Neural Networks 4. Transformers

5. Transformers’ problems 6. Graph Transformers

7. Problems of Graph Transformers 8. The losses

9. Summary and perspectives

2

(3)

Dense Deep Learning : MLP/FC

3

(4)

A Linear model

• 𝐻 ∈ ℝ^𝑚×1 : input

• W∈ ℝ

• B∈ ℝ

• 𝑌 ∈ ℝ෠ ^𝑚×1 : output (prediction)

• 𝑌 = 𝐻𝑊 + 𝐵෠

4

H=

ℎ₁ h₂ ℎ₃

[𝑤₁] = 𝑊 ℎ₁𝑤₁

h₂𝑤₁ ℎ₃𝑤₁

= 𝐻𝑊

(5)

A Linear model : for multi-dimensional input

𝐻 ∈ ℝ^𝑚×𝑑 : input

• W∈ ℝ^𝑑×1

• B∈ ℝ

• 𝑌 ∈ ℝ෠ ^𝑚×1 : output (prediction)

• 𝑌 = 𝐻𝑊 + 𝐵෠

5

H=

ℎ_1𝑎 ℎ_1𝑏 ℎ_2𝑎 ℎ_2𝑏 ℎ_3𝑎 ℎ_3𝑏

𝑤_𝑎

𝑤_𝑏 = 𝑊 ℎ_1𝑎𝑤_𝑎 +ℎ_1𝑏 𝑤_𝑏 ℎ_2𝑎𝑤_𝑎 +ℎ_2𝑏 𝑤_𝑏 ℎ_3𝑎𝑤_𝑎 +ℎ_3𝑏 𝑤_𝑏

= 𝐻𝑊

(6)

A Linear model : for multi-dimensional input and output

• H ∈ ℝ^𝑚×𝑑 : input

• W∈ ℝ^𝑑×𝑛

• B∈ ℝ^𝑛

• 𝑌 ∈ ℝ෠ ^𝑚×𝑛 : output (prediction)

• 𝑌 = 𝐻𝑊 + 𝐵෠

6

𝐻 =

ℎ_1𝑎 ℎ_1𝑏 ℎ_2𝑎 ℎ_2𝑏 ℎ_3𝑎 ℎ_3𝑏

𝑤_1𝑎 𝑤_2𝑎

𝑤_1𝑏

𝑤_2𝑏 = 𝑊 ℎ_1𝑎𝑤_1𝑎 +ℎ_1𝑏 𝑤_2𝑎

ℎ_2𝑎𝑤_1𝑎 +ℎ_2𝑏 𝑤_2𝑎 ℎ_3𝑎𝑤_1𝑎 +ℎ_3𝑏 𝑤_2𝑎

ℎ_1𝑎𝑤_1𝑏 +ℎ_1𝑏 𝑤_2𝑏 ℎ_2𝑎𝑤_1𝑏 +ℎ_2𝑏 𝑤_2𝑏 ℎ_3𝑎𝑤_1𝑏 +ℎ_3𝑏 𝑤_2𝑏

= 𝐻𝑊

(7)

A non-linear model : for multi-dimensional input/output

• W∈ ℝ^𝑑×𝑛

• B∈ ℝ^𝑛

• 𝑅 𝑎 , 𝜎 𝑎 : a non-linear function

• 𝑌 ∈ ℝ෠ ^𝑚×𝑛 : output (prediction)

• 𝑌 = 𝜎 𝐻𝑊 + 𝐵෠

7

(8)

Can we stack the model ? Endomorphism property

• 𝑊⁽¹⁾ ∈ ℝ^𝑑×𝑛

• 𝐵⁽¹⁾ ∈ ℝ^𝑛

• 𝑌෢⁽¹⁾ ∈ ℝ^𝑚×𝑛 : intermediate results

• 𝑊⁽¹⁾ ∈ ℝ^𝑛×𝑜

• 𝐵⁽¹⁾ ∈ ℝ^𝑜

• 𝑌෢⁽²⁾ ∈ ℝ^𝑚×𝑜 : output (prediction)

• 𝑌෢⁽¹⁾ = 𝜎 𝐻 𝑊 ¹ + 𝐵 ¹

• 𝑌෢⁽²⁾ = 𝜎( ෢𝑌⁽¹⁾𝑊 ² + 𝐵 ² )

• 𝑌෢⁽²⁾ = 𝜎(𝜎 H 𝑊 ¹ + 𝐵 ¹ 𝑊 ² + 𝐵 ² )

• 𝑌෢^(𝑙) = 𝑓( ෣𝑌 ^𝑙−1 , 𝑊 ^𝑙 , 𝐵 ^𝑙 )=𝑓( ෣𝑌 ^𝑙−1 , 𝜃 ^𝑙 )

• 𝑌෢^(𝑙) = 𝑓 𝑓 𝑋, 𝜃 ^𝑙−1 , 𝜃 ^𝑙 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s

• 𝐹 𝑋, Θ = 𝑓_𝜃 𝑙 ∘ 𝑓_𝜃 𝑙−1 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s

Multi Layer Perceptron : MLP

A sequence of linear operations and non linear operations

Why these non-linear operations ? ⁸

(9)

The basics of artificial neural networks

9

• MLP at glance :

• How to generalize MLP to Graphs ?

(10)

Graph Neural Networks

10

(11)

Graph Neural Networks : Principles

• Input: A graph

• Output: Node embeddings

• Assumptions: unorder data, stationarity and compositionality

• The goal:

• Graph Neural Networks (GNN) perform an end-to-end learning including feature extraction and classification.

11

(12)

Graph Neural Networks as a encoder

Output : a node embedding

12

(13)

Graph Neural Networks as a encoder

13

(14)

Decoder:

16

(15)

The basics of graph neural networks

17

(16)

Adjacency matrix

18

(17)

Degree matrix

19

(18)

Exponentiation of the adjacency matrix

20

(19)

Key idea and Intuition [Kipf and Welling, 2016]

• The key idea is to generate node embeddings based on local neighborhoods.

• The intuition is to aggregate node information from their neighbors using neural networks.

• Nodes have embeddings at each layer and the neural network can be arbitrary depth. “layer-0” embedding of node u is its input feature, i.e. Fu.

21

(20)

Topology and Features

• A∈ ℝ^{|𝑉|×|𝑉|} : Adjacency matrix (topology/structure)

• H∈ ℝ^|𝑉|×𝑑 : d dimensional features vector for each node

• 𝐴𝐻 ∈ ℝ^|𝑉|×𝑑= Feature aggregation

23

A=

𝑎_1𝑎 𝑎_1𝑏 𝑎_2𝑎 𝑎_2𝑏

ℎ_1𝑎 ℎ_2𝑎

ℎ_1𝑏

ℎ_2𝑏 = 𝐻 𝑎_1𝑎ℎ_1𝑎 +𝑎_1𝑏 ℎ_2𝑎

𝑎_2𝑎ℎ_1𝑎 +𝑎_2𝑏 ℎ_2𝑎

𝑎_1𝑎ℎ_1𝑏 +𝑎_1𝑏 ℎ_2𝑏

𝑎_2𝑎ℎ_1𝑏 +𝑎_2𝑏 ℎ_2𝑏 = 𝐴𝐻

(21)

Simple example of f

24

(22)

Topology and Features

• 𝐴𝐻 ∈ ℝ^|𝑉|×𝑑= Feature aggregation

• W∈ ℝ^𝑑×𝑙 : Parameters matrix that does not depend on |V|

25

AH =

𝑎ℎ_1𝑎 𝑎ℎ_1𝑏 𝑎ℎ_2𝑎 𝑎ℎ_2𝑏

𝑊_1𝑎 𝑊_2𝑎

𝑊_1𝑏 𝑊_1𝑐

𝑊_2𝑏 𝑊_2𝑐 = 𝑊 𝑎ℎ_1𝑎𝑊_1𝑎 +𝑎ℎ_1𝑏 𝑊_2𝑎

𝑎ℎ_2𝑎𝑊_1𝑎 +𝑎ℎ_2𝑏 𝑊_2𝑎

𝑎ℎ_1𝑎𝑊_1𝑏 +𝑎ℎ_1𝑏 𝑊_2𝑏 𝑎ℎ_1𝑎𝑊_1𝑐 +𝑎ℎ_1𝑏 𝑊_2𝑐

𝑎ℎ_2𝑎𝑊_1𝑏 +𝑎ℎ_2𝑏 𝑊_2𝑏 𝑎ℎ_2𝑎𝑊_1𝑐 +𝑎ℎ_2𝑏 𝑊_2𝑐 = 𝐴𝐻𝑊

(23)

2 issues of this simple example

• Issue 1:

• for every node, f sums up all the feature vectors of all neighboring nodes but not the node itself.

• Fix: simply add the identity matrix to A 

• Issue 2:

• A is typically not normalized and therefore the multiplication with A will completely change the scale of the feature vectors.

• Fix: Normalizing A such that all rows sum to one:

26

(24)

Altogether: [Kipf and Welling, 2016]

27

• The two patched mentioned before +

• A better (symmetric) normalization of the adjacency matrix

O(|E|) time complexity overall.

(25)

Graph Convolutional Networks

28

• Kipf et al.’s Graph Convolutional Networks (GCNs) are a slight variation on the neighborhood aggregation idea:

(26)

Graph Convolutional Networks

29

same matrix for self and

neighbor embeddings per-neighbor

normalization Basic Neighborhood Aggregation

GCN Neighborhood Aggregation VS.

(27)

instead of simple average, normalization varies across neighbors use the same transformation

matrix for self and neighbor embeddings

Graph Convolutional Networks

30

• Empirically, they found this configuration to give the best results.

• More parameter sharing.

• Down-weights high degree neighbors.

(28)

average of neighbor’s previous layer embeddings

The Math

• Basic approach: Average neighbor messages and apply a neural network.

Initial “layer 0” embeddings are equal to node features

kth layer embedding

of v non-linearity (e.g., ReLU or

tanh)

previous layer embedding of v

(29)

Training the Model

32

• After K-layers of neighborhood aggregation, we get output embeddings for each node.

• We can feed these embeddings into any loss function and run stochastic gradient descent to train the

aggregation parameters.

trainable matrices (i.e., what we learn)

(30)

More complex models [Nowak et al., 2017]

33

(31)

Inductive Capability

34

Inductive node embedding generalize to entirely unseen graphs

e.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

train on one graph generalize to new graph

(32)

Inductive Capability

35

train with snapshot new node arrives

generate embedding for new node

Many application settings constantly encounter previously unseen nodes.

e.g., Reddit, YouTube, GoogleScholar, ….

Need to generate new embeddings “on the fly”

(33)

From matrix computations to message passing algorithms

36

The function is used to gather the information from neighbor

(AGGREGATE) : Must be permutation invariant

The goal of this function is to fuse the information from neighbors into the node representation

(COMBINE ) a READOUT function is designed to aggregate node features of the

entire graph G for the final task (i.e classification)

(34)

GraphSAGE Idea

46

• So far we have aggregated the neighbor

messages by taking their (weighted) average,

can we do better?

(35)

GraphSAGE Idea

47

Any differentiable function that maps set of vectors to a

single vector.

(36)

• Simple neighborhood aggregation:

• GraphSAGE:

GraphSAGE Differences

generalized aggregation 48

concatenate self embedding and neighbor embedding

(37)

GraphSAGE Variants

49

• Mean:

• Pool

• Transform neighbor vectors and apply symmetric vector function.

• LSTM:

• Apply LSTM to random permutation of neighbors.

element-wise mean/max

(38)

Graph Attention Network (GAT)

• contributions of neighboring nodes to the central node are neither identical like GraphSage, GCN, …

• learn the relative weights between two connected nodes

• Taking into account edge features

50

P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y.

Bengio, “Graph attention networks,” in Proc. of ICLR, 2017

(39)

GNN as an encoder

51

(40)

Edge features in GNN

• GAT which can handle only binary edge indicators and the limitation of GCNs which can handle only one dimensional edge features.

• Exploiting Edge Features for Graph Neural Networks

• First, we propose to use doubly stochastic normalization of graph edge features instead of the commonly used row or symmetric normalization approches used in current graph neural networks.

• Second, we construct new formulas for the operations in each individual layer so that they can handle multi- dimensional edge features.

• Third, for the proposed new framework, edge features are adaptive across network layers

52

(41)

Edge features in GNN ????

In existing attention mechanisms, the attention coefficient depends on the two points Xi· and Xj· only. Here we let the attention operation be guided by edge features of the edge connecting the two nodes, so α depends on edge features as well. F

Attention mechanism is marginal. Edge feature components are considered independtly not as vectors ₅₃

(42)

54

(43)

55

(44)

Problems of graph neural networks

• Long range dependencies / Over-squashing

• Weisfeiler-Lehman test for graph isomorphism

• Not that generic :

56

(45)

Problems of graph neural networks :Long range dependencies / Over-squashing

57

(46)

Weisfeiler-Lehman test for graph isomorphism

P is a permutation matrix 58

(47)

Weisfeiler-Lehman test for graph isomorphism

Possibly isomorphic

59

(48)

Relation to WEISFEILER-LEHMAN algorithm

60

(49)

Relation to WEISFEILER-LEHMAN algorithm

• A neural network model for graph-structured data should ideally be able to learn representations of nodes in a graph, taking both the graph structure and feature description of nodes into account.

• A well-studied framework for the unique assignment of node labels given a graph is provided by the 1-dim Weisfeiler-Lehman (WL-1) algorithm (Weisfeiler & Lehmann, 1968)

Weisfeiler-lehman test of isomorphism

61

(50)

Relation to WEISFEILER-LEHMAN algorithm

62

(51)

Relation to WEISFEILER-LEHMAN algorithm

• The choice of the aggregation and the update operations is crucial:

• only multiset injective functions make it equivalent to the WL algorithm.

• Some popular choices for aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures

63

(52)

Transformers

67

(53)

Transformers for WHAT ?

• Sequence to Sequence (most popular)

• Computer vision (ViT, DERT): image classification, object detection

68

(54)

Transformer : The origin

https://arxiv.org/pdf/1706.03762.pdf

69

(55)

Sequence to Sequence

• NOT TRANSFORMERS

• RNN, LSTM

• One word at the time

• Not well parallelized

• SLOW

70

(56)

Sequence to Sequence

• Tranformers

• Many words at the same times (up to 1000 words so far)

• Well Parallelized

• Fast

71

(57)

Input embedding : The world of vectors

72

(58)

Transformers

• Are stackable

• Get rich representation

• By composition

73

(59)

Transformer discovers NLP pipe line

74

(60)

Encoder part

75

(61)

Encoder

76

(62)

Self Attention

77

(63)

Self Attention

78

(64)

Self attention : query, key and value

The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description etc.) associated with candidate videos in the database, then present you the best matched videos (values).

79

(65)

Self attention : query, key and value

80

(66)

Self attention : query by key = score

81

(67)

Self attention : soft max of the score = Attention weights

82

(68)

Attention weights by the value = output

83

(69)

Encoder in formula: Attention

84

(70)

Attention weights by the value = output Focus for one word

85

(71)

Multi head attention : concat + linear

86

(72)

Then the dense layer appear : Fully connected (MLP)

87

(73)

Encoder

88

(74)

Encoder in vectors

90

(75)

Word Order matters : Positionnal encoding

91

(76)

Positionnal Encoding

92

(77)

Positionnal Encoding

93

(78)

Positionnal Encoding

94

(79)

Word Order matters : Positionnal encoding

95

(80)

Transformer : Why does it work ?

• Attention mechanism

• Very generic ?

• Is there any « A priori knowledge » ?

101

(81)

Transformer problem

• Self Attention complexity is Quadratic in function of the input length

102

(82)

Graph Transformers

103

(83)

Graphormer

104

(84)

Graphormer

• Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the

model.

• We propose several simple yet effective structural encoding methods

• We mathematically characterize the expressive power of Graphormer

• The key is to properly incorporate structural information of graphs into the model

105

(85)

• Note that for each node i, the self-attention only calculates the

semantic similarity between i and other nodes, without considering the structural information of a graph reflected on the nodes and the relation between node pairs.

• Graphormer incorporates several effective structural encoding methods to leverage such information

106

(86)

Different nodes may have different importance

• First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.

• we leverage the degree centrality for the centrality encoding, where a learnable vector is assigned to each node according to its degree and added to the node features in the input layer.

107

(87)

Relation between nodes

• Second, we propose a novel Spatial Encoding in Graphormer to capture the structural relation between nodes.

• Dijkstra : For a general purpose, we use the distance of the shortest path

between any two nodes as a demonstration, which will be encoded as a bias term in the softmax attention and help the model accurately capture the

spatial dependency in a graph.

• Edge Features : spatial information contained in edge features. An average of dot-products of the edge features and learnable embeddings along the

shortest path

108

(88)

109

(89)

110

(90)

Different nodes may have different importance

• First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.

• we leverage the degree centrality for the centrality encoding, where a learnable vector is assigned to each node according to its degree and added to the node features in the input layer.

111

(91)

Positional Encoding = Spatial Encoding

112

(92)

Positional Encoding = by edge features

113

(93)

Another paper by Xavier Bresson

114

(94)

Graph Transformer Layer with edge features

122

(95)

GNN and transformers are encoders

127

(96)

Applications and losses

• Let us recall that Z is the output the GNN or Graph Transfomers

• Graph Transfomers also produce attention maps.

128

(97)

Unsupervised learning

• The graph factorization problem is the problem of predicting if two nodes are linked or not.

• Where is a similarity measure between two nodes embeddings

• Variation of graph factorization

• DeepWalk [Perozzi et al., 2014]

• node2vec [Grover and Leskovec, 2016]

129

(98)

Unsupervised embedding: Graph factorization

• Use stochastic gradient descent (SGD) as a general optimization method.

The graph factorization problem is the problem of predicting if two nodes are linked or not.

130

(99)

Supervised learning

• Node classification : social network

•

131

(100)

Supervised learning

• Node classification :

132

(101)

Node-wise classification

133

(102)

Supervised learning :Node classification

• Last layer:

• For the last layer the activation function is a softmax activation function.

• The loss :

134

Gibbs distribution

(103)

Supervised learning:

• Graph classification

• A global average pooling layer must be added to gather all the node embeddings of a given graph.

• Z’ can be fed to a MLP for classification

• The loss is the cross entropy

135

(104)

Semi-Supervised learning:

• Node classification:

• the problem of classifying nodes in a graph, where labels are only available for a small subset of nodes.

• where label information is smoothed over the graph via some form of explicit graph-based regularization

• Assumption is that connected nodes in the graph are likely to share the same label (class).

• This is true for instance for a neighborhood graph.

136

𝑠𝑜, l_𝑟𝑒𝑔 = ||𝑍. 𝑍^𝑇−A||Frobenius²

(105)

Some applications

• Node classification on citation networks.

• The input is a citation network where nodes are papers, edges are citation

links and optionally bag-of-words features on nodes. The target for each node is a paper category (e.g. stat.ML, cs.LG, ...).

137

(106)

Image/molecule classification

• Graph classification: An image is represented as a graph: based on raw pixels (a regular grid and all images have the same graph) or based on superpixels (irregular graph)

139

(107)

Graph classification

Taken from M. Bronstein. CVPR Tutorial 2017 ¹⁴⁰

1. cancerous or not cancerous molecules

2. determination of the boiling point

Molecular graph

(108)

Summary on GNN and Graph Transformers

141

(109)

Summary on GNN and Graph Transformers

• Node attenion mechanism

• GAT fashion

• Or Transformer fashion (K, Q, V)

• Nodes are embedded in K, Q, V

142

(110)

Summary on GNN and Graph Transformers

GNN advantages : Fast.

GNN problems :

- Weisfeiler-Lehman test for graph isomorphism. Aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures

- Fails to capture long-term relation in the graph (Over-Squashing)

143

(111)

Summary on GNN and Graph Transformers

Attentional GNN and Graph Transformers advantages : - Better feature extractors.

Attentional and Graph Transformers problems :

- Weisfeiler-Lehman test for graph isomorphism. Aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures.

- Quadratic time to compute the attention matrix.

144

(112)

Summary on GNN and Graph Transformers

Edge features consideration

• Things have been done but the question is still open.

• Edge features in the attention matrix only

• Edge embedding

145

(113)

Summary on GNN and Graph Transformers

• Whish list :

• Linear (Fast) feature extraction

• Node and edge embeddings

• Positional encoding for graphs

• Topology/Structure encoding

• Graph Laplacian

• Dijkstra (number of hops)

• Dijkstra (Edge features)

• Distinguish between very simple graph structures

• From pairwise feature embedding to higher degree(subgraph)

• Phi is a function of x_i, x_i, x_k, ….

146

(114)