Quelle IA pour les données structurées ?
Graph Neural Networks
Romain Raveaux
romain.raveaux@univ-tours.fr Maître de conférences
Université de Tours
Laboratoire d’informatique (LIFAT) Equipe RFAI
1
Outline
1. Dense Deep Learning : MLP/FC 2. Graph Neural Network
3. Problems of Graph Neural Networks 4. Transformers
5. Transformers’ problems 6. Graph Transformers
7. Problems of Graph Transformers 8. The losses
9. Summary and perspectives
2
Dense Deep Learning : MLP/FC
3
A Linear model
• 𝐻 ∈ ℝ𝑚×1 : input
• W∈ ℝ
• B∈ ℝ
• 𝑌 ∈ ℝ 𝑚×1 : output (prediction)
• 𝑌 = 𝐻𝑊 + 𝐵
4
H=
ℎ1 h2 ℎ3
[𝑤1] = 𝑊 ℎ1𝑤1
h2𝑤1 ℎ3𝑤1
= 𝐻𝑊
A Linear model : for multi-dimensional input
𝐻 ∈ ℝ𝑚×𝑑 : input
• W∈ ℝ𝑑×1
• B∈ ℝ
• 𝑌 ∈ ℝ 𝑚×1 : output (prediction)
• 𝑌 = 𝐻𝑊 + 𝐵
5
H=
ℎ1𝑎 ℎ1𝑏 ℎ2𝑎 ℎ2𝑏 ℎ3𝑎 ℎ3𝑏
𝑤𝑎
𝑤𝑏 = 𝑊 ℎ1𝑎𝑤𝑎 +ℎ1𝑏 𝑤𝑏 ℎ2𝑎𝑤𝑎 +ℎ2𝑏 𝑤𝑏 ℎ3𝑎𝑤𝑎 +ℎ3𝑏 𝑤𝑏
= 𝐻𝑊
A Linear model : for multi-dimensional input and output
• H ∈ ℝ𝑚×𝑑 : input
• W∈ ℝ𝑑×𝑛
• B∈ ℝ𝑛
• 𝑌 ∈ ℝ 𝑚×𝑛 : output (prediction)
• 𝑌 = 𝐻𝑊 + 𝐵
6
𝐻 =
ℎ1𝑎 ℎ1𝑏 ℎ2𝑎 ℎ2𝑏 ℎ3𝑎 ℎ3𝑏
𝑤1𝑎 𝑤2𝑎
𝑤1𝑏
𝑤2𝑏 = 𝑊 ℎ1𝑎𝑤1𝑎 +ℎ1𝑏 𝑤2𝑎
ℎ2𝑎𝑤1𝑎 +ℎ2𝑏 𝑤2𝑎 ℎ3𝑎𝑤1𝑎 +ℎ3𝑏 𝑤2𝑎
ℎ1𝑎𝑤1𝑏 +ℎ1𝑏 𝑤2𝑏 ℎ2𝑎𝑤1𝑏 +ℎ2𝑏 𝑤2𝑏 ℎ3𝑎𝑤1𝑏 +ℎ3𝑏 𝑤2𝑏
= 𝐻𝑊
A non-linear model : for multi-dimensional input/output
• H ∈ ℝ𝑚×𝑑 : input
• W∈ ℝ𝑑×𝑛
• B∈ ℝ𝑛
• 𝑅 𝑎 , 𝜎 𝑎 : a non-linear function
• 𝑌 ∈ ℝ 𝑚×𝑛 : output (prediction)
• 𝑌 = 𝜎 𝐻𝑊 + 𝐵
7
Can we stack the model ? Endomorphism property
• H ∈ ℝ𝑚×𝑑 : input
• 𝑊(1) ∈ ℝ𝑑×𝑛
• 𝐵(1) ∈ ℝ𝑛
• 𝑌(1) ∈ ℝ𝑚×𝑛 : intermediate results
• 𝑊(1) ∈ ℝ𝑛×𝑜
• 𝐵(1) ∈ ℝ𝑜
• 𝑌(2) ∈ ℝ𝑚×𝑜 : output (prediction)
• 𝑌(1) = 𝜎 𝐻 𝑊 1 + 𝐵 1
• 𝑌(2) = 𝜎( 𝑌(1)𝑊 2 + 𝐵 2 )
• 𝑌(2) = 𝜎(𝜎 H 𝑊 1 + 𝐵 1 𝑊 2 + 𝐵 2 )
• 𝑌(𝑙) = 𝑓( 𝑌 𝑙−1 , 𝑊 𝑙 , 𝐵 𝑙 )=𝑓( 𝑌 𝑙−1 , 𝜃 𝑙 )
• 𝑌(𝑙) = 𝑓 𝑓 𝑋, 𝜃 𝑙−1 , 𝜃 𝑙 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s
• 𝐹 𝑋, Θ = 𝑓𝜃 𝑙 ∘ 𝑓𝜃 𝑙−1 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s
Multi Layer Perceptron : MLP
A sequence of linear operations and non linear operations
Why these non-linear operations ? 8
The basics of artificial neural networks
9
• MLP at glance :
• How to generalize MLP to Graphs ?
Graph Neural Networks
10
Graph Neural Networks : Principles
• Input: A graph
• Output: Node embeddings
• Assumptions: unorder data, stationarity and compositionality
• The goal:
• Graph Neural Networks (GNN) perform an end-to-end learning including feature extraction and classification.
11
Graph Neural Networks as a encoder
Output : a node embedding
12
Graph Neural Networks as a encoder
13
Decoder:
16
The basics of graph neural networks
17
Adjacency matrix
18
Degree matrix
19
Exponentiation of the adjacency matrix
20
Key idea and Intuition [Kipf and Welling, 2016]
• The key idea is to generate node embeddings based on local neighborhoods.
• The intuition is to aggregate node information from their neighbors using neural networks.
• Nodes have embeddings at each layer and the neural network can be arbitrary depth. “layer-0” embedding of node u is its input feature, i.e. Fu.
21
Topology and Features
• A∈ ℝ|𝑉|×|𝑉| : Adjacency matrix (topology/structure)
• H∈ ℝ|𝑉|×𝑑 : d dimensional features vector for each node
• 𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation
23
A=
𝑎1𝑎 𝑎1𝑏 𝑎2𝑎 𝑎2𝑏
ℎ1𝑎 ℎ2𝑎
ℎ1𝑏
ℎ2𝑏 = 𝐻 𝑎1𝑎ℎ1𝑎 +𝑎1𝑏 ℎ2𝑎
𝑎2𝑎ℎ1𝑎 +𝑎2𝑏 ℎ2𝑎
𝑎1𝑎ℎ1𝑏 +𝑎1𝑏 ℎ2𝑏
𝑎2𝑎ℎ1𝑏 +𝑎2𝑏 ℎ2𝑏 = 𝐴𝐻
Simple example of f
24
Topology and Features
• 𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation
• W∈ ℝ𝑑×𝑙 : Parameters matrix that does not depend on |V|
25
AH =
𝑎ℎ1𝑎 𝑎ℎ1𝑏 𝑎ℎ2𝑎 𝑎ℎ2𝑏
𝑊1𝑎 𝑊2𝑎
𝑊1𝑏 𝑊1𝑐
𝑊2𝑏 𝑊2𝑐 = 𝑊 𝑎ℎ1𝑎𝑊1𝑎 +𝑎ℎ1𝑏 𝑊2𝑎
𝑎ℎ2𝑎𝑊1𝑎 +𝑎ℎ2𝑏 𝑊2𝑎
𝑎ℎ1𝑎𝑊1𝑏 +𝑎ℎ1𝑏 𝑊2𝑏 𝑎ℎ1𝑎𝑊1𝑐 +𝑎ℎ1𝑏 𝑊2𝑐
𝑎ℎ2𝑎𝑊1𝑏 +𝑎ℎ2𝑏 𝑊2𝑏 𝑎ℎ2𝑎𝑊1𝑐 +𝑎ℎ2𝑏 𝑊2𝑐 = 𝐴𝐻𝑊
2 issues of this simple example
• Issue 1:
• for every node, f sums up all the feature vectors of all neighboring nodes but not the node itself.
• Fix: simply add the identity matrix to A
• Issue 2:
• A is typically not normalized and therefore the multiplication with A will completely change the scale of the feature vectors.
• Fix: Normalizing A such that all rows sum to one:
26
Altogether: [Kipf and Welling, 2016]
27
• The two patched mentioned before +
• A better (symmetric) normalization of the adjacency matrix
O(|E|) time complexity overall.
Graph Convolutional Networks
28
• Kipf et al.’s Graph Convolutional Networks (GCNs) are a slight variation on the neighborhood aggregation idea:
Graph Convolutional Networks
29
same matrix for self and
neighbor embeddings per-neighbor
normalization Basic Neighborhood Aggregation
GCN Neighborhood Aggregation VS.
instead of simple average, normalization varies across neighbors use the same transformation
matrix for self and neighbor embeddings
Graph Convolutional Networks
30
• Empirically, they found this configuration to give the best results.
• More parameter sharing.
• Down-weights high degree neighbors.
average of neighbor’s previous layer embeddings
The Math
• Basic approach: Average neighbor messages and apply a neural network.
Initial “layer 0” embeddings are equal to node features
kth layer embedding
of v non-linearity (e.g., ReLU or
tanh)
previous layer embedding of v
Training the Model
32
• After K-layers of neighborhood aggregation, we get output embeddings for each node.
• We can feed these embeddings into any loss function and run stochastic gradient descent to train the
aggregation parameters.
trainable matrices (i.e., what we learn)
More complex models [Nowak et al., 2017]
33
Inductive Capability
34
Inductive node embedding generalize to entirely unseen graphs
e.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B
train on one graph generalize to new graph
Inductive Capability
35
train with snapshot new node arrives
generate embedding for new node
Many application settings constantly encounter previously unseen nodes.
e.g., Reddit, YouTube, GoogleScholar, ….
Need to generate new embeddings “on the fly”
From matrix computations to message passing algorithms
36
The function is used to gather the information from neighbor
(AGGREGATE) : Must be permutation invariant
The goal of this function is to fuse the information from neighbors into the node representation
(COMBINE ) a READOUT function is designed to aggregate node features of the
entire graph G for the final task (i.e classification)
GraphSAGE Idea
46
• So far we have aggregated the neighbor
messages by taking their (weighted) average,
can we do better?
GraphSAGE Idea
47
Any differentiable function that maps set of vectors to a
single vector.
• Simple neighborhood aggregation:
• GraphSAGE:
GraphSAGE Differences
generalized aggregation 48
concatenate self embedding and neighbor embedding
GraphSAGE Variants
49
• Mean:
• Pool
• Transform neighbor vectors and apply symmetric vector function.
• LSTM:
• Apply LSTM to random permutation of neighbors.
element-wise mean/max
Graph Attention Network (GAT)
• contributions of neighboring nodes to the central node are neither identical like GraphSage, GCN, …
• learn the relative weights between two connected nodes
• Taking into account edge features
50
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y.
Bengio, “Graph attention networks,” in Proc. of ICLR, 2017
GNN as an encoder
51
Edge features in GNN
• GAT which can handle only binary edge indicators and the limitation of GCNs which can handle only one dimensional edge features.
• Exploiting Edge Features for Graph Neural Networks
• First, we propose to use doubly stochastic normalization of graph edge features instead of the commonly used row or symmetric normalization approches used in current graph neural networks.
• Second, we construct new formulas for the operations in each individual layer so that they can handle multi- dimensional edge features.
• Third, for the proposed new framework, edge features are adaptive across network layers
52
Edge features in GNN ????
In existing attention mechanisms, the attention coefficient depends on the two points Xi· and Xj· only. Here we let the attention operation be guided by edge features of the edge connecting the two nodes, so α depends on edge features as well. F
Attention mechanism is marginal. Edge feature components are considered independtly not as vectors 53
54
55
Problems of graph neural networks
• Long range dependencies / Over-squashing
• Weisfeiler-Lehman test for graph isomorphism
• Not that generic :
56
Problems of graph neural networks :Long range dependencies / Over-squashing
57
Weisfeiler-Lehman test for graph isomorphism
P is a permutation matrix 58
Weisfeiler-Lehman test for graph isomorphism
Possibly isomorphic
59
Relation to WEISFEILER-LEHMAN algorithm
60
Relation to WEISFEILER-LEHMAN algorithm
• A neural network model for graph-structured data should ideally be able to learn representations of nodes in a graph, taking both the graph structure and feature description of nodes into account.
• A well-studied framework for the unique assignment of node labels given a graph is provided by the 1-dim Weisfeiler-Lehman (WL-1) algorithm (Weisfeiler & Lehmann, 1968)
Weisfeiler-lehman test of isomorphism
61
Relation to WEISFEILER-LEHMAN algorithm
62
Relation to WEISFEILER-LEHMAN algorithm
• The choice of the aggregation and the update operations is crucial:
• only multiset injective functions make it equivalent to the WL algorithm.
• Some popular choices for aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures
63
Transformers
67
Transformers for WHAT ?
• Sequence to Sequence (most popular)
• Computer vision (ViT, DERT): image classification, object detection
68
Transformer : The origin
https://arxiv.org/pdf/1706.03762.pdf
69
Sequence to Sequence
• NOT TRANSFORMERS
• RNN, LSTM
• One word at the time
• Not well parallelized
• SLOW
70
Sequence to Sequence
• Tranformers
• Many words at the same times (up to 1000 words so far)
• Well Parallelized
• Fast
71
Input embedding : The world of vectors
72
Transformers
• Are stackable
• Get rich representation
• By composition
73
Transformer discovers NLP pipe line
74
Encoder part
75
Encoder
76
Self Attention
77
Self Attention
78
Self attention : query, key and value
The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description etc.) associated with candidate videos in the database, then present you the best matched videos (values).
79
Self attention : query, key and value
80
Self attention : query by key = score
81
Self attention : soft max of the score = Attention weights
82
Attention weights by the value = output
83
Encoder in formula: Attention
84
Attention weights by the value = output Focus for one word
85
Multi head attention : concat + linear
86
Then the dense layer appear : Fully connected (MLP)
87
Encoder
88
Encoder in vectors
90
Word Order matters : Positionnal encoding
91
Positionnal Encoding
92
Positionnal Encoding
93
Positionnal Encoding
94
Word Order matters : Positionnal encoding
95
Transformer : Why does it work ?
• Attention mechanism
• Very generic ?
• Is there any « A priori knowledge » ?
101
Transformer problem
• Self Attention complexity is Quadratic in function of the input length
102
Graph Transformers
103
Graphormer
104
Graphormer
• Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the
model.
• We propose several simple yet effective structural encoding methods
• We mathematically characterize the expressive power of Graphormer
• The key is to properly incorporate structural information of graphs into the model
105
• Note that for each node i, the self-attention only calculates the
semantic similarity between i and other nodes, without considering the structural information of a graph reflected on the nodes and the relation between node pairs.
• Graphormer incorporates several effective structural encoding methods to leverage such information
106
Different nodes may have different importance
• First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.
• we leverage the degree centrality for the centrality encoding, where a learnable vector is assigned to each node according to its degree and added to the node features in the input layer.
107
Relation between nodes
• Second, we propose a novel Spatial Encoding in Graphormer to capture the structural relation between nodes.
• Dijkstra : For a general purpose, we use the distance of the shortest path
between any two nodes as a demonstration, which will be encoded as a bias term in the softmax attention and help the model accurately capture the
spatial dependency in a graph.
• Edge Features : spatial information contained in edge features. An average of dot-products of the edge features and learnable embeddings along the
shortest path
108
109
110
Different nodes may have different importance
• First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.
• we leverage the degree centrality for the centrality encoding, where a learnable vector is assigned to each node according to its degree and added to the node features in the input layer.
111
Positional Encoding = Spatial Encoding
112
Positional Encoding = by edge features
113
Another paper by Xavier Bresson
114
Graph Transformer Layer with edge features
122
GNN and transformers are encoders
127
Applications and losses
• Let us recall that Z is the output the GNN or Graph Transfomers
• Graph Transfomers also produce attention maps.
128
Unsupervised learning
• The graph factorization problem is the problem of predicting if two nodes are linked or not.
• Where is a similarity measure between two nodes embeddings
• Variation of graph factorization
• DeepWalk [Perozzi et al., 2014]
• node2vec [Grover and Leskovec, 2016]
129
Unsupervised embedding: Graph factorization
• Use stochastic gradient descent (SGD) as a general optimization method.
The graph factorization problem is the problem of predicting if two nodes are linked or not.
130
Supervised learning
• Node classification : social network
•
131
Supervised learning
• Node classification :
132
Node-wise classification
133
Supervised learning :Node classification
• Last layer:
• For the last layer the activation function is a softmax activation function.
• The loss :
134
Gibbs distribution
Supervised learning:
• Graph classification
• A global average pooling layer must be added to gather all the node embeddings of a given graph.
• Z’ can be fed to a MLP for classification
• The loss is the cross entropy
135
Semi-Supervised learning:
• Node classification:
• the problem of classifying nodes in a graph, where labels are only available for a small subset of nodes.
• where label information is smoothed over the graph via some form of explicit graph-based regularization
• Assumption is that connected nodes in the graph are likely to share the same label (class).
• This is true for instance for a neighborhood graph.
136
𝑠𝑜, l𝑟𝑒𝑔 = ||𝑍. 𝑍𝑇−A||Frobenius2
Some applications
• Node classification on citation networks.
• The input is a citation network where nodes are papers, edges are citation
links and optionally bag-of-words features on nodes. The target for each node is a paper category (e.g. stat.ML, cs.LG, ...).
137
Image/molecule classification
• Graph classification: An image is represented as a graph: based on raw pixels (a regular grid and all images have the same graph) or based on superpixels (irregular graph)
139
Graph classification
Taken from M. Bronstein. CVPR Tutorial 2017 140
1. cancerous or not cancerous molecules
2. determination of the boiling point
Molecular graph
Summary on GNN and Graph Transformers
141
Summary on GNN and Graph Transformers
• Node attenion mechanism
• GAT fashion
• Or Transformer fashion (K, Q, V)
• Nodes are embedded in K, Q, V
142
Summary on GNN and Graph Transformers
GNN advantages : Fast.
GNN problems :
- Weisfeiler-Lehman test for graph isomorphism. Aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures
- Fails to capture long-term relation in the graph (Over-Squashing)
143
Summary on GNN and Graph Transformers
Attentional GNN and Graph Transformers advantages : - Better feature extractors.
Attentional and Graph Transformers problems :
- Weisfeiler-Lehman test for graph isomorphism. Aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures.
- Quadratic time to compute the attention matrix.
144
Summary on GNN and Graph Transformers
Edge features consideration
• Things have been done but the question is still open.
• Edge features in the attention matrix only
• Edge embedding
145
Summary on GNN and Graph Transformers
• Whish list :
• Linear (Fast) feature extraction
• Node and edge embeddings
• Positional encoding for graphs
• Topology/Structure encoding
• Graph Laplacian
• Dijkstra (number of hops)
• Dijkstra (Edge features)
• Distinguish between very simple graph structures
• From pairwise feature embedding to higher degree(subgraph)
• Phi is a function of x_i, x_i, x_k, ….
146
Recommended reading
• Tutorials:
• Geometric Deep Learning, Tutorial, CVPR, 2017. http://geometricdeeplearning.com/
• Deep Learning on Graphs with Graph Convolutional Networks. http://deeploria.gforge.inria.fr/thomasTalk.pdf
• Graph-based Methods in Pattern Recognition & Document Image Analysis. http://gmprdia.univ-lr.fr/
• List of papers:
• Gilmer et al., Neural Message Passing for Quantum Chemistry, 2017. https://arxiv.org/abs/1704.01212
• Kipf et al., Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017. https://arxiv.org/abs/1609.02907
• Defferrard et al., Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering, NIPS 2016.
https://arxiv.org/abs/1606.09375
• Bruna et al., Spectral Networks and Locally Connected Networks on Graphs, ICLR 2014. https://arxiv.org/abs/1312.6203
• Duvenaud et al., Convolutional Networks on Graphs for Learning Molecular Fingerprints, NIPS 2015.
https://arxiv.org/abs/1509.09292
• Li et al., Gated Graph Sequence Neural Networks, ICLR 2016. https://arxiv.org/abs/1511.05493
• Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics, NIPS 2016.
https://arxiv.org/abs/1612.00222
• Kearnes et al., Molecular Graph Convolutions: Moving Beyond Fingerprints, 2016. https://arxiv.org/abs/1603.00856 91 R
147