• Aucun résultat trouvé

Quelle IA pour les données

N/A
N/A
Protected

Academic year: 2022

Partager "Quelle IA pour les données"

Copied!
114
0
0

Texte intégral

(1)

Quelle IA pour les données structurées ?

Graph Neural Networks

Romain Raveaux

romain.raveaux@univ-tours.fr Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT) Equipe RFAI

1

(2)

Outline

1. Dense Deep Learning : MLP/FC 2. Graph Neural Network

3. Problems of Graph Neural Networks 4. Transformers

5. Transformers’ problems 6. Graph Transformers

7. Problems of Graph Transformers 8. The losses

9. Summary and perspectives

2

(3)

Dense Deep Learning : MLP/FC

3

(4)

A Linear model

𝐻 ∈ ℝ𝑚×1 : input

W∈ ℝ

B∈ ℝ

𝑌 ∈ ℝ 𝑚×1 : output (prediction)

𝑌 = 𝐻𝑊 + 𝐵

4

H=

1 h2 3

[𝑤1] = 𝑊 1𝑤1

h2𝑤1 3𝑤1

= 𝐻𝑊

(5)

A Linear model : for multi-dimensional input

𝐻 ∈ ℝ𝑚×𝑑 : input

W∈ ℝ𝑑×1

B∈ ℝ

𝑌 ∈ ℝ 𝑚×1 : output (prediction)

𝑌 = 𝐻𝑊 + 𝐵

5

H=

1𝑎 1𝑏 2𝑎 2𝑏 3𝑎 3𝑏

𝑤𝑎

𝑤𝑏 = 𝑊 1𝑎𝑤𝑎 +ℎ1𝑏 𝑤𝑏 2𝑎𝑤𝑎 +ℎ2𝑏 𝑤𝑏 3𝑎𝑤𝑎 +ℎ3𝑏 𝑤𝑏

= 𝐻𝑊

(6)

A Linear model : for multi-dimensional input and output

H ∈ ℝ𝑚×𝑑 : input

W∈ ℝ𝑑×𝑛

B∈ ℝ𝑛

𝑌 ∈ ℝ 𝑚×𝑛 : output (prediction)

𝑌 = 𝐻𝑊 + 𝐵

6

𝐻 =

1𝑎 1𝑏 2𝑎 2𝑏 3𝑎 3𝑏

𝑤1𝑎 𝑤2𝑎

𝑤1𝑏

𝑤2𝑏 = 𝑊 1𝑎𝑤1𝑎 +ℎ1𝑏 𝑤2𝑎

2𝑎𝑤1𝑎 +ℎ2𝑏 𝑤2𝑎 3𝑎𝑤1𝑎 +ℎ3𝑏 𝑤2𝑎

1𝑎𝑤1𝑏 +ℎ1𝑏 𝑤2𝑏 2𝑎𝑤1𝑏 +ℎ2𝑏 𝑤2𝑏 3𝑎𝑤1𝑏 +ℎ3𝑏 𝑤2𝑏

= 𝐻𝑊

(7)

A non-linear model : for multi-dimensional input/output

H ∈ ℝ𝑚×𝑑 : input

W∈ ℝ𝑑×𝑛

B∈ ℝ𝑛

𝑅 𝑎 , 𝜎 𝑎 : a non-linear function

𝑌 ∈ ℝ 𝑚×𝑛 : output (prediction)

𝑌 = 𝜎 𝐻𝑊 + 𝐵

7

(8)

Can we stack the model ? Endomorphism property

H ∈ ℝ𝑚×𝑑 : input

𝑊(1) ∈ ℝ𝑑×𝑛

𝐵(1) ∈ ℝ𝑛

𝑌(1) ∈ ℝ𝑚×𝑛 : intermediate results

𝑊(1) ∈ ℝ𝑛×𝑜

𝐵(1) ∈ ℝ𝑜

𝑌(2) ∈ ℝ𝑚×𝑜 : output (prediction)

𝑌(1) = 𝜎 𝐻 𝑊 1 + 𝐵 1

𝑌(2) = 𝜎( ෢𝑌(1)𝑊 2 + 𝐵 2 )

𝑌(2) = 𝜎(𝜎 H 𝑊 1 + 𝐵 1 𝑊 2 + 𝐵 2 )

𝑌(𝑙) = 𝑓( ෣𝑌 𝑙−1 , 𝑊 𝑙 , 𝐵 𝑙 )=𝑓( ෣𝑌 𝑙−1 , 𝜃 𝑙 )

𝑌(𝑙) = 𝑓 𝑓 𝑋, 𝜃 𝑙−1 , 𝜃 𝑙 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s

𝐹 𝑋, Θ = 𝑓𝜃 𝑙 ∘ 𝑓𝜃 𝑙−1 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s

Multi Layer Perceptron : MLP

A sequence of linear operations and non linear operations

Why these non-linear operations ? 8

(9)

The basics of artificial neural networks

9

• MLP at glance :

• How to generalize MLP to Graphs ?

(10)

Graph Neural Networks

10

(11)

Graph Neural Networks : Principles

• Input: A graph

• Output: Node embeddings

• Assumptions: unorder data, stationarity and compositionality

• The goal:

Graph Neural Networks (GNN) perform an end-to-end learning including feature extraction and classification.

11

(12)

Graph Neural Networks as a encoder

Output : a node embedding

12

(13)

Graph Neural Networks as a encoder

13

(14)

Decoder:

16

(15)

The basics of graph neural networks

17

(16)

Adjacency matrix

18

(17)

Degree matrix

19

(18)

Exponentiation of the adjacency matrix

20

(19)

Key idea and Intuition [Kipf and Welling, 2016]

• The key idea is to generate node embeddings based on local neighborhoods.

• The intuition is to aggregate node information from their neighbors using neural networks.

• Nodes have embeddings at each layer and the neural network can be arbitrary depth. “layer-0” embedding of node u is its input feature, i.e. Fu.

21

(20)

Topology and Features

A∈ ℝ|𝑉|×|𝑉| : Adjacency matrix (topology/structure)

H∈ ℝ|𝑉|×𝑑 : d dimensional features vector for each node

𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation

23

A=

𝑎1𝑎 𝑎1𝑏 𝑎2𝑎 𝑎2𝑏

1𝑎 2𝑎

1𝑏

2𝑏 = 𝐻 𝑎1𝑎1𝑎 +𝑎1𝑏 2𝑎

𝑎2𝑎1𝑎 +𝑎2𝑏 2𝑎

𝑎1𝑎1𝑏 +𝑎1𝑏 2𝑏

𝑎2𝑎1𝑏 +𝑎2𝑏 2𝑏 = 𝐴𝐻

(21)

Simple example of f

24

(22)

Topology and Features

𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation

W∈ ℝ𝑑×𝑙 : Parameters matrix that does not depend on |V|

25

AH =

𝑎ℎ1𝑎 𝑎ℎ1𝑏 𝑎ℎ2𝑎 𝑎ℎ2𝑏

𝑊1𝑎 𝑊2𝑎

𝑊1𝑏 𝑊1𝑐

𝑊2𝑏 𝑊2𝑐 = 𝑊 𝑎ℎ1𝑎𝑊1𝑎 +𝑎ℎ1𝑏 𝑊2𝑎

𝑎ℎ2𝑎𝑊1𝑎 +𝑎ℎ2𝑏 𝑊2𝑎

𝑎ℎ1𝑎𝑊1𝑏 +𝑎ℎ1𝑏 𝑊2𝑏 𝑎ℎ1𝑎𝑊1𝑐 +𝑎ℎ1𝑏 𝑊2𝑐

𝑎ℎ2𝑎𝑊1𝑏 +𝑎ℎ2𝑏 𝑊2𝑏 𝑎ℎ2𝑎𝑊1𝑐 +𝑎ℎ2𝑏 𝑊2𝑐 = 𝐴𝐻𝑊

(23)

2 issues of this simple example

• Issue 1:

for every node, f sums up all the feature vectors of all neighboring nodes but not the node itself.

Fix: simply add the identity matrix to A

• Issue 2:

A is typically not normalized and therefore the multiplication with A will completely change the scale of the feature vectors.

Fix: Normalizing A such that all rows sum to one:

26

(24)

Altogether: [Kipf and Welling, 2016]

27

• The two patched mentioned before +

• A better (symmetric) normalization of the adjacency matrix

O(|E|) time complexity overall.

(25)

Graph Convolutional Networks

28

Kipf et al.’s Graph Convolutional Networks (GCNs) are a slight variation on the neighborhood aggregation idea:

(26)

Graph Convolutional Networks

29

same matrix for self and

neighbor embeddings per-neighbor

normalization Basic Neighborhood Aggregation

GCN Neighborhood Aggregation VS.

(27)

instead of simple average, normalization varies across neighbors use the same transformation

matrix for self and neighbor embeddings

Graph Convolutional Networks

30

• Empirically, they found this configuration to give the best results.

More parameter sharing.

Down-weights high degree neighbors.

(28)

average of neighbor’s previous layer embeddings

The Math

Basic approach: Average neighbor messages and apply a neural network.

Initial “layer 0” embeddings are equal to node features

kth layer embedding

of v non-linearity (e.g., ReLU or

tanh)

previous layer embedding of v

(29)

Training the Model

32

• After K-layers of neighborhood aggregation, we get output embeddings for each node.

We can feed these embeddings into any loss function and run stochastic gradient descent to train the

aggregation parameters.

trainable matrices (i.e., what we learn)

(30)

More complex models [Nowak et al., 2017]

33

(31)

Inductive Capability

34

Inductive node embedding generalize to entirely unseen graphs

e.g., train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B

train on one graph generalize to new graph

(32)

Inductive Capability

35

train with snapshot new node arrives

generate embedding for new node

Many application settings constantly encounter previously unseen nodes.

e.g., Reddit, YouTube, GoogleScholar, ….

Need to generate new embeddings “on the fly”

(33)

From matrix computations to message passing algorithms

36

The function is used to gather the information from neighbor

(AGGREGATE) : Must be permutation invariant

The goal of this function is to fuse the information from neighbors into the node representation

(COMBINE ) a READOUT function is designed to aggregate node features of the

entire graph G for the final task (i.e classification)

(34)

GraphSAGE Idea

46

• So far we have aggregated the neighbor

messages by taking their (weighted) average,

can we do better?

(35)

GraphSAGE Idea

47

Any differentiable function that maps set of vectors to a

single vector.

(36)

• Simple neighborhood aggregation:

• GraphSAGE:

GraphSAGE Differences

generalized aggregation 48

concatenate self embedding and neighbor embedding

(37)

GraphSAGE Variants

49

Mean:

Pool

Transform neighbor vectors and apply symmetric vector function.

LSTM:

Apply LSTM to random permutation of neighbors.

element-wise mean/max

(38)

Graph Attention Network (GAT)

• contributions of neighboring nodes to the central node are neither identical like GraphSage, GCN, …

• learn the relative weights between two connected nodes

Taking into account edge features

50

P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y.

Bengio, “Graph attention networks,” in Proc. of ICLR, 2017

(39)

GNN as an encoder

51

(40)

Edge features in GNN

GAT which can handle only binary edge indicators and the limitation of GCNs which can handle only one dimensional edge features.

Exploiting Edge Features for Graph Neural Networks

First, we propose to use doubly stochastic normalization of graph edge features instead of the commonly used row or symmetric normalization approches used in current graph neural networks.

Second, we construct new formulas for the operations in each individual layer so that they can handle multi- dimensional edge features.

Third, for the proposed new framework, edge features are adaptive across network layers

52

(41)

Edge features in GNN ????

In existing attention mechanisms, the attention coefficient depends on the two points Xi· and Xj· only. Here we let the attention operation be guided by edge features of the edge connecting the two nodes, so α depends on edge features as well. F

Attention mechanism is marginal. Edge feature components are considered independtly not as vectors 53

(42)

54

(43)

55

(44)

Problems of graph neural networks

• Long range dependencies / Over-squashing

• Weisfeiler-Lehman test for graph isomorphism

• Not that generic :

56

(45)

Problems of graph neural networks :Long range dependencies / Over-squashing

57

(46)

Weisfeiler-Lehman test for graph isomorphism

P is a permutation matrix 58

(47)

Weisfeiler-Lehman test for graph isomorphism

Possibly isomorphic

59

(48)

Relation to WEISFEILER-LEHMAN algorithm

60

(49)

Relation to WEISFEILER-LEHMAN algorithm

• A neural network model for graph-structured data should ideally be able to learn representations of nodes in a graph, taking both the graph structure and feature description of nodes into account.

• A well-studied framework for the unique assignment of node labels given a graph is provided by the 1-dim Weisfeiler-Lehman (WL-1) algorithm (Weisfeiler & Lehmann, 1968)

Weisfeiler-lehman test of isomorphism

61

(50)

Relation to WEISFEILER-LEHMAN algorithm

62

(51)

Relation to WEISFEILER-LEHMAN algorithm

• The choice of the aggregation and the update operations is crucial:

• only multiset injective functions make it equivalent to the WL algorithm.

• Some popular choices for aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures

63

(52)

Transformers

67

(53)

Transformers for WHAT ?

• Sequence to Sequence (most popular)

• Computer vision (ViT, DERT): image classification, object detection

68

(54)

Transformer : The origin

https://arxiv.org/pdf/1706.03762.pdf

69

(55)

Sequence to Sequence

• NOT TRANSFORMERS

RNN, LSTM

• One word at the time

Not well parallelized

SLOW

70

(56)

Sequence to Sequence

• Tranformers

Many words at the same times (up to 1000 words so far)

Well Parallelized

Fast

71

(57)

Input embedding : The world of vectors

72

(58)

Transformers

• Are stackable

Get rich representation

By composition

73

(59)

Transformer discovers NLP pipe line

74

(60)

Encoder part

75

(61)

Encoder

76

(62)

Self Attention

77

(63)

Self Attention

78

(64)

Self attention : query, key and value

The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description etc.) associated with candidate videos in the database, then present you the best matched videos (values).

79

(65)

Self attention : query, key and value

80

(66)

Self attention : query by key = score

81

(67)

Self attention : soft max of the score = Attention weights

82

(68)

Attention weights by the value = output

83

(69)

Encoder in formula: Attention

84

(70)

Attention weights by the value = output Focus for one word

85

(71)

Multi head attention : concat + linear

86

(72)

Then the dense layer appear : Fully connected (MLP)

87

(73)

Encoder

88

(74)

Encoder in vectors

90

(75)

Word Order matters : Positionnal encoding

91

(76)

Positionnal Encoding

92

(77)

Positionnal Encoding

93

(78)

Positionnal Encoding

94

(79)

Word Order matters : Positionnal encoding

95

(80)

Transformer : Why does it work ?

• Attention mechanism

• Very generic ?

Is there any « A priori knowledge » ?

101

(81)

Transformer problem

• Self Attention complexity is Quadratic in function of the input length

102

(82)

Graph Transformers

103

(83)

Graphormer

104

(84)

Graphormer

• Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the

model.

• We propose several simple yet effective structural encoding methods

• We mathematically characterize the expressive power of Graphormer

• The key is to properly incorporate structural information of graphs into the model

105

(85)

• Note that for each node i, the self-attention only calculates the

semantic similarity between i and other nodes, without considering the structural information of a graph reflected on the nodes and the relation between node pairs.

• Graphormer incorporates several effective structural encoding methods to leverage such information

106

(86)

Different nodes may have different importance

• First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.

• we leverage the degree centrality for the centrality encoding, where a learnable vector is assigned to each node according to its degree and added to the node features in the input layer.

107

(87)

Relation between nodes

• Second, we propose a novel Spatial Encoding in Graphormer to capture the structural relation between nodes.

Dijkstra : For a general purpose, we use the distance of the shortest path

between any two nodes as a demonstration, which will be encoded as a bias term in the softmax attention and help the model accurately capture the

spatial dependency in a graph.

Edge Features : spatial information contained in edge features. An average of dot-products of the edge features and learnable embeddings along the

shortest path

108

(88)

109

(89)

110

(90)

Different nodes may have different importance

• First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.

• we leverage the degree centrality for the centrality encoding, where a learnable vector is assigned to each node according to its degree and added to the node features in the input layer.

111

(91)

Positional Encoding = Spatial Encoding

112

(92)

Positional Encoding = by edge features

113

(93)

Another paper by Xavier Bresson

114

(94)

Graph Transformer Layer with edge features

122

(95)

GNN and transformers are encoders

127

(96)

Applications and losses

• Let us recall that Z is the output the GNN or Graph Transfomers

• Graph Transfomers also produce attention maps.

128

(97)

Unsupervised learning

• The graph factorization problem is the problem of predicting if two nodes are linked or not.

• Where is a similarity measure between two nodes embeddings

• Variation of graph factorization

DeepWalk [Perozzi et al., 2014]

node2vec [Grover and Leskovec, 2016]

129

(98)

Unsupervised embedding: Graph factorization

• Use stochastic gradient descent (SGD) as a general optimization method.

The graph factorization problem is the problem of predicting if two nodes are linked or not.

130

(99)

Supervised learning

• Node classification : social network

131

(100)

Supervised learning

• Node classification :

132

(101)

Node-wise classification

133

(102)

Supervised learning :Node classification

• Last layer:

• For the last layer the activation function is a softmax activation function.

• The loss :

134

Gibbs distribution

(103)

Supervised learning:

• Graph classification

A global average pooling layer must be added to gather all the node embeddings of a given graph.

Z’ can be fed to a MLP for classification

The loss is the cross entropy

135

(104)

Semi-Supervised learning:

• Node classification:

the problem of classifying nodes in a graph, where labels are only available for a small subset of nodes.

where label information is smoothed over the graph via some form of explicit graph-based regularization

Assumption is that connected nodes in the graph are likely to share the same label (class).

This is true for instance for a neighborhood graph.

136

𝑠𝑜, l𝑟𝑒𝑔 = ||𝑍. 𝑍𝑇−A||Frobenius2

(105)

Some applications

• Node classification on citation networks.

The input is a citation network where nodes are papers, edges are citation

links and optionally bag-of-words features on nodes. The target for each node is a paper category (e.g. stat.ML, cs.LG, ...).

137

(106)

Image/molecule classification

• Graph classification: An image is represented as a graph: based on raw pixels (a regular grid and all images have the same graph) or based on superpixels (irregular graph)

139

(107)

Graph classification

Taken from M. Bronstein. CVPR Tutorial 2017 140

1. cancerous or not cancerous molecules

2. determination of the boiling point

Molecular graph

(108)

Summary on GNN and Graph Transformers

141

(109)

Summary on GNN and Graph Transformers

• Node attenion mechanism

GAT fashion

Or Transformer fashion (K, Q, V)

Nodes are embedded in K, Q, V

142

(110)

Summary on GNN and Graph Transformers

GNN advantages : Fast.

GNN problems :

- Weisfeiler-Lehman test for graph isomorphism. Aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures

- Fails to capture long-term relation in the graph (Over-Squashing)

143

(111)

Summary on GNN and Graph Transformers

Attentional GNN and Graph Transformers advantages : - Better feature extractors.

Attentional and Graph Transformers problems :

- Weisfeiler-Lehman test for graph isomorphism. Aggregators used in the literature such as maximum or mean are actually strictly less powerful than WL and fail to distinguish between very simple graph structures.

- Quadratic time to compute the attention matrix.

144

(112)

Summary on GNN and Graph Transformers

Edge features consideration

Things have been done but the question is still open.

Edge features in the attention matrix only

Edge embedding

145

(113)

Summary on GNN and Graph Transformers

Whish list :

Linear (Fast) feature extraction

Node and edge embeddings

Positional encoding for graphs

Topology/Structure encoding

Graph Laplacian

Dijkstra (number of hops)

Dijkstra (Edge features)

Distinguish between very simple graph structures

From pairwise feature embedding to higher degree(subgraph)

Phi is a function of x_i, x_i, x_k, ….

146

(114)

Recommended reading

Tutorials:

Geometric Deep Learning, Tutorial, CVPR, 2017. http://geometricdeeplearning.com/

Deep Learning on Graphs with Graph Convolutional Networks. http://deeploria.gforge.inria.fr/thomasTalk.pdf

Graph-based Methods in Pattern Recognition & Document Image Analysis. http://gmprdia.univ-lr.fr/

List of papers:

Gilmer et al., Neural Message Passing for Quantum Chemistry, 2017. https://arxiv.org/abs/1704.01212

Kipf et al., Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017. https://arxiv.org/abs/1609.02907

Defferrard et al., Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering, NIPS 2016.

https://arxiv.org/abs/1606.09375

Bruna et al., Spectral Networks and Locally Connected Networks on Graphs, ICLR 2014. https://arxiv.org/abs/1312.6203

Duvenaud et al., Convolutional Networks on Graphs for Learning Molecular Fingerprints, NIPS 2015.

https://arxiv.org/abs/1509.09292

Li et al., Gated Graph Sequence Neural Networks, ICLR 2016. https://arxiv.org/abs/1511.05493

Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics, NIPS 2016.

https://arxiv.org/abs/1612.00222

Kearnes et al., Molecular Graph Convolutions: Moving Beyond Fingerprints, 2016. https://arxiv.org/abs/1603.00856 91 R

147

Références

Documents relatifs

Our experi- ments with basic network functions such as the centrality-driven content search, show (Section 4) that high rank-correlation values between the two BC variants are poor

This wear resistance, which is equivalent in concept to the inverse of wear volume, is directly proportional to hardness, but the constant of proportionality

We have shared large cross-domain knowledge bases [4], enriched them with linguistic and distributional information [7, 5, 2] and created contin- uous evolution feedback loops

So, in this paper, we first prove the theorem algebraically in section 2 and then, we expose the equivalent proof using combinatorial Quantum Field Theory (QFT) in subsection 3.3..

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

Ultimately, this explained the differences in our experimental results: in a small-world network, the average path length tends to be larger and this leads to higher costs and

Since the mean degree plays the role of a threshold beyond which the cumulative degree distribution decreases exponentially, it is interesting to study its evolution with the size

Oxidation of sulfide bearing minerals by water from heavy precipitation all over the year may be the mechanism responsible for Arsenic release especially in the eastern