Graph Neural Networks — Deep Dive
GNNs learn on graph-structured data by aggregating information from node neighborhoods. They generalize neural networks to irregular, non-Euclidean data structures.
Graph Data
DfGraph
A graph consists of:
- Nodes : — entities (people, molecules, papers)
- Edges : — relationships (friendships, bonds, citations)
- Node features : Feature matrix
- Adjacency matrix : Connectivity
Common graph types: citation networks, molecular graphs, social networks, knowledge graphs, point clouds.
Message Passing Framework
Message Function
Here,
- =Node u's representation at layer k
- =Node v's representation at layer k
- =Edge features (if any)
- =Neighbors of node v
Aggregation Function
Here,
- =Aggregation function (mean, max, sum, attention)
- =Aggregated neighborhood message
Update Function
Here,
- =Update function (e.g., GRU, MLP, residual connection)
Graph Convolutional Network (GCN)
DfGCN Layer
GCN (Kipf & Welling, 2017) performs spectral-inspired convolution on graphs:
where is the degree plus self-loop.
GCN Layer (Matrix Form)
Here,
- =A + I (adjacency with self-loops)
- =Degree matrix of \tilde{A}
- =Learnable weight matrix at layer k
- =Node representations at layer k
GraphSAGE
DfGraphSAGE
GraphSAGE (Hamilton et al., 2017) enables inductive learning by sampling and aggregating from neighborhoods:
- Sample neighbors from
- Aggregate using mean, LSTM, or pooling
- Concatenate with node's own features
- Transform with MLP
Key advantage: Can generate embeddings for unseen nodes (inductive).
GraphSAGE Update
Here,
- =Sampled subset of neighbors
- =Aggregator (mean, LSTM, pooling)
- =Concatenation of self and neighborhood features
Graph Attention Network (GAT)
DfGAT
GAT (Veličković et al., 2018) uses attention to weight neighbor contributions:
GAT Attention Coefficients
Here,
- =Attention weight from u to v
- =Attention parameter vector
- =Shared linear transformation
- =Concatenation
Over-Smoothing
ThOver-Smoothing in GNNs
As the number of GNN layers increases, node representations converge to the same vector regardless of initial features:
for some vector . This means deep GNNs lose the ability to distinguish nodes, limiting practical depth to 2-3 layers.