Graph Data Science
Overview
Graph data science applies machine learning and analytics to graph-structured data — networks of entities (nodes) and their relationships (edges). Applications include social networks, recommendation systems, fraud detection, knowledge graphs, and biological networks. Graph representations capture relational structure that tabular methods miss, enabling insights into influence, community, and connectivity.
Graph Theory Basics
Definitions
DfGraph
A graph consists of:
- = set of nodes (vertices)
- = set of edges
Types of Graphs
Undirected Graph Directed Graph
A --- B A --> B
| / | | / |
| / | v / v
| / | | / |
C --- D C --> D
Weighted Graph Bipartite Graph
A --5-- B Users Items
| / | | | | |
3 2 4 U1--I1 U2--I2
| / | | | | |
C --1-- D U2--I1 U1--I2
Graph Representations
Adjacency Matrix:
Adjacency Matrix
Here,
- =Entry in adjacency matrix
- =Set of edges
Adjacency List:
Node A: [B, C]
Node B: [A, C, D]
Node C: [A, B, D]
Node D: [B, C]
Edge List:
[(A,B), (A,C), (B,C), (B,D), (C,D)]
Network Metrics
Degree Centrality
Number of connections a node has:
Degree Centrality
Here,
- =Degree centrality of node v
- =Degree of node v
- =Total number of nodes
Node A: degree 2, centrality = 2/4 = 0.5
Node B: degree 3, centrality = 3/4 = 0.75 <-- Most connected
Node C: degree 3, centrality = 3/4 = 0.75
Node D: degree 2, centrality = 2/4 = 0.5
Betweenness Centrality
Measures how often a node lies on shortest paths:
Betweenness Centrality
Here,
- =Betweenness centrality of node v
- =Total shortest paths from s to t
- =Shortest paths passing through v
B
/ \
A C <-- B has high betweenness
\ (bridge between A and D)
D
Closeness Centrality
Average shortest path distance to all other nodes:
Closeness Centrality
Here,
- =Closeness centrality of node v
- =Shortest path distance between v and u
Eigenvector Centrality
Node importance based on neighbor importance:
In matrix form:
Node with high eigenvector centrality is connected to other well-connected nodes.
PageRank
DfPageRank
PageRank measures node importance by counting links to a node, where links from important nodes count more. It models a random walker who follows links with probability and teleports to a random node with probability .
PageRank Algorithm:
1. Initialize all nodes to 1/n
2. Iterate:
- Each node distributes its rank equally to neighbors
- Each node collects sum of incoming ranks
- Apply damping factor
3. Converges to stable distribution
ℹ️ PageRank Convergence
PageRank is guaranteed to converge to a unique stationary distribution as long as the graph is strongly connected and aperiodic. The damping factor ensures convergence even for graphs with dead ends.
Clustering Coefficient
Clustering Coefficient
Here,
- =Clustering coefficient of node i
- =Degree of node i
Measures how connected a node's neighbors are to each other.
Clustering = 1.0: Clustering = 0.0:
A A
/ \ / \
B---C B C
Triangle Star (no B-C edge)
Implementation
import networkx as nx
import numpy as np
import pandas as pd
from collections import defaultdict
class GraphAnalyzer:
"""Comprehensive graph analysis toolkit."""
def __init__(self, graph):
self.G = graph
def degree_centrality(self):
"""Calculate degree centrality for all nodes."""
n = len(self.G.nodes()) - 1
return {v: self.G.degree(v) / n for v in self.G.nodes()}
def betweenness_centrality(self):
"""Calculate betweenness centrality."""
return nx.betweenness_centrality(self.G)
def closeness_centrality(self):
"""Calculate closeness centrality."""
return nx.closeness_centrality(self.G)
def eigenvector_centrality(self, max_iter=100, tol=1e-6):
"""Calculate eigenvector centrality."""
return nx.eigenvector_centrality(self.G, max_iter=max_iter, tol=tol)
def pagerank(self, d=0.85, max_iter=100):
"""Calculate PageRank scores."""
return nx.pagerank(self.G, alpha=d, max_iter=max_iter)
def clustering_coefficient(self):
"""Calculate local clustering coefficient."""
return nx.clustering(self.G)
def shortest_paths(self, source=None):
"""Calculate shortest paths."""
if source:
return nx.single_source_shortest_path_length(self.G, source)
return dict(nx.all_pairs_shortest_path_length(self.G))
def community_detection(self):
"""Detect communities using Louvain method."""
from community import community_louvain
return community_louvain.best_partition(self.G)
def summary(self):
"""Generate graph summary statistics."""
return {
'num_nodes': self.G.number_of_nodes(),
'num_edges': self.G.number_of_edges(),
'density': nx.density(self.G),
'avg_degree': sum(dict(self.G.degree()).values()) / self.G.number_of_nodes(),
'is_connected': nx.is_connected(self.G),
'num_components': nx.number_connected_components(self.G),
'avg_clustering': nx.average_clustering(self.G),
'transitivity': nx.transitivity(self.G)
}
# Create example social network
G = nx.Graph()
# Add edges ( friendships)
edges = [
('Alice', 'Bob'), ('Alice', 'Charlie'), ('Alice', 'Diana'),
('Bob', 'Charlie'), ('Bob', 'Eve'), ('Charlie', 'Diana'),
('Charlie', 'Eve'), ('Diana', 'Frank'), ('Eve', 'Frank'),
('Eve', 'Grace'), ('Frank', 'Grace'), ('Frank', 'Heidi'),
('Grace', 'Heidi'), ('Heidi', 'Ivan')
]
G.add_edges_from(edges)
# Analyze
analyzer = GraphAnalyzer(G)
print("=== Graph Summary ===")
summary = analyzer.summary()
for key, value in summary.items():
print(f" {key}: {value:.4f}" if isinstance(value, float) else f" {key}: {value}")
print("\n=== Centrality Measures ===")
degree = analyzer.degree_centrality()
betweenness = analyzer.betweenness_centrality()
pagerank = analyzer.pagerank()
nodes = sorted(degree.keys())
print(f"{'Node':<10} {'Degree':>8} {'Betweenness':>12} {'PageRank':>10}")
print("-" * 42)
for node in nodes:
print(f"{node:<10} {degree[node]:>8.4f} {betweenness[node]:>12.4f} {pagerank[node]:>10.4f}")
print("\n=== Community Detection ===")
communities = analyzer.community_detection()
for comm_id in set(communities.values()):
members = [n for n, c in communities.items() if c == comm_id]
print(f"Community {comm_id}: {members}")
Graph Algorithms
Breadth-First Search (BFS)
Explores nodes level by level:
Level 0: A
/ \
Level 1: B C
/| |
Level 2: D E F
BFS Order: A -> B -> C -> D -> E -> F
Dijkstra's Shortest Path
Dijkstra's Algorithm
Here,
- =Shortest distance to node v
- =Weight of edge (u, v)
Minimum Spanning Tree (Kruskal's/Prim's)
Find subset of edges connecting all nodes with minimum total weight.
Community Detection Algorithms
Louvain Modularity:
Label Propagation:
Each node adopts the most common label among its neighbors.
Graph Neural Networks (GNN)
Message Passing Framework:
GNN Message Passing
Here,
- =Node v's hidden state at layer l
- =Neighbors of node v
GCN Layer:
where and .
Knowledge Graphs
Structure
Knowledge graphs store facts as triples:
DfKnowledge Triple
A knowledge triple is a (subject, predicate, object) representation of a fact.
Example:
(Albert_Einstein, born_in, Germany)
(Albert_Einstein, developed, Theory_of_Relativity)
(Theory_of_Relativity, is_a, Physics_Theory)
(Germany, located_in, Europe)
Knowledge Graph Embeddings
TransE:
TransE Embedding
Here,
- =Head entity embedding
- =Relation embedding
- =Tail entity embedding
TransE Loss:
RotatE:
RotatE Embedding
Here,
- =Hadamard product in complex space
where is Hadamard product in complex space.
Implementation
import networkx as nx
import numpy as np
from typing import List, Tuple, Dict
from collections import defaultdict
class KnowledgeGraph:
"""Simple knowledge graph implementation."""
def __init__(self):
self.graph = nx.MultiDiGraph()
self.entity_embeddings = {}
self.relation_embeddings = {}
def add_triple(self, subject: str, predicate: str, obj: str):
"""Add a knowledge triple."""
self.graph.add_edge(subject, obj, relation=predicate)
def add_triples(self, triples: List[Tuple[str, str, str]]):
"""Add multiple triples."""
for s, p, o in triples:
self.add_triple(s, p, o)
def query(self, subject=None, predicate=None, obj=None):
"""Query the knowledge graph."""
results = []
for s, p, o in self.graph.edges(keys=True, data=True):
if subject and s != subject:
continue
if predicate and o.get('relation') != predicate:
continue
if obj and o != obj:
continue
results.append((s, o.get('relation'), o))
return results
def get_neighbors(self, entity: str, relation: str = None):
"""Get neighbors of an entity."""
neighbors = []
for _, target, data in self.graph.out_edges(entity, data=True):
if relation is None or data.get('relation') == relation:
neighbors.append((target, data.get('relation')))
return neighbors
def find_paths(self, source: str, target: str, max_hops: int = 3):
"""Find paths between two entities."""
try:
paths = list(nx.all_simple_paths(self.graph, source, target, cutoff=max_hops))
return paths
except nx.NetworkXError:
return []
def pagerank(self):
"""Calculate PageRank on the knowledge graph."""
return nx.pagerank(self.graph)
def visualize(self):
"""Print graph structure."""
print("Knowledge Graph Structure:")
print("=" * 50)
for s, t, data in self.graph.edges(data=True):
print(f" {s} --[{data.get('relation')}]--> {t}")
# Create a medical knowledge graph
kg = KnowledgeGraph()
# Add medical triples
triples = [
('Diabetes', 'has_symptom', 'High_Blood_Sugar'),
('Diabetes', 'has_symptom', 'Frequent_Urination'),
('Diabetes', 'has_symptom', 'Increased_Thirst'),
('Diabetes', 'treated_by', 'Insulin'),
('Diabetes', 'type_of', 'Metabolic_Disorder'),
('Insulin', 'produced_by', 'Pancreas'),
('Pancreas', 'located_in', 'Abdomen'),
('High_Blood_Sugar', 'can_lead_to', 'Heart_Disease'),
('Heart_Disease', 'has_symptom', 'Chest_Pain'),
('Heart_Disease', 'risk_factor', 'Diabetes'),
('Metformin', 'treats', 'Diabetes'),
('Metformin', 'class', 'Biguanide'),
]
kg.add_triples(triples)
kg.visualize()
# Query examples
print("\n=== Query Results ===")
print("\nSymptoms of Diabetes:")
for s, p, o in kg.query(subject='Diabetes', predicate='has_symptom'):
print(f" {s} --{p}--> {o}")
print("\nRisk factors for Heart Disease:")
for s, p, o in kg.query(obj='Heart_Disease'):
print(f" {s} --{p}--> {o}")
print("\n=== Neighbors of Diabetes ===")
for neighbor, relation in kg.get_neighbors('Diabetes'):
print(f" --[{relation}]--> {neighbor}")
print("\n=== PageRank Scores ===")
pagerank = kg.pagerank()
for entity, score in sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" {entity}: {score:.4f}")
Graph Neural Networks
Architecture
Input Features GCN Layer Output
h^(0) h^(1) = σ(D̃^(-½)ÃD̃^(-½)h^(0)W^(0))
[x1] [z1] [y1]
[x2] ---> [z2] ---> [y2]
[x3] [z3] [y3]
[x4] [z4] [y4]
Message Passing Neural Network
import torch
import torch.nn as nn
import torch.nn.functional as F
class GCNLayer(nn.Module):
"""Graph Convolutional Network layer."""
def __init__(self, in_features, out_features):
super().__init__()
self.weight = nn.Parameter(torch.FloatTensor(in_features, out_features))
self.bias = nn.Parameter(torch.FloatTensor(out_features))
self.reset_parameters()
def reset_parameters(self):
nn.init.xavier_uniform_(self.weight)
nn.init.zeros_(self.bias)
def forward(self, x, adj):
"""
Forward pass.
Args:
x: Node features [n_nodes, in_features]
adj: Adjacency matrix [n_nodes, n_nodes]
"""
# Normalize adjacency matrix
degree = adj.sum(dim=1, keepdim=True)
degree_inv_sqrt = torch.where(degree > 0, degree.pow(-0.5), torch.zeros_like(degree))
adj_norm = degree_inv_sqrt * adj * degree_inv_sqrt.t()
# GCN message passing
support = torch.mm(x, self.weight)
output = torch.mm(adj_norm, support) + self.bias
return output
class GraphAttentionLayer(nn.Module):
"""Graph Attention Network layer."""
def __init__(self, in_features, out_features, n_heads=8):
super().__init__()
self.n_heads = n_heads
self.head_dim = out_features // n_heads
self.W = nn.Linear(in_features, out_features)
self.a = nn.Parameter(torch.FloatTensor(n_heads, 2 * self.head_dim))
nn.init.xavier_uniform_(self.a)
def forward(self, x, adj):
"""Forward pass with attention."""
n_nodes = x.size(0)
h = self.W(x).view(n_nodes, self.n_heads, self.head_dim)
# Attention scores
attention = torch.zeros(n_nodes, n_nodes, self.n_heads)
for i in range(n_nodes):
for j in range(n_nodes):
if adj[i, j] > 0:
score = torch.cat([h[i], h[j]], dim=-1)
attention[i, j] = torch.exp(F.leaky_relu((score * self.a).sum(dim=-1)))
# Normalize attention
attention = attention / (attention.sum(dim=1, keepdim=True) + 1e-10)
# Apply attention
out = torch.zeros(n_nodes, self.n_heads, self.head_dim)
for i in range(n_nodes):
for j in range(n_nodes):
out[i] += attention[i, j] * h[j]
return out.view(n_nodes, -1)
class SimpleGNN(nn.Module):
"""Simple Graph Neural Network for node classification."""
def __init__(self, n_features, n_hidden, n_classes, n_layers=2):
super().__init__()
self.layers = nn.ModuleList()
self.layers.append(GCNLayer(n_features, n_hidden))
for _ in range(n_layers - 2):
self.layers.append(GCNLayer(n_hidden, n_hidden))
self.layers.append(GCNLayer(n_hidden, n_classes))
def forward(self, x, adj):
"""Forward pass through all layers."""
for i, layer in enumerate(self.layers[:-1]):
x = F.relu(layer(x, adj))
x = F.dropout(x, p=0.5, training=self.training)
x = self.layers[-1](x, adj)
return F.log_softmax(x, dim=1)
# Example: Cora dataset simulation
def create_cora_like_data():
"""Create a small graph for demonstration."""
n_nodes = 100
n_features = 32
n_classes = 5
# Random features
features = torch.randn(n_nodes, n_features)
# Create adjacency (random graph)
adj = torch.zeros(n_nodes, n_nodes)
for i in range(n_nodes):
n_edges = torch.randint(3, 8, (1,)).item()
neighbors = torch.randperm(n_nodes)[:n_edges]
adj[i, neighbors] = 1
adj[neighbors, i] = 1
# Random labels
labels = torch.randint(0, n_classes, (n_nodes,))
return features, adj, labels
# Initialize and use model
features, adj, labels = create_cora_like_data()
model = SimpleGNN(n_features=32, n_hidden=64, n_classes=5)
# Forward pass
output = model(features, adj)
print(f"Input shape: {features.shape}")
print(f"Output shape: {output.shape}")
print(f"Predictions (first 5): {output[:5].argmax(dim=1)}")
Real-World Applications
Social Network Analysis
import networkx as nx
import numpy as np
# Create social network
G = nx.watts_strogatz_graph(n=100, k=6, p=0.3, seed=42)
# Find influential nodes
pagerank = nx.pagerank(G)
betweenness = nx.betweenness_centrality(G)
# Top 5 influencers
top_pr = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]
top_bw = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 by PageRank:")
for node, score in top_pr:
print(f" Node {node}: {score:.4f}")
print("\nTop 5 by Betweenness:")
for node, score in top_bw:
print(f" Node {node}: {score:.4f}")
Fraud Detection
# Transaction graph for fraud detection
class FraudDetector:
"""Graph-based fraud detection."""
def __init__(self):
self.G = nx.DiGraph()
def add_transaction(self, sender, receiver, amount, timestamp):
"""Add transaction to graph."""
self.G.add_edge(sender, receiver,
amount=amount,
timestamp=timestamp)
def detect_suspicious_patterns(self):
"""Detect common fraud patterns."""
suspicious = []
# Pattern 1: Circular transactions (money laundering)
for cycle in nx.simple_cycles(self.G):
if len(cycle) <= 4:
suspicious.append(('circular', cycle))
# Pattern 2: Rapid sequential transactions
for node in self.G.nodes():
out_edges = sorted(self.G.out_edges(node, data=True),
key=lambda x: x[2]['timestamp'])
for i in range(len(out_edges) - 2):
t1 = out_edges[i][2]['timestamp']
t3 = out_edges[i+2][2]['timestamp']
if t3 - t1 < 3600: # Within 1 hour
suspicious.append(('rapid_sequential', node))
# Pattern 3: High clustering (collusion)
clustering = nx.clustering(self.G.to_undirected())
high_clustering = [n for n, c in clustering.items() if c > 0.8]
if high_clustering:
suspicious.append(('high_clustering', high_clustering))
return suspicious
# Example
detector = FraudDetector()
# Add transactions
transactions = [
('A', 'B', 1000, 1000),
('B', 'C', 900, 1100),
('C', 'A', 850, 1200), # Circular!
('D', 'E', 500, 1300),
('E', 'F', 450, 1350), # Rapid
('F', 'D', 400, 1400), # Circular + rapid!
]
for s, r, a, t in transactions:
detector.add_transaction(s, r, a, t)
patterns = detector.detect_suspicious_patterns()
print("Suspicious Patterns Detected:")
for pattern_type, details in patterns:
print(f" {pattern_type}: {details}")
Key Takeaways
📋Summary: Graph Data Science
- Graphs model relationships: Entities as nodes, connections as edges — capturing relational structure that tabular data cannot
- Centrality measures identify important nodes (degree, betweenness, eigenvector, PageRank) — each capturing a different notion of "importance"
- Community detection reveals group structure in networks using modularity optimization or label propagation
- Knowledge graphs store structured facts as (subject, predicate, object) triples — enabling reasoning and question answering
- GNNs enable deep learning on graph-structured data through message passing and neighborhood aggregation
- Graph algorithms solve problems like shortest path, spanning trees, and flow — foundational for network optimization
- Applications span social networks, recommendations, fraud detection, and biology — any domain where relationships matter
Practice Exercises
Exercise 1: Network Analysis
Create a social network and calculate:
- Degree, betweenness, and eigenvector centrality
- Clustering coefficient for each node
- Communities using Louvain algorithm
Exercise 2: Knowledge Graph
Build a knowledge graph for a domain of your choice (movies, books, etc.) and implement:
- Triple storage and querying
- Path finding between entities
- PageRank for entity importance
Exercise 3: GNN Implementation
Implement a GCN from scratch for node classification on a small graph.
Exercise 4: Fraud Detection
Design a graph-based fraud detection system that identifies:
- Circular transaction patterns
- Rapid sequential transfers
- Collusion clusters
Discussion Questions
- When would you use a graph database over a relational database?
- How do you handle dynamic graphs that change over time?
- What are the scalability challenges of graph algorithms?