Graph Data Science

Overview

Graph data science applies machine learning and analytics to graph-structured data — networks of entities (nodes) and their relationships (edges). Applications include social networks, recommendation systems, fraud detection, knowledge graphs, and biological networks. Graph representations capture relational structure that tabular methods miss, enabling insights into influence, community, and connectivity.

Graph Theory Basics

Definitions

DfGraph

A graph $G = (V, E)$ consists of:

$V$ = set of nodes (vertices)
$E \subseteq V \times V$ = set of edges

Types of Graphs

Architecture Diagram

Undirected Graph              Directed Graph
A --- B                       A --> B
|   / |                       |   / |
|  /  |                       v  /  v
| /   |                       | /   |
C --- D                       C --> D

Weighted Graph                Bipartite Graph
A --5-- B                    Users    Items
|     / |                    |   |   |   |
3   2   4                   U1--I1  U2--I2
|  /    |                    |   |   |   |
C --1-- D                   U2--I1  U1--I2

Graph Representations

Adjacency Matrix:

Adjacency Matrix

A_{ij} = \begin{cases} 1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

Here,

$A_{ij}$ =Entry in adjacency matrix
$E$ =Set of edges

Adjacency List:

Architecture Diagram

Node A: [B, C]
Node B: [A, C, D]
Node C: [A, B, D]
Node D: [B, C]

Edge List:

Architecture Diagram

[(A,B), (A,C), (B,C), (B,D), (C,D)]

Network Metrics

Degree Centrality

Number of connections a node has:

Degree Centrality

C_D(v) = \frac{\text{deg}(v)}{n-1}

Here,

$C_D(v)$ =Degree centrality of node v
$deg(v)$ =Degree of node v
$n$ =Total number of nodes

Architecture Diagram

Node A: degree 2, centrality = 2/4 = 0.5
Node B: degree 3, centrality = 3/4 = 0.75  <-- Most connected
Node C: degree 3, centrality = 3/4 = 0.75
Node D: degree 2, centrality = 2/4 = 0.5

Betweenness Centrality

Measures how often a node lies on shortest paths:

Betweenness Centrality

C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}

Here,

$C_B(v)$ =Betweenness centrality of node v
$\sigma_{st}$ =Total shortest paths from s to t
$\sigma_{st}(v)$ =Shortest paths passing through v

Architecture Diagram

    B
   / \
  A   C  <-- B has high betweenness
       \      (bridge between A and D)
        D

Closeness Centrality

Average shortest path distance to all other nodes:

Closeness Centrality

C_C(v) = \frac{n-1}{\sum_{u \neq v} d(v, u)}

Here,

$C_C(v)$ =Closeness centrality of node v
$d(v, u)$ =Shortest path distance between v and u

Eigenvector Centrality

Node importance based on neighbor importance:

In matrix form: $Ax = \lambda x$

Node with high eigenvector centrality is connected to other well-connected nodes.

PageRank

DfPageRank

PageRank measures node importance by counting links to a node, where links from important nodes count more. It models a random walker who follows links with probability $d$ and teleports to a random node with probability $1-d$ .

Architecture Diagram

PageRank Algorithm:
1. Initialize all nodes to 1/n
2. Iterate:
   - Each node distributes its rank equally to neighbors
   - Each node collects sum of incoming ranks
   - Apply damping factor
3. Converges to stable distribution

ℹ️ PageRank Convergence

PageRank is guaranteed to converge to a unique stationary distribution as long as the graph is strongly connected and aperiodic. The damping factor $d = 0.85$ ensures convergence even for graphs with dead ends.

Clustering Coefficient

C_i = \frac{2 \cdot |\{e_{jk} : v_j, v_k \in N_i, e_{jk} \in E\}|}{k_i(k_i - 1)}

Here,

$C_i$ =Clustering coefficient of node i
$k_i$ =Degree of node i

Measures how connected a node's neighbors are to each other.

Architecture Diagram

Clustering = 1.0:          Clustering = 0.0:
    A                          A
   / \                        / \
  B---C                      B   C
Triangle                    Star (no B-C edge)

Implementation

import networkx as nx
import numpy as np
import pandas as pd
from collections import defaultdict

class GraphAnalyzer:
    """Comprehensive graph analysis toolkit."""
    
    def __init__(self, graph):
        self.G = graph
    
    def degree_centrality(self):
        """Calculate degree centrality for all nodes."""
        n = len(self.G.nodes()) - 1
        return {v: self.G.degree(v) / n for v in self.G.nodes()}
    
    def betweenness_centrality(self):
        """Calculate betweenness centrality."""
        return nx.betweenness_centrality(self.G)
    
    def closeness_centrality(self):
        """Calculate closeness centrality."""
        return nx.closeness_centrality(self.G)
    
    def eigenvector_centrality(self, max_iter=100, tol=1e-6):
        """Calculate eigenvector centrality."""
        return nx.eigenvector_centrality(self.G, max_iter=max_iter, tol=tol)
    
    def pagerank(self, d=0.85, max_iter=100):
        """Calculate PageRank scores."""
        return nx.pagerank(self.G, alpha=d, max_iter=max_iter)
    
    def clustering_coefficient(self):
        """Calculate local clustering coefficient."""
        return nx.clustering(self.G)
    
    def shortest_paths(self, source=None):
        """Calculate shortest paths."""
        if source:
            return nx.single_source_shortest_path_length(self.G, source)
        return dict(nx.all_pairs_shortest_path_length(self.G))
    
    def community_detection(self):
        """Detect communities using Louvain method."""
        from community import community_louvain
        return community_louvain.best_partition(self.G)
    
    def summary(self):
        """Generate graph summary statistics."""
        return {
            'num_nodes': self.G.number_of_nodes(),
            'num_edges': self.G.number_of_edges(),
            'density': nx.density(self.G),
            'avg_degree': sum(dict(self.G.degree()).values()) / self.G.number_of_nodes(),
            'is_connected': nx.is_connected(self.G),
            'num_components': nx.number_connected_components(self.G),
            'avg_clustering': nx.average_clustering(self.G),
            'transitivity': nx.transitivity(self.G)
        }

# Create example social network
G = nx.Graph()

# Add edges ( friendships)
edges = [
    ('Alice', 'Bob'), ('Alice', 'Charlie'), ('Alice', 'Diana'),
    ('Bob', 'Charlie'), ('Bob', 'Eve'), ('Charlie', 'Diana'),
    ('Charlie', 'Eve'), ('Diana', 'Frank'), ('Eve', 'Frank'),
    ('Eve', 'Grace'), ('Frank', 'Grace'), ('Frank', 'Heidi'),
    ('Grace', 'Heidi'), ('Heidi', 'Ivan')
]

G.add_edges_from(edges)

# Analyze
analyzer = GraphAnalyzer(G)

print("=== Graph Summary ===")
summary = analyzer.summary()
for key, value in summary.items():
    print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")

print("\n=== Centrality Measures ===")
degree = analyzer.degree_centrality()
betweenness = analyzer.betweenness_centrality()
pagerank = analyzer.pagerank()

nodes = sorted(degree.keys())
print(f"{'Node':<10} {'Degree':>8} {'Betweenness':>12} {'PageRank':>10}")
print("-" * 42)
for node in nodes:
    print(f"{node:<10} {degree[node]:>8.4f} {betweenness[node]:>12.4f} {pagerank[node]:>10.4f}")

print("\n=== Community Detection ===")
communities = analyzer.community_detection()
for comm_id in set(communities.values()):
    members = [n for n, c in communities.items() if c == comm_id]
    print(f"Community {comm_id}: {members}")

Graph Algorithms

Breadth-First Search (BFS)

Explores nodes level by level:

Architecture Diagram

Level 0:     A
            / \
Level 1:   B   C
          /|   |
Level 2: D E  F

BFS Order: A -> B -> C -> D -> E -> F

Dijkstra's Shortest Path

Dijkstra's Algorithm

d(v) = \min_{u \in N(v)} \{d(u) + w(u, v)\}

Here,

$d(v)$ =Shortest distance to node v
$w(u, v)$ =Weight of edge (u, v)

Minimum Spanning Tree (Kruskal's/Prim's)

Find subset of edges connecting all nodes with minimum total weight.

Community Detection Algorithms

Louvain Modularity:

Label Propagation:

Each node adopts the most common label among its neighbors.

Graph Neural Networks (GNN)

Message Passing Framework:

GNN Message Passing

h_v^{(l+1)} = \text{UPDATE}\left(h_v^{(l)}, \text{AGGREGATE}\left(\{h_u^{(l)} : u \in N(v)\}\right)\right)

Here,

$h_v^{(l)}$ =Node v's hidden state at layer l
$N(v)$ =Neighbors of node v

GCN Layer:

where $\tilde{A} = A + I$ and $\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}$ .

Knowledge Graphs

Structure

Knowledge graphs store facts as triples:

DfKnowledge Triple

A knowledge triple is a (subject, predicate, object) representation of a fact.

(\text{subject}, \text{predicate}, \text{object})

Example:

Architecture Diagram

(Albert_Einstein, born_in, Germany)
(Albert_Einstein, developed, Theory_of_Relativity)
(Theory_of_Relativity, is_a, Physics_Theory)
(Germany, located_in, Europe)

Knowledge Graph Embeddings

TransE:

TransE Embedding

d(h + r, t) \approx 0 \quad \text{for true triples}

Here,

$h$ =Head entity embedding
$r$ =Relation embedding
$t$ =Tail entity embedding

TransE Loss:

\mathcal{L} = \sum_{(h,r,t) \in S} \sum_{(h',r,t') \in S'} \max(0, \gamma + d(h+r, t) - d(h'+r, t'))

RotatE:

RotatE Embedding

h \circ r \approx t

Here,

$\circ$ =Hadamard product in complex space

where $\circ$ is Hadamard product in complex space.

Implementation

import networkx as nx
import numpy as np
from typing import List, Tuple, Dict
from collections import defaultdict

class KnowledgeGraph:
    """Simple knowledge graph implementation."""
    
    def __init__(self):
        self.graph = nx.MultiDiGraph()
        self.entity_embeddings = {}
        self.relation_embeddings = {}
    
    def add_triple(self, subject: str, predicate: str, obj: str):
        """Add a knowledge triple."""
        self.graph.add_edge(subject, obj, relation=predicate)
    
    def add_triples(self, triples: List[Tuple[str, str, str]]):
        """Add multiple triples."""
        for s, p, o in triples:
            self.add_triple(s, p, o)
    
    def query(self, subject=None, predicate=None, obj=None):
        """Query the knowledge graph."""
        results = []
        for s, p, o in self.graph.edges(keys=True, data=True):
            if subject and s != subject:
                continue
            if predicate and o.get('relation') != predicate:
                continue
            if obj and o != obj:
                continue
            results.append((s, o.get('relation'), o))
        return results
    
    def get_neighbors(self, entity: str, relation: str = None):
        """Get neighbors of an entity."""
        neighbors = []
        for _, target, data in self.graph.out_edges(entity, data=True):
            if relation is None or data.get('relation') == relation:
                neighbors.append((target, data.get('relation')))
        return neighbors
    
    def find_paths(self, source: str, target: str, max_hops: int = 3):
        """Find paths between two entities."""
        try:
            paths = list(nx.all_simple_paths(self.graph, source, target, cutoff=max_hops))
            return paths
        except nx.NetworkXError:
            return []
    
    def pagerank(self):
        """Calculate PageRank on the knowledge graph."""
        return nx.pagerank(self.graph)
    
    def visualize(self):
        """Print graph structure."""
        print("Knowledge Graph Structure:")
        print("=" * 50)
        for s, t, data in self.graph.edges(data=True):
            print(f"  {s} --[{data.get('relation')}]--> {t}")

# Create a medical knowledge graph
kg = KnowledgeGraph()

# Add medical triples
triples = [
    ('Diabetes', 'has_symptom', 'High_Blood_Sugar'),
    ('Diabetes', 'has_symptom', 'Frequent_Urination'),
    ('Diabetes', 'has_symptom', 'Increased_Thirst'),
    ('Diabetes', 'treated_by', 'Insulin'),
    ('Diabetes', 'type_of', 'Metabolic_Disorder'),
    ('Insulin', 'produced_by', 'Pancreas'),
    ('Pancreas', 'located_in', 'Abdomen'),
    ('High_Blood_Sugar', 'can_lead_to', 'Heart_Disease'),
    ('Heart_Disease', 'has_symptom', 'Chest_Pain'),
    ('Heart_Disease', 'risk_factor', 'Diabetes'),
    ('Metformin', 'treats', 'Diabetes'),
    ('Metformin', 'class', 'Biguanide'),
]

kg.add_triples(triples)
kg.visualize()

# Query examples
print("\n=== Query Results ===")
print("\nSymptoms of Diabetes:")
for s, p, o in kg.query(subject='Diabetes', predicate='has_symptom'):
    print(f"  {s} --{p}--> {o}")

print("\nRisk factors for Heart Disease:")
for s, p, o in kg.query(obj='Heart_Disease'):
    print(f"  {s} --{p}--> {o}")

print("\n=== Neighbors of Diabetes ===")
for neighbor, relation in kg.get_neighbors('Diabetes'):
    print(f"  --[{relation}]--> {neighbor}")

print("\n=== PageRank Scores ===")
pagerank = kg.pagerank()
for entity, score in sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {entity}: {score:.4f}")

Graph Neural Networks

Architecture

Architecture Diagram

Input Features          GCN Layer           Output
    h^(0)           h^(1) = σ(D̃^(-½)ÃD̃^(-½)h^(0)W^(0))
   
   [x1]              [z1]                  [y1]
   [x2]     --->     [z2]      --->       [y2]
   [x3]              [z3]                  [y3]
   [x4]              [z4]                  [y4]

Message Passing Neural Network

import torch
import torch.nn as nn
import torch.nn.functional as F

class GCNLayer(nn.Module):
    """Graph Convolutional Network layer."""
    
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.FloatTensor(in_features, out_features))
        self.bias = nn.Parameter(torch.FloatTensor(out_features))
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)
    
    def forward(self, x, adj):
        """
        Forward pass.
        
        Args:
            x: Node features [n_nodes, in_features]
            adj: Adjacency matrix [n_nodes, n_nodes]
        """
        # Normalize adjacency matrix
        degree = adj.sum(dim=1, keepdim=True)
        degree_inv_sqrt = torch.where(degree > 0, degree.pow(-0.5), torch.zeros_like(degree))
        adj_norm = degree_inv_sqrt * adj * degree_inv_sqrt.t()
        
        # GCN message passing
        support = torch.mm(x, self.weight)
        output = torch.mm(adj_norm, support) + self.bias
        return output

class GraphAttentionLayer(nn.Module):
    """Graph Attention Network layer."""
    
    def __init__(self, in_features, out_features, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = out_features // n_heads
        
        self.W = nn.Linear(in_features, out_features)
        self.a = nn.Parameter(torch.FloatTensor(n_heads, 2 * self.head_dim))
        nn.init.xavier_uniform_(self.a)
    
    def forward(self, x, adj):
        """Forward pass with attention."""
        n_nodes = x.size(0)
        h = self.W(x).view(n_nodes, self.n_heads, self.head_dim)
        
        # Attention scores
        attention = torch.zeros(n_nodes, n_nodes, self.n_heads)
        
        for i in range(n_nodes):
            for j in range(n_nodes):
                if adj[i, j] > 0:
                    score = torch.cat([h[i], h[j]], dim=-1)
                    attention[i, j] = torch.exp(F.leaky_relu((score * self.a).sum(dim=-1)))
        
        # Normalize attention
        attention = attention / (attention.sum(dim=1, keepdim=True) + 1e-10)
        
        # Apply attention
        out = torch.zeros(n_nodes, self.n_heads, self.head_dim)
        for i in range(n_nodes):
            for j in range(n_nodes):
                out[i] += attention[i, j] * h[j]
        
        return out.view(n_nodes, -1)

class SimpleGNN(nn.Module):
    """Simple Graph Neural Network for node classification."""
    
    def __init__(self, n_features, n_hidden, n_classes, n_layers=2):
        super().__init__()
        self.layers = nn.ModuleList()
        
        self.layers.append(GCNLayer(n_features, n_hidden))
        for _ in range(n_layers - 2):
            self.layers.append(GCNLayer(n_hidden, n_hidden))
        self.layers.append(GCNLayer(n_hidden, n_classes))
    
    def forward(self, x, adj):
        """Forward pass through all layers."""
        for i, layer in enumerate(self.layers[:-1]):
            x = F.relu(layer(x, adj))
            x = F.dropout(x, p=0.5, training=self.training)
        x = self.layers[-1](x, adj)
        return F.log_softmax(x, dim=1)

# Example: Cora dataset simulation
def create_cora_like_data():
    """Create a small graph for demonstration."""
    n_nodes = 100
    n_features = 32
    n_classes = 5
    
    # Random features
    features = torch.randn(n_nodes, n_features)
    
    # Create adjacency (random graph)
    adj = torch.zeros(n_nodes, n_nodes)
    for i in range(n_nodes):
        n_edges = torch.randint(3, 8, (1,)).item()
        neighbors = torch.randperm(n_nodes)[:n_edges]
        adj[i, neighbors] = 1
        adj[neighbors, i] = 1
    
    # Random labels
    labels = torch.randint(0, n_classes, (n_nodes,))
    
    return features, adj, labels

# Initialize and use model
features, adj, labels = create_cora_like_data()
model = SimpleGNN(n_features=32, n_hidden=64, n_classes=5)

# Forward pass
output = model(features, adj)
print(f"Input shape: {features.shape}")
print(f"Output shape: {output.shape}")
print(f"Predictions (first 5): {output[:5].argmax(dim=1)}")

Real-World Applications

Social Network Analysis

import networkx as nx
import numpy as np

# Create social network
G = nx.watts_strogatz_graph(n=100, k=6, p=0.3, seed=42)

# Find influential nodes
pagerank = nx.pagerank(G)
betweenness = nx.betweenness_centrality(G)

# Top 5 influencers
top_pr = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]
top_bw = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:5]

print("Top 5 by PageRank:")
for node, score in top_pr:
    print(f"  Node {node}: {score:.4f}")

print("\nTop 5 by Betweenness:")
for node, score in top_bw:
    print(f"  Node {node}: {score:.4f}")

Fraud Detection

# Transaction graph for fraud detection
class FraudDetector:
    """Graph-based fraud detection."""
    
    def __init__(self):
        self.G = nx.DiGraph()
    
    def add_transaction(self, sender, receiver, amount, timestamp):
        """Add transaction to graph."""
        self.G.add_edge(sender, receiver, 
                       amount=amount, 
                       timestamp=timestamp)
    
    def detect_suspicious_patterns(self):
        """Detect common fraud patterns."""
        suspicious = []
        
        # Pattern 1: Circular transactions (money laundering)
        for cycle in nx.simple_cycles(self.G):
            if len(cycle) <= 4:
                suspicious.append(('circular', cycle))
        
        # Pattern 2: Rapid sequential transactions
        for node in self.G.nodes():
            out_edges = sorted(self.G.out_edges(node, data=True), 
                             key=lambda x: x[2]['timestamp'])
            for i in range(len(out_edges) - 2):
                t1 = out_edges[i][2]['timestamp']
                t3 = out_edges[i+2][2]['timestamp']
                if t3 - t1 < 3600:  # Within 1 hour
                    suspicious.append(('rapid_sequential', node))
        
        # Pattern 3: High clustering (collusion)
        clustering = nx.clustering(self.G.to_undirected())
        high_clustering = [n for n, c in clustering.items() if c > 0.8]
        if high_clustering:
            suspicious.append(('high_clustering', high_clustering))
        
        return suspicious

# Example
detector = FraudDetector()

# Add transactions
transactions = [
    ('A', 'B', 1000, 1000),
    ('B', 'C', 900, 1100),
    ('C', 'A', 850, 1200),  # Circular!
    ('D', 'E', 500, 1300),
    ('E', 'F', 450, 1350),  # Rapid
    ('F', 'D', 400, 1400),  # Circular + rapid!
]

for s, r, a, t in transactions:
    detector.add_transaction(s, r, a, t)

patterns = detector.detect_suspicious_patterns()
print("Suspicious Patterns Detected:")
for pattern_type, details in patterns:
    print(f"  {pattern_type}: {details}")

Key Takeaways

📋Summary: Graph Data Science

Graphs model relationships: Entities as nodes, connections as edges — capturing relational structure that tabular data cannot
Centrality measures identify important nodes (degree, betweenness, eigenvector, PageRank) — each capturing a different notion of "importance"
Community detection reveals group structure in networks using modularity optimization or label propagation
Knowledge graphs store structured facts as (subject, predicate, object) triples — enabling reasoning and question answering
GNNs enable deep learning on graph-structured data through message passing and neighborhood aggregation
Graph algorithms solve problems like shortest path, spanning trees, and flow — foundational for network optimization
Applications span social networks, recommendations, fraud detection, and biology — any domain where relationships matter

Practice Exercises

Exercise 1: Network Analysis

Create a social network and calculate:

Degree, betweenness, and eigenvector centrality
Clustering coefficient for each node
Communities using Louvain algorithm

Exercise 2: Knowledge Graph

Build a knowledge graph for a domain of your choice (movies, books, etc.) and implement:

Triple storage and querying
Path finding between entities
PageRank for entity importance

Exercise 3: GNN Implementation

Implement a GCN from scratch for node classification on a small graph.

Exercise 4: Fraud Detection

Design a graph-based fraud detection system that identifies:

Circular transaction patterns
Rapid sequential transfers
Collusion clusters

Discussion Questions

When would you use a graph database over a relational database?
How do you handle dynamic graphs that change over time?
What are the scalability challenges of graph algorithms?

Graph Data Science

Graph Data Science

Overview

Graph Theory Basics

Definitions

DfGraph

Types of Graphs

Graph Representations

Adjacency Matrix

Network Metrics

Degree Centrality

Degree Centrality

Betweenness Centrality

Betweenness Centrality

Closeness Centrality

Closeness Centrality

Eigenvector Centrality

PageRank

DfPageRank

Clustering Coefficient

Clustering Coefficient

Implementation

Graph Algorithms

Breadth-First Search (BFS)

Dijkstra's Shortest Path

Dijkstra's Algorithm

Minimum Spanning Tree (Kruskal's/Prim's)

Community Detection Algorithms

Graph Neural Networks (GNN)

GNN Message Passing

Knowledge Graphs

Structure

DfKnowledge Triple

Knowledge Graph Embeddings

TransE Embedding

RotatE Embedding

Implementation

Graph Neural Networks

Architecture

Message Passing Neural Network

Real-World Applications

Social Network Analysis

Fraud Detection

Key Takeaways

📋Summary: Graph Data Science

Practice Exercises

Exercise 1: Network Analysis

Exercise 2: Knowledge Graph

Exercise 3: GNN Implementation

Exercise 4: Fraud Detection

Discussion Questions

Need Expert Data Science Help?