Graph Data Science

Module 4: Specialization + CareerFree Lesson

Advertisement

Graph Data Science

Overview

Graph data science applies machine learning and analytics to graph-structured data — networks of entities (nodes) and their relationships (edges). Applications include social networks, recommendation systems, fraud detection, knowledge graphs, and biological networks. Graph representations capture relational structure that tabular methods miss, enabling insights into influence, community, and connectivity.


Graph Theory Basics

Definitions

DfGraph

A graph G=(V,E)G = (V, E) consists of:

  • VV = set of nodes (vertices)
  • EV×VE \subseteq V \times V = set of edges

Types of Graphs

Architecture Diagram
Undirected Graph              Directed Graph
A --- B                       A --> B
|   / |                       |   / |
|  /  |                       v  /  v
| /   |                       | /   |
C --- D                       C --> D

Weighted Graph                Bipartite Graph
A --5-- B                    Users    Items
|     / |                    |   |   |   |
3   2   4                   U1--I1  U2--I2
|  /    |                    |   |   |   |
C --1-- D                   U2--I1  U1--I2

Graph Representations

Adjacency Matrix:

Adjacency Matrix

Aij={1if (i,j)E0otherwiseA_{ij} = \begin{cases} 1 & \text{if } (i,j) \in E \\ 0 & \text{otherwise} \end{cases}

Here,

  • AijA_{ij}=Entry in adjacency matrix
  • EE=Set of edges

Adjacency List:

Architecture Diagram
Node A: [B, C]
Node B: [A, C, D]
Node C: [A, B, D]
Node D: [B, C]

Edge List:

Architecture Diagram
[(A,B), (A,C), (B,C), (B,D), (C,D)]

Network Metrics

Degree Centrality

Number of connections a node has:

Degree Centrality

CD(v)=deg(v)n1C_D(v) = \frac{\text{deg}(v)}{n-1}

Here,

  • CD(v)C_D(v)=Degree centrality of node v
  • deg(v)deg(v)=Degree of node v
  • nn=Total number of nodes
Architecture Diagram
Node A: degree 2, centrality = 2/4 = 0.5
Node B: degree 3, centrality = 3/4 = 0.75  <-- Most connected
Node C: degree 3, centrality = 3/4 = 0.75
Node D: degree 2, centrality = 2/4 = 0.5

Betweenness Centrality

Measures how often a node lies on shortest paths:

Betweenness Centrality

CB(v)=svtσst(v)σstC_B(v) = \sum_{s \neq v \neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}

Here,

  • CB(v)C_B(v)=Betweenness centrality of node v
  • σst\sigma_{st}=Total shortest paths from s to t
  • σst(v)\sigma_{st}(v)=Shortest paths passing through v
Architecture Diagram
    B
   / \
  A   C  <-- B has high betweenness
       \      (bridge between A and D)
        D

Closeness Centrality

Average shortest path distance to all other nodes:

Closeness Centrality

CC(v)=n1uvd(v,u)C_C(v) = \frac{n-1}{\sum_{u \neq v} d(v, u)}

Here,

  • CC(v)C_C(v)=Closeness centrality of node v
  • d(v,u)d(v, u)=Shortest path distance between v and u

Eigenvector Centrality

Node importance based on neighbor importance:

In matrix form: Ax=λxAx = \lambda x

Node with high eigenvector centrality is connected to other well-connected nodes.

PageRank

DfPageRank

PageRank measures node importance by counting links to a node, where links from important nodes count more. It models a random walker who follows links with probability dd and teleports to a random node with probability 1d1-d.

Architecture Diagram
PageRank Algorithm:
1. Initialize all nodes to 1/n
2. Iterate:
   - Each node distributes its rank equally to neighbors
   - Each node collects sum of incoming ranks
   - Apply damping factor
3. Converges to stable distribution

ℹ️ PageRank Convergence

PageRank is guaranteed to converge to a unique stationary distribution as long as the graph is strongly connected and aperiodic. The damping factor d=0.85d = 0.85 ensures convergence even for graphs with dead ends.

Clustering Coefficient

Clustering Coefficient

Ci=2{ejk:vj,vkNi,ejkE}ki(ki1)C_i = \frac{2 \cdot |\{e_{jk} : v_j, v_k \in N_i, e_{jk} \in E\}|}{k_i(k_i - 1)}

Here,

  • CiC_i=Clustering coefficient of node i
  • kik_i=Degree of node i

Measures how connected a node's neighbors are to each other.

Architecture Diagram
Clustering = 1.0:          Clustering = 0.0:
    A                          A
   / \                        / \
  B---C                      B   C
Triangle                    Star (no B-C edge)

Implementation

import networkx as nx
import numpy as np
import pandas as pd
from collections import defaultdict

class GraphAnalyzer:
    """Comprehensive graph analysis toolkit."""
    
    def __init__(self, graph):
        self.G = graph
    
    def degree_centrality(self):
        """Calculate degree centrality for all nodes."""
        n = len(self.G.nodes()) - 1
        return {v: self.G.degree(v) / n for v in self.G.nodes()}
    
    def betweenness_centrality(self):
        """Calculate betweenness centrality."""
        return nx.betweenness_centrality(self.G)
    
    def closeness_centrality(self):
        """Calculate closeness centrality."""
        return nx.closeness_centrality(self.G)
    
    def eigenvector_centrality(self, max_iter=100, tol=1e-6):
        """Calculate eigenvector centrality."""
        return nx.eigenvector_centrality(self.G, max_iter=max_iter, tol=tol)
    
    def pagerank(self, d=0.85, max_iter=100):
        """Calculate PageRank scores."""
        return nx.pagerank(self.G, alpha=d, max_iter=max_iter)
    
    def clustering_coefficient(self):
        """Calculate local clustering coefficient."""
        return nx.clustering(self.G)
    
    def shortest_paths(self, source=None):
        """Calculate shortest paths."""
        if source:
            return nx.single_source_shortest_path_length(self.G, source)
        return dict(nx.all_pairs_shortest_path_length(self.G))
    
    def community_detection(self):
        """Detect communities using Louvain method."""
        from community import community_louvain
        return community_louvain.best_partition(self.G)
    
    def summary(self):
        """Generate graph summary statistics."""
        return {
            'num_nodes': self.G.number_of_nodes(),
            'num_edges': self.G.number_of_edges(),
            'density': nx.density(self.G),
            'avg_degree': sum(dict(self.G.degree()).values()) / self.G.number_of_nodes(),
            'is_connected': nx.is_connected(self.G),
            'num_components': nx.number_connected_components(self.G),
            'avg_clustering': nx.average_clustering(self.G),
            'transitivity': nx.transitivity(self.G)
        }

# Create example social network
G = nx.Graph()

# Add edges ( friendships)
edges = [
    ('Alice', 'Bob'), ('Alice', 'Charlie'), ('Alice', 'Diana'),
    ('Bob', 'Charlie'), ('Bob', 'Eve'), ('Charlie', 'Diana'),
    ('Charlie', 'Eve'), ('Diana', 'Frank'), ('Eve', 'Frank'),
    ('Eve', 'Grace'), ('Frank', 'Grace'), ('Frank', 'Heidi'),
    ('Grace', 'Heidi'), ('Heidi', 'Ivan')
]

G.add_edges_from(edges)

# Analyze
analyzer = GraphAnalyzer(G)

print("=== Graph Summary ===")
summary = analyzer.summary()
for key, value in summary.items():
    print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")

print("\n=== Centrality Measures ===")
degree = analyzer.degree_centrality()
betweenness = analyzer.betweenness_centrality()
pagerank = analyzer.pagerank()

nodes = sorted(degree.keys())
print(f"{'Node':<10} {'Degree':>8} {'Betweenness':>12} {'PageRank':>10}")
print("-" * 42)
for node in nodes:
    print(f"{node:<10} {degree[node]:>8.4f} {betweenness[node]:>12.4f} {pagerank[node]:>10.4f}")

print("\n=== Community Detection ===")
communities = analyzer.community_detection()
for comm_id in set(communities.values()):
    members = [n for n, c in communities.items() if c == comm_id]
    print(f"Community {comm_id}: {members}")

Graph Algorithms

Breadth-First Search (BFS)

Explores nodes level by level:

Architecture Diagram
Level 0:     A
            / \
Level 1:   B   C
          /|   |
Level 2: D E  F

BFS Order: A -> B -> C -> D -> E -> F

Dijkstra's Shortest Path

Dijkstra's Algorithm

d(v)=minuN(v){d(u)+w(u,v)}d(v) = \min_{u \in N(v)} \{d(u) + w(u, v)\}

Here,

  • d(v)d(v)=Shortest distance to node v
  • w(u,v)w(u, v)=Weight of edge (u, v)

Minimum Spanning Tree (Kruskal's/Prim's)

Find subset of edges connecting all nodes with minimum total weight.

Community Detection Algorithms

Louvain Modularity:

Label Propagation:

Each node adopts the most common label among its neighbors.

Graph Neural Networks (GNN)

Message Passing Framework:

GNN Message Passing

hv(l+1)=UPDATE(hv(l),AGGREGATE({hu(l):uN(v)}))h_v^{(l+1)} = \text{UPDATE}\left(h_v^{(l)}, \text{AGGREGATE}\left(\{h_u^{(l)} : u \in N(v)\}\right)\right)

Here,

  • hv(l)h_v^{(l)}=Node v's hidden state at layer l
  • N(v)N(v)=Neighbors of node v

GCN Layer:

where A~=A+I\tilde{A} = A + I and D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}.


Knowledge Graphs

Structure

Knowledge graphs store facts as triples:

DfKnowledge Triple

A knowledge triple is a (subject, predicate, object) representation of a fact.

(subject,predicate,object)(\text{subject}, \text{predicate}, \text{object})

Example:

Architecture Diagram
(Albert_Einstein, born_in, Germany)
(Albert_Einstein, developed, Theory_of_Relativity)
(Theory_of_Relativity, is_a, Physics_Theory)
(Germany, located_in, Europe)

Knowledge Graph Embeddings

TransE:

TransE Embedding

d(h+r,t)0for true triplesd(h + r, t) \approx 0 \quad \text{for true triples}

Here,

  • hh=Head entity embedding
  • rr=Relation embedding
  • tt=Tail entity embedding

TransE Loss:

L=(h,r,t)S(h,r,t)Smax(0,γ+d(h+r,t)d(h+r,t))\mathcal{L} = \sum_{(h,r,t) \in S} \sum_{(h',r,t') \in S'} \max(0, \gamma + d(h+r, t) - d(h'+r, t'))

RotatE:

RotatE Embedding

hrth \circ r \approx t

Here,

  • \circ=Hadamard product in complex space

where \circ is Hadamard product in complex space.

Implementation

import networkx as nx
import numpy as np
from typing import List, Tuple, Dict
from collections import defaultdict

class KnowledgeGraph:
    """Simple knowledge graph implementation."""
    
    def __init__(self):
        self.graph = nx.MultiDiGraph()
        self.entity_embeddings = {}
        self.relation_embeddings = {}
    
    def add_triple(self, subject: str, predicate: str, obj: str):
        """Add a knowledge triple."""
        self.graph.add_edge(subject, obj, relation=predicate)
    
    def add_triples(self, triples: List[Tuple[str, str, str]]):
        """Add multiple triples."""
        for s, p, o in triples:
            self.add_triple(s, p, o)
    
    def query(self, subject=None, predicate=None, obj=None):
        """Query the knowledge graph."""
        results = []
        for s, p, o in self.graph.edges(keys=True, data=True):
            if subject and s != subject:
                continue
            if predicate and o.get('relation') != predicate:
                continue
            if obj and o != obj:
                continue
            results.append((s, o.get('relation'), o))
        return results
    
    def get_neighbors(self, entity: str, relation: str = None):
        """Get neighbors of an entity."""
        neighbors = []
        for _, target, data in self.graph.out_edges(entity, data=True):
            if relation is None or data.get('relation') == relation:
                neighbors.append((target, data.get('relation')))
        return neighbors
    
    def find_paths(self, source: str, target: str, max_hops: int = 3):
        """Find paths between two entities."""
        try:
            paths = list(nx.all_simple_paths(self.graph, source, target, cutoff=max_hops))
            return paths
        except nx.NetworkXError:
            return []
    
    def pagerank(self):
        """Calculate PageRank on the knowledge graph."""
        return nx.pagerank(self.graph)
    
    def visualize(self):
        """Print graph structure."""
        print("Knowledge Graph Structure:")
        print("=" * 50)
        for s, t, data in self.graph.edges(data=True):
            print(f"  {s} --[{data.get('relation')}]--> {t}")

# Create a medical knowledge graph
kg = KnowledgeGraph()

# Add medical triples
triples = [
    ('Diabetes', 'has_symptom', 'High_Blood_Sugar'),
    ('Diabetes', 'has_symptom', 'Frequent_Urination'),
    ('Diabetes', 'has_symptom', 'Increased_Thirst'),
    ('Diabetes', 'treated_by', 'Insulin'),
    ('Diabetes', 'type_of', 'Metabolic_Disorder'),
    ('Insulin', 'produced_by', 'Pancreas'),
    ('Pancreas', 'located_in', 'Abdomen'),
    ('High_Blood_Sugar', 'can_lead_to', 'Heart_Disease'),
    ('Heart_Disease', 'has_symptom', 'Chest_Pain'),
    ('Heart_Disease', 'risk_factor', 'Diabetes'),
    ('Metformin', 'treats', 'Diabetes'),
    ('Metformin', 'class', 'Biguanide'),
]

kg.add_triples(triples)
kg.visualize()

# Query examples
print("\n=== Query Results ===")
print("\nSymptoms of Diabetes:")
for s, p, o in kg.query(subject='Diabetes', predicate='has_symptom'):
    print(f"  {s} --{p}--> {o}")

print("\nRisk factors for Heart Disease:")
for s, p, o in kg.query(obj='Heart_Disease'):
    print(f"  {s} --{p}--> {o}")

print("\n=== Neighbors of Diabetes ===")
for neighbor, relation in kg.get_neighbors('Diabetes'):
    print(f"  --[{relation}]--> {neighbor}")

print("\n=== PageRank Scores ===")
pagerank = kg.pagerank()
for entity, score in sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(f"  {entity}: {score:.4f}")

Graph Neural Networks

Architecture

Architecture Diagram
Input Features          GCN Layer           Output
    h^(0)           h^(1) = σ(D̃^(-½)ÃD̃^(-½)h^(0)W^(0))
   
   [x1]              [z1]                  [y1]
   [x2]     --->     [z2]      --->       [y2]
   [x3]              [z3]                  [y3]
   [x4]              [z4]                  [y4]

Message Passing Neural Network

import torch
import torch.nn as nn
import torch.nn.functional as F

class GCNLayer(nn.Module):
    """Graph Convolutional Network layer."""
    
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.FloatTensor(in_features, out_features))
        self.bias = nn.Parameter(torch.FloatTensor(out_features))
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)
    
    def forward(self, x, adj):
        """
        Forward pass.
        
        Args:
            x: Node features [n_nodes, in_features]
            adj: Adjacency matrix [n_nodes, n_nodes]
        """
        # Normalize adjacency matrix
        degree = adj.sum(dim=1, keepdim=True)
        degree_inv_sqrt = torch.where(degree > 0, degree.pow(-0.5), torch.zeros_like(degree))
        adj_norm = degree_inv_sqrt * adj * degree_inv_sqrt.t()
        
        # GCN message passing
        support = torch.mm(x, self.weight)
        output = torch.mm(adj_norm, support) + self.bias
        return output

class GraphAttentionLayer(nn.Module):
    """Graph Attention Network layer."""
    
    def __init__(self, in_features, out_features, n_heads=8):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = out_features // n_heads
        
        self.W = nn.Linear(in_features, out_features)
        self.a = nn.Parameter(torch.FloatTensor(n_heads, 2 * self.head_dim))
        nn.init.xavier_uniform_(self.a)
    
    def forward(self, x, adj):
        """Forward pass with attention."""
        n_nodes = x.size(0)
        h = self.W(x).view(n_nodes, self.n_heads, self.head_dim)
        
        # Attention scores
        attention = torch.zeros(n_nodes, n_nodes, self.n_heads)
        
        for i in range(n_nodes):
            for j in range(n_nodes):
                if adj[i, j] > 0:
                    score = torch.cat([h[i], h[j]], dim=-1)
                    attention[i, j] = torch.exp(F.leaky_relu((score * self.a).sum(dim=-1)))
        
        # Normalize attention
        attention = attention / (attention.sum(dim=1, keepdim=True) + 1e-10)
        
        # Apply attention
        out = torch.zeros(n_nodes, self.n_heads, self.head_dim)
        for i in range(n_nodes):
            for j in range(n_nodes):
                out[i] += attention[i, j] * h[j]
        
        return out.view(n_nodes, -1)

class SimpleGNN(nn.Module):
    """Simple Graph Neural Network for node classification."""
    
    def __init__(self, n_features, n_hidden, n_classes, n_layers=2):
        super().__init__()
        self.layers = nn.ModuleList()
        
        self.layers.append(GCNLayer(n_features, n_hidden))
        for _ in range(n_layers - 2):
            self.layers.append(GCNLayer(n_hidden, n_hidden))
        self.layers.append(GCNLayer(n_hidden, n_classes))
    
    def forward(self, x, adj):
        """Forward pass through all layers."""
        for i, layer in enumerate(self.layers[:-1]):
            x = F.relu(layer(x, adj))
            x = F.dropout(x, p=0.5, training=self.training)
        x = self.layers[-1](x, adj)
        return F.log_softmax(x, dim=1)

# Example: Cora dataset simulation
def create_cora_like_data():
    """Create a small graph for demonstration."""
    n_nodes = 100
    n_features = 32
    n_classes = 5
    
    # Random features
    features = torch.randn(n_nodes, n_features)
    
    # Create adjacency (random graph)
    adj = torch.zeros(n_nodes, n_nodes)
    for i in range(n_nodes):
        n_edges = torch.randint(3, 8, (1,)).item()
        neighbors = torch.randperm(n_nodes)[:n_edges]
        adj[i, neighbors] = 1
        adj[neighbors, i] = 1
    
    # Random labels
    labels = torch.randint(0, n_classes, (n_nodes,))
    
    return features, adj, labels

# Initialize and use model
features, adj, labels = create_cora_like_data()
model = SimpleGNN(n_features=32, n_hidden=64, n_classes=5)

# Forward pass
output = model(features, adj)
print(f"Input shape: {features.shape}")
print(f"Output shape: {output.shape}")
print(f"Predictions (first 5): {output[:5].argmax(dim=1)}")

Real-World Applications

Social Network Analysis

import networkx as nx
import numpy as np

# Create social network
G = nx.watts_strogatz_graph(n=100, k=6, p=0.3, seed=42)

# Find influential nodes
pagerank = nx.pagerank(G)
betweenness = nx.betweenness_centrality(G)

# Top 5 influencers
top_pr = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]
top_bw = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:5]

print("Top 5 by PageRank:")
for node, score in top_pr:
    print(f"  Node {node}: {score:.4f}")

print("\nTop 5 by Betweenness:")
for node, score in top_bw:
    print(f"  Node {node}: {score:.4f}")

Fraud Detection

# Transaction graph for fraud detection
class FraudDetector:
    """Graph-based fraud detection."""
    
    def __init__(self):
        self.G = nx.DiGraph()
    
    def add_transaction(self, sender, receiver, amount, timestamp):
        """Add transaction to graph."""
        self.G.add_edge(sender, receiver, 
                       amount=amount, 
                       timestamp=timestamp)
    
    def detect_suspicious_patterns(self):
        """Detect common fraud patterns."""
        suspicious = []
        
        # Pattern 1: Circular transactions (money laundering)
        for cycle in nx.simple_cycles(self.G):
            if len(cycle) <= 4:
                suspicious.append(('circular', cycle))
        
        # Pattern 2: Rapid sequential transactions
        for node in self.G.nodes():
            out_edges = sorted(self.G.out_edges(node, data=True), 
                             key=lambda x: x[2]['timestamp'])
            for i in range(len(out_edges) - 2):
                t1 = out_edges[i][2]['timestamp']
                t3 = out_edges[i+2][2]['timestamp']
                if t3 - t1 < 3600:  # Within 1 hour
                    suspicious.append(('rapid_sequential', node))
        
        # Pattern 3: High clustering (collusion)
        clustering = nx.clustering(self.G.to_undirected())
        high_clustering = [n for n, c in clustering.items() if c > 0.8]
        if high_clustering:
            suspicious.append(('high_clustering', high_clustering))
        
        return suspicious

# Example
detector = FraudDetector()

# Add transactions
transactions = [
    ('A', 'B', 1000, 1000),
    ('B', 'C', 900, 1100),
    ('C', 'A', 850, 1200),  # Circular!
    ('D', 'E', 500, 1300),
    ('E', 'F', 450, 1350),  # Rapid
    ('F', 'D', 400, 1400),  # Circular + rapid!
]

for s, r, a, t in transactions:
    detector.add_transaction(s, r, a, t)

patterns = detector.detect_suspicious_patterns()
print("Suspicious Patterns Detected:")
for pattern_type, details in patterns:
    print(f"  {pattern_type}: {details}")

Key Takeaways

📋Summary: Graph Data Science

  1. Graphs model relationships: Entities as nodes, connections as edges — capturing relational structure that tabular data cannot
  2. Centrality measures identify important nodes (degree, betweenness, eigenvector, PageRank) — each capturing a different notion of "importance"
  3. Community detection reveals group structure in networks using modularity optimization or label propagation
  4. Knowledge graphs store structured facts as (subject, predicate, object) triples — enabling reasoning and question answering
  5. GNNs enable deep learning on graph-structured data through message passing and neighborhood aggregation
  6. Graph algorithms solve problems like shortest path, spanning trees, and flow — foundational for network optimization
  7. Applications span social networks, recommendations, fraud detection, and biology — any domain where relationships matter

Practice Exercises

Exercise 1: Network Analysis

Create a social network and calculate:

  • Degree, betweenness, and eigenvector centrality
  • Clustering coefficient for each node
  • Communities using Louvain algorithm

Exercise 2: Knowledge Graph

Build a knowledge graph for a domain of your choice (movies, books, etc.) and implement:

  • Triple storage and querying
  • Path finding between entities
  • PageRank for entity importance

Exercise 3: GNN Implementation

Implement a GCN from scratch for node classification on a small graph.

Exercise 4: Fraud Detection

Design a graph-based fraud detection system that identifies:

  • Circular transaction patterns
  • Rapid sequential transfers
  • Collusion clusters

Discussion Questions

  1. When would you use a graph database over a relational database?
  2. How do you handle dynamic graphs that change over time?
  3. What are the scalability challenges of graph algorithms?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement