Sparse Matrices | ChatWhole Learn

Why It Matters

ℹ️ Why It Matters

Most real-world matrices are overwhelmingly sparse — containing 90% or more zeros. Consider a movie recommendation system with 1 million users and 10,000 movies: the user-item matrix has 10 billion entries, but each user rates fewer than 100 movies, making 99.99% of entries zero. Storing this densely wastes 99.99% of memory. Sparse representations store only non-zero values, reducing storage from 80 GB to 8 MB — a 10,000x compression that makes computation feasible.

In machine learning, sparsity appears everywhere: NLP term-document matrices, graph adjacency matrices, attention masks in Transformers, and feature representations in high-dimensional data. Without sparse matrix techniques, training large language models, building recommendation engines, and analyzing social networks would be computationally impossible.

What is a Sparse Matrix

DfSparse Matrix

A matrix $A \in \mathbb{R}^{m \times n}$ is sparse if most of its elements are zero. Formally, define the sparsity as:

\text{sparsity}(A) = \frac{\text{number of zero elements}}{m \times n}

A matrix is considered sparse when $\text{sparsity}(A) \geq 0.9$ (90% zeros). The density is $1 - \text{sparsity}(A)$ .

The nnz (number of non-zeros) satisfies $\text{nnz} \ll m \times n$ .

📝Example: Dense vs Sparse

Dense matrix (4×4, 50% sparse):

Architecture Diagram

┌                     ┐
│  1.0   0.0   0.0  2.0 │
│  0.0   3.0   0.0  0.0 │
│  0.0   0.0   4.0  0.0 │
│  5.0   0.0   0.0  6.0 │
└                     ┘
Dense storage: 16 values × 8 bytes = 128 bytes

Sparse matrix (4×4, 75% sparse, 4 non-zeros):

Architecture Diagram

┌                     ┐
│  1.0   0.0   0.0  0.0 │
│  0.0   3.0   0.0  0.0 │
│  0.0   0.0   4.0  0.0 │
│  0.0   0.0   0.0  6.0 │
└                     ┘
Dense storage: 16 values × 8 bytes = 128 bytes
Sparse COO: 4 triples × 16 bytes = 64 bytes

Extreme sparsity (10000×10000, 99.9% sparse):

Dense: 800 MB
Sparse (COO): ~2.4 MB (375x savings)

Sparsity Pattern

The sparsity pattern describes which entries are non-zero. Visualizing sparsity patterns reveals structure that affects format choice and algorithm performance.

ℹ️ Visualizing Sparsity

Architecture Diagram

Dense (no zeros):     Sparse (structured):   Sparse (random):
● ● ● ● ● ● ● ●      ● ○ ○ ○ ○ ○ ○ ●        ● ○ ○ ● ○ ● ○ ○
● ● ● ● ● ● ● ●      ○ ● ○ ○ ○ ○ ● ○        ○ ○ ● ○ ○ ○ ● ○
● ● ● ● ● ● ● ●      ○ ○ ● ○ ○ ● ○ ○        ○ ● ○ ○ ● ○ ○ ●
● ● ● ● ● ● ● ●      ○ ○ ○ ● ● ○ ○ ○        ○ ○ ○ ● ○ ○ ● ○
● ● ● ● ● ● ● ●      ○ ○ ○ ● ● ○ ○ ○        ● ○ ● ○ ○ ● ○ ○
● ● ● ● ● ● ● ●      ○ ○ ● ○ ○ ● ○ ○        ○ ● ○ ○ ● ○ ○ ●
● ● ● ● ● ● ● ●      ○ ● ○ ○ ○ ○ ● ○        ○ ○ ● ○ ○ ○ ● ○
● ● ● ● ● ● ● ●      ● ○ ○ ○ ○ ○ ○ ●        ● ○ ○ ● ○ ● ○ ○

● = non-zero          ● = non-zero           ● = non-zero
○ = zero              ○ = zero               ○ = zero

Dense: All entries stored — no pattern to exploit
Structured sparse: Banded, diagonal, or block patterns — can use specialized algorithms
Random sparse: Non-zeros scattered unpredictably — general sparse formats needed

ThSparsity Impact on Storage

For an $n \times n$ matrix with $\text{nnz}$ non-zeros:

Storage	Dense	CSR/CSC	COO
Formula	$8n^2$ bytes	$8 \cdot \text{nnz} + 4(n+1+\text{nnz})$ bytes	$16 \cdot \text{nnz}$ bytes
Break-even	—	$\text{nnz} < \frac{4n}{5}$	$\text{nnz} < \frac{n^2}{2}$

When $\text{nnz} \ll n^2$ , sparse formats provide exponential savings.

Sparse Matrix Formats

CSR (Compressed Sparse Row)

DfCSR Format

CSR stores a sparse matrix using three arrays:

values: Non-zero values in row-major order
col_indices: Column index for each non-zero value
row_ptr: Index into values/col_indices where each row starts (length $m+1$ )

Storage: $\text{nnz} \times 8$ bytes (values) + $\text{nnz} \times 4$ bytes (col_indices) + $(m+1) \times 4$ bytes (row_ptr)

Architecture Diagram

Matrix:                   CSR Representation:
┌                 ┐       values    = [1, 2, 3, 4, 5, 6]
│ 1  0  0  2  0  │       col_idx   = [0, 3, 1, 2, 3, 4]
│ 0  3  0  0  0  │       row_ptr   = [0, 2, 3, 5, 6]
│ 0  0  4  5  0  │
│ 0  0  0  0  6  │       Row 0: values[0:2] = [1, 2]
└                 ┘       Row 1: values[2:3] = [3]
                          Row 2: values[3:5] = [4, 5]
                          Row 3: values[5:6] = [6]

Best for: Row slicing, matrix-vector multiplication, arithmetic operations Worst for: Column slicing, inserting new non-zeros

CSC (Compressed Sparse Column)

DfCSC Format

CSC is the column-compressed analog of CSR:

values: Non-zero values in column-major order
row_indices: Row index for each non-zero value
col_ptr: Index into values/row_indices where each column starts (length $n+1$ )

Architecture Diagram

Matrix:                   CSC Representation:
┌                 ┐       values    = [1, 3, 2, 4, 5, 6]
│ 1  0  0  2  0  │       row_idx   = [0, 1, 0, 2, 2, 3]
│ 0  3  0  0  0  │       col_ptr   = [0, 2, 3, 4, 6, 6]
│ 0  0  4  5  0  │
│ 0  0  0  0  6  │       Col 0: values[0:2] = [1, 3] (rows 0,1)
└                 ┘       Col 1: values[2:3] = [3] (row 1)
                          Col 2: values[3:4] = [4] (row 2)
                          Col 3: values[4:6] = [2, 5] (rows 0,2)
                          Col 4: values[6:6] = [] (no non-zeros)

Best for: Column slicing, solving $A^T x = b$ , feature-based operations Worst for: Row slicing, inserting new non-zeros

COO (Coordinate)

DfCOO Format

COO stores each non-zero as a triple:

row: Row indices of non-zeros
col: Column indices of non-zeros
values: Non-zero values

All three arrays have length $\text{nnz}$ . Order doesn't matter but sorting by row then column is conventional.

Architecture Diagram

Matrix:                   COO Representation:
┌                 ┐       row    = [0, 0, 1, 2, 2, 3]
│ 1  0  0  2  0  │       col    = [0, 3, 1, 2, 3, 4]
│ 0  3  0  0  0  │       values = [1, 2, 3, 4, 5, 6]
│ 0  0  4  5  0  │
│ 0  0  0  0  6  │       Each triple: (row[i], col[i], values[i])
└                 ┘       = (0,0,1), (0,3,2), (1,1,3), ...

Best for: Construction, incremental building, format conversion input Worst for: Arithmetic, slicing (convert to CSR/CSC first)

LIL (List of Lists)

DfLIL Format

LIL stores two lists of lists:

data: data[i] is a list of non-zero values in row $i$
rows: rows[i] is a list of column indices for row $i$

Allows efficient row-wise insertion and deletion. Each row can have variable-length lists.

Architecture Diagram

Matrix:                   LIL Representation:
┌                 ┐       data  = [[1, 2], [3], [4, 5], [6]]
│ 1  0  0  2  0  │       rows  = [[0, 3], [1], [2, 3], [4]]
│ 0  3  0  0  0  │
│ 0  0  4  5  0  │       Row 0: data[0]=[1,2], rows[0]=[0,3]
│ 0  0  0  0  6  │       Row 1: data[1]=[3],   rows[1]=[1]
└                 ┘       Row 2: data[2]=[4,5], rows[2]=[2,3]
                          Row 3: data[3]=[6],   rows[3]=[4]

Best for: Incremental construction, row-wise modifications Worst for: Arithmetic, column slicing (convert to CSR/CSC)

DOK (Dictionary of Keys)

DfDOK Format

DOK uses a Python dictionary mapping (row, col) tuples to values. Only non-zero entries are stored. Supports O(1) lookup and insertion for individual elements.

Architecture Diagram

Matrix:                   DOK Representation:
┌                 ┐       {(0,0): 1, (0,3): 2,
│ 1  0  0  2  0  │        (1,1): 3, (2,2): 4,
│ 0  3  0  0  0  │        (2,3): 5, (3,4): 6}
│ 0  0  4  5  0  │
│ 0  0  0  0  6  │       Lookup: dok[(0,3)] → 2
└                 ┘       Insert: dok[(4,4)] = 7

Best for: Element-wise access during construction, sparse matrix assembly Worst for: Arithmetic, slicing (convert to CSR/CSC)

Format Comparison

Format	Construction	Row Slice	Col Slice	MatVec	Arithmetic	Memory
CSR	O(nnz)	O(1)	O(m)	O(nnz)	O(nnz)	Low
CSC	O(nnz)	O(n)	O(1)	O(nnz)	O(nnz)	Low
COO	O(1)	O(nnz)	O(nnz)	O(nnz)	O(nnz)	Medium
LIL	O(1)	O(nnz)	O(nnz)	O(nnz)	O(nnz)	High
DOK	O(1)	O(nnz)	O(nnz)	O(nnz)	O(nnz)	High

Format Conversions

Converting between formats involves tradeoffs in speed, memory, and operations supported.

ℹ️ Conversion Rules

Any → CSR/CSC: Fast, O(nnz). This is the standard conversion path.
CSR ↔ CSC: Fast, O(nnz). Just transpose the arrays.
Dense → CSR: Use scipy.sparse.csr_matrix(dense_array).
CSR/CSC → Dense: Use .toarray() — only for small matrices.
Any → LIL/DOK: Use for construction, then convert to CSR/CSC for computation.

from scipy import sparse
import numpy as np

# Build incrementally with LIL, convert to CSR for computation
lil = sparse.lil_matrix((10000, 10000))
lil[0, 0] = 1
lil[1, 2] = 3
lil[5, 7] = -2
csr = lil.tocsr()  # Convert for efficient operations

# COO construction from data
row = np.array([0, 1, 2, 0])
col = np.array([0, 1, 2, 2])
val = np.array([1.0, 2.0, 3.0, 4.0])
coo = sparse.coo_matrix((val, (row, col)), shape=(3, 3))
csr = coo.tocsr()  # Convert for arithmetic

# CSR ↔ CSC conversion
csc = csr.tocsc()
csr_back = csc.tocsr()

Conversion cost hierarchy:

Architecture Diagram

LIL/DOK → CSR/CSC:  O(nnz × log(nnz))  [sorting]
COO → CSR/CSC:      O(nnz)              [linear scan]
CSR ↔ CSC:          O(nnz)              [transpose]
CSR/CSC → Dense:    O(m × n)            [always expensive]

Sparse Matrix Operations

Matrix-Vector Multiplication

ThSparse MatVec Complexity

For CSR matrix $A$ ( $m \times n$ , nnz non-zeros) and dense vector $x$ :

y = Ax \quad \text{has complexity } O(\text{nnz})

Each row $i$ computes: $y_i = \sum_{j \in \text{row } i} A_{ij} x_j$

This is optimal — you cannot multiply faster than the number of non-zeros.

from scipy import sparse
import numpy as np

A = sparse.random(100000, 100000, density=0.001, format='csr')
x = np.random.randn(100000)

y = A @ x           # O(nnz) matrix-vector multiply
print(f"nnz: {A.nnz}, time proportional to {A.nnz} operations")

Matrix Addition

A = sparse.random(1000, 1000, density=0.01, format='csr')
B = sparse.random(1000, 1000, density=0.01, format='csr')

C = A + B           # Element-wise addition, sparsity decreases
D = A - B           # Element-wise subtraction
E = 2.0 * A         # Scalar multiplication

Note: Sum of two sparse matrices has sparsity $\leq$ min(sparsity(A), sparsity(B)). Non-zeros may increase significantly.

Matrix Multiplication

ℹ️ Sparse × Sparse Multiplication

Matrix multiplication $C = AB$ is more complex. The result $C_{ij} = \sum_k A_{ik} B_{kj}$ requires non-zeros to "align" on shared indices. The output nnz is unpredictable and can range from 0 to $m \times n$ .

CSR × CSR is efficient when the inner dimension $k$ has limited non-zeros per column/row.

A = sparse.random(1000, 1000, density=0.01, format='csr')
B = sparse.random(1000, 1000, density=0.01, format='csr')

C = A @ B           # Sparse × Sparse
print(f"Input nnz: {A.nnz}, {B.nnz}, Output nnz: {C.nnz}")

Transpose

A = sparse.random(1000, 500, density=0.01, format='csr')
AT = A.T            # O(nnz) operation, swaps CSR ↔ CSC internally
AAT = A @ A.T       # Symmetric result

Element Access

# DOK: O(1) lookup
dok = sparse.dok_matrix((10000, 10000))
dok[123, 456] = 7.0
val = dok[123, 456]  # O(1)

# CSR: O(log(nnz_row)) via binary search within row
csr = dok.tocsr()
val = csr[123, 456]  # Slower than DOK for single lookups

# Dense slice: convert small block
block = csr[100:200, 100:200].toarray()

Sparse vs Dense

ThWhen to Use Sparse vs Dense

Criterion	Use Sparse	Use Dense
Sparsity	≥ 90% zeros	< 90% zeros
Matrix size	> 1000 × 1000	< 1000 × 1000
Operations	MatVec, slicing	Element-wise, broadcast
Hardware	CPU (irregular access)	GPU (regular access)
Libraries	scipy.sparse, PyTorch sparse	NumPy, cuBLAS

📝Performance Comparison

import time
from scipy import sparse
import numpy as np

n = 10000
A_dense = np.random.randn(n, n)
A_dense[A_dense < 0.9] = 0  # Make ~90% zeros
A_sparse = sparse.csr_matrix(A_dense)

x = np.random.randn(n)

# Dense: O(n²) time
t0 = time.time()
y_dense = A_dense @ x
t_dense = time.time() - t0

# Sparse: O(nnz) time
t0 = time.time()
y_sparse = A_sparse @ x
t_sparse = time.time() - t0

print(f"Dense: {t_dense:.4f}s, Sparse: {t_sparse:.4f}s")
print(f"Speedup: {t_dense/t_sparse:.1f}x")
# Typical result: 5-20x speedup for 90% sparse matrices

Key insight: GPUs prefer dense matrices for regular memory access patterns. Sparse matrices on GPUs (cuSPARSE) can be faster but require specialized kernels. For very large sparse matrices, CPU sparse operations often beat GPU dense operations due to memory constraints.

Common Sparsity Patterns

DfDiagonal Matrix

A diagonal matrix $D$ has $D_{ij} = 0$ for $i \neq j$ . Stored as a single vector of length $\min(m, n)$ . Operations are $O(n)$ instead of $O(n^2)$ .

D = sparse.diags([1, 2, 3, 4])  # 4x4 diagonal
D_inv = sparse.diags([1/x for x in [1, 2, 3, 4]])  # O(n) inversion

DfBanded Matrix

A banded matrix has non-zeros only within $k$ diagonals of the main diagonal. Bandwidth $k$ determines storage: $O(kn)$ instead of $O(n^2)$ .

# Tridiagonal matrix (bandwidth = 1)
main_diag = np.ones(100) * 2
off_diag = np.ones(99) * -1
A = sparse.diags([off_diag, main_diag, off_diag], [-1, 0, 1])
# Storage: 299 values vs 10000 dense

DfBlock Diagonal Matrix

A block diagonal matrix has non-zero sub-matrices only along the diagonal. Each block is independent.

blocks = [sparse.random(100, 100, density=0.1) for _ in range(10)]
A = sparse.block_diag(blocks)  # 1000x1000 matrix, 10 independent blocks

DfRandom Sparse Matrix

Non-zeros are placed randomly. The density parameter controls the fraction of non-zeros.

A = sparse.random(1000, 1000, density=0.01, format='csr')
# Expected nnz = 1000 * 1000 * 0.01 = 10000

Python Implementation

Construction Patterns

from scipy import sparse
import numpy as np

# Pattern 1: From dense array
dense = np.array([[1, 0, 2], [0, 0, 0], [3, 0, 4]])
csr = sparse.csr_matrix(dense)

# Pattern 2: COO from data
row, col, val = [0, 2, 0], [0, 2, 2], [1.0, 4.0, 2.0]
coo = sparse.coo_matrix((val, (row, col)), shape=(3, 3))

# Pattern 3: Diagonal
D = sparse.diags([1, 2, 3])

# Pattern 4: Random
R = sparse.random(100, 100, density=0.1, format='csr',
                  random_state=42)

# Pattern 5: Identity
I = sparse.eye(1000, format='csr')

# Pattern 6: Kronecker product
A = sparse.random(10, 10, density=0.3)
B = sparse.random(10, 10, density=0.3)
C = sparse.kron(A, B)  # 100x100 sparse matrix

Sparse Linear Algebra

from scipy.sparse import linalg

A = sparse.csr_matrix([[4, 1], [1, 3]])
b = np.array([1, 2])

# Direct solve (small sparse systems)
x = linalg.spsolve(A, b)

# Iterative solve (large sparse systems)
x_iter, info = linalg.bicgstab(A, b, maxiter=1000, tol=1e-10)

# Eigenvalues
vals, vecs = linalg.eigs(A, k=2, which='LM')

# Singular values
U, s, Vt = linalg.svds(A, k=2)

# Norm
norm_A = linalg.norm(A, ord='fro')

Practical Example: Sparse Gradient Descent

from scipy import sparse
import numpy as np

# Feature matrix: 1M samples, 100K features, 0.1% density
X = sparse.random(1000000, 100000, density=0.001, format='csr')
y = np.random.randn(1000000)
w = np.random.randn(100000)

# Gradient: X.T @ (X @ w - y)
residual = X @ w - y           # O(nnz) sparse matvec
gradient = X.T @ residual      # O(nnz) sparse transpose matvec
w -= 0.01 * gradient           # O(n) dense update

# Memory: X is ~1.2GB sparse, would be 80GB dense

Applications in AI/ML

Recommendation Systems

ℹ️ Collaborative Filtering

User-item interaction matrices are extremely sparse. Netflix has ~200M users × ~20K items, but each user rates <100 items (99.95% sparse). Matrix factorization decomposes $R \approx UV^T$ where $U \in \mathbb{R}^{m \times k}$ , $V \in \mathbb{R}^{n \times k}$ with $k \ll m, n$ . Sparse formats enable training on matrices with billions of entries.

from scipy.sparse.linalg import svds

# Sparse user-item matrix (1M users, 10K items)
R = sparse.random(1000000, 10000, density=0.001, format='csr')

# Truncated SVD for latent factors
U, sigma, Vt = svds(R, k=50)  # k=50 latent dimensions
# U: user embeddings, Vt: item embeddings

NLP: Term-Document Matrices

ℹ️ Bag of Words

Document-term matrices are sparse because most documents contain a small vocabulary subset. A 100K-document corpus with 50K vocabulary has a 5B-entry matrix, but each document uses ~200 terms (99.996% sparse). TF-IDF weighting preserves sparsity while improving feature quality.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["the cat sat on the mat", "the dog ran in the park"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)  # Sparse matrix
print(f"Shape: {X.shape}, nnz: {X.nnz}")

Graph Neural Networks

ℹ️ Adjacency Matrices

Graph adjacency matrices are sparse by definition. A social network with 1B users has a 10^18-entry adjacency matrix, but each user has ~100 friends (99.99999% sparse). GNNs perform sparse message passing: $H^{(l+1)} = \sigma(D^{-1/2} A D^{-1/2} H^{(l)} W^{(l)})$ , where $A$ is the sparse adjacency matrix.

import torch
from torch_geometric.utils import dense_to_sparse

# Dense adjacency (small graph)
A_dense = torch.tensor([[0, 1, 0], [1, 0, 1], [0, 1, 0]], dtype=torch.float)
edge_index, edge_weight = dense_to_sparse(A_dense)
# edge_index: [[0,1,1,2], [1,0,2,1]] — sparse COO-like format

Transformer Sparse Attention

ℹ️ Sparse Attention Patterns

Standard Transformer attention is $O(n^2)$ for sequence length $n$ . Sparse attention restricts each token to attend to $O(\sqrt{n})$ or $O(\log n)$ others, reducing complexity to $O(n\sqrt{n})$ or $O(n\log n)$ . Patterns include local (window), global (anchor tokens), and random connections.

# Local attention pattern (window size = 256)
def local_attention_mask(seq_len, window=256):
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    for i in range(seq_len):
        start = max(0, i - window // 2)
        end = min(seq_len, i + window // 2)
        mask[i, start:end] = True
    return mask  # ~99% sparse for long sequences

# Sparse attention in PyTorch
from torch.nn.functional import scaled_dot_product_attention
# Use is_causal or custom attn_mask for sparsity

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Using `.toarray()` for large matrices	Defeats purpose of sparsity — allocates O(m×n) memory	Keep as sparse, use sparse operations
Converting CSR ↔ CSC repeatedly	Each conversion is O(nnz) and creates new arrays	Pick the right format once, stick with it
Adding dense to sparse	Result is dense, losing all benefits	Use sparse + sparse, or convert sparse to dense only when necessary
Using `A[i,j]` in loops on CSR	Each access is O(log(nnz_row)) — slow in loops	Use DOK for random access, vectorize operations
Ignoring fill-in in sparse solvers	Matrix multiplication can create new non-zeros	Monitor nnz growth, use iterative methods for large systems
Assuming GPU acceleration helps	Sparse operations on GPUs have high overhead for small nnz	Profile: GPU sparse beats CPU sparse only for very large matrices
Using float64 for integer indices	Wastes memory, slower operations	Use int32 for indices (saves 50% over int64)
Building large sparse matrices with append	List appends cause O(n) copies	Use COO lists or LIL, convert once at the end

Interview Questions

Q1: Why are sparse matrices important in machine learning?

Answer: Most real-world datasets produce sparse matrices: NLP term-document matrices (99.9% sparse), recommendation systems (99.95% sparse), graph adjacency matrices (99.999% sparse). Storing and computing with dense representations wastes memory and compute. Sparse formats store only non-zeros, reducing memory from O(mn) to O(nnz) and computation from O(mn) to O(nnz).

Q2: When would you choose CSR over CSC?

Answer: CSR for row-oriented operations: row slicing, matrix-vector multiplication ( $y = Ax$ ), row-wise normalization. CSC for column-oriented operations: column slicing, solving $A^T x = b$ , feature normalization. CSR and CSC are dual — CSR is optimal when operations access rows, CSC when accessing columns.

Q3: What is the time complexity of sparse matrix-vector multiplication?

Answer: O(nnz), where nnz is the number of non-zero entries. For CSR, each row $i$ computes $y_i = \sum_{j \in \text{row } i} A_{ij} x_j$ by iterating over non-zeros in that row. Total work is proportional to nnz, which is optimal since every non-zero must be accessed at least once.

Q4: How does sparse matrix multiplication differ from dense?

Answer: Dense multiplication $C = AB$ is $O(mnk)$ . Sparse multiplication is more complex: the output nnz is unpredictable. For CSR × CSR, the algorithm iterates over non-zeros of A and B, finding shared indices via merge or hash. Complexity is $O(\text{nnz}(A) \times \text{avg\_row\_nnz}(B))$ worst case. The output can be much denser than either input ("fill-in").

Q5: Describe the conversion path from construction to computation.

Answer: Build with COO (append triples in any order) or LIL (row-wise insertion), then convert to CSR or CSC for computation. Conversion is O(nnz). Never build directly in CSR/CSC — they don't support efficient insertion. The pipeline: COO/LIL → CSR/CSC → compute → COO/LIL for modifications.

Q6: What is fill-in and why does it matter?

Answer: Fill-in occurs when operations on sparse matrices produce denser results. Examples: $A + A^T$ doubles non-zeros; $A \times A$ can create $O(n^2)$ non-zeros; sparse LU decomposition creates non-zeros in zero positions. Fill-in reduces the effectiveness of sparse representations and increases computation time.

Q7: How would you handle a sparse matrix on a GPU?

Answer: Use cuSPARSE (NVIDIA) or equivalent. Key considerations: (1) Only beneficial for very large matrices (nnz > 1M), (2) CSR/CSC formats are supported, (3) Sparse operations on GPUs have higher per-operation overhead than CPU, so batch operations help, (4) Mixed sparse-dense operations (sparse × dense) benefit most from GPU acceleration.

Practice Problems

Problem 1: Memory Savings

📝Problem

A 50000×50000 adjacency matrix represents a social network where each user has exactly 200 friends on average. Compare storage requirements for dense (float64), COO, and CSR formats.

💡Solution

Dense: $50000 \times 50000 \times 8 = 20$ GB
nnz: $50000 \times 200 = 10^7$ (10 million non-zeros)
COO: $10^7 \times (8 + 4 + 4) = 160$ MB (value + row idx + col idx, int32 indices)
CSR: $10^7 \times (8 + 4) + 50001 \times 4 = 120$ MB (values + col_idx + row_ptr)
Savings: CSR is 167x smaller than dense

Problem 2: Sparsity After Operations

📝Problem

Matrix $A$ has sparsity 95% (5% non-zeros). Matrix $B$ also has sparsity 95%. What is the maximum possible sparsity of $C = A + B$ ? What is the minimum?

💡Solution

Maximum sparsity: 90% — if A and B have completely disjoint non-zero positions, the sum has $5\% + 5\% = 10\%$ non-zeros (90% sparse)
Minimum sparsity: 95% — if A and B have identical non-zero positions, the sum has at most 5% non-zeros (some may cancel to zero, increasing sparsity)
General rule: $\text{nnz}(A+B) \leq \text{nnz}(A) + \text{nnz}(B)$
In practice, sparsity typically decreases by 10-30% after addition

Problem 3: Format Selection

📝Problem

You need to: (1) build a sparse matrix by reading (row, col, value) triples from a file, (2) perform 1000 matrix-vector products, (3) occasionally access individual elements. Which format(s) should you use and in what order?

💡Solution

Build: Use COO — read triples directly, no sorting needed. O(1) per insertion.
Convert: .tocsr() — O(nnz) conversion
MatVec: Use CSR — O(nnz) per multiply, optimal for row-oriented operations
Element access: Keep CSR but accept O(log(nnz_row)) per access, or maintain a small DOK copy for frequent lookups
Overall: COO → CSR, with optional DOK for random access

Problem 4: Sparse Eigenvalues

📝Problem

You have a sparse matrix $A$ ( $10000 \times 10000$ , nnz = 50000) and need the 5 largest eigenvalues. Why is scipy.sparse.linalg.eigs(A, k=5) preferred over numpy.linalg.eig(A.toarray())?

💡Solution

Dense eigendecomposition: $O(n^3) = O(10^{12})$ operations, $O(n^2) = 800$ MB storage
Sparse eigs (Lanczos/Arnoldi): $O(k \times \text{nnz}) = O(250000)$ operations per iteration, $O(kn) = 400$ KB storage
Speedup: ~1000000x faster, ~2000x less memory
Key insight: Sparse eigensolvers only compute the eigenvalues you need, using matrix-vector products (O(nnz)) instead of full decomposition

Quick Reference

Aspect	Detail
Dense storage	$8mn$ bytes (float64)
CSR storage	$8\text{nnz} + 4(\text{nnz} + m + 1)$ bytes
CSC storage	$8\text{nnz} + 4(\text{nnz} + n + 1)$ bytes
COO storage	$16\text{nnz}$ bytes
MatVec complexity	$O(\text{nnz})$
Sparse × Sparse	$O(\text{nnz}(A) \times \text{avg\_nnz\_per\_row}(B))$
Sparsity definition	$\text{zeros} / (m \times n)$
Sparse threshold	≥ 90% zeros
Break-even nnz	$\text{nnz} < mn / 10$ for CSR vs dense
Best construction format	COO (random) or LIL (row-wise)
Best computation format	CSR (rows) or CSC (columns)
Best random access format	DOK
Python library	`scipy.sparse`
GPU support	cuSPARSE, PyTorch sparse

Cross-References

Matrix Decompositions — SVD, eigenvalue decomposition for sparse matrices
Linear Algebra Matrices — Dense operations as baseline for comparison
Linear Algebra Basics — Vectors, matrices, and basic operations
Linear Algebra Determinants — Rank, determinant, conditioning
Eigenvalues and Eigenvectors — Sparse eigenvalue computation
Calculus Optimization — Gradient methods using sparse Jacobians
Probability Distributions — Sparse covariance matrices
Discrete Graphs — Adjacency matrices and sparse graph operations