← Math|17 of 100
Linear Algebra

Sparse Matrices

Master sparse matrix representations, operations, and their applications in large-scale ML.

πŸ“‚ Computational EfficiencyπŸ“– Lesson 17 of 100πŸŽ“ Free Course

Advertisement

Why It Matters

ℹ️ Why It Matters

Most real-world matrices are overwhelmingly sparse β€” containing 90% or more zeros. Consider a movie recommendation system with 1 million users and 10,000 movies: the user-item matrix has 10 billion entries, but each user rates fewer than 100 movies, making 99.99% of entries zero. Storing this densely wastes 99.99% of memory. Sparse representations store only non-zero values, reducing storage from 80 GB to 8 MB β€” a 10,000x compression that makes computation feasible.

In machine learning, sparsity appears everywhere: NLP term-document matrices, graph adjacency matrices, attention masks in Transformers, and feature representations in high-dimensional data. Without sparse matrix techniques, training large language models, building recommendation engines, and analyzing social networks would be computationally impossible.


What is a Sparse Matrix

DfSparse Matrix

A matrix A∈RmΓ—nA \in \mathbb{R}^{m \times n} is sparse if most of its elements are zero. Formally, define the sparsity as:

sparsity(A)=numberΒ ofΒ zeroΒ elementsmΓ—n\text{sparsity}(A) = \frac{\text{number of zero elements}}{m \times n}

A matrix is considered sparse when sparsity(A)β‰₯0.9\text{sparsity}(A) \geq 0.9 (90% zeros). The density is 1βˆ’sparsity(A)1 - \text{sparsity}(A).

The nnz (number of non-zeros) satisfies nnzβ‰ͺmΓ—n\text{nnz} \ll m \times n.

πŸ“Example: Dense vs Sparse

Dense matrix (4Γ—4, 50% sparse):

Architecture Diagram
β”Œ                     ┐
β”‚  1.0   0.0   0.0  2.0 β”‚
β”‚  0.0   3.0   0.0  0.0 β”‚
β”‚  0.0   0.0   4.0  0.0 β”‚
β”‚  5.0   0.0   0.0  6.0 β”‚
β””                     β”˜
Dense storage: 16 values Γ— 8 bytes = 128 bytes

Sparse matrix (4Γ—4, 75% sparse, 4 non-zeros):

Architecture Diagram
β”Œ                     ┐
β”‚  1.0   0.0   0.0  0.0 β”‚
β”‚  0.0   3.0   0.0  0.0 β”‚
β”‚  0.0   0.0   4.0  0.0 β”‚
β”‚  0.0   0.0   0.0  6.0 β”‚
β””                     β”˜
Dense storage: 16 values Γ— 8 bytes = 128 bytes
Sparse COO: 4 triples Γ— 16 bytes = 64 bytes

Extreme sparsity (10000Γ—10000, 99.9% sparse):

  • Dense: 800 MB
  • Sparse (COO): ~2.4 MB (375x savings)

Sparsity Pattern

The sparsity pattern describes which entries are non-zero. Visualizing sparsity patterns reveals structure that affects format choice and algorithm performance.

ℹ️ Visualizing Sparsity

Architecture Diagram
Dense (no zeros):     Sparse (structured):   Sparse (random):
● ● ● ● ● ● ● ●      ● β—‹ β—‹ β—‹ β—‹ β—‹ β—‹ ●        ● β—‹ β—‹ ● β—‹ ● β—‹ β—‹
● ● ● ● ● ● ● ●      β—‹ ● β—‹ β—‹ β—‹ β—‹ ● β—‹        β—‹ β—‹ ● β—‹ β—‹ β—‹ ● β—‹
● ● ● ● ● ● ● ●      β—‹ β—‹ ● β—‹ β—‹ ● β—‹ β—‹        β—‹ ● β—‹ β—‹ ● β—‹ β—‹ ●
● ● ● ● ● ● ● ●      β—‹ β—‹ β—‹ ● ● β—‹ β—‹ β—‹        β—‹ β—‹ β—‹ ● β—‹ β—‹ ● β—‹
● ● ● ● ● ● ● ●      β—‹ β—‹ β—‹ ● ● β—‹ β—‹ β—‹        ● β—‹ ● β—‹ β—‹ ● β—‹ β—‹
● ● ● ● ● ● ● ●      β—‹ β—‹ ● β—‹ β—‹ ● β—‹ β—‹        β—‹ ● β—‹ β—‹ ● β—‹ β—‹ ●
● ● ● ● ● ● ● ●      β—‹ ● β—‹ β—‹ β—‹ β—‹ ● β—‹        β—‹ β—‹ ● β—‹ β—‹ β—‹ ● β—‹
● ● ● ● ● ● ● ●      ● β—‹ β—‹ β—‹ β—‹ β—‹ β—‹ ●        ● β—‹ β—‹ ● β—‹ ● β—‹ β—‹

● = non-zero          ● = non-zero           ● = non-zero
β—‹ = zero              β—‹ = zero               β—‹ = zero
  • Dense: All entries stored β€” no pattern to exploit
  • Structured sparse: Banded, diagonal, or block patterns β€” can use specialized algorithms
  • Random sparse: Non-zeros scattered unpredictably β€” general sparse formats needed

ThSparsity Impact on Storage

For an nΓ—nn \times n matrix with nnz\text{nnz} non-zeros:

StorageDenseCSR/CSCCOO
Formula8n28n^2 bytes8β‹…nnz+4(n+1+nnz)8 \cdot \text{nnz} + 4(n+1+\text{nnz}) bytes16β‹…nnz16 \cdot \text{nnz} bytes
Break-evenβ€”nnz<4n5\text{nnz} < \frac{4n}{5}nnz<n22\text{nnz} < \frac{n^2}{2}

When nnzβ‰ͺn2\text{nnz} \ll n^2, sparse formats provide exponential savings.


Sparse Matrix Formats

CSR (Compressed Sparse Row)

DfCSR Format

CSR stores a sparse matrix using three arrays:

  • values: Non-zero values in row-major order
  • col_indices: Column index for each non-zero value
  • row_ptr: Index into values/col_indices where each row starts (length m+1m+1)

Storage: nnzΓ—8\text{nnz} \times 8 bytes (values) + nnzΓ—4\text{nnz} \times 4 bytes (col_indices) + (m+1)Γ—4(m+1) \times 4 bytes (row_ptr)

Architecture Diagram
Matrix:                   CSR Representation:
β”Œ                 ┐       values    = [1, 2, 3, 4, 5, 6]
β”‚ 1  0  0  2  0  β”‚       col_idx   = [0, 3, 1, 2, 3, 4]
β”‚ 0  3  0  0  0  β”‚       row_ptr   = [0, 2, 3, 5, 6]
β”‚ 0  0  4  5  0  β”‚
β”‚ 0  0  0  0  6  β”‚       Row 0: values[0:2] = [1, 2]
β””                 β”˜       Row 1: values[2:3] = [3]
                          Row 2: values[3:5] = [4, 5]
                          Row 3: values[5:6] = [6]

Best for: Row slicing, matrix-vector multiplication, arithmetic operations Worst for: Column slicing, inserting new non-zeros

CSC (Compressed Sparse Column)

DfCSC Format

CSC is the column-compressed analog of CSR:

  • values: Non-zero values in column-major order
  • row_indices: Row index for each non-zero value
  • col_ptr: Index into values/row_indices where each column starts (length n+1n+1)
Architecture Diagram
Matrix:                   CSC Representation:
β”Œ                 ┐       values    = [1, 3, 2, 4, 5, 6]
β”‚ 1  0  0  2  0  β”‚       row_idx   = [0, 1, 0, 2, 2, 3]
β”‚ 0  3  0  0  0  β”‚       col_ptr   = [0, 2, 3, 4, 6, 6]
β”‚ 0  0  4  5  0  β”‚
β”‚ 0  0  0  0  6  β”‚       Col 0: values[0:2] = [1, 3] (rows 0,1)
β””                 β”˜       Col 1: values[2:3] = [3] (row 1)
                          Col 2: values[3:4] = [4] (row 2)
                          Col 3: values[4:6] = [2, 5] (rows 0,2)
                          Col 4: values[6:6] = [] (no non-zeros)

Best for: Column slicing, solving ATx=bA^T x = b, feature-based operations Worst for: Row slicing, inserting new non-zeros

COO (Coordinate)

DfCOO Format

COO stores each non-zero as a triple:

  • row: Row indices of non-zeros
  • col: Column indices of non-zeros
  • values: Non-zero values

All three arrays have length nnz\text{nnz}. Order doesn't matter but sorting by row then column is conventional.

Architecture Diagram
Matrix:                   COO Representation:
β”Œ                 ┐       row    = [0, 0, 1, 2, 2, 3]
β”‚ 1  0  0  2  0  β”‚       col    = [0, 3, 1, 2, 3, 4]
β”‚ 0  3  0  0  0  β”‚       values = [1, 2, 3, 4, 5, 6]
β”‚ 0  0  4  5  0  β”‚
β”‚ 0  0  0  0  6  β”‚       Each triple: (row[i], col[i], values[i])
β””                 β”˜       = (0,0,1), (0,3,2), (1,1,3), ...

Best for: Construction, incremental building, format conversion input Worst for: Arithmetic, slicing (convert to CSR/CSC first)

LIL (List of Lists)

DfLIL Format

LIL stores two lists of lists:

  • data: data[i] is a list of non-zero values in row ii
  • rows: rows[i] is a list of column indices for row ii

Allows efficient row-wise insertion and deletion. Each row can have variable-length lists.

Architecture Diagram
Matrix:                   LIL Representation:
β”Œ                 ┐       data  = [[1, 2], [3], [4, 5], [6]]
β”‚ 1  0  0  2  0  β”‚       rows  = [[0, 3], [1], [2, 3], [4]]
β”‚ 0  3  0  0  0  β”‚
β”‚ 0  0  4  5  0  β”‚       Row 0: data[0]=[1,2], rows[0]=[0,3]
β”‚ 0  0  0  0  6  β”‚       Row 1: data[1]=[3],   rows[1]=[1]
β””                 β”˜       Row 2: data[2]=[4,5], rows[2]=[2,3]
                          Row 3: data[3]=[6],   rows[3]=[4]

Best for: Incremental construction, row-wise modifications Worst for: Arithmetic, column slicing (convert to CSR/CSC)

DOK (Dictionary of Keys)

DfDOK Format

DOK uses a Python dictionary mapping (row, col) tuples to values. Only non-zero entries are stored. Supports O(1) lookup and insertion for individual elements.

Architecture Diagram
Matrix:                   DOK Representation:
β”Œ                 ┐       {(0,0): 1, (0,3): 2,
β”‚ 1  0  0  2  0  β”‚        (1,1): 3, (2,2): 4,
β”‚ 0  3  0  0  0  β”‚        (2,3): 5, (3,4): 6}
β”‚ 0  0  4  5  0  β”‚
β”‚ 0  0  0  0  6  β”‚       Lookup: dok[(0,3)] β†’ 2
β””                 β”˜       Insert: dok[(4,4)] = 7

Best for: Element-wise access during construction, sparse matrix assembly Worst for: Arithmetic, slicing (convert to CSR/CSC)

Format Comparison

FormatConstructionRow SliceCol SliceMatVecArithmeticMemory
CSRO(nnz)O(1)O(m)O(nnz)O(nnz)Low
CSCO(nnz)O(n)O(1)O(nnz)O(nnz)Low
COOO(1)O(nnz)O(nnz)O(nnz)O(nnz)Medium
LILO(1)O(nnz)O(nnz)O(nnz)O(nnz)High
DOKO(1)O(nnz)O(nnz)O(nnz)O(nnz)High

Format Conversions

Converting between formats involves tradeoffs in speed, memory, and operations supported.

ℹ️ Conversion Rules

  • Any β†’ CSR/CSC: Fast, O(nnz). This is the standard conversion path.
  • CSR ↔ CSC: Fast, O(nnz). Just transpose the arrays.
  • Dense β†’ CSR: Use scipy.sparse.csr_matrix(dense_array).
  • CSR/CSC β†’ Dense: Use .toarray() β€” only for small matrices.
  • Any β†’ LIL/DOK: Use for construction, then convert to CSR/CSC for computation.
from scipy import sparse
import numpy as np

# Build incrementally with LIL, convert to CSR for computation
lil = sparse.lil_matrix((10000, 10000))
lil[0, 0] = 1
lil[1, 2] = 3
lil[5, 7] = -2
csr = lil.tocsr()  # Convert for efficient operations

# COO construction from data
row = np.array([0, 1, 2, 0])
col = np.array([0, 1, 2, 2])
val = np.array([1.0, 2.0, 3.0, 4.0])
coo = sparse.coo_matrix((val, (row, col)), shape=(3, 3))
csr = coo.tocsr()  # Convert for arithmetic

# CSR ↔ CSC conversion
csc = csr.tocsc()
csr_back = csc.tocsr()

Conversion cost hierarchy:

Architecture Diagram
LIL/DOK β†’ CSR/CSC:  O(nnz Γ— log(nnz))  [sorting]
COO β†’ CSR/CSC:      O(nnz)              [linear scan]
CSR ↔ CSC:          O(nnz)              [transpose]
CSR/CSC β†’ Dense:    O(m Γ— n)            [always expensive]

Sparse Matrix Operations

Matrix-Vector Multiplication

ThSparse MatVec Complexity

For CSR matrix AA (mΓ—nm \times n, nnz non-zeros) and dense vector xx:

y=AxhasΒ complexityΒ O(nnz)y = Ax \quad \text{has complexity } O(\text{nnz})

Each row ii computes: yi=βˆ‘j∈rowΒ iAijxjy_i = \sum_{j \in \text{row } i} A_{ij} x_j

This is optimal β€” you cannot multiply faster than the number of non-zeros.

from scipy import sparse
import numpy as np

A = sparse.random(100000, 100000, density=0.001, format='csr')
x = np.random.randn(100000)

y = A @ x           # O(nnz) matrix-vector multiply
print(f"nnz: {A.nnz}, time proportional to {A.nnz} operations")

Matrix Addition

A = sparse.random(1000, 1000, density=0.01, format='csr')
B = sparse.random(1000, 1000, density=0.01, format='csr')

C = A + B           # Element-wise addition, sparsity decreases
D = A - B           # Element-wise subtraction
E = 2.0 * A         # Scalar multiplication

Note: Sum of two sparse matrices has sparsity ≀\leq min(sparsity(A), sparsity(B)). Non-zeros may increase significantly.

Matrix Multiplication

ℹ️ Sparse Γ— Sparse Multiplication

Matrix multiplication C=ABC = AB is more complex. The result Cij=βˆ‘kAikBkjC_{ij} = \sum_k A_{ik} B_{kj} requires non-zeros to "align" on shared indices. The output nnz is unpredictable and can range from 0 to mΓ—nm \times n.

CSR Γ— CSR is efficient when the inner dimension kk has limited non-zeros per column/row.

A = sparse.random(1000, 1000, density=0.01, format='csr')
B = sparse.random(1000, 1000, density=0.01, format='csr')

C = A @ B           # Sparse Γ— Sparse
print(f"Input nnz: {A.nnz}, {B.nnz}, Output nnz: {C.nnz}")

Transpose

A = sparse.random(1000, 500, density=0.01, format='csr')
AT = A.T            # O(nnz) operation, swaps CSR ↔ CSC internally
AAT = A @ A.T       # Symmetric result

Element Access

# DOK: O(1) lookup
dok = sparse.dok_matrix((10000, 10000))
dok[123, 456] = 7.0
val = dok[123, 456]  # O(1)

# CSR: O(log(nnz_row)) via binary search within row
csr = dok.tocsr()
val = csr[123, 456]  # Slower than DOK for single lookups

# Dense slice: convert small block
block = csr[100:200, 100:200].toarray()

Sparse vs Dense

ThWhen to Use Sparse vs Dense

CriterionUse SparseUse Dense
Sparsityβ‰₯ 90% zeros< 90% zeros
Matrix size> 1000 Γ— 1000< 1000 Γ— 1000
OperationsMatVec, slicingElement-wise, broadcast
HardwareCPU (irregular access)GPU (regular access)
Librariesscipy.sparse, PyTorch sparseNumPy, cuBLAS

πŸ“Performance Comparison

import time
from scipy import sparse
import numpy as np

n = 10000
A_dense = np.random.randn(n, n)
A_dense[A_dense < 0.9] = 0  # Make ~90% zeros
A_sparse = sparse.csr_matrix(A_dense)

x = np.random.randn(n)

# Dense: O(nΒ²) time
t0 = time.time()
y_dense = A_dense @ x
t_dense = time.time() - t0

# Sparse: O(nnz) time
t0 = time.time()
y_sparse = A_sparse @ x
t_sparse = time.time() - t0

print(f"Dense: {t_dense:.4f}s, Sparse: {t_sparse:.4f}s")
print(f"Speedup: {t_dense/t_sparse:.1f}x")
# Typical result: 5-20x speedup for 90% sparse matrices

Key insight: GPUs prefer dense matrices for regular memory access patterns. Sparse matrices on GPUs (cuSPARSE) can be faster but require specialized kernels. For very large sparse matrices, CPU sparse operations often beat GPU dense operations due to memory constraints.


Common Sparsity Patterns

DfDiagonal Matrix

A diagonal matrix DD has Dij=0D_{ij} = 0 for iβ‰ ji \neq j. Stored as a single vector of length min⁑(m,n)\min(m, n). Operations are O(n)O(n) instead of O(n2)O(n^2).

D = sparse.diags([1, 2, 3, 4])  # 4x4 diagonal
D_inv = sparse.diags([1/x for x in [1, 2, 3, 4]])  # O(n) inversion

DfBanded Matrix

A banded matrix has non-zeros only within kk diagonals of the main diagonal. Bandwidth kk determines storage: O(kn)O(kn) instead of O(n2)O(n^2).

# Tridiagonal matrix (bandwidth = 1)
main_diag = np.ones(100) * 2
off_diag = np.ones(99) * -1
A = sparse.diags([off_diag, main_diag, off_diag], [-1, 0, 1])
# Storage: 299 values vs 10000 dense

DfBlock Diagonal Matrix

A block diagonal matrix has non-zero sub-matrices only along the diagonal. Each block is independent.

blocks = [sparse.random(100, 100, density=0.1) for _ in range(10)]
A = sparse.block_diag(blocks)  # 1000x1000 matrix, 10 independent blocks

DfRandom Sparse Matrix

Non-zeros are placed randomly. The density parameter controls the fraction of non-zeros.

A = sparse.random(1000, 1000, density=0.01, format='csr')
# Expected nnz = 1000 * 1000 * 0.01 = 10000

Python Implementation

Construction Patterns

from scipy import sparse
import numpy as np

# Pattern 1: From dense array
dense = np.array([[1, 0, 2], [0, 0, 0], [3, 0, 4]])
csr = sparse.csr_matrix(dense)

# Pattern 2: COO from data
row, col, val = [0, 2, 0], [0, 2, 2], [1.0, 4.0, 2.0]
coo = sparse.coo_matrix((val, (row, col)), shape=(3, 3))

# Pattern 3: Diagonal
D = sparse.diags([1, 2, 3])

# Pattern 4: Random
R = sparse.random(100, 100, density=0.1, format='csr',
                  random_state=42)

# Pattern 5: Identity
I = sparse.eye(1000, format='csr')

# Pattern 6: Kronecker product
A = sparse.random(10, 10, density=0.3)
B = sparse.random(10, 10, density=0.3)
C = sparse.kron(A, B)  # 100x100 sparse matrix

Sparse Linear Algebra

from scipy.sparse import linalg

A = sparse.csr_matrix([[4, 1], [1, 3]])
b = np.array([1, 2])

# Direct solve (small sparse systems)
x = linalg.spsolve(A, b)

# Iterative solve (large sparse systems)
x_iter, info = linalg.bicgstab(A, b, maxiter=1000, tol=1e-10)

# Eigenvalues
vals, vecs = linalg.eigs(A, k=2, which='LM')

# Singular values
U, s, Vt = linalg.svds(A, k=2)

# Norm
norm_A = linalg.norm(A, ord='fro')

Practical Example: Sparse Gradient Descent

from scipy import sparse
import numpy as np

# Feature matrix: 1M samples, 100K features, 0.1% density
X = sparse.random(1000000, 100000, density=0.001, format='csr')
y = np.random.randn(1000000)
w = np.random.randn(100000)

# Gradient: X.T @ (X @ w - y)
residual = X @ w - y           # O(nnz) sparse matvec
gradient = X.T @ residual      # O(nnz) sparse transpose matvec
w -= 0.01 * gradient           # O(n) dense update

# Memory: X is ~1.2GB sparse, would be 80GB dense

Applications in AI/ML

Recommendation Systems

ℹ️ Collaborative Filtering

User-item interaction matrices are extremely sparse. Netflix has ~200M users Γ— ~20K items, but each user rates <100 items (99.95% sparse). Matrix factorization decomposes Rβ‰ˆUVTR \approx UV^T where U∈RmΓ—kU \in \mathbb{R}^{m \times k}, V∈RnΓ—kV \in \mathbb{R}^{n \times k} with kβ‰ͺm,nk \ll m, n. Sparse formats enable training on matrices with billions of entries.

from scipy.sparse.linalg import svds

# Sparse user-item matrix (1M users, 10K items)
R = sparse.random(1000000, 10000, density=0.001, format='csr')

# Truncated SVD for latent factors
U, sigma, Vt = svds(R, k=50)  # k=50 latent dimensions
# U: user embeddings, Vt: item embeddings

NLP: Term-Document Matrices

ℹ️ Bag of Words

Document-term matrices are sparse because most documents contain a small vocabulary subset. A 100K-document corpus with 50K vocabulary has a 5B-entry matrix, but each document uses ~200 terms (99.996% sparse). TF-IDF weighting preserves sparsity while improving feature quality.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["the cat sat on the mat", "the dog ran in the park"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)  # Sparse matrix
print(f"Shape: {X.shape}, nnz: {X.nnz}")

Graph Neural Networks

ℹ️ Adjacency Matrices

Graph adjacency matrices are sparse by definition. A social network with 1B users has a 10^18-entry adjacency matrix, but each user has ~100 friends (99.99999% sparse). GNNs perform sparse message passing: H(l+1)=Οƒ(Dβˆ’1/2ADβˆ’1/2H(l)W(l))H^{(l+1)} = \sigma(D^{-1/2} A D^{-1/2} H^{(l)} W^{(l)}), where AA is the sparse adjacency matrix.

import torch
from torch_geometric.utils import dense_to_sparse

# Dense adjacency (small graph)
A_dense = torch.tensor([[0, 1, 0], [1, 0, 1], [0, 1, 0]], dtype=torch.float)
edge_index, edge_weight = dense_to_sparse(A_dense)
# edge_index: [[0,1,1,2], [1,0,2,1]] β€” sparse COO-like format

Transformer Sparse Attention

ℹ️ Sparse Attention Patterns

Standard Transformer attention is O(n2)O(n^2) for sequence length nn. Sparse attention restricts each token to attend to O(n)O(\sqrt{n}) or O(log⁑n)O(\log n) others, reducing complexity to O(nn)O(n\sqrt{n}) or O(nlog⁑n)O(n\log n). Patterns include local (window), global (anchor tokens), and random connections.

# Local attention pattern (window size = 256)
def local_attention_mask(seq_len, window=256):
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    for i in range(seq_len):
        start = max(0, i - window // 2)
        end = min(seq_len, i + window // 2)
        mask[i, start:end] = True
    return mask  # ~99% sparse for long sequences

# Sparse attention in PyTorch
from torch.nn.functional import scaled_dot_product_attention
# Use is_causal or custom attn_mask for sparsity

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using .toarray() for large matricesDefeats purpose of sparsity β€” allocates O(mΓ—n) memoryKeep as sparse, use sparse operations
Converting CSR ↔ CSC repeatedlyEach conversion is O(nnz) and creates new arraysPick the right format once, stick with it
Adding dense to sparseResult is dense, losing all benefitsUse sparse + sparse, or convert sparse to dense only when necessary
Using A[i,j] in loops on CSREach access is O(log(nnz_row)) β€” slow in loopsUse DOK for random access, vectorize operations
Ignoring fill-in in sparse solversMatrix multiplication can create new non-zerosMonitor nnz growth, use iterative methods for large systems
Assuming GPU acceleration helpsSparse operations on GPUs have high overhead for small nnzProfile: GPU sparse beats CPU sparse only for very large matrices
Using float64 for integer indicesWastes memory, slower operationsUse int32 for indices (saves 50% over int64)
Building large sparse matrices with appendList appends cause O(n) copiesUse COO lists or LIL, convert once at the end

Interview Questions

Q1: Why are sparse matrices important in machine learning?

Answer: Most real-world datasets produce sparse matrices: NLP term-document matrices (99.9% sparse), recommendation systems (99.95% sparse), graph adjacency matrices (99.999% sparse). Storing and computing with dense representations wastes memory and compute. Sparse formats store only non-zeros, reducing memory from O(mn) to O(nnz) and computation from O(mn) to O(nnz).

Q2: When would you choose CSR over CSC?

Answer: CSR for row-oriented operations: row slicing, matrix-vector multiplication (y=Axy = Ax), row-wise normalization. CSC for column-oriented operations: column slicing, solving ATx=bA^T x = b, feature normalization. CSR and CSC are dual β€” CSR is optimal when operations access rows, CSC when accessing columns.

Q3: What is the time complexity of sparse matrix-vector multiplication?

Answer: O(nnz), where nnz is the number of non-zero entries. For CSR, each row ii computes yi=βˆ‘j∈rowΒ iAijxjy_i = \sum_{j \in \text{row } i} A_{ij} x_j by iterating over non-zeros in that row. Total work is proportional to nnz, which is optimal since every non-zero must be accessed at least once.

Q4: How does sparse matrix multiplication differ from dense?

Answer: Dense multiplication C=ABC = AB is O(mnk)O(mnk). Sparse multiplication is more complex: the output nnz is unpredictable. For CSR Γ— CSR, the algorithm iterates over non-zeros of A and B, finding shared indices via merge or hash. Complexity is O(nnz(A)Γ—avg_row_nnz(B))O(\text{nnz}(A) \times \text{avg\_row\_nnz}(B)) worst case. The output can be much denser than either input ("fill-in").

Q5: Describe the conversion path from construction to computation.

Answer: Build with COO (append triples in any order) or LIL (row-wise insertion), then convert to CSR or CSC for computation. Conversion is O(nnz). Never build directly in CSR/CSC β€” they don't support efficient insertion. The pipeline: COO/LIL β†’ CSR/CSC β†’ compute β†’ COO/LIL for modifications.

Q6: What is fill-in and why does it matter?

Answer: Fill-in occurs when operations on sparse matrices produce denser results. Examples: A+ATA + A^T doubles non-zeros; AΓ—AA \times A can create O(n2)O(n^2) non-zeros; sparse LU decomposition creates non-zeros in zero positions. Fill-in reduces the effectiveness of sparse representations and increases computation time.

Q7: How would you handle a sparse matrix on a GPU?

Answer: Use cuSPARSE (NVIDIA) or equivalent. Key considerations: (1) Only beneficial for very large matrices (nnz > 1M), (2) CSR/CSC formats are supported, (3) Sparse operations on GPUs have higher per-operation overhead than CPU, so batch operations help, (4) Mixed sparse-dense operations (sparse Γ— dense) benefit most from GPU acceleration.


Practice Problems

Problem 1: Memory Savings

πŸ“Problem

A 50000Γ—50000 adjacency matrix represents a social network where each user has exactly 200 friends on average. Compare storage requirements for dense (float64), COO, and CSR formats.

πŸ’‘Solution

  • Dense: 50000Γ—50000Γ—8=2050000 \times 50000 \times 8 = 20 GB
  • nnz: 50000Γ—200=10750000 \times 200 = 10^7 (10 million non-zeros)
  • COO: 107Γ—(8+4+4)=16010^7 \times (8 + 4 + 4) = 160 MB (value + row idx + col idx, int32 indices)
  • CSR: 107Γ—(8+4)+50001Γ—4=12010^7 \times (8 + 4) + 50001 \times 4 = 120 MB (values + col_idx + row_ptr)
  • Savings: CSR is 167x smaller than dense

Problem 2: Sparsity After Operations

πŸ“Problem

Matrix AA has sparsity 95% (5% non-zeros). Matrix BB also has sparsity 95%. What is the maximum possible sparsity of C=A+BC = A + B? What is the minimum?

πŸ’‘Solution

  • Maximum sparsity: 90% β€” if A and B have completely disjoint non-zero positions, the sum has 5%+5%=10%5\% + 5\% = 10\% non-zeros (90% sparse)
  • Minimum sparsity: 95% β€” if A and B have identical non-zero positions, the sum has at most 5% non-zeros (some may cancel to zero, increasing sparsity)
  • General rule: nnz(A+B)≀nnz(A)+nnz(B)\text{nnz}(A+B) \leq \text{nnz}(A) + \text{nnz}(B)
  • In practice, sparsity typically decreases by 10-30% after addition

Problem 3: Format Selection

πŸ“Problem

You need to: (1) build a sparse matrix by reading (row, col, value) triples from a file, (2) perform 1000 matrix-vector products, (3) occasionally access individual elements. Which format(s) should you use and in what order?

πŸ’‘Solution

  1. Build: Use COO β€” read triples directly, no sorting needed. O(1) per insertion.
  2. Convert: .tocsr() β€” O(nnz) conversion
  3. MatVec: Use CSR β€” O(nnz) per multiply, optimal for row-oriented operations
  4. Element access: Keep CSR but accept O(log(nnz_row)) per access, or maintain a small DOK copy for frequent lookups
  5. Overall: COO β†’ CSR, with optional DOK for random access

Problem 4: Sparse Eigenvalues

πŸ“Problem

You have a sparse matrix AA (10000Γ—1000010000 \times 10000, nnz = 50000) and need the 5 largest eigenvalues. Why is scipy.sparse.linalg.eigs(A, k=5) preferred over numpy.linalg.eig(A.toarray())?

πŸ’‘Solution

  • Dense eigendecomposition: O(n3)=O(1012)O(n^3) = O(10^{12}) operations, O(n2)=800O(n^2) = 800 MB storage
  • Sparse eigs (Lanczos/Arnoldi): O(kΓ—nnz)=O(250000)O(k \times \text{nnz}) = O(250000) operations per iteration, O(kn)=400O(kn) = 400 KB storage
  • Speedup: ~1000000x faster, ~2000x less memory
  • Key insight: Sparse eigensolvers only compute the eigenvalues you need, using matrix-vector products (O(nnz)) instead of full decomposition

Quick Reference

AspectDetail
Dense storage8mn8mn bytes (float64)
CSR storage8nnz+4(nnz+m+1)8\text{nnz} + 4(\text{nnz} + m + 1) bytes
CSC storage8nnz+4(nnz+n+1)8\text{nnz} + 4(\text{nnz} + n + 1) bytes
COO storage16nnz16\text{nnz} bytes
MatVec complexityO(nnz)O(\text{nnz})
Sparse Γ— SparseO(nnz(A)Γ—avg_nnz_per_row(B))O(\text{nnz}(A) \times \text{avg\_nnz\_per\_row}(B))
Sparsity definitionzeros/(mΓ—n)\text{zeros} / (m \times n)
Sparse thresholdβ‰₯ 90% zeros
Break-even nnznnz<mn/10\text{nnz} < mn / 10 for CSR vs dense
Best construction formatCOO (random) or LIL (row-wise)
Best computation formatCSR (rows) or CSC (columns)
Best random access formatDOK
Python libraryscipy.sparse
GPU supportcuSPARSE, PyTorch sparse

Cross-References

Lesson Progress17 / 100