Topological Data Analysis
ℹ️ Why It Matters
Topological data analysis (TDA) captures the shape of data—connected components, loops, voids, and higher-dimensional cavities—that traditional statistical methods miss. Persistent homology tracks these topological features across multiple scales, providing robust descriptors that are invariant to continuous deformations. TDA has proven valuable in shape analysis, time series classification, protein structure analysis, and as feature engineering for machine learning. The mathematical foundation draws from algebraic topology, simplicial homology, and metric geometry.
Core Definitions
DfTopological Space
A topological space is a pair where is a set and is a collection of subsets (open sets) satisfying: (1) ; (2) arbitrary unions of sets in are in ; (3) finite intersections of sets in are in . A topology captures the notion of "closeness" without requiring a metric. Examples include the standard topology on (generated by open intervals) and the discrete topology (every subset is open).
DfOpen and Closed Sets
A set is open if . A set is closed if is open (equivalently, contains all its limit points). In a metric space, open balls generate the topology. A set can be both open and closed (clopen), or neither. In connected spaces, the only clopen sets are and .
DfHomeomorphism
A function between topological spaces is a homeomorphism if it is bijective, continuous, and has a continuous inverse. Homeomorphic spaces are topologically equivalent—they have the same "shape." For example, a coffee mug and a donut are homeomorphic (both have one hole). Homeomorphism preserves connectedness, compactness, and all topological invariants.
DfConnectedness
A topological space is connected if it cannot be written as the union of two disjoint non-empty open sets. Equivalently, the only clopen (simultaneously closed and open) sets are and . A space is path-connected if any two points can be joined by a continuous path. Path-connected implies connected, but not conversely (the topologist's sine curve is connected but not path-connected). The number of connected components is the zeroth Betti number.
DfCompactness
A topological space is compact if every open cover (where ) has a finite subcover. In metric spaces, compactness is equivalent to sequential compactness (every sequence has a convergent subsequence) and to being complete and totally bounded. Compact sets are "finite" in a topological sense. The Heine-Borel theorem states that in , a set is compact iff it is closed and bounded.
DfHomotopy
Two continuous maps are homotopic (written ) if there exists a continuous map such that and . A space is contractible if it is homotopy equivalent to a point. Homotopy equivalence is weaker than homeomorphism—it preserves "shape" but not rigid structure. For example, a disk is contractible (homotopy equivalent to a point), while a circle is not.
DfSimplicial Complex
An abstract simplicial complex on a vertex set is a collection of finite subsets of (simplices) closed under taking subsets. A simplex has dimension . The geometric realization embeds in Euclidean space as a union of simplices. Examples: graph (1-complex), triangle mesh (2-complex), tetrahedralization (3-complex). The -vector counts the number of simplices in each dimension.
DfSimplicial Homology
The -th chain group is the free abelian group generated by -simplices. The boundary operator maps each -simplex to its oriented boundary. The -th homology group is , and the -th Betti number is . Homology measures "holes" in each dimension.
DfPersistent Homology
Given a filtration of simplicial complexes , persistent homology tracks the birth and death of homological features (connected components, loops, voids) as the filtration parameter increases. A feature born at scale and dying at scale has persistence ; features with high persistence are considered "real" structure, while short-lived features are attributed to noise.
DfVietoris-Rips Complex
For a point cloud and parameter , the Vietoris-Rips complex includes a simplex if for all . As increases, more simplices are added, creating a filtration. The Rips complex is commonly used in TDA because it depends only on pairwise distances.
DfČech Complex
For a point cloud and parameter , the Čech complex includes a simplex if the balls of radius centered at have non-empty intersection. The Čech complex captures the "union of balls" topology and satisfies the Nerve Theorem: if the balls are convex, the Čech complex is homotopy equivalent to the union of balls.
Key Formulas
Betti Numbers
Here,
- =k-th Betti number (number of k-dimensional holes)
- =k-th homology group
- =Boundary operator from k-chains to (k-1)-chains
Euler Characteristic
Here,
- =Euler characteristic (topological invariant)
- =k-th Betti number
- =Number of k-simplices in a triangulation
Persistence Diagram
Here,
- =Persistence diagram for k-dimensional features
- =Birth time of the i-th feature
- =Death time of the i-th feature
Bottleneck Distance
Here,
- =Two persistence diagrams
- =Bijection (matching) between the diagrams
- =Bottleneck distance measuring diagram similarity
Wasserstein Distance
Here,
- =p-Wasserstein distance between persistence diagrams
- =Optimal matching (bijection with possible diagonal matches)
Vietoris-Rips Distance Condition
Here,
- =Distance threshold parameter
- =Pairwise distance between points
Čech Complex Intersection Condition
Here,
- =Open ball of radius ε centered at x
- =Radius parameter
Important Theorems
ThNerve Theorem (Borsuk)
Let be a finite collection of convex, contractible subsets of such that all non-empty intersections are also contractible. Then the nerve of the cover is homotopy equivalent to . This theorem justifies using simplicial complexes (built from intersections of balls) to capture the topology of the underlying space.
ThStability of Persistence Diagrams
If are continuous functions on a compact space, then , where is the bottleneck distance and is the persistence diagram of . This means small perturbations of the data cause small changes in the persistence diagram—TDA is stable.
ThIsometry Invariance of Persistent Homology
Persistent homology is invariant under isometries of the metric space. If and are isometric metric spaces, their persistence diagrams are identical. This means TDA captures intrinsic geometric structure independent of embedding.
ThUniversal Coefficient Theorem
For a topological space with simplicial homology , the homology with coefficients in a field satisfies:
When has characteristic 0 (e.g., , ), the Tor term vanishes and . This justifies working with field coefficients in computational TDA.
ThMayer-Vietoris Sequence
For a topological space expressed as the union of two open sets and , there is a long exact sequence relating the homology of , , , and :
This sequence is a powerful computational tool: it reduces the computation of homology of to computing homology of simpler subspaces.
Worked Examples
📝Betti Numbers of Common Shapes
Problem: Compute the Betti numbers for (a) a circle, (b) a sphere, (c) a torus, (d) a figure-eight.
Solution:
(a) Circle :
- (one connected component)
- (one loop)
- for
Euler characteristic:
(b) Sphere :
- (connected)
- (no loops—any loop can be contracted to a point)
- (one void/enclosed cavity)
- for
Euler characteristic:
(c) Torus :
- (connected)
- (two independent loops: meridian and longitude)
- (one void)
- for
Euler characteristic:
(d) Figure-eight (wedge of two circles):
- (connected)
- (two loops)
- for
Euler characteristic:
Result: The Betti numbers distinguish these shapes: the torus has while the sphere has . The figure-eight and torus share but differ in .
📝Vietoris-Rips Complex of a Point Cloud
Problem: Given points , , (vertices of an equilateral triangle with side length 1), construct the Vietoris-Rips complex for , , and .
Solution:
Step 1: Compute pairwise distances. , , (equilateral triangle).
Step 2: : No pair of points has distance .
- 0-simplices: (3 vertices)
- 1-simplices: none
- (three components),
Step 3: : All pairwise distances .
- 0-simplices:
- 1-simplices: (all edges)
- 2-simplices: (since all edges present, triangle filled)
- (connected), (loop filled by triangle)
Step 4: : Same as (no new features).
- Same complex as
Result: The persistence diagram has one connected component born at (never dies), and no other features. The triangle forms at .
📝Persistent Homology of a Noisy Circle
Problem: 20 points sampled from a unit circle with small Gaussian noise. Describe the expected persistence diagram.
Solution:
Step 1: As increases from 0:
- At : Each point is a separate component ().
- As grows: Nearby points connect, reducing .
Step 2: At (half the typical inter-point spacing):
- Most points connect into a single ring.
- (one component).
- (one loop appears—the circle).
Step 3: The loop persists until (diameter of circle), when the interior fills in and the loop dies.
Step 4: Expected persistence diagram:
- : 19 "noise" components born at , dying at various small values (short bars). 1 "signal" component born at , never dying (infinite bar).
- : 1 "signal" loop born at , dying at (long bar). Possible small "noise" loops born and dying at similar scales (short bars).
Result: The persistent homology clearly identifies the circle: one long-lived feature with persistence , while noise features have short persistence. The bottleneck distance to the exact circle's diagram is small.
📝Euler Characteristic of a Graph
Problem: Compute the Euler characteristic of a complete graph (5 vertices, all edges) and a tree with 5 vertices.
Solution:
Complete graph :
- (vertices), (edges), for
Alternatively: , (independent cycles). ✓
Tree with 5 vertices:
- (vertices), (edges, since a tree on vertices has edges)
Alternatively: (connected), (no cycles in a tree). ✓
Result: has (many cycles), while the tree has (no cycles). The Euler characteristic distinguishes these graphs: trees always have , while graphs with cycles have .
Practice Problems
📝Problem 1: Connected Components of a Subset of R²
Problem: Determine the connected components and compactness of .
Solution:
Step 1: is the union of two closed disks: centered at with radius 1, and centered at with radius 1.
Step 2: The distance between centers is 3, and the sum of radii is 2. Since , the disks do not overlap (their intersection is empty).
Step 3: Connected components: Each disk is connected (convex sets are connected), and since they are disjoint, has exactly 2 connected components.
Step 4: Compactness: Each disk is closed and bounded in , hence compact by Heine-Borel. The finite union of compact sets is compact, so is compact.
Result: (two connected components), is compact.
📝Problem 2: Homotopy Equivalence
Problem: Show that a circle is homotopy equivalent to (the punctured plane).
Solution:
Step 1: Define as the inclusion (view ).
Step 2: Define as the radial projection .
Step 3: Check : For , (since ). ✓
Step 4: Check : Define .
For , . This is continuous and never zero (since for ).
, .
Result: and are homotopy inverses, so . Both have , .
📝Problem 3: Compactness in Function Spaces
Problem: Is the closed unit ball compact in the supremum norm?
Solution:
Step 1: Consider the sequence in . Each is continuous with .
Step 2: Pointwise limit:
This limit is NOT continuous, so .
Step 3: Any subsequence of converges pointwise to the same discontinuous function, so no subsequence converges in the supremum norm.
Step 4: Therefore has no convergent subsequence in .
Result: The closed unit ball in is NOT compact. This is a consequence of the Riesz theorem: in an infinite-dimensional Banach space, the closed unit ball is never compact. Contrast with finite-dimensional where the closed unit ball IS compact (Heine-Borel).
📝Problem 4: Persistent Homology of Two Concentric Circles
Problem: Describe the persistence diagram for 30 points sampled from two concentric circles of radii 1 and 2.
Solution:
Step 1: At small : 30 connected components ().
Step 2: As increases, points on each circle connect. The inner circle (radius 1) forms a loop first (at , the typical spacing). The outer circle (radius 2) forms a loop at a slightly larger .
Step 3: At : Both circles are fully connected.
- (two connected components: inner and outer rings)
- (two loops)
Step 4: As grows toward 1.0: The inner circle's loop fills in (dies). The outer circle's loop persists longer.
Step 5: At : The outer loop fills in.
Persistence diagram:
- : 1 infinite bar (everything connected eventually), 28 short bars (noise components)
- : 2 long bars. Inner loop: born , dies (persistence ). Outer loop: born , dies (persistence ).
Result: The two loops are clearly distinguished by their persistence. The outer loop has higher persistence because it is larger. TDA correctly identifies the "shape" of two concentric circles.
📝Problem 5: Čech Complex vs. Vietoris-Rips Complex
Problem: For a set of points forming a regular hexagon, compare the Čech and Vietoris-Rips complexes at different scales.
Solution:
Step 1: Consider a regular hexagon with side length 1. Vertices: .
Step 2: For : No edges in either complex.
Step 3: For :
- Čech complex: Two adjacent balls intersect iff the distance between centers (diameter of intersection region). Adjacent vertices are distance 1 apart, so edges appear.
- Vietoris-Rips: Same edges (distance 1 ).
- Both give the cycle graph .
Step 4: For :
- Čech complex: Three consecutive vertices have balls intersecting at the center of the triangle they form. A 2-simplex appears.
- Vietoris-Rips: All pairwise distances . Since the hexagon's diameter is 2, only adjacent and next-adjacent pairs (distance ) are connected.
Step 5: Key difference: The Čech complex requires a common intersection point for all balls in a simplex, while Vietoris-Rips only requires pairwise distances. The Čech complex is "smaller" (fewer simplices) but captures the true topology by the Nerve Theorem.
Result: For the same data, the Čech complex is a subcomplex of the Vietoris-Rips complex. Čech is more topologically accurate but computationally harder (requires checking ball intersections). Vietoris-Rips depends only on pairwise distances, making it computationally efficient.
📝Problem 6: Persistent Homology of a Torus
Problem: Describe the expected persistence diagram for points sampled from a torus embedded in .
Solution:
Step 1: A torus has , , .
Step 2: At small scales, the point cloud looks like scattered points ().
Step 3: As increases, points connect into a "skeleton" of the torus:
- drops to 1 (one connected component)
- Two loops appear (): the meridian and longitude circles
Step 4: The two loops persist over a wide range of values. Eventually:
- The meridian loop fills in (dies first, at where is the tube radius)
- The longitude loop fills in later (at where is the major radius)
- The void () appears when the surface is fully reconstructed and dies when the interior fills
Step 5: Expected persistence diagram:
- : Many short bars (noise), one infinite bar
- : Two long bars (the two independent loops of the torus)
- : One bar (the enclosed void)
Result: Persistent homology correctly identifies the torus topology: , , . The two features have different persistence values reflecting the different radii of the torus.
📝Problem 7: Vietoris-Rips for Clustering
Problem: How does persistent homology help determine the number of clusters in a point cloud?
Solution:
Step 1: Consider 3 clusters of points, each cluster forming a tight ball, with large gaps between clusters.
Step 2: At small : Each point is separate ().
Step 3: As increases: Points within each cluster connect first. At equal to the intra-cluster spacing, (three components).
Step 4: The three components persist until reaches the inter-cluster distance, at which point they merge into one component ().
Step 5: The persistence diagram for shows:
- Many short bars (noise within clusters)
- Three medium bars (the three clusters), dying at the inter-cluster distance
- One infinite bar (everything eventually connected)
Result: The number of long-lived features (after filtering short noise bars) indicates the number of clusters. This is the basis of topological clustering methods that use persistent homology to determine cluster count automatically.
📝Problem 8: Čech Complex Computation
Problem: For 4 points forming a square with side length 1, determine at what value of the Čech complex first contains a 2-simplex (triangle).
Solution:
Step 1: Points: , , , .
Step 2: For a 2-simplex , we need balls of radius around , , to have a common intersection point.
Step 3: The circumcenter of triangle is at with circumradius .
Step 4: The three balls intersect when .
Step 5: At , the point is equidistant from , , at distance , so it lies in all three balls.
Result: The Čech complex first contains a 2-simplex at . This is less than the Vietoris-Rips threshold ( for the same simplex), illustrating that Čech is "tighter" than Rips.
📝Problem 9: Persistence Landscape Computation
Problem: Given a persistence diagram with features and for , compute the first persistence landscape.
Solution:
Step 1: For each feature , define the tent function:
Step 2: For feature 1: , midpoint :
Peak value: at .
Step 3: For feature 2: , midpoint :
Peak value: at .
Step 4: The first persistence landscape is :
- For : (only feature 1 active)
- For : (feature 1 dominates)
- For : (feature 1 still dominates)
- For : (feature 1 dominates)
- For : (only feature 1 active)
Result: The first persistence landscape is the upper envelope of the tent functions. It can be used as a feature vector for statistical analysis (averaging landscapes across samples, computing confidence bands, etc.).
📝Problem 10: TDA for Time Series Classification
Problem: How would you use persistent homology to classify periodic vs. non-periodic time series?
Solution:
Step 1: Given a time series , create a delay embedding in :
where is the delay and is the embedding dimension.
Step 2: For a periodic signal, the delay embedding traces a closed curve (a loop in ).
Step 3: Compute persistent homology of the point cloud .
Step 4: Classification criteria:
- Periodic signal: with high persistence (long-lived loop)
- Non-periodic signal: or with low persistence (short-lived feature)
Step 5: Use the persistence of the dominant feature as a classification feature: threshold at some value to separate periodic from non-periodic.
Result: TDA provides a principled, noise-robust method for detecting periodicity. The delay embedding + persistent homology approach works even for noisy, irregularly sampled time series, making it valuable for applications in astronomy (variable star classification), medicine (heartbeat analysis), and finance (cycle detection).
Common Mistakes
| Mistake | Correct Approach |
|---|---|
| Assuming more data always gives better topology | Noise creates short-lived topological features; focus on persistence |
| Confusing homotopy equivalence with homeomorphism | Homotopy equivalence is weaker; a disk and a point are homotopy equivalent but not homeomorphic |
| Using Vietoris-Rips without understanding its relation to Čech | Rips is a superset of Čech; Rips may create false features that Čech avoids |
| Ignoring the choice of distance metric | TDA results depend on the metric; different metrics yield different persistence diagrams |
| Assuming Betti numbers capture all topology | Betti numbers are invariants, but they don't capture the full homotopy type (e.g., lens spaces) |
| Confusing persistent homology with clustering | Clustering gives partition; persistent homology captures higher-dimensional topology (loops, voids) |
| Forgetting that persistence is invariant under rotation/translation | TDA captures intrinsic geometry, not extrinsic position |
Connections to Machine Learning
ℹ️ Connections to Machine Learning
TDA provides topology-aware features for ML: (1) Persistence diagrams serve as features for classification—long-lived features indicate meaningful structure. (2) Persistence landscapes convert diagrams into functions suitable for statistical analysis. (3) Topological regularization encourages neural networks to learn functions with simple topology. (4) Shape analysis uses persistent homology to compare 3D meshes and point clouds. (5) Time series analysis detects periodicity and anomalies via persistent homology of delay embeddings. (6) Graph neural networks can incorporate topological features from persistent homology of graph neighborhoods. (7) Manifold learning methods (Isomap, UMAP) assume data lies on a low-dimensional manifold; TDA can validate this assumption by measuring topological complexity.
Exam/Interview Questions
Q1: What are Betti numbers, and what do they measure?
Answer: Betti numbers count the number of -dimensional "holes" in a topological space: = connected components, = independent loops, = enclosed voids. They are topological invariants—preserved under homeomorphism. For a circle: , . For a sphere: , . For a torus: , , .
Q2: Explain the difference between the Vietoris-Rips and Čech complexes.
Answer: The Vietoris-Rips complex includes a simplex if all pairwise distances are . The Čech complex includes a simplex if the balls of radius around the vertices have a common intersection point. Čech is topologically more accurate (by the Nerve Theorem) but harder to compute. Rips is always a superset of Čech and depends only on pairwise distances, making it computationally efficient.
Q3: Why is persistent homology "stable"?
Answer: The stability theorem states that , where is the bottleneck distance. This means if we perturb the data (or the function defining the filtration) by a small amount in the supremum norm, the persistence diagram changes by at most that amount. This guarantees that TDA is robust to noise—small perturbations cause small changes in the topological summary.
Q4: What is the Euler characteristic and how does it relate to Betti numbers?
Answer: The Euler characteristic is . For surfaces: sphere (), torus (), genus- surface (). It can also be computed from any triangulation: where is the number of -simplices. It is a topological invariant (independent of triangulation) and relates geometry to topology via the Gauss-Bonnet theorem.
Q5: How can persistent homology be used as features for machine learning?
Answer: Persistence diagrams can be converted to feature vectors via: (1) Persistence landscapes—transform diagrams into functions amenable to averaging and statistics. (2) Persistence images—rasterize diagrams into fixed-size images. (3) Betti curves—plot as a function of scale. (4) Persistence statistics—compute mean, variance, and other statistics of birth-death pairs. These features capture topological structure (holes, loops, voids) that traditional features miss, and have been used in shape classification, time series analysis, and molecular biology.
Q6: Why is compactness important in topology and analysis?
Answer: Compactness ensures that: (1) Every sequence has a convergent subsequence (sequential compactness), guaranteeing existence of solutions. (2) Continuous functions on compact sets are bounded and attain their extrema (extreme value theorem). (3) Open covers have finite subcovers, enabling finite-dimensional arguments. In TDA, compactness of the underlying space ensures persistence diagrams are well-defined (features have finite birth and death times). In analysis, compactness justifies convergence of iterative algorithms and existence of optimal solutions.
Quick Reference
| Concept | Definition | Key Property |
|---|---|---|
| Topological Space | with open set axioms | Generalizes metric spaces |
| Open Set | Closed under arbitrary union | |
| Closed Set | open | Contains all limit points |
| Connected | Cannot split into disjoint open sets | counts components |
| Compact | Every open cover has finite subcover | Sequentially compact in metric spaces |
| Homeomorphism | Continuous bijection with continuous inverse | Preserves topology |
| Homotopy | Continuous deformation between maps | Preserves "shape" |
| Simplicial Complex | Collection of simplices closed under subsets | Discrete model of topology |
| Betti Number | Counts -dimensional holes | |
| Euler Characteristic | Topological invariant | |
| Persistent Homology | Track feature birth/death across scales | Stable under perturbation |
| Vietoris-Rips | simplex iff | Pairwise distance condition |
| Čech Complex | simplex iff balls have common intersection | Nerve Theorem applies |
| Bottleneck Distance | Stability metric |
Cross-References
- 096-advanced-tensor-calculus — Tensor products in algebraic topology; chain complexes as tensor products
- 097-advanced-differential-geometry — Manifolds are topological spaces with smooth structure; de Rham cohomology relates to simplicial homology
- 098-advanced-functional-analysis — Function spaces are topological vector spaces; weak topologies connect analysis to topology
- 099-advanced-measure-theory — Borel σ-algebras are generated by open sets; measurability relates to topological structure
- Graph Theory: Graphs are 1-dimensional simplicial complexes; graph connectivity relates to