Master vector and matrix norms, their properties, and applications in regularization, convergence, and numerical stability.
š Vector Normsš Lesson 15 of 100š Free Course
Advertisement
Vector and Matrix Norms
Why It Matters: Norms are the foundation of measuring "size" and "distance" in vector spaces. They determine how we quantify error, define convergence, and control optimization. In machine learning, the choice of norm directly influences model behavior ā L1 produces sparse solutions, L2 promotes smoothness, and the Lā norm captures worst-case scenarios. Understanding norms is essential for regularization, numerical stability analysis, and distance-based algorithms.
What is a Norm
DfVector Norm
A norm is a function ā„ā ā„:VāRā„0ā from a vector space V to the non-negative real numbers that satisfies four fundamental axioms:
Axiom
Property
Description
1
Non-negativity
ā„xā„ā„0 for all xāV
2
Definiteness
ā„xā„=0āŗx=0
3
Homogeneity
ā„αxā„=ā„αā„ā ā„xā„ for all scalars α
4
Triangle Inequality
ā„x+yāā„ā¤ā„xā„+ā„yāā„ for all x,yāāV
A vector space equipped with a norm is called a normed vector space. The norm induces a natural distance function d(x,yā)=ā„xāyāā„, making it a metric space.
Key Insight: For any vector, ā„xā„āāā¤ā„xā„2āā¤ā„xā„1ā. The Lā norm captures only the largest component, L2 averages all components, and L1 sums all magnitudes. As p increases, the Lp norm converges to the Lā norm.
The Frobenius norm treats a matrix as a vector in RmĆn and computes its Euclidean norm. It equals the square root of the sum of squared singular values.
Spectral Norm (Operator 2-Norm)
ā£Aā£2ā=sigmamaxā(A)=sqrtlambdamaxā(ATA)
Here,
Ļmaxā(A)=Largest singular value of A
Ī»maxā(ATA)=Largest eigenvalue of A^T A
The spectral norm measures the maximum "stretch" factor of a linear transformation. It equals the largest singular value.
Nuclear Norm
ā£Aā£āā=sumi=1rāsigmaiā=texttr(sqrtATA)
Here,
Ļiā=Singular values of A
r=Rank of A
The nuclear norm (also called the trace norm) is the convex envelope of the rank function over the unit spectral norm ball. It is used in matrix completion and low-rank approximation.
An induced norm (also called an operator norm) measures the maximum output norm given an input constrained to unit norm. The most common induced norms are:
For any two norms ā„ā ā„aā and ā„ā ā„bā on a finite-dimensional vector space V (with dim(V)=n), there exist constants c1ā,c2ā>0 such that for all xāV:
Implication: In finite dimensions, all norms define the same topology ā convergence in one norm implies convergence in all others. However, the constants matter: the L1 norm can be up to n times larger than the Lā norm. In infinite dimensions (function spaces), norms need NOT be equivalent.
Unit Ball: Geometric Interpretation
The unit ball of a norm is the set B={x:ā„xā„ā¤1}. Its shape reveals the geometric character of the norm.
Norm
Unit Ball Shape
Geometry
L1
Diamond (rotated square in 2D)
Vertices at (±1,0) and (0,±1)
L2
Circle / Sphere
Smooth, rotationally symmetric
Lā
Square / Hypercube
Vertices at (±1,±1)
Lp (1<p<ā)
Rounded polygon
Interpolates between diamond and circle
The L1 unit ball's "pointy" vertices at the axes explain why L1 optimization produces sparse solutions ā the optimal point is more likely to land on a vertex where some coordinates are exactly zero.
Norms in Optimization
Regularized Loss Function
minvecwāmathcalL(vecw)+lambdaā£vecwā£ppā
Here,
L(w)=Loss function (e.g., squared error)
Ī»=Regularization strength
ā„wā„ppā=p-norm penalty (p=1 or 2 common)
Penalty
Name
Effect
Ī»ā„wā„1ā
Lasso
Sparse solutions, automatic feature selection
Ī»ā„wā„22ā
Ridge
Small weights, no feature selection
Ī»ā„wā„1ā+Ī»2āā„wā„22ā
Elastic Net
Combines sparsity and smoothness
Ī»ā„wā„āā
Minimax
Bounded maximum coefficient
Distance Metrics
A norm ā„ā ā„ induces a distance metric d(x,yā)=ā„xāyāā„ that satisfies:
L1 Regularization (Lasso): The L1 norm penalty Ī»ā„wā„1ā drives some weights to exactly zero, performing automatic feature selection. This is critical in high-dimensional settings where only a subset of features matter (genomics, NLP feature selection).
L2 Regularization (Ridge): The L2 norm penalty Ī»ā„wā„22ā shrinks all weights toward zero but never sets them exactly to zero. It prevents overfitting and improves generalization. It is the default regularization in most linear models.
Adversarial Robustness: The Lā norm measures the maximum perturbation allowed in adversarial examples. Models trained with the PGD adversarial method optimize maxā„Ī“ā„āāā¤ĻµāL(x+Ī“,y).
Gradient Clipping: In deep learning, gradients are clipped by norm: if ā„gāā„>Ļ, then gāāĻā gā/ā„gāā„. This prevents exploding gradients and stabilizes training.
Matrix Completion (Netflix Prize): The nuclear norm ā„Aā„āā is minimized to recover low-rank matrices from partial observations: minXāā„Xā„āā subject to observed entries matching.
Common Mistakes
Mistake
Why It's Wrong
Correct Approach
Using L0 norm for optimization
L0 is non-convex, NP-hard
Use L1 as convex relaxation
Confusing ā„Aā„Fā with ā„Aā„2ā
Frobenius sums all singular values, spectral takes the max
ā„Aā„2āā¤ā„Aā„Fāā¤rāā„Aā„2ā
Forgetting ā„cxā„=ā£cā£ā„xā„
Homogeneity requires absolute value on scalar
ā„ā3xā„=3ā„xā„, not ā3ā„xā„
Assuming all norms are equal in infinite dimensions
Norm equivalence requires finite dimensions
In function spaces, different norms define different topologies
Using L2 norm for sparse feature selection
L2 shrinks but doesn't zero out features
Use L1 (Lasso) or Elastic Net
Ignoring norm when computing condition number
Īŗ(A)=ā„Aā„ā ā„Aā1ā„ depends on the norm
Choose the norm appropriate for your error metric
Interview Questions
Q1: Why does L1 regularization produce sparse solutions while L2 does not?
š”Solution
Geometrically, the L1 unit ball is a diamond with vertices on the axes. The level curves of the loss function are more likely to intersect the L1 ball at a vertex, where some coordinates are exactly zero. The L2 ball is a circle ā level curves typically intersect it at points where all coordinates are non-zero.
Q2: What is the relationship between the spectral norm and the Frobenius norm?
š”Solution
For any matrix A: ā„Aā„2āā¤ā„Aā„Fāā¤rāā ā„Aā„2ā, where r=rank(A). The spectral norm equals the largest singular value, while the Frobenius norm equals the root-sum-of-squares of all singular values.
Q3: When would you use the nuclear norm instead of the Frobenius norm?
š”Solution
Use the nuclear norm when you want to encourage low-rank structure in a matrix. The nuclear norm is the tightest convex relaxation of the rank function. Applications include matrix completion (e.g., recommender systems), denoising, and dimensionality reduction.
Q4: Prove that the Lā norm is indeed a norm on Rn.
š”Solution
Non-negativity:maxā£xiāā£ā„0 since each ā£xiāā£ā„0.
Definiteness:maxā£xiāā£=0ā¹xiā=0 for all iā¹x=0.
Q5: What is the condition number of a matrix, and why does the norm matter?
š”Solution
The condition number is Īŗ(A)=ā„Aā„ā ā„Aā1ā„. It measures how sensitive the solution of Ax=b is to perturbations in b. A large condition number indicates an ill-conditioned problem. The value depends on the norm chosen ā typically the spectral norm or the Lā norm is used.
Q6: How do norms relate to convergence in optimization algorithms?
š”Solution
Convergence of an iterative algorithm x(k)āxā is defined with respect to a norm: ā„x(k)āxāā„ā0. In finite dimensions, convergence in one norm implies convergence in all norms. However, the rate of convergence (and practical numerical behavior) can differ significantly between norms.
Practice Problems
Problem 1: Compute the L1, L2, Lā, and L4 norms of x=ā3ā405āā.