← Math|15 of 100
Linear Algebra

Vector and Matrix Norms

Master vector and matrix norms, their properties, and applications in regularization, convergence, and numerical stability.

šŸ“‚ Vector NormsšŸ“– Lesson 15 of 100šŸŽ“ Free Course

Advertisement

Vector and Matrix Norms

Why It Matters: Norms are the foundation of measuring "size" and "distance" in vector spaces. They determine how we quantify error, define convergence, and control optimization. In machine learning, the choice of norm directly influences model behavior — L1 produces sparse solutions, L2 promotes smoothness, and the Lāˆž norm captures worst-case scenarios. Understanding norms is essential for regularization, numerical stability analysis, and distance-based algorithms.

What is a Norm

DfVector Norm

A norm is a function āˆ„ā‹…āˆ„:V→R≄0\|\cdot\|: V \to \mathbb{R}_{\geq 0} from a vector space VV to the non-negative real numbers that satisfies four fundamental axioms:

AxiomPropertyDescription
1Non-negativity∄xāƒ—āˆ„ā‰„0\|\vec{x}\| \geq 0 for all xāƒ—āˆˆV\vec{x} \in V
2Definiteness∄xāƒ—āˆ„=0ā€…ā€ŠāŸŗā€…ā€Šxāƒ—=0āƒ—\|\vec{x}\| = 0 \iff \vec{x} = \vec{0}
3Homogeneity∄αxāƒ—āˆ„=āˆ„Ī±āˆ„ā‹…āˆ„xāƒ—āˆ„\|\alpha \vec{x}\| = \|\alpha\| \cdot \|\vec{x}\| for all scalars α\alpha
4Triangle Inequality∄xāƒ—+yāƒ—āˆ„ā‰¤āˆ„xāƒ—āˆ„+∄yāƒ—āˆ„\|\vec{x} + \vec{y}\| \leq \|\vec{x}\| + \|\vec{y}\| for all xāƒ—,yāƒ—āˆˆV\vec{x}, \vec{y} \in V

A vector space equipped with a norm is called a normed vector space. The norm induces a natural distance function d(xāƒ—,yāƒ—)=∄xāƒ—āˆ’yāƒ—āˆ„d(\vec{x}, \vec{y}) = \|\vec{x} - \vec{y}\|, making it a metric space.

Vector Norms

Lp Norm Family

∣vecx∣p=left(sumi=1n∣xi∣pright)1/p\\|\\vec{x}\\|_p = \\left(\\sum_{i=1}^{n} |x_i|^p\\right)^{1/p}

Here,

  • xāƒ—\vec{x}=Vector in \mathbb{R}^n
  • pp=Parameter satisfying p \geq 1
  • ∣xi∣|x_i|=Absolute value of the i-th component
NormFormulaWhen to Use
L1 (Manhattan)∄xāƒ—āˆ„1=āˆ‘i=1n∣xi∣\|\vec{x}\|_1 = \sum_{i=1}^{n} |x_i|Sparse solutions, feature selection (Lasso)
L2 (Euclidean)∄xāƒ—āˆ„2=āˆ‘i=1nxi2\|\vec{x}\|_2 = \sqrt{\sum_{i=1}^{n} x_i^2}Smooth solutions, general-purpose (Ridge)
Lāˆž (Max Norm)∄xāƒ—āˆ„āˆž=max⁔i∣xi∣\|\vec{x}\|_\infty = \max_i |x_i|Worst-case analysis, adversarial robustness
Lp (General)∄xāƒ—āˆ„p=(āˆ‘āˆ£xi∣p)1/p\|\vec{x}\|_p = \left(\sum |x_i|^p\right)^{1/p}Interpolation between L1 and Lāˆž
L0 (Pseudo-norm)∄xāƒ—āˆ„0=#{i:xi≠0}\|\vec{x}\|_0 = \#\{i : x_i \neq 0\}Cardinality (non-convex, NP-hard to optimize)

Step-by-Step Example: Computing Vector Norms

šŸ“Computing Vector Norms for x = [1, -2, 3, -4]

Given xāƒ—=[1āˆ’23āˆ’4]\vec{x} = \begin{bmatrix} 1 \\ -2 \\ 3 \\ -4 \end{bmatrix}, compute all common norms.

Step 1: L1 Norm

∣vecx∣1=∣1∣+āˆ£āˆ’2∣+∣3∣+āˆ£āˆ’4∣=1+2+3+4=10\\|\\vec{x}\\|_1 = |1| + |-2| + |3| + |-4| = 1 + 2 + 3 + 4 = 10

Step 2: L2 Norm

∣vecx∣2=sqrt12+(āˆ’2)2+32+(āˆ’4)2=sqrt1+4+9+16=sqrt30approx5.477\\|\\vec{x}\\|_2 = \\sqrt{1^2 + (-2)^2 + 3^2 + (-4)^2} = \\sqrt{1 + 4 + 9 + 16} = \\sqrt{30} \\approx 5.477

Step 3: Lāˆž Norm

\\|\\vec{x}\\|_\\infty = \\max(|1|, |-2|, |3|, |-4|) = 4

Step 4: L4 Norm (example of Lp)

∣vecx∣4=(14+24+34+44)1/4=(1+16+81+256)1/4=3541/4approx4.34\\|\\vec{x}\\|_4 = (1^4 + 2^4 + 3^4 + 4^4)^{1/4} = (1 + 16 + 81 + 256)^{1/4} = 354^{1/4} \\approx 4.34

šŸ’”Solution

Key Insight: For any vector, ∄xāƒ—āˆ„āˆžā‰¤āˆ„xāƒ—āˆ„2ā‰¤āˆ„xāƒ—āˆ„1\|\vec{x}\|_\infty \leq \|\vec{x}\|_2 \leq \|\vec{x}\|_1. The Lāˆž norm captures only the largest component, L2 averages all components, and L1 sums all magnitudes. As pp increases, the Lp norm converges to the Lāˆž norm.

Matrix Norms

Frobenius Norm

∣A∣F=sqrtsumi=1msumj=1naij2=sqrttexttr(ATA)=sqrtsumi=1min(m,n)sigmai2\\|A\\|_F = \\sqrt{\\sum_{i=1}^{m}\\sum_{j=1}^{n} a_{ij}^2} = \\sqrt{\\text{tr}(A^TA)} = \\sqrt{\\sum_{i=1}^{\\min(m,n)} \\sigma_i^2}

Here,

  • AA=Matrix of size m Ɨ n
  • aija_{ij}=Element in row i, column j
  • tr\text{tr}=Trace (sum of diagonal elements)
  • σi\sigma_i=Singular values of A

The Frobenius norm treats a matrix as a vector in RmƗn\mathbb{R}^{m \times n} and computes its Euclidean norm. It equals the square root of the sum of squared singular values.

Spectral Norm (Operator 2-Norm)

∣A∣2=sigmamax(A)=sqrtlambdamax(ATA)\\|A\\|_2 = \\sigma_{\\max}(A) = \\sqrt{\\lambda_{\\max}(A^TA)}

Here,

  • σmax⁔(A)\sigma_{\max}(A)=Largest singular value of A
  • Ī»max⁔(ATA)\lambda_{\max}(A^TA)=Largest eigenvalue of A^T A

The spectral norm measures the maximum "stretch" factor of a linear transformation. It equals the largest singular value.

Nuclear Norm

∣Aāˆ£āˆ—=sumi=1rsigmai=texttr(sqrtATA)\\|A\\|_* = \\sum_{i=1}^{r} \\sigma_i = \\text{tr}(\\sqrt{A^TA})

Here,

  • σi\sigma_i=Singular values of A
  • rr=Rank of A

The nuclear norm (also called the trace norm) is the convex envelope of the rank function over the unit spectral norm ball. It is used in matrix completion and low-rank approximation.

Comparison of Matrix Norms

NormFormulaUse Case
Frobenius∄A∄F=āˆ‘aij2\|A\|_F = \sqrt{\sum a_{ij}^2}General matrix similarity, reconstruction error
Spectral∄A∄2=σmax⁔\|A\|_2 = \sigma_{\max}Stability analysis, condition number, Lipschitz constants
Nuclear∄Aāˆ„āˆ—=āˆ‘Ļƒi\|A\|_* = \sum \sigma_iMatrix completion, low-rank recovery
L1 (entry-wise)∄A∄1,1=āˆ‘āˆ£aij∣\|A\|_{1,1} = \sum |a_{ij}|Sparse matrix recovery
Lāˆž (entry-wise)∄Aāˆ„āˆž,āˆž=max⁔∣aij∣\|A\|_{\infty,\infty} = \max |a_{ij}|Bounded perturbations

Induced (Operator) Norms

DfInduced Matrix Norm

An induced norm (also called an operator norm) measures the maximum output norm given an input constrained to unit norm. The most common induced norms are:

Induced NormDefinitionFormula
Induced 2-norm∄A∄2=max⁔∄xāƒ—āˆ„=1∄Axāƒ—āˆ„2\|A\|_2 = \max_{\|\vec{x}\|=1} \|A\vec{x}\|_2σmax⁔(A)\sigma_{\max}(A)
Induced 1-norm∄A∄1=max⁔∄xāƒ—āˆ„=1∄Axāƒ—āˆ„1\|A\|_1 = \max_{\|\vec{x}\|=1} \|A\vec{x}\|_1max⁔jāˆ‘i∣aij∣\max_j \sum_i |a_{ij}|
Induced āˆž-norm∄Aāˆ„āˆž=max⁔∄xāƒ—āˆ„=1∄Axāƒ—āˆ„āˆž\|A\|_\infty = \max_{\|\vec{x}\|=1} \|A\vec{x}\|_\inftymax⁔iāˆ‘j∣aij∣\max_i \sum_j |a_{ij}|

Example: Computing Matrix Norms

šŸ“Matrix Norms for A = [[1, 2], [3, 4]]

Given A=[1234]A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}.

Frobenius Norm:

∣A∣F=sqrt12+22+32+42=sqrt30approx5.477\\|A\\|_F = \\sqrt{1^2 + 2^2 + 3^2 + 4^2} = \\sqrt{30} \\approx 5.477

Spectral Norm: Compute ATAA^TA:

A^TA = \\begin{bmatrix} 1 & 3 \\\\ 2 & 4 \\end{bmatrix} \\begin{bmatrix} 1 & 2 \\\\ 3 & 4 \\end{bmatrix} = \\begin{bmatrix} 10 & 14 \\\\ 14 & 20 \\end{bmatrix}
lambdamax=frac30+sqrt302āˆ’4(200āˆ’196)2=frac30+sqrt8802approx29.37\\lambda_{\\max} = \\frac{30 + \\sqrt{30^2 - 4(200-196)}}{2} = \\frac{30 + \\sqrt{880}}{2} \\approx 29.37
∣A∣2=sqrt29.37approx5.42\\|A\\|_2 = \\sqrt{29.37} \\approx 5.42

Induced 1-norm:

∣A∣1=max(1+3,2+4)=max(4,6)=6\\|A\\|_1 = \\max(1+3, 2+4) = \\max(4, 6) = 6

Induced āˆž-norm:

\\|A\\|_\\infty = \\max(1+2, 3+4) = \\max(3, 7) = 7

Norm Equivalence

ThNorm Equivalence in Finite Dimensions

For any two norms āˆ„ā‹…āˆ„a\|\cdot\|_a and āˆ„ā‹…āˆ„b\|\cdot\|_b on a finite-dimensional vector space VV (with dim⁔(V)=n\dim(V) = n), there exist constants c1,c2>0c_1, c_2 > 0 such that for all xāƒ—āˆˆV\vec{x} \in V:

c1∣vecx∣aleq∣vecx∣bleqc2∣vecx∣ac_1 \\|\\vec{x}\\|_a \\leq \\|\\vec{x}\\|_b \\leq c_2 \\|\\vec{x}\\|_a

Specific bounds for Rn\mathbb{R}^n:

RelationshipBound
∄xāƒ—āˆ„āˆžā‰¤āˆ„xāƒ—āˆ„2\|\vec{x}\|_\infty \leq \|\vec{x}\|_2∄xāƒ—āˆ„2≤n∄xāƒ—āˆ„āˆž\|\vec{x}\|_2 \leq \sqrt{n} \|\vec{x}\|_\infty
∄xāƒ—āˆ„2ā‰¤āˆ„xāƒ—āˆ„1\|\vec{x}\|_2 \leq \|\vec{x}\|_1∄xāƒ—āˆ„1≤n∄xāƒ—āˆ„2\|\vec{x}\|_1 \leq \sqrt{n} \|\vec{x}\|_2
∄xāƒ—āˆ„āˆžā‰¤āˆ„xāƒ—āˆ„1\|\vec{x}\|_\infty \leq \|\vec{x}\|_1∄xāƒ—āˆ„1≤n∄xāƒ—āˆ„āˆž\|\vec{x}\|_1 \leq n \|\vec{x}\|_\infty

Implication: In finite dimensions, all norms define the same topology — convergence in one norm implies convergence in all others. However, the constants matter: the L1 norm can be up to nn times larger than the Lāˆž norm. In infinite dimensions (function spaces), norms need NOT be equivalent.

Unit Ball: Geometric Interpretation

The unit ball of a norm is the set B={xāƒ—:∄xāƒ—āˆ„ā‰¤1}B = \{\vec{x} : \|\vec{x}\| \leq 1\}. Its shape reveals the geometric character of the norm.

NormUnit Ball ShapeGeometry
L1Diamond (rotated square in 2D)Vertices at (±1,0)(\pm 1, 0) and (0,±1)(0, \pm 1)
L2Circle / SphereSmooth, rotationally symmetric
LāˆžSquare / HypercubeVertices at (±1,±1)(\pm 1, \pm 1)
Lp (1<p<āˆž)Rounded polygonInterpolates between diamond and circle

The L1 unit ball's "pointy" vertices at the axes explain why L1 optimization produces sparse solutions — the optimal point is more likely to land on a vertex where some coordinates are exactly zero.

Norms in Optimization

Regularized Loss Function

minvecwmathcalL(vecw)+lambda∣vecw∣pp\\min_{\\vec{w}} \\mathcal{L}(\\vec{w}) + \\lambda \\|\\vec{w}\\|_p^p

Here,

  • L(wāƒ—)\mathcal{L}(\vec{w})=Loss function (e.g., squared error)
  • Ī»\lambda=Regularization strength
  • ∄wāƒ—āˆ„pp\|\vec{w}\|_p^p=p-norm penalty (p=1 or 2 common)
PenaltyNameEffect
λ∄wāƒ—āˆ„1\lambda \|\vec{w}\|_1LassoSparse solutions, automatic feature selection
λ∄wāƒ—āˆ„22\lambda \|\vec{w}\|_2^2RidgeSmall weights, no feature selection
λ∄wāƒ—āˆ„1+Ī»2∄wāƒ—āˆ„22\lambda \|\vec{w}\|_1 + \lambda_2 \|\vec{w}\|_2^2Elastic NetCombines sparsity and smoothness
λ∄wāƒ—āˆ„āˆž\lambda \|\vec{w}\|_\inftyMinimaxBounded maximum coefficient

Distance Metrics

A norm āˆ„ā‹…āˆ„\|\cdot\| induces a distance metric d(xāƒ—,yāƒ—)=∄xāƒ—āˆ’yāƒ—āˆ„d(\vec{x}, \vec{y}) = \|\vec{x} - \vec{y}\| that satisfies:

  1. Non-negativity: d(xāƒ—,yāƒ—)≄0d(\vec{x}, \vec{y}) \geq 0
  2. Identity: d(xāƒ—,yāƒ—)=0ā€…ā€ŠāŸŗā€…ā€Šxāƒ—=yāƒ—d(\vec{x}, \vec{y}) = 0 \iff \vec{x} = \vec{y}
  3. Symmetry: d(xāƒ—,yāƒ—)=d(yāƒ—,xāƒ—)d(\vec{x}, \vec{y}) = d(\vec{y}, \vec{x})
  4. Triangle inequality: d(xāƒ—,zāƒ—)≤d(xāƒ—,yāƒ—)+d(yāƒ—,zāƒ—)d(\vec{x}, \vec{z}) \leq d(\vec{x}, \vec{y}) + d(\vec{y}, \vec{z})
DistanceFormulaUse Case
Manhattan (L1L_1)d1=āˆ‘āˆ£xiāˆ’yi∣d_1 = \sum |x_i - y_i|Grid-based movement, high-dimensional data
Euclidean (L2L_2)d2=āˆ‘(xiāˆ’yi)2d_2 = \sqrt{\sum (x_i - y_i)^2}Geometric distance, clustering
Chebyshev (LāˆžL_\infty)dāˆž=max⁔∣xiāˆ’yi∣d_\infty = \max |x_i - y_i|Warehouse logistics, robotics

Python Implementation

import numpy as np

# --- Vector Norms ---
x = np.array([1, -2, 3, -4])

l1 = np.linalg.norm(x, ord=1)           # L1: 10.0
l2 = np.linalg.norm(x, ord=2)           # L2: sqrt(30) ā‰ˆ 5.477
linf = np.linalg.norm(x, ord=np.inf)    # Lāˆž: 4.0
l4 = np.linalg.norm(x, ord=4)           # L4: 354^(1/4) ā‰ˆ 4.34

print(f"L1: {l1}, L2: {l2:.4f}, Lāˆž: {linf}, L4: {l4:.4f}")

# --- Matrix Norms ---
A = np.array([[1, 2], [3, 4]])

frob = np.linalg.norm(A, ord='fro')     # Frobenius: sqrt(30) ā‰ˆ 5.477
spectral = np.linalg.norm(A, ord=2)     # Spectral: largest singular value
nuclear = np.linalg.norm(A, ord='nuc')  # Nuclear: sum of singular values

print(f"Frobenius: {frob:.4f}")
print(f"Spectral: {spectral:.4f}")
print(f"Nuclear: {nuclear:.4f}")

# --- Induced Norms ---
induced1 = np.linalg.norm(A, ord=1)     # Induced 1-norm: max column sum
induced_inf = np.linalg.norm(A, ord=np.inf)  # Induced āˆž-norm: max row sum
print(f"Induced 1-norm: {induced1}")
print(f"Induced āˆž-norm: {induced_inf}")

# --- Distance Computation ---
from scipy.spatial.distance import cdist, pdist

points = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
dist_l1 = cdist(points, points, metric='cityblock')   # Manhattan
dist_l2 = cdist(points, points, metric='euclidean')   # Euclidean
dist_linf = cdist(points, points, metric='chebyshev') # Chebyshev

# --- Regularization Comparison ---
from sklearn.linear_model import Lasso, Ridge

np.random.seed(42)
X = np.random.randn(100, 10)
true_coef = np.zeros(10)
true_coef[:3] = [3, -2, 1]  # Only 3 non-zero features
y = X @ true_coef + np.random.randn(100) * 0.1

lasso = Lasso(alpha=0.1).fit(X, y)
ridge = Ridge(alpha=0.1).fit(X, y)

print(f"Lasso coefficients: {np.round(lasso.coef_, 3)}")  # Sparse
print(f"Ridge coefficients: {np.round(ridge.coef_, 3)}")  # Small but non-zero

Applications in AI/ML

L1 Regularization (Lasso): The L1 norm penalty λ∄wāƒ—āˆ„1\lambda \|\vec{w}\|_1 drives some weights to exactly zero, performing automatic feature selection. This is critical in high-dimensional settings where only a subset of features matter (genomics, NLP feature selection).

L2 Regularization (Ridge): The L2 norm penalty λ∄wāƒ—āˆ„22\lambda \|\vec{w}\|_2^2 shrinks all weights toward zero but never sets them exactly to zero. It prevents overfitting and improves generalization. It is the default regularization in most linear models.

Adversarial Robustness: The Lāˆž norm measures the maximum perturbation allowed in adversarial examples. Models trained with the PGD adversarial method optimize maxā”āˆ„Ī“āˆ„āˆžā‰¤ĻµL(x+Ī“,y)\max_{\|\delta\|_\infty \leq \epsilon} \mathcal{L}(x + \delta, y).

Gradient Clipping: In deep learning, gradients are clipped by norm: if ∄gāƒ—āˆ„>Ļ„\|\vec{g}\| > \tau, then gāƒ—ā†Ļ„ā‹…gāƒ—/∄gāƒ—āˆ„\vec{g} \leftarrow \tau \cdot \vec{g} / \|\vec{g}\|. This prevents exploding gradients and stabilizes training.

Matrix Completion (Netflix Prize): The nuclear norm ∄Aāˆ„āˆ—\|A\|_* is minimized to recover low-rank matrices from partial observations: min⁔X∄Xāˆ„āˆ—\min_{X} \|X\|_* subject to observed entries matching.

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using L0 norm for optimizationL0 is non-convex, NP-hardUse L1 as convex relaxation
Confusing ∄A∄F\|A\|_F with ∄A∄2\|A\|_2Frobenius sums all singular values, spectral takes the max∄A∄2ā‰¤āˆ„A∄F≤r∄A∄2\|A\|_2 \leq \|A\|_F \leq \sqrt{r} \|A\|_2
Forgetting ∄cxāƒ—āˆ„=∣c∣∄xāƒ—āˆ„\|c\vec{x}\| = |c| \|\vec{x}\|Homogeneity requires absolute value on scalarāˆ„āˆ’3xāƒ—āˆ„=3∄xāƒ—āˆ„\|-3\vec{x}\| = 3\|\vec{x}\|, not āˆ’3∄xāƒ—āˆ„-3\|\vec{x}\|
Assuming all norms are equal in infinite dimensionsNorm equivalence requires finite dimensionsIn function spaces, different norms define different topologies
Using L2 norm for sparse feature selectionL2 shrinks but doesn't zero out featuresUse L1 (Lasso) or Elastic Net
Ignoring norm when computing condition numberĪŗ(A)=∄Aāˆ„ā‹…āˆ„Aāˆ’1∄\kappa(A) = \|A\| \cdot \|A^{-1}\| depends on the normChoose the norm appropriate for your error metric

Interview Questions

Q1: Why does L1 regularization produce sparse solutions while L2 does not?

šŸ’”Solution

Geometrically, the L1 unit ball is a diamond with vertices on the axes. The level curves of the loss function are more likely to intersect the L1 ball at a vertex, where some coordinates are exactly zero. The L2 ball is a circle — level curves typically intersect it at points where all coordinates are non-zero.

Q2: What is the relationship between the spectral norm and the Frobenius norm?

šŸ’”Solution

For any matrix AA: ∄A∄2ā‰¤āˆ„A∄F≤rā‹…āˆ„A∄2\|A\|_2 \leq \|A\|_F \leq \sqrt{r} \cdot \|A\|_2, where r=rank(A)r = \text{rank}(A). The spectral norm equals the largest singular value, while the Frobenius norm equals the root-sum-of-squares of all singular values.

Q3: When would you use the nuclear norm instead of the Frobenius norm?

šŸ’”Solution

Use the nuclear norm when you want to encourage low-rank structure in a matrix. The nuclear norm is the tightest convex relaxation of the rank function. Applications include matrix completion (e.g., recommender systems), denoising, and dimensionality reduction.

Q4: Prove that the Lāˆž norm is indeed a norm on Rn\mathbb{R}^n.

šŸ’”Solution

  1. Non-negativity: max⁔∣xiāˆ£ā‰„0\max |x_i| \geq 0 since each ∣xiāˆ£ā‰„0|x_i| \geq 0.
  2. Definiteness: max⁔∣xi∣=0ā€…ā€ŠāŸ¹ā€…ā€Šxi=0\max |x_i| = 0 \implies x_i = 0 for all iā€…ā€ŠāŸ¹ā€…ā€Šxāƒ—=0āƒ—i \implies \vec{x} = \vec{0}.
  3. Homogeneity: ∄αxāƒ—āˆ„āˆž=max⁔∣αxi∣=∣α∣max⁔∣xi∣=∣α∣∄xāƒ—āˆ„āˆž\|\alpha \vec{x}\|_\infty = \max |\alpha x_i| = |\alpha| \max |x_i| = |\alpha| \|\vec{x}\|_\infty.
  4. Triangle inequality: ∄xāƒ—+yāƒ—āˆ„āˆž=max⁔∣xi+yiāˆ£ā‰¤max⁔(∣xi∣+∣yi∣)≤max⁔∣xi∣+max⁔∣yi∣=∄xāƒ—āˆ„āˆž+∄yāƒ—āˆ„āˆž\|\vec{x} + \vec{y}\|_\infty = \max |x_i + y_i| \leq \max (|x_i| + |y_i|) \leq \max |x_i| + \max |y_i| = \|\vec{x}\|_\infty + \|\vec{y}\|_\infty.

Q5: What is the condition number of a matrix, and why does the norm matter?

šŸ’”Solution

The condition number is Īŗ(A)=∄Aāˆ„ā‹…āˆ„Aāˆ’1∄\kappa(A) = \|A\| \cdot \|A^{-1}\|. It measures how sensitive the solution of Axāƒ—=bāƒ—A\vec{x} = \vec{b} is to perturbations in bāƒ—\vec{b}. A large condition number indicates an ill-conditioned problem. The value depends on the norm chosen — typically the spectral norm or the Lāˆž norm is used.

Q6: How do norms relate to convergence in optimization algorithms?

šŸ’”Solution

Convergence of an iterative algorithm xāƒ—(k)→xāƒ—āˆ—\vec{x}^{(k)} \to \vec{x}^* is defined with respect to a norm: ∄xāƒ—(k)āˆ’xāƒ—āˆ—āˆ„ā†’0\|\vec{x}^{(k)} - \vec{x}^*\| \to 0. In finite dimensions, convergence in one norm implies convergence in all norms. However, the rate of convergence (and practical numerical behavior) can differ significantly between norms.

Practice Problems

Problem 1: Compute the L1, L2, Lāˆž, and L4 norms of xāƒ—=[3āˆ’405]\vec{x} = \begin{bmatrix} 3 \\ -4 \\ 0 \\ 5 \end{bmatrix}.

šŸ’”Solution

∣vecx∣1=3+4+0+5=12\\|\\vec{x}\\|_1 = 3 + 4 + 0 + 5 = 12
∣vecx∣2=sqrt9+16+0+25=sqrt50=5sqrt2approx7.07\\|\\vec{x}\\|_2 = \\sqrt{9 + 16 + 0 + 25} = \\sqrt{50} = 5\\sqrt{2} \\approx 7.07
\\|\\vec{x}\\|_\\infty = \\max(3, 4, 0, 5) = 5
∣vecx∣4=(81+256+0+625)1/4=9621/4approx5.57\\|\\vec{x}\\|_4 = (81 + 256 + 0 + 625)^{1/4} = 962^{1/4} \\approx 5.57

Verify: ∄xāƒ—āˆ„āˆžā‰¤āˆ„xāƒ—āˆ„4ā‰¤āˆ„xāƒ—āˆ„2ā‰¤āˆ„xāƒ—āˆ„1\|\vec{x}\|_\infty \leq \|\vec{x}\|_4 \leq \|\vec{x}\|_2 \leq \|\vec{x}\|_1: 5≤5.57≤7.07≤125 \leq 5.57 \leq 7.07 \leq 12 āœ“

Problem 2: Verify the Cauchy-Schwarz inequality for xāƒ—=[123]\vec{x} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} and yāƒ—=[456]\vec{y} = \begin{bmatrix} 4 \\ 5 \\ 6 \end{bmatrix}.

šŸ’”Solution

vecxcdotvecy=1(4)+2(5)+3(6)=32\\vec{x} \\cdot \\vec{y} = 1(4) + 2(5) + 3(6) = 32
∣vecx∣2=sqrt1+4+9=sqrt14approx3.74\\|\\vec{x}\\|_2 = \\sqrt{1 + 4 + 9} = \\sqrt{14} \\approx 3.74
∣vecy∣2=sqrt16+25+36=sqrt77approx8.77\\|\\vec{y}\\|_2 = \\sqrt{16 + 25 + 36} = \\sqrt{77} \\approx 8.77
∣vecxcdotvecy∣=32leqsqrt14cdotsqrt77=sqrt1078approx32.83|\\vec{x} \\cdot \\vec{y}| = 32 \\leq \\sqrt{14} \\cdot \\sqrt{77} = \\sqrt{1078} \\approx 32.83

Cauchy-Schwarz holds: 32≤32.8332 \leq 32.83 āœ“

Problem 3: Compute the Frobenius and spectral norms of A=[2003]A = \begin{bmatrix} 2 & 0 \\\\ 0 & 3 \end{bmatrix}.

šŸ’”Solution

∣A∣F=sqrt4+0+0+9=sqrt13approx3.61\\|A\\|_F = \\sqrt{4 + 0 + 0 + 9} = \\sqrt{13} \\approx 3.61

Since AA is diagonal, its singular values are ∣2∣=2|2| = 2 and ∣3∣=3|3| = 3.

∣A∣2=sigmamax=3\\|A\\|_2 = \\sigma_{\\max} = 3
∣Aāˆ£āˆ—=sigma1+sigma2=2+3=5\\|A\\|_* = \\sigma_1 + \\sigma_2 = 2 + 3 = 5

Problem 4: Show that for any vector xāƒ—āˆˆRn\vec{x} \in \mathbb{R}^n: ∄xāƒ—āˆ„āˆžā‰¤āˆ„xāƒ—āˆ„2≤n∄xāƒ—āˆ„āˆž\|\vec{x}\|_\infty \leq \|\vec{x}\|_2 \leq \sqrt{n} \|\vec{x}\|_\infty.

šŸ’”Solution

Lower bound: Let j=arg⁔max⁔i∣xi∣j = \arg\max_i |x_i|. Then:

\\|\\vec{x}\\|_2 = \\sqrt{\\sum_i x_i^2} \\geq \\sqrt{x_j^2} = |x_j| = \\|\\vec{x}\\|_\\infty

Upper bound: Since ∣xiāˆ£ā‰¤āˆ„xāƒ—āˆ„āˆž|x_i| \leq \|\vec{x}\|_\infty for all ii:

\\|\\vec{x}\\|_2 = \\sqrt{\\sum_i x_i^2} \\leq \\sqrt{\\sum_i \\|\\vec{x}\\|_\\infty^2} = \\sqrt{n \\|\\vec{x}\\|_\\infty^2} = \\sqrt{n} \\|\\vec{x}\\|_\\infty

Quick Reference

ConceptFormulaKey Property
L1 Norm∄xāƒ—āˆ„1=āˆ‘āˆ„xi∄\|\vec{x}\|_1 = \sum \|x_i\|Promotes sparsity
L2 Norm∄xāƒ—āˆ„2=āˆ‘xi2\|\vec{x}\|_2 = \sqrt{\sum x_i^2}Promotes smoothness
Lāˆž Norm∄xāƒ—āˆ„āˆž=max⁔∄xi∄\|\vec{x}\|_\infty = \max \|x_i\|Worst-case measure
Lp Norm∄xāƒ—āˆ„p=(āˆ‘āˆ„xi∄p)1/p\|\vec{x}\|_p = (\sum \|x_i\|^p)^{1/p}Interpolates L1–Lāˆž
Frobenius∄A∄F=tr(ATA)\|A\|_F = \sqrt{\text{tr}(A^TA)}Matrix Euclidean norm
Spectral∄A∄2=σmax⁔(A)\|A\|_2 = \sigma_{\max}(A)Maximum stretch factor
Nuclear∄Aāˆ„āˆ—=āˆ‘Ļƒi\|A\|_* = \sum \sigma_iLow-rank relaxation
Induced 1-norm∄A∄1=max⁔jāˆ‘i∄aij∄\|A\|_1 = \max_j \sum_i \|a_{ij}\|Max column sum
Induced āˆž-norm∄Aāˆ„āˆž=max⁔iāˆ‘j∄aij∄\|A\|_\infty = \max_i \sum_j \|a_{ij}\|Max row sum
Condition NumberĪŗ(A)=∄Aāˆ„ā‹…āˆ„Aāˆ’1∄\kappa(A) = \|A\| \cdot \|A^{-1}\|Numerical stability

Cross-References

  • Vector Spaces: Foundation for defining norms
  • Eigenvalues and Singular Values: Used to compute spectral and nuclear norms
  • Inner Products: Cauchy-Schwarz inequality connects norms to inner products
  • Optimization: Regularization, gradient descent, constrained optimization
  • Machine Learning: Lasso, Ridge, Elastic Net, adversarial robustness
  • Numerical Linear Algebra: Condition numbers, stability analysis
  • Clustering: K-means uses Euclidean norm, K-medians uses L1
  • Dimensionality Reduction: PCA minimizes Frobenius norm reconstruction error
Lesson Progress15 / 100