← Math|96 of 100
Advanced Topics

Tensor Calculus

Understand tensor operations, einsum notation, and their applications in deep learning.

📂 Tensors📖 Lesson 96 of 100🎓 Free Course

Advertisement

Tensor Calculus

ℹ️ Why It Matters

Tensor calculus is the mathematical language of multi-dimensional data manipulation. Every operation in deep learning—matrix multiplication, convolutions, attention mechanisms—is fundamentally a tensor operation. Mastering tensor calculus unlocks understanding of how frameworks like PyTorch and TensorFlow compute gradients through complex architectures, and enables implementation of custom operations that go beyond standard library functions. The Einstein summation convention provides a compact, powerful notation for expressing complex multi-index operations that would otherwise require cumbersome summation symbols.


Core Definitions

DfTensor

A tensor is a multidimensional array of numbers that transforms according to specific rules under a change of coordinates. A tensor of rank kk (also called order-kk) has kk indices. A scalar is rank-0, a vector is rank-1, a matrix is rank-2, and higher-order tensors have 3 or more indices. Formally, a type (p,q)(p,q) tensor is a multilinear map from qq copies of the dual space and pp copies of the vector space to the scalar field.

DfEinstein Summation Convention

When an index appears exactly twice in a single term, summation over that index is implied. For example, aikbkja_{ik}b_{kj} means kaikbkj\sum_{k} a_{ik}b_{kj}. Free indices (appearing once) label the components of the result. Dummy indices (appearing twice) are summed over and do not appear in the result. This convention eliminates the need for explicit sigma notation and makes tensor manipulations far more concise and readable.

DfCovariant Tensor

A covariant tensor (or lower-index tensor) transforms with the Jacobian of the coordinate change. Components are written with subscripts: TijT_{ij}. Under a coordinate transformation xxx \to x', a rank-2 covariant tensor transforms as Tij=xkxixlxjTklT'_{ij} = \frac{\partial x^k}{\partial x'^i}\frac{\partial x^l}{\partial x'^j}T_{kl}. Covariant tensors are "measured" quantities—they measure distances, angles, and other geometric properties.

DfContravariant Tensor

A contravariant tensor (or upper-index tensor) transforms with the inverse Jacobian. Components are written with superscripts: TijT^{ij}. Under coordinate transformation, Tij=xixkxjxlTklT'^{ij} = \frac{\partial x'^i}{\partial x^k}\frac{\partial x'^j}{\partial x^l}T^{kl}. Contravariant tensors represent "physical" quantities like velocity and position that transform inversely to the coordinate system.

DfMetric Tensor

The metric tensor gijg_{ij} is a symmetric positive-definite covariant tensor that defines distances and angles on a manifold. The infinitesimal distance is ds2=gijdxidxjds^2 = g_{ij}dx^i dx^j. The metric also defines raising and lowering of indices: vi=gijvjv^i = g^{ij}v_j and wi=gijwjw_i = g_{ij}w^j. In Euclidean space, the metric tensor is simply the identity matrix, but in curved spaces it encodes the geometry.

DfTensor Contraction

Tensor contraction is the operation of summing over a pair of indices—one upper and one lower. For a mixed tensor T jiT^i_{\ j}, contraction yields a scalar: T ii=iT iiT^i_{\ i} = \sum_i T^i_{\ i}. Contraction generalizes the trace operation and is the fundamental operation underlying matrix multiplication, dot products, and all einsum computations.


Key Formulas

Einstein Summation Notation

cij=kaikbkjaikbkjc_{ij} = \sum_k a_{ik}b_{kj} \equiv a_{ik}b_{kj}

Here,

  • aika_{ik}=First tensor component with row index i and summation index k
  • bkjb_{kj}=Second tensor component with summation index k and column index j
  • cijc_{ij}=Result tensor with free indices i and j

Tensor Contraction

T ji=gikTkjT^{i}_{\ j} = g^{ik}T_{kj}

Here,

  • gikg^{ik}=Inverse metric tensor (raises the index)
  • TkjT_{kj}=Covariant tensor of rank 2
  • TjiT^{i}_{ j}=Mixed tensor with one upper and one lower index

Outer Product (Tensor Product)

(ab)ij=aibj(a \otimes b)_{ij} = a_i b_j

Here,

  • aia_i=Component of the first vector (rank-1 tensor)
  • bjb_j=Component of the second vector (rank-1 tensor)
  • (aotimesb)ij(a otimes b)_{ij}=Resulting rank-2 tensor

Trace

tr(A)=A ii=iAii\text{tr}(A) = A^{i}_{\ i} = \sum_i A_{ii}

Here,

  • AiiA_{ii}=Diagonal elements of a rank-2 tensor
  • AiiA^{i}_{ i}=Contracted index notation for trace

Frobenius Norm

AF=AijAij=tr(ATA)\|A\|_F = \sqrt{A_{ij}A_{ij}} = \sqrt{\text{tr}(A^T A)}

Here,

  • AijA_{ij}=Elements of the matrix
  • ATA^T=Transpose of A

Kronecker Product

(AB)ik,jl=AijBkl(A \otimes B)_{ik,jl} = A_{ij}B_{kl}

Here,

  • AijA_{ij}=Element of matrix A
  • BklB_{kl}=Element of matrix B
  • (AotimesB)ik,jl(A otimes B)_{ik,jl}=Block structure of Kronecker product

Index Lowering and Raising

vi=gijvj,vi=gijvjv_i = g_{ij}v^j, \quad v^i = g^{ij}v_j

Here,

  • gijg_{ij}=Metric tensor (lowers indices)
  • gijg^{ij}=Inverse metric tensor (raises indices)
  • vi,viv_i, v^i=Covariant and contravariant components

Important Theorems

ThUniversal Approximation Theorem (Tensor Version)

A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of Rn\mathbb{R}^n to arbitrary accuracy, provided the activation function is non-constant, bounded, and monotonically-increasing. The weights and biases are tensor parameters optimized via backpropagation. This theorem guarantees that neural networks are universal function approximators.

ThTensor Rank Decomposition

Any rank-rr tensor TRn1×n2××nkT \in \mathbb{R}^{n_1 \times n_2 \times \cdots \times n_k} can be decomposed as a sum of rr rank-1 tensors: T=s=1ras(1)as(2)as(k)T = \sum_{s=1}^{r} a_s^{(1)} \otimes a_s^{(2)} \otimes \cdots \otimes a_s^{(k)}. The minimum rr needed is called the tensor rank. For matrices (rank-2 tensors), this is the standard matrix rank. For higher-order tensors, tensor rank can differ from the notion of multilinear rank.

ThEinstein Summation Equivalence

Every Einstein summation expression can be written as a sequence of tensor contractions, which are themselves compositions of matrix multiplications, transpositions, and reshaping operations. This guarantees that any einsum expression can be implemented as an efficient sequence of BLAS/LAPACK operations, ensuring computational tractability for arbitrary tensor expressions.

ThPolyadic Decomposition (CP Decomposition)

A tensor TRI×J×KT \in \mathbb{R}^{I \times J \times K} can be decomposed as Tr=1RarbrcrT \approx \sum_{r=1}^{R} \mathbf{a}_r \circ \mathbf{b}_r \circ \mathbf{c}_r, where arRI\mathbf{a}_r \in \mathbb{R}^I, brRJ\mathbf{b}_r \in \mathbb{R}^J, crRK\mathbf{c}_r \in \mathbb{R}^K, and \circ denotes the outer product. The minimum RR for exact decomposition is the tensor CP rank. This decomposition generalizes low-rank matrix factorization to higher dimensions.


Worked Examples

📝Matrix Multiplication via Einstein Summation

Problem: Compute C=ABC = AB where AR3×4A \in \mathbb{R}^{3 \times 4} and BR4×5B \in \mathbb{R}^{4 \times 5} using Einstein summation notation.

Solution:

The matrix multiplication is expressed as Cij=AikBkjC_{ij} = A_{ik}B_{kj} with summation over kk.

In einsum notation: 'ik,kj->ij'

Step 1: Identify free and dummy indices.

  • Free indices: ii (row of CC), jj (column of CC)
  • Dummy index: kk (summed over dimension 4)

Step 2: Compute each element:

Cij=k=14AikBkj=Ai1B1j+Ai2B2j+Ai3B3j+Ai4B4jC_{ij} = \sum_{k=1}^{4} A_{ik}B_{kj} = A_{i1}B_{1j} + A_{i2}B_{2j} + A_{i3}B_{3j} + A_{i4}B_{4j}

Step 3: The result CR3×5C \in \mathbb{R}^{3 \times 5}.

Concrete example: If A=(123456789101112)A = \begin{pmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8 \\ 9 & 10 & 11 & 12 \end{pmatrix} and B=(10011100)B = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ 0 & 0 \end{pmatrix}, then:

C11=11+20+31+40=4C_{11} = 1 \cdot 1 + 2 \cdot 0 + 3 \cdot 1 + 4 \cdot 0 = 4C12=10+21+31+40=5C_{12} = 1 \cdot 0 + 2 \cdot 1 + 3 \cdot 1 + 4 \cdot 0 = 5

And so on for all 15 elements.

📝Batch Matrix Multiplication

Problem: Given batches of matrices ARB×M×KA \in \mathbb{R}^{B \times M \times K} and BRB×K×NB \in \mathbb{R}^{B \times K \times N}, express batch matrix multiplication using einsum.

Solution:

Einsum notation: 'bmk,bkn->bmn'

Step 1: For each batch index bb, compute the standard matrix product:

Cbmn=k=1KAbmkBbknC_{bmn} = \sum_{k=1}^{K} A_{bmk}B_{bkn}

Step 2: The batch index bb appears only once in the output, so it is NOT summed over.

Step 3: Result shape: CRB×M×NC \in \mathbb{R}^{B \times M \times N}

Example values: If B=32B=32, M=128M=128, K=64K=64, N=256N=256, this computes 32 independent matrix multiplications of size 128×64128 \times 64 times 64×25664 \times 256.

Computational advantage: Batch matrix multiplication can be parallelized on GPUs, processing all BB matrix multiplications simultaneously. The einsum notation makes this parallelism explicit—the batch dimension bb is free (not summed), so operations along different batches are independent.

📝Dot Product and Outer Product

Problem: For vectors aRn\mathbf{a} \in \mathbb{R}^n and bRn\mathbf{b} \in \mathbb{R}^n, express both the dot product and outer product using Einstein summation.

Solution:

Dot product: c=aibic = a_i b_i (in einsum: 'i,i->')

This sums over the single index: c=i=1naibic = \sum_{i=1}^{n} a_i b_i

The result is a scalar (rank-0 tensor).

Outer product: Cij=aibjC_{ij} = a_i b_j (in einsum: 'i,j->ij')

No indices are summed; both ii and jj are free.

The result is a rank-2 tensor: CRn×nC \in \mathbb{R}^{n \times n}.

Key distinction: The dot product contracts both indices into a scalar. The outer product keeps both indices free, producing a matrix.

Example: For a=(1,2,3)\mathbf{a} = (1, 2, 3) and b=(4,5,6)\mathbf{b} = (4, 5, 6):

Dot product: 14+25+36=321 \cdot 4 + 2 \cdot 5 + 3 \cdot 6 = 32

Outer product: (45681012121518)\begin{pmatrix} 4 & 5 & 6 \\ 8 & 10 & 12 \\ 12 & 15 & 18 \end{pmatrix}

📝Trace Using Index Contraction

Problem: Express the trace of a matrix ARn×nA \in \mathbb{R}^{n \times n} using Einstein summation, and compute it for a specific matrix.

Solution:

Einsum notation: 'ii->' or equivalently A iiA^{i}_{\ i}

Step 1: The index ii appears twice (as row and column), so it is summed over:

tr(A)=i=1nAii\text{tr}(A) = \sum_{i=1}^{n} A_{ii}

Step 2: For A=(3124)A = \begin{pmatrix} 3 & 1 \\ 2 & 4 \end{pmatrix}:

tr(A)=A11+A22=3+4=7\text{tr}(A) = A_{11} + A_{22} = 3 + 4 = 7

Step 3: The trace is invariant under cyclic permutation: tr(AB)=tr(BA)\text{tr}(AB) = \text{tr}(BA) even when ABBAAB \neq BA.

Step 4: Generalization: The trace of a product of kk matrices is invariant under cyclic permutation:

tr(A1A2Ak)=tr(A2A3AkA1)=\text{tr}(A_1 A_2 \cdots A_k) = \text{tr}(A_2 A_3 \cdots A_k A_1) = \cdots

This property is used extensively in matrix calculus and backpropagation.

📝Computing Tensor Gradients

Problem: In a neural network layer y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}, compute the gradient LW\frac{\partial L}{\partial W} given the loss gradient Ly\frac{\partial L}{\partial \mathbf{y}}.

Solution:

Step 1: Write in index notation: yi=Wijxj+biy_i = W_{ij}x_j + b_i (Einstein summation over jj).

Step 2: Apply the chain rule:

LWij=LykykWij\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial y_k} \cdot \frac{\partial y_k}{\partial W_{ij}}

Step 3: Compute ykWij\frac{\partial y_k}{\partial W_{ij}}:

ykWij=(Wklxl+bk)Wij=δkiδjlxl=δkixj\frac{\partial y_k}{\partial W_{ij}} = \frac{\partial (W_{kl}x_l + b_k)}{\partial W_{ij}} = \delta_{ki}\delta_{jl}x_l = \delta_{ki}x_j

where δki\delta_{ki} is the Kronecker delta (1 if k=ik=i, 0 otherwise).

Step 4: Substitute:

LWij=Lykδkixj=Lyixj\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial y_k} \delta_{ki} x_j = \frac{\partial L}{\partial y_i} x_j

Step 5: In matrix form: LW=LyxT\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \mathbf{y}} \mathbf{x}^T (outer product)

Einsum notation: 'k,i->ki' where kk indexes L/y\partial L/\partial \mathbf{y} and ii indexes x\mathbf{x}.


Practice Problems

📝Problem 1: Batch Attention Computation

Problem: In transformer attention, given QRB×T×dQ \in \mathbb{R}^{B \times T \times d}, KRB×S×dK \in \mathbb{R}^{B \times S \times d}, and VRB×S×dV \in \mathbb{R}^{B \times S \times d}, write the full attention computation softmax(QKT/d)V\text{softmax}(QK^T/\sqrt{d})V using einsum notation.

Solution:

Step 1: Compute attention scores: Abts=dQbtdKbsdA_{bts} = \sum_d Q_{btd}K_{bsd} Einsum: 'btd,bsd->bts'

Step 2: Scale by d\sqrt{d}: AbtsAbts/dA_{bts} \leftarrow A_{bts}/\sqrt{d}

Step 3: Apply softmax over the ss dimension: A^bts=softmaxs(Abts)\hat{A}_{bts} = \text{softmax}_s(A_{bts})

Step 4: Compute weighted sum: Obtd=sA^btsVbsdO_{btd} = \sum_s \hat{A}_{bts}V_{bsd} Einsum: 'bts,bsd->btd'

Combined: scores = 'btd,bsd->bts'; out = 'bts,bsd->btd'

Result shape: ORB×T×dO \in \mathbb{R}^{B \times T \times d}

The attention mechanism is a batched weighted average over source positions, where weights are determined by query-key similarity.

📝Problem 2: Frobenius Norm via Einstein Summation

Problem: Express the Frobenius norm of a matrix ARm×nA \in \mathbb{R}^{m \times n} using Einstein summation, and verify that it equals tr(ATA)\sqrt{\text{tr}(A^T A)}.

Solution:

The Frobenius norm squared is:

AF2=AijAij=i=1mj=1nAij2\|A\|_F^2 = A_{ij}A_{ij} = \sum_{i=1}^{m}\sum_{j=1}^{n} A_{ij}^2

Einsum notation: 'ij,ij->'

Verification: (ATA)jj=iAij2(A^T A)_{jj} = \sum_i A_{ij}^2, so:

tr(ATA)=jiAij2=i,jAij2=AF2\text{tr}(A^T A) = \sum_j \sum_i A_{ij}^2 = \sum_{i,j} A_{ij}^2 = \|A\|_F^2

Therefore AF=tr(ATA)\|A\|_F = \sqrt{\text{tr}(A^T A)}

Alternative expression: AF=iσi2\|A\|_F = \sqrt{\sum_i \sigma_i^2} where σi\sigma_i are the singular values of AA.

📝Problem 3: Tensor Product of Two Vectors

Problem: Compute the tensor product uv\mathbf{u} \otimes \mathbf{v} where u=(1,2,3)T\mathbf{u} = (1, 2, 3)^T and v=(4,5)T\mathbf{v} = (4, 5)^T.

Solution:

The tensor product is a rank-2 tensor: (uv)ij=uivj(u \otimes v)_{ij} = u_i v_j

uv=(141524253435)=(458101215)u \otimes v = \begin{pmatrix} 1 \cdot 4 & 1 \cdot 5 \\ 2 \cdot 4 & 2 \cdot 5 \\ 3 \cdot 4 & 3 \cdot 5 \end{pmatrix} = \begin{pmatrix} 4 & 5 \\ 8 & 10 \\ 12 & 15 \end{pmatrix}

Result shape: 3×23 \times 2 (outer product of a 3-vector with a 2-vector).

Einsum: 'i,j->ij' with u=(1,2,3)u = (1,2,3) and v=(4,5)v = (4,5).

Key properties:

  • rank(uv)=1\text{rank}(u \otimes v) = 1 (rank-1 matrix)
  • tr(uv)=iuivi=uv\text{tr}(u \otimes v) = \sum_i u_i v_i = \mathbf{u} \cdot \mathbf{v} (dot product)
  • uvF=uv\|u \otimes v\|_F = \|\mathbf{u}\| \cdot \|\mathbf{v}\|

📝Problem 4: Index Raising and Lowering

Problem: Given a metric gij=(2113)g_{ij} = \begin{pmatrix} 2 & 1 \\ 1 & 3 \end{pmatrix} and covariant vector vj=(3,7)v_j = (3, 7), find the contravariant components viv^i.

Solution:

Step 1: Compute the inverse metric gijg^{ij}.

gij=1det(g)(3112)g^{ij} = \frac{1}{\det(g)} \begin{pmatrix} 3 & -1 \\ -1 & 2 \end{pmatrix}
det(g)=2311=5\det(g) = 2 \cdot 3 - 1 \cdot 1 = 5
gij=15(3112)=(0.60.20.20.4)g^{ij} = \frac{1}{5}\begin{pmatrix} 3 & -1 \\ -1 & 2 \end{pmatrix} = \begin{pmatrix} 0.6 & -0.2 \\ -0.2 & 0.4 \end{pmatrix}

Step 2: Raise the index: vi=gijvjv^i = g^{ij}v_j

v1=g11v1+g12v2=0.6×3+(0.2)×7=1.81.4=0.4v^1 = g^{11}v_1 + g^{12}v_2 = 0.6 \times 3 + (-0.2) \times 7 = 1.8 - 1.4 = 0.4
v2=g21v1+g22v2=(0.2)×3+0.4×7=0.6+2.8=2.2v^2 = g^{21}v_1 + g^{22}v_2 = (-0.2) \times 3 + 0.4 \times 7 = -0.6 + 2.8 = 2.2

Result: vi=(0.4,2.2)v^i = (0.4, 2.2)

Verification: Lower back: vj=gijvi=g11(0.4)+g12(2.2)=2(0.4)+1(2.2)=0.8+2.2=3v_j = g_{ij}v^i = g_{11}(0.4) + g_{12}(2.2) = 2(0.4) + 1(2.2) = 0.8 + 2.2 = 3

And v2=g21(0.4)+g22(2.2)=1(0.4)+3(2.2)=0.4+6.6=7v_2 = g_{21}(0.4) + g_{22}(2.2) = 1(0.4) + 3(2.2) = 0.4 + 6.6 = 7

📝Problem 5: Kronecker Product Properties

Problem: Given A=(1234)A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} and B=(0567)B = \begin{pmatrix} 0 & 5 \\ 6 & 7 \end{pmatrix}, compute ABA \otimes B and verify that tr(AB)=tr(A)tr(B)\text{tr}(A \otimes B) = \text{tr}(A) \cdot \text{tr}(B).

Solution:

Step 1: Compute the Kronecker product:

AB=(1B2B3B4B)=(0501067121401502018212428)A \otimes B = \begin{pmatrix} 1 \cdot B & 2 \cdot B \\ 3 \cdot B & 4 \cdot B \end{pmatrix} = \begin{pmatrix} 0 & 5 & 0 & 10 \\ 6 & 7 & 12 & 14 \\ 0 & 15 & 0 & 20 \\ 18 & 21 & 24 & 28 \end{pmatrix}

Step 2: Compute trace of ABA \otimes B:

tr(AB)=0+7+0+28=35\text{tr}(A \otimes B) = 0 + 7 + 0 + 28 = 35

Step 3: Compute tr(A)tr(B)\text{tr}(A) \cdot \text{tr}(B):

tr(A)=1+4=5,tr(B)=0+7=7\text{tr}(A) = 1 + 4 = 5, \quad \text{tr}(B) = 0 + 7 = 7
tr(A)tr(B)=5×7=35\text{tr}(A) \cdot \text{tr}(B) = 5 \times 7 = 35

Verification: tr(AB)=35=tr(A)tr(B)\text{tr}(A \otimes B) = 35 = \text{tr}(A) \cdot \text{tr}(B)

Additional properties of Kronecker product:

  • (AB)(CD)=(AC)(BD)(A \otimes B)(C \otimes D) = (AC) \otimes (BD) (mixed-product property)
  • rank(AB)=rank(A)rank(B)\text{rank}(A \otimes B) = \text{rank}(A) \cdot \text{rank}(B)
  • (AB)1=A1B1(A \otimes B)^{-1} = A^{-1} \otimes B^{-1} (when inverses exist)

Common Mistakes

MistakeCorrect Approach
Confusing aijbija_{ij}b_{ij} (dot product) with aijbjka_{ij}b_{jk} (matrix multiply)Check which indices are repeated vs. free
Forgetting that einsum 'ij,kl->ijkl' is outer product, not contractionOnly repeated indices in the same term are summed
Assuming AikBkj=BkjAikA_{ik}B_{kj} = B_{kj}A_{ik} (matrix multiply is not commutative)Order matters: ABBAAB \neq BA in general
Using same index name for different dimensionsEach dummy index must appear exactly twice; use distinct names
Ignoring index alignment in batch operationsBatch dimensions must match or be broadcastable
Confusing covariant (lower) and contravariant (upper) indicesUpper indices transform with Jacobian, lower with inverse Jacobian
Forgetting that einsum does not broadcastAll dimensions must match or appear in the output

Connections to Machine Learning

ℹ️ Connections to Machine Learning

Tensor calculus is the foundation of deep learning computation. Backpropagation computes gradients that are themselves tensors, and the chain rule becomes a tensor contraction. Attention mechanisms in transformers are tensor operations: QKTQK^T is a batched outer product over sequence positions. Convolutional layers apply weight tensors to input tensors via einsum-like operations. Understanding tensor calculus enables efficient implementation of custom gradient functions, understanding of framework internals, and development of novel architectures. The tensor rank decomposition connects to model compression: low-rank factorization of weight matrices reduces parameters while preserving function. Tensor networks (MPS, Tensor Train) provide structured parameterizations that reduce the exponential cost of representing high-order correlations.


Exam/Interview Questions

Q1: Write the einsum notation for computing the matrix C=ABTC = AB^T where ARm×kA \in \mathbb{R}^{m \times k} and BRn×kB \in \mathbb{R}^{n \times k}.

Answer: 'ik,jk->ij'. The summation is over kk (the shared dimension), and the free indices ii and jj give the row and column of CRm×nC \in \mathbb{R}^{m \times n}.


Q2: Explain the difference between a covariant and contravariant tensor.

Answer: A covariant tensor (lower indices TijT_{ij}) transforms with the Jacobian of the coordinate change: Tij=xkxixlxjTklT'_{ij} = \frac{\partial x^k}{\partial x'^i}\frac{\partial x^l}{\partial x'^j}T_{kl}. A contravariant tensor (upper indices TijT^{ij}) transforms with the inverse Jacobian: Tij=xixkxjxlTklT'^{ij} = \frac{\partial x'^i}{\partial x^k}\frac{\partial x'^j}{\partial x^l}T^{kl}. Intuitively, covariant components shrink when the coordinate system expands, while contravariant components expand.


Q3: What is the tensor product of two vectors, and how does it relate to the outer product?

Answer: The tensor product uv\mathbf{u} \otimes \mathbf{v} produces a rank-2 tensor with components (uv)ij=uivj(u \otimes v)_{ij} = u_i v_j. This is identical to the outer product in linear algebra. The tensor product generalizes to higher-order tensors: the product of a rank-2 and rank-1 tensor gives a rank-3 tensor.


Q4: Why is the Frobenius norm of a matrix equal to tr(ATA)\sqrt{\text{tr}(A^T A)}?

Answer: (ATA)jj=iAij2(A^T A)_{jj} = \sum_i A_{ij}^2, so tr(ATA)=jiAij2=i,jAij2=AF2\text{tr}(A^T A) = \sum_j \sum_i A_{ij}^2 = \sum_{i,j} A_{ij}^2 = \|A\|_F^2. Taking the square root gives the Frobenius norm. This identity is useful for computing norms without explicitly forming the full matrix.


Q5: In transformer attention, the operation is softmax(QKT/d)V\text{softmax}(QK^T/\sqrt{d})V. Explain what tensor operations are involved.

Answer: QKTQK^T is a batched matrix multiplication with einsum 'btd,bsd->bts', producing attention scores. Division by d\sqrt{d} is elementwise scaling. Softmax is applied over the ss (source) dimension. The final multiplication with VV is 'bts,bsd->btd', a weighted sum over source positions. All operations are tensor contractions or elementwise.


Q6: What is the rank of a tensor, and why does it matter for model compression?

Answer: The tensor rank is the minimum number of rank-1 tensors needed to express a tensor as a sum: T=s=1ras(1)as(k)T = \sum_{s=1}^{r} a_s^{(1)} \otimes \cdots \otimes a_s^{(k)}. A low-rank weight tensor can be decomposed into smaller factors, reducing the number of parameters from O(n1n2nk)O(n_1 n_2 \cdots n_k) to O(rni)O(r \sum n_i). This is the basis of tensor decomposition methods for neural network compression.


Q7: How does backpropagation use tensor contractions?

Answer: Backpropagation applies the chain rule through each layer. For y=Wxy = Wx, the gradient LW=LyxT\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} x^T is an outer product (tensor contraction with the input). For multi-dimensional tensors (e.g., convolutions), the gradient is computed via transposed convolution operations, which are also tensor contractions. Einsum notation makes these operations explicit and ensures correct index alignment.


Quick Reference

OperationEinstein NotationEinsum StringResult Shape
Matrix MultiplyCij=AikBkjC_{ij} = A_{ik}B_{kj}'ik,kj->ij'(m,n)(m, n)
Batch MatMulCbij=AbikBbkjC_{bij} = A_{bik}B_{bkj}'bik,bkj->bij'(b,m,n)(b, m, n)
Dot Productc=aibic = a_i b_i'i,i->'scalar
Outer ProductCij=aibjC_{ij} = a_i b_j'i,j->ij'(m,n)(m, n)
Tracec=Aiic = A_{ii}'ii->'scalar
Frobenius NormAF=AijAij\|A\|_F = \sqrt{A_{ij}A_{ij}}'ij,ij->'scalar
TransposeBij=AjiB_{ij} = A_{ji}'ij->ji'(n,m)(n, m)
Diagonaldi=Aiid_i = A_{ii}'ii->i'(n,)(n,)
Sums=Aijs = A_{ij}'ij->'scalar
Elementwise MultiplyCij=AijBijC_{ij} = A_{ij}B_{ij}'ij,ij->ij'(m,n)(m, n)
Matrix NormA=AijAkl\|A\| = \sqrt{A_{ij}A_{kl}}'ij,kl->'scalar
Batch Tracetb=Abiit_b = A_{bii}'bii->b'(b,)(b,)

Cross-References

  • 097-advanced-differential-geometry — Metric tensors on manifolds extend the concept of inner products to curved spaces; Christoffel symbols are tensor expressions
  • 098-advanced-functional-analysis — Tensor products of vector spaces generalize to infinite-dimensional Hilbert spaces; operator tensors in quantum mechanics
  • 099-advanced-measure-theory — Integration of tensor fields requires measure-theoretic foundations; tensor-valued measures
  • Linear Algebra (earlier modules) — Matrix operations, eigenvalues, and SVD are foundational to tensor decompositions
  • Neural Networks: Backpropagation computes gradients as tensor contractions through the computational graph
Lesson Progress96 / 100