← Math|84 of 100
Information Theory

Cross-Entropy Loss

Master cross-entropy loss, its derivation, connection to KL divergence, and why it's used in classification.

📂 Cross-Entropy📖 Lesson 84 of 100🎓 Free Course

Advertisement

Cross-Entropy Loss

ℹ️ Why It Matters

Cross-entropy is the standard loss function for classification in neural networks. Every time you train a classifier with nn.CrossEntropyLoss(), you're minimizing the cross-entropy between the true label distribution and your model's predicted distribution. Understanding why it works—its connection to maximum likelihood estimation, KL divergence, and information theory—gives you the intuition to debug, modify, and improve your models.


Historical Context

ℹ️ From Compression to Classification

Cross-entropy originated in coding theory: it measures the average number of bits needed to encode data from distribution PP using a code optimized for distribution QQ. When P=QP = Q, you achieve the entropy H(P)H(P). When PQP \neq Q, you need extra DKL(PQ)D_{KL}(P \| Q) bits. This "extra cost" is exactly what cross-entropy loss measures in classification.


Core Definitions

DfCross-Entropy

The cross-entropy between distributions PP (true) and QQ (predicted) is:

H(P,Q)=xXp(x)log2q(x)H(P, Q) = -\sum_{x \in \mathcal{X}} p(x) \log_2 q(x)

It measures the average number of bits needed to encode events from PP using a code optimized for QQ.

DfBinary Cross-Entropy

For binary classification with true label y{0,1}y \in \{0, 1\} and predicted probability y^\hat{y}:

LBCE(y,y^)=[ylog(y^)+(1y)log(1y^)]\mathcal{L}_{\text{BCE}}(y, \hat{y}) = -\left[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\right]

DfCategorical Cross-Entropy

For multi-class classification with true one-hot vector y\mathbf{y} and predicted probabilities y^\hat{\mathbf{y}}:

LCE=c=1Cyclog(y^c)\mathcal{L}_{\text{CE}} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

For a single sample with true class kk: L=log(y^k)\mathcal{L} = -\log(\hat{y}_k).

DfMean Cross-Entropy Loss

Over a batch of NN samples:

L=1Ni=1Nc=1Cyiclog(y^ic)\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})

Key Formulas

Cross-Entropy

H(P,Q)=xp(x)logq(x)H(P, Q) = -\sum_{x} p(x) \log q(x)

Here,

  • H(P,Q)H(P, Q)=Cross-entropy between distributions P and Q
  • p(x)p(x)=True distribution
  • q(x)q(x)=Predicted distribution

Binary Cross-Entropy

L=[ylog(y^)+(1y)log(1y^)]L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]

Here,

  • yy=True label (0 or 1)
  • y^\hat{y}=Predicted probability of class 1

Categorical Cross-Entropy

L=c=1Cyclog(y^c)L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Here,

  • ycy_c=True probability for class c (0 or 1 for one-hot)
  • y^c\hat{y}_c=Predicted probability for class c
  • CC=Number of classes

Relation to KL Divergence

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q)

Here,

  • H(P)H(P)=Entropy of true distribution (constant during training)
  • DKL(PQ)D_{KL}(P \| Q)=KL divergence from Q to P

Relation to Log-Likelihood

L=1Ni=1Nlogq(yixi)=EP[logQ]\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log q(y_i | x_i) = -\mathbb{E}_{P}[\log Q]

Here,

  • q(yixi)q(y_i | x_i)=Model's predicted probability of true label

Properties and Theorems

ThCross-Entropy ≥ Entropy

H(P,Q)H(P)H(P, Q) \geq H(P) for all distributions P,QP, Q. Equality holds iff P=QP = Q. This follows from DKL(PQ)0D_{KL}(P \| Q) \geq 0.

ThCross-Entropy Decomposition

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q)

Cross-entropy = entropy (irreducible noise) + KL divergence (model mismatch). Minimizing cross-entropy is equivalent to minimizing KL divergence.

ThMaximum Likelihood Connection

Minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of the data under the model. For a categorical model:

LCE=1Nilogpθ(yixi)=1Nlogipθ(yixi)\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_{i} \log p_\theta(y_i | x_i) = -\frac{1}{N} \log \prod_{i} p_\theta(y_i | x_i)

ThGradient Properties

The gradient of cross-entropy loss w.r.t. logits zcz_c is:

Lzc=y^cyc\frac{\partial \mathcal{L}}{\partial z_c} = \hat{y}_c - y_c

This elegant form comes from combining softmax + cross-entropy. The gradient is the prediction error, which vanishes as the model becomes confident and correct.

ThLabel Smoothing Effect

With label smoothing (yc=(1ϵ)onehot+ϵ/Cy_c = (1-\epsilon) \cdot \text{onehot} + \epsilon/C), cross-entropy prevents the model from becoming overconfident. The loss becomes:

L=(1ϵ)log(y^k)ϵCclog(y^c)\mathcal{L} = -(1-\epsilon) \log(\hat{y}_k) - \frac{\epsilon}{C}\sum_c \log(\hat{y}_c)

Worked Examples

📝Example 1: Binary Cross-Entropy

Problem: True label y=1y = 1, predicted y^=0.9\hat{y} = 0.9. Compute BCE.

💡Solution: Binary CE

L=[1log(0.9)+0log(0.1)]=log(0.9)=0.105 bits\mathcal{L} = -[1 \cdot \log(0.9) + 0 \cdot \log(0.1)] = -\log(0.9) = 0.105 \text{ bits}

If y^=0.5\hat{y} = 0.5: L=log(0.5)=1.0\mathcal{L} = -\log(0.5) = 1.0 bit (maximum uncertainty). If y^=0.99\hat{y} = 0.99: L=log(0.99)=0.0145\mathcal{L} = -\log(0.99) = 0.0145 bits (high confidence, correct).

📝Example 2: Categorical Cross-Entropy

Problem: True label is class 0 (one-hot: [1,0,0][1, 0, 0]). Predictions: [0.7,0.2,0.1][0.7, 0.2, 0.1]. Compute CE.

💡Solution: Categorical CE

L=[1log(0.7)+0log(0.2)+0log(0.1)]=log(0.7)=0.515 bits\mathcal{L} = -[1 \cdot \log(0.7) + 0 \cdot \log(0.2) + 0 \cdot \log(0.1)] = -\log(0.7) = 0.515 \text{ bits}

Only the predicted probability of the true class matters.

📝Example 3: Comparing Two Models

Problem: True label: [1,0,0][1, 0, 0]. Model A predicts [0.9,0.05,0.05][0.9, 0.05, 0.05], Model B predicts [0.6,0.2,0.2][0.6, 0.2, 0.2]. Which is better?

💡Solution: Model Comparison

LA=log(0.9)=0.105\mathcal{L}_A = -\log(0.9) = 0.105 bits.

LB=log(0.6)=0.511\mathcal{L}_B = -\log(0.6) = 0.511 bits.

Model A has lower CE, so it's better. Model A is more confident AND correct.

📝Example 4: Cross-Entropy vs KL

Problem: True distribution P=[0.5,0.5]P = [0.5, 0.5], model A predicts QA=[0.6,0.4]Q_A = [0.6, 0.4], model B predicts QB=[0.9,0.1]Q_B = [0.9, 0.1]. Compute CE and KL for both.

💡Solution: CE vs KL

H(P)=1.0H(P) = 1.0 bit.

DKL(PQA)=0.5log(0.5/0.6)+0.5log(0.5/0.4)=0.5(0.263)+0.5(0.322)=0.029D_{KL}(P \| Q_A) = 0.5 \log(0.5/0.6) + 0.5 \log(0.5/0.4) = 0.5(-0.263) + 0.5(0.322) = 0.029 bits.

DKL(PQB)=0.5log(0.5/0.9)+0.5log(0.5/0.1)=0.5(0.848)+0.5(1.322)=0.237D_{KL}(P \| Q_B) = 0.5 \log(0.5/0.9) + 0.5 \log(0.5/0.1) = 0.5(-0.848) + 0.5(1.322) = 0.237 bits.

H(P,QA)=1.0+0.029=1.029H(P, Q_A) = 1.0 + 0.029 = 1.029 bits.

H(P,QB)=1.0+0.237=1.237H(P, Q_B) = 1.0 + 0.237 = 1.237 bits.

QAQ_A is closer to PP in both CE and KL.

📝Example 5: Gradient Calculation

Problem: For a 3-class problem, logits are [2.0,1.0,0.1][2.0, 1.0, 0.1] and true label is class 0. Compute the softmax output and the gradient.

💡Solution: Gradient

Softmax: y^=[0.659,0.242,0.099]\hat{y} = [0.659, 0.242, 0.099].

Gradient w.r.t. logits: y^y=[0.6591,0.2420,0.0990]=[0.341,0.242,0.099]\hat{y} - y = [0.659 - 1, 0.242 - 0, 0.099 - 0] = [-0.341, 0.242, 0.099].

The gradient pushes logits toward making the correct class more probable.


Python Implementation

import numpy as np

def binary_cross_entropy(y_true, y_pred):
    """Compute binary cross-entropy loss."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred):
    """Compute categorical cross-entropy loss."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1 - 1e-7)  # Fix: should be 1 - 1e-7
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

def sparse_categorical_cross_entropy(y_true, y_pred):
    """Compute cross-entropy for integer labels."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    N = y_true.shape[0]
    return -np.mean(np.log(y_pred[np.arange(N), y_true]))

def softmax(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

def cross_entropy_with_logits(y_true, logits):
    """Compute CE directly from logits (more numerically stable)."""
    probs = softmax(logits)
    N = y_true.shape[0]
    return -np.mean(np.sum(y_true * np.log(probs), axis=1))

# --- Examples ---
# Binary CE
y_true = np.array([1, 0, 1, 1])
y_pred = np.array([0.9, 0.1, 0.8, 0.7])
print(f"BCE: {binary_cross_entropy(y_true, y_pred):.4f}")

# Categorical CE
y_true = np.array([[1, 0, 0], [0, 1, 0]])
y_pred = np.array([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1]])
print(f"CE: {categorical_cross_entropy(y_true, y_pred):.4f}")

# Sparse CE
y_true = np.array([0, 1])
y_pred = np.array([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1]])
print(f"Sparse CE: {sparse_categorical_cross_entropy(y_true, y_pred):.4f}")

# Logits
logits = np.array([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])
y_true = np.array([[1, 0, 0], [0, 1, 0]])
print(f"CE from logits: {cross_entropy_with_logits(y_true, logits):.4f}")

Why Cross-Entropy for Classification?

ℹ️ Reason 1: Maximum Likelihood

For a categorical model p(y=kx)=softmax(z)kp(y=k|x) = \text{softmax}(z)_k, the log-likelihood of a dataset is:

logip(yixi)=ilogsoftmax(zi)yi\log \prod_{i} p(y_i | x_i) = \sum_{i} \log \text{softmax}(z_i)_{y_i}

Maximizing this is equivalent to minimizing cross-entropy loss.

ℹ️ Reason 2: Gradient Properties

The gradient of CE + softmax is simply y^y\hat{y} - y—the prediction error. This is well-behaved: large when the model is wrong, small when correct. In contrast, MSE + sigmoid has vanishing gradients when the model is confidently wrong.

ℹ️ Reason 3: Probabilistic Interpretation

CE measures the "surprise" of the true label under the model. A good model assigns high probability to true labels, minimizing surprise. This is information-theoretically principled.

ℹ️ Reason 4: Convexity

For linear models, CE loss is convex in the parameters, guaranteeing a global minimum. MSE with sigmoid is non-convex and can have poor local minima.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using MSE for classificationPoor gradients, non-convexUse cross-entropy loss
Forgetting to clip predictionslog(0) = -∞ causes NaNClip to [ε, 1-ε]
Applying CE to regressionCE is for discrete distributionsUse MSE or MAE for regression
Not using softmax before CERaw logits aren't probabilitiesApply softmax or use logits directly
Ignoring class imbalanceCE treats all classes equallyUse weighted CE or focal loss
Confusing BCE with Categorical CEBCE is for binary, CE for multi-classMatch loss to problem type

Interview Questions

Q1: Why not use MSE for classification? A: MSE + sigmoid has vanishing gradients when the model is confidently wrong (saturation region). CE + softmax has gradient y^y\hat{y} - y which doesn't saturate. Also, CE is convex for linear models while MSE is not.

Q2: What's the connection between CE and MLE? A: Minimizing CE is equivalent to maximizing log-likelihood: LCE=1Nilogpθ(yixi)\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_i \log p_\theta(y_i|x_i). This is exactly the negative log-likelihood.

Q3: How does label smoothing help? A: Label smoothing replaces hard one-hot labels with soft targets: yc=(1ϵ)onehot+ϵ/Cy_c = (1-\epsilon) \cdot \text{onehot} + \epsilon/C. This prevents overconfident predictions and improves generalization. It's equivalent to adding a regularization term.

Q4: Why is CE called "cross-entropy"? A: It measures the "cross" between two distributions. When P=QP = Q, it equals the entropy H(P)H(P)—the theoretical minimum. When PQP \neq Q, it's larger by DKL(PQ)D_{KL}(P \| Q).

Q5: What happens numerically when y^\hat{y} is very close to 0 or 1? A: log(0)\log(0) \to -\infty. In practice, clip predictions to [ϵ,1ϵ][\epsilon, 1-\epsilon] or compute CE directly from logits using log(softmax(z))\log(\text{softmax}(z)) which is numerically stable.


Practice Problems

📝Problem 1: BCE vs CE

Problem: A binary problem can be treated as 2-class categorical. Show that binary CE equals categorical CE for 2 classes.

💡Solution: BCE = CE

For 2 classes with one-hot (y,1y)(y, 1-y) and predictions (y^,1y^)(\hat{y}, 1-\hat{y}):

Categorical CE: [ylogy^+(1y)log(1y^)]-[y \log \hat{y} + (1-y) \log(1-\hat{y})]

This is exactly BCE.

📝Problem 2: Uniform Predictions

Problem: For a 10-class problem with uniform predictions y^c=0.1\hat{y}_c = 0.1, what is the CE if the true class is 0?

💡Solution: Uniform Predictions

L=log(0.1)=log(10)3.322\mathcal{L} = -\log(0.1) = \log(10) \approx 3.322 bits.

This is the maximum CE for a 10-class problem with uniform predictions.

📝Problem 3: Perfect Predictions

Problem: If the model predicts y^k=1\hat{y}_k = 1 for the true class kk and 0 for all others, what is the CE?

💡Solution: Perfect Predictions

L=log(1)=0\mathcal{L} = -\log(1) = 0.

Zero loss means perfect predictions. In practice, this never happens due to regularization and finite precision.


Advanced Topics

DfFocal Loss

Focal loss addresses class imbalance by down-weighting easy examples:

Lfocal=(1y^k)γlog(y^k)\mathcal{L}_{\text{focal}} = -(1-\hat{y}_k)^\gamma \log(\hat{y}_k)

where γ>0\gamma > 0 is the focusing parameter. When γ=0\gamma = 0, it reduces to standard CE. Larger γ\gamma focuses more on hard (misclassified) examples.

Focal Loss

L=(1y^k)γlog(y^k)L = -(1-\hat{y}_k)^\gamma \log(\hat{y}_k)

Here,

  • y^k\hat{y}_k=Predicted probability of true class
  • γ\gamma=Focusing parameter (typically 2)
  • (1y^k)γ(1-\hat{y}_k)^\gamma=Modulating factor for hard examples

ℹ️ Focal Loss in Object Detection

Focal loss was introduced in RetinaNet (Lin et al., 2017) to address the extreme foreground-background class imbalance in object detection. With γ=2\gamma = 2, easy background examples (which dominate) are down-weighted by (10.9)2=0.01(1-0.9)^2 = 0.01, while hard examples retain their full gradient.

ℹ️ Label Smoothing Details

Label smoothing replaces the one-hot target yk=1y_k = 1 with:

yksmooth=(1ϵ)1[k=true]+ϵCy_k^{\text{smooth}} = (1 - \epsilon) \cdot \mathbf{1}[k = \text{true}] + \frac{\epsilon}{C}

The CE loss becomes:

L=(1ϵ)log(y^true)ϵCc=1Clog(y^c)\mathcal{L} = -(1-\epsilon) \log(\hat{y}_{\text{true}}) - \frac{\epsilon}{C} \sum_{c=1}^{C} \log(\hat{y}_c)

This prevents the model from becoming overconfident and improves calibration.

ℹ️ Numerical Stability: Log-Sum-Exp

Computing log(softmax(z))\log(\text{softmax}(z)) naively can overflow. The numerically stable version:

log(softmax(zc))=zclogjexp(zj)=zcmax(z)logjexp(zjmax(z))\log(\text{softmax}(z_c)) = z_c - \log \sum_j \exp(z_j) = z_c - \max(z) - \log \sum_j \exp(z_j - \max(z))

This subtracts the maximum to prevent overflow. PyTorch's nn.CrossEntropyLoss implements this automatically.


Quick Reference

Loss FunctionFormulaUse Case
Binary CE[ylogy^+(1y)log(1y^)]-[y \log \hat{y} + (1-y)\log(1-\hat{y})]Binary classification
Categorical CEcyclogy^c-\sum_c y_c \log \hat{y}_cMulti-class (one-hot)
Sparse CElogy^k-\log \hat{y}_k where kk is true classMulti-class (integer labels)
Weighted CEcwcyclogy^c-\sum_c w_c y_c \log \hat{y}_cImbalanced classes
Focal CE(1y^k)γlogy^k-(1-\hat{y}_k)^\gamma \log \hat{y}_kHard examples

Cross-References

  • 081 - EntropyH(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q) — cross-entropy decomposes into entropy plus KL.
  • 082 - Mutual Information — MI is used for feature selection; CE is used for training.
  • 083 - KL Divergence: DKL(PQ)=H(P,Q)H(P)D_{KL}(P \| Q) = H(P, Q) - H(P) — CE minus entropy equals KL.
  • 085 - Applications — Cross-entropy is the loss in classification, knowledge distillation, and language modeling.

Summary

📋Key Takeaways

  • Cross-Entropy: H(P,Q)=xp(x)logq(x)H(P, Q) = -\sum_x p(x) \log q(x) measures the average number of bits needed to encode events from PP using a code optimized for QQ. Lower CE means QQ is closer to PP.

  • Decomposition: H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q). The irreducible entropy H(P)H(P) is constant; minimizing CE is equivalent to minimizing KL divergence.

  • MLE Connection: Minimizing CE loss is equivalent to maximizing the log-likelihood of the data under the model. This is why CE is the standard loss for classification.

  • Gradient: The gradient of CE + softmax is y^y\hat{y} - y—the prediction error. This is well-behaved and avoids the saturation issues of MSE + sigmoid.

  • Binary vs Categorical: Binary CE is for 2 classes; categorical CE is for CC classes. They are equivalent when C=2C = 2.

  • Label Smoothing: Replacing hard labels with soft targets (1ϵ)onehot+ϵ/C(1-\epsilon) \cdot \text{onehot} + \epsilon/C prevents overconfident predictions and improves generalization.

  • Numerical Stability: Always compute CE from logits using log(softmax(z))\log(\text{softmax}(z)) rather than applying softmax then log separately. PyTorch's nn.CrossEntropyLoss does this automatically.

Lesson Progress84 / 100