Cross-Entropy Loss

ℹ️ Why It Matters

Cross-entropy is the standard loss function for classification in neural networks. Every time you train a classifier with nn.CrossEntropyLoss(), you're minimizing the cross-entropy between the true label distribution and your model's predicted distribution. Understanding why it works—its connection to maximum likelihood estimation, KL divergence, and information theory—gives you the intuition to debug, modify, and improve your models.

Historical Context

ℹ️ From Compression to Classification

Cross-entropy originated in coding theory: it measures the average number of bits needed to encode data from distribution $P$ using a code optimized for distribution $Q$ . When $P = Q$ , you achieve the entropy $H(P)$ . When $P \neq Q$ , you need extra $D_{KL}(P \| Q)$ bits. This "extra cost" is exactly what cross-entropy loss measures in classification.

Core Definitions

DfCross-Entropy

The cross-entropy between distributions $P$ (true) and $Q$ (predicted) is:

H(P, Q) = -\sum_{x \in \mathcal{X}} p(x) \log_2 q(x)

It measures the average number of bits needed to encode events from $P$ using a code optimized for $Q$ .

DfBinary Cross-Entropy

For binary classification with true label $y \in \{0, 1\}$ and predicted probability $\hat{y}$ :

\mathcal{L}_{\text{BCE}}(y, \hat{y}) = -\left[y \log(\hat{y}) + (1-y) \log(1-\hat{y})\right]

DfCategorical Cross-Entropy

For multi-class classification with true one-hot vector $\mathbf{y}$ and predicted probabilities $\hat{\mathbf{y}}$ :

\mathcal{L}_{\text{CE}} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

For a single sample with true class $k$ : $\mathcal{L} = -\log(\hat{y}_k)$ .

DfMean Cross-Entropy Loss

Over a batch of $N$ samples:

\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})

Key Formulas

Cross-Entropy

H(P, Q) = -\sum_{x} p(x) \log q(x)

Here,

$H(P, Q)$ =Cross-entropy between distributions P and Q
$p(x)$ =True distribution
$q(x)$ =Predicted distribution

Binary Cross-Entropy

L = -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]

Here,

$y$ =True label (0 or 1)
$\hat{y}$ =Predicted probability of class 1

Categorical Cross-Entropy

L = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)

Here,

$y_c$ =True probability for class c (0 or 1 for one-hot)
$\hat{y}_c$ =Predicted probability for class c
$C$ =Number of classes

Relation to KL Divergence

H(P, Q) = H(P) + D_{KL}(P \| Q)

Here,

$H(P)$ =Entropy of true distribution (constant during training)
$D_{KL}(P \| Q)$ =KL divergence from Q to P

Relation to Log-Likelihood

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log q(y_i | x_i) = -\mathbb{E}_{P}[\log Q]

Here,

$q(y_i | x_i)$ =Model's predicted probability of true label

Properties and Theorems

ThCross-Entropy ≥ Entropy

$H(P, Q) \geq H(P)$ for all distributions $P, Q$ . Equality holds iff $P = Q$ . This follows from $D_{KL}(P \| Q) \geq 0$ .

ThCross-Entropy Decomposition

H(P, Q) = H(P) + D_{KL}(P \| Q)

Cross-entropy = entropy (irreducible noise) + KL divergence (model mismatch). Minimizing cross-entropy is equivalent to minimizing KL divergence.

ThMaximum Likelihood Connection

Minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of the data under the model. For a categorical model:

\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_{i} \log p_\theta(y_i | x_i) = -\frac{1}{N} \log \prod_{i} p_\theta(y_i | x_i)

ThGradient Properties

The gradient of cross-entropy loss w.r.t. logits $z_c$ is:

\frac{\partial \mathcal{L}}{\partial z_c} = \hat{y}_c - y_c

This elegant form comes from combining softmax + cross-entropy. The gradient is the prediction error, which vanishes as the model becomes confident and correct.

ThLabel Smoothing Effect

With label smoothing ( $y_c = (1-\epsilon) \cdot \text{onehot} + \epsilon/C$ ), cross-entropy prevents the model from becoming overconfident. The loss becomes:

\mathcal{L} = -(1-\epsilon) \log(\hat{y}_k) - \frac{\epsilon}{C}\sum_c \log(\hat{y}_c)

Worked Examples

📝Example 1: Binary Cross-Entropy

Problem: True label $y = 1$ , predicted $\hat{y} = 0.9$ . Compute BCE.

💡Solution: Binary CE

\mathcal{L} = -[1 \cdot \log(0.9) + 0 \cdot \log(0.1)] = -\log(0.9) = 0.105 \text{ bits}

If $\hat{y} = 0.5$ : $\mathcal{L} = -\log(0.5) = 1.0$ bit (maximum uncertainty). If $\hat{y} = 0.99$ : $\mathcal{L} = -\log(0.99) = 0.0145$ bits (high confidence, correct).

📝Example 2: Categorical Cross-Entropy

Problem: True label is class 0 (one-hot: $[1, 0, 0]$ ). Predictions: $[0.7, 0.2, 0.1]$ . Compute CE.

💡Solution: Categorical CE

\mathcal{L} = -[1 \cdot \log(0.7) + 0 \cdot \log(0.2) + 0 \cdot \log(0.1)] = -\log(0.7) = 0.515 \text{ bits}

Only the predicted probability of the true class matters.

📝Example 3: Comparing Two Models

Problem: True label: $[1, 0, 0]$ . Model A predicts $[0.9, 0.05, 0.05]$ , Model B predicts $[0.6, 0.2, 0.2]$ . Which is better?

💡Solution: Model Comparison

$\mathcal{L}_A = -\log(0.9) = 0.105$ bits.

$\mathcal{L}_B = -\log(0.6) = 0.511$ bits.

Model A has lower CE, so it's better. Model A is more confident AND correct.

📝Example 4: Cross-Entropy vs KL

Problem: True distribution $P = [0.5, 0.5]$ , model A predicts $Q_A = [0.6, 0.4]$ , model B predicts $Q_B = [0.9, 0.1]$ . Compute CE and KL for both.

💡Solution: CE vs KL

$H(P) = 1.0$ bit.

$D_{KL}(P \| Q_A) = 0.5 \log(0.5/0.6) + 0.5 \log(0.5/0.4) = 0.5(-0.263) + 0.5(0.322) = 0.029$ bits.

$D_{KL}(P \| Q_B) = 0.5 \log(0.5/0.9) + 0.5 \log(0.5/0.1) = 0.5(-0.848) + 0.5(1.322) = 0.237$ bits.

$H(P, Q_A) = 1.0 + 0.029 = 1.029$ bits.

$H(P, Q_B) = 1.0 + 0.237 = 1.237$ bits.

$Q_A$ is closer to $P$ in both CE and KL.

📝Example 5: Gradient Calculation

Problem: For a 3-class problem, logits are $[2.0, 1.0, 0.1]$ and true label is class 0. Compute the softmax output and the gradient.

💡Solution: Gradient

Softmax: $\hat{y} = [0.659, 0.242, 0.099]$ .

Gradient w.r.t. logits: $\hat{y} - y = [0.659 - 1, 0.242 - 0, 0.099 - 0] = [-0.341, 0.242, 0.099]$ .

The gradient pushes logits toward making the correct class more probable.

Python Implementation

import numpy as np

def binary_cross_entropy(y_true, y_pred):
    """Compute binary cross-entropy loss."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred):
    """Compute categorical cross-entropy loss."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1 - 1e-7)  # Fix: should be 1 - 1e-7
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

def sparse_categorical_cross_entropy(y_true, y_pred):
    """Compute cross-entropy for integer labels."""
    y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
    N = y_true.shape[0]
    return -np.mean(np.log(y_pred[np.arange(N), y_true]))

def softmax(z):
    """Numerically stable softmax."""
    z_shifted = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

def cross_entropy_with_logits(y_true, logits):
    """Compute CE directly from logits (more numerically stable)."""
    probs = softmax(logits)
    N = y_true.shape[0]
    return -np.mean(np.sum(y_true * np.log(probs), axis=1))

# --- Examples ---
# Binary CE
y_true = np.array([1, 0, 1, 1])
y_pred = np.array([0.9, 0.1, 0.8, 0.7])
print(f"BCE: {binary_cross_entropy(y_true, y_pred):.4f}")

# Categorical CE
y_true = np.array([[1, 0, 0], [0, 1, 0]])
y_pred = np.array([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1]])
print(f"CE: {categorical_cross_entropy(y_true, y_pred):.4f}")

# Sparse CE
y_true = np.array([0, 1])
y_pred = np.array([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1]])
print(f"Sparse CE: {sparse_categorical_cross_entropy(y_true, y_pred):.4f}")

# Logits
logits = np.array([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])
y_true = np.array([[1, 0, 0], [0, 1, 0]])
print(f"CE from logits: {cross_entropy_with_logits(y_true, logits):.4f}")

Why Cross-Entropy for Classification?

ℹ️ Reason 1: Maximum Likelihood

For a categorical model $p(y=k|x) = \text{softmax}(z)_k$ , the log-likelihood of a dataset is:

\log \prod_{i} p(y_i | x_i) = \sum_{i} \log \text{softmax}(z_i)_{y_i}

Maximizing this is equivalent to minimizing cross-entropy loss.

ℹ️ Reason 2: Gradient Properties

The gradient of CE + softmax is simply $\hat{y} - y$ —the prediction error. This is well-behaved: large when the model is wrong, small when correct. In contrast, MSE + sigmoid has vanishing gradients when the model is confidently wrong.

ℹ️ Reason 3: Probabilistic Interpretation

CE measures the "surprise" of the true label under the model. A good model assigns high probability to true labels, minimizing surprise. This is information-theoretically principled.

ℹ️ Reason 4: Convexity

For linear models, CE loss is convex in the parameters, guaranteeing a global minimum. MSE with sigmoid is non-convex and can have poor local minima.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Using MSE for classification	Poor gradients, non-convex	Use cross-entropy loss
Forgetting to clip predictions	log(0) = -∞ causes NaN	Clip to [ε, 1-ε]
Applying CE to regression	CE is for discrete distributions	Use MSE or MAE for regression
Not using softmax before CE	Raw logits aren't probabilities	Apply softmax or use logits directly
Ignoring class imbalance	CE treats all classes equally	Use weighted CE or focal loss
Confusing BCE with Categorical CE	BCE is for binary, CE for multi-class	Match loss to problem type

Interview Questions

Q1: Why not use MSE for classification? A: MSE + sigmoid has vanishing gradients when the model is confidently wrong (saturation region). CE + softmax has gradient $\hat{y} - y$ which doesn't saturate. Also, CE is convex for linear models while MSE is not.

Q2: What's the connection between CE and MLE? A: Minimizing CE is equivalent to maximizing log-likelihood: $\mathcal{L}_{\text{CE}} = -\frac{1}{N}\sum_i \log p_\theta(y_i|x_i)$ . This is exactly the negative log-likelihood.

Q3: How does label smoothing help? A: Label smoothing replaces hard one-hot labels with soft targets: $y_c = (1-\epsilon) \cdot \text{onehot} + \epsilon/C$ . This prevents overconfident predictions and improves generalization. It's equivalent to adding a regularization term.

Q4: Why is CE called "cross-entropy"? A: It measures the "cross" between two distributions. When $P = Q$ , it equals the entropy $H(P)$ —the theoretical minimum. When $P \neq Q$ , it's larger by $D_{KL}(P \| Q)$ .

Q5: What happens numerically when $\hat{y}$ is very close to 0 or 1? A: $\log(0) \to -\infty$ . In practice, clip predictions to $[\epsilon, 1-\epsilon]$ or compute CE directly from logits using $\log(\text{softmax}(z))$ which is numerically stable.

Practice Problems

📝Problem 1: BCE vs CE

Problem: A binary problem can be treated as 2-class categorical. Show that binary CE equals categorical CE for 2 classes.

💡Solution: BCE = CE

For 2 classes with one-hot $(y, 1-y)$ and predictions $(\hat{y}, 1-\hat{y})$ :

Categorical CE: $-[y \log \hat{y} + (1-y) \log(1-\hat{y})]$

This is exactly BCE.

📝Problem 2: Uniform Predictions

Problem: For a 10-class problem with uniform predictions $\hat{y}_c = 0.1$ , what is the CE if the true class is 0?

💡Solution: Uniform Predictions

$\mathcal{L} = -\log(0.1) = \log(10) \approx 3.322$ bits.

This is the maximum CE for a 10-class problem with uniform predictions.

📝Problem 3: Perfect Predictions

Problem: If the model predicts $\hat{y}_k = 1$ for the true class $k$ and 0 for all others, what is the CE?

💡Solution: Perfect Predictions

$\mathcal{L} = -\log(1) = 0$ .

Zero loss means perfect predictions. In practice, this never happens due to regularization and finite precision.

Advanced Topics

DfFocal Loss

Focal loss addresses class imbalance by down-weighting easy examples:

\mathcal{L}_{\text{focal}} = -(1-\hat{y}_k)^\gamma \log(\hat{y}_k)

where $\gamma > 0$ is the focusing parameter. When $\gamma = 0$ , it reduces to standard CE. Larger $\gamma$ focuses more on hard (misclassified) examples.

Focal Loss

L = -(1-\hat{y}_k)^\gamma \log(\hat{y}_k)

Here,

$\hat{y}_k$ =Predicted probability of true class
$\gamma$ =Focusing parameter (typically 2)
$(1-\hat{y}_k)^\gamma$ =Modulating factor for hard examples

ℹ️ Focal Loss in Object Detection

Focal loss was introduced in RetinaNet (Lin et al., 2017) to address the extreme foreground-background class imbalance in object detection. With $\gamma = 2$ , easy background examples (which dominate) are down-weighted by $(1-0.9)^2 = 0.01$ , while hard examples retain their full gradient.

ℹ️ Label Smoothing Details

Label smoothing replaces the one-hot target $y_k = 1$ with:

y_k^{\text{smooth}} = (1 - \epsilon) \cdot \mathbf{1}[k = \text{true}] + \frac{\epsilon}{C}

The CE loss becomes:

\mathcal{L} = -(1-\epsilon) \log(\hat{y}_{\text{true}}) - \frac{\epsilon}{C} \sum_{c=1}^{C} \log(\hat{y}_c)

This prevents the model from becoming overconfident and improves calibration.

ℹ️ Numerical Stability: Log-Sum-Exp

Computing $\log(\text{softmax}(z))$ naively can overflow. The numerically stable version:

\log(\text{softmax}(z_c)) = z_c - \log \sum_j \exp(z_j) = z_c - \max(z) - \log \sum_j \exp(z_j - \max(z))

This subtracts the maximum to prevent overflow. PyTorch's nn.CrossEntropyLoss implements this automatically.

Quick Reference

Loss Function	Formula	Use Case
Binary CE	$-[y \log \hat{y} + (1-y)\log(1-\hat{y})]$	Binary classification
Categorical CE	$-\sum_c y_c \log \hat{y}_c$	Multi-class (one-hot)
Sparse CE	$-\log \hat{y}_k$ where $k$ is true class	Multi-class (integer labels)
Weighted CE	$-\sum_c w_c y_c \log \hat{y}_c$	Imbalanced classes
Focal CE	$-(1-\hat{y}_k)^\gamma \log \hat{y}_k$	Hard examples

Cross-References

081 - Entropy — $H(P, Q) = H(P) + D_{KL}(P \| Q)$ — cross-entropy decomposes into entropy plus KL.
082 - Mutual Information — MI is used for feature selection; CE is used for training.
083 - KL Divergence: $D_{KL}(P \| Q) = H(P, Q) - H(P)$ — CE minus entropy equals KL.
085 - Applications — Cross-entropy is the loss in classification, knowledge distillation, and language modeling.

Summary

📋Key Takeaways

Cross-Entropy: $H(P, Q) = -\sum_x p(x) \log q(x)$ measures the average number of bits needed to encode events from $P$ using a code optimized for $Q$ . Lower CE means $Q$ is closer to $P$ .
Decomposition: $H(P, Q) = H(P) + D_{KL}(P \| Q)$ . The irreducible entropy $H(P)$ is constant; minimizing CE is equivalent to minimizing KL divergence.
MLE Connection: Minimizing CE loss is equivalent to maximizing the log-likelihood of the data under the model. This is why CE is the standard loss for classification.
Gradient: The gradient of CE + softmax is $\hat{y} - y$ —the prediction error. This is well-behaved and avoids the saturation issues of MSE + sigmoid.
Binary vs Categorical: Binary CE is for 2 classes; categorical CE is for $C$ classes. They are equivalent when $C = 2$ .
Label Smoothing: Replacing hard labels with soft targets $(1-\epsilon) \cdot \text{onehot} + \epsilon/C$ prevents overconfident predictions and improves generalization.
Numerical Stability: Always compute CE from logits using $\log(\text{softmax}(z))$ rather than applying softmax then log separately. PyTorch's nn.CrossEntropyLoss does this automatically.