Cross-Entropy Loss
ℹ️ Why It Matters
Cross-entropy is the standard loss function for classification in neural networks. Every time you train a classifier with nn.CrossEntropyLoss(), you're minimizing the cross-entropy between the true label distribution and your model's predicted distribution. Understanding why it works—its connection to maximum likelihood estimation, KL divergence, and information theory—gives you the intuition to debug, modify, and improve your models.
Historical Context
ℹ️ From Compression to Classification
Cross-entropy originated in coding theory: it measures the average number of bits needed to encode data from distribution using a code optimized for distribution . When , you achieve the entropy . When , you need extra bits. This "extra cost" is exactly what cross-entropy loss measures in classification.
Core Definitions
DfCross-Entropy
The cross-entropy between distributions (true) and (predicted) is:
It measures the average number of bits needed to encode events from using a code optimized for .
DfBinary Cross-Entropy
For binary classification with true label and predicted probability :
DfCategorical Cross-Entropy
For multi-class classification with true one-hot vector and predicted probabilities :
For a single sample with true class : .
DfMean Cross-Entropy Loss
Over a batch of samples:
Key Formulas
Cross-Entropy
Here,
- =Cross-entropy between distributions P and Q
- =True distribution
- =Predicted distribution
Binary Cross-Entropy
Here,
- =True label (0 or 1)
- =Predicted probability of class 1
Categorical Cross-Entropy
Here,
- =True probability for class c (0 or 1 for one-hot)
- =Predicted probability for class c
- =Number of classes
Relation to KL Divergence
Here,
- =Entropy of true distribution (constant during training)
- =KL divergence from Q to P
Relation to Log-Likelihood
Here,
- =Model's predicted probability of true label
Properties and Theorems
ThCross-Entropy ≥ Entropy
for all distributions . Equality holds iff . This follows from .
ThCross-Entropy Decomposition
Cross-entropy = entropy (irreducible noise) + KL divergence (model mismatch). Minimizing cross-entropy is equivalent to minimizing KL divergence.
ThMaximum Likelihood Connection
Minimizing cross-entropy loss is equivalent to maximizing the log-likelihood of the data under the model. For a categorical model:
ThGradient Properties
The gradient of cross-entropy loss w.r.t. logits is:
This elegant form comes from combining softmax + cross-entropy. The gradient is the prediction error, which vanishes as the model becomes confident and correct.
ThLabel Smoothing Effect
With label smoothing (), cross-entropy prevents the model from becoming overconfident. The loss becomes:
Worked Examples
📝Example 1: Binary Cross-Entropy
Problem: True label , predicted . Compute BCE.
💡Solution: Binary CE
If : bit (maximum uncertainty). If : bits (high confidence, correct).
📝Example 2: Categorical Cross-Entropy
Problem: True label is class 0 (one-hot: ). Predictions: . Compute CE.
💡Solution: Categorical CE
Only the predicted probability of the true class matters.
📝Example 3: Comparing Two Models
Problem: True label: . Model A predicts , Model B predicts . Which is better?
💡Solution: Model Comparison
bits.
bits.
Model A has lower CE, so it's better. Model A is more confident AND correct.
📝Example 4: Cross-Entropy vs KL
Problem: True distribution , model A predicts , model B predicts . Compute CE and KL for both.
💡Solution: CE vs KL
bit.
bits.
bits.
bits.
bits.
is closer to in both CE and KL.
📝Example 5: Gradient Calculation
Problem: For a 3-class problem, logits are and true label is class 0. Compute the softmax output and the gradient.
💡Solution: Gradient
Softmax: .
Gradient w.r.t. logits: .
The gradient pushes logits toward making the correct class more probable.
Python Implementation
import numpy as np
def binary_cross_entropy(y_true, y_pred):
"""Compute binary cross-entropy loss."""
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def categorical_cross_entropy(y_true, y_pred):
"""Compute categorical cross-entropy loss."""
y_pred = np.clip(y_pred, 1e-7, 1 - 1 - 1e-7) # Fix: should be 1 - 1e-7
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
def sparse_categorical_cross_entropy(y_true, y_pred):
"""Compute cross-entropy for integer labels."""
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
N = y_true.shape[0]
return -np.mean(np.log(y_pred[np.arange(N), y_true]))
def softmax(z):
"""Numerically stable softmax."""
z_shifted = z - np.max(z, axis=-1, keepdims=True)
exp_z = np.exp(z_shifted)
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
def cross_entropy_with_logits(y_true, logits):
"""Compute CE directly from logits (more numerically stable)."""
probs = softmax(logits)
N = y_true.shape[0]
return -np.mean(np.sum(y_true * np.log(probs), axis=1))
# --- Examples ---
# Binary CE
y_true = np.array([1, 0, 1, 1])
y_pred = np.array([0.9, 0.1, 0.8, 0.7])
print(f"BCE: {binary_cross_entropy(y_true, y_pred):.4f}")
# Categorical CE
y_true = np.array([[1, 0, 0], [0, 1, 0]])
y_pred = np.array([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1]])
print(f"CE: {categorical_cross_entropy(y_true, y_pred):.4f}")
# Sparse CE
y_true = np.array([0, 1])
y_pred = np.array([[0.7, 0.2, 0.1], [0.1, 0.8, 0.1]])
print(f"Sparse CE: {sparse_categorical_cross_entropy(y_true, y_pred):.4f}")
# Logits
logits = np.array([[2.0, 1.0, 0.1], [0.5, 2.5, 0.3]])
y_true = np.array([[1, 0, 0], [0, 1, 0]])
print(f"CE from logits: {cross_entropy_with_logits(y_true, logits):.4f}")
Why Cross-Entropy for Classification?
ℹ️ Reason 1: Maximum Likelihood
For a categorical model , the log-likelihood of a dataset is:
Maximizing this is equivalent to minimizing cross-entropy loss.
ℹ️ Reason 2: Gradient Properties
The gradient of CE + softmax is simply —the prediction error. This is well-behaved: large when the model is wrong, small when correct. In contrast, MSE + sigmoid has vanishing gradients when the model is confidently wrong.
ℹ️ Reason 3: Probabilistic Interpretation
CE measures the "surprise" of the true label under the model. A good model assigns high probability to true labels, minimizing surprise. This is information-theoretically principled.
ℹ️ Reason 4: Convexity
For linear models, CE loss is convex in the parameters, guaranteeing a global minimum. MSE with sigmoid is non-convex and can have poor local minima.
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Using MSE for classification | Poor gradients, non-convex | Use cross-entropy loss |
| Forgetting to clip predictions | log(0) = -∞ causes NaN | Clip to [ε, 1-ε] |
| Applying CE to regression | CE is for discrete distributions | Use MSE or MAE for regression |
| Not using softmax before CE | Raw logits aren't probabilities | Apply softmax or use logits directly |
| Ignoring class imbalance | CE treats all classes equally | Use weighted CE or focal loss |
| Confusing BCE with Categorical CE | BCE is for binary, CE for multi-class | Match loss to problem type |
Interview Questions
Q1: Why not use MSE for classification? A: MSE + sigmoid has vanishing gradients when the model is confidently wrong (saturation region). CE + softmax has gradient which doesn't saturate. Also, CE is convex for linear models while MSE is not.
Q2: What's the connection between CE and MLE? A: Minimizing CE is equivalent to maximizing log-likelihood: . This is exactly the negative log-likelihood.
Q3: How does label smoothing help? A: Label smoothing replaces hard one-hot labels with soft targets: . This prevents overconfident predictions and improves generalization. It's equivalent to adding a regularization term.
Q4: Why is CE called "cross-entropy"? A: It measures the "cross" between two distributions. When , it equals the entropy —the theoretical minimum. When , it's larger by .
Q5: What happens numerically when is very close to 0 or 1? A: . In practice, clip predictions to or compute CE directly from logits using which is numerically stable.
Practice Problems
📝Problem 1: BCE vs CE
Problem: A binary problem can be treated as 2-class categorical. Show that binary CE equals categorical CE for 2 classes.
💡Solution: BCE = CE
For 2 classes with one-hot and predictions :
Categorical CE:
This is exactly BCE.
📝Problem 2: Uniform Predictions
Problem: For a 10-class problem with uniform predictions , what is the CE if the true class is 0?
💡Solution: Uniform Predictions
bits.
This is the maximum CE for a 10-class problem with uniform predictions.
📝Problem 3: Perfect Predictions
Problem: If the model predicts for the true class and 0 for all others, what is the CE?
💡Solution: Perfect Predictions
.
Zero loss means perfect predictions. In practice, this never happens due to regularization and finite precision.
Advanced Topics
DfFocal Loss
Focal loss addresses class imbalance by down-weighting easy examples:
where is the focusing parameter. When , it reduces to standard CE. Larger focuses more on hard (misclassified) examples.
Focal Loss
Here,
- =Predicted probability of true class
- =Focusing parameter (typically 2)
- =Modulating factor for hard examples
ℹ️ Focal Loss in Object Detection
Focal loss was introduced in RetinaNet (Lin et al., 2017) to address the extreme foreground-background class imbalance in object detection. With , easy background examples (which dominate) are down-weighted by , while hard examples retain their full gradient.
ℹ️ Label Smoothing Details
Label smoothing replaces the one-hot target with:
The CE loss becomes:
This prevents the model from becoming overconfident and improves calibration.
ℹ️ Numerical Stability: Log-Sum-Exp
Computing naively can overflow. The numerically stable version:
This subtracts the maximum to prevent overflow. PyTorch's nn.CrossEntropyLoss implements this automatically.
Quick Reference
| Loss Function | Formula | Use Case |
|---|---|---|
| Binary CE | Binary classification | |
| Categorical CE | Multi-class (one-hot) | |
| Sparse CE | where is true class | Multi-class (integer labels) |
| Weighted CE | Imbalanced classes | |
| Focal CE | Hard examples |
Cross-References
- 081 - Entropy — — cross-entropy decomposes into entropy plus KL.
- 082 - Mutual Information — MI is used for feature selection; CE is used for training.
- 083 - KL Divergence: — CE minus entropy equals KL.
- 085 - Applications — Cross-entropy is the loss in classification, knowledge distillation, and language modeling.
Summary
📋Key Takeaways
-
Cross-Entropy: measures the average number of bits needed to encode events from using a code optimized for . Lower CE means is closer to .
-
Decomposition: . The irreducible entropy is constant; minimizing CE is equivalent to minimizing KL divergence.
-
MLE Connection: Minimizing CE loss is equivalent to maximizing the log-likelihood of the data under the model. This is why CE is the standard loss for classification.
-
Gradient: The gradient of CE + softmax is —the prediction error. This is well-behaved and avoids the saturation issues of MSE + sigmoid.
-
Binary vs Categorical: Binary CE is for 2 classes; categorical CE is for classes. They are equivalent when .
-
Label Smoothing: Replacing hard labels with soft targets prevents overconfident predictions and improves generalization.
-
Numerical Stability: Always compute CE from logits using rather than applying softmax then log separately. PyTorch's
nn.CrossEntropyLossdoes this automatically.