Training Deep Networks — Complete Guide
Training deep networks requires careful choice of optimizer, learning rate, regularization, and debugging techniques.
Optimizers
SGD with Momentum:
w = w - α × (momentum × v + gradient)
Good generalization, slow convergence
Adam: Adaptive learning rates per parameter
├─ Combines momentum + RMSprop
├─ Fast convergence
├─ Default choice
└─ May generalize worse than SGD
AdamW: Adam with decoupled weight decay
├─ Better regularization
└─ Current default for most tasks
Learning Rate Schedule:
├─ Cosine annealing
├─ Linear warmup + decay
├─ Step decay
└─ Reduce on plateau
Regularization
Dropout:
├─ Randomly zero neurons during training
├─ Forces redundancy
├─ Use 0.1-0.5 probability
└─ Disable during inference
Batch Normalization:
├─ Normalize activations per batch
├─ Faster training
├─ Allows higher learning rates
└─ Slight regularization effect
Layer Normalization:
├─ Normalize per sample (not per batch)
├️ Used in Transformers
└️ Stable for variable batch sizes
Weight Decay:
├─ L2 penalty on weights
├─ Prevents large weights
└─ AdamW handles this correctly
Debugging Training
Symptoms and Solutions:
Loss not decreasing:
├─ Learning rate too high/low
├─ Bug in data pipeline
├─ Wrong loss function
└─ Check gradients (vanishing/exploding)
Overfitting:
├─ More data
├─ Dropout, weight decay
├─ Data augmentation
├─ Early stopping
└─ Simplify model
Underfitting:
├─ Larger model
├─ More epochs
├─ Lower regularization
└─ Better features
Unstable training:
├─ Lower learning rate
├─ Gradient clipping
├─ Better initialization
└─ Batch normalization
Key Takeaways
- Adam/AdamW is the default optimizer
- Learning rate is the most important hyperparameter
- Cosine annealing with warmup is the standard schedule
- Dropout and weight decay prevent overfitting
- Batch/Layer normalization stabilize training
- Gradient clipping prevents exploding gradients
- Mixed precision training saves memory and speeds up
- Gradient accumulation simulates larger batch sizes
- Early stopping is the simplest regularization
- Debug training by monitoring loss curves