GANs Deep Dive — Generative Adversarial Networks
GANs learn to generate realistic data by pitting two neural networks against each other in a minimax game: a generator creates fake samples, and a discriminator tries to distinguish real from fake.
The GAN Framework
DfGAN Framework
A GAN consists of:
- Generator : Maps random noise to fake data
- Discriminator : Outputs probability that input is real data
The two networks compete: tries to fool , while tries to correctly classify real vs. fake. Training converges when produces data indistinguishable from real data.
Discriminator Objective
Here,
- =Discriminator's estimate that x is real
- =Generator's fake sample from noise z
- =Real data distribution
- =Prior noise distribution (e.g., Gaussian)
Generator Objective (Non-saturating)
Here,
- =Generator wants discriminator to output 1 for fakes
Nash Equilibrium
ThGlobal Optimum of GAN
The global optimum of the minimax game is achieved when:
and the optimal discriminator is:
At this point, and the generator perfectly matches the data distribution.
ℹ️ Interpretation
When , the discriminator cannot distinguish real from fake and outputs everywhere. The game reaches a Nash equilibrium where neither player can improve by changing strategy unilaterally.
Training Challenges
DfMode Collapse
The generator learns to produce only a few types of outputs that fool the discriminator, ignoring the diversity of the real data distribution. This is the most common failure mode of GANs.
Symptoms: Generator produces very similar outputs regardless of input noise.
DfTraining Instability
GAN training is inherently unstable because:
- Non-convergence: Alternating optimization may not converge
- Vanishing gradients: When is too good, saturates
- Oscillation: and may cycle without converging
- Mode collapse: maps all inputs to same output
DCGAN (Deep Convolutional GAN)
DfDCGAN Architecture
DCGAN (Radford et al., 2015) established stable GAN training with architectural guidelines:
- Replace pooling with strided convolutions (discriminator) and transposed convolutions (generator)
- Use batch normalization in both networks
- Remove fully connected layers
- Use ReLU activation in generator (Tanh for output)
- Use LeakyReLU in discriminator
Transposed Convolution Output Size
Here,
- =Input spatial dimension
- =Stride of transposed convolution
- =Padding
- =Kernel size
WGAN (Wasserstein GAN)
DfWGAN
WGAN (Arjovsky et al., 2017) replaces the JS divergence with Wasserstein distance (Earth-Mover distance) for more stable training:
- Uses Wasserstein distance:
- Discriminator becomes "critic" — outputs scalar, not probability
- Weight clipping or gradient penalty instead of batch norm in critic
- Meaningful loss correlate with sample quality
💡 WGAN-GP Gradient Penalty
Instead of weight clipping (WGAN), use gradient penalty (WGAN-GP):
where is interpolated between real and fake samples. This enforces the Lipschitz constraint smoothly.
StyleGAN
DfStyleGAN Architecture
StyleGAN (Karras et al., 2019) introduces style-based generator architecture:
- Mapping network: (8 FC layers) maps latent to style space
- Adaptive instance normalization (AdaIN): Injects style at each layer
- Noise injection: Per-pixel noise for stochastic variation
- Progressive growing: Train with increasing resolution
This enables disentangled control over high-level attributes (pose, identity) and stochastic variation (hair, freckles).
AdaIN (Adaptive Instance Normalization)
Here,
- =Feature map at layer i
- =Style scale (from w)
- =Style bias (from w)
- =Mean and std of feature map
PyTorch Implementation
📝Example: DCGAN
import torch
import torch.nn as nn
class Generator(nn.Module):
def __init__(self, latent_dim=100, channels=3):
super().__init__()
self.main = nn.Sequential(
# Latent -> 512 x 4 x 4
nn.ConvTranspose2d(latent_dim, 512, 4, 1, 0, bias=False),
nn.BatchNorm2d(512),
nn.ReLU(True),
# 512 x 4 x 4 -> 256 x 8 x 8
nn.ConvTranspose2d(512, 256, 4, 2, 1, bias=False),
nn.BatchNorm2d(256),
nn.ReLU(True),
# 256 x 8 x 8 -> 128 x 16 x 16
nn.ConvTranspose2d(256, 128, 4, 2, 1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(True),
# 128 x 16 x 16 -> 64 x 32 x 32
nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False),
nn.BatchNorm2d(64),
nn.ReLU(True),
# 64 x 32 x 32 -> 3 x 64 x 64
nn.ConvTranspose2d(64, channels, 4, 2, 1, bias=False),
nn.Tanh()
)
def forward(self, z):
return self.main(z.view(z.size(0), -1, 1, 1))
class Discriminator(nn.Module):
def __init__(self, channels=3):
super().__init__()
self.main = nn.Sequential(
# 3 x 64 x 64 -> 64 x 32 x 32
nn.Conv2d(channels, 64, 4, 2, 1, bias=False),
nn.LeakyReLU(0.2, inplace=True),
# 64 x 32 x 32 -> 128 x 16 x 16
nn.Conv2d(64, 128, 4, 2, 1, bias=False),
nn.BatchNorm2d(128),
nn.LeakyReLU(0.2, inplace=True),
# 128 x 16 x 16 -> 256 x 8 x 8
nn.Conv2d(128, 256, 4, 2, 1, bias=False),
nn.BatchNorm2d(256),
nn.LeakyReLU(0.2, inplace=True),
# 256 x 8 x 8 -> 512 x 4 x 4
nn.Conv2d(256, 512, 4, 2, 1, bias=False),
nn.BatchNorm2d(512),
nn.LeakyReLU(0.2, inplace=True),
# 512 x 4 x 4 -> 1
nn.Conv2d(512, 1, 4, 1, 0, bias=False),
)
def forward(self, x):
return self.main(x).view(-1)
# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
G = Generator(100, 3).to(device)
D = Discriminator(3).to(device)
opt_G = torch.optim.Adam(G.parameters(), lr=0.0002, betas=(0.5, 0.999))
opt_D = torch.optim.Adam(D.parameters(), lr=0.0002, betas=(0.5, 0.999))
criterion = nn.BCEWithLogitsLoss()
for epoch in range(100):
for real, _ in dataloader:
real = real.to(device)
batch_size = real.size(0)
# Train Discriminator
z = torch.randn(batch_size, 100, device=device)
fake = G(z).detach()
loss_D = criterion(D(real), torch.ones(batch_size, device=device)) + \
criterion(D(fake), torch.zeros(batch_size, device=device))
opt_D.zero_grad()
loss_D.backward()
opt_D.step()
# Train Generator
z = torch.randn(batch_size, 100, device=device)
fake = G(z)
loss_G = criterion(D(fake), torch.ones(batch_size, device=device))
opt_G.zero_grad()
loss_G.backward()
opt_G.step()
Training Tips
💡 GAN Training Best Practices
- Use label smoothing: Real labels = 0.9, fake = 0.1 (reduces overconfidence)
- Two-time-scale update rule: Train D more than G (e.g., 5:1 ratio)
- Spectral normalization: Apply to D weights for stable training
- Progressive growing: Start with low resolution, increase gradually
- Track FID/IS: Fréchet Inception Distance is the standard evaluation metric
- Avoid batch norm in D: Use layer norm or instance norm instead
- Adam optimizer: , , learning rate
Practice Exercises
-
Train DCGAN on CIFAR-10: Generate realistic images. Monitor FID over training.
-
WGAN-GP implementation: Replace BCE loss with Wasserstein loss + gradient penalty. Compare training stability.
-
Mode collapse experiment: Train a GAN on MNIST and observe mode collapse. Fix it with minibatch discrimination.
-
Style mixing: Implement StyleGAN and experiment with style mixing at different layers.
Key Takeaways
📋Summary: GANs
- GANs consist of generator and discriminator in minimax game
- Nash equilibrium: ,
- Non-saturating loss: instead of
- DCGAN: Architectural guidelines for stable training
- WGAN: Wasserstein distance for better training dynamics
- StyleGAN: Style-based generation with disentangled controls
- Mode collapse and training instability are main challenges
- FID score is the standard evaluation metric
- GANs excel at image synthesis, style transfer, super-resolution
- See also: GANs in ML for fundamentals