Stable Diffusion Complete Guide — Diffusion Architecture, SDXL, SD 3.5, LoRA & ComfyUI
Stable Diffusion is the most popular open-source AI image generator. Unlike proprietary alternatives, it runs locally on consumer hardware and has spawned a massive ecosystem of custom models, LoRA adapters, and community tools.
The Theory of Diffusion Models
What is Diffusion?
Diffusion models generate images by learning to reverse a noise process. The core insight: if you gradually add noise to an image until it becomes pure static, you can train a neural network to reverse that process.
Forward Process (Training Data Creation):
------------------------------------------
Clean Image -> Add Noise -> Add Noise -> ... -> Pure Noise
🐱 😺 😶 📺
t=0 t=1 t=2 t=T
This is a MARKOV CHAIN:
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) x_{t-1}, β_t I)
Where:
x_t = noisy image at timestep t
β_t = noise schedule (controls how much noise added)
N = Gaussian distribution
Reverse Process (Generation):
------------------------------------------
Pure Noise -> Remove Noise -> Remove Noise -> ... -> Clean Image
📺 🫥 😶 🐱
t=T t=T-1 t=T-2 t=0
This is what the MODEL LEARNS:
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))
The model predicts the mean (μ) and variance (Σ) of the
clean image given the noisy image.
The Math Behind Diffusion
Training Objective (simplified):
L = E[||ε - ε_θ(x_t, t)||²]
Where:
ε = actual noise added
ε_θ = model's prediction of the noise
x_t = noisy image at timestep t
t = timestep
The model learns to predict the NOISE that was added,
then subtracts it to recover the clean image.
In practice:
1. Take clean image x₀
2. Sample random noise ε ~ N(0, I)
3. Sample random timestep t
4. Create noisy image: x_t = √(ᾱ_t) x₀ + √(1-ᾱ_t) ε
5. Train model to predict ε from x_t and t
6. Loss = ||ε - ε_θ(x_t, t)||²
Stable Diffusion Architecture
Latent Diffusion Model (LDM)
Stable Diffusion operates in latent space, not pixel space. This is its key innovation:
Pixel Space Diffusion (DALL-E 2):
Image (512×512×3 = 786,432 dimensions)
-> Diffusion in pixel space
-> Very slow, very expensive
Latent Space Diffusion (Stable Diffusion):
Image (512×512×3) -> VAE Encoder -> Latent (64×64×4 = 16,384 dims)
-> Diffusion in latent space (48x smaller!)
-> Much faster, much cheaper
+---------------------------------------------------------+
| Stable Diffusion Architecture |
| |
| Text Prompt: "A cat in space" |
| | |
| v |
| +--------------+ |
| | CLIP Text | <- Text understanding |
| | Encoder | (frozen, pretrained) |
| +------+-------+ |
| | text embeddings |
| v |
| +------------------------------------------+ |
| | U-Net Denoiser | |
| | (Processes in LATENT space, not pixels) | |
| | | |
| | Input: Noisy latent (64×64×4) | |
| | + Timestep embedding | |
| | + Text conditioning (cross-attention) | |
| | | |
| | Output: Predicted noise (64×64×4) | |
| +--------------+---------------------------+ |
| | |
| v |
| +------------------+ |
| | Scheduler | <- DDPM, DDIM, Euler, etc. |
| | (Removes noise) | |
| +------+-----------+ |
| | clean latent |
| v |
| +------------------+ |
| | VAE Decoder | <- Latent -> Pixel |
| | (64×64 -> 512×512)| |
| +------+-----------+ |
| | |
| v |
| Generated Image (512×512×3) |
+---------------------------------------------------------+
U-Net Architecture
The U-Net is the core denoising network:
U-Net Structure:
Input (64×64×4 latent + timestep + text embeddings)
|
v
+-------------------------------------------------+
| Encoder Path (Downsampling) |
| |
| Conv (64×64) -> ResBlock -> Attention -> Down |
| Conv (32×32) -> ResBlock -> Attention -> Down |
| Conv (16×16) -> ResBlock -> Attention -> Down |
| Conv (8×8) -> ResBlock -> Attention -> Down |
| |
| Middle: Conv (4×4) -> ResBlock -> Attention |
| |
| Decoder Path (Upsampling) |
| |
| Up -> ResBlock + Skip -> Attention -> Conv (8×8) |
| Up -> ResBlock + Skip -> Attention -> Conv (16×16)|
| Up -> ResBlock + Skip -> Attention -> Conv (32×32)|
| Up -> ResBlock + Skip -> Attention -> Conv (64×64)|
+-------------------------------------------------+
|
v
Output (64×64×4 predicted noise)
Key: SKIP CONNECTIONS preserve fine details
Text conditioning via CROSS-ATTENTION in each block
Model Versions
Stable Diffusion 1.5 — The Classic
| Specification | Details |
|---|---|
| Release | October 2022 |
| Parameters | ~860 million |
| Architecture | Latent Diffusion (U-Net) |
| Resolution | 512×512 |
| Text Encoder | CLIP ViT-L/14 |
| VAE | KL-f8 (48x compression) |
| License | CreativeML Open RAIL-M |
| Community models | 10,000+ fine-tuned variants |
Why it's still popular: Massive community, most LoRA models available, fastest generation.
SDXL — Quality Leap
| Specification | Details |
|---|---|
| Release | July 2023 |
| Parameters | ~2.6 billion |
| Architecture | Two-stage (Base + Refiner) |
| Resolution | 1024×1024 |
| Text Encoder | CLIP ViT-bigG + OpenCLIP ViT-G |
| License | Stability AI Community |
Two-Stage Architecture:
SDXL Pipeline:
Stage 1: Base Model (3.5B params)
+----------------------------+
| Generate initial image |
| (1024×1024 latent) |
| Coarse structure + colors |
+------------+---------------+
|
v
Stage 2: Refiner Model (6.6B params)
+----------------------------+
| Refine details |
| Fine textures |
| Sharp edges |
| Final quality |
+------------+---------------+
|
v
Final Image (1024×1024)
The refiner uses a different noise schedule
for the final denoising steps.
SD 3.5 Large — Latest Generation
| Specification | Details |
|---|---|
| Release | October 2024 |
| Parameters | ~8 billion |
| Architecture | Multimodal Diffusion Transformer (MMDiT) |
| Resolution | Up to 2 megapixels |
| Text Encoder | CLIP ViT-L + OpenCLIP ViT-bigG + T5-XXL |
| License | Stability AI Community |
MMDiT Architecture — A paradigm shift from U-Net:
MMDiT (Multimodal Diffusion Transformer):
Instead of U-Net, SD 3.5 uses TRANSFORMER blocks:
+---------------------------------------------+
| Input: Noisy latent patches (16×16 = 256) |
| + Text token embeddings |
| |
| +-------------------------------------+ |
| | MMDiT Block (×24 layers) | |
| | | |
| | +----------+ +----------+ | |
| | |Image | |Text | | |
| | |Patches |◄--►|Tokens | | |
| | |(256) | |(77 max) | | |
| | +----------+ +----------+ | |
| | | | | |
| | v v | |
| | Joint Self-Attention + FFN | |
| | | | | |
| | v v | |
| | Updated Image Updated Text | |
| | Patches Embeddings | |
| +-------------------------------------+ |
| |
| Output: Denoised latent patches |
+---------------------------------------------+
Why Transformer instead of U-Net?
- Better at long-range dependencies
- Scales better with resolution
- More efficient computation
- Better text-image alignment
LoRA Fine-Tuning Deep Dive
What is LoRA?
LoRA (Low-Rank Adaptation) enables custom model fine-tuning with minimal storage:
Full Fine-Tuning:
W (d×d matrix) -> W' (d×d matrix)
70B params × 2 bytes = 140GB storage
Full backward pass through all params
LoRA:
W (d×d) -> FROZEN
A (d×r) -> TRAINED
B (r×d) -> TRAINED
W' = W + A × B
Where r << d (e.g., r=16, d=4096)
Trainable params: 2 × d × r = 131,072 (vs 16M for full)
Storage: ~100MB (vs 140GB for full)
LoRA Training Example
from diffusers import StableDiffusionPipeline, DPOKScheduler
from peft import LoraConfig
# Load base model
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=16, # Scaling
target_modules=[
"to_q", "to_k", "to_v", # Attention layers
"to_out.0",
],
lora_dropout=0.05,
)
# Apply LoRA to UNet
pipe.unet.add_adapter(lora_config)
# Train on your dataset
# ... training loop ...
# Save only LoRA weights (~100MB)
pipe.unet.save_pretrained("./my-lora-adapter")
ComfyUI Node-Based Workflows
Load Checkpoint
SD model weights
CLIP Text Encode
Prompt + Negative
KSampler
Denoise steps
VAE Decode
Save Image
System Requirements
| Component | SD 1.5 | SDXL | SD 3.5 |
|---|---|---|---|
| GPU VRAM | 4GB | 8GB | 16GB+ |
| RAM | 8GB | 16GB | 32GB+ |
| Storage | 5GB | 15GB | 30GB+ |
| GPU (min) | GTX 1660 | RTX 3060 | RTX 4090 |
| GPU (rec) | RTX 3060 | RTX 4090 | A100 80GB |
Key Takeaways
- Stable Diffusion is free and open-source — run anywhere
- Latent diffusion operates in compressed space (48x smaller)
- SD 3.5 Large uses MMDiT (Transformer) instead of U-Net
- LoRA enables custom fine-tuning with ~100MB adapter files
- ComfyUI is the most powerful node-based interface
- SD 1.5 has the largest community and model ecosystem
- SDXL offers the best quality-to-speed ratio
- Use negative prompts to improve quality
- CFG Scale controls prompt adherence (7-12 recommended)
- VAE handles the latent-to-pixel conversion
Further Reading
- Rombach et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models"
- Esser et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (SD 3)
- Stability AI (2024). "SD 3.5 Technical Report"