CW

Stable Diffusion Complete Guide — Diffusion Architecture, SDXL, SD 3.5, LoRA & ComfyUI

Best Open Source Image AIImage Generation40 min read

By ChatWhole Team | 2025-01-20

Advertisement

Stable Diffusion Complete Guide — Diffusion Architecture, SDXL, SD 3.5, LoRA & ComfyUI

Stable Diffusion is the most popular open-source AI image generator. Unlike proprietary alternatives, it runs locally on consumer hardware and has spawned a massive ecosystem of custom models, LoRA adapters, and community tools.


The Theory of Diffusion Models

What is Diffusion?

Diffusion models generate images by learning to reverse a noise process. The core insight: if you gradually add noise to an image until it becomes pure static, you can train a neural network to reverse that process.

Architecture Diagram
Forward Process (Training Data Creation):
------------------------------------------
Clean Image -> Add Noise -> Add Noise -> ... -> Pure Noise
   🐱           😺          😶            📺
   t=0          t=1         t=2           t=T

This is a MARKOV CHAIN:
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) x_{t-1}, β_t I)

Where:
x_t = noisy image at timestep t
β_t = noise schedule (controls how much noise added)
N = Gaussian distribution

Reverse Process (Generation):
------------------------------------------
Pure Noise -> Remove Noise -> Remove Noise -> ... -> Clean Image
   📺           🫥             😶             🐱
   t=T          t=T-1          t=T-2           t=0

This is what the MODEL LEARNS:
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

The model predicts the mean (μ) and variance (Σ) of the
clean image given the noisy image.

The Math Behind Diffusion

Architecture Diagram
Training Objective (simplified):

L = E[||ε - ε_θ(x_t, t)||²]

Where:
ε = actual noise added
ε_θ = model's prediction of the noise
x_t = noisy image at timestep t
t = timestep

The model learns to predict the NOISE that was added,
then subtracts it to recover the clean image.

In practice:
1. Take clean image x₀
2. Sample random noise ε ~ N(0, I)
3. Sample random timestep t
4. Create noisy image: x_t = √(ᾱ_t) x₀ + √(1-ᾱ_t) ε
5. Train model to predict ε from x_t and t
6. Loss = ||ε - ε_θ(x_t, t)||²

Stable Diffusion Architecture

Latent Diffusion Model (LDM)

Stable Diffusion operates in latent space, not pixel space. This is its key innovation:

Architecture Diagram
Pixel Space Diffusion (DALL-E 2):
Image (512×512×3 = 786,432 dimensions)
-> Diffusion in pixel space
-> Very slow, very expensive

Latent Space Diffusion (Stable Diffusion):
Image (512×512×3) -> VAE Encoder -> Latent (64×64×4 = 16,384 dims)
-> Diffusion in latent space (48x smaller!)
-> Much faster, much cheaper

+---------------------------------------------------------+
|           Stable Diffusion Architecture                  |
|                                                          |
|  Text Prompt: "A cat in space"                          |
|       |                                                  |
|       v                                                  |
|  +--------------+                                       |
|  |  CLIP Text   | <- Text understanding                  |
|  |  Encoder     |   (frozen, pretrained)                |
|  +------+-------+                                       |
|         | text embeddings                               |
|         v                                               |
|  +------------------------------------------+          |
|  |  U-Net Denoiser                          |          |
|  |  (Processes in LATENT space, not pixels) |          |
|  |                                          |          |
|  |  Input: Noisy latent (64×64×4)          |          |
|  |  + Timestep embedding                   |          |
|  |  + Text conditioning (cross-attention)  |          |
|  |                                          |          |
|  |  Output: Predicted noise (64×64×4)      |          |
|  +--------------+---------------------------+          |
|                 |                                       |
|                 v                                       |
|  +------------------+                                  |
|  |  Scheduler       | <- DDPM, DDIM, Euler, etc.       |
|  |  (Removes noise) |                                  |
|  +------+-----------+                                  |
|         | clean latent                                  |
|         v                                               |
|  +------------------+                                  |
|  |  VAE Decoder     | <- Latent -> Pixel                 |
|  |  (64×64 -> 512×512)|                                 |
|  +------+-----------+                                  |
|         |                                               |
|         v                                               |
|  Generated Image (512×512×3)                           |
+---------------------------------------------------------+

U-Net Architecture

The U-Net is the core denoising network:

Architecture Diagram
U-Net Structure:

Input (64×64×4 latent + timestep + text embeddings)
    |
    v
+-------------------------------------------------+
|  Encoder Path (Downsampling)                     |
|                                                  |
|  Conv (64×64) -> ResBlock -> Attention -> Down     |
|  Conv (32×32) -> ResBlock -> Attention -> Down     |
|  Conv (16×16) -> ResBlock -> Attention -> Down     |
|  Conv (8×8)   -> ResBlock -> Attention -> Down     |
|                                                  |
|  Middle: Conv (4×4) -> ResBlock -> Attention      |
|                                                  |
|  Decoder Path (Upsampling)                       |
|                                                  |
|  Up -> ResBlock + Skip -> Attention -> Conv (8×8)  |
|  Up -> ResBlock + Skip -> Attention -> Conv (16×16)|
|  Up -> ResBlock + Skip -> Attention -> Conv (32×32)|
|  Up -> ResBlock + Skip -> Attention -> Conv (64×64)|
+-------------------------------------------------+
    |
    v
Output (64×64×4 predicted noise)

Key: SKIP CONNECTIONS preserve fine details
Text conditioning via CROSS-ATTENTION in each block

Model Versions

Stable Diffusion 1.5 — The Classic

SpecificationDetails
ReleaseOctober 2022
Parameters~860 million
ArchitectureLatent Diffusion (U-Net)
Resolution512×512
Text EncoderCLIP ViT-L/14
VAEKL-f8 (48x compression)
LicenseCreativeML Open RAIL-M
Community models10,000+ fine-tuned variants

Why it's still popular: Massive community, most LoRA models available, fastest generation.


SDXL — Quality Leap

SpecificationDetails
ReleaseJuly 2023
Parameters~2.6 billion
ArchitectureTwo-stage (Base + Refiner)
Resolution1024×1024
Text EncoderCLIP ViT-bigG + OpenCLIP ViT-G
LicenseStability AI Community

Two-Stage Architecture:

Architecture Diagram
SDXL Pipeline:

Stage 1: Base Model (3.5B params)
+----------------------------+
| Generate initial image     |
| (1024×1024 latent)         |
| Coarse structure + colors  |
+------------+---------------+
             |
             v
Stage 2: Refiner Model (6.6B params)
+----------------------------+
| Refine details             |
| Fine textures              |
| Sharp edges                |
| Final quality              |
+------------+---------------+
             |
             v
Final Image (1024×1024)

The refiner uses a different noise schedule
for the final denoising steps.

SD 3.5 Large — Latest Generation

SpecificationDetails
ReleaseOctober 2024
Parameters~8 billion
ArchitectureMultimodal Diffusion Transformer (MMDiT)
ResolutionUp to 2 megapixels
Text EncoderCLIP ViT-L + OpenCLIP ViT-bigG + T5-XXL
LicenseStability AI Community

MMDiT Architecture — A paradigm shift from U-Net:

Architecture Diagram
MMDiT (Multimodal Diffusion Transformer):

Instead of U-Net, SD 3.5 uses TRANSFORMER blocks:

+---------------------------------------------+
|  Input: Noisy latent patches (16×16 = 256)  |
|  + Text token embeddings                    |
|                                              |
|  +-------------------------------------+    |
|  |  MMDiT Block (×24 layers)           |    |
|  |                                      |    |
|  |  +----------+    +----------+       |    |
|  |  |Image     |    |Text      |       |    |
|  |  |Patches   |◄--►|Tokens    |       |    |
|  |  |(256)     |    |(77 max)  |       |    |
|  |  +----------+    +----------+       |    |
|  |       |              |              |    |
|  |       v              v              |    |
|  |  Joint Self-Attention + FFN        |    |
|  |       |              |              |    |
|  |       v              v              |    |
|  |  Updated Image   Updated Text      |    |
|  |  Patches         Embeddings        |    |
|  +-------------------------------------+    |
|                                              |
|  Output: Denoised latent patches             |
+---------------------------------------------+

Why Transformer instead of U-Net?
- Better at long-range dependencies
- Scales better with resolution
- More efficient computation
- Better text-image alignment

LoRA Fine-Tuning Deep Dive

What is LoRA?

LoRA (Low-Rank Adaptation) enables custom model fine-tuning with minimal storage:

Architecture Diagram
Full Fine-Tuning:
W (d×d matrix) -> W' (d×d matrix)
70B params × 2 bytes = 140GB storage
Full backward pass through all params

LoRA:
W (d×d) -> FROZEN
A (d×r) -> TRAINED
B (r×d) -> TRAINED
W' = W + A × B

Where r << d (e.g., r=16, d=4096)
Trainable params: 2 × d × r = 131,072 (vs 16M for full)
Storage: ~100MB (vs 140GB for full)

LoRA Training Example

from diffusers import StableDiffusionPipeline, DPOKScheduler
from peft import LoraConfig

# Load base model
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = StableDiffusionPipeline.from_pretrained(model_id)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                          # Rank
    lora_alpha=16,                 # Scaling
    target_modules=[
        "to_q", "to_k", "to_v",  # Attention layers
        "to_out.0",
    ],
    lora_dropout=0.05,
)

# Apply LoRA to UNet
pipe.unet.add_adapter(lora_config)

# Train on your dataset
# ... training loop ...

# Save only LoRA weights (~100MB)
pipe.unet.save_pretrained("./my-lora-adapter")

ComfyUI Node-Based Workflows

📦

Load Checkpoint

SD model weights

📝

CLIP Text Encode

Prompt + Negative

🎨

KSampler

Denoise steps

🖼️

VAE Decode

💾

Save Image


System Requirements

ComponentSD 1.5SDXLSD 3.5
GPU VRAM4GB8GB16GB+
RAM8GB16GB32GB+
Storage5GB15GB30GB+
GPU (min)GTX 1660RTX 3060RTX 4090
GPU (rec)RTX 3060RTX 4090A100 80GB

Key Takeaways

  1. Stable Diffusion is free and open-source — run anywhere
  2. Latent diffusion operates in compressed space (48x smaller)
  3. SD 3.5 Large uses MMDiT (Transformer) instead of U-Net
  4. LoRA enables custom fine-tuning with ~100MB adapter files
  5. ComfyUI is the most powerful node-based interface
  6. SD 1.5 has the largest community and model ecosystem
  7. SDXL offers the best quality-to-speed ratio
  8. Use negative prompts to improve quality
  9. CFG Scale controls prompt adherence (7-12 recommended)
  10. VAE handles the latent-to-pixel conversion

Further Reading

  • Rombach et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models"
  • Esser et al. (2024). "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (SD 3)
  • Stability AI (2024). "SD 3.5 Technical Report"

Advertisement

Need Expert AI Help?

Get personalized AI tool selection, integration, and consulting.

Advertisement