Meta Llama Complete Guide — Open Source AI, Architecture, Fine-Tuning & Deployment
Llama (Large Language Model Meta AI) is Meta's family of open-weight transformer models. Unlike proprietary models from OpenAI or Anthropic, Llama weights are freely available for research and commercial use, enabling a global ecosystem of fine-tuning, deployment, and innovation.
The Open Source AI Revolution
Proprietary AI (Closed):
OpenAI, Anthropic, Google
-> API only, pay per token
-> No customization
-> Vendor lock-in
-> No transparency
Open Source AI (Open):
Meta Llama, Mistral, Alibaba Qwen
-> Download weights, run anywhere
-> Fine-tune for your use case
-> No vendor lock-in
-> Full transparency
-> Community-driven innovation
Llama Architecture Deep Dive
Transformer Architecture
Llama uses a decoder-only Transformer with several modern improvements:
Llama Architecture Components:
1. RoPE (Rotary Position Embedding)
- Encodes position as rotation in complex plane
- Better than absolute positional embeddings
- Enables longer sequences than training
2. SwiGLU Activation
- Swish-Gated Linear Unit
- Better than ReLU/GELU for LLMs
- Computation: SwiGLU(x) = Swish(xW₁) ⊙ (xW₂)
3. Grouped Query Attention (GQA)
- Multiple query heads share key-value heads
- Reduces memory by 50%
- Faster inference for long sequences
4. RMSNorm
- Root Mean Square Layer Normalization
- Simpler than LayerNorm
- Same quality, faster computation
Llama 3 Architecture
Llama 3 70B Specifications:
-----------------------------
Parameters: 70 billion
Layers: 80
Attention heads: 64
KV heads: 8 (GQA)
Hidden dim: 8,192
FFN dim: 28,672
Vocabulary: 128,256 tokens
Context window: 128,000 tokens
Max output: 4,096 tokens
Training tokens: 15 trillion
Architecture diagram:
+-----------------------------------------+
| Token Embedding (128K vocab) |
| + RoPE Position Encoding |
| | |
| v |
| +---------------------------------+ |
| | Transformer Layer ×80 | |
| | +-------------------------+ | |
| | | RMSNorm | | |
| | | + | | |
| | | GQA (64 Q, 8 KV heads) | | |
| | | + Residual | | |
| | +-------------------------+ | |
| | +-------------------------+ | |
| | | RMSNorm | | |
| | | + | | |
| | | SwiGLU FFN | | |
| | | + Residual | | |
| | +-------------------------+ | |
| +---------------------------------+ |
| | |
| v |
| Output Projection (128K) |
| -> Logits for next token |
+-----------------------------------------+
Model Lineup
Llama 4 Maverick — Latest Generation
| Specification | Details |
|---|---|
| Release | April 2025 |
| Total parameters | 400 billion |
| Active parameters | 17 billion (per token) |
| Experts | 128 (2 active per token) |
| Context window | 1,000,000 tokens |
| Architecture | Mixture of Experts (MoE) |
| License | Llama 4 Community License |
| Training tokens | 30+ trillion |
Mixture of Experts Architecture:
Llama 4 Maverick MoE Design:
Input Token
|
v
+-----------------------------+
| Token Embedding + RoPE |
+--------------+--------------+
|
v
+-----------------------------+
| Expert Router (Learned) |
| |
| Input -> Router -> Weights |
| [0.4, 0.3, 0.2, 0.1, ...] |
| |
| Select top-2 experts |
| (128 experts total) |
+--------------+--------------+
|
+--------+--------+
v v
+----------+ +----------+
| Expert 1 | | Expert 2 |
| (17B) | | (17B) |
| FFN+Attn | | FFN+Attn |
+----+-----+ +----+-----+
+--------+--------+
v
+----------------+
| Weighted Sum |
| (0.4 × E1 + |
| 0.3 × E2) |
+-------+--------+
v
+----------------+
| Output |
+----------------+
Active per token: 17B (2 experts × 8.5B each)
Total model: 400B (128 experts × 3.125B each)
Efficiency: Runs like 17B model, quality of 400B
Llama 3.3 70B — Balanced Champion
| Specification | Details |
|---|---|
| Release | December 2024 |
| Parameters | 70 billion |
| Layers | 80 |
| Context window | 128,000 tokens |
| Training tokens | 15 trillion |
| License | Llama 3.3 Community License |
| Architecture | Dense Transformer |
Best for: Most use cases where you need a capable open-source model. The best balance of quality, speed, and cost.
Llama 3.1 405B — Maximum Open Source
| Specification | Details |
|---|---|
| Release | July 2024 |
| Parameters | 405 billion |
| Layers | 126 |
| Context window | 128,000 tokens |
| Training tokens | 15 trillion |
| License | Llama 3.1 Community License |
Best for: Research, fine-tuning base models, maximum quality open-source.
Llama 3.1 8B — Lightweight Champion
| Specification | Details |
|---|---|
| Release | July 2024 |
| Parameters | 8 billion |
| Layers | 32 |
| Context window | 128,000 tokens |
| License | Llama 3.1 Community License |
Best for: Edge deployment, mobile, rapid inference, resource-constrained environments.
Licensing Deep Dive
Llama 3.3 Community License:
-----------------------------
✅ Commercial use allowed
✅ Fine-tuning allowed
✅ Deployment allowed
✅ Research use
⚠️ Monthly active users < 700 million
⚠️ Cannot use to train other LLMs
❌ No redistribution of weights
❌ No use for harm
Comparison with other licenses:
-----------------------------
MIT License: Most permissive (Mistral, Qwen)
Apache 2.0: Permissive + patent grant
Llama Community: Commercial with MAU limit
GPT-4: API only, no weights
Claude: API only, no weights
Self-Hosting Guide
Option 1: Ollama (Easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3.3 70B
ollama pull llama3.3:70b
ollama run llama3.3:70b
# Hardware requirements for 70B:
# - GPU: 2× RTX 4090 (48GB VRAM total)
# - RAM: 64GB minimum
# - Storage: 140GB SSD
# Run via API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3:70b",
"prompt": "Explain quantum computing",
"stream": false
}'
Option 2: vLLM (Production)
from vllm import LLM, SamplingParams
# High-throughput serving
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=2, # Split across 2 GPUs
max_model_len=8192,
gpu_memory_utilization=0.9
)
# Generate
params = SamplingParams(temperature=0.7, max_tokens=1024)
outputs = llm.generate(["Hello, how are you?"], params)
print(outputs[0].outputs[0].text)
Option 3: Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "meta-llama/Llama-3.3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [
{"role": "user", "content": "Explain quantum computing"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Fine-Tuning Guide
LoRA (Low-Rank Adaptation)
LoRA is the most efficient way to fine-tune large models:
LoRA Concept:
Full Fine-Tuning:
Original weights W (70B params) -> Updated weights W' (70B params)
Storage: 140GB (fp16)
Computation: Full backward pass through all params
LoRA Fine-Tuning:
Original weights W (70B params) -> FROZEN (not updated)
Low-rank matrices A, B -> TRAINED (updated)
W' = W + A × B
Where:
A: (d, r) matrix <- r << d (e.g., r=16)
B: (r, d) matrix
Trainable params: 2 × d × r (e.g., 2 × 4096 × 16 = 131K)
Storage: ~100MB (LoRA adapters)
Computation: Only backward pass through A, B
Speed: 10x faster than full fine-tuning
Training Example
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 131,072 || all params: 70,000,000,000 || 0.019%
# Train (only 0.019% of params are trained!)
trainer.train()
# Save only LoRA adapters (~100MB)
model.save_pretrained("./lora-adapter")
Deployment Comparison
| Method | Ease | Performance | Cost | Best For |
|---|---|---|---|---|
| Ollama | Easy | Good | Free | Local dev |
| vLLM | Medium | Excellent | Free | Production |
| HuggingFace | Medium | Good | Free | Research |
| Together.ai | Easy | Excellent | Pay | No GPU |
| Fireworks.ai | Easy | Excellent | Pay | No GPU |
Use Cases
| Use Case | Recommended Model | Why |
|---|---|---|
| Research | Llama 3.1 405B | Maximum capability |
| Production API | Llama 3.3 70B | Best balance |
| Edge/Mobile | Llama 3.1 8B | Small, fast |
| Fine-tuning | Llama 3.1 405B base | Best base model |
| Cost-sensitive | Llama 3.3 70B | Free to run |
| Privacy-critical | Any Llama | Run locally |
| Custom domain | Llama 3.3 70B + LoRA | Domain adaptation |
Key Takeaways
- Llama is free to use — download weights and run anywhere
- Llama 4 Maverick uses MoE (400B total, 17B active) for efficiency
- Llama 3.3 70B is the best all-around open-source model
- Use Ollama for easy local deployment
- Use vLLM for high-throughput production serving
- LoRA enables fine-tuning with ~100MB adapter files
- Run Llama locally for complete privacy and control
- Community license allows commercial use under 700M users
- 1M context window in Llama 4 matches Gemini
- Open-source models enable community innovation and transparency
Further Reading
- Touvron et al. (2023). "LLaMA: Open and Efficient Foundation Language Models"
- Meta AI (2024). "Llama 3 Herd of Models"
- Meta AI (2025). "Llama 4 Technical Report"
- Dettmers et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs"