ChatGPT Complete Guide — Architecture, Models, Training & How to Use It
ChatGPT is the world's most popular AI chatbot, built by OpenAI on the GPT (Generative Pre-trained Transformer) family of large language models. This comprehensive guide provides a PhD-level deep dive into every model variant, the transformer architecture behind it, the training methodology, and exactly when to use each version.
Table of Contents
- What is ChatGPT?
- The Transformer Architecture
- GPT Model Evolution
- GPT-4o — Current Flagship
- GPT-4 Turbo — Maximum Intelligence
- GPT-4o Mini — Efficiency Champion
- o1 / o3 — Reasoning Models
- Training Methodology
- Use Cases and Selection Guide
- API Deep Dive
- Pricing Economics
- Ethical Considerations
What is ChatGPT?
ChatGPT (Chat Generative Pre-trained Transformer) is a conversational AI assistant powered by Large Language Models (LLMs) developed by OpenAI. It uses the Transformer architecture to understand and generate human-like text through autoregressive prediction.
Historical Context
The GPT lineage represents one of the most significant advances in artificial intelligence:
Timeline of GPT Models:
-------------------------------------------------------------
2018 GPT-1 117M params First generative pre-training
2019 GPT-2 1.5B params "Too dangerous to release"
2020 GPT-3 175B params In-context learning revolution
2022 ChatGPT GPT-3.5 RLHF alignment breakthrough
2023 GPT-4 ~1.8T MoE Multimodal, reasoning leap
2024 GPT-4o ~1.8T MoE Omni-modal, 2x speed
2024 o1/o3 Unknown Chain-of-thought reasoning
2025 GPT-4.1 Unknown 1M context, latest knowledge
-------------------------------------------------------------
Fundamental Capabilities
ChatGPT operates as a next-token predictor. Given a sequence of tokens, it predicts the most probable next token:
P(token_t | token_1, token_2, ..., token_{t-1})
Example:
Input: "The capital of France is"
Output: "Paris" (with highest probability)
Internally:
token_1 = "The" -> logits for all 100K tokens
token_2 = "capital" -> logits updated
token_3 = "of" -> logits updated
token_4 = "France" -> logits updated
token_5 = "is" -> logits updated
token_6 = ? -> "Paris" has highest probability
The Transformer Architecture
Every GPT model is built on the Transformer architecture, introduced by Vaswani et al. in 2017 ("Attention Is All You Need"). Understanding this architecture is essential for understanding ChatGPT.
Self-Attention Mechanism
The core innovation is self-attention, which allows each token to attend to every other token in the sequence:
Self-Attention Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Where:
Q = Query matrix (what am I looking for?)
K = Key matrix (what do I contain?)
V = Value matrix (what information do I provide?)
d_k = Key dimension (for scaling)
Intuition:
Each token creates a Query ("What context do I need?")
Each token creates a Key ("What context can I provide?")
Each token creates a Value ("What information to pass on")
Query × Key = Attention scores (how much to attend to each token)
Attention scores × Value = Output (weighted combination of information)
Multi-Head Attention
Instead of computing attention once, Transformers compute it multiple times in parallel (multi-head attention):
Multi-Head Attention:
Head 1: Attention(Q₁, K₁, V₁) -> learns syntax
Head 2: Attention(Q₂, K₂, V₂) -> learns semantics
Head 3: Attention(Q₃, K₃, V₃) -> learns long-range deps
...
Head h: Attention(Qₕ, Kₕ, Vₕ) -> learns other patterns
Output = Concat(head₁, ..., headₕ) × W^O
GPT-4 uses 128 attention heads per layer (estimated)
Transformer Block Structure
+---------------------------------------------+
| Transformer Block |
| |
| Input Embeddings |
| | |
| v |
| +---------------------------------+ |
| | Multi-Head Self-Attention | |
| | + Residual Connection | |
| | + Layer Normalization | |
| +---------------+-----------------+ |
| | |
| v |
| +---------------------------------+ |
| | Feed-Forward Network | |
| | (2 × d_model internal dim) | |
| | + Residual Connection | |
| | + Layer Normalization | |
| +---------------+-----------------+ |
| | |
| v |
| Output Embeddings |
+---------------------------------------------+
Stacked N times:
GPT-3: 96 layers
GPT-4: ~120 layers (estimated)
GPT-4o: ~120 layers (estimated)
Positional Encoding
Transformers don't inherently understand order. Positional encodings add position information:
GPT uses learned positional embeddings:
Token Embedding + Position Embedding = Input
"The" at position 0 -> embedding + pos_0
"cat" at position 1 -> embedding + pos_1
"sat" at position 2 -> embedding + pos_2
GPT-4 uses ALiBi (Attention with Linear Biases) for
better extrapolation to longer sequences than seen in training.
GPT Model Evolution
GPT-3 (2020) — The Foundation
| Specification | Value |
|---|---|
| Parameters | 175 billion |
| Layers | 96 |
| Attention heads | 96 |
| Context window | 2,048 tokens |
| Training tokens | 300 billion |
| Vocabulary | 50,257 tokens |
| Architecture | Dense Transformer |
GPT-3 demonstrated in-context learning — the ability to learn from examples provided in the prompt without updating weights:
In-Context Learning:
Few-shot prompt:
"Convert English to French:
hello -> bonjour
goodbye -> au revoir
thank you ->"
GPT-3 output: "merci"
No fine-tuning needed! Just provide examples in the prompt.
GPT-4o — Current Flagship
GPT-4o ("o" for "omni") is OpenAI's current flagship model, designed from the ground up for multimodal input and output.
Architecture
| Specification | Details |
|---|---|
| Release | May 2024 |
| Parameters | ~1.8 trillion (estimated) |
| Architecture | Mixture of Experts (MoE), 8 experts |
| Active parameters per token | ~220 billion |
| Context window | 128,000 tokens |
| Max output tokens | 16,384 |
| Training data cutoff | October 2023 |
| Modalities | Text, Vision, Audio (native) |
| API cost (input) | $2.50 / 1M tokens |
| API cost (output) | $10.00 / 1M tokens |
| Speed | ~2x faster than GPT-4 Turbo |
Mixture of Experts (MoE) Explained
GPT-4 and GPT-4o use a Mixture of Experts architecture, which is fundamentally different from dense models:
Dense Model (GPT-3):
Every parameter is used for every token
175B params -> 175B active per token
Slow but thorough
Mixture of Experts (GPT-4o):
Multiple "expert" networks, only some activated per token
~1.8T total params -> ~220B active per token
Fast AND capable
+---------------------------------------------+
| MoE Architecture |
| |
| Input Token |
| | |
| v |
| +---------------------+ |
| | Gating Network | <- Learns which |
| | (Router) | experts to use |
| +----------+----------+ |
| | |
| +--------+--------+--------+ |
| v v v v |
| +-----+ +-----+ +-----+ +-----+ |
| |Exp 1| |Exp 2| |Exp 3| |... | (8 total) |
| |220B | |220B | |220B | |Exp 8| |
| +--+--+ +--+--+ +--+--+ +--+--+ |
| +--------+--------+ | |
| v | |
| +---------------------+ | |
| | Top-2 Selection |◄--+ |
| | (Only 2 experts | |
| | active per token) | |
| +----------+----------+ |
| v |
| Combined Output |
+---------------------------------------------+
Benefits:
- Same quality as 1.8T dense model
- Runs at ~220B speed (8x faster)
- Lower compute cost
- Better scaling properties
Native Multimodal Architecture
Unlike previous models that chained separate systems, GPT-4o processes all modalities in a single neural network:
Previous Approach (GPT-4 + Whisper + DALL-E):
Audio -> Whisper (STT) -> Text -> GPT-4 -> Text -> TTS -> Audio
(Latency: ~5 seconds round-trip)
GPT-4o Native Approach:
Audio -> [GPT-4o Unified Model] -> Audio
(Latency: ~300ms round-trip, 16x faster!)
The model processes:
- Text tokens (via text tokenizer)
- Image patches (via vision encoder)
- Audio spectrograms (via audio encoder)
All in the SAME transformer layers!
GPT-4 Turbo — Maximum Intelligence
| Specification | Details |
|---|---|
| Release | November 2023 |
| Parameters | ~1.8 trillion (MoE) |
| Context window | 128,000 tokens |
| Max output tokens | 4,096 |
| Training data cutoff | December 2023 |
| API cost (input) | $10.00 / 1M tokens |
| API cost (output) | $30.00 / 1M tokens |
| Speed | ~1x (baseline) |
GPT-4 Turbo remains OpenAI's most capable model for pure reasoning tasks, despite being older than GPT-4o. Its larger output window and higher intelligence make it valuable for complex analysis.
GPT-4o Mini — Efficiency Champion
| Specification | Details |
|---|---|
| Release | July 2024 |
| Parameters | ~8 billion (estimated) |
| Context window | 128,000 tokens |
| Max output tokens | 16,384 |
| API cost (input) | $0.15 / 1M tokens |
| API cost (output) | $0.60 / 1M tokens |
| Speed | ~5x faster than GPT-4o |
GPT-4o mini is 100x cheaper than GPT-4 Turbo while maintaining impressive capability. It replaced GPT-3.5 Turbo as the default lightweight model.
o1 / o3 — Reasoning Models
The o-series models represent a paradigm shift from "fast thinking" to "slow thinking" AI.
Chain-of-Thought Reasoning
Standard LLM (System 1 Thinking):
Input -> Immediate Output
"7 × 8 = ?" -> "54" (often wrong)
Reasoning Model (System 2 Thinking):
Input -> Think -> Think -> Think -> Verify -> Output
"7 × 8 = ?"
Step 1: "7 × 8... let me break this down"
Step 2: "7 × 8 = 7 × (10 - 2) = 70 - 14"
Step 3: "70 - 14 = 56"
Step 4: "Let me verify: 8 × 7 = 56 ✓"
Output: "56"
o1 vs o3
| Feature | o1 | o3 |
|---|---|---|
| Release | September 2024 | April 2025 |
| Context | 200K tokens | 200K tokens |
| Max output | 100K tokens | 100K tokens |
| Math (AIME) | 83.3% | 96.7% |
| Coding (Codeforces) | 89th percentile | 97th percentile |
| Science (GPQA) | 77.3% | 87.7% |
| API cost (input) | $15 / 1M | $10 / 1M |
| API cost (output) | $60 / 1M | $40 / 1M |
How Reasoning Models Work
Reasoning Model Internal Process:
1. Problem Decomposition
"Solve x² + 5x + 6 = 0"
-> "This is a quadratic equation"
-> "I can factor this"
2. Step-by-Step Reasoning
"x² + 5x + 6 = (x + 2)(x + 3) = 0"
"Therefore x = -2 or x = -3"
3. Self-Verification
"Check: (-2)² + 5(-2) + 6 = 4 - 10 + 6 = 0 ✓"
"Check: (-3)² + 5(-3) + 6 = 9 - 15 + 6 = 0 ✓"
4. Final Answer
"x = -2 or x = -3"
This "thinking" happens invisibly — the user only sees the final answer.
Training Methodology
Phase 1: Pre-training
Objective: Predict next token on massive text corpus
Data: ~13 trillion tokens from:
- Web pages (Common Crawl)
- Books (Books3, Gutenberg)
- Code (GitHub, Stack Overflow)
- Academic papers (ArXiv, Semantic Scholar)
- Wikipedia
- News articles
Loss function:
L = -Σ log P(token_t | token_1, ..., token_{t-1})
Training compute:
GPT-3: ~3.14 × 10²³ FLOPs
GPT-4: ~2.1 × 10²⁵ FLOPs (estimated)
GPT-4o: ~2.6 × 10²⁵ FLOPs (estimated)
Phase 2: Supervised Fine-Tuning (SFT)
Human-written examples teach the model to follow instructions:
Prompt: "Explain quantum computing in simple terms"
Ideal Response: [Human-written explanation]
Thousands of these examples fine-tune the pre-trained model
to behave as a helpful assistant.
Phase 3: RLHF (Reinforcement Learning from Human Feedback)
RLHF Process:
Step 1: Collect Comparisons
Human sees: "What is 2+2?"
Model generates 4 different responses
Human ranks: Response 3 > Response 1 > Response 4 > Response 2
Step 2: Train Reward Model
Learn to predict human preference scores
R(response | prompt) -> score
Step 3: Optimize with PPO
Maximize: E[R(response)] - β × KL(response || reference)
- Maximize reward (human preference)
- Stay close to original model (KL penalty)
Result: Model learns to generate responses humans prefer
Phase 4: Constitutional AI (for alignment)
Constitutional AI principles:
1. Be helpful and harmless
2. Be honest and accurate
3. Respect human autonomy
4. Avoid bias and discrimination
5. Protect privacy and security
The model self-critiques against these principles
and revises its responses accordingly.
Use Cases and Selection Guide
Decision Matrix
| Use Case | Model | Reasoning |
|---|---|---|
| Quick Q&A | GPT-4o mini | Fast, cheap, good enough |
| Writing & editing | GPT-4o | Best balance |
| Complex analysis | GPT-4 Turbo | Highest intelligence |
| Math & logic | o1 / o3 | Chain-of-thought |
| Image analysis | GPT-4o | Native vision |
| Code generation | GPT-4o | Strong coding |
| High-volume API | GPT-4o mini | 100x cheaper |
| Real-time voice | GPT-4o | Native audio |
| Research synthesis | o3 | Deep reasoning |
| Production apps | GPT-4o mini | Cost-effective |
Prompt Engineering Best Practices
# System prompt for consistent behavior
system_prompt = """
You are a helpful technical assistant.
- Be concise and accurate
- Use code examples when relevant
- Cite sources when possible
- Admit uncertainty when unsure
"""
# Temperature guide
temperature = 0.0 # Factual tasks, code, math
temperature = 0.3 # Most general tasks
temperature = 0.7 # Creative writing, brainstorming
temperature = 1.0 # Maximum creativity
# Structured output
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
],
temperature=0.7,
max_tokens=2000,
response_format={"type": "json_object"} # JSON output
)
Pricing Economics
Cost Comparison Table
| Model | Input (1M tokens) | Output (1M tokens) | Speed | Quality |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Fast | High |
| GPT-4o mini | $0.15 | $0.60 | Very Fast | Medium-High |
| GPT-4 Turbo | $10.00 | $30.00 | Medium | Highest |
| o1 | $15.00 | $60.00 | Slow | Very High |
| o3 | $10.00 | $40.00 | Slow | Highest |
Cost Optimization Strategies
Strategy 1: Model Routing
Simple query -> GPT-4o mini ($0.15/M)
Complex query -> GPT-4o ($2.50/M)
Very complex -> o3 ($10/M)
Strategy 2: Prompt Caching
Cache common prefixes -> 50% cost reduction
OpenAI caches prompts repeated within 5-10 minutes
Strategy 3: Batch API
Process non-urgent requests in batch
50% cost reduction, 24-hour turnaround
Strategy 4: Output Length Control
Set max_tokens appropriately
Shorter outputs = lower cost
Ethical Considerations
Known Limitations
1. Hallucination
- Model can generate plausible but false information
- Always verify critical facts
- Use retrieval-augmented generation (RAG) for accuracy
2. Bias
- Training data contains societal biases
- Model may reflect stereotypes
- Use diverse prompts and evaluate outputs
3. Knowledge Cutoff
- Training data has a cutoff date
- Cannot access real-time information (unless using tools)
- Use web search for current events
4. Reasoning Errors
- Complex multi-step reasoning can fail
- Math calculations may be incorrect
- Use o1/o3 for critical reasoning tasks
5. Privacy
- Conversations may be used for training (opt out in settings)
- Don't share sensitive personal information
- Use API for data that shouldn't be stored
Responsible Use Guidelines
DO:
- Verify important information from multiple sources
- Disclose AI-generated content
- Respect intellectual property
- Use for augmentation, not replacement of human judgment
- Consider downstream impacts
DON'T:
- Use for deception or misinformation
- Share private/sensitive data
- Rely solely on AI for critical decisions
- Use to generate harmful content
- Ignore output quality (always review)
Key Takeaways
- GPT-4o is the best all-around model — use it by default
- GPT-4o mini is 100x cheaper for simple tasks
- GPT-4 Turbo has the highest intelligence for complex problems
- o1/o3 are best for math, reasoning, and multi-step problems
- The Transformer architecture uses self-attention to process tokens
- MoE architecture enables massive models with efficient inference
- RLHF training aligns models with human preferences
- Always use system prompts for consistent behavior
- Temperature controls creativity (0 for facts, 0.7 for creative)
- Monitor token usage to control costs effectively
- Verify critical information — models can hallucinate
- Batch API and prompt caching reduce costs significantly
Further Reading
- Vaswani et al. (2017). "Attention Is All You Need"
- Brown et al. (2020). "Language Models are Few-Shot Learners" (GPT-3)
- Ouyang et al. (2022). "Training Language Models to Follow Instructions with Human Feedback"
- OpenAI (2023). "GPT-4 Technical Report"
- OpenAI (2024). "GPT-4o System Card"