In-Context Learning
In-context learning (ICL) is one of the most remarkable emergent capabilities of LLMs. This tutorial explores what ICL is, why it works, and how to use it effectively.
DfIn-Context Learning (ICL)
In-context learning is the ability of a language model to learn new tasks from examples provided in the prompt, without any gradient updates to the model parameters. The model adapts its behavior based solely on the input context.
How ICL Works
The Bayesian Inference Hypothesis
ICL as Implicit Bayesian Inference
Here,
- =In-context examples (x_1, y_1), ..., (x_n, y_n)
- =Implicit task hypothesis
- =Posterior over hypotheses given examples
The model effectively performs Bayesian inference over task hypotheses, using the in-context examples to update its beliefs about which task is being performed.
The Task Vector Hypothesis
DfTask Vectors
Task vectors are directions in activation space that encode the task defined by in-context examples. These vectors emerge from the attention mechanism and guide the model's predictions for new inputs.
The Grokking Hypothesis
DfGrokking
Grokking is the phenomenon where a model suddenly generalizes after overfitting. For ICL, this suggests that large models have implicitly learned to perform gradient descent during the forward pass, effectively "grokking" how to learn from examples.
Impact of Example Ordering
The order of in-context examples significantly affects performance:
ICL Performance by Ordering
Here,
- =Examples ordered by permutation \pi
- =Examples ordered by permutation \sigma
Ordering Strategies
- Random ordering: Baseline, moderate performance
- Similarity-based: Most similar examples last (best average)
- Label-balanced: Equal representation of each class
- Difficulty-based: Easy examples first, hard examples last
For classification tasks, placing the most similar example last consistently improves performance. For generation tasks, order matters less.
Impact of Example Selection
Example Selection Score
Here,
- =Candidate in-context example
- =Test input
- =Similarity function (cosine or dot product)
Selection Strategies
- Random: No selection bias, but may include irrelevant examples
- Top-k retrieval: Select k most similar examples from a pool
- Diverse selection: Balance similarity with diversity
- Label-aware: Ensure balanced label distribution in selected examples
ICL vs Fine-tuning
| Aspect | ICL | Fine-tuning |
|---|---|---|
| Data required | 2-32 examples | 100-10,000+ examples |
| Compute | Forward pass only | Gradient updates |
| Task switching | Change prompt | Retrain model |
| Performance | 80-95% of fine-tuning | 100% (baseline) |
| Latency | Higher (longer prompts) | Lower (shorter prompts) |
| Knowledge access | Full pre-trained knowledge | May forget pre-trained knowledge |
ICL-Fine-tuning Tradeoff
Here,
- =Number of labeled examples
- =Number of model parameters
- =Threshold (typically 1e-4)
Practical ICL Implementation
`python from transformers import AutoModelForCausalLM, AutoTokenizer import torch
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
Few-shot examples
examples = [ ("This movie was fantastic!", "Positive"), ("Terrible waste of time.", "Negative"), ("The food was okay.", "Neutral"), ]
def build_icl_prompt(test_input, examples): prompt = "Classify the sentiment of each review.\n\n" for text, label in examples: prompt += f"Review: "{text}" -> {label}\n" prompt += f"Review: "{test_input}" -> " return prompt
test = "Absolutely loved every minute of it!" prompt = build_icl_prompt(test, examples)
inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=10, temperature=0.0) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(f"Prediction: {response}") # Expected: Positive `
For more on prompting techniques, see our module on Prompt Engineering.
Practice Exercises
- Empirical: Test ICL on a 3-class classification task with 1, 2, 4, 8, and 16 examples. Plot accuracy vs number of examples.
- Ordering: Compare random, similarity-based, and reverse ordering of examples. Which is most robust?
- Selection: Implement a retrieval-based example selector using sentence embeddings. Compare with random selection.
- Theory: Explain why ICL works for decoder-only models but not for encoder-only models like BERT.
Key Takeaways:
- ICL enables learning from examples without gradient updates
- The model performs implicit Bayesian inference over task hypotheses
- Example ordering and selection significantly affect performance
- ICL uses 80-95% of fine-tuning performance with 100x fewer examples
- Similarity-based example selection (most similar last) is generally best
- ICL and fine-tuning are complementary; use ICL when data is scarce
Theoretical Foundations
ICL as Gradient Descent
Recent research suggests that transformer attention mechanisms can implicitly perform gradient descent during the forward pass. The attention computation effectively computes a linear regression over the in-context examples, updating the model internal representations without explicit parameter updates.
Mechanistic Interpretability of ICL
Mechanistic interpretability studies have identified specific circuits responsible for ICL. The induction head circuit, composed of two attention heads, learns to perform pattern completion by copying from previous contexts. This circuit emerges during training and is essential for few-shot learning.
Limitations of ICL
- Context window limits: ICL is constrained by the maximum sequence length. Long contexts increase latency and cost.
- Distribution shift: ICL performance degrades when test examples differ significantly from the pre-training distribution.
- Instability: Small changes in example ordering or selection can cause large performance swings.
- Limited complexity: ICL struggles with tasks requiring deep reasoning or memorization of complex patterns.
Advanced ICL Techniques
Retrieval-Augmented ICL
Instead of selecting examples randomly, use a retrieval system to find the most relevant in-context examples for each query. This combines the benefits of RAG with ICL, improving accuracy on diverse inputs.
Learned Prompting
Rather than selecting natural examples, learn continuous prompt embeddings that maximize task performance. This is the basis of prefix tuning and prompt tuning methods, which bridge the gap between ICL and fine-tuning.
Task-Aware ICL
Analyze the task structure and design ICL prompts that explicitly communicate the task type, input format, and output format. This reduces ambiguity and improves consistency across diverse inputs.
The most effective ICL systems combine retrieval (finding relevant examples), ordering (presenting them optimally), and calibration (adjusting for bias). Invest in all three components for best results.
ICL in Production
When deploying ICL in production, consider:
- Latency: Longer prompts mean slower inference. Balance example count with speed requirements.
- Cost: API calls are priced by token count. Fewer, more relevant examples reduce cost.
- Consistency: Use deterministic example selection for reproducible outputs.
- Fallback: Have a fine-tuned model as fallback when ICL performance is insufficient.