In-Context Learning

InferenceICLFree Lesson

Advertisement

In-Context Learning

In-context learning (ICL) is one of the most remarkable emergent capabilities of LLMs. This tutorial explores what ICL is, why it works, and how to use it effectively.

DfIn-Context Learning (ICL)

In-context learning is the ability of a language model to learn new tasks from examples provided in the prompt, without any gradient updates to the model parameters. The model adapts its behavior based solely on the input context.

How ICL Works

The Bayesian Inference Hypothesis

ICL as Implicit Bayesian Inference

P(ytexttestxtexttest,D)=sumthetaP(ytexttestxtexttest,theta)P(thetaD)P(y_{\\text{test}} | x_{\\text{test}}, D) = \\sum_{\\theta} P(y_{\\text{test}} | x_{\\text{test}}, \\theta) P(\\theta | D)

Here,

  • DD=In-context examples (x_1, y_1), ..., (x_n, y_n)
  • θ\theta=Implicit task hypothesis
  • P(θD)P(\theta | D)=Posterior over hypotheses given examples

The model effectively performs Bayesian inference over task hypotheses, using the in-context examples to update its beliefs about which task is being performed.

The Task Vector Hypothesis

DfTask Vectors

Task vectors are directions in activation space that encode the task defined by in-context examples. These vectors emerge from the attention mechanism and guide the model's predictions for new inputs.

The Grokking Hypothesis

DfGrokking

Grokking is the phenomenon where a model suddenly generalizes after overfitting. For ICL, this suggests that large models have implicitly learned to perform gradient descent during the forward pass, effectively "grokking" how to learn from examples.

P(ytestxtest,D)=θP(ytestxtest,θ)P(θD)P(y_{test}|x_{test}, D) = \sum_\theta P(y_{test}|x_{test}, \theta) P(\theta|D)

Impact of Example Ordering

The order of in-context examples significantly affects performance:

ICL Performance by Ordering

textAccuracy(Dpi)neqtextAccuracy(Dsigma)quadforallpineqsigma\\text{Accuracy}(D_{\\pi}) \\neq \\text{Accuracy}(D_{\\sigma}) \\quad \\forall \\pi \\neq \\sigma

Here,

  • DπD_{\pi}=Examples ordered by permutation \pi
  • DσD_{\sigma}=Examples ordered by permutation \sigma

Ordering Strategies

  1. Random ordering: Baseline, moderate performance
  2. Similarity-based: Most similar examples last (best average)
  3. Label-balanced: Equal representation of each class
  4. Difficulty-based: Easy examples first, hard examples last

For classification tasks, placing the most similar example last consistently improves performance. For generation tasks, order matters less.

Impact of Example Selection

Example Selection Score

textscore(xi)=textsim(textembed(xi),textembed(xtexttest))\\text{score}(x_i) = \\text{sim}(\\text{embed}(x_i), \\text{embed}(x_{\\text{test}}))

Here,

  • xix_i=Candidate in-context example
  • xtestx_{\text{test}}=Test input
  • simsim=Similarity function (cosine or dot product)

Selection Strategies

  • Random: No selection bias, but may include irrelevant examples
  • Top-k retrieval: Select k most similar examples from a pool
  • Diverse selection: Balance similarity with diversity
  • Label-aware: Ensure balanced label distribution in selected examples

ICL vs Fine-tuning

AspectICLFine-tuning
Data required2-32 examples100-10,000+ examples
ComputeForward pass onlyGradient updates
Task switchingChange promptRetrain model
Performance80-95% of fine-tuning100% (baseline)
LatencyHigher (longer prompts)Lower (shorter prompts)
Knowledge accessFull pre-trained knowledgeMay forget pre-trained knowledge

ICL-Fine-tuning Tradeoff

textUseICLif:fracNtextexamplesNtextparams<tautexticl\\text{Use ICL if: } \\frac{N_{\\text{examples}}}{N_{\\text{params}}} < \\tau_{\\text{icl}}

Here,

  • NexamplesN_{\text{examples}}=Number of labeled examples
  • NparamsN_{\text{params}}=Number of model parameters
  • τicl\tau_{\text{icl}}=Threshold (typically 1e-4)

Practical ICL Implementation

`python from transformers import AutoModelForCausalLM, AutoTokenizer import torch

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

Few-shot examples

examples = [ ("This movie was fantastic!", "Positive"), ("Terrible waste of time.", "Negative"), ("The food was okay.", "Neutral"), ]

def build_icl_prompt(test_input, examples): prompt = "Classify the sentiment of each review.\n\n" for text, label in examples: prompt += f"Review: "{text}" -> {label}\n" prompt += f"Review: "{test_input}" -> " return prompt

test = "Absolutely loved every minute of it!" prompt = build_icl_prompt(test, examples)

inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=10, temperature=0.0) response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(f"Prediction: {response}") # Expected: Positive `

For more on prompting techniques, see our module on Prompt Engineering.

Practice Exercises

  1. Empirical: Test ICL on a 3-class classification task with 1, 2, 4, 8, and 16 examples. Plot accuracy vs number of examples.
  2. Ordering: Compare random, similarity-based, and reverse ordering of examples. Which is most robust?
  3. Selection: Implement a retrieval-based example selector using sentence embeddings. Compare with random selection.
  4. Theory: Explain why ICL works for decoder-only models but not for encoder-only models like BERT.

Key Takeaways:

  • ICL enables learning from examples without gradient updates
  • The model performs implicit Bayesian inference over task hypotheses
  • Example ordering and selection significantly affect performance
  • ICL uses 80-95% of fine-tuning performance with 100x fewer examples
  • Similarity-based example selection (most similar last) is generally best
  • ICL and fine-tuning are complementary; use ICL when data is scarce

Theoretical Foundations

ICL as Gradient Descent

Recent research suggests that transformer attention mechanisms can implicitly perform gradient descent during the forward pass. The attention computation effectively computes a linear regression over the in-context examples, updating the model internal representations without explicit parameter updates.

Mechanistic Interpretability of ICL

Mechanistic interpretability studies have identified specific circuits responsible for ICL. The induction head circuit, composed of two attention heads, learns to perform pattern completion by copying from previous contexts. This circuit emerges during training and is essential for few-shot learning.

Limitations of ICL

  1. Context window limits: ICL is constrained by the maximum sequence length. Long contexts increase latency and cost.
  2. Distribution shift: ICL performance degrades when test examples differ significantly from the pre-training distribution.
  3. Instability: Small changes in example ordering or selection can cause large performance swings.
  4. Limited complexity: ICL struggles with tasks requiring deep reasoning or memorization of complex patterns.

Advanced ICL Techniques

Retrieval-Augmented ICL

Instead of selecting examples randomly, use a retrieval system to find the most relevant in-context examples for each query. This combines the benefits of RAG with ICL, improving accuracy on diverse inputs.

Learned Prompting

Rather than selecting natural examples, learn continuous prompt embeddings that maximize task performance. This is the basis of prefix tuning and prompt tuning methods, which bridge the gap between ICL and fine-tuning.

Task-Aware ICL

Analyze the task structure and design ICL prompts that explicitly communicate the task type, input format, and output format. This reduces ambiguity and improves consistency across diverse inputs.

The most effective ICL systems combine retrieval (finding relevant examples), ordering (presenting them optimally), and calibration (adjusting for bias). Invest in all three components for best results.

ICL in Production

When deploying ICL in production, consider:

  • Latency: Longer prompts mean slower inference. Balance example count with speed requirements.
  • Cost: API calls are priced by token count. Fewer, more relevant examples reduce cost.
  • Consistency: Use deterministic example selection for reproducible outputs.
  • Fallback: Have a fine-tuned model as fallback when ICL performance is insufficient.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement