Advanced Topics
LLM Interpretability
Understanding what happens inside the black box—mechanistic interpretability, probing, and activation patching to reverse-engineer language model cognition.
- Mechanistic — Reverse-engineering circuits and algorithms
- Probing — Training classifiers on internal representations
- Patching — Causal interventions on model activations
If you can't explain it simply, you don't understand it well enough.
LLM Interpretability
Understanding what happens inside the black box—mechanistic interpretability, probing, and activation patching to reverse-engineer language model cognition.
DfInterpretability
Interpretability in the context of LLMs is the degree to which a human can understand the cause of a model's decision. A model is interpretable if its internal representations, attention patterns, and computational circuits can be mapped to human-understandable concepts and algorithms.
Levels of Interpretability
Interpretability operates at multiple levels of abstraction:
| Level | Question | Methods |
|---|---|---|
| Neuron | What does this neuron detect? | Activation visualization, maximally activating examples |
| Feature | What features does the model use? | Sparse autoencoders, dictionary learning |
| Circuit | What algorithm does this subnetwork implement? | Activation patching, causal intervention |
| Representation | What concepts are encoded? | Probing, geometric analysis |
| Behavior | Why does the model produce this output? | Attention visualization, Chain-of-Thought analysis |
Mechanistic interpretability is fundamentally a reverse-engineering problem: given a trained neural network that solves a task, figure out what algorithm it learned.
Probing Methods
Linear Probing
The simplest probe: a linear classifier trained on internal representations:
Linear Probe
Here,
- =Hidden representation at layer l for input i
- =Learnable weight matrix
- =Bias vector
- =Activation function
A probe that achieves high accuracy suggests the information is linearly decodable from the representation—meaning the model has learned to encode it.
Non-Linear Probing
When linear probes fail, more expressive probes (MLPs) can detect non-linear representations:
Non-Linear Probe
Here,
- =Weight matrices for hidden layers
- =Bias vectors
A high-accuracy non-linear probe does not necessarily mean the model uses that information for its task. The probe may be extracting information the model encodes but does not use. This is the "probe-reads-more-than-model-uses" problem.
What Probing Reveals
Research has revealed that LLMs encode rich internal representations:
| Layer Level | What Is Encoded | Example |
|---|---|---|
| Early layers | Token syntax, part-of-speech | Nouns vs verbs |
| Middle layers | Syntactic structure, semantics | Dependency relations |
| Late layers | Task-specific features, facts | Entity properties, relations |
Anthropic's work on Claude has shown that facts about the world are stored in a remarkably linear fashion: "The Eiffel Tower is in Paris" is encoded as a linear combination of concepts that can be manipulated algebraically.
Causal Probing
To establish that the model actually uses a representation, we must intervene causally:
Causal Effect of Representation
Here,
- =Representation with target feature present
- =Representation with target feature ablated
- =Model output
Mechanistic Interpretability
Circuit Discovery
The core idea: identify a minimal subgraph of the model's computational graph that is sufficient to perform a specific task.
DfCircuit
A circuit is a minimal set of attention heads, MLP neurons, and their connections that is causally necessary and sufficient for a specific model behavior. Formally, a circuit C is a subgraph of the full model G such that G \ C produces incorrect outputs on the target task, while C alone produces correct outputs.
Activation Patching
Activation patching (also called causal tracing) is the primary tool for identifying circuits:
Activation Patching
Here,
- =Correct output
- =Clean input (with correct answer)
- =Corrupted hidden state at layer l
- =Clean hidden state (restored)
The procedure:
- Run the model on a clean input (produces correct output)
- Run on a corrupted input (produces incorrect output)
- For each component, replace its corrupted activation with the clean one
- Measure how much the output recovers toward correct
Types of Patching
There are several variants of activation patching, each targeting different computational structures:
DfPatching Variants
- Node patching: Replace the output of a single attention head or MLP layer
- Edge patching: Replace the connection between two specific components
- Subgraph patching: Replace all activations within a defined subgraph
- Causal trace: Systematically patch each component to build a causal graph
The choice of patching granularity depends on the hypothesis being tested. Node patching identifies important components; edge patching identifies important information pathways.
Circuit Discovery Process
For indirect object identification ("John went to the store. He gave the book to __"), we:
- Corrupt the subject tokens ("John" → random tokens)
- Patch each attention head's output
- Find that OV-circuit in heads L9H6 and L10H9 is critical
- These heads copy the subject name to the correct position
Sparse Autoencoders
Large language models may encode concepts in superposed (polysemantic neurons). Sparse autoencoders decompose activations into interpretable features:
Sparse Autoencoder Decomposition
Here,
- =Encoder weights (d → k, k >> d)
- =Decoder weights (k → d)
- =Bias vectors
The sparsity constraint forces the autoencoder to learn a decomposed representation where each feature corresponds to a meaningful concept.
Anthropic's work on Claude has identified thousands of interpretable features using sparse autoencoders, including features for code bugs, legal concepts, and safety-relevant topics like deception.
Practice Exercises
-
Conceptual: Explain the difference between interpretability by design (e.g., attention heads) and interpretability by analysis (e.g., probing). Why is causal intervention necessary?
-
Mathematical: If a linear probe achieves 95% accuracy on a binary classification task from layer 12 representations, and the random baseline is 50%, compute the probe's effective signal. Does this prove the model uses this information?
-
Practical: Using a small GPT-2 model, implement activation patching to identify which attention heads are responsible for in-context copying of a specific token.
-
Research: Compare the information-theoretic capacity of a superposed representation vs. an orthogonal one. How does the number of features that can be stored scale with dimensionality?
Key Takeaways:
- Interpretability operates at neuron, feature, circuit, and representation levels
- Linear probes detect linearly decodable information; non-linear probes detect more complex encodings
- Causal probing (activation patching) establishes that the model actually uses a representation
- Circuit discovery identifies minimal subgraphs sufficient for specific behaviors
- Sparse autoencoders decompose superposed representations into interpretable features
What to Learn Next
-> LLM Architecture Deep Dive Understanding transformers, attention, and internal computations.
-> Mixture of Experts Sparse architectures and conditional computation in LLMs.
-> LLM Watermarking Statistical watermarks and detection of AI-generated text.
-> Hallucination Detection Detecting factual errors through internal model analysis.
-> Environmental Impact of LLMs Energy costs and sustainable AI practices.
-> Future of LLMs Trends, predictions, and emerging capabilities.