Advanced Topics

LLM Interpretability

Understanding what happens inside the black box—mechanistic interpretability, probing, and activation patching to reverse-engineer language model cognition.

Mechanistic — Reverse-engineering circuits and algorithms
Probing — Training classifiers on internal representations
Patching — Causal interventions on model activations

If you can't explain it simply, you don't understand it well enough.

LLM Interpretability

Understanding what happens inside the black box—mechanistic interpretability, probing, and activation patching to reverse-engineer language model cognition.

DfInterpretability

Interpretability in the context of LLMs is the degree to which a human can understand the cause of a model's decision. A model is interpretable if its internal representations, attention patterns, and computational circuits can be mapped to human-understandable concepts and algorithms.

Levels of Interpretability

Interpretability operates at multiple levels of abstraction:

Level	Question	Methods
Neuron	What does this neuron detect?	Activation visualization, maximally activating examples
Feature	What features does the model use?	Sparse autoencoders, dictionary learning
Circuit	What algorithm does this subnetwork implement?	Activation patching, causal intervention
Representation	What concepts are encoded?	Probing, geometric analysis
Behavior	Why does the model produce this output?	Attention visualization, Chain-of-Thought analysis

Mechanistic interpretability is fundamentally a reverse-engineering problem: given a trained neural network that solves a task, figure out what algorithm it learned.

Probing Methods

Linear Probing

The simplest probe: a linear classifier trained on internal representations:

Linear Probe

\\hat{y} = \\sigma(W \\cdot h_l^{(i)} + b)

Here,

$h_l^{(i)}$ =Hidden representation at layer l for input i
$W$ =Learnable weight matrix
$b$ =Bias vector
$\sigma$ =Activation function

A probe that achieves high accuracy suggests the information is linearly decodable from the representation—meaning the model has learned to encode it.

Non-Linear Probing

When linear probes fail, more expressive probes (MLPs) can detect non-linear representations:

Non-Linear Probe

\\hat{y} = \\sigma(W_2 \\cdot \\text{ReLU}(W_1 \\cdot h_l^{(i)} + b_1) + b_2)

Here,

$W_1, W_2$ =Weight matrices for hidden layers
$b_1, b_2$ =Bias vectors

A high-accuracy non-linear probe does not necessarily mean the model uses that information for its task. The probe may be extracting information the model encodes but does not use. This is the "probe-reads-more-than-model-uses" problem.

What Probing Reveals

Research has revealed that LLMs encode rich internal representations:

Layer Level	What Is Encoded	Example
Early layers	Token syntax, part-of-speech	Nouns vs verbs
Middle layers	Syntactic structure, semantics	Dependency relations
Late layers	Task-specific features, facts	Entity properties, relations

Anthropic's work on Claude has shown that facts about the world are stored in a remarkably linear fashion: "The Eiffel Tower is in Paris" is encoded as a linear combination of concepts that can be manipulated algebraically.

Causal Probing

To establish that the model actually uses a representation, we must intervene causally:

Causal Effect of Representation

\\text{CE}(h_l) = P(y | \\text{do}(h_l = h_l^{\\text{target}})) - P(y | \\text{do}(h_l = h_l^{\\text{baseline}}))

Here,

$h_l^{\text{target}}$ =Representation with target feature present
$h_l^{\text{baseline}}$ =Representation with target feature ablated
$y$ =Model output

Mechanistic Interpretability

Circuit Discovery

The core idea: identify a minimal subgraph of the model's computational graph that is sufficient to perform a specific task.

DfCircuit

A circuit is a minimal set of attention heads, MLP neurons, and their connections that is causally necessary and sufficient for a specific model behavior. Formally, a circuit C is a subgraph of the full model G such that G \ C produces incorrect outputs on the target task, while C alone produces correct outputs.

Activation Patching

Activation patching (also called causal tracing) is the primary tool for identifying circuits:

Activation Patching

\\text{Effect}(l, h) = P(y^* | x_{\\text{clean}}, \\text{patch } h_l^{\\text{corrupted}} \\rightarrow \\text{clean})

Here,

$y^*$ =Correct output
$x_{\text{clean}}$ =Clean input (with correct answer)
$h_l^{\text{corrupted}}$ =Corrupted hidden state at layer l
$\text{clean}$ =Clean hidden state (restored)

The procedure:

Run the model on a clean input (produces correct output)
Run on a corrupted input (produces incorrect output)
For each component, replace its corrupted activation with the clean one
Measure how much the output recovers toward correct

Types of Patching

There are several variants of activation patching, each targeting different computational structures:

DfPatching Variants

Node patching: Replace the output of a single attention head or MLP layer
Edge patching: Replace the connection between two specific components
Subgraph patching: Replace all activations within a defined subgraph
Causal trace: Systematically patch each component to build a causal graph

The choice of patching granularity depends on the hypothesis being tested. Node patching identifies important components; edge patching identifies important information pathways.

Circuit Discovery Process

For indirect object identification ("John went to the store. He gave the book to __"), we:

Corrupt the subject tokens ("John" → random tokens)
Patch each attention head's output
Find that OV-circuit in heads L9H6 and L10H9 is critical
These heads copy the subject name to the correct position

Sparse Autoencoders

Large language models may encode concepts in superposed (polysemantic neurons). Sparse autoencoders decompose activations into interpretable features:

Sparse Autoencoder Decomposition

h_l \\approx W_d \\cdot \\text{ReLU}(W_e \\cdot h_l + b_e) + b_d

Here,

$W_e$ =Encoder weights (d → k, k >> d)
$W_d$ =Decoder weights (k → d)
$b_e, b_d$ =Bias vectors

The sparsity constraint forces the autoencoder to learn a decomposed representation where each feature corresponds to a meaningful concept.

Anthropic's work on Claude has identified thousands of interpretable features using sparse autoencoders, including features for code bugs, legal concepts, and safety-relevant topics like deception.

Practice Exercises

Conceptual: Explain the difference between interpretability by design (e.g., attention heads) and interpretability by analysis (e.g., probing). Why is causal intervention necessary?
Mathematical: If a linear probe achieves 95% accuracy on a binary classification task from layer 12 representations, and the random baseline is 50%, compute the probe's effective signal. Does this prove the model uses this information?
Practical: Using a small GPT-2 model, implement activation patching to identify which attention heads are responsible for in-context copying of a specific token.
Research: Compare the information-theoretic capacity of a superposed representation vs. an orthogonal one. How does the number of features that can be stored scale with dimensionality?

Key Takeaways:

Interpretability operates at neuron, feature, circuit, and representation levels
Linear probes detect linearly decodable information; non-linear probes detect more complex encodings
Causal probing (activation patching) establishes that the model actually uses a representation
Circuit discovery identifies minimal subgraphs sufficient for specific behaviors
Sparse autoencoders decompose superposed representations into interpretable features

What to Learn Next

-> LLM Architecture Deep Dive Understanding transformers, attention, and internal computations.

-> Mixture of Experts Sparse architectures and conditional computation in LLMs.

-> LLM Watermarking Statistical watermarks and detection of AI-generated text.

-> Hallucination Detection Detecting factual errors through internal model analysis.

-> Environmental Impact of LLMs Energy costs and sustainable AI practices.

-> Future of LLMs Trends, predictions, and emerging capabilities.

LLM Interpretability

LLM Interpretability

LLM Interpretability

DfInterpretability

Levels of Interpretability

Probing Methods

Linear Probing

Linear Probe

Non-Linear Probing

Non-Linear Probe

What Probing Reveals

Causal Probing

Causal Effect of Representation

Mechanistic Interpretability

Circuit Discovery

DfCircuit

Activation Patching

Activation Patching

Types of Patching

DfPatching Variants

Circuit Discovery Process

Sparse Autoencoders

Sparse Autoencoder Decomposition

Practice Exercises

What to Learn Next

Need Expert LLM Help?