CW

LLM Interpretability

Advanced TopicsUnderstandingFree Lesson

Advertisement

Advanced Topics

LLM Interpretability

Understanding what happens inside the black box—mechanistic interpretability, probing, and activation patching to reverse-engineer language model cognition.

  • Mechanistic — Reverse-engineering circuits and algorithms
  • Probing — Training classifiers on internal representations
  • Patching — Causal interventions on model activations

If you can't explain it simply, you don't understand it well enough.

LLM Interpretability

Understanding what happens inside the black box—mechanistic interpretability, probing, and activation patching to reverse-engineer language model cognition.

DfInterpretability

Interpretability in the context of LLMs is the degree to which a human can understand the cause of a model's decision. A model is interpretable if its internal representations, attention patterns, and computational circuits can be mapped to human-understandable concepts and algorithms.

Levels of Interpretability

Interpretability operates at multiple levels of abstraction:

LevelQuestionMethods
NeuronWhat does this neuron detect?Activation visualization, maximally activating examples
FeatureWhat features does the model use?Sparse autoencoders, dictionary learning
CircuitWhat algorithm does this subnetwork implement?Activation patching, causal intervention
RepresentationWhat concepts are encoded?Probing, geometric analysis
BehaviorWhy does the model produce this output?Attention visualization, Chain-of-Thought analysis

Mechanistic interpretability is fundamentally a reverse-engineering problem: given a trained neural network that solves a task, figure out what algorithm it learned.

Probing Methods

Linear Probing

The simplest probe: a linear classifier trained on internal representations:

Linear Probe

haty=sigma(Wcdothl(i)+b)\\hat{y} = \\sigma(W \\cdot h_l^{(i)} + b)

Here,

  • hl(i)h_l^{(i)}=Hidden representation at layer l for input i
  • WW=Learnable weight matrix
  • bb=Bias vector
  • σ\sigma=Activation function

A probe that achieves high accuracy suggests the information is linearly decodable from the representation—meaning the model has learned to encode it.

Non-Linear Probing

When linear probes fail, more expressive probes (MLPs) can detect non-linear representations:

Non-Linear Probe

haty=sigma(W2cdottextReLU(W1cdothl(i)+b1)+b2)\\hat{y} = \\sigma(W_2 \\cdot \\text{ReLU}(W_1 \\cdot h_l^{(i)} + b_1) + b_2)

Here,

  • W1,W2W_1, W_2=Weight matrices for hidden layers
  • b1,b2b_1, b_2=Bias vectors

A high-accuracy non-linear probe does not necessarily mean the model uses that information for its task. The probe may be extracting information the model encodes but does not use. This is the "probe-reads-more-than-model-uses" problem.

What Probing Reveals

Research has revealed that LLMs encode rich internal representations:

Layer LevelWhat Is EncodedExample
Early layersToken syntax, part-of-speechNouns vs verbs
Middle layersSyntactic structure, semanticsDependency relations
Late layersTask-specific features, factsEntity properties, relations

Anthropic's work on Claude has shown that facts about the world are stored in a remarkably linear fashion: "The Eiffel Tower is in Paris" is encoded as a linear combination of concepts that can be manipulated algebraically.

Causal Probing

To establish that the model actually uses a representation, we must intervene causally:

Causal Effect of Representation

textCE(hl)=P(ytextdo(hl=hltexttarget))P(ytextdo(hl=hltextbaseline))\\text{CE}(h_l) = P(y | \\text{do}(h_l = h_l^{\\text{target}})) - P(y | \\text{do}(h_l = h_l^{\\text{baseline}}))

Here,

  • hltargeth_l^{\text{target}}=Representation with target feature present
  • hlbaselineh_l^{\text{baseline}}=Representation with target feature ablated
  • yy=Model output

Mechanistic Interpretability

Circuit Discovery

The core idea: identify a minimal subgraph of the model's computational graph that is sufficient to perform a specific task.

DfCircuit

A circuit is a minimal set of attention heads, MLP neurons, and their connections that is causally necessary and sufficient for a specific model behavior. Formally, a circuit C is a subgraph of the full model G such that G \ C produces incorrect outputs on the target task, while C alone produces correct outputs.

Activation Patching

Activation patching (also called causal tracing) is the primary tool for identifying circuits:

Activation Patching

textEffect(l,h)=P(yxtextclean,textpatchhltextcorruptedrightarrowtextclean)\\text{Effect}(l, h) = P(y^* | x_{\\text{clean}}, \\text{patch } h_l^{\\text{corrupted}} \\rightarrow \\text{clean})

Here,

  • yy^*=Correct output
  • xcleanx_{\text{clean}}=Clean input (with correct answer)
  • hlcorruptedh_l^{\text{corrupted}}=Corrupted hidden state at layer l
  • clean\text{clean}=Clean hidden state (restored)

The procedure:

  1. Run the model on a clean input (produces correct output)
  2. Run on a corrupted input (produces incorrect output)
  3. For each component, replace its corrupted activation with the clean one
  4. Measure how much the output recovers toward correct

Types of Patching

There are several variants of activation patching, each targeting different computational structures:

DfPatching Variants

  • Node patching: Replace the output of a single attention head or MLP layer
  • Edge patching: Replace the connection between two specific components
  • Subgraph patching: Replace all activations within a defined subgraph
  • Causal trace: Systematically patch each component to build a causal graph

The choice of patching granularity depends on the hypothesis being tested. Node patching identifies important components; edge patching identifies important information pathways.

Circuit Discovery Process

For indirect object identification ("John went to the store. He gave the book to __"), we:

  1. Corrupt the subject tokens ("John" → random tokens)
  2. Patch each attention head's output
  3. Find that OV-circuit in heads L9H6 and L10H9 is critical
  4. These heads copy the subject name to the correct position

Sparse Autoencoders

Large language models may encode concepts in superposed (polysemantic neurons). Sparse autoencoders decompose activations into interpretable features:

Sparse Autoencoder Decomposition

hlapproxWdcdottextReLU(Wecdothl+be)+bdh_l \\approx W_d \\cdot \\text{ReLU}(W_e \\cdot h_l + b_e) + b_d

Here,

  • WeW_e=Encoder weights (d → k, k >> d)
  • WdW_d=Decoder weights (k → d)
  • be,bdb_e, b_d=Bias vectors

The sparsity constraint forces the autoencoder to learn a decomposed representation where each feature corresponds to a meaningful concept.

Anthropic's work on Claude has identified thousands of interpretable features using sparse autoencoders, including features for code bugs, legal concepts, and safety-relevant topics like deception.

Practice Exercises

  1. Conceptual: Explain the difference between interpretability by design (e.g., attention heads) and interpretability by analysis (e.g., probing). Why is causal intervention necessary?

  2. Mathematical: If a linear probe achieves 95% accuracy on a binary classification task from layer 12 representations, and the random baseline is 50%, compute the probe's effective signal. Does this prove the model uses this information?

  3. Practical: Using a small GPT-2 model, implement activation patching to identify which attention heads are responsible for in-context copying of a specific token.

  4. Research: Compare the information-theoretic capacity of a superposed representation vs. an orthogonal one. How does the number of features that can be stored scale with dimensionality?

Key Takeaways:

  • Interpretability operates at neuron, feature, circuit, and representation levels
  • Linear probes detect linearly decodable information; non-linear probes detect more complex encodings
  • Causal probing (activation patching) establishes that the model actually uses a representation
  • Circuit discovery identifies minimal subgraphs sufficient for specific behaviors
  • Sparse autoencoders decompose superposed representations into interpretable features

What to Learn Next

-> LLM Architecture Deep Dive Understanding transformers, attention, and internal computations.

-> Mixture of Experts Sparse architectures and conditional computation in LLMs.

-> LLM Watermarking Statistical watermarks and detection of AI-generated text.

-> Hallucination Detection Detecting factual errors through internal model analysis.

-> Environmental Impact of LLMs Energy costs and sustainable AI practices.

-> Future of LLMs Trends, predictions, and emerging capabilities.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement