LLM Evaluation Benchmarks
Evaluating large language models is one of the most challenging problems in AI. Unlike traditional ML tasks with clear metrics, LLMs are general-purpose systems whose capabilities span reasoning, creativity, knowledge, and more. This tutorial covers the major benchmarks and evaluation methodologies used to assess LLM performance.
Why Evaluation is Hard
LLMs exhibit emergent capabilities that are difficult to measure with simple metrics:
- Open-ended generation has no single correct answer
- Reasoning chains require evaluating intermediate steps
- Safety requires testing for harms that may not appear in standard benchmarks
- Alignment measures subjective qualities like helpfulness and honesty
Perplexity
Perplexity is the most fundamental metric for language models, measuring how well the model predicts the next token.
Perplexity (PPL) is the exponentiated average negative log-likelihood of a sequence, measuring how "surprised" the model is by the test data. Lower perplexity indicates better predictive performance.
Perplexity
Here,
- =
- =
- =
- =
Cross-Entropy Loss
Here,
- =
- =
The relationship between perplexity and cross-entropy:
Perplexity is simply the exponentiated cross-entropy: PPL = exp(H). A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 possibilities.
Computing Perplexity
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def compute_perplexity(model_name: str, text: str, stride: int = 512) -> float:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
seq_len = encodings.input_ids.size(1)
nlls = []
prev_end_loc = 0
for begin_loc in range(0, seq_len, stride):
end_loc = min(begin_loc + 2048, seq_len)
input_ids = encodings.input_ids[:, begin_loc:end_loc]
target_ids = input_ids.clone()
# Mask tokens outside the current window
if begin_loc > 0:
target_ids[:, :-stride] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
neg_log_likelihood = outputs.loss
nlls.append(neg_log_likelihood)
prev_end_loc = end_loc
if end_loc == seq_len:
break
ppl = torch.exp(torch.stack(nlls).mean())
return ppl.item()
Perplexity is useful for comparing models of similar size on the same test set, but it is not a reliable indicator of downstream task performance. A model with lower perplexity may still perform worse on reasoning tasks.
MMLU (Massive Multitask Language Understanding)
MMLU measures knowledge across 57 subjects spanning STEM, humanities, social sciences, and more.
Benchmark Structure
| Category | Subjects | Examples |
|---|---|---|
| STEM | Physics, Math, CS | 14,042 questions |
| Humanities | History, Philosophy, Law | 11,039 questions |
| Social Sciences | Economics, Psychology | 8,302 questions |
| Other | Misc, Professional | 7,530 questions |
Evaluation Protocol
MMLU uses 5-shot evaluation with multiple-choice questions:
def format_mmlu_prompt(question, options, examples=None):
prompt = "Answer the following multiple-choice question.\n\n"
if examples:
for ex in examples:
prompt += f"Question: {ex['question']}\n"
for i, opt in enumerate(ex['options']):
prompt += f"({chr(65+i)}) {opt}\n"
prompt += f"Answer: {ex['answer']}\n\n"
prompt += f"Question: {question}\n"
for i, opt in enumerate(options):
prompt += f"({chr(65+i)}) {opt}\n"
prompt += "Answer:"
return prompt
def evaluate_mmlu(model, tokenizer, dataset, k=5):
correct = 0
total = 0
for question in dataset:
examples = question['few_shot_examples'][:k]
prompt = format_mmlu_prompt(
question['question'],
question['options'],
examples
)
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=1)
predicted = tokenizer.decode(outputs[0][-1:])
if predicted == question['answer']:
correct += 1
total += 1
return correct / total
HumanEval (Code Generation)
HumanEval evaluates a model's ability to generate correct Python functions from docstrings.
Pass@k (Code Generation)
Here,
- =
- =
- =
The unbiased estimator for Pass@k:
Unbiased Pass@k Estimator
Here,
- =
- =
- =
- =
import math
def pass_at_k(n: int, c: int, k: int) -> float:
if n - c < k:
return 1.0
return 1.0 - math.prod(1.0 - k / (n - i) for i in range(c))
def evaluate_humaneval(model, tokenizer, problems, n_samples=200, k_values=[1, 10, 100]):
results = {}
for problem in problems:
prompt = f"def {problem['function_name']}({problem['signature']}):\n \"\"\"{problem['docstring']}\"\"\"\n"
samples = []
for _ in range(n_samples):
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.8,
do_sample=True
)
code = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
samples.append(code)
c = sum(1 for s in samples if run_test_cases(s, problem['test_cases']))
results[problem['task_id']] = {
k: pass_at_k(n_samples, c, k) for k in k_values
}
return results
GSM8K (Math Reasoning)
GSM8K tests grade-school math reasoning with multi-step word problems.
Evaluation Metrics
| Metric | Description | Calculation |
|---|---|---|
| Exact Match | Final answer matches exactly | Binary per problem |
| Execution Accuracy | Code execution produces correct answer | Binary per problem |
| Reasoning Score | Intermediate steps are valid | 0-1 per problem |
def extract_answer(response: str) -> str:
"""Extract final numerical answer from response."""
import re
# Look for #### pattern used in GSM8K
match = re.search(r'####\s*(.+)', response)
if match:
return match.group(1).strip()
# Fallback: last number in response
numbers = re.findall(r'-?\d+\.?\d*', response)
return numbers[-1] if numbers else ""
def evaluate_gsm8k(model, tokenizer, dataset):
correct = 0
total = 0
for problem in dataset:
prompt = f"Question: {problem['question']}\n\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.0
)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
predicted = extract_answer(response)
ground_truth = extract_answer(problem['answer'])
if predicted == ground_truth:
correct += 1
total += 1
return correct / total
MT-Bench and Chatbot Arena
MT-Bench
MT-Bench evaluates multi-turn conversation quality using GPT-4 as a judge:
MT_BENCH_CATEGORIES = [
"writing", "roleplay", "reasoning", "math",
"coding", "extraction", "stem", "humanities"
]
MT_BENCH_JUDGE_PROMPT = """[System]
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Rate on a scale of 1 to 10.
[Question]
{question}
[Assistant Response]
{response}
[Rating]
Provide your rating on a scale of 1 to 10. Explain your rating briefly."""
Chatbot Arena (ELO Rating)
Chatbot Arena uses blind comparison with human voting to compute ELO ratings:
ELO Rating Update
Here,
- =
- =
- =
- =
- =
LLM-as-Judge Evaluation
Using GPT-4 or other strong models as evaluation judges:
def llm_judge(
question: str,
response: str,
criteria: dict,
judge_model,
judge_tokenizer
) -> dict:
judge_prompt = f"""You are an expert evaluator. Rate the following response on these criteria:
{chr(10).join(f'- {k}: {v}' for k, v in criteria.items())}
Question: {question}
Response: {response}
Provide ratings (1-10) for each criterion and an overall score.
Format your response as JSON."""
inputs = judge_tokenizer(judge_prompt, return_tensors="pt")
with torch.no_grad():
output = judge_model.generate(**inputs, max_new_tokens=256)
judgment = judge_tokenizer.decode(output[0][inputs["input_ids"].shape[1]:])
return parse_judgment(judgment)
LLM-as-judge has been shown to achieve >80% agreement with human evaluators on tasks like response quality assessment. However, it can be biased toward its own outputs if used to evaluate similar models.
Evaluation Pitfalls and Limitations
Common Pitfalls
- Data Contamination: Test data may appear in training data
- Benchmark Gaming: Models can overfit to specific benchmark formats
- Metric Sensitivity: Small changes in evaluation protocol can change rankings
- Missing Capabilities: Benchmarks may not capture important real-world skills
Evaluation Anti-Patterns
# BAD: Only evaluating on one benchmark
score = evaluate_gsm8k(model, tokenizer, test_set)
# GOOD: Comprehensive evaluation
evaluation_suite = {
"perplexity": compute_perplexity(model, held_out_data),
"mmlu": evaluate_mmlu(model, tokenizer, mmlu_test),
"humaneval": evaluate_humaneval(model, tokenizer, humaneval),
"gsm8k": evaluate_gsm8k(model, tokenizer, gsm8k_test),
"safety": evaluate_safety(model, safety_test),
"helpfulness": llm_judge(model, helpfulness_test)
}
Limitations of Automatic Metrics
| Metric | Strengths | Weaknesses |
|---|---|---|
| Perplexity | Fast, consistent | Doesn't measure reasoning |
| Exact Match | Clear, objective | Misses partial credit |
| Pass@k | Task-specific | Expensive to compute |
| LLM Judge | Flexible, nuanced | Expensive, potential bias |
Always evaluate LLMs on multiple benchmarks across different capability dimensions. No single benchmark captures all aspects of model quality. Combine automatic metrics with human evaluation for critical applications.
Building an Evaluation Pipeline
class LLMEvaluator:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.results = {}
def evaluate(self, benchmarks: dict):
for name, (eval_fn, dataset) in benchmarks.items():
print(f"Evaluating {name}...")
self.results[name] = eval_fn(self.model, self.tokenizer, dataset)
return self.results
def compare_with_baseline(self, baseline_results: dict) -> dict:
comparison = {}
for benchmark in self.results:
if benchmark in baseline_results:
comparison[benchmark] = {
"current": self.results[benchmark],
"baseline": baseline_results[benchmark],
"delta": self.results[benchmark] - baseline_results[benchmark]
}
return comparison
def generate_report(self) -> str:
report = "# LLM Evaluation Report\n\n"
for benchmark, score in self.results.items():
report += f"## {benchmark}\n"
report += f"- Score: {score:.4f}\n\n"
return report
Summary
- Perplexity measures next-token prediction quality: PPL = exp(H)
- MMLU evaluates knowledge across 57 academic subjects
- HumanEval measures code generation capability via Pass@k
- GSM8K tests multi-step mathematical reasoning
- MT-Bench and Chatbot Arena use LLM-as-judge and human comparison
- LLM-as-judge provides scalable evaluation with >80% human agreement
- Always use multiple benchmarks and combine automatic with human evaluation
- Watch for data contamination and benchmark gaming
Practice Exercises
-
Perplexity Comparison: Compute perplexity for 3 different models on the WikiText-2 dataset. How does perplexity correlate with model size?
-
Benchmark Implementation: Implement a 5-shot MMLU evaluation for a small language model. Report accuracy by subject category.
-
Code Generation: Evaluate a model on HumanEval with k=1, k=10, and k=100. How does Pass@k improve with more samples?
-
LLM-as-Judge: Use GPT-4 to evaluate 50 responses from different models on the MT-Bench dataset. Compare the rankings with your own judgments.
-
Evaluation Pipeline: Build a complete evaluation pipeline that runs 4 different benchmarks and generates a comparison report.
Previous: 14 - Constitutional AI β | Next: 16 - LLM Inference Optimization β