LLM Evaluation
LLM Benchmarking Suites
Comprehensive benchmarks are the standard for comparing LLM capabilities—from knowledge (MMLU) to reasoning (GSM8K) to coding (HumanEval).
- Knowledge — MMLU, ARC, HellaSwag, TriviaQA
- Reasoning — GSM8K, MATH, BigBench-Hard, BBH
- Coding — HumanEval, MBPP, SWE-Bench, LiveCodeBench
Measurement is the first step that leads to control and eventually to improvement.
LLM Benchmarking Suites
Comprehensive benchmarks are the standard for comparing LLM capabilities—from knowledge (MMLU) to reasoning (GSM8K) to coding (HumanEval). Understanding what each benchmark measures and its limitations is essential for meaningful model comparison.
DfLLM Benchmark
An LLM benchmark is a standardized evaluation protocol consisting of a dataset of test examples, a task definition, and a metric. A benchmark is valid if it (1) correlates with human judgments of model quality, (2) differentiates between models of varying capability, and (3) is resistant to contamination and gaming.
Knowledge and Understanding Benchmarks
MMLU (Massive Multitask Language Understanding)
The most widely cited benchmark for broad knowledge:
MMLU Accuracy
Here,
- =Set of 57 subject areas
- =Test examples for subject t
- =Model prediction
- =Ground truth label
| Subject Category | Examples | Difficulty |
|---|---|---|
| STEM | Physics, Math, CS, Engineering | High |
| Humanities | History, Philosophy, Law | Medium |
| Social Sciences | Economics, Psychology, Politics | Medium |
| Other | Business, Health, Miscellaneous | Variable |
Current state-of-the-art: GPT-4 achieves ~86%, Claude 3.5 Sonnet ~88%, Gemini Ultra ~90%.
MMLU has known issues: (1) many questions have ambiguous or incorrect answers, (2) multiple-choice format limits reasoning assessment, (3) contamination in training data is widespread. MMLU-Pro (2024) addresses some of these issues with harder questions and a 10-choice format.
HellaSwag
Tests commonsense reasoning through sentence completion:
DfHellaSwag Task
Given a context sentence, select the most plausible continuation from four options. The adversarial filtering procedure ensures that distractors are plausible to models but not to humans, creating a challenging benchmark for commonsense understanding.
ARC (AI2 Reasoning Challenge)
Science exam questions requiring multi-step reasoning:
| Subset | Source | Difficulty | Human Accuracy |
|---|---|---|---|
| ARC-Easy | Grade 3-5 science | Easy | ~95% |
| ARC-Challenge | Grade 6-8 science | Hard | ~80% |
Reasoning Benchmarks
GSM8K (Grade School Math 8K)
The gold standard for mathematical reasoning:
GSM8K Multi-Step Reasoning
Here,
- =Math word problem
- =Final numerical answer
- =i-th reasoning step
- =Number of steps required
GSM8K requires 2-8 step reasoning with grade school math concepts (arithmetic, fractions, basic algebra).
Current state-of-the-art: GPT-4 achieves ~95%, Claude 3.5 Sonnet ~97%, Gemini Ultra ~96%.
MATH Benchmark
Competition-level mathematics:
| Difficulty | Topics | Examples | Current Best |
|---|---|---|---|
| Level 1 | Prealgebra | 750 | ~100% |
| Level 2 | Algebra | 750 | ~95% |
| Level 3 | Number Theory | 750 | ~85% |
| Level 4 | Counting & Probability | 750 | ~75% |
| Level 5 | Geometry & Pre-calculus | 750 | ~65% |
BigBench-Hard (BBH)
A curated subset of 23 tasks from BIG-Bench that were challenging for language models:
- Logical reasoning: Boolean expressions, causal judgment
- Symbolic reasoning: Navigate, object counting
- Commonsense: Sports understanding, ruin names
- Multilingual: Multilingual QA, translated tasks
BBH is particularly useful because it focuses on tasks where chain-of-thought prompting significantly improves performance, making it a good test of reasoning capabilities rather than knowledge recall.
Coding Benchmarks
HumanEval (Pass@k)
The standard for code generation:
Pass@k Metric
Here,
- =Total samples generated per problem
- =Correct samples (pass all test cases)
- =Number of samples considered
Pass@1 measures the probability of generating a correct solution in one attempt. Pass@10 measures whether at least one of 10 attempts is correct.
Current state-of-the-art: GPT-4 achieves ~86% Pass@1, Claude 3.5 Sonnet ~88%, Gemini Ultra ~84%.
MBPP (Mostly Basic Python Programming)
974-entry crowdsourced Python programming problems:
- Simpler than HumanEval but broader coverage
- Tests basic programming concepts
- Less sensitive to prompt engineering
SWE-Bench
Real-world software engineering tasks from GitHub issues:
| Subset | Problems | Current Best | Difficulty |
|---|---|---|---|
| SWE-Bench Full | 2,294 | ~25% | Very Hard |
| SWE-Bench Verified | 500 | ~35% | Hard |
| SWE-Bench Lite | 300 | ~40% | Medium |
SWE-Bench is the most realistic coding benchmark, but it is also the hardest. Models must understand complex codebases, navigate dependencies, and generate patches that pass existing test suites. Current models solve only 25-35% of issues.
Comprehensive Evaluation Frameworks
| Framework | Benchmarks Included | Key Feature |
|---|---|---|
| Open LLM Leaderboard | MMLU, ARC, HellaSwag, etc. | Community-driven, transparent |
| LMSYS Chatbot Arena | Human preference votes | Real-world usage comparison |
| HELM | 42+ scenarios | Holistic, multi-metric |
| BigCode Evaluation | HumanEval, MBPP | Code-specific |
| OpenCompass | 100+ datasets | Comprehensive coverage |
No single benchmark captures model quality. Use a benchmark portfolio: MMLU for knowledge, GSM8K for reasoning, HumanEval for coding, and human evaluation for open-ended tasks. Track trends across multiple benchmarks rather than optimizing for any single one.
Practice Exercises
-
Conceptual: Explain why MMLU alone is insufficient for evaluating an LLM's capabilities. What important aspects of model quality does it miss?
-
Mathematical: Compute Pass@1 and Pass@10 for a model that generates 20 samples for a problem, with 3 passing all test cases. Show your work.
-
Practical: Using the Open LLM Leaderboard, compare the performance of three different model sizes (7B, 13B, 70B) across MMLU, GSM8K, and HumanEval. What patterns emerge?
-
Research: Design a benchmark that tests an LLM's ability to learn new concepts from minimal examples. What would the evaluation protocol look like?
Key Takeaways:
- MMLU tests broad knowledge but has issues with ambiguity and contamination
- GSM8K and MATH test mathematical reasoning; BBH tests general reasoning
- HumanEval and SWE-Bench test coding; Pass@k is the standard metric
- No single benchmark is sufficient; use a portfolio of benchmarks
- Human evaluation (Chatbot Arena) remains the gold standard for open-ended tasks
What to Learn Next
-> Automated LLM Evaluation Using models to evaluate models at scale.
-> LLM Evaluation Frameworks Comprehensive evaluation methodologies.
-> Human Evaluation of LLMs The gold standard for assessing model quality.
-> Hallucination Detection Evaluating factual accuracy.
-> Bias and Fairness Evaluating fairness across demographic groups.
-> Scaling Laws and Chinchilla Understanding how performance scales with resources.