LLM Evaluation

LLM Benchmarking Suites

Comprehensive benchmarks are the standard for comparing LLM capabilities—from knowledge (MMLU) to reasoning (GSM8K) to coding (HumanEval).

Knowledge — MMLU, ARC, HellaSwag, TriviaQA
Reasoning — GSM8K, MATH, BigBench-Hard, BBH
Coding — HumanEval, MBPP, SWE-Bench, LiveCodeBench

Measurement is the first step that leads to control and eventually to improvement.

LLM Benchmarking Suites

Comprehensive benchmarks are the standard for comparing LLM capabilities—from knowledge (MMLU) to reasoning (GSM8K) to coding (HumanEval). Understanding what each benchmark measures and its limitations is essential for meaningful model comparison.

DfLLM Benchmark

An LLM benchmark is a standardized evaluation protocol consisting of a dataset of test examples, a task definition, and a metric. A benchmark is valid if it (1) correlates with human judgments of model quality, (2) differentiates between models of varying capability, and (3) is resistant to contamination and gaming.

Knowledge and Understanding Benchmarks

MMLU (Massive Multitask Language Understanding)

The most widely cited benchmark for broad knowledge:

MMLU Accuracy

\\text{Acc}_{\\text{MMLU}} = \\frac{1}{|\\mathcal{T}|} \\sum_{t \\in \\mathcal{T}} \\frac{|\\{(x, y) \\in D_t : \\hat{y} = y\\}|}{|D_t|}

Here,

$\mathcal{T}$ =Set of 57 subject areas
$D_t$ =Test examples for subject t
$\hat{y}$ =Model prediction
$y$ =Ground truth label

Subject Category	Examples	Difficulty
STEM	Physics, Math, CS, Engineering	High
Humanities	History, Philosophy, Law	Medium
Social Sciences	Economics, Psychology, Politics	Medium
Other	Business, Health, Miscellaneous	Variable

Current state-of-the-art: GPT-4 achieves ~86%, Claude 3.5 Sonnet ~88%, Gemini Ultra ~90%.

MMLU has known issues: (1) many questions have ambiguous or incorrect answers, (2) multiple-choice format limits reasoning assessment, (3) contamination in training data is widespread. MMLU-Pro (2024) addresses some of these issues with harder questions and a 10-choice format.

HellaSwag

Tests commonsense reasoning through sentence completion:

DfHellaSwag Task

Given a context sentence, select the most plausible continuation from four options. The adversarial filtering procedure ensures that distractors are plausible to models but not to humans, creating a challenging benchmark for commonsense understanding.

ARC (AI2 Reasoning Challenge)

Science exam questions requiring multi-step reasoning:

Subset	Source	Difficulty	Human Accuracy
ARC-Easy	Grade 3-5 science	Easy	~95%
ARC-Challenge	Grade 6-8 science	Hard	~80%

Reasoning Benchmarks

GSM8K (Grade School Math 8K)

The gold standard for mathematical reasoning:

GSM8K Multi-Step Reasoning

P(y | x) = \\prod_{i=1}^{n} P(s_i | x, s_1, \\ldots, s_{i-1})

Here,

$x$ =Math word problem
$y$ =Final numerical answer
$s_i$ =i-th reasoning step
$n$ =Number of steps required

GSM8K requires 2-8 step reasoning with grade school math concepts (arithmetic, fractions, basic algebra).

Current state-of-the-art: GPT-4 achieves ~95%, Claude 3.5 Sonnet ~97%, Gemini Ultra ~96%.

MATH Benchmark

Competition-level mathematics:

Difficulty	Topics	Examples	Current Best
Level 1	Prealgebra	750	~100%
Level 2	Algebra	750	~95%
Level 3	Number Theory	750	~85%
Level 4	Counting & Probability	750	~75%
Level 5	Geometry & Pre-calculus	750	~65%

BigBench-Hard (BBH)

A curated subset of 23 tasks from BIG-Bench that were challenging for language models:

Logical reasoning: Boolean expressions, causal judgment
Symbolic reasoning: Navigate, object counting
Commonsense: Sports understanding, ruin names
Multilingual: Multilingual QA, translated tasks

BBH is particularly useful because it focuses on tasks where chain-of-thought prompting significantly improves performance, making it a good test of reasoning capabilities rather than knowledge recall.

Coding Benchmarks

HumanEval (Pass@k)

The standard for code generation:

Pass@k Metric

\\text{Pass@k} = \\mathbb{E}_{\\text{problems}} \\left[ 1 - \\frac{\\binom{n-c}{k}}{\\binom{n}{k}} \\right]

Here,

$n$ =Total samples generated per problem
$c$ =Correct samples (pass all test cases)
$k$ =Number of samples considered

Pass@1 measures the probability of generating a correct solution in one attempt. Pass@10 measures whether at least one of 10 attempts is correct.

Current state-of-the-art: GPT-4 achieves ~86% Pass@1, Claude 3.5 Sonnet ~88%, Gemini Ultra ~84%.

MBPP (Mostly Basic Python Programming)

974-entry crowdsourced Python programming problems:

Simpler than HumanEval but broader coverage
Tests basic programming concepts
Less sensitive to prompt engineering

SWE-Bench

Real-world software engineering tasks from GitHub issues:

Subset	Problems	Current Best	Difficulty
SWE-Bench Full	2,294	~25%	Very Hard
SWE-Bench Verified	500	~35%	Hard
SWE-Bench Lite	300	~40%	Medium

SWE-Bench is the most realistic coding benchmark, but it is also the hardest. Models must understand complex codebases, navigate dependencies, and generate patches that pass existing test suites. Current models solve only 25-35% of issues.

Comprehensive Evaluation Frameworks

Framework	Benchmarks Included	Key Feature
Open LLM Leaderboard	MMLU, ARC, HellaSwag, etc.	Community-driven, transparent
LMSYS Chatbot Arena	Human preference votes	Real-world usage comparison
HELM	42+ scenarios	Holistic, multi-metric
BigCode Evaluation	HumanEval, MBPP	Code-specific
OpenCompass	100+ datasets	Comprehensive coverage

No single benchmark captures model quality. Use a benchmark portfolio: MMLU for knowledge, GSM8K for reasoning, HumanEval for coding, and human evaluation for open-ended tasks. Track trends across multiple benchmarks rather than optimizing for any single one.

Practice Exercises

Conceptual: Explain why MMLU alone is insufficient for evaluating an LLM's capabilities. What important aspects of model quality does it miss?
Mathematical: Compute Pass@1 and Pass@10 for a model that generates 20 samples for a problem, with 3 passing all test cases. Show your work.
Practical: Using the Open LLM Leaderboard, compare the performance of three different model sizes (7B, 13B, 70B) across MMLU, GSM8K, and HumanEval. What patterns emerge?
Research: Design a benchmark that tests an LLM's ability to learn new concepts from minimal examples. What would the evaluation protocol look like?

Key Takeaways:

MMLU tests broad knowledge but has issues with ambiguity and contamination
GSM8K and MATH test mathematical reasoning; BBH tests general reasoning
HumanEval and SWE-Bench test coding; Pass@k is the standard metric
No single benchmark is sufficient; use a portfolio of benchmarks
Human evaluation (Chatbot Arena) remains the gold standard for open-ended tasks

What to Learn Next

-> Automated LLM Evaluation Using models to evaluate models at scale.

-> LLM Evaluation Frameworks Comprehensive evaluation methodologies.

-> Human Evaluation of LLMs The gold standard for assessing model quality.

-> Hallucination Detection Evaluating factual accuracy.

-> Bias and Fairness Evaluating fairness across demographic groups.

-> Scaling Laws and Chinchilla Understanding how performance scales with resources.

LLM Benchmarking Suites

LLM Benchmarking Suites

LLM Benchmarking Suites

DfLLM Benchmark

Knowledge and Understanding Benchmarks

MMLU (Massive Multitask Language Understanding)

MMLU Accuracy

HellaSwag

DfHellaSwag Task

ARC (AI2 Reasoning Challenge)

Reasoning Benchmarks

GSM8K (Grade School Math 8K)

GSM8K Multi-Step Reasoning

MATH Benchmark

BigBench-Hard (BBH)

Coding Benchmarks

HumanEval (Pass@k)

Pass@k Metric

MBPP (Mostly Basic Python Programming)

SWE-Bench

Comprehensive Evaluation Frameworks

Practice Exercises

What to Learn Next

Need Expert LLM Help?