Evaluation
Automated LLM Evaluation β Scaling Quality Assessment
Human evaluation is the gold standard but doesn't scale. Automated LLM evaluation uses metrics, LLM-as-judge, and learned scorers to approximate human judgment at scale. This guide covers the major automated evaluation approaches.
- LLM-as-Judge β Using strong models to evaluate weaker ones
- G-Eval β Generalized evaluation with chain-of-thought
- Reference-Based Metrics β BLEU, ROUGE, and learned metrics
- Reference-Free Metrics β Perplexity, perplexity-based evaluation
The best automated metric is the one that correlates most with human judgment.
Automated LLM Evaluation
Human evaluation is expensive and doesn't scale. Automated evaluation methods provide fast, consistent, and cost-effective alternatives. The challenge is ensuring these automated methods actually correlate with human preferences.
DfAutomated Evaluation
Automated evaluation uses computational metrics to assess LLM quality without human annotators. The goal is to approximate human judgment while being faster, cheaper, and more reproducible.
LLM-as-Judge
Overview
DfLLM-as-Judge
LLM-as-Judge uses a strong language model (typically GPT-4) to evaluate the outputs of other models. It provides scalable evaluation that correlates well with human judgments for many tasks.
Implementation
class LLMJudge:
"""LLM-as-Judge evaluation system."""
def __init__(self, judge_model, eval_model):
self.judge_model = judge_model
self.eval_model = eval_model
def evaluate_single(self, prompt, response):
"""Evaluate a single response."""
judge_prompt = f"""Please evaluate the following response on a scale of 1-5.
Prompt: {prompt}
Response: {response}
Evaluation criteria:
1. Relevance: Does the response address the prompt?
2. Accuracy: Is the information correct?
3. Completeness: Does it cover all important aspects?
4. Clarity: Is it well-written and easy to understand?
Provide your evaluation as:
Score: [1-5]
Reasoning: [brief explanation]
"""
evaluation = self.judge_model.generate(judge_prompt)
return self._parse_evaluation(evaluation)
def evaluate_pairwise(self, prompt, response_a, response_b):
"""Compare two responses side by side."""
judge_prompt = f"""Please compare these two responses to the prompt.
Prompt: {prompt}
Response A: {response_a}
Response B: {response_b}
Which response is better? Consider:
- Relevance to the prompt
- Accuracy of information
- Completeness
- Clarity of explanation
Answer with:
Winner: [A/B/Tie]
Reasoning: [brief explanation]
"""
comparison = self.judge_model.generate(judge_prompt)
return self._parse_comparison(comparison)
def _parse_evaluation(self, evaluation):
"""Parse judge evaluation into structured output."""
lines = evaluation.strip().split("\n")
score = None
reasoning = ""
for line in lines:
if line.startswith("Score:"):
score = int(line.split(":")[1].strip())
elif line.startswith("Reasoning:"):
reasoning = line.split(":", 1)[1].strip()
return {"score": score, "reasoning": reasoning}
def _parse_comparison(self, comparison):
"""Parse pairwise comparison."""
lines = comparison.strip().split("\n")
winner = None
reasoning = ""
for line in lines:
if line.startswith("Winner:"):
winner = line.split(":")[1].strip()
elif line.startswith("Reasoning:"):
reasoning = line.split(":", 1)[1].strip()
return {"winner": winner, "reasoning": reasoning}
Multi-Dimensional Evaluation
Composite Score
Here,
- =Total evaluation score
- =Weight for dimension i
- =Score for dimension i
- =Number of evaluation dimensions
class MultiDimensionalJudge:
"""Evaluate across multiple quality dimensions."""
DIMENSIONS = {
"relevance": "How well does the response address the prompt?",
"accuracy": "How factually correct is the information?",
"completeness": "How thoroughly does it cover the topic?",
"clarity": "How well-written and clear is the response?",
"creativity": "How original and creative is the response?",
"safety": "How appropriate and safe is the content?"
}
def evaluate(self, prompt, response, dimensions=None):
"""Evaluate response on multiple dimensions."""
if dimensions is None:
dimensions = list(self.DIMENSIONS.keys())
evaluations = {}
for dim in dimensions:
criteria = self.DIMENSIONS[dim]
eval_result = self._evaluate_dimension(
prompt, response, dim, criteria
)
evaluations[dim] = eval_result
# Compute weighted average
weights = {dim: 1.0 / len(dimensions) for dim in dimensions}
total_score = sum(
evaluations[dim]["score"] * weights[dim]
for dim in dimensions
)
return {
"total_score": total_score,
"dimension_scores": evaluations
}
def _evaluate_dimension(self, prompt, response, dimension, criteria):
"""Evaluate a single dimension."""
judge_prompt = f"""Rate the following response on "{dimension}".
Prompt: {prompt}
Response: {response}
Criterion: {criteria}
Score (1-5):"""
result = self.judge_model.generate(judge_prompt)
score = self._extract_score(result)
return {"score": score, "dimension": dimension}
LLM-as-Judge works best when the judge model is significantly stronger than the model being evaluated. GPT-4 is commonly used as the judge, but this creates a dependency on a proprietary model. Open-source alternatives like JudgeLM are emerging.
G-Eval
Overview
DfG-Eval
G-Eval (Liu et al., 2023) uses chain-of-thought prompting to generate evaluation steps, then applies token probabilities to compute scores. It generalizes across tasks without task-specific training.
Implementation
class GEval:
"""G-Eval: Generalized evaluation with chain-of-thought."""
def __init__(self, judge_model):
self.judge_model = judge_model
def evaluate(self, prompt, response, criteria):
"""Evaluate using G-Eval methodology."""
# Step 1: Generate evaluation steps
steps = self._generate_steps(criteria)
# Step 2: Evaluate each step
step_scores = []
for step in steps:
score = self._evaluate_step(prompt, response, step)
step_scores.append(score)
# Step 3: Aggregate scores
final_score = sum(step_scores) / len(step_scores)
return {
"score": final_score,
"steps": steps,
"step_scores": step_scores
}
def _generate_steps(self, criteria):
"""Generate evaluation steps using chain-of-thought."""
prompt = f"""Generate 5 evaluation steps for assessing: {criteria}
Steps:"""
steps_text = self.judge_model.generate(prompt)
steps = [s.strip() for s in steps_text.split("\n") if s.strip()]
return steps[:5]
def _evaluate_step(self, prompt, response, step):
"""Evaluate a single step and compute probability-based score."""
eval_prompt = f"""Prompt: {prompt}
Response: {response}
Evaluation step: {step}
Does the response satisfy this criterion?
Answer:"""
# Get token probabilities for "Yes" vs "No"
probs = self.judge_model.get_token_probs(eval_prompt)
yes_prob = probs.get("Yes", 0.5)
no_prob = probs.get("No", 0.5)
# Convert to 1-5 score
score = 1 + 4 * (yes_prob / (yes_prob + no_prob))
return score
G-Eval Variants
| Variant | Description | Advantage |
|---|---|---|
| G-Eval (CoT) | Chain-of-thought steps | Interpretable |
| G-Eval (BoN) | Best-of-N sampling | More robust |
| G-Eval (ensemble) | Multiple judges | Higher reliability |
Reference-Based Metrics
BLEU (Bilingual Evaluation Understudy)
BLEU Score
Here,
- =Brevity penalty
- =Weight for n-gram (typically 1/N)
- =Modified n-gram precision
- =Maximum n-gram order (typically 4)
import math
from collections import Counter
def compute_bleu(reference, candidate, max_n=4):
"""Compute BLEU score between reference and candidate."""
# Tokenize
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
# Compute modified precision for each n-gram
precisions = []
for n in range(1, max_n + 1):
ref_ngrams = Counter(zip(*[ref_tokens[i:] for i in range(n)]))
cand_ngrams = Counter(zip(*[cand_tokens[i:] for i in range(n)]))
# Clipped counts
clipped = sum(
min(count, ref_ngrams.get(ngram, 0))
for ngram, count in cand_ngrams.items()
)
total = sum(cand_ngrams.values())
precision = clipped / total if total > 0 else 0
precisions.append(precision)
# Brevity penalty
bp = min(1, math.exp(1 - len(ref_tokens) / len(cand_tokens)))
# Geometric mean
if all(p > 0 for p in precisions):
log_avg = sum(math.log(p) for p in precisions) / len(precisions)
bleu = bp * math.exp(log_avg)
else:
bleu = 0
return bleu
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
ROUGE-L
Here,
- =Precision
- =Recall
- =Balance parameter (typically 1.2)
def lcs_length(x, y):
"""Compute length of longest common subsequence."""
m, n = len(x), len(y)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if x[i-1] == y[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
def compute_rouge_l(reference, candidate):
"""Compute ROUGE-L score."""
ref_tokens = reference.lower().split()
cand_tokens = candidate.lower().split()
lcs = lcs_length(ref_tokens, cand_tokens)
precision = lcs / len(cand_tokens) if cand_tokens else 0
recall = lcs / len(ref_tokens) if ref_tokens else 0
beta = 1.2
if precision + recall > 0:
f_score = (1 + beta**2) * precision * recall / (beta**2 * precision + recall)
else:
f_score = 0
return {"precision": precision, "recall": recall, "f1": f_score}
BERTScore
DfBERTScore
BERTScore computes semantic similarity using contextual embeddings from BERT. It matches tokens between reference and candidate based on embedding similarity rather than exact match.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
class BERTScore:
"""BERTScore evaluation metric."""
def __init__(self, model_name="microsoft/deberta-xlarge-mnli"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
def compute(self, references, candidates):
"""Compute BERTScore between references and candidates."""
# Tokenize
ref_inputs = self.tokenizer(
references, padding=True, truncation=True, return_tensors="pt"
)
cand_inputs = self.tokenizer(
candidates, padding=True, truncation=True, return_tensors="pt"
)
# Get embeddings
with torch.no_grad():
ref_embeddings = self.model(**ref_inputs).last_hidden_state
cand_embeddings = self.model(**cand_inputs).last_hidden_state
# Compute cosine similarity
ref_embeddings = F.normalize(ref_embeddings, dim=-1)
cand_embeddings = F.normalize(cand_embeddings, dim=-1)
# Token-level similarity
similarity = torch.bmm(cand_embeddings, ref_embeddings.transpose(1, 2))
# Max similarity for each token
precision = similarity.max(dim=2)[0].mean(dim=1)
recall = similarity.max(dim=1)[0].mean(dim=1)
# F1 score
f1 = 2 * precision * recall / (precision + recall)
return {
"precision": precision.mean().item(),
"recall": recall.mean().item(),
"f1": f1.mean().item()
}
Reference-Free Metrics
Perplexity
Perplexity
Here,
- =Sequence length
- =Model probability for token t
Self-BLEU
DfSelf-BLEU
Self-BLEU measures diversity by computing BLEU between generated samples. Lower Self-BLEU indicates higher diversity.
def compute_self_bleu(samples, n=5):
"""Compute Self-BLEU for diversity measurement."""
from itertools import combinations
scores = []
for i, j in combinations(range(len(samples)), 2):
score = compute_bleu(samples[i], samples[j])
scores.append(score)
return sum(scores) / len(scores)
FrΓ©chet Inception Distance (FID) Adaptation
For text generation, adapted versions of FID use text embeddings instead of image features. The Frechet Distance measures the difference between the distribution of generated texts and reference texts in embedding space.
Choosing Evaluation Methods
| Method | Correlation with Human | Cost | Speed | Use Case |
|---|---|---|---|---|
| LLM-as-Judge (GPT-4) | High | High | Moderate | Gold standard automated |
| G-Eval | High | Moderate | Moderate | General evaluation |
| BERTScore | Moderate | Low | Fast | Quick comparison |
| BLEU/ROUGE | Low-Moderate | Very Low | Very Fast | Translation/summarization |
| Perplexity | Low | Very Low | Very Fast | Language modeling |
For production evaluation, combine multiple methods: use LLM-as-Judge for a sample of outputs, BERTScore for all outputs, and perplexity for ongoing monitoring. This balances cost, speed, and accuracy.
Evaluation Pipeline
class AutomatedEvaluationPipeline:
"""Complete automated evaluation pipeline."""
def __init__(self, judge_model=None):
self.judge = LLMJudge(judge_model) if judge_model else None
self.bert_scorer = BERTScore()
def evaluate(self, model_outputs, references=None, prompts=None):
"""Run comprehensive automated evaluation."""
results = {}
# 1. Reference-based metrics (if references available)
if references:
results["bertscore"] = self.bert_scorer.compute(
references, model_outputs
)
results["rouge"] = [
compute_rouge_l(ref, cand)
for ref, cand in zip(references, model_outputs)
]
# 2. Reference-free metrics
results["self_bleu"] = compute_self_bleu(model_outputs)
# 3. LLM-as-Judge (if prompts available and judge configured)
if self.judge and prompts:
judge_scores = []
for prompt, output in zip(prompts, model_outputs):
score = self.judge.evaluate_single(prompt, output)
judge_scores.append(score["score"])
results["llm_judge"] = {
"mean": sum(judge_scores) / len(judge_scores),
"scores": judge_scores
}
# 4. Aggregate
results["overall"] = self._aggregate_results(results)
return results
def _aggregate_results(self, results):
"""Aggregate all metrics into overall score."""
scores = []
if "llm_judge" in results:
scores.append(results["llm_judge"]["mean"] / 5.0)
if "bertscore" in results:
scores.append(results["bertscore"]["f1"])
if scores:
return sum(scores) / len(scores)
return 0
Practice Exercises
-
Conceptual: Explain the trade-offs between reference-based metrics (BLEU, ROUGE) and LLM-as-Judge. When would you prefer one over the other?
-
Mathematical: Compute the BLEU score for a candidate sentence with 10 tokens against a reference with 12 tokens, where the modified precisions are 0.8, 0.6, 0.4, and 0.3 for n-grams 1-4.
-
Practical: Implement a G-Eval evaluator for summarization quality and compare its correlation with human judgments against ROUGE scores.
-
Research: Investigate the bias in LLM-as-Judge evaluations. Do judges prefer their own outputs or outputs from similar models? How can this bias be mitigated?
Key Takeaways:
- LLM-as-Judge (GPT-4) provides scalable evaluation correlating well with human judgment
- G-Eval uses chain-of-thought to generate evaluation steps automatically
- BLEU and ROUGE are fast but correlate weakly with human judgment for open-ended tasks
- BERTScore captures semantic similarity better than exact match metrics
- Combine multiple evaluation methods for robust assessment
- Self-BLEU measures output diversity (lower = more diverse)
- Perplexity measures language modeling quality but not task performance
- Automated evaluation should be validated against human judgments periodically
What to Learn Next
-> Human Evaluation of LLMs Chatbot Arena, preference studies, and annotation.
-> LLM Evaluation Frameworks lm-eval-harness, OpenCompass, and HELM.
-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.
-> RLHF and Alignment Training models to align with human preferences.
-> Constitutional AI Training safe and aligned language models.
-> DPO and Preference Optimization Direct preference optimization for alignment.