Evaluation

Automated LLM Evaluation — Scaling Quality Assessment

Human evaluation is the gold standard but doesn't scale. Automated LLM evaluation uses metrics, LLM-as-judge, and learned scorers to approximate human judgment at scale. This guide covers the major automated evaluation approaches.

LLM-as-Judge — Using strong models to evaluate weaker ones
G-Eval — Generalized evaluation with chain-of-thought
Reference-Based Metrics — BLEU, ROUGE, and learned metrics
Reference-Free Metrics — Perplexity, perplexity-based evaluation

The best automated metric is the one that correlates most with human judgment.

Automated LLM Evaluation

Human evaluation is expensive and doesn't scale. Automated evaluation methods provide fast, consistent, and cost-effective alternatives. The challenge is ensuring these automated methods actually correlate with human preferences.

DfAutomated Evaluation

Automated evaluation uses computational metrics to assess LLM quality without human annotators. The goal is to approximate human judgment while being faster, cheaper, and more reproducible.

LLM-as-Judge

Overview

DfLLM-as-Judge

LLM-as-Judge uses a strong language model (typically GPT-4) to evaluate the outputs of other models. It provides scalable evaluation that correlates well with human judgments for many tasks.

Implementation

class LLMJudge:
    """LLM-as-Judge evaluation system."""
    
    def __init__(self, judge_model, eval_model):
        self.judge_model = judge_model
        self.eval_model = eval_model
    
    def evaluate_single(self, prompt, response):
        """Evaluate a single response."""
        judge_prompt = f"""Please evaluate the following response on a scale of 1-5.

Prompt: {prompt}

Response: {response}

Evaluation criteria:
1. Relevance: Does the response address the prompt?
2. Accuracy: Is the information correct?
3. Completeness: Does it cover all important aspects?
4. Clarity: Is it well-written and easy to understand?

Provide your evaluation as:
Score: [1-5]
Reasoning: [brief explanation]
"""
        
        evaluation = self.judge_model.generate(judge_prompt)
        return self._parse_evaluation(evaluation)
    
    def evaluate_pairwise(self, prompt, response_a, response_b):
        """Compare two responses side by side."""
        judge_prompt = f"""Please compare these two responses to the prompt.

Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response is better? Consider:
- Relevance to the prompt
- Accuracy of information
- Completeness
- Clarity of explanation

Answer with:
Winner: [A/B/Tie]
Reasoning: [brief explanation]
"""
        
        comparison = self.judge_model.generate(judge_prompt)
        return self._parse_comparison(comparison)
    
    def _parse_evaluation(self, evaluation):
        """Parse judge evaluation into structured output."""
        lines = evaluation.strip().split("\n")
        score = None
        reasoning = ""
        
        for line in lines:
            if line.startswith("Score:"):
                score = int(line.split(":")[1].strip())
            elif line.startswith("Reasoning:"):
                reasoning = line.split(":", 1)[1].strip()
        
        return {"score": score, "reasoning": reasoning}
    
    def _parse_comparison(self, comparison):
        """Parse pairwise comparison."""
        lines = comparison.strip().split("\n")
        winner = None
        reasoning = ""
        
        for line in lines:
            if line.startswith("Winner:"):
                winner = line.split(":")[1].strip()
            elif line.startswith("Reasoning:"):
                reasoning = line.split(":", 1)[1].strip()
        
        return {"winner": winner, "reasoning": reasoning}

Multi-Dimensional Evaluation

Composite Score

S_{\\text{total}} = \\sum_{i=1}^{n} w_i \\cdot S_i

Here,

$S_{\text{total}}$ =Total evaluation score
$w_i$ =Weight for dimension i
$S_i$ =Score for dimension i
$n$ =Number of evaluation dimensions

class MultiDimensionalJudge:
    """Evaluate across multiple quality dimensions."""
    
    DIMENSIONS = {
        "relevance": "How well does the response address the prompt?",
        "accuracy": "How factually correct is the information?",
        "completeness": "How thoroughly does it cover the topic?",
        "clarity": "How well-written and clear is the response?",
        "creativity": "How original and creative is the response?",
        "safety": "How appropriate and safe is the content?"
    }
    
    def evaluate(self, prompt, response, dimensions=None):
        """Evaluate response on multiple dimensions."""
        if dimensions is None:
            dimensions = list(self.DIMENSIONS.keys())
        
        evaluations = {}
        
        for dim in dimensions:
            criteria = self.DIMENSIONS[dim]
            eval_result = self._evaluate_dimension(
                prompt, response, dim, criteria
            )
            evaluations[dim] = eval_result
        
        # Compute weighted average
        weights = {dim: 1.0 / len(dimensions) for dim in dimensions}
        total_score = sum(
            evaluations[dim]["score"] * weights[dim] 
            for dim in dimensions
        )
        
        return {
            "total_score": total_score,
            "dimension_scores": evaluations
        }
    
    def _evaluate_dimension(self, prompt, response, dimension, criteria):
        """Evaluate a single dimension."""
        judge_prompt = f"""Rate the following response on "{dimension}".

Prompt: {prompt}
Response: {response}

Criterion: {criteria}

Score (1-5):"""
        
        result = self.judge_model.generate(judge_prompt)
        score = self._extract_score(result)
        
        return {"score": score, "dimension": dimension}

LLM-as-Judge works best when the judge model is significantly stronger than the model being evaluated. GPT-4 is commonly used as the judge, but this creates a dependency on a proprietary model. Open-source alternatives like JudgeLM are emerging.

G-Eval

Overview

DfG-Eval

G-Eval (Liu et al., 2023) uses chain-of-thought prompting to generate evaluation steps, then applies token probabilities to compute scores. It generalizes across tasks without task-specific training.

Implementation

class GEval:
    """G-Eval: Generalized evaluation with chain-of-thought."""
    
    def __init__(self, judge_model):
        self.judge_model = judge_model
    
    def evaluate(self, prompt, response, criteria):
        """Evaluate using G-Eval methodology."""
        
        # Step 1: Generate evaluation steps
        steps = self._generate_steps(criteria)
        
        # Step 2: Evaluate each step
        step_scores = []
        for step in steps:
            score = self._evaluate_step(prompt, response, step)
            step_scores.append(score)
        
        # Step 3: Aggregate scores
        final_score = sum(step_scores) / len(step_scores)
        
        return {
            "score": final_score,
            "steps": steps,
            "step_scores": step_scores
        }
    
    def _generate_steps(self, criteria):
        """Generate evaluation steps using chain-of-thought."""
        prompt = f"""Generate 5 evaluation steps for assessing: {criteria}

Steps:"""
        
        steps_text = self.judge_model.generate(prompt)
        steps = [s.strip() for s in steps_text.split("\n") if s.strip()]
        
        return steps[:5]
    
    def _evaluate_step(self, prompt, response, step):
        """Evaluate a single step and compute probability-based score."""
        eval_prompt = f"""Prompt: {prompt}
Response: {response}

Evaluation step: {step}

Does the response satisfy this criterion?
Answer:"""
        
        # Get token probabilities for "Yes" vs "No"
        probs = self.judge_model.get_token_probs(eval_prompt)
        
        yes_prob = probs.get("Yes", 0.5)
        no_prob = probs.get("No", 0.5)
        
        # Convert to 1-5 score
        score = 1 + 4 * (yes_prob / (yes_prob + no_prob))
        
        return score

G-Eval Variants

Variant	Description	Advantage
G-Eval (CoT)	Chain-of-thought steps	Interpretable
G-Eval (BoN)	Best-of-N sampling	More robust
G-Eval (ensemble)	Multiple judges	Higher reliability

Reference-Based Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU Score

\\text{BLEU} = \\text{BP} \\cdot \\exp\\left(\\sum_{n=1}^{N} w_n \\log p_n\\right)

Here,

$BP$ =Brevity penalty
$w_n$ =Weight for n-gram (typically 1/N)
$p_n$ =Modified n-gram precision
$N$ =Maximum n-gram order (typically 4)

import math
from collections import Counter

def compute_bleu(reference, candidate, max_n=4):
    """Compute BLEU score between reference and candidate."""
    
    # Tokenize
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    # Compute modified precision for each n-gram
    precisions = []
    
    for n in range(1, max_n + 1):
        ref_ngrams = Counter(zip(*[ref_tokens[i:] for i in range(n)]))
        cand_ngrams = Counter(zip(*[cand_tokens[i:] for i in range(n)]))
        
        # Clipped counts
        clipped = sum(
            min(count, ref_ngrams.get(ngram, 0))
            for ngram, count in cand_ngrams.items()
        )
        total = sum(cand_ngrams.values())
        
        precision = clipped / total if total > 0 else 0
        precisions.append(precision)
    
    # Brevity penalty
    bp = min(1, math.exp(1 - len(ref_tokens) / len(cand_tokens)))
    
    # Geometric mean
    if all(p > 0 for p in precisions):
        log_avg = sum(math.log(p) for p in precisions) / len(precisions)
        bleu = bp * math.exp(log_avg)
    else:
        bleu = 0
    
    return bleu

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-L

F_\\beta = \\frac{(1 + \\beta^2) \\cdot P \\cdot R}{\\beta^2 \\cdot P + R}

Here,

$P$ =Precision
$R$ =Recall
$\beta$ =Balance parameter (typically 1.2)

def lcs_length(x, y):
    """Compute length of longest common subsequence."""
    m, n = len(x), len(y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i-1] == y[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    return dp[m][n]

def compute_rouge_l(reference, candidate):
    """Compute ROUGE-L score."""
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    lcs = lcs_length(ref_tokens, cand_tokens)
    
    precision = lcs / len(cand_tokens) if cand_tokens else 0
    recall = lcs / len(ref_tokens) if ref_tokens else 0
    
    beta = 1.2
    if precision + recall > 0:
        f_score = (1 + beta**2) * precision * recall / (beta**2 * precision + recall)
    else:
        f_score = 0
    
    return {"precision": precision, "recall": recall, "f1": f_score}

BERTScore

DfBERTScore

BERTScore computes semantic similarity using contextual embeddings from BERT. It matches tokens between reference and candidate based on embedding similarity rather than exact match.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

class BERTScore:
    """BERTScore evaluation metric."""
    
    def __init__(self, model_name="microsoft/deberta-xlarge-mnli"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def compute(self, references, candidates):
        """Compute BERTScore between references and candidates."""
        # Tokenize
        ref_inputs = self.tokenizer(
            references, padding=True, truncation=True, return_tensors="pt"
        )
        cand_inputs = self.tokenizer(
            candidates, padding=True, truncation=True, return_tensors="pt"
        )
        
        # Get embeddings
        with torch.no_grad():
            ref_embeddings = self.model(**ref_inputs).last_hidden_state
            cand_embeddings = self.model(**cand_inputs).last_hidden_state
        
        # Compute cosine similarity
        ref_embeddings = F.normalize(ref_embeddings, dim=-1)
        cand_embeddings = F.normalize(cand_embeddings, dim=-1)
        
        # Token-level similarity
        similarity = torch.bmm(cand_embeddings, ref_embeddings.transpose(1, 2))
        
        # Max similarity for each token
        precision = similarity.max(dim=2)[0].mean(dim=1)
        recall = similarity.max(dim=1)[0].mean(dim=1)
        
        # F1 score
        f1 = 2 * precision * recall / (precision + recall)
        
        return {
            "precision": precision.mean().item(),
            "recall": recall.mean().item(),
            "f1": f1.mean().item()
        }

Reference-Free Metrics

Perplexity

\\text{PPL} = \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^{T} \\log P(x_t | x_{<t})\\right)

Here,

$T$ =Sequence length
$P(x_t | x_{<t})$ =Model probability for token t

Self-BLEU

DfSelf-BLEU

Self-BLEU measures diversity by computing BLEU between generated samples. Lower Self-BLEU indicates higher diversity.

def compute_self_bleu(samples, n=5):
    """Compute Self-BLEU for diversity measurement."""
    from itertools import combinations
    
    scores = []
    for i, j in combinations(range(len(samples)), 2):
        score = compute_bleu(samples[i], samples[j])
        scores.append(score)
    
    return sum(scores) / len(scores)

Fréchet Inception Distance (FID) Adaptation

For text generation, adapted versions of FID use text embeddings instead of image features. The Frechet Distance measures the difference between the distribution of generated texts and reference texts in embedding space.

Choosing Evaluation Methods

Method	Correlation with Human	Cost	Speed	Use Case
LLM-as-Judge (GPT-4)	High	High	Moderate	Gold standard automated
G-Eval	High	Moderate	Moderate	General evaluation
BERTScore	Moderate	Low	Fast	Quick comparison
BLEU/ROUGE	Low-Moderate	Very Low	Very Fast	Translation/summarization
Perplexity	Low	Very Low	Very Fast	Language modeling

For production evaluation, combine multiple methods: use LLM-as-Judge for a sample of outputs, BERTScore for all outputs, and perplexity for ongoing monitoring. This balances cost, speed, and accuracy.

Evaluation Pipeline

class AutomatedEvaluationPipeline:
    """Complete automated evaluation pipeline."""
    
    def __init__(self, judge_model=None):
        self.judge = LLMJudge(judge_model) if judge_model else None
        self.bert_scorer = BERTScore()
    
    def evaluate(self, model_outputs, references=None, prompts=None):
        """Run comprehensive automated evaluation."""
        results = {}
        
        # 1. Reference-based metrics (if references available)
        if references:
            results["bertscore"] = self.bert_scorer.compute(
                references, model_outputs
            )
            results["rouge"] = [
                compute_rouge_l(ref, cand)
                for ref, cand in zip(references, model_outputs)
            ]
        
        # 2. Reference-free metrics
        results["self_bleu"] = compute_self_bleu(model_outputs)
        
        # 3. LLM-as-Judge (if prompts available and judge configured)
        if self.judge and prompts:
            judge_scores = []
            for prompt, output in zip(prompts, model_outputs):
                score = self.judge.evaluate_single(prompt, output)
                judge_scores.append(score["score"])
            
            results["llm_judge"] = {
                "mean": sum(judge_scores) / len(judge_scores),
                "scores": judge_scores
            }
        
        # 4. Aggregate
        results["overall"] = self._aggregate_results(results)
        
        return results
    
    def _aggregate_results(self, results):
        """Aggregate all metrics into overall score."""
        scores = []
        
        if "llm_judge" in results:
            scores.append(results["llm_judge"]["mean"] / 5.0)
        
        if "bertscore" in results:
            scores.append(results["bertscore"]["f1"])
        
        if scores:
            return sum(scores) / len(scores)
        return 0

Practice Exercises

Conceptual: Explain the trade-offs between reference-based metrics (BLEU, ROUGE) and LLM-as-Judge. When would you prefer one over the other?
Mathematical: Compute the BLEU score for a candidate sentence with 10 tokens against a reference with 12 tokens, where the modified precisions are 0.8, 0.6, 0.4, and 0.3 for n-grams 1-4.
Practical: Implement a G-Eval evaluator for summarization quality and compare its correlation with human judgments against ROUGE scores.
Research: Investigate the bias in LLM-as-Judge evaluations. Do judges prefer their own outputs or outputs from similar models? How can this bias be mitigated?

Key Takeaways:

LLM-as-Judge (GPT-4) provides scalable evaluation correlating well with human judgment
G-Eval uses chain-of-thought to generate evaluation steps automatically
BLEU and ROUGE are fast but correlate weakly with human judgment for open-ended tasks
BERTScore captures semantic similarity better than exact match metrics
Combine multiple evaluation methods for robust assessment
Self-BLEU measures output diversity (lower = more diverse)
Perplexity measures language modeling quality but not task performance
Automated evaluation should be validated against human judgments periodically

What to Learn Next

-> Human Evaluation of LLMs Chatbot Arena, preference studies, and annotation.

-> LLM Evaluation Frameworks lm-eval-harness, OpenCompass, and HELM.

-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.

-> RLHF and Alignment Training models to align with human preferences.

-> Constitutional AI Training safe and aligned language models.

-> DPO and Preference Optimization Direct preference optimization for alignment.

Automated LLM Evaluation

Automated LLM Evaluation — Scaling Quality Assessment

Automated LLM Evaluation

DfAutomated Evaluation

LLM-as-Judge

Overview

DfLLM-as-Judge

Implementation

Multi-Dimensional Evaluation

Composite Score

G-Eval

Overview

DfG-Eval

Implementation

G-Eval Variants

Reference-Based Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-L

BERTScore

DfBERTScore

Reference-Free Metrics

Perplexity

Perplexity

Self-BLEU

DfSelf-BLEU

Fréchet Inception Distance (FID) Adaptation

Choosing Evaluation Methods

Evaluation Pipeline

Practice Exercises

What to Learn Next

Need Expert LLM Help?