CW

Automated LLM Evaluation

EvaluationAutomatic MetricsFree Lesson

Advertisement

Evaluation

Automated LLM Evaluation β€” Scaling Quality Assessment

Human evaluation is the gold standard but doesn't scale. Automated LLM evaluation uses metrics, LLM-as-judge, and learned scorers to approximate human judgment at scale. This guide covers the major automated evaluation approaches.

  • LLM-as-Judge β€” Using strong models to evaluate weaker ones
  • G-Eval β€” Generalized evaluation with chain-of-thought
  • Reference-Based Metrics β€” BLEU, ROUGE, and learned metrics
  • Reference-Free Metrics β€” Perplexity, perplexity-based evaluation

The best automated metric is the one that correlates most with human judgment.

Automated LLM Evaluation

Human evaluation is expensive and doesn't scale. Automated evaluation methods provide fast, consistent, and cost-effective alternatives. The challenge is ensuring these automated methods actually correlate with human preferences.

DfAutomated Evaluation

Automated evaluation uses computational metrics to assess LLM quality without human annotators. The goal is to approximate human judgment while being faster, cheaper, and more reproducible.

LLM-as-Judge

Overview

DfLLM-as-Judge

LLM-as-Judge uses a strong language model (typically GPT-4) to evaluate the outputs of other models. It provides scalable evaluation that correlates well with human judgments for many tasks.

Implementation

class LLMJudge:
    """LLM-as-Judge evaluation system."""
    
    def __init__(self, judge_model, eval_model):
        self.judge_model = judge_model
        self.eval_model = eval_model
    
    def evaluate_single(self, prompt, response):
        """Evaluate a single response."""
        judge_prompt = f"""Please evaluate the following response on a scale of 1-5.

Prompt: {prompt}

Response: {response}

Evaluation criteria:
1. Relevance: Does the response address the prompt?
2. Accuracy: Is the information correct?
3. Completeness: Does it cover all important aspects?
4. Clarity: Is it well-written and easy to understand?

Provide your evaluation as:
Score: [1-5]
Reasoning: [brief explanation]
"""
        
        evaluation = self.judge_model.generate(judge_prompt)
        return self._parse_evaluation(evaluation)
    
    def evaluate_pairwise(self, prompt, response_a, response_b):
        """Compare two responses side by side."""
        judge_prompt = f"""Please compare these two responses to the prompt.

Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response is better? Consider:
- Relevance to the prompt
- Accuracy of information
- Completeness
- Clarity of explanation

Answer with:
Winner: [A/B/Tie]
Reasoning: [brief explanation]
"""
        
        comparison = self.judge_model.generate(judge_prompt)
        return self._parse_comparison(comparison)
    
    def _parse_evaluation(self, evaluation):
        """Parse judge evaluation into structured output."""
        lines = evaluation.strip().split("\n")
        score = None
        reasoning = ""
        
        for line in lines:
            if line.startswith("Score:"):
                score = int(line.split(":")[1].strip())
            elif line.startswith("Reasoning:"):
                reasoning = line.split(":", 1)[1].strip()
        
        return {"score": score, "reasoning": reasoning}
    
    def _parse_comparison(self, comparison):
        """Parse pairwise comparison."""
        lines = comparison.strip().split("\n")
        winner = None
        reasoning = ""
        
        for line in lines:
            if line.startswith("Winner:"):
                winner = line.split(":")[1].strip()
            elif line.startswith("Reasoning:"):
                reasoning = line.split(":", 1)[1].strip()
        
        return {"winner": winner, "reasoning": reasoning}

Multi-Dimensional Evaluation

Composite Score

Stexttotal=sumi=1nwicdotSiS_{\\text{total}} = \\sum_{i=1}^{n} w_i \\cdot S_i

Here,

  • StotalS_{\text{total}}=Total evaluation score
  • wiw_i=Weight for dimension i
  • SiS_i=Score for dimension i
  • nn=Number of evaluation dimensions
class MultiDimensionalJudge:
    """Evaluate across multiple quality dimensions."""
    
    DIMENSIONS = {
        "relevance": "How well does the response address the prompt?",
        "accuracy": "How factually correct is the information?",
        "completeness": "How thoroughly does it cover the topic?",
        "clarity": "How well-written and clear is the response?",
        "creativity": "How original and creative is the response?",
        "safety": "How appropriate and safe is the content?"
    }
    
    def evaluate(self, prompt, response, dimensions=None):
        """Evaluate response on multiple dimensions."""
        if dimensions is None:
            dimensions = list(self.DIMENSIONS.keys())
        
        evaluations = {}
        
        for dim in dimensions:
            criteria = self.DIMENSIONS[dim]
            eval_result = self._evaluate_dimension(
                prompt, response, dim, criteria
            )
            evaluations[dim] = eval_result
        
        # Compute weighted average
        weights = {dim: 1.0 / len(dimensions) for dim in dimensions}
        total_score = sum(
            evaluations[dim]["score"] * weights[dim] 
            for dim in dimensions
        )
        
        return {
            "total_score": total_score,
            "dimension_scores": evaluations
        }
    
    def _evaluate_dimension(self, prompt, response, dimension, criteria):
        """Evaluate a single dimension."""
        judge_prompt = f"""Rate the following response on "{dimension}".

Prompt: {prompt}
Response: {response}

Criterion: {criteria}

Score (1-5):"""
        
        result = self.judge_model.generate(judge_prompt)
        score = self._extract_score(result)
        
        return {"score": score, "dimension": dimension}

LLM-as-Judge works best when the judge model is significantly stronger than the model being evaluated. GPT-4 is commonly used as the judge, but this creates a dependency on a proprietary model. Open-source alternatives like JudgeLM are emerging.

G-Eval

Overview

DfG-Eval

G-Eval (Liu et al., 2023) uses chain-of-thought prompting to generate evaluation steps, then applies token probabilities to compute scores. It generalizes across tasks without task-specific training.

Implementation

class GEval:
    """G-Eval: Generalized evaluation with chain-of-thought."""
    
    def __init__(self, judge_model):
        self.judge_model = judge_model
    
    def evaluate(self, prompt, response, criteria):
        """Evaluate using G-Eval methodology."""
        
        # Step 1: Generate evaluation steps
        steps = self._generate_steps(criteria)
        
        # Step 2: Evaluate each step
        step_scores = []
        for step in steps:
            score = self._evaluate_step(prompt, response, step)
            step_scores.append(score)
        
        # Step 3: Aggregate scores
        final_score = sum(step_scores) / len(step_scores)
        
        return {
            "score": final_score,
            "steps": steps,
            "step_scores": step_scores
        }
    
    def _generate_steps(self, criteria):
        """Generate evaluation steps using chain-of-thought."""
        prompt = f"""Generate 5 evaluation steps for assessing: {criteria}

Steps:"""
        
        steps_text = self.judge_model.generate(prompt)
        steps = [s.strip() for s in steps_text.split("\n") if s.strip()]
        
        return steps[:5]
    
    def _evaluate_step(self, prompt, response, step):
        """Evaluate a single step and compute probability-based score."""
        eval_prompt = f"""Prompt: {prompt}
Response: {response}

Evaluation step: {step}

Does the response satisfy this criterion?
Answer:"""
        
        # Get token probabilities for "Yes" vs "No"
        probs = self.judge_model.get_token_probs(eval_prompt)
        
        yes_prob = probs.get("Yes", 0.5)
        no_prob = probs.get("No", 0.5)
        
        # Convert to 1-5 score
        score = 1 + 4 * (yes_prob / (yes_prob + no_prob))
        
        return score

G-Eval Variants

VariantDescriptionAdvantage
G-Eval (CoT)Chain-of-thought stepsInterpretable
G-Eval (BoN)Best-of-N samplingMore robust
G-Eval (ensemble)Multiple judgesHigher reliability

Reference-Based Metrics

BLEU (Bilingual Evaluation Understudy)

BLEU Score

textBLEU=textBPcdotexpleft(sumn=1Nwnlogpnright)\\text{BLEU} = \\text{BP} \\cdot \\exp\\left(\\sum_{n=1}^{N} w_n \\log p_n\\right)

Here,

  • BPBP=Brevity penalty
  • wnw_n=Weight for n-gram (typically 1/N)
  • pnp_n=Modified n-gram precision
  • NN=Maximum n-gram order (typically 4)
import math
from collections import Counter

def compute_bleu(reference, candidate, max_n=4):
    """Compute BLEU score between reference and candidate."""
    
    # Tokenize
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    # Compute modified precision for each n-gram
    precisions = []
    
    for n in range(1, max_n + 1):
        ref_ngrams = Counter(zip(*[ref_tokens[i:] for i in range(n)]))
        cand_ngrams = Counter(zip(*[cand_tokens[i:] for i in range(n)]))
        
        # Clipped counts
        clipped = sum(
            min(count, ref_ngrams.get(ngram, 0))
            for ngram, count in cand_ngrams.items()
        )
        total = sum(cand_ngrams.values())
        
        precision = clipped / total if total > 0 else 0
        precisions.append(precision)
    
    # Brevity penalty
    bp = min(1, math.exp(1 - len(ref_tokens) / len(cand_tokens)))
    
    # Geometric mean
    if all(p > 0 for p in precisions):
        log_avg = sum(math.log(p) for p in precisions) / len(precisions)
        bleu = bp * math.exp(log_avg)
    else:
        bleu = 0
    
    return bleu

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE-L

F_\\beta = \\frac{(1 + \\beta^2) \\cdot P \\cdot R}{\\beta^2 \\cdot P + R}

Here,

  • PP=Precision
  • RR=Recall
  • Ξ²\beta=Balance parameter (typically 1.2)
def lcs_length(x, y):
    """Compute length of longest common subsequence."""
    m, n = len(x), len(y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if x[i-1] == y[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    return dp[m][n]

def compute_rouge_l(reference, candidate):
    """Compute ROUGE-L score."""
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    lcs = lcs_length(ref_tokens, cand_tokens)
    
    precision = lcs / len(cand_tokens) if cand_tokens else 0
    recall = lcs / len(ref_tokens) if ref_tokens else 0
    
    beta = 1.2
    if precision + recall > 0:
        f_score = (1 + beta**2) * precision * recall / (beta**2 * precision + recall)
    else:
        f_score = 0
    
    return {"precision": precision, "recall": recall, "f1": f_score}

BERTScore

DfBERTScore

BERTScore computes semantic similarity using contextual embeddings from BERT. It matches tokens between reference and candidate based on embedding similarity rather than exact match.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

class BERTScore:
    """BERTScore evaluation metric."""
    
    def __init__(self, model_name="microsoft/deberta-xlarge-mnli"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def compute(self, references, candidates):
        """Compute BERTScore between references and candidates."""
        # Tokenize
        ref_inputs = self.tokenizer(
            references, padding=True, truncation=True, return_tensors="pt"
        )
        cand_inputs = self.tokenizer(
            candidates, padding=True, truncation=True, return_tensors="pt"
        )
        
        # Get embeddings
        with torch.no_grad():
            ref_embeddings = self.model(**ref_inputs).last_hidden_state
            cand_embeddings = self.model(**cand_inputs).last_hidden_state
        
        # Compute cosine similarity
        ref_embeddings = F.normalize(ref_embeddings, dim=-1)
        cand_embeddings = F.normalize(cand_embeddings, dim=-1)
        
        # Token-level similarity
        similarity = torch.bmm(cand_embeddings, ref_embeddings.transpose(1, 2))
        
        # Max similarity for each token
        precision = similarity.max(dim=2)[0].mean(dim=1)
        recall = similarity.max(dim=1)[0].mean(dim=1)
        
        # F1 score
        f1 = 2 * precision * recall / (precision + recall)
        
        return {
            "precision": precision.mean().item(),
            "recall": recall.mean().item(),
            "f1": f1.mean().item()
        }

Reference-Free Metrics

Perplexity

Perplexity

textPPL=expleft(βˆ’frac1Tsumt=1TlogP(xt∣x<t)right)\\text{PPL} = \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^{T} \\log P(x_t | x_{<t})\\right)

Here,

  • TT=Sequence length
  • P(xt∣x<t)P(x_t | x_{<t})=Model probability for token t

Self-BLEU

DfSelf-BLEU

Self-BLEU measures diversity by computing BLEU between generated samples. Lower Self-BLEU indicates higher diversity.

def compute_self_bleu(samples, n=5):
    """Compute Self-BLEU for diversity measurement."""
    from itertools import combinations
    
    scores = []
    for i, j in combinations(range(len(samples)), 2):
        score = compute_bleu(samples[i], samples[j])
        scores.append(score)
    
    return sum(scores) / len(scores)

FrΓ©chet Inception Distance (FID) Adaptation

For text generation, adapted versions of FID use text embeddings instead of image features. The Frechet Distance measures the difference between the distribution of generated texts and reference texts in embedding space.

Choosing Evaluation Methods

MethodCorrelation with HumanCostSpeedUse Case
LLM-as-Judge (GPT-4)HighHighModerateGold standard automated
G-EvalHighModerateModerateGeneral evaluation
BERTScoreModerateLowFastQuick comparison
BLEU/ROUGELow-ModerateVery LowVery FastTranslation/summarization
PerplexityLowVery LowVery FastLanguage modeling

For production evaluation, combine multiple methods: use LLM-as-Judge for a sample of outputs, BERTScore for all outputs, and perplexity for ongoing monitoring. This balances cost, speed, and accuracy.

Evaluation Pipeline

class AutomatedEvaluationPipeline:
    """Complete automated evaluation pipeline."""
    
    def __init__(self, judge_model=None):
        self.judge = LLMJudge(judge_model) if judge_model else None
        self.bert_scorer = BERTScore()
    
    def evaluate(self, model_outputs, references=None, prompts=None):
        """Run comprehensive automated evaluation."""
        results = {}
        
        # 1. Reference-based metrics (if references available)
        if references:
            results["bertscore"] = self.bert_scorer.compute(
                references, model_outputs
            )
            results["rouge"] = [
                compute_rouge_l(ref, cand)
                for ref, cand in zip(references, model_outputs)
            ]
        
        # 2. Reference-free metrics
        results["self_bleu"] = compute_self_bleu(model_outputs)
        
        # 3. LLM-as-Judge (if prompts available and judge configured)
        if self.judge and prompts:
            judge_scores = []
            for prompt, output in zip(prompts, model_outputs):
                score = self.judge.evaluate_single(prompt, output)
                judge_scores.append(score["score"])
            
            results["llm_judge"] = {
                "mean": sum(judge_scores) / len(judge_scores),
                "scores": judge_scores
            }
        
        # 4. Aggregate
        results["overall"] = self._aggregate_results(results)
        
        return results
    
    def _aggregate_results(self, results):
        """Aggregate all metrics into overall score."""
        scores = []
        
        if "llm_judge" in results:
            scores.append(results["llm_judge"]["mean"] / 5.0)
        
        if "bertscore" in results:
            scores.append(results["bertscore"]["f1"])
        
        if scores:
            return sum(scores) / len(scores)
        return 0

Practice Exercises

  1. Conceptual: Explain the trade-offs between reference-based metrics (BLEU, ROUGE) and LLM-as-Judge. When would you prefer one over the other?

  2. Mathematical: Compute the BLEU score for a candidate sentence with 10 tokens against a reference with 12 tokens, where the modified precisions are 0.8, 0.6, 0.4, and 0.3 for n-grams 1-4.

  3. Practical: Implement a G-Eval evaluator for summarization quality and compare its correlation with human judgments against ROUGE scores.

  4. Research: Investigate the bias in LLM-as-Judge evaluations. Do judges prefer their own outputs or outputs from similar models? How can this bias be mitigated?

Key Takeaways:

  • LLM-as-Judge (GPT-4) provides scalable evaluation correlating well with human judgment
  • G-Eval uses chain-of-thought to generate evaluation steps automatically
  • BLEU and ROUGE are fast but correlate weakly with human judgment for open-ended tasks
  • BERTScore captures semantic similarity better than exact match metrics
  • Combine multiple evaluation methods for robust assessment
  • Self-BLEU measures output diversity (lower = more diverse)
  • Perplexity measures language modeling quality but not task performance
  • Automated evaluation should be validated against human judgments periodically

What to Learn Next

-> Human Evaluation of LLMs Chatbot Arena, preference studies, and annotation.

-> LLM Evaluation Frameworks lm-eval-harness, OpenCompass, and HELM.

-> LLM Evaluation Benchmarks Understanding standard benchmarks for LLMs.

-> RLHF and Alignment Training models to align with human preferences.

-> Constitutional AI Training safe and aligned language models.

-> DPO and Preference Optimization Direct preference optimization for alignment.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement