Model Evaluation

Evaluation Metrics

# BLEU Score
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

def calculate_bleu(references, hypotheses):
    return corpus_bleu(references, hypotheses)

# ROUGE Score
from rouge_score import rouge_scorer

def calculate_rouge(reference, hypothesis):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return scorer.score(reference, hypothesis)

# BERTScore
from bert_score import score

def calculate_bertscore(references, hypotheses):
    P, R, F1 = score(hypotheses, references, lang="en", verbose=True)
    return F1.mean().item()

LLM-as-Judge

def llm_judge(question, response_a, response_b, judge_model):
    prompt = f"""You are an expert judge evaluating AI responses.

Question: {question}

Response A: {response_a}

Response B: {response_b}

Which response is better? Consider:
1. Accuracy and factual correctness
2. Completeness and thoroughness
3. Clarity and readability
4. Helpfulness

Output your judgment as "A" or "B" followed by a brief explanation."""

    return judge_model.generate(prompt)

Benchmarks

Benchmark	Task	Metric
MMLU	Knowledge	Accuracy
HumanEval	Code	Pass@k
GSM8K	Math	Accuracy
TruthfulQA	Honesty	% Truthful

Summary

Effective evaluation combines automatic metrics, human judgment, and benchmarks to comprehensively assess model quality.

Next: We'll explore hallucination detection techniques.

Model Evaluation

Model Evaluation

Evaluation Metrics

LLM-as-Judge

Benchmarks

Summary

Premium Content

Need Expert Generative AI Help?