CW

Data Quality and Curation for LLMs

Advanced TrainingData EngineeringFree Lesson

Advertisement

Advanced Training

Data Quality and Curation — The Hidden Driver of LLM Performance

Data quality matters more than data quantity. A smaller, high-quality dataset consistently outperforms a larger, noisy one. This guide covers the science of data curation for LLMs.

  • Deduplication — Removing duplicates can improve performance by 10%+ on downstream tasks
  • Data Mixing — The ratio of code, math, books, and web text dramatically affects capabilities
  • Quality Filtering — Perplexity-based and classifier-based filtering separate signal from noise

The quality of what you feed the model determines the quality of what it produces.

Data Quality and Curation for LLMs

Data quality is arguably the most important factor in LLM performance, yet it receives far less attention than model architecture or training algorithms. The "Scaling Data-Constrained Language Models" paper demonstrated that repeating data beyond 4 epochs yields diminishing returns, while carefully curated data can achieve the same performance with 10x fewer tokens.

DfData Quality

Data quality in the context of LLM training encompasses multiple dimensions: factual accuracy, linguistic quality, deduplication, diversity, recency, and domain coverage. High-quality training data directly translates to better downstream performance, fewer hallucinations, and more reliable reasoning.

The Data Quality Hierarchy

Level 1: Basic Filtering

Remove clearly low-quality data:

def basic_quality_filter(document):
    if len(document) < 100 or len(document) > 100000:
        return False
    words = document.split()
    if len(words) < 20:
        return False
    alpha_ratio = sum(c.isalpha() for c in document) / len(document)
    if alpha_ratio < 0.8:
        return False
    word_repetition = len(words) / len(set(words))
    if word_repetition > 3.0:
        return False
    return True

Level 2: Perplexity-Based Filtering

DfPerplexity Filtering

Use a pre-trained language model to score document quality by perplexity. Documents with very high perplexity (noisy, incoherent) or very low perplexity (repetitive, templated) are filtered out.

Document Perplexity

PPL(d)=exp(1Ni=1NlogP(xix<i))\text{PPL}(d) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_{<i})\right)

Here,

  • dd=Document
  • NN=Number of tokens in the document
  • P(xix<i)P(x_i | x_{<i})=Language model probability of token i given context
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(doc, model, tokenizer, max_length=512):
    inputs = tokenizer(doc, return_tensors="pt", truncation=True, max_length=max_length)
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return torch.exp(outputs.loss).item()

def perplexity_filter(documents, model_name="gpt2", low_thresh=10, high_thresh=1000):
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    filtered = []
    for doc in documents:
        ppl = compute_perplexity(doc, model, tokenizer)
        if low_thresh <= ppl <= high_thresh:
            filtered.append(doc)
    return filtered

Level 3: Classifier-Based Filtering

DfClassifier-Based Quality Filtering

Train a binary classifier on a small labeled dataset of high-quality (Wikipedia, books) vs. low-quality (web crawl) documents. Use the classifier to score and filter the full training corpus.

def classifier_quality_score(document, classifier):
    prob = classifier.predict_proba([document])[0][1]
    return prob

high_quality_docs = [
    doc for doc in corpus
    if classifier_quality_score(doc, quality_classifier) > 0.7
]

The GPT-3 paper used a classifier trained on Wikipedia as positive examples and random web pages as negative examples. This "WebText-like" filtering significantly improved downstream performance.

Deduplication

Why Deduplication Matters

DfTraining Data Deduplication

Deduplication removes exact and near-duplicate documents from the training corpus. Duplicated data causes: (1) memorization instead of generalization, (2) biased gradients toward repeated content, (3) wasted compute on redundant examples.

Deduplication MethodTypePrecisionSpeed
Exact hashExact100%Very fast
MinHash LSHNear-exact~95%Fast
SimHashApproximate~90%Very fast
SemDeDupSemantic~85%Slow

Exact Deduplication

import hashlib

def exact_dedup(documents):
    seen_hashes = set()
    unique_docs = []
    for doc in documents:
        doc_hash = hashlib.sha256(doc.encode()).hexdigest()
        if doc_hash not in seen_hashes:
            seen_hashes.add(doc_hash)
            unique_docs.append(doc)
    return unique_docs

Near-Duplicate Detection with MinHash

DfMinHash LSH

MinHash creates compact fingerprints of sets (e.g., shingle sets of documents). Locality-Sensitive Hashing (LSH) groups similar MinHash signatures into buckets, enabling efficient near-duplicate detection.

Jaccard Similarity

J(A,B)=ABABJ(A, B) = \frac{|A \cap B|}{|A \cup B|}

Here,

  • AA=Shingle set of document 1
  • BB=Shingle set of document 2
from datasketch import MinHash, MinHashLSH

def minhash_dedup(documents, threshold=0.8, num_perm=128):
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    unique_docs = []
    for i, doc in enumerate(documents):
        m = MinHash(num_perm=num_perm)
        for shingle in get_shingles(doc, k=5):
            m.update(shingle.encode())
        result = lsh.query(m)
        if not result:
            lsh.insert(str(i), m)
            unique_docs.append(doc)
    return unique_docs

Semantic Deduplication

DfSemantic Deduplication

Semantic deduplication removes documents that are semantically similar (same meaning) even if they use different words. This goes beyond surface-level duplication to remove truly redundant training examples.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN

def semantic_dedup(documents, model_name="all-MiniLM-L6-v2", eps=0.3):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(documents, show_progress_bar=True)
    clustering = DBSCAN(eps=eps, min_samples=2, metric="cosine")
    clusters = clustering.fit_predict(embeddings)
    unique_docs = []
    seen_clusters = set()
    for doc, cluster_id in zip(documents, clusters):
        if cluster_id == -1 or cluster_id not in seen_clusters:
            unique_docs.append(doc)
            if cluster_id != -1:
                seen_clusters.add(cluster_id)
    return unique_docs

Data Mixing

Domain Mixing Ratios

DfData Mixing

Data mixing determines the proportion of training data from each domain (web, code, books, academic papers, etc.). The mixing ratio significantly impacts model capabilities — more code data improves coding ability, more books improves long-form reasoning.

Mixing Ratio (Web:Code:Books:Academic)CodingReasoningKnowledgeInstruction Following
100:0:0:0BaselineBaselineBaselineBaseline
80:10:5:5+5%+3%+2%+4%
60:20:10:10+15%+8%+5%+10%
40:30:20:10+25%+12%+8%+15%

The "Scaling Data-Constrained Language Models" paper found that code data is particularly valuable — even small amounts (5-10%) significantly improve reasoning and logical thinking abilities.

Optimal Mixing with DOREMI

DfDOREMI

DOREMI (Data Optimal REweighting for Model Instructiveness) automatically learns optimal domain weights by training a small proxy model to identify which domains are underrepresented relative to the optimal distribution.

def doremi_weights(domain_losses, target_loss):
    weights = {}
    for domain, loss in domain_losses.items():
        weights[domain] = max(0, target_loss - loss)
    total = sum(weights.values())
    if total > 0:
        weights = {k: v / total for k, v in weights.items()}
    else:
        weights = {k: 1.0 / len(weights) for k in weights}
    return weights

Data Attribution

Measuring Data Influence

DfData Attribution

Data attribution methods measure the influence of individual training examples on model behavior. This enables understanding which data points are most valuable, identifying harmful data, and debugging model failures.

Influence Function

I(ztest,zi)=θ2L(θ,ztest)H1θL(θ,zi)\mathcal{I}(z_{\text{test}}, z_i) = -\nabla^2_{\theta} L(\theta, z_{\text{test}}) \cdot H^{-1} \cdot \nabla_{\theta} L(\theta, z_i)

Here,

  • I\mathcal{I}=Influence of training example z_i on test loss
  • ztestz_{\text{test}}=Test example
  • ziz_i=Training example
  • HH=Hessian of the loss

For LLMs, exact influence functions are intractable. Practical approximations include: gradient similarity, TracIn, and data Shapley values computed on smaller proxy models.

Data Quality Metrics

Comprehensive Quality Score

Composite Quality Score

Q(d)=w1Sppl(d)+w2Scls(d)+w3Sdiv(d)+w4Slen(d)Q(d) = w_1 \cdot S_{\text{ppl}}(d) + w_2 \cdot S_{\text{cls}}(d) + w_3 \cdot S_{\text{div}}(d) + w_4 \cdot S_{\text{len}}(d)

Here,

  • Q(d)Q(d)=Composite quality score for document d
  • SpplS_{\text{ppl}}=Perplexity-based quality score
  • SclsS_{\text{cls}}=Classifier-based quality score
  • SdivS_{\text{div}}=Diversity score (novelty vs corpus)
  • SlenS_{\text{len}}=Length/complexity score
  • wiw_i=Weights for each component

Practice Exercises

  1. Deduplication Analysis: Given a corpus of 1M documents, estimate the size reduction after exact deduplication, MinHash LSH (threshold=0.8), and semantic deduplication. What is the typical deduplication rate for web crawl data?

  2. Data Mixing Experiment: Design a data mixing strategy for an LLM that must excel at (a) conversational AI, (b) code generation, and (c) mathematical reasoning. Justify your domain proportions.

  3. Quality Filter Design: Design a multi-stage quality filtering pipeline for Common Crawl data. What filters would you apply, in what order, and what retention rates would you target?

  4. Data Attribution: How would you identify training examples that cause a specific hallucination in an LLM? Describe a practical approach using influence functions or similar methods.

Key Takeaways

Summary: Data Quality and Curation

  • Data quality > data quantity — curated datasets outperform larger, noisy ones
  • Deduplication prevents memorization and improves generalization (10%+ improvement)
  • Perplexity filtering removes noisy and repetitive documents effectively
  • Classifier-based filtering learns quality standards from labeled examples
  • Semantic deduplication removes meaning duplicates, not just exact copies
  • Data mixing ratios directly determine model capabilities (code, reasoning, knowledge)
  • DOREMI automatically learns optimal domain weights
  • Data attribution identifies which training examples influence specific behaviors
  • Multi-stage filtering pipelines are standard practice for web-scale data

What to Learn Next

-> Synthetic Data Generation Using LLMs to create high-quality training data for themselves.

-> Pretraining Language Models The fundamentals of training language models on large corpora.

-> Distributed Training for LLMs Scaling training across hundreds of GPUs with parallelism strategies.

-> Scaling Laws and Chinchilla How data quantity and quality interact with model scale.

-> Curriculum Learning for LLMs Strategic ordering of training data for improved learning.

-> Knowledge Distillation for LLMs Using teacher models to generate quality training signals.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement