CW

Synthetic Data Generation for LLMs

Advanced TrainingData EngineeringFree Lesson

Advertisement

Advanced Training

Synthetic Data Generation — Teaching LLMs to Teach Themselves

Synthetic data enables LLMs to generate their own training data, breaking the ceiling of human-created datasets. This guide covers self-instruct, evol-instruct, and quality-controlled synthesis.

  • Self-Instruct — Seed tasks generate new instructions automatically
  • Evol-Instruct — Iteratively complexify instructions for capability growth
  • Quality Control — Filtering, verification, and diversity enforcement

The best data for training an AI is the data that AI itself creates — if properly curated.

Synthetic Data Generation for LLMs

Synthetic data generation has emerged as one of the most powerful techniques for improving LLM capabilities. Models like Alpaca, Vicuna, and WizardLM demonstrated that synthetic instruction data can dramatically improve instruction-following ability. Research shows that LLMs can generate reasoning traces, code examples, and even mathematical proofs that improve their own training.

DfSynthetic Data Generation

Synthetic data generation is the process of using a language model (often a larger, more capable one) to generate training examples — instructions, responses, reasoning traces, or code — that are then used to train or fine-tune a target model.

The Self-Instruct Pipeline

DfSelf-Instruct

Self-Instruct (Wang et al., 2023) generates new instruction-following data by using a small set of seed tasks to prompt an LLM to generate new instructions, then generates input-output pairs for each instruction.

Architecture Diagram
Seed Tasks (175 examples)
    |
    v
[Instruction Generation] -> New Instructions
    |
    v
[Classification] -> Classify as classification/generation
    |
    v
[Input Generation] -> Generate inputs for each instruction
    |
    v
[Output Generation] -> Generate outputs using LLM
    |
    v
[Filtering] -> Remove low-quality, duplicates, near-matches
    |
    v
Final Synthetic Dataset

Implementation

import openai
import json
import random

class SelfInstructPipeline:
    def __init__(self, model="gpt-4", seed_tasks_path="seed_tasks.json"):
        self.model = model
        self.seed_tasks = json.load(open(seed_tasks_path))
    
    def generate_instruction(self, seed_examples):
        prompt = f"""Below are some example tasks and their instructions:

{chr(10).join(f'Task: {t["task"]}\nInstruction: {t["instruction"]}' for t in seed_examples)}

Generate a new, creative task and instruction that is different from the above:

Task: """
        
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        return response.choices[0].message.content.strip()
    
    def generate_input_output(self, instruction, task_type):
        prompt = f"""Generate a detailed input and the correct output for this instruction:

Instruction: {instruction}
Task Type: {task_type}

Input:"""
        
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        content = response.choices[0].message.content
        return self._parse_input_output(content)
    
    def filter_duplicates(self, new_tasks, existing_tasks, threshold=0.8):
        filtered = []
        for task in new_tasks:
            is_duplicate = False
            for existing in existing_tasks:
                similarity = self._compute_similarity(task["instruction"], existing["instruction"])
                if similarity > threshold:
                    is_duplicate = True
                    break
            if not is_duplicate:
                filtered.append(task)
        return filtered
    
    def generate_dataset(self, num_samples=10000):
        dataset = list(self.seed_tasks)
        while len(dataset) < num_samples:
            seed_examples = random.sample(self.seed_tasks, min(5, len(self.seed_tasks)))
            new_instruction = self.generate_instruction(seed_examples)
            task_type = self.classify_task(new_instruction)
            input_output = self.generate_input_output(new_instruction, task_type)
            new_task = {
                "instruction": new_instruction,
                "input": input_output["input"],
                "output": input_output["output"],
                "task_type": task_type,
                "source": "self_instruct"
            }
            filtered = self.filter_duplicates([new_task], dataset)
            dataset.extend(filtered)
        return dataset[:num_samples]

Evol-Instruct: Evolutionary Complexity

DfEvol-Instruct

Evol-Instruct (Xu et al., 2023) iteratively evolves instructions to increase complexity. Starting from simple instructions, it applies "evolution" operators (deepening, widening, concretizing, reasoning) to create progressively more challenging tasks.

Evolution Operators

class EvolInstruct:
    DEEPEN_PROMPT = """Take the following instruction and make it more complex by adding additional constraints or requirements:

Original: {instruction}

Make it more complex:"""
    
    WIDEN_PROMPT = """Take the following instruction and broaden its scope to require more comprehensive knowledge:

Original: {instruction}

Broaden the scope:"""
    
    CONCRETIZE_PROMPT = """Take the following abstract instruction and make it more specific and concrete:

Original: {instruction}

Make it more specific:"""
    
    REASONING_PROMPT = """Take the following instruction and modify it to require multi-step reasoning:

Original: {instruction}

Require multi-step reasoning:"""
    
    def evolve(self, instruction, operator="deepen"):
        prompts = {
            "deepen": self.DEEPEN_PROMPT,
            "widen": self.WIDEN_PROMPT,
            "concretize": self.CONCRETIZE_PROMPT,
            "reasoning": self.REASONING_PROMPT
        }
        prompt = prompts[operator].format(instruction=instruction)
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7
        )
        return response.choices[0].message.content.strip()
    
    def evolve_dataset(self, initial_instructions, num_rounds=5):
        current = initial_instructions
        all_evolved = list(current)
        for round_num in range(num_rounds):
            evolved = []
            for instruction in current:
                operator = random.choice(["deepen", "widen", "concretize", "reasoning"])
                new_instruction = self.evolve(instruction, operator)
                evolved.append(new_instruction)
                all_evolved.append(new_instruction)
            current = evolved
        return all_evolved

WizardLM used Evol-Instruct to create 250K complex instructions. Models trained on this data showed significant improvements on complex reasoning benchmarks compared to models trained on simpler Alpaca-style data.

Synthetic Reasoning Data

Chain-of-Thought Synthesis

DfSynthetic CoT Data

Generate chain-of-thought reasoning traces for training data. Use a capable model to produce step-by-step reasoning for problems, then train smaller models on these traces to distill reasoning capabilities.

def generate_cot_data(problems, model="gpt-4"):
    cot_data = []
    for problem in problems:
        prompt = f"""Solve this problem step by step, showing your reasoning:

Problem: {problem['question']}

Solution:"""
        
        response = openai.ChatCompletion.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
        )
        reasoning = response.choices[0].message.content
        cot_data.append({
            "instruction": problem["question"],
            "output": reasoning,
            "task_type": "reasoning",
            "source": "synthetic_cot"
        })
    return cot_data

Math Problem Synthesis

def generate_math_problems(num_problems, difficulty_level="medium"):
    prompt = f"""Generate {num_problems} math problems at {difficulty_level} difficulty.

For each problem, provide:
1. The problem statement
2. Step-by-step solution
3. Final answer
4. Difficulty rating (1-5)

Format as JSON array.

Example:
[
  {{
    "problem": "If f(x) = 2x^2 + 3x - 5, find f'(2).",
    "solution": "f'(x) = 4x + 3\nf'(2) = 4(2) + 3 = 11",
    "answer": 11,
    "difficulty": 3
  }}
]

Generate problems:"""
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    return json.loads(response.choices[0].message.content)

Quality Control for Synthetic Data

Multi-Stage Verification

DfSynthetic Data Quality Control

A pipeline of verification steps to ensure synthetic data is accurate, diverse, and useful. Includes: (1) format validation, (2) factual verification, (3) diversity checks, (4) human spot-checks, (5) model-based quality scoring.

class SyntheticDataQualityControl:
    def validate_format(self, sample):
        required_fields = ["instruction", "output", "task_type"]
        for field in required_fields:
            if field not in sample or not sample[field]:
                return False, f"Missing field: {field}"
        if len(sample["instruction"]) < 10:
            return False, "Instruction too short"
        if len(sample["output"]) < 20:
            return False, "Output too short"
        return True, "Valid"
    
    def check_factual_accuracy(self, sample, verifier_model="gpt-4"):
        prompt = f"""Verify the factual accuracy of this response:

Question: {sample['instruction']}
Response: {sample['output']}

Is the response factually accurate? If not, what errors exist?
Answer with: ACCURATE, MINOR_ERRORS, or MAJOR_ERRORS"""
        
        response = openai.ChatCompletion.create(
            model=verifier_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1
        )
        verdict = response.choices[0].message.content.strip()
        return verdict == "ACCURATE"
    
    def check_diversity(self, new_sample, existing_samples, threshold=0.7):
        for existing in existing_samples:
            similarity = compute_similarity(new_sample["instruction"], existing["instruction"])
            if similarity > threshold:
                return False
        return True
    
    def quality_score(self, sample, scoring_model="gpt-4"):
        prompt = f"""Rate the quality of this training example on a scale of 1-10:

Instruction: {sample['instruction']}
Output: {sample['output']}

Consider:
- Clarity of instruction
- Completeness of response
- Educational value
- Accuracy

Score (1-10):"""
        
        response = openai.ChatCompletion.create(
            model=scoring_model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1
        )
        try:
            score = int(response.choices[0].message.content.strip())
            return min(max(score, 1), 10)
        except:
            return 5
    
    def filter_dataset(self, dataset, min_score=7):
        filtered = []
        for sample in dataset:
            valid, reason = self.validate_format(sample)
            if not valid:
                continue
            if not self.check_factual_accuracy(sample):
                continue
            if not self.check_diversity(sample, filtered):
                continue
            score = self.quality_score(sample)
            if score >= min_score:
                sample["quality_score"] = score
                filtered.append(sample)
        return filtered

Model Collapse and Diversity

DfModel Collapse

Model collapse occurs when a model is trained on its own outputs iteratively. Over generations, the model's outputs become less diverse, lose tail knowledge, and converge to a narrow distribution. This is a critical risk in synthetic data generation.

Training on purely synthetic data without fresh human data leads to model collapse. Always maintain a mixture of human-generated data (at least 10-20%) in the training mix.

Preventing Model Collapse

def maintain_diversity(synthetic_data, human_data, synthetic_ratio=0.8):
    n_synthetic = int(len(synthetic_data) * synthetic_ratio)
    n_human = len(synthetic_data) - n_synthetic
    sampled_synthetic = random.sample(synthetic_data, min(n_synthetic, len(synthetic_data)))
    sampled_human = random.sample(human_data, min(n_human, len(human_data)))
    mixed_data = sampled_synthetic + sampled_human
    random.shuffle(mixed_data)
    return mixed_data

Cost-Effective Synthetic Data

Using Smaller Models for Generation

DfStudent-Teacher Synthesis

Use a large, capable model (teacher) to generate high-quality examples, then use those examples to train a smaller model (student). The student can then generate its own synthetic data for self-improvement, at a fraction of the cost.

def student_teacher_synthesis(teacher_model, student_model, seed_data, num_rounds=3):
    current_data = seed_data
    for round_num in range(num_rounds):
        teacher_examples = generate_with_model(teacher_model, current_data, n=1000)
        train_model(student_model, teacher_examples)
        student_examples = generate_with_model(student_model, current_data, n=5000)
        verified = verify_with_model(teacher_model, student_examples)
        current_data = mix_data(current_data, verified, ratio=0.7)
    return current_data

Practice Exercises

  1. Self-Instruct Design: Design a Self-Instruct pipeline for generating medical QA data. What seed tasks would you use? How would you ensure medical accuracy?

  2. Evol-Instruct Strategy: Create an Evol-Instruct pipeline that evolves simple math problems into multi-step reasoning challenges. What evolution operators would you design?

  3. Quality Control: Design a 5-stage quality control pipeline for synthetic code generation data. What verification steps would you include?

  4. Model Collapse Analysis: If you had a model trained on 100% synthetic data, what metrics would you use to detect model collapse? Design an experiment to measure collapse severity.

Key Takeaways

Summary: Synthetic Data Generation

  • Self-Instruct generates new instructions from seed tasks automatically
  • Evol-Instruct iteratively complexifies instructions for capability growth
  • Synthetic CoT data distills reasoning from large models to smaller ones
  • Quality control requires multi-stage verification (format, facts, diversity, quality)
  • Model collapse is a real risk — always mix in human-generated data
  • Student-teacher synthesis enables cost-effective iterative improvement
  • Diversity enforcement prevents the synthetic data distribution from narrowing
  • Verification is essential — synthetic does not mean automatically correct

What to Learn Next

-> Data Quality and Curation for LLMs The foundations of data quality, deduplication, and filtering.

-> Instruction Tuning Training models to follow instructions effectively.

-> Curriculum Learning for LLMs Strategic ordering of training data for improved learning.

-> Fine-Tuning LLMs Customizing models for specific tasks and domains.

-> Chain-of-Thought Reasoning Teaching models to reason step by step.

-> Constitutional AI Using AI feedback for self-improvement and alignment.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement