Instruction Tuning

TrainingInstruction TuningFree Lesson

Advertisement

Instruction Tuning

Instruction tuning is the process of fine-tuning a language model on (instruction, response) pairs to improve its ability to follow human instructions. It bridges the gap between pre-training on raw text and aligning with user intent.

What is Instruction Tuning?

Fine-tuning a pre-trained language model on datasets of structured instructions paired with desired responses, teaching the model to generalize to unseen instructions and follow them accurately.

The key insight: pre-training teaches the model to predict the next token, but instruction tuning teaches the model to understand and execute instructions.

Mathematical Formulation

Instruction Tuning Loss

\\mathcal{L}_{\\text{IT}} = -\\sum_{i=1}^{N} \\sum_{t=1}^{T_i} \\log P_\\theta(y_{i,t} | y_{i,<t}, x_i)

Here,

  • =
  • =
  • =
  • =
  • =

Unlike pre-training, instruction tuning computes loss only on the response tokens, not the instruction tokens. The instruction provides context but is not trained to be predicted.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

def compute_instruction_loss(model, batch, tokenizer):
    """Compute loss only on response tokens."""
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]
    
    # Find where response starts (after instruction)
    # Labels are -100 for instruction tokens
    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        labels=labels
    )
    
    return outputs.loss

def prepare_instruction_data(
    instruction: str,
    response: str,
    tokenizer,
    max_length: int = 512
) -> dict:
    """Prepare instruction-response pair with proper masking."""
    # Format: <s> Instruction: {instruction} \n Response: {response} </s>
    full_text = f"Instruction: {instruction}\nResponse: {response}"
    instruction_text = f"Instruction: {instruction}\nResponse:"
    
    full_encoded = tokenizer(full_text, truncation=True, max_length=max_length)
    instruction_encoded = tokenizer(instruction_text, truncation=True, max_length=max_length)
    
    # Create labels: mask instruction tokens with -100
    labels = full_encoded["input_ids"].copy()
    instruction_length = len(instruction_encoded["input_ids"])
    labels[:instruction_length] = [-100] * instruction_length
    
    full_encoded["labels"] = labels
    return full_encoded

FLAN (Fine-tuned LAnguage Net)

FLAN demonstrated the power of multi-task instruction tuning across hundreds of tasks:

FLAN (Wei et al., 2022) fine-tuned PaLM on 62 NLP datasets converted to instruction format. The key finding: instruction tuning on diverse tasks improves zero-shot performance on unseen tasks.

FLAN Task Formats

Task TypeFormat Example
Classification"Is the following sentence positive or negative? {text}"
Summarization"Summarize the following article: {article}"
QA"Based on the context, answer: {question}\nContext: {context}"
Translation"Translate the following from English to French: {text}"
flan_task_templates = {
    "sentiment": "Is the following review positive or negative? Review: {input}\nAnswer:",
    "summarization": "Summarize the following text in one sentence: {input}\nSummary:",
    "question_answering": "Based on the following passage, answer the question.\nPassage: {passage}\nQuestion: {input}\nAnswer:",
    "translation": "Translate the following sentence to French: {input}\nTranslation:",
    "classification": "Classify the following text into one of these categories: {labels}\nText: {input}\nCategory:",
    "reasoning": "Answer the following question step by step.\nQuestion: {input}\nSolution:"
}

InstructGPT

InstructGPT (Ouyang et al., 2022) established the three-stage training paradigm:

Stage 1: Supervised Fine-Tuning (SFT)

Train on human-written demonstrations:

def train_sft_stage(model, dataset, tokenizer, epochs=3):
    """Stage 1: Supervised Fine-Tuning on human demonstrations."""
    training_args = TrainingArguments(
        output_dir="./sft_model",
        num_train_epochs=epochs,
        per_device_train_batch_size=8,
        learning_rate=2e-5,
        weight_decay=0.01,
        warmup_ratio=0.1,
        fp16=True
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )
    
    trainer.train()
    return trainer.model

Stage 2: Reward Model Training

Train reward model on human preferences:

class RewardModelTrainer:
    def __init__(self, reward_model, tokenizer):
        self.model = reward_model
        self.tokenizer = tokenizer
    
    def compute_reward_loss(self, chosen, rejected):
        """Train reward model to rank chosen > rejected."""
        chosen_rewards = self.model(chosen).logits
        rejected_rewards = self.model(rejected).logits
        
        # Pairwise ranking loss
        loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards))
        return loss.mean()

Stage 3: PPO Optimization

Optimize policy against reward model with KL penalty:

InstructGPT PPO Objective

\\max_{\\pi_\\theta} \\mathbb{E}_{x \\sim D, y \\sim \\pi_\\theta(\\cdot|x)} \\left[ R_\\phi(x, y) - \\beta \\, \\text{KL}(\\pi_\\theta(\\cdot|x) \\| \\pi_{\\text{SFT}}(\\cdot|x)) \\right]

Here,

  • =
  • =
  • =
  • =

Dataset Construction

Self-Instruct

Self-Instruct generates instruction-following data using the model itself:

class SelfInstructGenerator:
    def __init__(self, model, tokenizer, seed_tasks: List[str]):
        self.model = model
        self.tokenizer = tokenizer
        self.seed_tasks = seed_tasks
    
    def generate_instructions(self, num_instructions: int = 1000) -> List[Dict]:
        """Generate instruction-following data."""
        generated = []
        
        for _ in range(num_instructions):
            # Sample seed tasks for context
            context = random.sample(self.seed_tasks, k=3)
            
            prompt = f"""Here are some example instructions:
{chr(10).join(f'- {t}' for t in context)}

Generate a new, different instruction that is similar in complexity:"""
            
            instruction = self._generate(prompt)
            
            # Generate response
            response_prompt = f"Instruction: {instruction}\nResponse:"
            response = self._generate(response_prompt, max_tokens=256)
            
            if self._is_valid(instruction, response):
                generated.append({
                    "instruction": instruction,
                    "response": response
                })
        
        return generated
    
    def _is_valid(self, instruction: str, response: str) -> bool:
        """Validate generated data."""
        # Check length
        if len(instruction) < 10 or len(response) < 20:
            return False
        
        # Check for repetition
        if instruction == response:
            return False
        
        # Check for coherence
        if response.count(".") < 1:
            return False
        
        return True

Evol-Instruct

Evol-Instruct progressively makes instructions more complex:

Evol-Instruct (Xu et al., 2023) creates diverse, high-quality instruction data by iteratively "evolving" seed instructions through deepening (adding constraints), widening (adding input context), concretizing, and increasing reasoning steps.

class EvolInstructGenerator:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
        self.evolution_prompts = {
            "deepen": "Add more constraints or requirements to this instruction: {instruction}",
            "widen": "Add more context or background information to this instruction: {instruction}",
            "concretize": "Make this instruction more specific and concrete: {instruction}",
            "reasoning": "Modify this instruction to require multi-step reasoning: {instruction}"
        }
    
    def evolve(self, instruction: str, evolution_type: str) -> str:
        """Evolve instruction using specified type."""
        prompt = self.evolution_prompts[evolution_type].format(instruction=instruction)
        
        evolved = self._generate(prompt)
        return evolved
    
    def generate_dataset(
        self,
        seed_instructions: List[str],
        num_evolutions: int = 3,
        evolution_factor: int = 2
    ) -> List[Dict]:
        """Generate evolved instruction dataset."""
        dataset = []
        
        for seed in seed_instructions:
            current = [seed]
            
            for _ in range(num_evolutions):
                next_level = []
                for instr in current:
                    for evo_type in self.evolution_prompts.keys():
                        evolved = self.evolve(instr, evo_type)
                        next_level.append(evolved)
                        dataset.append({
                            "instruction": evolved,
                            "evolution_type": evo_type,
                            "level": _
                        })
                
                current = next_level[:evolution_factor]
        
        return dataset

Multi-Task vs Single-Task Tuning

Multi-Task Instruction Tuning

multi_task_dataset = {
    "classification": load_dataset("glue", "sst2"),
    "summarization": load_dataset("cnn_dailymail", "3.0.0"),
    "qa": load_dataset("squad"),
    "translation": load_dataset("wmt14", "fr-en"),
    "code": load_dataset("code_x_glue_tc_nl_code_search_adv")
}

# Combine all tasks with instruction format
def format_multi_task_dataset(datasets, task_formats):
    combined = []
    
    for task_name, dataset in datasets.items():
        format_fn = task_formats[task_name]
        for example in dataset["train"]:
            instruction, response = format_fn(example)
            combined.append({
                "instruction": instruction,
                "response": response,
                "task": task_name
            })
    
    return combined

Single-Task Instruction Tuning

Focus on one domain for specialized performance:

single_task_configs = {
    "medical": {
        "dataset": "medical_dialog",
        "instruction_format": "Based on the patient's symptoms, provide a diagnosis: {symptoms}",
        "response_format": "Diagnosis: {diagnosis}\nReasoning: {reasoning}"
    },
    "legal": {
        "dataset": "legal_contract_qa",
        "instruction_format": "Analyze the following legal clause: {clause}\nQuestion: {question}",
        "response_format": "Analysis: {analysis}\nAnswer: {answer}"
    },
    "code": {
        "dataset": "code_instruct",
        "instruction_format": "Write code to: {task_description}\nLanguage: {language}",
        "response_format": "```&#123;language&#125;\n&#123;code&#125;\n```\nExplanation: {explanation}"
    }
}

Multi-task instruction tuning improves generalization across tasks, while single-task tuning excels in domain-specific applications. Choose based on your use case: multi-task for general assistants, single-task for specialized domains.

Training Best Practices

instruction_training_config = {
    "learning_rate": 2e-5,
    "batch_size": 32,
    "epochs": 3,
    "warmup_ratio": 0.03,
    "weight_decay": 0.1,
    "max_seq_length": 512,
    "gradient_accumulation_steps": 4,
    "fp16": True,
    "logging_steps": 100,
    "save_strategy": "epoch",
    "evaluation_strategy": "epoch"
}

Common Pitfalls

PitfallSymptomSolution
OverfittingTrain loss ↓, eval loss ↑Early stopping, dropout
UnderfittingBoth losses plateau highIncrease epochs, reduce regularization
Catastrophic forgettingGood on target, bad on generalLower learning rate, mix general data
Data contaminationHigh eval accuracy, poor real perfUse held-out test sets

Summary

  • Instruction tuning trains LLMs on (instruction, response) pairs to follow human intent
  • Loss is computed only on response tokens, not instruction tokens
  • FLAN showed multi-task instruction tuning improves zero-shot generalization
  • InstructGPT established the SFT β†’ Reward Model β†’ PPO training paradigm
  • Self-Instruct generates data using the model itself
  • Evol-Instruct progressively increases instruction complexity
  • Multi-task tuning improves generalization; single-task tuning excels in domains

Practice Exercises

  1. Instruction Dataset: Create an instruction dataset with 100 examples covering 5 different task types.

  2. SFT Training: Fine-tune a small language model (e.g., GPT-2) on instruction data. Compare before and after on held-out instructions.

  3. Self-Instruct: Implement Self-Instruct to generate 500 instruction-response pairs. Evaluate quality.

  4. Evol-Instruct: Use Evol-Instruct to create a dataset with increasing complexity levels. Analyze difficulty distribution.

  5. Multi-Task Evaluation: Train on 3 tasks and evaluate zero-shot on 5 unseen tasks. Compare with single-task training.


Previous: 20 - LLM Agent Frameworks ← | Next: 22 - LLM Safety & Red Teaming β†’

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement