Practical Fine-tuning
Data Preparation
from datasets import Dataset
def prepare_training_data(instructions, model_name):
"""Format data for instruction tuning."""
formatted_data = []
for inst in instructions:
text = f"""### Instruction:
{inst['instruction']}
### Response:
{inst['response']}"""
formatted_data.append({"text": text})
dataset = Dataset.from_list(formatted_data)
def tokenize(examples):
tokenizer = AutoTokenizer.from_pretrained(model_name)
return tokenizer(examples["text"], truncation=True, max_length=512)
return dataset.map(tokenize, batched=True)
Training Configuration
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
fp16=True,
logging_steps=10,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
load_best_model_at_end=True,
)
Best Practices
| Aspect | Recommendation |
|---|---|
| Data Size | 1K-100K examples |
| Learning Rate | 1e-5 to 5e-5 (full), 1e-4 to 3e-4 (LoRA) |
| Batch Size | 4-16 with gradient accumulation |
| Epochs | 1-3 typically sufficient |
Summary
Successful fine-tuning requires careful data preparation, appropriate hyperparameters, and systematic evaluation. Start simple and iterate.
Next: We'll explore AI ethics and bias.