Applications
Code Generation with LLMs — From Autocomplete to Autonomous Programming
LLMs have revolutionized software development, moving from simple autocomplete to full program synthesis. This guide covers code-specific architectures, training methodologies, evaluation benchmarks, and production deployment.
- Code LLMs — Specialized architectures for program understanding and generation
- Training on Code — Fine-tuning strategies for programming tasks
- Evaluation — HumanEval, MBPP, SWE-bench, and beyond
- Production — Deployment patterns for code assistants and agents
The best code is the code that writes itself.
Code Generation with LLMs
Large Language Models have transformed software development, enabling everything from intelligent autocomplete to autonomous program synthesis. This tutorial covers the architectures, training methods, and evaluation frameworks that make code generation possible.
DfCode LLM
A Code LLM is a language model specifically trained or fine-tuned on programming-related data (source code, documentation, tests, issues) to understand and generate code across multiple programming languages, paradigms, and complexity levels.
Code-Specific Architectures
How Code LLMs Differ from General LLMs
Code LLMs require specialized handling due to the unique characteristics of programming languages:
| Characteristic | Natural Language | Programming Language |
|---|---|---|
| Structure | Flexible, ambiguous | Strict syntax, semantics |
| Compositionality | Moderate | High (functions, classes) |
| Long-range Dependencies | Variable | Very high (call graphs) |
| Verification | Subjective | Objective (tests, compilation) |
| Multimodal | Text only | Text + structure + execution |
DfCode-Aware Architecture
Code LLMs incorporate several architectural innovations:
- Tree-structured Attention: Attention patterns that respect AST hierarchy
- Byte-Pair Encoding for Code: Tokenizers optimized for code tokens
- Multi-file Context: Context windows spanning multiple files
- Execution Feedback: Integration with compilers and test runners
State-of-the-Art Code LLMs
| Model | Params | Context | Languages | Key Innovation |
|---|---|---|---|---|
| Codex | 12B | 4K | 12+ | HumanEval benchmark |
| StarCoder | 15B | 8K | 86 | Fill-in-the-middle |
| Code Llama | 7-70B | 100K | 20+ | Long-context code |
| DeepSeek-Coder | 6.7-33B | 128K | 86+ | Repository-level |
| GPT-4 | ~1.8T | 128K | 100+ | Multi-lingual, reasoning |
| Claude 3.5 | Unknown | 200K | 20+ | Agentic coding |
Training Code LLMs
Pretraining on Code
Code Pretraining Objective
Here,
- =Total pretraining loss across all files
- =Number of files in training corpus
- =Number of tokens in file i
- =Token at position t in file i
- =Model parameters
Code-Specific Training Strategies
training_strategies = {
"fill_in_middle": {
"description": "Train model to fill in middle of code given prefix and suffix",
"objective": "P(middle | prefix, suffix)",
"benefit": "Enables intelligent code completion"
},
"commit_message": {
"description": "Train to generate commit messages from diffs",
"objective": "P(message | diff)",
"benefit": "Automated documentation"
},
"code_review": {
"description": "Train to review code and suggest improvements",
"objective": "P(review | code, context)",
"benefit": "Automated code review"
},
"test_generation": {
"description": "Train to generate test cases from code",
"objective": "P(tests | function)",
"benefit": "Automated testing"
},
"bug_detection": {
"description": "Train to identify and fix bugs",
"objective": "P(fix | buggy_code, error)",
"benefit": "Automated debugging"
}
}
Fine-Tuning for Code Tasks
Fine-Tuning Code Llama for Test Generation
Given a Code Llama 7B base model, fine-tune for test generation:
- Dataset: 100K Python functions with corresponding pytest tests
- Prompt Format:
"def function_name(params): ... \n # Write tests for this function" - Training: LoRA fine-tuning with rank 16, alpha 32
- Evaluation: Test pass rate on held-out functions
Expected results: 75-85% pass@1 on HumanEval-style test generation
Evaluation Benchmarks
Core Benchmarks
DfPass@k Metric
Pass@k measures the probability that at least one of k generated samples passes all test cases. It accounts for the stochastic nature of code generation.
Pass@k Calculation
Here,
- =Probability that at least 1 of k samples passes
- =Total samples generated
- =Number of samples that pass all tests
- =Number of samples to consider
Benchmark Comparison
| Benchmark | Tasks | Metric | Difficulty | Best Model (2024) |
|---|---|---|---|---|
| HumanEval | 164 | Pass@1 | Medium | GPT-4: 88.4% |
| MBPP | 974 | Pass@1 | Easy | GPT-4: 82.1% |
| SWE-bench | 2,294 | Resolve Rate | Hard | GPT-4: 12.5% |
| DS-1000 | 1,000 | Pass@1 | Medium | GPT-4: 47.2% |
| APPS | 10,000 | Pass@1 | Variable | GPT-4: 29.4% |
SWE-bench is the most realistic benchmark—it requires resolving real GitHub issues from popular Python repositories. The low resolution rates (even for GPT-4) highlight the gap between code generation and real-world software engineering.
Evaluation Framework
class CodeLLMEvaluator:
"""Comprehensive evaluation framework for code LLMs."""
def __init__(self, model, benchmarks=["humaneval", "mbpp"]):
self.model = model
self.benchmarks = benchmarks
def evaluate(self, benchmark_name, n_samples=100):
"""Run evaluation on a benchmark."""
dataset = load_dataset(benchmark_name)
results = []
for problem in dataset:
# Generate solutions
solutions = self.model.generate(
prompt=problem["prompt"],
n=n_samples,
temperature=0.8
)
# Execute solutions
pass_count = 0
for solution in solutions:
if self.execute_and_test(solution, problem["test_cases"]):
pass_count += 1
# Calculate pass@k
for k in [1, 5, 10, 100]:
pass_at_k = self.calculate_pass_at_k(
n=n_samples, c=pass_count, k=k
)
results.append({
"problem": problem["id"],
"pass_at_k": pass_at_k,
"k": k
})
return self.aggregate_results(results)
def execute_and_test(self, code, test_cases):
"""Safely execute code and run tests."""
try:
# Create isolated execution environment
exec_globals = {}
exec(code, exec_globals)
# Run test cases
for test in test_cases:
result = eval(test, exec_globals)
if not result:
return False
return True
except Exception:
return False
Production Deployment
Code Assistant Architecture
class CodeAssistant:
"""Production code assistant with context management."""
def __init__(self, model, context_window=8192):
self.model = model
self.context_window = context_window
self.file_cache = {}
def autocomplete(self, file_path, cursor_position, prefix, suffix):
"""Provide intelligent code completion."""
# Gather context
context = self.gather_context(file_path, cursor_position)
# Format prompt with fill-in-middle
prompt = f"<prefix>{prefix}<middle>{suffix}"
# Generate completion
completion = self.model.generate(
prompt=prompt,
max_tokens=256,
stop_tokens=["\ndef ", "\nclass ", "\n# "]
)
return completion
def gather_context(self, file_path, cursor_position):
"""Gather relevant context from the codebase."""
context = {
"current_file": self.get_file_content(file_path),
"imports": self.get_imports(file_path),
"related_files": self.get_related_files(file_path),
"definitions": self.get_definitions(file_path)
}
# Truncate to fit context window
return self.truncate_context(context)
def generate_function(self, description, context):
"""Generate a complete function from description."""
prompt = f"""Write a Python function that: {description}
Context from the codebase:
{context}
Requirements:
- Follow existing code style
- Include type hints
- Add docstring
- Handle edge cases
Function:"""
return self.model.generate(prompt, max_tokens=512)
Multi-File Code Generation
DfRepository-Level Code Generation
Repository-level code generation produces code that is consistent with the entire codebase—respecting naming conventions, import patterns, architecture decisions, and existing implementations across multiple files.
class RepoLevelGenerator:
"""Generate code at repository level."""
def __init__(self, repo_path, model):
self.repo = Repository(repo_path)
self.model = model
def generate_feature(self, feature_description):
"""Generate a complete feature across multiple files."""
# Analyze repository structure
structure = self.repo.analyze()
# Identify files to modify/create
plan = self.plan_feature(feature_description, structure)
# Generate changes for each file
changes = []
for file_change in plan:
if file_change["action"] == "modify":
new_content = self.modify_file(
file_change["path"],
file_change["instructions"]
)
else:
new_content = self.create_file(
file_change["path"],
file_change["instructions"]
)
changes.append({
"path": file_change["path"],
"content": new_content,
"action": file_change["action"]
})
return changes
def modify_file(self, file_path, instructions):
"""Modify an existing file with new code."""
current_content = self.repo.get_file(file_path)
prompt = f"""Modify the following code according to these instructions:
Current code:
{current_content}
Instructions: {instructions}
Modified code:"""
return self.model.generate(prompt, max_tokens=2048)
Advanced Techniques
Execution-Based Generation
DfExecution-Based Generation
Execution-based generation uses feedback from code execution (compilation errors, test failures, runtime exceptions) to iteratively improve generated code.
class ExecutionBasedGenerator:
"""Generate code with execution feedback."""
def __init__(self, model, executor):
self.model = model
self.executor = executor
def generate_with_feedback(self, problem, max_attempts=5):
"""Generate code with iterative improvement."""
for attempt in range(max_attempts):
# Generate code
code = self.model.generate(problem)
# Execute and get feedback
result = self.executor.execute(code)
if result["success"]:
return code, attempt + 1
# Add error feedback to prompt
problem = f"""{problem}
Previous attempt failed with error:
{result['error']}
Fix the error and try again:"""
return None, max_attempts
Code Explanation and Documentation
Code LLMs excel not just at generation but also at understanding. Use them for:
- Code explanation: Translating complex code to natural language
- Documentation generation: Creating docstrings, READMEs, and API docs
- Code review: Identifying bugs, security issues, and style violations
- Refactoring suggestions: Improving code structure and readability
Practice Exercises
-
Conceptual: Explain the difference between Pass@1 and Pass@k metrics. Why is Pass@k more appropriate for evaluating code generation?
-
Mathematical: If a code LLM generates 100 samples for a problem and 40 pass all tests, calculate Pass@1, Pass@5, and Pass@10.
-
Practical: Implement a simple code completion system using a pre-trained code LLM. Test it on 10 Python functions and measure completion accuracy.
-
Research: Compare the performance of Code Llama 7B and StarCoder 15B on the HumanEval benchmark. What are the trade-offs between model size and performance?
Key Takeaways:
- Code LLMs require specialized architectures for program structure and long-range dependencies
- Fill-in-the-middle training enables intelligent code completion
- Pass@k is the standard metric for evaluating code generation
- SWE-bench represents the most realistic (and challenging) evaluation scenario
- Production deployment requires context management, execution feedback, and multi-file awareness
What to Learn Next
-> LLMs for Scientific Research Using LLMs for literature review, hypothesis generation, and paper writing.
-> LLMs in Healthcare Clinical NLP, medical QA, and drug discovery applications.
-> LLMs for Finance Sentiment analysis, risk assessment, and trading applications.
-> LLMs for Education Tutoring systems, content generation, and assessment.
-> State Space Models Mamba, S4, and linear attention alternatives to transformers.
-> Agent Frameworks Building autonomous agents with LLMs for complex tasks.