Python File I/O — Advanced Patterns
Advanced file I/O patterns handle large files, concurrent access, and safe writes. These patterns are essential for production applications.
Learning Objectives
- Process large files efficiently with generators
- Use memory-mapped files for fast random access
- Write atomic files safely (prevent corruption)
- Work with temporary files and directories
Large File Processing
Loading entire files into memory is dangerous for large files. A 10GB log file will crash your program if you try f.read(). Instead, process line by line.
def process_large_file(filename):
"""Process file line by line without loading into memory."""
with open(filename, 'r', buffering=8192) as f:
for line in f: # Python reads in chunks automatically
yield line.strip()
# Memory-efficient line counting
def count_lines(filename):
return sum(1 for _ in process_large_file(filename))
# Memory-efficient word counting
def count_words(filename):
word_count = 0
for line in process_large_file(filename):
word_count += len(line.split())
return word_count
# Process only specific lines
def process_every_nth(filename, n=10):
for i, line in enumerate(process_large_file(filename)):
if i % n == 0:
yield line
Why This Works
When you iterate over a file object (for line in f), Python does NOT read the entire file into memory. Instead:
- It reads a buffer (default 8KB)
- Yields one line at a time
- Reads more from disk when buffer is exhausted
This means you can process gigabyte files with only megabytes of RAM.
Memory-Mapped Files
Memory mapping lets you access file contents as if they were in memory, but the OS handles loading pages on demand. This is perfect for random access in large files.
import mmap
# Create a memory-mapped file
with open('large_file.bin', 'r+b') as f:
# Map the entire file (0 = entire file)
mm = mmap.mmap(f.fileno(), 0)
# Access like a byte array
print(mm[0:100]) # First 100 bytes
print(mm[1000:2000]) # Bytes 1000-2000
# Seek and read
mm.seek(500)
data = mm.read(100) # Read 100 bytes from position 500
# Write
mm.seek(0)
mm.write(b'HEADER')
mm.close()
When to Use Memory Mapping
| Scenario | Regular File | Memory Map |
|---|---|---|
| Sequential read | Good | Good |
| Random access | Slow (seek) | Fast |
| Small file (less than 1MB) | Good | Overkill |
| Large file (greater than 100MB) | OK | Better |
| Multiple processes reading | Complex | Easy |
Atomic Writes
Atomic writes prevent file corruption. If your program crashes mid-write, the file is either completely written or not written at all — never partially written.
import tempfile
import os
def atomic_write(filename, content):
"""Write atomically using temp file + rename."""
dir_name = os.path.dirname(filename) or '.'
# Write to temp file in same directory (same filesystem)
with tempfile.NamedTemporaryFile(
mode='w',
dir=dir_name,
delete=False,
suffix='.tmp'
) as tmp:
tmp.write(content)
tmp_path = tmp.name
# Atomic on most filesystems (POSIX)
os.replace(tmp_path, filename)
# Usage
atomic_write('config.json', '{"key": "value"}')
Why This Matters
Without atomic writes:
- Program starts writing to config.json
- Program crashes at 50% — file is now corrupt
- On restart, program reads corrupt config → crash loop
With atomic writes:
- Program writes to config.json.tmp
- Program crashes — config.json is untouched
- On restart, program reads valid config.json
Temporary Files and Directories
import tempfile
import os
# Temporary file (auto-deleted when closed)
with tempfile.NamedTemporaryFile(
mode='w',
suffix='.txt',
delete=True # Auto-delete on close
) as f:
f.write("temporary data")
temp_path = f.name
# Process data using temp_path
# Temporary directory (all contents deleted)
with tempfile.TemporaryDirectory() as tmpdir:
# Create files in tmpdir
path = os.path.join(tmpdir, 'data.txt')
with open(path, 'w') as f:
f.write("temp")
# All files in tmpdir are deleted when exiting
# Persistent temporary file (you must delete manually)
tmp = tempfile.NamedTemporaryFile(delete=False)
tmp.close()
# ... use tmp.name ...
os.unlink(tmp.name) # Delete when done
File Watching
import time
from pathlib import Path
def watch_file(filename, callback, interval=1.0):
"""Watch for file changes and call callback."""
last_mtime = Path(filename).stat().st_mtime
while True:
try:
current_mtime = Path(filename).stat().st_mtime
if current_mtime != last_mtime:
callback(filename)
last_mtime = current_mtime
except FileNotFoundError:
pass
time.sleep(interval)
# Usage
def on_change(filename):
print(f"{filename} was modified!")
watch_file('config.json', on_change)
Key Takeaways
- Use generators for memory-efficient line-by-line processing
- Memory-mapped files are fast for random access in large files
- Atomic writes prevent corruption on crashes
- Use
tempfilefor temporary storage (auto-cleanup) - Buffer large reads for better performance
- Use
os.replace()for atomic file replacement - Always close files (use context managers)
- Use
bufferingparameter for custom buffer sizes