Python File I/O — Advanced Patterns

Python Standard LibraryFile I/OFree Lesson

Advertisement

Python File I/O — Advanced Patterns

Advanced file I/O patterns handle large files, concurrent access, and safe writes. These patterns are essential for production applications.

Learning Objectives

  • Process large files efficiently with generators
  • Use memory-mapped files for fast random access
  • Write atomic files safely (prevent corruption)
  • Work with temporary files and directories

Large File Processing

Loading entire files into memory is dangerous for large files. A 10GB log file will crash your program if you try f.read(). Instead, process line by line.

def process_large_file(filename):
    """Process file line by line without loading into memory."""
    with open(filename, 'r', buffering=8192) as f:
        for line in f:  # Python reads in chunks automatically
            yield line.strip()

# Memory-efficient line counting
def count_lines(filename):
    return sum(1 for _ in process_large_file(filename))

# Memory-efficient word counting
def count_words(filename):
    word_count = 0
    for line in process_large_file(filename):
        word_count += len(line.split())
    return word_count

# Process only specific lines
def process_every_nth(filename, n=10):
    for i, line in enumerate(process_large_file(filename)):
        if i % n == 0:
            yield line

Why This Works

When you iterate over a file object (for line in f), Python does NOT read the entire file into memory. Instead:

  1. It reads a buffer (default 8KB)
  2. Yields one line at a time
  3. Reads more from disk when buffer is exhausted

This means you can process gigabyte files with only megabytes of RAM.


Memory-Mapped Files

Memory mapping lets you access file contents as if they were in memory, but the OS handles loading pages on demand. This is perfect for random access in large files.

import mmap

# Create a memory-mapped file
with open('large_file.bin', 'r+b') as f:
    # Map the entire file (0 = entire file)
    mm = mmap.mmap(f.fileno(), 0)

    # Access like a byte array
    print(mm[0:100])       # First 100 bytes
    print(mm[1000:2000])   # Bytes 1000-2000

    # Seek and read
    mm.seek(500)
    data = mm.read(100)    # Read 100 bytes from position 500

    # Write
    mm.seek(0)
    mm.write(b'HEADER')

    mm.close()

When to Use Memory Mapping

ScenarioRegular FileMemory Map
Sequential readGoodGood
Random accessSlow (seek)Fast
Small file (less than 1MB)GoodOverkill
Large file (greater than 100MB)OKBetter
Multiple processes readingComplexEasy

Atomic Writes

Atomic writes prevent file corruption. If your program crashes mid-write, the file is either completely written or not written at all — never partially written.

import tempfile
import os

def atomic_write(filename, content):
    """Write atomically using temp file + rename."""
    dir_name = os.path.dirname(filename) or '.'

    # Write to temp file in same directory (same filesystem)
    with tempfile.NamedTemporaryFile(
        mode='w',
        dir=dir_name,
        delete=False,
        suffix='.tmp'
    ) as tmp:
        tmp.write(content)
        tmp_path = tmp.name

    # Atomic on most filesystems (POSIX)
    os.replace(tmp_path, filename)

# Usage
atomic_write('config.json', '{"key": "value"}')

Why This Matters

Without atomic writes:

  1. Program starts writing to config.json
  2. Program crashes at 50% — file is now corrupt
  3. On restart, program reads corrupt config → crash loop

With atomic writes:

  1. Program writes to config.json.tmp
  2. Program crashes — config.json is untouched
  3. On restart, program reads valid config.json

Temporary Files and Directories

import tempfile
import os

# Temporary file (auto-deleted when closed)
with tempfile.NamedTemporaryFile(
    mode='w',
    suffix='.txt',
    delete=True  # Auto-delete on close
) as f:
    f.write("temporary data")
    temp_path = f.name
    # Process data using temp_path

# Temporary directory (all contents deleted)
with tempfile.TemporaryDirectory() as tmpdir:
    # Create files in tmpdir
    path = os.path.join(tmpdir, 'data.txt')
    with open(path, 'w') as f:
        f.write("temp")
    # All files in tmpdir are deleted when exiting

# Persistent temporary file (you must delete manually)
tmp = tempfile.NamedTemporaryFile(delete=False)
tmp.close()
# ... use tmp.name ...
os.unlink(tmp.name)  # Delete when done

File Watching

import time
from pathlib import Path

def watch_file(filename, callback, interval=1.0):
    """Watch for file changes and call callback."""
    last_mtime = Path(filename).stat().st_mtime

    while True:
        try:
            current_mtime = Path(filename).stat().st_mtime
            if current_mtime != last_mtime:
                callback(filename)
                last_mtime = current_mtime
        except FileNotFoundError:
            pass
        time.sleep(interval)

# Usage
def on_change(filename):
    print(f"{filename} was modified!")

watch_file('config.json', on_change)

Key Takeaways

  1. Use generators for memory-efficient line-by-line processing
  2. Memory-mapped files are fast for random access in large files
  3. Atomic writes prevent corruption on crashes
  4. Use tempfile for temporary storage (auto-cleanup)
  5. Buffer large reads for better performance
  6. Use os.replace() for atomic file replacement
  7. Always close files (use context managers)
  8. Use buffering parameter for custom buffer sizes

Advertisement

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement