Introduction
Python generators are functions that produce a sequence of values lazily, one at a time, using the yield keyword. They provide memory-efficient iteration over large datasets and are essential for handling streaming data in data science applications. Generators maintain state between iterations, making them powerful for complex data processing pipelines.
Key Concepts
- Lazy evaluation: Values produced on-demand, not stored in memory
- Yield keyword: Pauses function and returns value without terminating
- Generator object: Iterator that produces values when iterated
- Memory efficiency: Only one item in memory at a time
- Stateful iteration: Maintains position between iterations
- Generator expressions: Similar to list comprehensions but lazy
Python Implementation
# Basic generator function
def count_up_to(n):
count = 1
while count <= n:
yield count
count += 1
for i in count_up_to(5):
print(i) # Prints 1, 2, 3, 4, 5
# Generator for large data processing
def process_large_file(filepath):
with open(filepath, 'r') as file:
for line in file:
yield line.strip()
# Fibonacci generator
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
fib = fibonacci()
for i in range(10):
print(next(fib)) # 0, 1, 1, 2, 3, 5, 8, 13, 21, 34
# Generator expression (like list comprehension but lazy)
squares_gen = (x**2 for x in range(1000000))
# Does not create list in memory - each value computed on demand
# Chaining generators
def pipeline(data):
for item in data:
yield item * 2
def filter_pipeline(data):
for item in data:
if item > 0:
yield item
# Using send() for two-way communication
def counter():
total = 0
while True:
value = yield total
if value:
total += value
c = counter()
print(c.send(None)) # 0
print(c.send(5)) # 5
print(c.send(3)) # 8
When to Use
- Processing large datasets that don't fit in memory
- Streaming data from files or APIs
- Creating infinite sequences
- Building data pipelines with transformation steps
- Avoiding memory overhead of list creation
- Implementing custom iteration patterns
Key Takeaways
- Generators are memory-efficient for large datasets as they don't store all values
- The yield keyword pauses execution, maintaining state between iterations
- Generator expressions offer lazy evaluation similar to generators
- Generators can be chained for efficient data pipelines
- Once exhausted, generators cannot be reused without recreation