Why Functions Matter
Functions are the atomic units of reusable logic in Python. They transform raw, repetitive code into modular, testable, and composable building blocks. Without functions, data science code becomes an unmaintainable tangle of copy-pasted operations.
The Cost of Not Using Functions
Without Functions: With Functions:
βββββββββββββββββββββββ βββββββββββββββββββββββ
β data = load_csv() β β data = load_csv() β
β # clean it β β data = clean(data) β
β data['a'] = ... β β data = transform(a) β
β data['b'] = ... β β data = transform(b) β
β # copy-paste again β β report = report(dataβ
β data2 = load_csv() β βββββββββββββββββββββββ
β data2['a'] = ... β β Reusable
β data2['b'] = ... β β Testable
β # ... repeat 50x β β Readable
β # BUG: forgot one β β Bug-prone
βββββββββββββββββββββββ
Function Fundamentals
Basic Syntax and Return Types
DfFunction
A mathematical mapping f: X β Y that assigns each element x β X to exactly one element y β Y. In programming, a function is a reusable block of code that takes inputs (parameters), executes a sequence of operations, and returns an output.
# Every function is a mapping: f: X β Y
# Formally: f: A β B where f(a) = b for each a β A
def square(x):
"""Return the square of x.
Mathematical definition: f(x) = xΒ²
Domain: β (all real numbers)
Codomain: [0, β)
"""
return x ** 2
# Functions are first-class objects in Python
print(type(square)) # <class 'function'>
print(square.__name__) # 'square'
print(square.__doc__) # 'Return the square of x...'
# Functions can be assigned to variables
f = square
print(f(5)) # 25
# Functions can be stored in data structures
operations = {
'square': lambda x: x ** 2,
'cube': lambda x: x ** 3,
'sqrt': lambda x: x ** 0.5
}
print(operations['cube'](3)) # 27
Python's first-class functions enable functional programming patterns like higher-order functions (functions that take/return other functions), closures, and decorators. This is the foundation of functional composition, where complex operations are built by combining simpler functions.
Arguments: Positional, Keyword, Default
def describe_dataset(data, name="dataset", verbose=True):
"""Describe a pandas DataFrame with optional verbosity.
Parameters
----------
data : pd.DataFrame
The dataset to describe.
name : str, default="dataset"
Name for labeling output.
verbose : bool, default=True
If True, print detailed stats.
"""
import pandas as pd
print(f"Dataset: {name}")
print(f"Shape: {data.shape[0]} rows Γ {data.shape[1]} columns")
print(f"Memory: {data.memory_usage(deep=True).sum() / 1024:.2f} KB")
if verbose:
print(f"\nColumn types:")
print(data.dtypes.value_counts())
print(f"\nMissing values: {data.isnull().sum().sum()}")
return {'shape': data.shape, 'memory_kb': data.memory_usage(deep=True).sum() / 1024}
# Calling with different argument styles
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
# Positional
describe_dataset(df, "my_data")
# Keyword
describe_dataset(data=df, name="my_data", verbose=False)
# Mixed (positional must come first)
describe_dataset(df, verbose=False, name="my_data")
*args and **kwargs
DfVariadic Function
A function that accepts a variable number of arguments. In Python, *args collects positional arguments into a tuple, and **kwargs collects keyword arguments into a dictionary, enabling flexible function signatures.
def flexible_function(*args, **kwargs):
"""Accept any number of positional and keyword arguments.
*args: collects extra positional arguments into a tuple
Mathematical: f: ββΏ β β where n is variable
**kwargs: collects extra keyword arguments into a dict
Mathematical: f: ββΏ Γ {kβ:β,...,kβ:β} β β
"""
print(f"Positional args (tuple): {args}")
print(f"Keyword args (dict): {kwargs}")
print(f"Sum of positional: {sum(args)}")
return sum(args)
flexible_function(1, 2, 3, x=10, y=20)
# Positional args (tuple): (1, 2, 3)
# Keyword args (dict): {'x': 10, 'y': 20}
# Sum of positional: 6
# Real-world example: flexible aggregation
def aggregate(df, group_cols, agg_dict=None, **extra_aggs):
"""Flexible DataFrame aggregation.
Parameters
----------
group_cols : str or list
Column(s) to group by.
agg_dict : dict, optional
{column: [aggregations]} mapping.
**extra_aggs : additional aggregations
Column name as key, aggregation function as value.
"""
if agg_dict is None:
agg_dict = {}
agg_dict.update(extra_aggs)
return df.groupby(group_cols).agg(agg_dict)
Return Types: Multiple Returns and Named Tuples
def compute_statistics(values):
"""Compute multiple statistics at once.
Returns a tuple (named tuple for clarity).
Mathematical: f: ββΏ β ββ΅ where outputs are
(mean, median, std, min, max)
"""
from collections import namedtuple
import statistics
Stats = namedtuple('Stats', ['mean', 'median', 'std', 'min', 'max'])
return Stats(
mean=statistics.mean(values),
median=statistics.median(values),
stdev=statistics.stdev(values),
min=min(values),
max=max(values)
)
data = [23, 45, 12, 67, 89, 34, 56]
stats = compute_statistics(data)
print(f"Mean: {stats.mean:.2f}")
print(f"Median: {stats.median}")
print(f"Std Dev: {stats.stdev:.2f}")
# Mean: 46.57
# Median: 45
# Std Dev: 25.63
# Unpacking multiple returns
mean, median, std, mn, mx = compute_statistics(data)
Scope: The LEGB Rule
Python resolves variable names using the LEGB rule β a hierarchical lookup chain:
DfLEGB Scope
Python's variable resolution order: Local (L) β Enclosing (E) β Global (G) β Built-in (B). When a variable is referenced, Python searches these scopes in order, using the first one found. This determines which binding is visible at any point in the code.
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Built-in (B) β
β βββββββββββββββββββββββββββββββββββββββββββββββββ
β β Global (G) ββ
β β ββββββββββββββββββββββββββββββββββββββββββββββ
β β β Enclosing (E) βββ
β β β βββββββββββββββββββββββββββββββββββββββββββ
β β β β Local (L) ββββ
β β β β ββββ
β β β β def outer(): ββββ
β β β β x = "enclosing" βββ E scope ββββ
β β β β def inner(): βββββ
β β β β x = "local" βββ L scope βββββ
β β β β print(x) βββββ
β β β β inner() βββββ
β β β β βββββ
β β β ββββββββββββββββββββββββββββββββββββββββββββ
β β β ββββ
β β β x = "global" βββ G scope ββββ
β β ββββββββββββββββββββββββββββββββββββββββββββββββ
β β ββββ
β β print() βββ B scope (built-in) ββββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
# LEGB Rule Demonstration
built_in = "I'm built-in" # Available everywhere (B)
x = "global" # Global scope (G)
def outer():
x = "enclosing" # Enclosing scope (E)
def inner():
x = "local" # Local scope (L)
print(f"Local x: {x}") # Prints: local
inner()
print(f"Enclosing x: {x}") # Prints: enclosing
outer()
print(f"Global x: {x}") # Prints: global
# Common mistake: UnboundLocalError
def bad_function():
# This will raise UnboundLocalError!
# Python sees x = x + 1, assumes x is local
print(x) # UnboundLocalError: x referenced before assignment
x = x + 1
# Solution: use global keyword (but avoid it!)
def better_function():
global x
x = x + 1
Closures: Capturing Enclosing Scope
DfClosure
A function object that remembers values from its enclosing lexical scope even when the function is executed outside that scope. Mathematically, a closure is a triple (function, environment, free variables) that captures bindings from the outer scope.
def make_multiplier(factor):
"""Create a multiplier function using closure.
Mathematical: f(a, b) = Ξ»x. aΒ·x + b
Where 'factor' is captured from enclosing scope.
"""
def multiplier(x):
return x * factor
return multiplier
double = make_multiplier(2)
triple = make_multiplier(3)
print(double(5)) # 10
print(triple(5)) # 15
# The closure captures 'factor' in __closure__
print(double.__closure__[0].cell_contents) # 2
Closure Complexity
Here,
- =Variables from enclosing scope stored in closure
- =Accessing captured variables has no overhead
Decorators: Function Wrapping
Decorators are higher-order functions that modify the behavior of other functions without changing their source code. They follow the principle: wrap a function β add behavior β return the wrapped function.
DfDecorator
A higher-order function that takes a function f and returns a modified function g = decorator(f). Decorators implement the decorator pattern, allowing cross-cutting concerns (logging, timing, caching) to be applied to functions without modifying their implementation.
Without Decorator: With Decorator:
βββββββββββββββββββ βββββββββββββββββββββββ
β def func(): β β @timer β
β ... β β def func(): β
β return β β ... β
βββββββββββββββββββ β return β
βββββββββββββββββββββββ
Which is equivalent to:
βββββββββββββββββββββββ
β func = timer(func) β
β func() # wrapped β
βββββββββββββββββββββββ
import time
import functools
# Decorator 1: Timer
def timer(func):
"""Measure function execution time."""
@functools.wraps(func) # Preserves original function metadata
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
elapsed = time.perf_counter() - start
print(f"[{func.__name__}] {elapsed:.4f}s")
return result
return wrapper
# Decorator 2: Cache (memoization)
def cache(func):
"""Cache function results for previously seen inputs.
Mathematical: memoize f by storing {x: f(x)}
Uses LRU (Least Recently Used) eviction policy.
"""
memo = {}
@functools.wraps(func)
def wrapper(*args):
if args not in memo:
memo[args] = func(*args)
return memo[args]
wrapper.cache = memo
return wrapper
# Decorator 3: Retry
def retry(max_attempts=3, delay=1):
"""Retry function on failure with exponential backoff."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_attempts - 1:
raise
print(f"Attempt {attempt+1} failed: {e}")
time.sleep(delay * (2 ** attempt))
return wrapper
return decorator
# Using decorators
@timer
@cache
def fibonacci(n):
"""Compute nth Fibonacci number."""
if n < 2:
return n
return fibonacci(n-1) + fibonacci(n-2)
# This runs in O(n) instead of O(2^n) thanks to caching
print(fibonacci(30)) # 832040 (instant with cache)
Caching decorator reduces Fibonacci from exponential O(2^n) to linear O(n) by storing previously computed values.
Lambda Functions
Lambdas are anonymous, single-expression functions defined inline. They are restricted to a single expression and are ideal for short, throwaway operations.
Syntax and Use Cases
DfLambda Abstraction
An anonymous function expression Ξ»xβ, xβ, ..., xβ. body. In lambda calculus, this represents function abstraction. Python's lambda is a restricted version, allowing only a single expression without side effects.
# Lambda syntax: lambda arguments: expression
# Mathematical: Ξ»x. f(x) (lambda calculus notation)
# Basic examples
square = lambda x: x ** 2
add = lambda a, b: a + b
absolute = lambda x: x if x >= 0 else -x
print(square(5)) # 25
print(add(3, 4)) # 7
print(absolute(-7)) # 7
# Lambda with map: apply function to each element
# map: (A β B) Γ List[A] β List[B]
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, numbers))
print(squared) # [1, 4, 9, 16, 25]
# Lambda with filter: select elements satisfying predicate
# filter: (A β Bool) Γ List[A] β List[A]
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens) # [2, 4]
# Lambda with reduce: accumulate values
# reduce: (B Γ A β B) Γ List[A] β B
from functools import reduce
product = reduce(lambda a, b: a * b, numbers)
print(product) # 120 (1*2*3*4*5)
# Lambda with sorted: custom sort keys
students = [
{'name': 'Alice', 'gpa': 3.8},
{'name': 'Bob', 'gpa': 3.5},
{'name': 'Charlie', 'gpa': 3.9}
]
sorted_students = sorted(students, key=lambda s: s['gpa'], reverse=True)
for s in sorted_students:
print(f"{s['name']}: {s['gpa']}")
# Charlie: 3.9
# Alice: 3.8
# Bob: 3.5
Lambda with pandas apply
import pandas as pd
import numpy as np
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'income': [50000, 60000, 75000, 55000]
})
# Apply lambda to create new column
df['income_category'] = df['income'].apply(
lambda x: 'High' if x > 65000 else ('Medium' if x > 55000 else 'Low')
)
# Apply lambda to each row
df['description'] = df.apply(
lambda row: f"{row['name']}, age {row['age']}, earns ${row['income']:,}",
axis=1
)
print(df)
# name age income income_category description
# 0 Alice 25 50000 Low Alice, age 25, earns $50,000
# 1 Bob 30 60000 Medium Bob, age 30, earns $60,000
# 2 Charlie 35 75000 High Charlie, age 35, earns $75,000
# 3 Diana 28 55000 Medium Diana, age 28, earns $55,000
When to Use Lambda vs. Def
| Criterion | lambda | def |
|---|---|---|
| Named function needed | β | β |
| Multiple expressions | β | β |
| Docstring needed | β | β |
| Type hints needed | β | β |
| Short throwaway function | β | β |
| Inline with map/filter/sort | β | β |
| Recursion needed | β | β |
List Comprehensions
List comprehensions provide a concise, declarative syntax for transforming and filtering sequences. They are often faster than equivalent for loops because Python optimizes them internally.
Mathematical Foundation
A list comprehension is equivalent to set-builder notation:
List Comprehension
Here,
- =The expression (transformation)
- =The iterable (source)
- =The predicate (filter condition)
# Basic syntax: [expression for item in iterable]
# Equivalent to: list(map(expression, iterable))
squares = [x**2 for x in range(10)]
print(squares) # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
# With condition (filter): [expr for item in iterable if condition]
# Equivalent to: list(filter(predicate, map(expression, iterable)))
even_squares = [x**2 for x in range(10) if x % 2 == 0]
print(even_squares) # [0, 4, 16, 36, 64]
# With if-else (must be in expression, not after 'for')
labels = ['even' if x % 2 == 0 else 'odd' for x in range(5)]
print(labels) # ['even', 'odd', 'even', 'odd', 'even']
Nested Comprehensions
# Flattening a matrix
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [num for row in matrix for num in row]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Generating all pairs
names = ['Alice', 'Bob']
colors = ['red', 'blue']
pairs = [(name, color) for name in names for color in colors]
print(pairs)
# [('Alice', 'red'), ('Alice', 'blue'), ('Bob', 'red'), ('Bob', 'blue')]
# Matrix transpose using comprehension
original = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
transposed = [[row[i] for row in original] for i in range(3)]
print(transposed) # [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
# Or using zip (more Pythonic)
transposed_zipped = [list(row) for row in zip(*original)]
print(transposed_zipped) # [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
Practical Data Processing Example
import pandas as pd
import numpy as np
# Generate sample data
np.random.seed(42)
raw_data = [
{'name': 'Alice', 'scores': '85,92,78', 'grade': 'A'},
{'name': 'Bob', 'scores': '90,88,95', 'grade': 'A+'},
{'name': 'Charlie', 'scores': '72,68,75', 'grade': 'B'},
{'name': 'Diana', 'scores': '95,98,100', 'grade': 'A+'},
{'name': 'Eve', 'scores': '60,72,65', 'grade': 'C'}
]
# Comprehension to parse scores
parsed_data = [
{
'name': r['name'],
'scores': [int(s) for s in r['scores'].split(',')],
'average': sum(int(s) for s in r['scores'].split(',')) / 3,
'grade': r['grade']
}
for r in raw_data
]
# Filter high performers
high_performers = [
d for d in parsed_data
if d['average'] >= 85
]
print("High performers:")
for hp in high_performers:
print(f" {hp['name']}: avg={hp['average']:.1f}")
# Output:
# High performers:
# Alice: avg=85.0
# Bob: avg=91.0
# Diana: avg=97.7
Dictionary Comprehensions
# Basic dict comprehension: {key: value for item in iterable}
names = ['Alice', 'Bob', 'Charlie']
name_lengths = {name: len(name) for name in names}
print(name_lengths) # {'Alice': 5, 'Bob': 3, 'Charlie': 7}
# Invert a dictionary
original = {'a': 1, 'b': 2, 'c': 3}
inverted = {v: k for k, v in original.items()}
print(inverted) # {1: 'a', 2: 'b', 3: 'c'}
# Filter dictionary
scores = {'Alice': 92, 'Bob': 78, 'Charlie': 85, 'Diana': 95}
excellent = {k: v for k, v in scores.items() if v >= 85}
print(excellent) # {'Alice': 92, 'Charlie': 85, 'Diana': 95}
# Transform values
adjusted = {k: min(100, v * 1.1) for k, v in scores.items()}
print(adjusted)
# Real-world: feature engineering in ML
import pandas as pd
df = pd.DataFrame({
'feature_1': [1, 2, 3],
'feature_2': [4, 5, 6],
'target': [0, 1, 0]
})
# Create interaction features
interaction_features = {
f'{c1}_x_{c2}': df[c1] * df[c2]
for c1 in ['feature_1', 'feature_2']
for c2 in ['feature_1', 'feature_2']
if c1 != c2
}
print(interaction_features.keys())
# dict_keys(['feature_1_x_feature_2', 'feature_2_x_feature_1'])
Set Comprehensions
# Set comprehension: {expression for item in iterable}
text = "hello world"
unique_chars = {c.upper() for c in text if c.isalpha()}
print(unique_chars) # {'H', 'E', 'L', 'O', 'W', 'R', 'D'}
# Find common elements
list_a = [1, 2, 3, 4, 5, 6]
list_b = [4, 5, 6, 7, 8, 9]
common = {x for x in list_a if x in list_b}
print(common) # {4, 5, 6}
# Extract unique data types from mixed list
mixed = [1, 'hello', 3.14, True, [1, 2], {'a': 1}]
types = {type(x).__name__ for x in mixed}
print(types) # {'int', 'str', 'float', 'bool', 'list', 'dict'}
Generator Expressions
Generators produce values lazily β they compute one value at a time on demand, rather than building the entire list in memory. This is critical when working with large datasets.
Memory Comparison
DfLazy Evaluation
A computation strategy where expressions are not evaluated until their values are needed. Generators implement lazy evaluation, computing values on-the-fly rather than precomputing all results. This enables processing of infinite sequences and large datasets with O(1) memory.
List Comprehension: Generator Expression:
βββββββββββββββββββββββ βββββββββββββββββββββββ
β [1, 2, 3, ..., N] β β generator object β
β β β β
β ALL in memory β β ONE at a time β
β O(N) memory β β O(1) memory β
β Fast access β β Slow (re-iterate) β
βββββββββββββββββββββββ βββββββββββββββββββββββ
import sys
# List comprehension: allocates all elements immediately
list_comp = [x**2 for x in range(1000000)]
print(f"List size: {sys.getsizeof(list_comp) / 1024 / 1024:.2f} MB")
# List size: 8.44 MB
# Generator expression: lazy evaluation
gen_exp = (x**2 for x in range(1000000))
print(f"Generator size: {sys.getsizeof(gen_exp)} bytes")
# Generator size: 208 bytes
# Generator function with yield
def fibonacci_generator(n):
"""Generate first n Fibonacci numbers lazily."""
a, b = 0, 1
count = 0
while count < n:
yield a
a, b = b, a + b
count += 1
# Uses only O(1) memory regardless of n
fib = fibonacci_generator(10)
print(list(fib)) # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
# Infinite generator
def count_up(start=0):
"""Generate infinite sequence from start."""
n = start
while True:
yield n
n += 1
# Take first 5 from infinite generator
counter = count_up(100)
first_five = [next(counter) for _ in range(5)]
print(first_five) # [100, 101, 102, 103, 104]
# Generator pipeline for large data processing
def read_large_file(file_path):
"""Lazy read of large file, line by line."""
with open(file_path, 'r') as f:
for line in f:
yield line.strip()
def filter_lines(lines, keyword):
"""Filter generator by keyword."""
for line in lines:
if keyword in line:
yield line
# Pipeline: read β filter β count (O(1) memory!)
# lines = read_large_file('huge_log.txt')
# errors = filter_lines(lines, 'ERROR')
# count = sum(1 for _ in errors)
Generator Memory
Here,
- =Memory for generator (constant)
- =Memory for list (proportional to n)
- =Number of elements
When to Use Each: Decision Tree
Need a function?
ββ Yes, reusable β Use `def`
ββ Yes, one-liner β Lambda
ββ Transform sequence β List comprehension
β ββ Large dataset β Generator expression
β ββ Need dict β Dict comprehension
β ββ Need unique items β Set comprehension
ββ No, inline logic β Lambda with map/filter/sort
Performance Priority?
ββ Memory critical β Generator expression
ββ Speed critical β List comprehension (optimized in CPython)
ββ Readability critical β List comprehension with clear logic
Performance Comparison
import time
import sys
data = list(range(1_000_000))
def benchmark(label, func, data):
"""Benchmark a function and return timing."""
start = time.perf_counter()
result = func(data)
elapsed = time.perf_counter() - start
return label, elapsed, sys.getsizeof(result)
# Method 1: for loop
def with_loop(data):
result = []
for x in data:
if x % 2 == 0:
result.append(x ** 2)
return result
# Method 2: map + filter + lambda
def with_map(data):
return list(map(lambda x: x**2, filter(lambda x: x % 2 == 0, data)))
# Method 3: list comprehension
def with_comprehension(data):
return [x**2 for x in data if x % 2 == 0]
# Method 4: generator expression
def with_generator(data):
return (x**2 for x in data if x % 2 == 0)
results = [
benchmark("for loop", with_loop, data),
benchmark("map+filter", with_map, data),
benchmark("comprehension", with_comprehension, data),
benchmark("generator", with_generator, data),
]
print(f"{'Method':<15} {'Time (s)':<12} {'Memory':<12}")
print("-" * 39)
for label, elapsed, mem in results:
print(f"{label:<15} {elapsed:<12.4f} {mem:<12}")
Typical Results:
Method Time (s) Memory
---------------------------------------
for loop 0.0892 4304888
map+filter 0.1203 4304888
comprehension 0.0634 4304888
generator 0.0001 208
List comprehensions are 20-30% faster than equivalent for loops because Python optimizes them internally (avoiding repeated attribute lookups for .append()). However, generators win decisively on memory for large datasets.
Complete Data Processing Pipeline
import pandas as pd
import numpy as np
from functools import reduce
# === STEP 1: Generate Raw Data ===
np.random.seed(42)
raw_records = [
{'id': i,
'name': f'User_{i}',
'age': np.random.randint(18, 70),
'income': np.random.normal(50000, 15000),
'department': np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR']),
'score': np.random.uniform(0, 100)}
for i in range(100)
]
# === STEP 2: Create DataFrame ===
df = pd.DataFrame(raw_records)
# === STEP 3: Clean with comprehensions ===
# Remove outliers using IQR method
def remove_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
return df[(df[column] >= Q1 - 1.5*IQR) & (df[column] <= Q3 + 1.5*IQR)]
df_clean = reduce(lambda d, col: remove_outliers_iqr(d, col), ['age', 'income'], df)
# === STEP 4: Feature Engineering with Comprehensions ===
# Create bins for age
age_bins = pd.cut(df_clean['age'], bins=[0, 25, 35, 50, 100],
labels=['Young', 'Mid', 'Senior', 'Expert'])
df_clean['age_group'] = age_bins
# Create interaction features
feature_cols = ['age', 'income', 'score']
interaction_names = [f'{c1}_x_{c2}'
for c1 in feature_cols
for c2 in feature_cols
if c1 < c2]
for name in interaction_names:
c1, c2 = name.split('_x_')
df_clean[name] = df_clean[c1] * df_clean[c2]
# === STEP 5: Aggregation with Dict Comprehension ===
dept_stats = {
dept: {
'mean_income': group['income'].mean(),
'count': len(group),
'top_score': group['score'].max()
}
for dept, group in df_clean.groupby('department')
}
print("Department Statistics:")
for dept, stats in dept_stats.items():
print(f" {dept}: avg_income=${stats['mean_income']:,.0f}, "
f"count={stats['count']}, top_score={stats['top_score']:.1f}")
# === STEP 6: Filter and Select with Comprehension ===
high_performers = [
{'name': row['name'], 'score': row['score']}
for _, row in df_clean.iterrows()
if row['score'] > 80 and row['income'] > 60000
]
print(f"\nHigh performers (score>80, income>60k): {len(high_performers)}")
# === STEP 7: Summary Statistics ===
summary = {
'total_records': len(df_clean),
'features_created': len(interaction_names),
'departments': list(dept_stats.keys()),
'age_groups': list(df_clean['age_group'].unique())
}
print(f"\nPipeline Summary:")
for key, value in summary.items():
print(f" {key}: {value}")
Key Takeaways
πSummary: Functions, Lambda & Comprehensions
- Functions are the foundation of modular code. Use
*argsand**kwargsfor flexible interfaces. Return named tuples for multiple outputs. - Scope follows the LEGB rule. Avoid
globalβ prefer function parameters and return values for data flow. - Decorators wrap functions to add cross-cutting concerns (timing, caching, retries) without modifying the original code.
- Lambda functions are for short, throwaway operations β especially with
map,filter, andsorted. Usedeffor anything reusable. - List comprehensions are Pythonic, readable, and fast. Use them as the default for sequence transformations.
- Dict/Set comprehensions extend the pattern to dictionaries and sets β ideal for building lookup tables and collecting unique values.
- Generator expressions are essential for memory-efficient processing of large datasets. Use them when you don't need all results in memory at once.
- Performance hierarchy: generators > comprehensions > map/filter > for loops (for memory); comprehensions > for loops > map/filter (for speed on small data).
Practice Exercises
Exercise 1: Function Factory
# Create a function that returns functions
# Example: create_operation('add', 5) returns a function that adds 5 to any input
# create_operation('multiply', 3) returns a function that multiplies by 3
def create_operation(operation, value):
# Your code here
pass
add_five = create_operation('add', 5)
multiply_by_three = create_operation('multiply', 3)
print(add_five(10)) # Should print 15
print(multiply_by_three(10)) # Should print 30
Exercise 2: Comprehension Challenge
# Given a list of dictionaries representing products,
# create:
# 1. A dict comprehension mapping product names to prices
# 2. A list comprehension of products cheaper than $10
# 3. A set comprehension of all categories
products = [
{'name': 'Laptop', 'price': 999, 'category': 'Electronics'},
{'name': 'Mouse', 'price': 25, 'category': 'Electronics'},
{'name': 'Book', 'price': 15, 'category': 'Education'},
{'name': 'Pen', 'price': 2, 'category': 'Education'},
{'name': 'Chair', 'price': 150, 'category': 'Furniture'},
]
# Your comprehensions here
price_map = {}
cheap_products = []
categories = set()
Exercise 3: Decorator Challenge
# Create a decorator that logs function calls to a list
# and provides statistics about the function
call_log = []
def log_calls(func):
# Your decorator here
pass
@log_calls
def add(a, b):
return a + b
add(1, 2)
add(3, 4)
add(5, 6)
# Should be able to call:
# log_calls.stats() β {'total_calls': 3, 'args': [(1,2), (3,4), (5,6)]}
Exercise 4: Generator Pipeline
# Create a generator pipeline that:
# 1. Reads numbers from 1 to 1000
# 2. Filters out numbers not divisible by 3
# 3. Transforms by squaring
# 4. Takes only the first 10 results
def number_source():
pass # Your code
def divisible_by_3(numbers):
pass # Your code
def square(numbers):
pass # Your code
# Pipeline
pipeline = square(divisible_by_3(number_source()))
results = [next(pipeline) for _ in range(10)]
print(results) # [9, 36, 81, 144, 225, 324, 441, 576, 729, 900]