Functions, Lambda & Comprehensions

Module 1: FoundationsFree Lesson

Advertisement

Why Functions Matter

Functions are the atomic units of reusable logic in Python. They transform raw, repetitive code into modular, testable, and composable building blocks. Without functions, data science code becomes an unmaintainable tangle of copy-pasted operations.

The Cost of Not Using Functions

Architecture Diagram
Without Functions:               With Functions:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ data = load_csv()   β”‚          β”‚ data = load_csv()   β”‚
β”‚ # clean it          β”‚          β”‚ data = clean(data)  β”‚
β”‚ data['a'] = ...     β”‚          β”‚ data = transform(a) β”‚
β”‚ data['b'] = ...     β”‚          β”‚ data = transform(b) β”‚
β”‚ # copy-paste again  β”‚          β”‚ report = report(dataβ”‚
β”‚ data2 = load_csv()  β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ data2['a'] = ...    β”‚          βœ“ Reusable
β”‚ data2['b'] = ...    β”‚          βœ“ Testable
β”‚ # ... repeat 50x    β”‚          βœ“ Readable
β”‚ # BUG: forgot one   β”‚          βœ— Bug-prone
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Function Fundamentals

Basic Syntax and Return Types

DfFunction

A mathematical mapping f: X β†’ Y that assigns each element x ∈ X to exactly one element y ∈ Y. In programming, a function is a reusable block of code that takes inputs (parameters), executes a sequence of operations, and returns an output.

# Every function is a mapping: f: X β†’ Y
# Formally: f: A β†’ B where f(a) = b for each a ∈ A

def square(x):
    """Return the square of x.

    Mathematical definition: f(x) = xΒ²
    Domain: ℝ (all real numbers)
    Codomain: [0, ∞)
    """
    return x ** 2

# Functions are first-class objects in Python
print(type(square))           # <class 'function'>
print(square.__name__)        # 'square'
print(square.__doc__)         # 'Return the square of x...'

# Functions can be assigned to variables
f = square
print(f(5))                   # 25

# Functions can be stored in data structures
operations = {
    'square': lambda x: x ** 2,
    'cube': lambda x: x ** 3,
    'sqrt': lambda x: x ** 0.5
}
print(operations['cube'](3))  # 27

Python's first-class functions enable functional programming patterns like higher-order functions (functions that take/return other functions), closures, and decorators. This is the foundation of functional composition, where complex operations are built by combining simpler functions.

Arguments: Positional, Keyword, Default

def describe_dataset(data, name="dataset", verbose=True):
    """Describe a pandas DataFrame with optional verbosity.

    Parameters
    ----------
    data : pd.DataFrame
        The dataset to describe.
    name : str, default="dataset"
        Name for labeling output.
    verbose : bool, default=True
        If True, print detailed stats.
    """
    import pandas as pd

    print(f"Dataset: {name}")
    print(f"Shape: {data.shape[0]} rows Γ— {data.shape[1]} columns")
    print(f"Memory: {data.memory_usage(deep=True).sum() / 1024:.2f} KB")

    if verbose:
        print(f"\nColumn types:")
        print(data.dtypes.value_counts())
        print(f"\nMissing values: {data.isnull().sum().sum()}")
    return {'shape': data.shape, 'memory_kb': data.memory_usage(deep=True).sum() / 1024}

# Calling with different argument styles
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

# Positional
describe_dataset(df, "my_data")

# Keyword
describe_dataset(data=df, name="my_data", verbose=False)

# Mixed (positional must come first)
describe_dataset(df, verbose=False, name="my_data")

*args and **kwargs

DfVariadic Function

A function that accepts a variable number of arguments. In Python, *args collects positional arguments into a tuple, and **kwargs collects keyword arguments into a dictionary, enabling flexible function signatures.

def flexible_function(*args, **kwargs):
    """Accept any number of positional and keyword arguments.

    *args: collects extra positional arguments into a tuple
           Mathematical: f: ℝⁿ β†’ ℝ where n is variable
    **kwargs: collects extra keyword arguments into a dict
              Mathematical: f: ℝⁿ Γ— {k₁:ℝ,...,kβ‚˜:ℝ} β†’ ℝ
    """
    print(f"Positional args (tuple): {args}")
    print(f"Keyword args (dict): {kwargs}")
    print(f"Sum of positional: {sum(args)}")
    return sum(args)

flexible_function(1, 2, 3, x=10, y=20)
# Positional args (tuple): (1, 2, 3)
# Keyword args (dict): {'x': 10, 'y': 20}
# Sum of positional: 6

# Real-world example: flexible aggregation
def aggregate(df, group_cols, agg_dict=None, **extra_aggs):
    """Flexible DataFrame aggregation.

    Parameters
    ----------
    group_cols : str or list
        Column(s) to group by.
    agg_dict : dict, optional
        {column: [aggregations]} mapping.
    **extra_aggs : additional aggregations
        Column name as key, aggregation function as value.
    """
    if agg_dict is None:
        agg_dict = {}
    agg_dict.update(extra_aggs)
    return df.groupby(group_cols).agg(agg_dict)

Return Types: Multiple Returns and Named Tuples

def compute_statistics(values):
    """Compute multiple statistics at once.

    Returns a tuple (named tuple for clarity).
    Mathematical: f: ℝⁿ β†’ ℝ⁡ where outputs are
    (mean, median, std, min, max)
    """
    from collections import namedtuple
    import statistics

    Stats = namedtuple('Stats', ['mean', 'median', 'std', 'min', 'max'])

    return Stats(
        mean=statistics.mean(values),
        median=statistics.median(values),
        stdev=statistics.stdev(values),
        min=min(values),
        max=max(values)
    )

data = [23, 45, 12, 67, 89, 34, 56]
stats = compute_statistics(data)
print(f"Mean: {stats.mean:.2f}")
print(f"Median: {stats.median}")
print(f"Std Dev: {stats.stdev:.2f}")
# Mean: 46.57
# Median: 45
# Std Dev: 25.63

# Unpacking multiple returns
mean, median, std, mn, mx = compute_statistics(data)

Scope: The LEGB Rule

Python resolves variable names using the LEGB rule β€” a hierarchical lookup chain:

DfLEGB Scope

Python's variable resolution order: Local (L) β†’ Enclosing (E) β†’ Global (G) β†’ Built-in (B). When a variable is referenced, Python searches these scopes in order, using the first one found. This determines which binding is visible at any point in the code.

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Built-in (B)                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚  Global (G)                                  β”‚β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚β”‚
β”‚  β”‚  β”‚  Enclosing (E)                           β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚  Local (L)                           β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚                                      β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚  def outer():                        β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚      x = "enclosing"  ←── E scope    β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚      def inner():                   β”‚β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚          x = "local"  ←── L scope    β”‚β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚          print(x)                    β”‚β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚      inner()                         β”‚β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β”‚                                      β”‚β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚                                          β”‚β”‚β”‚β”‚
β”‚  β”‚  β”‚  x = "global"  ←── G scope               β”‚β”‚β”‚β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚β”‚β”‚β”‚
β”‚  β”‚                                              β”‚β”‚β”‚β”‚
β”‚  β”‚  print()  ←── B scope (built-in)             β”‚β”‚β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚β”‚β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
# LEGB Rule Demonstration
built_in = "I'm built-in"       # Available everywhere (B)

x = "global"                    # Global scope (G)

def outer():
    x = "enclosing"             # Enclosing scope (E)

    def inner():
        x = "local"             # Local scope (L)
        print(f"Local x: {x}")  # Prints: local

    inner()
    print(f"Enclosing x: {x}")  # Prints: enclosing

outer()
print(f"Global x: {x}")        # Prints: global

# Common mistake: UnboundLocalError
def bad_function():
    # This will raise UnboundLocalError!
    # Python sees x = x + 1, assumes x is local
    print(x)  # UnboundLocalError: x referenced before assignment
    x = x + 1

# Solution: use global keyword (but avoid it!)
def better_function():
    global x
    x = x + 1

Closures: Capturing Enclosing Scope

DfClosure

A function object that remembers values from its enclosing lexical scope even when the function is executed outside that scope. Mathematically, a closure is a triple (function, environment, free variables) that captures bindings from the outer scope.

def make_multiplier(factor):
    """Create a multiplier function using closure.

    Mathematical: f(a, b) = Ξ»x. aΒ·x + b
    Where 'factor' is captured from enclosing scope.
    """
    def multiplier(x):
        return x * factor
    return multiplier

double = make_multiplier(2)
triple = make_multiplier(3)

print(double(5))   # 10
print(triple(5))   # 15

# The closure captures 'factor' in __closure__
print(double.__closure__[0].cell_contents)  # 2

Closure Complexity

O(1)Β spaceΒ forΒ capturedΒ variables,Β O(1)Β timeΒ forΒ variableΒ accessO(1) \text{ space for captured variables, } O(1) \text{ time for variable access}

Here,

  • =Variables from enclosing scope stored in closure
  • =Accessing captured variables has no overhead

Decorators: Function Wrapping

Decorators are higher-order functions that modify the behavior of other functions without changing their source code. They follow the principle: wrap a function β†’ add behavior β†’ return the wrapped function.

DfDecorator

A higher-order function that takes a function f and returns a modified function g = decorator(f). Decorators implement the decorator pattern, allowing cross-cutting concerns (logging, timing, caching) to be applied to functions without modifying their implementation.

Architecture Diagram
Without Decorator:              With Decorator:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  def func():    β”‚            β”‚  @timer             β”‚
β”‚      ...        β”‚            β”‚  def func():        β”‚
β”‚      return     β”‚            β”‚      ...            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚      return         β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               Which is equivalent to:
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚  func = timer(func)  β”‚
                               β”‚  func()  # wrapped   β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
import time
import functools

# Decorator 1: Timer
def timer(func):
    """Measure function execution time."""
    @functools.wraps(func)  # Preserves original function metadata
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        elapsed = time.perf_counter() - start
        print(f"[{func.__name__}] {elapsed:.4f}s")
        return result
    return wrapper

# Decorator 2: Cache (memoization)
def cache(func):
    """Cache function results for previously seen inputs.

    Mathematical: memoize f by storing {x: f(x)}
    Uses LRU (Least Recently Used) eviction policy.
    """
    memo = {}
    @functools.wraps(func)
    def wrapper(*args):
        if args not in memo:
            memo[args] = func(*args)
        return memo[args]
    wrapper.cache = memo
    return wrapper

# Decorator 3: Retry
def retry(max_attempts=3, delay=1):
    """Retry function on failure with exponential backoff."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_attempts - 1:
                        raise
                    print(f"Attempt {attempt+1} failed: {e}")
                    time.sleep(delay * (2 ** attempt))
        return wrapper
    return decorator

# Using decorators
@timer
@cache
def fibonacci(n):
    """Compute nth Fibonacci number."""
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# This runs in O(n) instead of O(2^n) thanks to caching
print(fibonacci(30))  # 832040 (instant with cache)
Tmemoized=O(n),Trecursive=O(2n)T_{\text{memoized}} = O(n), \quad T_{\text{recursive}} = O(2^n)

Caching decorator reduces Fibonacci from exponential O(2^n) to linear O(n) by storing previously computed values.


Lambda Functions

Lambdas are anonymous, single-expression functions defined inline. They are restricted to a single expression and are ideal for short, throwaway operations.

Syntax and Use Cases

DfLambda Abstraction

An anonymous function expression Ξ»x₁, xβ‚‚, ..., xβ‚™. body. In lambda calculus, this represents function abstraction. Python's lambda is a restricted version, allowing only a single expression without side effects.

# Lambda syntax: lambda arguments: expression
# Mathematical: Ξ»x. f(x)  (lambda calculus notation)

# Basic examples
square = lambda x: x ** 2
add = lambda a, b: a + b
absolute = lambda x: x if x >= 0 else -x

print(square(5))      # 25
print(add(3, 4))      # 7
print(absolute(-7))   # 7

# Lambda with map: apply function to each element
# map: (A β†’ B) Γ— List[A] β†’ List[B]
numbers = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, numbers))
print(squared)  # [1, 4, 9, 16, 25]

# Lambda with filter: select elements satisfying predicate
# filter: (A β†’ Bool) Γ— List[A] β†’ List[A]
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens)  # [2, 4]

# Lambda with reduce: accumulate values
# reduce: (B Γ— A β†’ B) Γ— List[A] β†’ B
from functools import reduce
product = reduce(lambda a, b: a * b, numbers)
print(product)  # 120 (1*2*3*4*5)

# Lambda with sorted: custom sort keys
students = [
    {'name': 'Alice', 'gpa': 3.8},
    {'name': 'Bob', 'gpa': 3.5},
    {'name': 'Charlie', 'gpa': 3.9}
]
sorted_students = sorted(students, key=lambda s: s['gpa'], reverse=True)
for s in sorted_students:
    print(f"{s['name']}: {s['gpa']}")
# Charlie: 3.9
# Alice: 3.8
# Bob: 3.5

Lambda with pandas apply

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'income': [50000, 60000, 75000, 55000]
})

# Apply lambda to create new column
df['income_category'] = df['income'].apply(
    lambda x: 'High' if x > 65000 else ('Medium' if x > 55000 else 'Low')
)

# Apply lambda to each row
df['description'] = df.apply(
    lambda row: f"{row['name']}, age {row['age']}, earns ${row['income']:,}",
    axis=1
)

print(df)
#       name  age  income income_category                        description
# 0    Alice   25   50000             Low       Alice, age 25, earns $50,000
# 1      Bob   30   60000          Medium          Bob, age 30, earns $60,000
# 2  Charlie   35   75000            High    Charlie, age 35, earns $75,000
# 3    Diana   28   55000          Medium        Diana, age 28, earns $55,000

When to Use Lambda vs. Def

Criterionlambdadef
Named function neededβœ—βœ“
Multiple expressionsβœ—βœ“
Docstring neededβœ—βœ“
Type hints neededβœ—βœ“
Short throwaway functionβœ“βœ—
Inline with map/filter/sortβœ“βœ—
Recursion neededβœ—βœ“

List Comprehensions

List comprehensions provide a concise, declarative syntax for transforming and filtering sequences. They are often faster than equivalent for loops because Python optimizes them internally.

Mathematical Foundation

A list comprehension is equivalent to set-builder notation:

List Comprehension

L=[f(x)∣x∈S∧p(x)]L = [f(x) \mid x \in S \wedge p(x)]

Here,

  • =The expression (transformation)
  • =The iterable (source)
  • =The predicate (filter condition)
# Basic syntax: [expression for item in iterable]
# Equivalent to: list(map(expression, iterable))

squares = [x**2 for x in range(10)]
print(squares)  # [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# With condition (filter): [expr for item in iterable if condition]
# Equivalent to: list(filter(predicate, map(expression, iterable)))
even_squares = [x**2 for x in range(10) if x % 2 == 0]
print(even_squares)  # [0, 4, 16, 36, 64]

# With if-else (must be in expression, not after 'for')
labels = ['even' if x % 2 == 0 else 'odd' for x in range(5)]
print(labels)  # ['even', 'odd', 'even', 'odd', 'even']

Nested Comprehensions

# Flattening a matrix
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flat = [num for row in matrix for num in row]
print(flat)  # [1, 2, 3, 4, 5, 6, 7, 8, 9]

# Generating all pairs
names = ['Alice', 'Bob']
colors = ['red', 'blue']
pairs = [(name, color) for name in names for color in colors]
print(pairs)
# [('Alice', 'red'), ('Alice', 'blue'), ('Bob', 'red'), ('Bob', 'blue')]

# Matrix transpose using comprehension
original = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
transposed = [[row[i] for row in original] for i in range(3)]
print(transposed)  # [[1, 4, 7], [2, 5, 8], [3, 6, 9]]

# Or using zip (more Pythonic)
transposed_zipped = [list(row) for row in zip(*original)]
print(transposed_zipped)  # [[1, 4, 7], [2, 5, 8], [3, 6, 9]]

Practical Data Processing Example

import pandas as pd
import numpy as np

# Generate sample data
np.random.seed(42)
raw_data = [
    {'name': 'Alice', 'scores': '85,92,78', 'grade': 'A'},
    {'name': 'Bob', 'scores': '90,88,95', 'grade': 'A+'},
    {'name': 'Charlie', 'scores': '72,68,75', 'grade': 'B'},
    {'name': 'Diana', 'scores': '95,98,100', 'grade': 'A+'},
    {'name': 'Eve', 'scores': '60,72,65', 'grade': 'C'}
]

# Comprehension to parse scores
parsed_data = [
    {
        'name': r['name'],
        'scores': [int(s) for s in r['scores'].split(',')],
        'average': sum(int(s) for s in r['scores'].split(',')) / 3,
        'grade': r['grade']
    }
    for r in raw_data
]

# Filter high performers
high_performers = [
    d for d in parsed_data
    if d['average'] >= 85
]

print("High performers:")
for hp in high_performers:
    print(f"  {hp['name']}: avg={hp['average']:.1f}")

# Output:
# High performers:
#   Alice: avg=85.0
#   Bob: avg=91.0
#   Diana: avg=97.7

Dictionary Comprehensions

# Basic dict comprehension: {key: value for item in iterable}
names = ['Alice', 'Bob', 'Charlie']
name_lengths = {name: len(name) for name in names}
print(name_lengths)  # {'Alice': 5, 'Bob': 3, 'Charlie': 7}

# Invert a dictionary
original = {'a': 1, 'b': 2, 'c': 3}
inverted = {v: k for k, v in original.items()}
print(inverted)  # {1: 'a', 2: 'b', 3: 'c'}

# Filter dictionary
scores = {'Alice': 92, 'Bob': 78, 'Charlie': 85, 'Diana': 95}
excellent = {k: v for k, v in scores.items() if v >= 85}
print(excellent)  # {'Alice': 92, 'Charlie': 85, 'Diana': 95}

# Transform values
adjusted = {k: min(100, v * 1.1) for k, v in scores.items()}
print(adjusted)

# Real-world: feature engineering in ML
import pandas as pd
df = pd.DataFrame({
    'feature_1': [1, 2, 3],
    'feature_2': [4, 5, 6],
    'target': [0, 1, 0]
})

# Create interaction features
interaction_features = {
    f'{c1}_x_{c2}': df[c1] * df[c2]
    for c1 in ['feature_1', 'feature_2']
    for c2 in ['feature_1', 'feature_2']
    if c1 != c2
}
print(interaction_features.keys())
# dict_keys(['feature_1_x_feature_2', 'feature_2_x_feature_1'])

Set Comprehensions

# Set comprehension: {expression for item in iterable}
text = "hello world"
unique_chars = {c.upper() for c in text if c.isalpha()}
print(unique_chars)  # {'H', 'E', 'L', 'O', 'W', 'R', 'D'}

# Find common elements
list_a = [1, 2, 3, 4, 5, 6]
list_b = [4, 5, 6, 7, 8, 9]
common = {x for x in list_a if x in list_b}
print(common)  # {4, 5, 6}

# Extract unique data types from mixed list
mixed = [1, 'hello', 3.14, True, [1, 2], {'a': 1}]
types = {type(x).__name__ for x in mixed}
print(types)  # {'int', 'str', 'float', 'bool', 'list', 'dict'}

Generator Expressions

Generators produce values lazily β€” they compute one value at a time on demand, rather than building the entire list in memory. This is critical when working with large datasets.

Memory Comparison

DfLazy Evaluation

A computation strategy where expressions are not evaluated until their values are needed. Generators implement lazy evaluation, computing values on-the-fly rather than precomputing all results. This enables processing of infinite sequences and large datasets with O(1) memory.

Architecture Diagram
List Comprehension:            Generator Expression:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ [1, 2, 3, ..., N]  β”‚       β”‚ generator object     β”‚
β”‚                     β”‚       β”‚                      β”‚
β”‚  ALL in memory      β”‚       β”‚  ONE at a time       β”‚
β”‚  O(N) memory        β”‚       β”‚  O(1) memory         β”‚
β”‚  Fast access        β”‚       β”‚  Slow (re-iterate)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
import sys

# List comprehension: allocates all elements immediately
list_comp = [x**2 for x in range(1000000)]
print(f"List size: {sys.getsizeof(list_comp) / 1024 / 1024:.2f} MB")
# List size: 8.44 MB

# Generator expression: lazy evaluation
gen_exp = (x**2 for x in range(1000000))
print(f"Generator size: {sys.getsizeof(gen_exp)} bytes")
# Generator size: 208 bytes

# Generator function with yield
def fibonacci_generator(n):
    """Generate first n Fibonacci numbers lazily."""
    a, b = 0, 1
    count = 0
    while count < n:
        yield a
        a, b = b, a + b
        count += 1

# Uses only O(1) memory regardless of n
fib = fibonacci_generator(10)
print(list(fib))  # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

# Infinite generator
def count_up(start=0):
    """Generate infinite sequence from start."""
    n = start
    while True:
        yield n
        n += 1

# Take first 5 from infinite generator
counter = count_up(100)
first_five = [next(counter) for _ in range(5)]
print(first_five)  # [100, 101, 102, 103, 104]

# Generator pipeline for large data processing
def read_large_file(file_path):
    """Lazy read of large file, line by line."""
    with open(file_path, 'r') as f:
        for line in f:
            yield line.strip()

def filter_lines(lines, keyword):
    """Filter generator by keyword."""
    for line in lines:
        if keyword in line:
            yield line

# Pipeline: read β†’ filter β†’ count (O(1) memory!)
# lines = read_large_file('huge_log.txt')
# errors = filter_lines(lines, 'ERROR')
# count = sum(1 for _ in errors)

Generator Memory

Mgenerator=O(1),Mlist=O(n)M_{\text{generator}} = O(1), \quad M_{\text{list}} = O(n)

Here,

  • =Memory for generator (constant)
  • =Memory for list (proportional to n)
  • =Number of elements

When to Use Each: Decision Tree

Architecture Diagram
Need a function?
β”œβ”€ Yes, reusable β†’ Use `def`
β”œβ”€ Yes, one-liner β†’ Lambda
β”œβ”€ Transform sequence β†’ List comprehension
β”‚   β”œβ”€ Large dataset β†’ Generator expression
β”‚   β”œβ”€ Need dict β†’ Dict comprehension
β”‚   └─ Need unique items β†’ Set comprehension
└─ No, inline logic β†’ Lambda with map/filter/sort

Performance Priority?
β”œβ”€ Memory critical β†’ Generator expression
β”œβ”€ Speed critical β†’ List comprehension (optimized in CPython)
└─ Readability critical β†’ List comprehension with clear logic

Performance Comparison

import time
import sys

data = list(range(1_000_000))

def benchmark(label, func, data):
    """Benchmark a function and return timing."""
    start = time.perf_counter()
    result = func(data)
    elapsed = time.perf_counter() - start
    return label, elapsed, sys.getsizeof(result)

# Method 1: for loop
def with_loop(data):
    result = []
    for x in data:
        if x % 2 == 0:
            result.append(x ** 2)
    return result

# Method 2: map + filter + lambda
def with_map(data):
    return list(map(lambda x: x**2, filter(lambda x: x % 2 == 0, data)))

# Method 3: list comprehension
def with_comprehension(data):
    return [x**2 for x in data if x % 2 == 0]

# Method 4: generator expression
def with_generator(data):
    return (x**2 for x in data if x % 2 == 0)

results = [
    benchmark("for loop", with_loop, data),
    benchmark("map+filter", with_map, data),
    benchmark("comprehension", with_comprehension, data),
    benchmark("generator", with_generator, data),
]

print(f"{'Method':<15} {'Time (s)':<12} {'Memory':<12}")
print("-" * 39)
for label, elapsed, mem in results:
    print(f"{label:<15} {elapsed:<12.4f} {mem:<12}")

Typical Results:

Architecture Diagram
Method          Time (s)     Memory
---------------------------------------
for loop        0.0892       4304888
map+filter      0.1203       4304888
comprehension   0.0634       4304888
generator       0.0001       208

List comprehensions are 20-30% faster than equivalent for loops because Python optimizes them internally (avoiding repeated attribute lookups for .append()). However, generators win decisively on memory for large datasets.

Complete Data Processing Pipeline

import pandas as pd
import numpy as np
from functools import reduce

# === STEP 1: Generate Raw Data ===
np.random.seed(42)
raw_records = [
    {'id': i,
     'name': f'User_{i}',
     'age': np.random.randint(18, 70),
     'income': np.random.normal(50000, 15000),
     'department': np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR']),
     'score': np.random.uniform(0, 100)}
    for i in range(100)
]

# === STEP 2: Create DataFrame ===
df = pd.DataFrame(raw_records)

# === STEP 3: Clean with comprehensions ===
# Remove outliers using IQR method
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    return df[(df[column] >= Q1 - 1.5*IQR) & (df[column] <= Q3 + 1.5*IQR)]

df_clean = reduce(lambda d, col: remove_outliers_iqr(d, col), ['age', 'income'], df)

# === STEP 4: Feature Engineering with Comprehensions ===
# Create bins for age
age_bins = pd.cut(df_clean['age'], bins=[0, 25, 35, 50, 100],
                  labels=['Young', 'Mid', 'Senior', 'Expert'])
df_clean['age_group'] = age_bins

# Create interaction features
feature_cols = ['age', 'income', 'score']
interaction_names = [f'{c1}_x_{c2}'
                     for c1 in feature_cols
                     for c2 in feature_cols
                     if c1 < c2]

for name in interaction_names:
    c1, c2 = name.split('_x_')
    df_clean[name] = df_clean[c1] * df_clean[c2]

# === STEP 5: Aggregation with Dict Comprehension ===
dept_stats = {
    dept: {
        'mean_income': group['income'].mean(),
        'count': len(group),
        'top_score': group['score'].max()
    }
    for dept, group in df_clean.groupby('department')
}

print("Department Statistics:")
for dept, stats in dept_stats.items():
    print(f"  {dept}: avg_income=${stats['mean_income']:,.0f}, "
          f"count={stats['count']}, top_score={stats['top_score']:.1f}")

# === STEP 6: Filter and Select with Comprehension ===
high_performers = [
    {'name': row['name'], 'score': row['score']}
    for _, row in df_clean.iterrows()
    if row['score'] > 80 and row['income'] > 60000
]

print(f"\nHigh performers (score>80, income>60k): {len(high_performers)}")

# === STEP 7: Summary Statistics ===
summary = {
    'total_records': len(df_clean),
    'features_created': len(interaction_names),
    'departments': list(dept_stats.keys()),
    'age_groups': list(df_clean['age_group'].unique())
}

print(f"\nPipeline Summary:")
for key, value in summary.items():
    print(f"  {key}: {value}")

Key Takeaways

πŸ“‹Summary: Functions, Lambda & Comprehensions

  1. Functions are the foundation of modular code. Use *args and **kwargs for flexible interfaces. Return named tuples for multiple outputs.
  2. Scope follows the LEGB rule. Avoid global β€” prefer function parameters and return values for data flow.
  3. Decorators wrap functions to add cross-cutting concerns (timing, caching, retries) without modifying the original code.
  4. Lambda functions are for short, throwaway operations β€” especially with map, filter, and sorted. Use def for anything reusable.
  5. List comprehensions are Pythonic, readable, and fast. Use them as the default for sequence transformations.
  6. Dict/Set comprehensions extend the pattern to dictionaries and sets β€” ideal for building lookup tables and collecting unique values.
  7. Generator expressions are essential for memory-efficient processing of large datasets. Use them when you don't need all results in memory at once.
  8. Performance hierarchy: generators > comprehensions > map/filter > for loops (for memory); comprehensions > for loops > map/filter (for speed on small data).

Practice Exercises

Exercise 1: Function Factory

# Create a function that returns functions
# Example: create_operation('add', 5) returns a function that adds 5 to any input
# create_operation('multiply', 3) returns a function that multiplies by 3

def create_operation(operation, value):
    # Your code here
    pass

add_five = create_operation('add', 5)
multiply_by_three = create_operation('multiply', 3)

print(add_five(10))        # Should print 15
print(multiply_by_three(10)) # Should print 30

Exercise 2: Comprehension Challenge

# Given a list of dictionaries representing products,
# create:
# 1. A dict comprehension mapping product names to prices
# 2. A list comprehension of products cheaper than $10
# 3. A set comprehension of all categories

products = [
    {'name': 'Laptop', 'price': 999, 'category': 'Electronics'},
    {'name': 'Mouse', 'price': 25, 'category': 'Electronics'},
    {'name': 'Book', 'price': 15, 'category': 'Education'},
    {'name': 'Pen', 'price': 2, 'category': 'Education'},
    {'name': 'Chair', 'price': 150, 'category': 'Furniture'},
]

# Your comprehensions here
price_map = {}
cheap_products = []
categories = set()

Exercise 3: Decorator Challenge

# Create a decorator that logs function calls to a list
# and provides statistics about the function

call_log = []

def log_calls(func):
    # Your decorator here
    pass

@log_calls
def add(a, b):
    return a + b

add(1, 2)
add(3, 4)
add(5, 6)

# Should be able to call:
# log_calls.stats() β†’ {'total_calls': 3, 'args': [(1,2), (3,4), (5,6)]}

Exercise 4: Generator Pipeline

# Create a generator pipeline that:
# 1. Reads numbers from 1 to 1000
# 2. Filters out numbers not divisible by 3
# 3. Transforms by squaring
# 4. Takes only the first 10 results

def number_source():
    pass  # Your code

def divisible_by_3(numbers):
    pass  # Your code

def square(numbers):
    pass  # Your code

# Pipeline
pipeline = square(divisible_by_3(number_source()))
results = [next(pipeline) for _ in range(10)]
print(results)  # [9, 36, 81, 144, 225, 324, 441, 576, 729, 900]

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement