Windowing Strategies
Watermarks and Triggers
from apache_beam.transforms.window import FixedWindows, Sessions
from apache_beam.transforms.trigger import (
AfterWatermark, AfterProcessingTime, AfterCount,
AccumulationMode, Repeatedly, AfterAny
)
# Fixed Windows with early and late triggers
FixedWindows(60) # 1-minute windows
# Triggers
AfterWatermark(
early=AfterProcessingTime(10), # Early results every 10s
late=AfterCount(1) # Late data trigger
)
# Composite trigger for complex scenarios
Repeatedly(
AfterAny(
AfterWatermark(), # Fire at watermark
AfterProcessingTime(30), # Fire every 30s
AfterCount(100) # Fire after 100 elements
)
)
# Accumulation modes
AccumulationMode.DISCARDING # Reset state after each trigger
AccumulationMode.ACCUMULATING # Keep state across triggers
State Management
import apache_beam as beam
from apache_beam.transforms.userstate import (
BagStateSpec, CombiningValueStateSpec, TimerSpec, on_timer
)
from apache_beam.coders import VarIntCoder
class CountWithState(beam.DoFn):
"""Count elements with state management."""
COUNT_STATE = CombiningValueStateSpec('count', VarIntCoder(), sum)
def process(self, element, count_state=beam.DoFn.StateParam(COUNT_STATE)):
current_count = count_state.read() or 0
count_state.add(1)
yield (element, current_count + 1)
β¨
Best Practice: Use early triggers for low-latency results. Use AccumulationMode.DISCARDING for sliding windows. Implement watermarks to handle late data. Use Streaming Engine for stateful processing. Monitor watermark lag for pipeline health.
Common Interview Questions
Q1: What is a watermark in streaming?
Answer: A watermark tracks the progress of event time through the pipeline. It tells the system when to consider all data for a given time window has been received. Watermarks enable windowed processing and handle late-arriving data.
Q2: What is the difference between fixed and sliding windows?
Answer: Fixed windows are non-overlapping and provide regular aggregations. Sliding windows overlap and provide moving averages. Fixed windows are simpler; sliding windows provide smoother trend analysis.
Q3: How do you handle late data in streaming?
Answer: 1) Use late triggers (AfterCount), 2) Allow allowed lateness, 3) Use session windows for gap-based grouping, 4) Implement dead-letter queues, 5) Monitor watermark lag.
Q4: What is the purpose of triggers?
Answer: Triggers determine when to emit results from a window. Early triggers provide low-latency results. Late triggers handle straggler data. Watermark triggers fire when the watermark passes the window end.
Q5: What is the difference between DISCARDING and ACCUMULATING modes?
Answer: DISCARDING resets state after each trigger, emitting only new data. ACCUMULATING keeps state across triggers, emitting cumulative results. Use DISCARDING for counts, ACCUMULATING for running sums.