
Streaming pipelines never stop. Data flows in constantly, which raises an immediate question for anyone building on Dataflow: how do you group elements together for aggregation when the stream has no natural end? The answer the Professional Data Engineer exam expects you to know is windowing. Dataflow gives you three window types, and each one solves a different problem. If you can pick the right window for a given scenario, you will get a real chunk of the streaming questions on the exam without thinking twice.
I want to walk through tumbling, hopping, and session windows the way I think about them when I'm building real pipelines, because the exam scenarios almost always map to one of those mental models.
A bounded batch job is easy. You read the file, you compute the average, you write the output. A streaming job has no end, so you cannot compute an average over the whole stream because the stream never finishes. Windows let you slice an unbounded stream into bounded chunks that you can run aggregations against. Once you start thinking about windows as the only way to apply a GROUP BY to an infinite stream, the three types start to make a lot more sense.
Dataflow gives you three window strategies: tumbling (also called fixed), hopping (also called sliding), and session-based.
Tumbling windows are the simplest. You pick a duration, say 30 minutes, and Dataflow divides the stream into back-to-back chunks of that length. The first window covers 12:00 to 12:30, the next covers 12:30 to 1:00, the next covers 1:00 to 1:30, and so on. Every event belongs to exactly one window. There is no overlap.
Three properties define tumbling windows:
Use tumbling windows when you want clean, periodic reports. Average order value per hour. Total error count per 5 minutes. Page views per day. Anything where the report represents a discrete time bucket and you do not want the buckets to share data.
On the exam, look for words like every 5 minutes, hourly summary, or non-overlapping reporting period. Those almost always point to fixed windows.
Hopping windows are where things get more interesting. The window itself is still a fixed size, but a new window starts on a fixed interval that is shorter than the window length. That second number is called the hop.
Take a 30-minute window with a 5-minute hop. At 12:30 you get a window covering 12:00 to 12:30. At 12:35 you get a window covering 12:05 to 12:35. At 12:40 you get a window covering 12:10 to 12:40. Each event in the stream belongs to multiple windows because the windows overlap.
The three defining properties:
The use case is a running metric that you want updated frequently but computed over a longer horizon. A 20-minute moving average of stock prices recomputed every minute is the canonical example. You want fresh output every minute, but you want each output to reflect 20 minutes of context. Tumbling windows cannot do this. Hopping windows are built for it.
On the exam, the giveaway phrase is something like compute the last X minutes every Y minutes where X is bigger than Y. That is a hopping window with a window size of X and a hop of Y.
Session windows throw out the idea of a fixed duration entirely. Instead of saying every 30 minutes, you say keep grouping events together until I see a gap of N minutes with no activity. When the gap occurs, the window closes. The next event starts a fresh window.
So if I set a 5-minute gap duration, and events arrive every 4 minutes and 59 seconds for an hour straight, Dataflow treats all of it as a single session. The moment 5 minutes go by with nothing, the session ends. The next event opens a new one.
The properties:
The classic example is the way Google Analytics defines a website session, which uses a 30-minute gap. Your visit stays one session until you go 30 minutes without doing anything. Anything that looks like that pattern (a user's gameplay session, a device's burst of telemetry, a customer's support chat) is a session-window problem.
When you read a streaming scenario on the exam, run through these three questions in order:
The Professional Data Engineer exam loves to test whether you can distinguish hopping from tumbling, because both have a fixed window size. The discriminator is overlap and the hop interval. And session-window questions almost always include the word activity, inactivity, or user behavior in the scenario.
One more thing. Windows work on event time by default in Dataflow, which means late-arriving data still gets routed to the window it belongs to (subject to your allowed lateness settings). That detail matters for exam questions about correctness in streaming, but the windowing strategy itself is independent of how you handle late data.
My Professional Data Engineer course covers Dataflow windowing along with watermarks, triggers, and the full streaming model you need for the exam.