Watermarks and Triggers in Cloud Dataflow for the PCA Exam

GCP Study Hub
Ben Makansi
December 10, 2025

Stream processing is one of the harder topics on the Professional Cloud Architect exam because the questions assume you understand how Dataflow keeps results accurate when data shows up out of order. The two concepts that make that possible are watermarks and triggers. I want to walk through what each one does, how they interact, and the kinds of exam scenarios where you need to reach for them.

Why Event Time Matters More Than Processing Time

Before watermarks make sense, you need to internalize the difference between event time and processing time. Event time is when the event actually happened in the real world. Processing time is when the data physically arrives at your pipeline. In a perfect network these would be identical, but they almost never are. Mobile clients lose connectivity, IoT devices buffer locally, upstream systems retry. Data routinely arrives minutes or hours after the event it describes.

Dataflow processes streaming data based on event time, not processing time. That is the foundation. If your pipeline aggregates by hour, an event that occurred at 10:00 AM belongs in the 10:00 AM window even if it does not arrive at your worker until 10:45 AM. Without this guarantee, your hourly counts would shift around based on network latency, and any downstream analytics built on those counts would be wrong.

What Watermarks Actually Do

A watermark is a timestamp that tracks how far the pipeline has progressed through event time. You can think of it as a checkpoint. When the watermark sits at 7:45 PM, Dataflow is asserting that it has processed everything it expects to see for events up through 7:45 PM. Anything older than the watermark is considered complete.

The watermark does not advance just because real time passes. It advances based on the event timestamps Dataflow observes flowing through the pipeline. If a step fails or stalls, the watermark stays put. If a worker is processing slowly, the watermark stays put. This is intentional. The watermark is a safeguard against emitting incomplete results.

Here is a concrete example. At 10:10 AM your pipeline receives a record with an event timestamp of 7:45 PM yesterday. That is the most recent event seen so far, so the watermark sits at 7:45 PM yesterday. At 10:20 AM another record arrives with an event timestamp of 6:30 PM yesterday. This is older than the watermark, so the watermark does not move. At 10:30 AM a record with timestamp 9:30 AM yesterday arrives. Still older than the watermark, still no movement. Then at 10:40 AM a record arrives with timestamp 5:10 AM today. Now the watermark advances to 5:10 AM today, because Dataflow has finally seen evidence that the stream has progressed past the previous high-water mark.

Notice what this buys you. While the watermark sat at 7:45 PM yesterday, late-arriving records from earlier in the day could still be slotted into their correct event-time windows. The pipeline did not skip ahead and emit incomplete hourly counts. It waited until the data itself signaled progress.

Triggers Decide When Results Get Emitted

Watermarks track progress, but they do not actually decide when your aggregated results leave the pipeline. That is the job of triggers. A trigger is a rule that says when a window should fire and emit its current aggregation downstream.

In a bounded batch job this is trivial because there is a clear end of input. In an unbounded streaming pipeline there is no end. Data keeps flowing, and you need explicit rules about when each window should produce output. Triggers give you that control.

Three trigger types cover most of what you need to know for the Professional Cloud Architect exam. Event-time triggers fire when the watermark reaches a specified point. The most common pattern is firing when the watermark crosses the end of a window, which means the system believes all expected data for that window has arrived. Processing-time triggers fire based on real-world clock time, independent of event timestamps. You might use one to emit partial results every thirty seconds for a dashboard. Data-driven triggers fire based on properties of the data itself, like emitting after a certain number of records accumulate or after a specific value is observed.

Handling Late-Arriving Data

This is where watermarks and triggers work together. The watermark eventually advances past a window, which would normally close that window forever. But streaming data is messy, and records can still trickle in after the watermark has moved on. Triggers let you re-fire a window when late data arrives, updating the previously emitted result with the new information.

You configure this through allowed lateness. If you set allowed lateness to ten minutes, Dataflow keeps the window state around for ten minutes of event time after the watermark passes the window's end. Any late records that arrive during that grace period trigger an update. Past that, the window is garbage collected and any subsequent late records are dropped.

The exam likes to test whether you can match the right trigger configuration to a business requirement. If the question says results need to be exact and the workload is hourly billing, you want event-time triggers with enough allowed lateness to absorb realistic delays. If the question describes a real-time dashboard where freshness matters more than precision, processing-time triggers that fire periodically are the right choice. If the question describes a scenario where you need partial updates as data accumulates, data-driven triggers fit.

What to Remember Going Into the Exam

Watermarks are timestamps that track how far event time has progressed in your pipeline. They advance based on observed event timestamps, not wall-clock time, and they pause when steps fail or stall. Triggers are rules that decide when windows emit results, with event-time, processing-time, and data-driven flavors covering the main use cases. Late data can still update windows if you configure allowed lateness on top of your trigger.

If a Professional Cloud Architect question describes a streaming pipeline that needs to handle out-of-order data without losing accuracy, the answer almost always involves event-time processing with watermarks and an event-time trigger plus allowed lateness. Recognize that pattern and the rest of the question usually falls into place.

My Professional Cloud Architect course covers watermarks and triggers in Cloud Dataflow alongside the rest of the messaging and pipelines material.

arrow