Watermarks and Triggers in Dataflow for the PDE Exam

619c7c8da6d7b95cf26f6f70

September 25, 2025

When candidates email me about the Google Cloud Professional Data Engineer exam, Dataflow streaming questions show up more often than almost any other topic. And inside Dataflow, the two concepts that cause the most confusion are watermarks and triggers. They sound abstract, the docs explain them with heavy diagrams, and the exam likes to combine them with late-arriving data scenarios that punish surface-level understanding.

In this post I want to walk through how I think about both concepts in a way that maps cleanly onto what the Professional Data Engineer exam actually tests.

Event Time vs Processing Time

Before watermarks make any sense, you have to be comfortable with the distinction between event time and processing time. Event time is when something actually happened in the real world. A user clicked a button, a sensor recorded a temperature, a payment was authorized. Processing time is when your Dataflow pipeline finally sees that event.

These two clocks almost never line up. A mobile app might buffer events offline and ship them an hour later. A network partition might delay a batch of IoT readings. The whole point of Dataflow's streaming model is that it processes data based on event time, not on when the bytes physically arrived. That is what lets you compute things like requests per minute correctly even when the data shows up out of order.

What a Watermark Actually Is

A watermark is a timestamp the system uses to track how far along it is in event time. Think of it as a checkpoint that says, we believe we have seen all the data with event timestamps earlier than this moment. As more events flow in, the watermark moves forward.

If a step fails or stalls, the watermark does not advance. That is the safeguard. The watermark only moves when the system is confident it has handled everything up to that point in event time. It is not a wall clock and it does not care about real-world time directly.

The most important consequence of this design is how late-arriving data is handled. If a record shows up with an event timestamp that is earlier than the current watermark, the system can still place it into the correct window. The watermark is what lets Dataflow reason about which window a piece of data belongs to even when arrival order is messy.

A Concrete Watermark Example

Here is the example I walk through in my Professional Data Engineer course because it makes the abstract feel concrete.

At 10:10 AM today, the system receives a record with an event timestamp of 7:45 PM yesterday. That is the most recent event the system has seen so far, so the watermark sits at 7:45 PM yesterday.

At 10:20 AM, another record arrives, but its event timestamp is 6:30 PM yesterday, which is earlier than the current watermark. The watermark does not move backward and it does not move forward either. It stays put at 7:45 PM yesterday.

At 10:30 AM, a record arrives with an event timestamp of 9:30 AM yesterday. Same situation, the watermark holds.

At 10:40 AM, a record arrives with an event timestamp of 5:10 AM today. That is more recent than the current watermark, so Dataflow advances the watermark to 5:10 AM today.

The pattern to remember is this. The watermark advances only when newer event-time data shows up, and it pauses to let earlier late-arriving data catch up. When you see exam questions about why a window has not closed yet or why late data was still admitted, this is the mechanism behind the answer.

Triggers Control When Results Emit

Watermarks track progress. Triggers decide when to emit the results of a window. In a batch pipeline you do not really need triggers because all the data is bounded and you compute the result once at the end. In a streaming pipeline the data never ends, so you need a rule for when to fire output downstream.

There are three trigger types that the Professional Data Engineer exam expects you to recognize.

Event-time triggers fire when the watermark reaches a specific point. This is the default and the most common choice for accuracy. When the watermark passes the end of a window, the window closes and the result fires.
Processing-time triggers fire based on wall-clock time. You might want a partial result every minute even if late data is still trickling in. The exam tends to associate this with use cases where freshness matters more than completeness.
Data-driven triggers fire based on a property of the data itself. After 100 records, after a certain byte size, after a specific value is detected. These show up in scenarios where you want to react to volume or content rather than time.

Triggers and Late Data

The piece that connects watermarks and triggers is what happens when a record arrives after the watermark has already passed its window. Triggers can be configured to re-fire a window, updating the previous result with the new late data. Combined with allowed lateness settings, this is how you keep a streaming aggregation accurate even when records show up minutes or hours after the window logically closed.

For the exam, the rule of thumb I give my students is straightforward. If the question is about how does Dataflow know a window is complete, the answer involves a watermark. If the question is about when does Dataflow emit the result of a window, the answer involves a trigger. If the question is about handling data that arrived late, both concepts work together along with allowed lateness.

Once that mental separation clicks, the streaming questions on the Professional Data Engineer exam stop feeling like trick questions and start feeling like vocabulary checks.

My Professional Data Engineer course covers watermarks, triggers, windowing strategies, and the rest of the Dataflow streaming model in the same focused way, with practice questions modeled on the actual exam.