Dataflow Troubleshooting on the PDE Exam

619c7c8da6d7b95cf26f6f70

September 28, 2025

Dataflow shows up all over the Professional Data Engineer exam, and a chunk of the trickier questions are not about building a pipeline from scratch. They are about diagnosing one that is already running and behaving badly. The three failure modes Google keeps coming back to are increased latency, missing messages in streaming pipelines, and out-of-order data. If you can talk through each of these with confidence, you will pick up a meaningful number of points on exam day.

Here is how I think about each one.

Increased latency

Latency questions usually drop a scenario in your lap where a Dataflow job that used to run cleanly is suddenly slow, and the exam wants to know what you check first. The framing I keep in my head is that there are two flavors of latency worth watching: end-to-end latency, which is the time it takes for a record to traverse the entire pipeline, and per-stage latency, which is how long an individual stage holds onto data before passing it on.

The signals that point to a bottleneck are pretty consistent:

A specific stage is taking noticeably longer than the rest of the graph.
Data is backing up in one stage, visible as a growing backlog.
The monitoring dashboard shows latency values well above the baseline you are used to.

When you see those signals, the troubleshooting path is fairly mechanical. First, pull the pipeline job logs and the worker logs. Those give you detailed information about what is happening inside each stage and on each VM. Second, identify the specific step that is dragging. The exam will often hand you a graph or a description and expect you to point at the slow stage rather than rewriting the whole job. Third, look for the underlying cause, which usually falls into one of three buckets: resource limitations on the workers, data skew where one worker is processing far more records than its peers, or an external dependency such as a slow downstream sink.

Data skew is the one I would internalize most. If the question describes one worker pegged at full CPU while the others are idle, that is the answer they want you to spot.

Missing messages in streaming pipelines

The second failure mode is messages disappearing from a streaming pipeline. The Professional Data Engineer exam likes this one because the right answer is not obvious. You have to know a specific diagnostic technique.

The symptoms you watch for are gaps in the data, aggregations that look incomplete relative to what you expected, and a sudden drop in throughput. Any of those can mean records are being dropped somewhere between your source and your sink.

The technique Google teaches for this scenario is to convert the streaming pipeline into a batch run and compare the outputs. The exam answer almost always walks through these four steps:

Capture the streaming data. Land the incoming events in Cloud Storage or BigQuery as they arrive, so you have a complete, durable snapshot of what came in.
Create a batch job. Modify the Dataflow job configuration to read that captured data and process it in batch mode instead of streaming.
Compare results. Run the batch job and see whether the messages that went missing in streaming actually do appear when the same data is reprocessed offline.
Diagnose. If the batch job processes everything cleanly, the streaming job is the problem. Review the windowing, the triggering, and the resource allocation on the streaming pipeline.

The reason this works is that batch processing strips away all the timing concerns. There are no late records, no watermark issues, no window expirations. If the batch run returns the records you thought were lost, you have proven the data was always there and the streaming configuration is what dropped them. That isolation is the whole point of the exercise.

Out-of-order data

The third common challenge is out-of-order data, which is just the reality of streaming. Records do not arrive at Dataflow in the same order they were generated. Network paths differ, devices buffer, and clocks drift. The exam expects you to know the three primitives Dataflow gives you to handle this and what each one does.

Windows organize incoming records into logical time frames based on event time. They group data into containers like fixed five-minute windows or sliding windows, regardless of when the record actually showed up at the worker.
Watermarks track the progress of event time through the pipeline. A watermark is Dataflow's estimate of how complete the data for a given event time is. Once the watermark passes the end of a window, the pipeline can safely assume that most of the data for that window has arrived.
Triggers decide when results for a window get emitted. They can fire when the watermark reaches the end of a window, after a processing-time delay, or on late data so stragglers can update an already-emitted result.

The exam loves combination questions on this. You will see a scenario where late records need to update an existing aggregate, and the correct answer involves a window plus a trigger that fires on late data. You will see another scenario where the team wants results emitted on a regular schedule even if data is still arriving, and the right answer is a processing-time trigger. Knowing which primitive solves which problem is the whole game.

When you see a Dataflow troubleshooting question on the Professional Data Engineer exam, slot it into one of these three buckets first. Latency means logs and bottleneck analysis. Missing messages means the batch comparison technique. Out-of-order data means windows, watermarks, and triggers. That framing turns a vague troubleshooting prompt into a known recipe.

My Professional Data Engineer course covers Dataflow troubleshooting patterns, streaming semantics, and the windowing model in depth.

Common Dataflow Challenges for the PDE Exam: Latency, Missing Messages, Out-of-Order Data

Increased latency

Missing messages in streaming pipelines

Out-of-order data

Get tips and updates from GCP Study Hub