Dataflow Error Handling and Monitoring for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 9, 2025

Dataflow pipelines fail in ways that batch ETL jobs never do. A streaming job might run for weeks, then start dropping messages at 3am because one upstream service changed its schema. A batch job might process 99 percent of records cleanly and choke on the last few thousand. Knowing how to catch those failures, route them somewhere useful, and monitor the right metrics is a core Professional Data Engineer skill, and the exam tests it in a few predictable ways.

This article walks through the error handling patterns and Dataflow metrics I think every Professional Data Engineer candidate needs in their head before exam day.

Start with logs and PCollection contents

When a pipeline starts generating errors, the first move is not to rewrite code. It is to read what the pipeline is actually doing at each step.

Every Dataflow pipeline is a chain of processing steps, and each step produces a PCollection. If you treat each step as a checkpoint and review the logs plus the PCollection contents after that step, you can narrow down where the bad data appears. Maybe step one looks fine. Step two has 100 fewer elements than expected. That tells you the problem is in the transform between them, not somewhere further downstream.

This sounds obvious, but the exam likes scenarios where a team is panicking about a failing pipeline and the right answer is some variant of inspect logs and intermediate PCollections to localize the issue. If you see that option, it is almost always correct over options that suggest immediately rewriting transforms or scaling workers up.

Try-catch and routing errors to a separate PCollection

Once you know where errors happen, you need a strategy for handling them without crashing the main pipeline. The standard pattern is a try-catch inside a transform, with caught errors written to a separate output PCollection.

The flow looks like this. Data enters a processing step. Most elements process normally and continue down the main pipeline. When an element throws an exception, the catch block grabs it and emits it to a dedicated error PCollection. That error PCollection then gets routed somewhere durable for later analysis. Pub/Sub is the most common destination because you can attach subscribers that alert, store to BigQuery, or trigger replay logic.

Why route to Pub/Sub instead of just logging? Two reasons. First, logs are searchable but they are not a queue. If a downstream team wants to replay failed records, parsing Cloud Logging is painful. Second, Pub/Sub decouples error handling from the pipeline itself. You can change how errors are processed without redeploying the Dataflow job.

SideOutputs inside a DoFn

The try-catch pattern works at the transform boundary. SideOutputs let you do the same thing inside a single DoFn. You can tag elements with different output tags during processing, and the main output continues down the pipeline while a side output goes somewhere else.

The most common use case is exactly what we covered above. Elements that fail validation or processing get sent to a SideOutput, which is routed to a Pub/Sub topic. You then monitor the volume of messages on that topic in Cloud Monitoring. If error volume spikes, you get alerted. If it stays at zero for a week, you know the pipeline is healthy.

For the Professional Data Engineer exam, the distinction between a try-catch with a separate output PCollection and a SideOutput inside a DoFn is mostly academic. Both achieve the same outcome, which is isolating bad data without crashing the job. If the question asks about handling errors inside a single transform, SideOutput is the right vocabulary. If it asks about catching exceptions at the transform level, the try-catch pattern is what you want.

Dataflow metrics worth monitoring

Once your pipeline is running with error handling in place, you need to know it is actually healthy. There are six metrics that come up repeatedly in Dataflow troubleshooting and on the exam.

Watermark Age: how fresh the data is in a streaming pipeline. A high watermark age means your pipeline is falling behind real-time.
System Lag: how long data is waiting in the system before being processed. This is the streaming equivalent of a queue depth. Rising system lag means a bottleneck somewhere.
Backlog (Bytes): how much unprocessed data is sitting in the pipeline. Applies to both streaming and batch jobs. Useful for seeing whether you are keeping up with input.
Backlog Processing Time: estimates how long it will take to chew through the current backlog at current throughput. If this number is growing, you have a capacity problem.
Memory Capacity: how much memory workers are actually consuming. Spikes here often precede crashes or autoscaling events.
Worker Memory Limit: the ceiling for worker memory. If usage approaches the limit, you need to either tune the pipeline or pick a larger machine type.

The exam pattern here is usually scenario-based. A streaming pipeline is reporting freshness problems, and you need to pick the right metric. Watermark Age. A pipeline is taking longer and longer to complete batches. Backlog Processing Time. Workers are crashing intermittently. Memory Capacity against Worker Memory Limit.

Putting it together for exam day

If you remember three things from this article, make them these. Read logs and PCollection contents step by step before doing anything else. Catch errors inside the pipeline with try-catch or SideOutputs and route them to Pub/Sub so they are inspectable and replayable. Monitor Watermark Age and System Lag for streaming jobs, Backlog and Backlog Processing Time for throughput problems, and the memory metrics when workers are unstable.

Dataflow questions on the Professional Data Engineer exam reward candidates who can match a symptom to the right metric or the right pattern. The patterns above cover the majority of what you will see.

My Professional Data Engineer course covers Dataflow error handling, SideOutputs, and the monitoring metrics in more depth, including walkthroughs of the diagrams you will recognize on exam questions.

Error Handling and Monitoring in Dataflow for the PDE Exam

Start with logs and PCollection contents

Try-catch and routing errors to a separate PCollection

SideOutputs inside a DoFn

Dataflow metrics worth monitoring

Putting it together for exam day

Get tips and updates from GCP Study Hub