Dataflow Side Inputs and Side Outputs for PDE

619c7c8da6d7b95cf26f6f70

September 19, 2025

Side inputs and side outputs are two small but heavily tested Dataflow concepts on the Google Cloud Professional Data Engineer exam. They show up in scenario questions where a pipeline needs to enrich each record with a lookup table, or where bad records need to flow somewhere other than the main output. If you understand what each one does and when to reach for it, you can usually eliminate two answer choices on sight.

In this post I will walk through what a side input is, what a side output is, and the kinds of PDE exam questions that hinge on knowing the difference.

What a side input is

A side input is supplemental data that you provide to a PTransform alongside the main PCollection it is processing. The main PCollection still drives the pipeline element by element. The side input is an extra reference that the transform can read while it works on each element.

The two examples that come up most often are lookup tables and configuration data. If you are processing a stream of order events and you need to join each order against a small product catalog, that catalog is a natural side input. If you are running a transform whose behavior depends on a feature flag or a threshold value, that value is a natural side input. The side input is not the thing the pipeline is iterating over. It is the thing the iteration consults.

The mental model I use is that the main PCollection is the conveyor belt and the side input is the reference card taped to the wall. Every element on the belt can glance at the card, but the card is not on the belt.

What a side output is

A side output is an additional output PCollection produced by a PTransform. Most transforms produce one output PCollection. A transform with a side output produces a main output plus one or more extra outputs that route specific results to alternate destinations.

The canonical use case is error handling. Imagine a parsing step that reads JSON records. Most records parse cleanly and continue down the pipeline. A small percentage fail validation. Instead of crashing the pipeline or silently dropping the bad records, you emit the failures to a side output and write them to a separate sink such as a Cloud Storage bucket or a dead-letter table in BigQuery. The good records keep flowing through the main output as if nothing happened.

You can also use side outputs to split a stream by category. If you want premium customers to go to one BigQuery table and standard customers to go to another, a single transform with two outputs is cleaner than two separate filter transforms.

Why this matters for the Professional Data Engineer exam

The Professional Data Engineer exam tests Dataflow heavily, and the questions are almost always scenario based. You will not be asked to define a side input. You will be given a pipeline description and asked which Dataflow feature solves the problem. Recognizing the shape of the problem is the whole skill.

Two patterns to watch for:

Pipeline needs a lookup or reference table. The scenario will mention enriching records with a small dataset, joining streaming events against a slowly changing reference, or applying configuration values during processing. The answer is a side input.
Pipeline needs to handle bad records or split outputs. The scenario will mention dead-letter queues, error records, malformed input, or routing certain records to a separate destination. The answer is a side output.

The wrong answers in these questions tend to suggest things like running two separate pipelines, using Pub/Sub to fan out, or rewriting the transform in a more complicated way. If you can pattern match the scenario to side input or side output, you can rule those out.

A few practical notes

Side inputs work best when the supplemental data is small enough to fit comfortably in memory on each worker. If you are trying to join a streaming PCollection against a multi terabyte table, a side input is the wrong tool and the exam will sometimes test that boundary. For very large reference data, a proper join or a BigQuery lookup is usually the better answer.

Side outputs do not slow down the main path. Records that flow to a side output are emitted independently of the main output, so error handling does not bottleneck the rest of the pipeline. This is one reason the dead-letter pattern is the standard recommendation rather than logging failures and stopping.

Both features are part of the Apache Beam programming model that Dataflow runs, so anything you read in the Beam documentation about side inputs and additional outputs applies directly to Dataflow on Google Cloud.

When you see a Dataflow question on the Professional Data Engineer exam that involves a lookup or a dead-letter pattern, side inputs and side outputs are the first two things to consider.

My Professional Data Engineer course covers Dataflow side inputs, side outputs, and the rest of the Beam programming model in the depth the PDE exam expects.

Side Inputs and Side Outputs in Dataflow for the PDE Exam

What a side input is

What a side output is

Why this matters for the Professional Data Engineer exam

A few practical notes

Get tips and updates from GCP Study Hub