
Side inputs and side outputs are two small but heavily tested Dataflow concepts on the Google Cloud Professional Data Engineer exam. They show up in scenario questions where a pipeline needs to enrich each record with a lookup table, or where bad records need to flow somewhere other than the main output. If you understand what each one does and when to reach for it, you can usually eliminate two answer choices on sight.
In this post I will walk through what a side input is, what a side output is, and the kinds of PDE exam questions that hinge on knowing the difference.
A side input is supplemental data that you provide to a PTransform alongside the main PCollection it is processing. The main PCollection still drives the pipeline element by element. The side input is an extra reference that the transform can read while it works on each element.
The two examples that come up most often are lookup tables and configuration data. If you are processing a stream of order events and you need to join each order against a small product catalog, that catalog is a natural side input. If you are running a transform whose behavior depends on a feature flag or a threshold value, that value is a natural side input. The side input is not the thing the pipeline is iterating over. It is the thing the iteration consults.
The mental model I use is that the main PCollection is the conveyor belt and the side input is the reference card taped to the wall. Every element on the belt can glance at the card, but the card is not on the belt.
A side output is an additional output PCollection produced by a PTransform. Most transforms produce one output PCollection. A transform with a side output produces a main output plus one or more extra outputs that route specific results to alternate destinations.
The canonical use case is error handling. Imagine a parsing step that reads JSON records. Most records parse cleanly and continue down the pipeline. A small percentage fail validation. Instead of crashing the pipeline or silently dropping the bad records, you emit the failures to a side output and write them to a separate sink such as a Cloud Storage bucket or a dead-letter table in BigQuery. The good records keep flowing through the main output as if nothing happened.
You can also use side outputs to split a stream by category. If you want premium customers to go to one BigQuery table and standard customers to go to another, a single transform with two outputs is cleaner than two separate filter transforms.
The Professional Data Engineer exam tests Dataflow heavily, and the questions are almost always scenario based. You will not be asked to define a side input. You will be given a pipeline description and asked which Dataflow feature solves the problem. Recognizing the shape of the problem is the whole skill.
Two patterns to watch for:
The wrong answers in these questions tend to suggest things like running two separate pipelines, using Pub/Sub to fan out, or rewriting the transform in a more complicated way. If you can pattern match the scenario to side input or side output, you can rule those out.
Side inputs work best when the supplemental data is small enough to fit comfortably in memory on each worker. If you are trying to join a streaming PCollection against a multi terabyte table, a side input is the wrong tool and the exam will sometimes test that boundary. For very large reference data, a proper join or a BigQuery lookup is usually the better answer.
Side outputs do not slow down the main path. Records that flow to a side output are emitted independently of the main output, so error handling does not bottleneck the rest of the pipeline. This is one reason the dead-letter pattern is the standard recommendation rather than logging failures and stopping.
Both features are part of the Apache Beam programming model that Dataflow runs, so anything you read in the Beam documentation about side inputs and additional outputs applies directly to Dataflow on Google Cloud.
When you see a Dataflow question on the Professional Data Engineer exam that involves a lookup or a dead-letter pattern, side inputs and side outputs are the first two things to consider.
My Professional Data Engineer course covers Dataflow side inputs, side outputs, and the rest of the Beam programming model in the depth the PDE exam expects.