Dataflow Pipeline Setup for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 5, 2025

When I work through Dataflow questions with people studying for the Google Cloud Professional Data Engineer exam, the same handful of configuration knobs come up over and over. Region, disk size, worker type, autoscaling, and the worker cap. None of these are conceptually hard, but the exam expects you to know which lever to pull when a scenario gives you a specific symptom. Lag in the pipeline, runaway cost, slow throughput, cross-region egress charges. Each one maps to a setting, and the goal of this article is to make those mappings stick.

Region selection

A Dataflow pipeline lives in a single region. You pick the region when you launch the job, and every worker for that job runs there. That is the first thing to internalize, because it forces a follow-on decision about where your input and output data live.

If your pipeline reads from a Cloud Storage bucket in us-central1 and you launch the job in europe-west1, you are paying cross-region egress on every read. You are also adding network latency that compounds across millions of records. The recommended pattern is to keep the data and the pipeline in the same region. Lower latency, higher throughput, and lower cost, all from one alignment decision.

On the exam, if you see a scenario where someone is complaining about unexplained network charges on a Dataflow job, region mismatch between the pipeline and its data sources is the first thing to check.

Disk size

Each Dataflow worker gets a persistent disk, and the size of that disk matters more than people expect. The default is fine for small jobs, but as soon as you have large datasets or a pipeline with heavy intermediate stages, you can run out of temporary storage and stall the job.

Think about a pipeline that does a big GroupByKey followed by a join. Those shuffle stages spill intermediate data to disk on each worker. If the disk is too small, you get bottlenecks at exactly the moment your pipeline is doing its hardest work. Bumping the disk size is a cheap fix relative to the cost of a stuck job.

For the Professional Data Engineer exam, the signal to remember is this: if a question describes a pipeline failing or slowing down during aggregation, shuffle, or any stage that produces large intermediate output, disk size is one of the suspects.

Worker type

The worker type, set via the workerMachineType parameter, controls what kind of VM your workers run on. There are four flavors worth knowing for the exam.

General Purpose: balanced CPU and memory. The default choice for typical workloads.
High Memory: optimized for memory-heavy work. Use this when your pipeline loads large datasets into memory or holds big state in transforms.
High CPU: optimized for compute-heavy work. Use this when your pipeline is dominated by transformations, encoding, encryption, or anything that pegs the CPU.
Preemptible: cheaper workers that can be shut down at any time. Good for batch jobs that tolerate interruption.

The exam likes to give you a workload description and ask which worker type fits. A pipeline that does heavy in-memory joins on wide records is high memory. A pipeline that runs CPU-bound transformations is high CPU. A cost-sensitive batch job that can survive restarts is a candidate for preemptible workers. If you map the workload to the bottleneck, the right answer falls out.

Autoscaling

Autoscaling lets Dataflow adjust the number of workers based on incoming data volume. When the volume spikes, Dataflow adds workers. When it drops, Dataflow removes them. The job tracks the load instead of running at a fixed size.

The key detail for the exam is that Dataflow autoscaling is horizontal, not vertical. It adds or removes worker nodes. It does not resize the machines you already have. If a question mentions vertical autoscaling for Dataflow, that is your distractor.

A simple picture helps. A pipeline starts with two workers while data volume is steady. Volume spikes, so Dataflow scales out to three workers. Volume drops below the original level, and Dataflow scales back to one worker. Same job, same code, different worker count over time.

Max workers

Once you have autoscaling on, you need to cap it. That cap is maxNumWorkers. It is the upper bound Dataflow will respect when scaling out.

Setting this too low means your pipeline lags under load because Dataflow cannot add enough workers to keep up. Setting it too high means you can rack up cost during a traffic spike, because Dataflow will happily spin up workers all the way to the ceiling you gave it.

The exam framing is usually around tradeoffs. If a scenario says a pipeline is lagging and autoscaling is already enabled, raising maxNumWorkers is one valid move. If a scenario says costs are out of control during peaks, lowering maxNumWorkers caps the blast radius.

Improving throughput

When a Professional Data Engineer exam question asks how to improve the throughput of a Dataflow pipeline that already has autoscaling enabled, there are two levers.

Raise maxNumWorkers so Dataflow can scale out further.
Switch workerMachineType to a more powerful machine, with more CPU or more memory depending on where the bottleneck is.

More workers gives you more parallelism. Bigger workers gives each unit of parallelism more horsepower. The exam wants you to recognize both options, and to pick between them based on whether the workload is bound by parallelism or by per-worker capacity.

Putting it together

The mental model I keep in my head for Dataflow setup questions is a short checklist. Region matches the data. Disk size matches the intermediate footprint. Worker type matches the dominant resource. Autoscaling is on, with a sensible maxNumWorkers that balances lag and cost. If something is wrong, the symptom points to which setting needs to change.

Most of the Dataflow scenarios on the Professional Data Engineer exam are not asking you to design a pipeline from scratch. They are asking you to read a symptom and pick the configuration knob that addresses it. Get fluent with the five knobs in this article and that class of question becomes a freebie.

My Professional Data Engineer course covers Dataflow pipeline configuration, autoscaling behavior, and the full set of streaming and batch processing topics on the exam.

Dataflow Pipeline Setup for the PDE Exam: Region, Disk, Worker Type, Autoscaling