Dataflow Cost Optimization for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 14, 2025

Dataflow cost questions show up on the Professional Data Engineer exam in a predictable shape. You get a scenario where a team is overspending, and you have to pick the lever that actually moves the bill. The trap is that every answer looks plausible because every answer is technically a cost optimization. The skill is knowing which lever is the biggest.

I rank the levers from most impactful to least impactful when I teach this. That ranking is what the exam tests, so I want to walk through it the way I think about it during a question.

Lever 1: Batch versus streaming

This is the single biggest cost decision you make on a Dataflow pipeline. Batch jobs only consume resources while they run. They spin up, process a bounded dataset, and shut down. Streaming jobs run continuously, which means workers stay provisioned around the clock even when traffic is light.

On the exam, if a scenario describes a job that runs nightly, hourly, or on any clear cadence with a defined end, the cost-optimal answer is almost always batch. If you see a team running a Dataflow streaming pipeline to process a file that lands in Cloud Storage once a day, the fix is to convert it to batch. I tell candidates to read the cadence carefully before anything else.

Lever 2: Worker machine type

Once you have batch versus stream right, the next biggest knob is the worker machine type. Dataflow defaults are not always the cheapest fit. If your job is memory-bound, a high-memory machine type runs faster and ends sooner, which costs less even though the per-hour rate is higher. If your job is CPU-bound and embarrassingly parallel, smaller machines with more workers can be cheaper.

Preemptible workers (now called Spot VMs on Compute Engine) are the other half of this lever. For fault-tolerant batch jobs, Spot VMs can cut the bill substantially. Dataflow handles the interruptions gracefully because the runner can reassign work. On exam questions about cutting cost for a tolerant batch workload, Spot or preemptible workers is a strong signal.

Lever 3: Pipeline optimization

The longer a job runs, the more you pay. That sounds obvious, but it shapes how you read pipeline-optimization questions. If the scenario says a pipeline takes six hours and the team wants to cut cost, the answer is usually about removing inefficiency in the pipeline graph rather than swapping infrastructure.

Things that lengthen a Dataflow job include unnecessary shuffles, wide windows on streaming jobs, side inputs that get reloaded too often, and transforms that fuse poorly. Reading and writing in efficient formats (Avro or Parquet over CSV) cuts both runtime and shuffle volume. I keep these in my head as a checklist when an exam question implies the pipeline itself is the problem.

Lever 4: Worker storage

Worker storage matters less than compute, but it is not free. Each Dataflow worker provisions persistent disk, and the size and type of that disk show up on the bill. Choosing SSD when standard would do, or oversizing disks for jobs that do not shuffle much, wastes money.

The other side of worker storage is shuffle. Excessive shuffling forces workers to spill to disk and read it back, which slows the job and inflates storage and compute costs together. Shuffle Service (for batch) and Streaming Engine (for streaming) move that work off the worker disk and onto a managed backend, which can reduce overall cost on shuffle-heavy jobs even though it adds a per-GB charge.

Lever 5: Region selection

This is the smallest lever and the most common distractor. Region selection matters because Compute Engine prices differ by region and because data transfer between regions is not free. If your source data sits in one region and you run Dataflow in another, you pay egress. If you pick a region with higher Compute Engine prices than necessary, you pay a premium with no benefit.

On the exam, if a scenario mentions cross-region egress explicitly or names two different regions for storage and compute, region alignment is the answer. If the scenario is silent on region, region is almost certainly a distractor and a higher lever applies.

How I read these questions

When a Dataflow cost question shows up, I work top-down through the ranking. Is this a streaming job that could be batch? If yes, that is the answer. If no, is the machine type wrong for the workload? If no, is the pipeline itself wasting time? If no, are storage or shuffle inflated? Only after all of that do I consider region.

The Professional Data Engineer exam rewards this ordering because it is how Google actually advises customers to think about Dataflow spend. Memorizing the order is more valuable than memorizing any single optimization.

My Professional Data Engineer course covers Dataflow cost optimization alongside the rest of the pipeline-design topics on the exam.