Dataproc vs Dataflow for the PDE Exam: When to Use Each

619c7c8da6d7b95cf26f6f70

November 5, 2025

One of the most common decision points on the Google Cloud Professional Data Engineer exam is choosing between Dataproc and Dataflow. Both are managed data processing services. Both can handle batch and streaming workloads. And on a surface read of the docs, they sound almost interchangeable. They are not, and the exam loves to test whether you understand the actual decision criteria.

I want to walk through the framework I use when I see one of these questions, because once you internalize it, you can answer most Dataproc-versus-Dataflow questions in under thirty seconds.

The core distinction

Dataproc is Google Cloud's managed Hadoop and Spark service. You spin up a cluster, you run Spark jobs, Hive queries, Pig scripts, Presto, or any of the other tools in the Hadoop ecosystem. You manage the cluster lifecycle, even if Google handles the provisioning and patching underneath.

Dataflow is Google Cloud's managed Apache Beam service. You write a Beam pipeline, submit it, and Dataflow handles everything else. There is no cluster you log into. There is no node count you tune by hand at runtime. The service is fully serverless and autoscales based on the work in front of it.

Both services run on Google Cloud infrastructure. Both can read and write to BigQuery, Cloud Storage, Pub/Sub, and the rest of the data stack. The difference that matters for the exam is the programming model and the operational model.

When to pick Dataproc

Reach for Dataproc when any of these are true:

You have existing Hadoop or Spark code. If the question describes a team migrating an on-premises Hadoop cluster, or lifting an existing Spark job into Google Cloud with minimal rewrites, Dataproc is the answer. You can take your JAR, point it at a Dataproc cluster, and run it.
You depend on a specific Hadoop ecosystem tool. Hive, Pig, HBase, Presto, Oozie, Zeppelin notebooks, custom Spark libraries. If the workload requires a tool that lives in that ecosystem, Dataflow cannot run it. Dataproc can.
Your team wants hands-on control. Dataproc gives you a real cluster with real VMs. You can SSH in, install packages with initialization actions, and tune the cluster configuration. That control is useful when you are porting a workload that depends on a particular OS-level setup.
You want to keep using Spark or Hadoop going forward. If the team has invested in Spark skills and does not want to retrain on a new framework, Dataproc lets them keep working in the environment they know.

When to pick Dataflow

Reach for Dataflow when any of these are true:

You have no Hadoop or Spark dependency. Greenfield pipelines, especially for streaming, almost always point to Dataflow on the exam. There is no legacy code to drag along.
You are migrating an Apache Beam job. Beam is the SDK Dataflow runs natively. If the question mentions Beam in any form, the answer is Dataflow.
You want a serverless, hands-off operational model. No clusters to size. No nodes to provision. The service autoscales workers up and down based on the pipeline's needs. If a question stresses minimizing operational overhead, this is the signal.
You need unified batch and streaming code. Beam's whole pitch is that the same pipeline code runs in both modes. If the scenario describes a team that wants one codebase covering both, Dataflow is the fit.

How the exam phrases this

The Professional Data Engineer exam rarely asks the question directly. It hides the answer in the scenario. Watch for these signals.

Pointers to Dataproc: the words "Hadoop", "Spark", "Hive", "migration from on-premises", "existing PySpark jobs", "Cloudera", "Hortonworks", or any mention of a team that wants to keep their current code with minimal changes. Also watch for cost-sensitive scenarios where ephemeral clusters and preemptible workers come up. Dataproc's ability to spin clusters up for a single job and tear them down afterward is a recurring exam theme.

Pointers to Dataflow: the words "Apache Beam", "serverless", "autoscaling", "unified batch and stream", "windowing", "watermarks", "exactly-once processing", or a team that does not have any prior big-data tooling and wants Google to handle the infrastructure. Streaming-only scenarios with Pub/Sub upstream and BigQuery downstream almost always resolve to Dataflow.

A quick decision flow

When I hit one of these questions on the exam, I run through three checks in order:

Does the scenario mention Hadoop, Spark, or a Hadoop-ecosystem tool? If yes, Dataproc. Stop here.
Does the scenario mention Apache Beam or a unified batch-and-streaming requirement? If yes, Dataflow. Stop here.
Is the priority minimizing operations and going serverless? If yes, Dataflow. Otherwise, default to whichever ecosystem matches the team's existing skills.

That short flow handles the large majority of Dataproc-versus-Dataflow questions you will see. The wrong answer is almost always the service that does not fit the team's current code or skills.

One more nuance

Dataproc is not the only place Spark lives on Google Cloud anymore. Dataproc Serverless lets you run Spark batch workloads without managing a cluster, which narrows the operational gap between the two services for pure Spark batch jobs. The exam still leans on the classic framing above, but if you see a scenario that specifies Spark and serverless together, Dataproc Serverless is the answer rather than Dataflow.

My Professional Data Engineer course covers the Dataproc and Dataflow decision in depth, including the streaming, windowing, and cost-optimization angles the exam loves to test.