
One of the most common decision points on the Google Cloud Professional Data Engineer exam is choosing between Dataproc and Dataflow. Both are managed data processing services. Both can handle batch and streaming workloads. And on a surface read of the docs, they sound almost interchangeable. They are not, and the exam loves to test whether you understand the actual decision criteria.
I want to walk through the framework I use when I see one of these questions, because once you internalize it, you can answer most Dataproc-versus-Dataflow questions in under thirty seconds.
Dataproc is Google Cloud's managed Hadoop and Spark service. You spin up a cluster, you run Spark jobs, Hive queries, Pig scripts, Presto, or any of the other tools in the Hadoop ecosystem. You manage the cluster lifecycle, even if Google handles the provisioning and patching underneath.
Dataflow is Google Cloud's managed Apache Beam service. You write a Beam pipeline, submit it, and Dataflow handles everything else. There is no cluster you log into. There is no node count you tune by hand at runtime. The service is fully serverless and autoscales based on the work in front of it.
Both services run on Google Cloud infrastructure. Both can read and write to BigQuery, Cloud Storage, Pub/Sub, and the rest of the data stack. The difference that matters for the exam is the programming model and the operational model.
Reach for Dataproc when any of these are true:
Reach for Dataflow when any of these are true:
The Professional Data Engineer exam rarely asks the question directly. It hides the answer in the scenario. Watch for these signals.
Pointers to Dataproc: the words "Hadoop", "Spark", "Hive", "migration from on-premises", "existing PySpark jobs", "Cloudera", "Hortonworks", or any mention of a team that wants to keep their current code with minimal changes. Also watch for cost-sensitive scenarios where ephemeral clusters and preemptible workers come up. Dataproc's ability to spin clusters up for a single job and tear them down afterward is a recurring exam theme.
Pointers to Dataflow: the words "Apache Beam", "serverless", "autoscaling", "unified batch and stream", "windowing", "watermarks", "exactly-once processing", or a team that does not have any prior big-data tooling and wants Google to handle the infrastructure. Streaming-only scenarios with Pub/Sub upstream and BigQuery downstream almost always resolve to Dataflow.
When I hit one of these questions on the exam, I run through three checks in order:
That short flow handles the large majority of Dataproc-versus-Dataflow questions you will see. The wrong answer is almost always the service that does not fit the team's current code or skills.
Dataproc is not the only place Spark lives on Google Cloud anymore. Dataproc Serverless lets you run Spark batch workloads without managing a cluster, which narrows the operational gap between the two services for pure Spark batch jobs. The exam still leans on the classic framing above, but if you see a scenario that specifies Spark and serverless together, Dataproc Serverless is the answer rather than Dataflow.
My Professional Data Engineer course covers the Dataproc and Dataflow decision in depth, including the streaming, windowing, and cost-optimization angles the exam loves to test.