
Dataproc shows up on the Professional Cloud Architect exam more often than people expect, usually buried inside a scenario about migrating a Hadoop or Spark workload onto Google Cloud. The questions tend to focus on a small set of configuration choices, and once you understand what each one controls, the answers fall out pretty quickly. I want to walk through how I think about Dataproc cluster setup so the exam scenarios feel familiar.
When you stand up a Dataproc cluster, there are a handful of options that the exam likes to ask about:
None of these are exotic. They are the same kinds of choices you make for any compute workload on Google Cloud. The difference is that Dataproc bundles them into a cluster definition, and the exam expects you to know which knob to turn for which scenario.
Dataproc gives you three cluster modes, and they map cleanly to three different use cases.
Single node. One master, zero workers. Lightweight, cheap, fine for development or quick experiments. Not appropriate for production because there is no horizontal scale and no resilience.
Standard mode. One master with a custom number of worker nodes. This is the default for production workloads. You scale the worker pool to match the size of the job, and you only run one master because the master is not the bottleneck for most Spark or Hadoop work.
High availability. Three masters with a custom number of worker nodes. Use this when downtime is unacceptable. The exam likes to test on this one because the trade-off is obvious: three masters cost more than one, but you survive a master failure without losing the cluster.
If a question says lightweight or development, the answer is single node. If a question says production, the answer is standard. If a question says critical, mission-critical, or no downtime, the answer is high availability.
One thing the Professional Cloud Architect exam likes to confirm is that you understand which configuration choices are mutable after cluster creation. The mutable ones are:
When you change the worker count, Dataproc handles resharding of data automatically. You do not have to manually rebalance anything. That is one of the reasons Dataproc is positioned as a managed service rather than just rented Hadoop.
The local SSD size, which I mentioned earlier, is not in this list. Once you set it, you are stuck with it for the life of the cluster.
Preemptible nodes are a cost lever. They are much cheaper than standard worker nodes, but Google can reclaim them at any time with very little warning. The exam scenarios that point toward preemptible nodes usually involve compute-intensive work that can tolerate interruptions, like batch transformations, ML feature generation, or analytical queries that can be retried.
A few rules to keep in mind:
What preemptibles are not good for is anything that needs persistent state on the node, anything that cannot be retried, or anything where a sudden node loss would corrupt the job. The exam will almost always frame preemptibles as a cost optimization for batch or interruption-tolerant workloads.
Graceful decommissioning is the feature that lets you remove worker nodes without breaking active jobs. When a node is decommissioned gracefully, Dataproc redistributes the data and tasks on that node to the remaining workers before the node is actually removed. This prevents data loss and prevents job failures during scale-down operations.
A few details worth knowing for the exam:
The Dataproc questions on the Professional Cloud Architect exam tend to combine a few of these knobs at once. A typical scenario might describe a Spark workload that runs nightly, where cost matters more than uptime, and ask you to size the cluster appropriately. The right answer is usually a standard cluster with a sensible worker count and a healthy number of preemptible nodes layered on top, with graceful decommissioning enabled if the question mentions scaling during a job.
If the scenario emphasizes development or experimentation, lean toward single node. If the scenario emphasizes critical workloads or compliance with availability requirements, lean toward high availability. If the scenario emphasizes cost reduction with interruption-tolerant work, layer in preemptible nodes. If the scenario emphasizes scaling without disruption, enable graceful decommissioning.
The shape of the answer is almost always one of those four patterns, and once you internalize the patterns, the questions move quickly.
My Professional Cloud Architect course covers Dataproc configuration and cluster modes alongside the rest of the messaging and pipelines material.