Dataproc Configuration and Cluster Modes for the PCA Exam

GCP Study Hub
Ben Makansi
February 15, 2026

Dataproc shows up on the Professional Cloud Architect exam more often than people expect, usually buried inside a scenario about migrating a Hadoop or Spark workload onto Google Cloud. The questions tend to focus on a small set of configuration choices, and once you understand what each one controls, the answers fall out pretty quickly. I want to walk through how I think about Dataproc cluster setup so the exam scenarios feel familiar.

The Configuration Choices That Matter

When you stand up a Dataproc cluster, there are a handful of options that the exam likes to ask about:

  • Region and zone. Pick the region closest to your data. This keeps egress costs down, reduces latency, and improves overall performance. If your data lives in a Cloud Storage bucket in us-central1, you do not want a Dataproc cluster in europe-west1 reading from it.
  • Cluster mode. The number of master and worker nodes. More nodes mean more processing power but also more cost. I will get into the specific modes in a moment.
  • Disk type and size. Standard persistent disks are cheaper. SSDs are faster. The right choice depends on whether the workload is throughput-bound or latency-sensitive.
  • Local SSD size. Local SSDs give you very fast scratch storage, but the size cannot be changed after the cluster is created. If you think you might need more, provision it up front.
  • Staging bucket. A Cloud Storage bucket where Dataproc writes job logs, temporary data, and any output that needs to persist beyond the cluster lifetime.
  • Preemptible nodes. Cheaper VMs that can be reclaimed at any time. Useful for non-critical, interruption-tolerant work.

None of these are exotic. They are the same kinds of choices you make for any compute workload on Google Cloud. The difference is that Dataproc bundles them into a cluster definition, and the exam expects you to know which knob to turn for which scenario.

The Three Cluster Modes

Dataproc gives you three cluster modes, and they map cleanly to three different use cases.

Single node. One master, zero workers. Lightweight, cheap, fine for development or quick experiments. Not appropriate for production because there is no horizontal scale and no resilience.

Standard mode. One master with a custom number of worker nodes. This is the default for production workloads. You scale the worker pool to match the size of the job, and you only run one master because the master is not the bottleneck for most Spark or Hadoop work.

High availability. Three masters with a custom number of worker nodes. Use this when downtime is unacceptable. The exam likes to test on this one because the trade-off is obvious: three masters cost more than one, but you survive a master failure without losing the cluster.

If a question says lightweight or development, the answer is single node. If a question says production, the answer is standard. If a question says critical, mission-critical, or no downtime, the answer is high availability.

What You Can Change After the Cluster Exists

One thing the Professional Cloud Architect exam likes to confirm is that you understand which configuration choices are mutable after cluster creation. The mutable ones are:

  • The number of worker nodes.
  • The number of preemptible VMs.
  • Cluster labels.
  • Whether graceful decommissioning is on.

When you change the worker count, Dataproc handles resharding of data automatically. You do not have to manually rebalance anything. That is one of the reasons Dataproc is positioned as a managed service rather than just rented Hadoop.

The local SSD size, which I mentioned earlier, is not in this list. Once you set it, you are stuck with it for the life of the cluster.

Preemptible Nodes

Preemptible nodes are a cost lever. They are much cheaper than standard worker nodes, but Google can reclaim them at any time with very little warning. The exam scenarios that point toward preemptible nodes usually involve compute-intensive work that can tolerate interruptions, like batch transformations, ML feature generation, or analytical queries that can be retried.

A few rules to keep in mind:

  • You need at least one standard worker node before you can add preemptible nodes. Preemptibles are an addition to the cluster, not a replacement for the cluster.
  • Dataproc handles the leave-and-rejoin process automatically. When a preemptible node is reclaimed, Dataproc removes it from the cluster cleanly. When capacity returns, the cluster can pick up new preemptibles.
  • Because you are adding more nodes overall, jobs typically finish faster, which is part of why the cost trade-off is attractive.

What preemptibles are not good for is anything that needs persistent state on the node, anything that cannot be retried, or anything where a sudden node loss would corrupt the job. The exam will almost always frame preemptibles as a cost optimization for batch or interruption-tolerant workloads.

Graceful Decommissioning

Graceful decommissioning is the feature that lets you remove worker nodes without breaking active jobs. When a node is decommissioned gracefully, Dataproc redistributes the data and tasks on that node to the remaining workers before the node is actually removed. This prevents data loss and prevents job failures during scale-down operations.

A few details worth knowing for the exam:

  • It applies only to standard worker nodes. Preemptible nodes do not get the courtesy of graceful decommissioning because they can be reclaimed by Google with no warning anyway.
  • It is most valuable in mixed clusters that combine standard and preemptible nodes. The standard nodes are running the parts of the workload that absolutely cannot fail, so when you scale down, you want to be sure those nodes hand off cleanly.
  • It must be enabled. It is not on by default. If a question describes a scaling operation that needs to preserve in-flight jobs, the answer involves enabling graceful decommissioning before the scale-down.

How These Pieces Fit on the Exam

The Dataproc questions on the Professional Cloud Architect exam tend to combine a few of these knobs at once. A typical scenario might describe a Spark workload that runs nightly, where cost matters more than uptime, and ask you to size the cluster appropriately. The right answer is usually a standard cluster with a sensible worker count and a healthy number of preemptible nodes layered on top, with graceful decommissioning enabled if the question mentions scaling during a job.

If the scenario emphasizes development or experimentation, lean toward single node. If the scenario emphasizes critical workloads or compliance with availability requirements, lean toward high availability. If the scenario emphasizes cost reduction with interruption-tolerant work, layer in preemptible nodes. If the scenario emphasizes scaling without disruption, enable graceful decommissioning.

The shape of the answer is almost always one of those four patterns, and once you internalize the patterns, the questions move quickly.

My Professional Cloud Architect course covers Dataproc configuration and cluster modes alongside the rest of the messaging and pipelines material.

arrow