Dataproc Cluster Setup for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 29, 2025

Dataproc is the managed Hadoop and Spark service on Google Cloud, and the Professional Data Engineer exam expects you to know how to size a cluster correctly the first time. The questions tend to look the same: someone needs a cheap dev environment, someone else needs a production cluster that survives a master node failure, and a third team wants to cut costs without losing data when nodes go away. Each of those maps to a specific Dataproc configuration choice, and if you can recognize the pattern in the question, the answer is usually one click.

I want to walk through the cluster setup options I drill into Professional Data Engineer candidates: the main configuration knobs, the three cluster modes, how preemptible workers actually behave, and what graceful decommissioning does for you.

The main configuration knobs

When you create a Dataproc cluster, you pick a handful of settings up front. Some of them you can change later. Some of them you cannot. Knowing which is which saves you from a question that asks what you can modify on a running cluster.

Region and zone: pick the region closest to your data. Cross-region reads to Cloud Storage will hurt your job latency more than any other tuning choice.
Cluster mode: single node, standard, or high availability. This drives master and worker counts.
Disk type and size: standard persistent disks versus SSDs on master and worker nodes. Pick based on your performance needs.
Local SSD size: high-speed local storage. This cannot be changed after the cluster is created. If you suspect a question is testing immutability, this is one of the answers to memorize.
Staging bucket: a Cloud Storage bucket for job logs, temp data, and outputs.
Preemptible node count: how many cheap, reclaimable workers you want layered on top of your standard workers.

After creation, you can change the number of workers, the number of preemptible VMs, the labels on the cluster, and you can toggle graceful decommissioning. Dataproc handles resharding data automatically when worker counts change, so you do not have to manage that yourself.

The three cluster modes

This is the exam question I see most often in this section. A scenario describes a workload, and you pick the mode.

Single node: one master, zero workers. Lightweight, used for development and quick experiments. If the scenario mentions a dev environment or a small test job, this is your answer.
Standard mode: one master, a custom number of workers. This is the default for production workloads and is the right call when the scenario does not specifically demand uptime guarantees.
High availability: three masters, a custom number of workers. This is for critical workloads where you cannot tolerate the master going down. If the scenario says something like "24/7 availability" or "cannot afford master failure," pick HA.

The trap on these questions is overspending. If the scenario describes a batch job that runs nightly and is fine to retry, you do not need HA. The Professional Data Engineer exam rewards matching the mode to the actual reliability requirement, not always choosing the most resilient option.

Preemptible nodes

Preemptible workers are the cost lever for Dataproc. They are much cheaper than standard workers, but Google can reclaim them at any time with no guarantee on when they come back. Dataproc manages the join and leave process for you, so you do not have to write any code to handle a preempted node disappearing.

A few rules to keep in mind:

You need at least one standard worker before you can add preemptible workers. You cannot run a cluster on preemptibles alone.
They are best for compute-intensive jobs that are resilient to interruption and do not need persistent storage. Spark jobs that recompute lost partitions are a good fit. Jobs that hold long-lived state on the worker are not.
Adding more nodes, even preemptible ones, means jobs typically finish faster. You are trading some risk of interruption for parallelism and lower cost.

On the exam, the cue for preemptible nodes is almost always cost reduction combined with a tolerance for interruption. If the scenario says "compute-heavy, fault-tolerant, lowest cost," you are looking at preemptible workers.

Graceful decommissioning

Graceful decommissioning is what lets you shrink a cluster without breaking running jobs. When you decommission a worker gracefully, Dataproc waits for in-flight tasks to finish on that node and redistributes its data to the remaining workers before removing it. No data loss, no job failures, no surprise restarts.

A few specifics that show up on the Professional Data Engineer exam:

Graceful decommissioning applies only to standard worker nodes. Preemptible workers do not use it because Google reclaims them without warning, and there is no opportunity to drain them first.
It is especially valuable in mixed clusters with both standard and preemptible workers. When you scale down the standard side, graceful decommissioning protects the critical workloads from getting hit by the preemptible churn.
It is not on by default. You have to enable it. This is a common gotcha question.

If you see a scenario where someone scaled down a cluster and lost data or had jobs fail mid-flight, the fix is to enable graceful decommissioning and try again.

How I recommend studying this section

Memorize the three cluster modes with their master and worker counts. Memorize that local SSD size is immutable after creation. Memorize that preemptible workers require at least one standard worker and that graceful decommissioning is opt-in and standard-only. Those four facts cover the bulk of what the Professional Data Engineer exam tests on Dataproc cluster setup.

The rest is reading the scenario carefully and matching the description to one of these levers. Dev workload means single node. Critical means HA. Cost cutting on a fault-tolerant job means preemptibles. Safe scale-down means graceful decommissioning.

My Professional Data Engineer course covers Dataproc cluster setup, autoscaling, job submission, and the rest of the data processing domain on the exam.

Dataproc Cluster Setup for the PDE Exam: Modes, Preemptible Nodes, Graceful Decommissioning

The main configuration knobs

The three cluster modes

Preemptible nodes

Graceful decommissioning

How I recommend studying this section

Get tips and updates from GCP Study Hub