
Dataproc is the managed Hadoop and Spark service on Google Cloud, and the Professional Data Engineer exam expects you to know how to size a cluster correctly the first time. The questions tend to look the same: someone needs a cheap dev environment, someone else needs a production cluster that survives a master node failure, and a third team wants to cut costs without losing data when nodes go away. Each of those maps to a specific Dataproc configuration choice, and if you can recognize the pattern in the question, the answer is usually one click.
I want to walk through the cluster setup options I drill into Professional Data Engineer candidates: the main configuration knobs, the three cluster modes, how preemptible workers actually behave, and what graceful decommissioning does for you.
When you create a Dataproc cluster, you pick a handful of settings up front. Some of them you can change later. Some of them you cannot. Knowing which is which saves you from a question that asks what you can modify on a running cluster.
After creation, you can change the number of workers, the number of preemptible VMs, the labels on the cluster, and you can toggle graceful decommissioning. Dataproc handles resharding data automatically when worker counts change, so you do not have to manage that yourself.
This is the exam question I see most often in this section. A scenario describes a workload, and you pick the mode.
The trap on these questions is overspending. If the scenario describes a batch job that runs nightly and is fine to retry, you do not need HA. The Professional Data Engineer exam rewards matching the mode to the actual reliability requirement, not always choosing the most resilient option.
Preemptible workers are the cost lever for Dataproc. They are much cheaper than standard workers, but Google can reclaim them at any time with no guarantee on when they come back. Dataproc manages the join and leave process for you, so you do not have to write any code to handle a preempted node disappearing.
A few rules to keep in mind:
On the exam, the cue for preemptible nodes is almost always cost reduction combined with a tolerance for interruption. If the scenario says "compute-heavy, fault-tolerant, lowest cost," you are looking at preemptible workers.
Graceful decommissioning is what lets you shrink a cluster without breaking running jobs. When you decommission a worker gracefully, Dataproc waits for in-flight tasks to finish on that node and redistributes its data to the remaining workers before removing it. No data loss, no job failures, no surprise restarts.
A few specifics that show up on the Professional Data Engineer exam:
If you see a scenario where someone scaled down a cluster and lost data or had jobs fail mid-flight, the fix is to enable graceful decommissioning and try again.
Memorize the three cluster modes with their master and worker counts. Memorize that local SSD size is immutable after creation. Memorize that preemptible workers require at least one standard worker and that graceful decommissioning is opt-in and standard-only. Those four facts cover the bulk of what the Professional Data Engineer exam tests on Dataproc cluster setup.
The rest is reading the scenario carefully and matching the description to one of these levers. Dev workload means single node. Critical means HA. Cost cutting on a fault-tolerant job means preemptibles. Safe scale-down means graceful decommissioning.
My Professional Data Engineer course covers Dataproc cluster setup, autoscaling, job submission, and the rest of the data processing domain on the exam.