Dataproc Migration Best Practices for the PDE Exam

619c7c8da6d7b95cf26f6f70

October 22, 2025

If you have an on-premises Hadoop or Spark cluster and you are wondering where it should go on Google Cloud, the short answer for the Professional Data Engineer exam is Dataproc. That single rule of thumb covers most of the migration questions you will see, but the exam goes further. It tests whether you understand how to actually run that migration, what to move first, what kind of clusters to spin up, and where each piece of the old Apache stack should land once you are done. In this post I will walk through the best practices the exam expects you to know.

The rule of thumb: on-prem Spark and Hadoop go to Dataproc

When in doubt, on-prem Apache Spark and Hadoop jobs should be migrated to Dataproc. That is the default answer. Dataproc is Google Cloud's managed Hadoop and Spark service, and it is built specifically to take existing Apache workloads and run them in the cloud without rewriting them.

The reason the exam emphasizes this is that it is the path of least resistance. You keep your Spark code, your Hive queries, your Pig scripts, and your Hadoop jobs. You move them to a managed cluster instead of running them on hardware you have to maintain yourself. There are situations where you would skip Dataproc and go straight to a serverless service like Dataflow or BigQuery, but if the question is framed as a migration of existing Apache workloads, Dataproc is the answer.

Move the data first

The first concrete step in any cluster migration is to move the data before you move the compute. Usually that means landing the data in Google Cloud Storage. GCS gives you a secure, durable place to park your datasets so that your Dataproc clusters can read from them on demand.

This ordering matters for a reason that is easy to miss. If you stand up a cluster before the data is available, you are paying for compute that has nothing to do. Worse, you tie your cluster lifecycle to your data lifecycle, which is the opposite of how cloud-native architectures work. Data goes to durable storage first. Compute reads from it on demand.

Test small before you scale

The next best practice is to perform small-scale testing on a subset of your data before you cut over the full workload. This is straightforward but easy to skip when you are under pressure to migrate. Running a job on a sample lets you catch configuration issues, version mismatches between your on-prem Spark and the Dataproc image, and cost surprises before they show up at production scale.

Once the small-scale run succeeds, you scale up to the full dataset. The exam will sometimes phrase this as choosing between a big-bang cutover and a phased approach. The phased approach is the right answer.

Use ephemeral clusters

This is one of the most exam-relevant ideas in the whole Dataproc migration section. On-premises clusters are long-running by necessity. You bought the hardware, so you leave it on. On Dataproc, the opposite is true. You should think in terms of ephemeral clusters, which means you create a cluster when you need it, run your job, and delete the cluster when the job is done.

The reason this works is that your data lives in GCS, not on the cluster. Deleting the cluster does not delete your data. You only pay for compute while the job is actually running. If you see an exam question about a team running a long-lived Dataproc cluster around the clock, the better-practice answer almost always involves making that cluster ephemeral.

Use the native cost-saving tools

Google Cloud gives you a few specific levers to control Dataproc costs, and the exam expects you to know them:

Autoscaling lets your cluster add and remove worker nodes based on YARN demand, so you are not paying for idle workers.
Preemptible nodes are short-lived VMs that cost a fraction of regular nodes. They can be reclaimed by Google Cloud at any time, but for fault-tolerant batch workloads like Spark and Hadoop they are a strong fit.
The billing console lets you monitor and forecast spend so cost overruns do not sneak up on you.

If a question asks how to reduce Dataproc costs, expect the right answer to combine some of these with the ephemeral-cluster pattern above.

Aim for cloud-native and serverless

Dataproc is managed, but it is not serverless. You still pick machine types and worker counts. The longer-term goal of any migration should be to move toward a cloud-native and, where possible, serverless architecture. You may not get there in one step. Many teams use Dataproc as the intermediate landing zone for their Apache workloads and then gradually move the parts that fit better onto Dataflow, BigQuery, or other serverless services. The exam likes this framing, so when you see options that include a long-term plan to evolve past Dataproc, those are usually the better answers.

Where each piece of the Apache stack should land

The Professional Data Engineer exam also tests whether you know the natural Google Cloud destination for each piece of the Apache ecosystem after the migration is done:

HDFS data goes to Cloud Storage. GCS replaces HDFS as the durable storage layer.
Apache Hive and Apache Impala workloads, which are SQL engines on distributed data, typically move to BigQuery. BigQuery is the fully managed, serverless data warehouse and it is the natural target for those SQL analytics workloads.
Apache HBase, which is a NoSQL wide-column store, moves to Bigtable. Bigtable is the managed, highly scalable NoSQL equivalent on Google Cloud.

If you can recall those three mappings cleanly, you will pick up the easy points on this part of the exam without having to reason from scratch.

What to remember on exam day

When you see a Dataproc migration question on the Professional Data Engineer exam, walk through the same short checklist. Default to Dataproc for on-prem Spark and Hadoop. Move the data to GCS first. Test on a subset before scaling. Use ephemeral clusters, autoscaling, and preemptible nodes to control cost. Plan to evolve toward serverless. And remember the destination mappings for HDFS, Hive and Impala, and HBase. Those few rules will carry you through most of the migration scenarios the exam throws at you.

My Professional Data Engineer course covers Dataproc migration patterns, ephemeral clusters, and the rest of the data-processing domain in detail.