
If you have an on-premises Hadoop or Spark cluster and you are wondering where it should go on Google Cloud, the short answer for the Professional Data Engineer exam is Dataproc. That single rule of thumb covers most of the migration questions you will see, but the exam goes further. It tests whether you understand how to actually run that migration, what to move first, what kind of clusters to spin up, and where each piece of the old Apache stack should land once you are done. In this post I will walk through the best practices the exam expects you to know.
When in doubt, on-prem Apache Spark and Hadoop jobs should be migrated to Dataproc. That is the default answer. Dataproc is Google Cloud's managed Hadoop and Spark service, and it is built specifically to take existing Apache workloads and run them in the cloud without rewriting them.
The reason the exam emphasizes this is that it is the path of least resistance. You keep your Spark code, your Hive queries, your Pig scripts, and your Hadoop jobs. You move them to a managed cluster instead of running them on hardware you have to maintain yourself. There are situations where you would skip Dataproc and go straight to a serverless service like Dataflow or BigQuery, but if the question is framed as a migration of existing Apache workloads, Dataproc is the answer.
The first concrete step in any cluster migration is to move the data before you move the compute. Usually that means landing the data in Google Cloud Storage. GCS gives you a secure, durable place to park your datasets so that your Dataproc clusters can read from them on demand.
This ordering matters for a reason that is easy to miss. If you stand up a cluster before the data is available, you are paying for compute that has nothing to do. Worse, you tie your cluster lifecycle to your data lifecycle, which is the opposite of how cloud-native architectures work. Data goes to durable storage first. Compute reads from it on demand.
The next best practice is to perform small-scale testing on a subset of your data before you cut over the full workload. This is straightforward but easy to skip when you are under pressure to migrate. Running a job on a sample lets you catch configuration issues, version mismatches between your on-prem Spark and the Dataproc image, and cost surprises before they show up at production scale.
Once the small-scale run succeeds, you scale up to the full dataset. The exam will sometimes phrase this as choosing between a big-bang cutover and a phased approach. The phased approach is the right answer.
This is one of the most exam-relevant ideas in the whole Dataproc migration section. On-premises clusters are long-running by necessity. You bought the hardware, so you leave it on. On Dataproc, the opposite is true. You should think in terms of ephemeral clusters, which means you create a cluster when you need it, run your job, and delete the cluster when the job is done.
The reason this works is that your data lives in GCS, not on the cluster. Deleting the cluster does not delete your data. You only pay for compute while the job is actually running. If you see an exam question about a team running a long-lived Dataproc cluster around the clock, the better-practice answer almost always involves making that cluster ephemeral.
Google Cloud gives you a few specific levers to control Dataproc costs, and the exam expects you to know them:
If a question asks how to reduce Dataproc costs, expect the right answer to combine some of these with the ephemeral-cluster pattern above.
Dataproc is managed, but it is not serverless. You still pick machine types and worker counts. The longer-term goal of any migration should be to move toward a cloud-native and, where possible, serverless architecture. You may not get there in one step. Many teams use Dataproc as the intermediate landing zone for their Apache workloads and then gradually move the parts that fit better onto Dataflow, BigQuery, or other serverless services. The exam likes this framing, so when you see options that include a long-term plan to evolve past Dataproc, those are usually the better answers.
The Professional Data Engineer exam also tests whether you know the natural Google Cloud destination for each piece of the Apache ecosystem after the migration is done:
If you can recall those three mappings cleanly, you will pick up the easy points on this part of the exam without having to reason from scratch.
When you see a Dataproc migration question on the Professional Data Engineer exam, walk through the same short checklist. Default to Dataproc for on-prem Spark and Hadoop. Move the data to GCS first. Test on a subset before scaling. Use ephemeral clusters, autoscaling, and preemptible nodes to control cost. Plan to evolve toward serverless. And remember the destination mappings for HDFS, Hive and Impala, and HBase. Those few rules will carry you through most of the migration scenarios the exam throws at you.
My Professional Data Engineer course covers Dataproc migration patterns, ephemeral clusters, and the rest of the data-processing domain in detail.