
One of the most testable patterns on the Google Cloud Professional Data Engineer exam is the storage decision for Dataproc workloads. You lift a Hadoop or Spark job off premises, you land it on a Dataproc cluster, and then you have to answer a deceptively simple question: where does the data live? The default answer is Cloud Storage through the Cloud Storage Connector. The exception is HDFS on the cluster itself. Knowing which one to pick, and why, is the difference between confidently eliminating two distractors and second-guessing yourself on exam day.
I want to walk through the tradeoff the way I think about it when I am working through PDE practice questions, and then show how the exam tends to frame it.
When you migrate a Hadoop or Spark job to Dataproc, the compute moves into the cluster, but the data does not have to. The Cloud Storage Connector is a shim that lets jobs running on Dataproc read and write data directly from a Cloud Storage bucket as if it were HDFS. In your job code or your Hive table definitions, you swap the hdfs:// prefix for gs:// and the job keeps working. That is the whole interface change.
This is not a workaround. It is the recommended pattern for almost every Dataproc workload on Google Cloud. The reasons stack up quickly:
The combination of those last three is what makes the Cloud Storage Connector the default for the Professional Data Engineer exam. Whenever a question describes a Dataproc migration without any unusual latency or feature constraints, the right move is to put the data in Cloud Storage and use the connector to access it.
HDFS is not deprecated on Dataproc. It comes with the cluster, it works, and there are three specific cases where the exam expects you to reach for it.
If a PDE exam question does not mention any of those constraints, you should default to the Cloud Storage Connector. If it does mention them, HDFS becomes a viable answer and you should start ruling out the Cloud Storage-only options.
The pattern that shows up most often in PDE practice questions is a hybrid: data lands in Cloud Storage, but part of it needs to be in HDFS for performance or compatibility. The exam wants you to recognize that this is not an either-or choice. Two methods are valid inside Dataproc:
Both work. The first option leans on the connector and is the more cloud-native flow. The second is the older Hadoop way of doing it and is fine when you need very direct control over what ends up on the local file system. When you see a question that asks for two valid approaches, those are usually the two it is looking for.
My rule of thumb for the Professional Data Engineer exam: read the scenario, then look for the words that would push you off the default. Phrases like "low latency," "local processing," "fine-grained replication," or "HDFS-specific features" are the signals that HDFS is in play. Without any of those signals, the Cloud Storage Connector with a gs:// path is the answer, and any option that talks about staging the data in HDFS first is usually a distractor.
One more thing worth internalizing: the Cloud Storage Connector is not just for migrations. It is the steady-state storage interface for Dataproc on Google Cloud. Treat HDFS as the exception you have to justify, not the default you have to argue against.
My Professional Data Engineer course covers Dataproc storage patterns, the Cloud Storage Connector, and the BigQuery and Bigtable connectors in the depth the exam requires.