Cloud Storage Connector vs HDFS on Dataproc for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
October 24, 2025

One of the most testable patterns on the Google Cloud Professional Data Engineer exam is the storage decision for Dataproc workloads. You lift a Hadoop or Spark job off premises, you land it on a Dataproc cluster, and then you have to answer a deceptively simple question: where does the data live? The default answer is Cloud Storage through the Cloud Storage Connector. The exception is HDFS on the cluster itself. Knowing which one to pick, and why, is the difference between confidently eliminating two distractors and second-guessing yourself on exam day.

I want to walk through the tradeoff the way I think about it when I am working through PDE practice questions, and then show how the exam tends to frame it.

What the Cloud Storage Connector actually does

When you migrate a Hadoop or Spark job to Dataproc, the compute moves into the cluster, but the data does not have to. The Cloud Storage Connector is a shim that lets jobs running on Dataproc read and write data directly from a Cloud Storage bucket as if it were HDFS. In your job code or your Hive table definitions, you swap the hdfs:// prefix for gs:// and the job keeps working. That is the whole interface change.

This is not a workaround. It is the recommended pattern for almost every Dataproc workload on Google Cloud. The reasons stack up quickly:

  • HDFS compatibility. Jobs that were written against HDFS paths run against Cloud Storage with the prefix change. You do not rewrite Spark code or Hive DDL.
  • Interoperability. Anything else in Google Cloud that reads from Cloud Storage can read the same data. BigQuery loads, Dataflow jobs, and downstream Spark jobs all share one source of truth.
  • Data outlives the cluster. If you shut the Dataproc cluster down, your data is still in the bucket. With HDFS, killing the cluster kills the data with it. This is what makes ephemeral Dataproc clusters viable in the first place.
  • High availability. Multi-region buckets give you replicated, globally available storage without you running any of the replication machinery.
  • No file system management. No fsck, no NameNode tuning, no version upgrades or rollbacks. Google runs the storage layer.

The combination of those last three is what makes the Cloud Storage Connector the default for the Professional Data Engineer exam. Whenever a question describes a Dataproc migration without any unusual latency or feature constraints, the right move is to put the data in Cloud Storage and use the connector to access it.

When HDFS on the cluster is still the right answer

HDFS is not deprecated on Dataproc. It comes with the cluster, it works, and there are three specific cases where the exam expects you to reach for it.

  • Extremely low-latency data access. Cloud Storage is a remote object store. Every read crosses a network. For jobs where even small network round trips hurt, HDFS keeps the data on the same nodes that are running the compute, so the access path is local disk instead of a network call.
  • Advanced HDFS-specific features. Things like fine-grained replication policies, custom block placement, or other low-level HDFS configuration knobs do not have direct equivalents in Cloud Storage. If your existing workload depends on them, HDFS is the only option that preserves the behavior.
  • Local storage for faster processing. This is closely related to the latency case but worth separating. When you have a large dataset that you are going to scan many times during a job, copying it once to HDFS so subsequent reads are local can outperform repeatedly streaming from Cloud Storage.

If a PDE exam question does not mention any of those constraints, you should default to the Cloud Storage Connector. If it does mention them, HDFS becomes a viable answer and you should start ruling out the Cloud Storage-only options.

The hybrid case the exam loves

The pattern that shows up most often in PDE practice questions is a hybrid: data lands in Cloud Storage, but part of it needs to be in HDFS for performance or compatibility. The exam wants you to recognize that this is not an either-or choice. Two methods are valid inside Dataproc:

  • Use the Cloud Storage Connector to read the files where they already are, define them as external Hive tables, and replicate the subset you need into HDFS.
  • Move the data to the cluster's master node, copy it to HDFS using the Hadoop utility, and then mount it from HDFS for the jobs that require local access.

Both work. The first option leans on the connector and is the more cloud-native flow. The second is the older Hadoop way of doing it and is fine when you need very direct control over what ends up on the local file system. When you see a question that asks for two valid approaches, those are usually the two it is looking for.

How to answer the storage question on exam day

My rule of thumb for the Professional Data Engineer exam: read the scenario, then look for the words that would push you off the default. Phrases like "low latency," "local processing," "fine-grained replication," or "HDFS-specific features" are the signals that HDFS is in play. Without any of those signals, the Cloud Storage Connector with a gs:// path is the answer, and any option that talks about staging the data in HDFS first is usually a distractor.

One more thing worth internalizing: the Cloud Storage Connector is not just for migrations. It is the steady-state storage interface for Dataproc on Google Cloud. Treat HDFS as the exception you have to justify, not the default you have to argue against.

My Professional Data Engineer course covers Dataproc storage patterns, the Cloud Storage Connector, and the BigQuery and Bigtable connectors in the depth the exam requires.

Get tips and updates from GCP Study Hub

arrow