Cloud Storage Uploads for the PDE Exam

619c7c8da6d7b95cf26f6f70

November 28, 2025

Cloud Storage upload mechanics show up on the Professional Data Engineer exam in a quiet but consistent way. You will not see a question titled "how do you upload a file to GCS", but you will see scenarios where a candidate has to pick between gcloud and gsutil, decide whether to enable parallel composite uploads, reason about why a long-running transfer started returning 403 errors, or explain what happens when a gzipped object is served to a client. None of these are hard topics on their own. They become hard when they are mixed into a larger question about a data pipeline and you have to spot the right knob to turn.

This article walks through the four upload-side concepts I drill into my Professional Data Engineer students: command line uploads with gcloud and gsutil, partitioning large files during upload, compression with decompressive transcoding, and expired credentials during long transfers.

Command line uploads with gcloud and gsutil

There are two command line tools that can push data into Cloud Storage from a local machine or a Compute Engine VM. The newer one is gcloud storage, the older one is gsutil. They do the same job with slightly different syntax, and you should be comfortable reading both because exam questions and Google's own documentation still mix them.

For a single file upload with gcloud:

gcloud storage cp [LOCAL_FILE] gs://[BUCKET_NAME]/

For an entire directory, add the recursive flag:

gcloud storage cp -r [DIRECTORY] gs://[BUCKET_NAME]/

gsutil is almost identical:

gsutil cp [LOCAL_FILE] gs://[BUCKET_NAME]/
gsutil cp -r [DIRECTORY] gs://[BUCKET_NAME]/

The -r flag is what makes the upload recursive so that subdirectories are included. If a question gives you a directory tree and asks why only the top-level files made it into the bucket, the missing flag is almost always the answer.

Partitioning files during upload

A single large file uploaded over a single HTTP stream is bottlenecked by that one stream. For data engineering workloads where you are pushing CSV exports, Parquet files, or backups that can easily run into the tens of gigabytes, partitioning the file and uploading the pieces in parallel makes a real difference in wall-clock time.

You have three tools that can do this: gsutil, gcloud storage, and the Storage Transfer Service. The one most exam questions point at is gsutil's parallel composite uploads feature. It automatically splits a large file into smaller components, uploads them in parallel, and then composes them into a single object on the server side.

You enable it by setting a threshold:

gsutil -o "GSUtil:parallel_composite_upload_threshold=100M" cp large_file.csv gs://your-bucket

Any file larger than the threshold gets the parallel treatment. Anything smaller goes up as a normal single-stream upload. The trade-off to be aware of is that composite objects require a tool that understands composition on the download side, and they have a different CRC32C behavior, so for some workflows you will want to either disable the feature or recompose the file after upload. For the Professional Data Engineer exam, the recognition piece matters most: when a question describes slow uploads of multi-gigabyte files and asks for a fix, parallel composite uploads is the answer.

Compression and decompressive transcoding

Compressing files before upload, usually with gzip, reduces both transfer time and storage cost. The objection most people have to gzipping objects in Cloud Storage is that it complicates the read path. Clients have to decompress on receive, browsers might not handle it cleanly, and downstream tools might need extra config.

Cloud Storage solves that with decompressive transcoding. If an object is stored with Content-Encoding: gzip and you request it without an Accept-Encoding: gzip header, Cloud Storage decompresses the object on the fly before sending it. You store the compressed version, you pay the smaller storage bill, and the end user receives a normal uncompressed file. If the client does accept gzip, the object is served compressed so they get the network savings too.

The exam framing here is usually "how do I get the cost savings of compressed storage without breaking compatibility with clients that do not handle gzip?" Decompressive transcoding is the one-line answer.

Expired source credentials during long transfers

The last failure mode worth knowing is 403 errors that show up partway through a long-running data transfer. You saw the file start uploading, hundreds of gigabytes went through, and then suddenly the transfer started rejecting requests as forbidden.

The cause is almost always credential expiry. Temporary access tokens, signed URLs, and short-lived service account tokens all have lifetimes measured in hours. A multi-terabyte transfer can outlive them. The fixes are:

Regenerate the credentials and restart the failed portion of the transfer.
Extend the credential lifetime if your token issuer allows it, so the transfer completes within a single token's validity window.
Split the transfer into smaller chunks so any single job stays well inside the token lifetime, and so a failed chunk only forces you to retry that chunk and not the whole transfer.

When you see a 403 in a transfer scenario on the Professional Data Engineer exam, the question is almost always testing whether you can connect the error code to credential expiry rather than to a real permissions problem.

My Professional Data Engineer course covers Cloud Storage upload mechanics, transfer services, and the rest of the storage-and-ingestion section in the exam blueprint.

Cloud Storage Uploads for the PDE Exam: Command Line, Partitioning, Decompressive Transcoding

Command line uploads with gcloud and gsutil

Partitioning files during upload

Compression and decompressive transcoding

Expired source credentials during long transfers

Get tips and updates from GCP Study Hub