Cloud Storage Notifications and Pipeline Triggers for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
December 1, 2025

Cloud Storage is rarely the end of the story in a data pipeline. An object lands in a bucket, and almost always something downstream needs to react to it. A Cloud Function should fire, a Pub/Sub topic should get a message, or a Composer DAG should pick up the new data and run a job. The Professional Data Engineer exam tests whether you understand how that reaction is wired together, and the answer comes down to Cloud Storage notifications.

I want to walk through how these notifications work, what they can trigger, and the specific patterns I see show up in exam scenarios. If you understand the moving parts here, a whole category of pipeline-design questions becomes straightforward.

What a Cloud Storage notification actually is

When an object is uploaded to a bucket or changed in some way, Cloud Storage can be configured to generate a notification. That notification is the signal that something happened inside the bucket. On its own it does nothing useful. The value comes from wiring that signal into another GCP service that runs your logic in response.

The four trigger events you need to know are:

  • Object finalized, which fires when a new object is created or an existing object is overwritten
  • Object deleted
  • Object metadata updated
  • Object archived

Object creation is the most common case in real workloads and on the exam, but if you see a question about reacting to a deletion or a metadata change, the same mechanism applies. It is one feature with multiple event types, not several different features.

What the notification can trigger

From a single object upload, Cloud Storage can directly trigger:

  • A Cloud Function, which runs a piece of serverless code in response to the event
  • A Pub/Sub message sent to a topic, which any number of subscribers can then consume
  • A Cloud Run endpoint, which is the right pick when the work needs more memory, longer runtime, or a containerized service rather than a quick function

These three destinations cover almost every exam scenario. If the question describes lightweight processing on each file, like compressing an image, converting a format, or running a validation script on a small dataset, Cloud Functions is the natural answer. If the workflow needs to fan out to multiple consumers, or you want to decouple producers and consumers, route the notification through Pub/Sub. If the work is heavier or already lives in a container, Cloud Run is the better fit.

The Composer DAG trigger pattern

One specific pattern shows up often enough that it is worth memorizing. You have a data pipeline orchestrated by Cloud Composer, and you want a DAG to run as soon as a new file lands in a bucket. The chain looks like this:

  • A user or upstream system uploads an object to a Cloud Storage bucket
  • The object creation generates a notification that triggers a Cloud Function
  • The Cloud Function calls the Airflow API to kick off a specific DAG in Composer
  • The DAG runs and processes the new data

The reason Cloud Function sits in the middle is that Cloud Storage cannot call the Airflow REST API directly. The function is the glue. It receives the event, extracts the object metadata it needs, and makes the authenticated API call to Composer. From there the DAG owns the rest of the workflow.

If the exam describes a scenario where files arrive irregularly and you need a Composer DAG to run each time, this is the architecture to reach for. Polling on a schedule is the wrong answer because it either runs too often and wastes resources or runs too rarely and adds latency.

Cloud Storage as a component of other services

Notifications are the event-driven story, but Cloud Storage also shows up as a static input or output in pipelines that other services orchestrate. Three integrations come up:

  • Dataflow reads from and writes to Cloud Storage as both source and sink. A common pattern is to stage raw files in a bucket, run a Dataflow job that transforms them, and write the output back to another bucket for downstream consumption
  • Dataproc uses Cloud Storage as the storage layer for Spark and Hadoop jobs, replacing HDFS. The connector lets your jobs read and write to gs:// paths transparently. This is cheaper than running HDFS on persistent disks and survives cluster shutdowns, which is why ephemeral Dataproc clusters with Cloud Storage as the data layer are the standard architecture
  • BigQuery can query data in Cloud Storage directly through external tables. You define the schema, point the table at the bucket, and run SQL without loading anything. This is useful for large or rarely-queried datasets where the cost of loading does not pay back

The distinction worth holding in your head is that these three are batch integrations. Cloud Storage is providing data to a job that runs on a schedule or on demand. Notifications, by contrast, are the event-driven layer that says something just happened and lets a downstream service react in near real time.

What this looks like on the exam

When you see a Professional Data Engineer question that mentions a file landing in Cloud Storage and a downstream system needing to run, the candidate answers usually narrow to three categories. Either the question wants you to identify the notification + Cloud Function path, the notification + Pub/Sub path, or the notification + Cloud Function + Composer path for orchestrated pipelines. Reading the question carefully for clues about scale, fan-out, and existing infrastructure usually tells you which one.

If you can sketch the diagram from upload to downstream action without checking, you are in good shape for this section.

My Professional Data Engineer course covers Cloud Storage notifications, pipeline trigger patterns, and the rest of the GCS integrations you need for the exam.

Get tips and updates from GCP Study Hub

arrow