Orchestrating GCP Services with Managed Airflow for the PDE Exam

February 26, 2026

Managed Service for Apache Airflow (formerly Cloud Composer) questions on the Professional Data Engineer exam almost always come down to one capability: managing dependencies between steps that run on different GCP services. The exam will give you a scenario where a Dataflow job needs to wait for a BigQuery load, or where Agent Platform (formerly Vertex AI) training has to kick off only after fresh data lands in a table, and you need to know which Airflow operators wire those pieces together. This article walks through the operator vocabulary I expect every PDE candidate to have memorized before exam day.

Why Managed Airflow wins these scenarios on the exam

Managed Airflow is managed Apache Airflow. That distinction matters because Airflow is what gives you the dependency graph. When a question describes a multi-step pipeline where step B must not start until step A succeeds, and step C runs only if both A and B finished cleanly, you are looking at a DAG. Cloud Scheduler can fire a single job on a cron. Workflows can chain a small number of HTTP calls. Neither one was designed to express dependencies between services with retries, branching, and backfills, which is exactly the territory Managed Airflow owns.

If a PDE question mentions phrases like orchestrate, complex dependencies, retry policy across services, or end-to-end pipeline with multiple stages, the answer is almost always Managed Airflow. Lock that pattern in.

BigQuery operators

BigQuery is the most common destination in PDE pipelines, so the BigQuery operators show up everywhere. The two you should recognize on sight are BigQueryInsertJobOperator (the modern way to run SQL) and BigQueryCheckOperator (a guard that fails the task if a query returns an empty or zero result, useful for data quality gates).

from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator

run_aggregation = BigQueryInsertJobOperator(
    task_id="daily_aggregation",
    configuration={
        "query": {
            "query": "INSERT INTO analytics.daily_metrics SELECT ... FROM raw.events",
            "useLegacySql": False,
        }
    },
)

If you see older study material referencing BigQueryOperator, that is the legacy name for the same idea. The exam tends to phrase questions in terms of the capability rather than the exact class, so understand that Managed Airflow runs SQL on BigQuery as a step in a DAG.

Loading from Cloud Storage

A very common pipeline shape is files land in GCS, get loaded into BigQuery on a schedule, and trigger downstream processing. The operator that does the middle step is GCSToBigQueryOperator. The exam loves this pattern because it tests whether you reach for Managed Airflow instead of writing a Cloud Run Function (formerly Cloud Function) that polls a bucket.

from airflow.providers.google.cloud.transfers.gcs_to_bigquery import GCSToBigQueryOperator

load_events = GCSToBigQueryOperator(
    task_id="load_events",
    bucket="raw-events-prod",
    source_objects=["events/{{ ds }}/*.json"],
    destination_project_dataset_table="raw.events${{ ds_nodash }}",
    source_format="NEWLINE_DELIMITED_JSON",
    write_disposition="WRITE_TRUNCATE",
)

Notice the Jinja templating with {{ ds }}. Managed Airflow fills in the logical execution date at runtime, which is how Airflow handles partitioned ingestion and backfills cleanly. That templating is a recurring PDE tell that something is an Airflow-shaped problem.

Dataflow operators

Dataflow is where most heavy transformation happens in a Google-native data stack, and Managed Airflow is how you schedule it. There are two flavors of operator worth knowing.

DataflowTemplatedJobStartOperator launches a Dataflow job from a Google-provided or custom template. This is the right answer when the exam describes a parameterized batch job that runs on a schedule.
BeamRunPythonPipelineOperator (and its Java equivalent) runs a pipeline directly from source. Use this when the pipeline lives in your repo and you want Managed Airflow to deploy it on each run.

The exam scenario to anchor on is the classic load then transform chain: load_events >> run_dataflow >> run_aggregation. Managed Airflow guarantees the Dataflow job starts only after the BigQuery load reports success, and the aggregation query runs only after the Dataflow job finishes.

Managed Spark operators

For Spark and Hadoop workloads, DataprocSubmitJobOperator is the standard, paired with DataprocCreateClusterOperator and DataprocDeleteClusterOperator when you want ephemeral clusters. The PDE exam favors ephemeral clusters for cost reasons, so a typical answer pattern is create cluster, submit job, delete cluster, all as tasks in the same DAG with the delete task set to run regardless of upstream success so you never leave a cluster orphaned.

Pub/Sub operators

Pub/Sub operators handle the publish side of event-driven workflows. PubSubPublishMessageOperator lets a DAG announce that a stage completed, which downstream consumers (including other DAGs or Cloud Run Functions) can react to. For the consume side, exam answers usually point to Dataflow with Pub/Sub as a source rather than an Airflow task polling a subscription, because Managed Airflow is for orchestration, not for being the streaming runtime itself.

Agent Platform operators

ML pipelines on the PDE exam frequently look like ingest, transform, retrain, deploy. The Agent Platform operator family covers training jobs, batch prediction, and model deployment, so Managed Airflow can drive the retrain step as the last node in a data pipeline. The classic exam phrasing is something like kick off model retraining after the feature table refresh completes, and the answer is a Managed Airflow DAG with the Agent Platform training operator downstream of the BigQuery refresh.

End-to-end pipeline pattern to memorize

The single most useful mental model for the exam is this five-stage chain:

Ingest: GCSToBigQueryOperator pulls files landing in a bucket into a raw table.
Validate: BigQueryCheckOperator confirms row counts and freshness.
Transform: DataflowTemplatedJobStartOperator or BigQueryInsertJobOperator builds the clean dataset.
Train or aggregate: Agent Platform operator retrains the model, or another BigQuery job builds reporting tables.
Notify: PubSubPublishMessageOperator emits a completion event for downstream consumers.

If you can sketch that DAG on a whiteboard with the right operators, you will handle most Managed Airflow questions on the Professional Data Engineer exam without breaking stride.

Multi-cloud is fair game

One detail that occasionally trips people up: Managed Airflow is not GCP-only. Airflow has provider packages for AWS and Azure, so an Airflow DAG can orchestrate an S3 transfer, an Azure VM job, and a BigQuery load in the same workflow. The exam rarely asks deep multi-cloud questions, but if a scenario mentions cross-cloud dependencies, Managed Airflow remains a valid answer.

What to lock in before exam day

Recognize Managed Airflow as the right tool whenever a question describes dependencies, retries, or backfills across multiple services. Be able to name the operator family that runs SQL on BigQuery, loads from GCS, launches Dataflow, submits Managed Service for Apache Spark (formerly Dataproc) jobs, and triggers Agent Platform. Understand that Managed Airflow orchestrates other services rather than processing data itself, and that the value comes from the DAG, not from any single task.

My Professional Data Engineer course covers Managed Airflow in depth alongside Dataflow, BigQuery, and Agent Platform so you can recognize orchestration scenarios on sight and pick the right operator family every time.