Managed Airflow for the PDE Exam: Managed Airflow and DAGs

February 17, 2026

Managed Service for Apache Airflow (formerly Cloud Composer) is one of those services that shows up on the Professional Data Engineer exam in a very specific way. The exam wants to know whether you can recognize the kind of pipeline that needs an orchestrator, and whether you can pick Managed Airflow over Dataflow, Workflows, or a Cloud Scheduler plus Cloud Run Function (formerly Cloud Function) combo when the scenario calls for it. The way I think about it: Managed Airflow is the answer when the question is about coordinating many steps across many services, not about transforming a single stream of data.

What Managed Airflow actually is

Managed Airflow is a managed implementation of Apache Airflow. Airflow is an open source programmatic framework, originally built at Airbnb before being donated to the Apache Foundation, that lets you create, schedule, manage, and monitor data workflows. When Google runs Airflow as a managed service in GCP, they handle a lot of the infrastructure heavy lifting for you. Provisioning, patching, the underlying GKE cluster Airflow runs on, the metadata database, the webserver, all of it is set up for you when you create a Managed Airflow environment.

That said, Managed Airflow is not no-ops. It is low-ops. You still configure things like environment size, worker autoscaling parameters, the Airflow version, the Python packages installed in the environment, and the network configuration. On the exam, if you see a scenario describing a team that wants a fully managed workflow tool but is willing to tune a few knobs, Managed Airflow fits. If the scenario insists on zero configuration, the answer is probably Workflows or Cloud Scheduler instead.

Why Airflow exists in the first place

Big data pipelines are usually a complex, multi-step process. A realistic workflow might pull from an API, land raw files in Cloud Storage, run a Managed Service for Apache Spark (formerly Dataproc) job, load the result into BigQuery, run a few SQL transformations, and then push aggregates to Bigtable or trigger a downstream ML training job. That kind of pipeline has a few characteristics that Airflow was designed to handle:

Resources span multiple services. No single GCP product owns the whole workflow.
There are complex dependencies between steps. Step five cannot start until steps three and four both finish, and step six should only run if step five succeeds.
Resources need to be cleaned up. Ephemeral Managed Spark clusters or temporary GCS buckets should be torn down once their step finishes, both for cost and tidiness.
The team needs a central view. Engineers, analysts, and on-call rotations need one place to see what ran, what failed, and what is scheduled next.
Scheduling is complicated. Some jobs run on cron-like time schedules, others are event-driven, and some have to wait on external signals.

If you read an exam scenario and most of those bullets apply, Managed Airflow is almost certainly the intended answer.

DAGs, the central concept

A DAG is a Directed Acyclic Graph. In Managed Airflow, a DAG is a collection of tasks that you want to run, organized in a structure that reflects the dependencies between them. Directed means each connection between two tasks has a direction, this task before that task. Acyclic means there are no loops, you cannot have task A depend on task B which depends on task A.

DAGs are written in Python. A DAG file is just a Python script that imports Airflow operators, instantiates them as tasks, and wires them together with dependency syntax. The Python file lives in a Cloud Storage bucket that Managed Airflow reads on a schedule, and any new DAG file that lands in that bucket gets picked up automatically.

A skeleton DAG looks like this:

from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id="daily_sales_pipeline",
    start_date=datetime(2026, 1, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command="gsutil cp gs://raw/sales.csv /tmp/",
    )

    load = BigQueryInsertJobOperator(
        task_id="load",
        configuration={"query": {"query": "CALL project.dataset.load_sales()", "useLegacySql": False}},
    )

    extract >> load

The >> operator at the bottom is how you express that extract has to finish before load can start. With a few more tasks, you get the parallel-then-converge diagrams that show up in the Airflow UI, where two tasks fan out from a start node, then meet at a downstream task, then fan out again before converging at the end.

Tasks and operators

Each node in a DAG is a task, and each task is an instance of an operator. Airflow ships with operators for almost every GCP service you would touch as a Professional Data Engineer. There are operators for BigQuery jobs, Dataflow jobs, Managed Spark cluster creation and deletion, Cloud Storage transfers, Pub/Sub publishing, Agent Platform (formerly Vertex AI) training, and plenty more. There is also a generic BashOperator and PythonOperator for the cases where you just need to run arbitrary code as a step.

The mental model that helps on the exam: operators are the verbs of your pipeline, tasks are the specific instances, and the DAG is the sentence those verbs combine to form.

The Airflow monitoring dashboard

Managed Airflow environments come with the Airflow web UI built in. From there you can view and manage your DAGs, monitor the status of each task in each run, and rerun failed tasks directly without re-running the entire DAG. That last point matters on the exam. If a question describes a team needing to retry a single failed step in a long workflow without restarting the whole thing, that points to Airflow and therefore Managed Airflow.

When Managed Airflow is the right exam pick

The pattern I look for on the Professional Data Engineer exam is multi-service, multi-step orchestration with dependencies. If the scenario mentions chaining BigQuery, Dataflow, Managed Spark, and storage operations together on a schedule, Managed Airflow is the pick. If it mentions Apache Airflow by name and asks for a managed version, the answer is Managed Airflow. If it says the team already has Airflow DAGs and wants to lift and shift them to GCP, the answer is Managed Airflow.

Where Managed Airflow is not the pick: simple cron jobs (Cloud Scheduler), single-step serverless triggers (Cloud Run Functions or Cloud Run), or pure data transformation pipelines without orchestration concerns (Dataflow). The Professional Data Engineer exam likes to test the boundary between these, so it helps to be able to articulate why Managed Airflow is overkill for a one-step job and just right for a tangled multi-service one.

My Professional Data Engineer course covers Managed Airflow in depth, including the Managed Airflow API versus Airflow API distinction, environment sizing, and the specific exam patterns that point to Managed Airflow as the right answer.

Managed Airflow for the PDE Exam: Managed Apache Airflow and DAGs