Cloud Composer for the PDE Exam: Managed Airflow and DAGs

619c7c8da6d7b95cf26f6f70

February 17, 2026

Cloud Composer is one of those services that shows up on the Professional Data Engineer exam in a very specific way. The exam wants to know whether you can recognize the kind of pipeline that needs an orchestrator, and whether you can pick Composer over Dataflow, Workflows, or a Cloud Scheduler plus Cloud Function combo when the scenario calls for it. The way I think about it: Composer is the answer when the question is about coordinating many steps across many services, not about transforming a single stream of data.

What Cloud Composer actually is

Cloud Composer is a managed implementation of Apache Airflow. Airflow is an open source programmatic framework, originally built at Airbnb before being donated to the Apache Foundation, that lets you create, schedule, manage, and monitor data workflows. When Google runs Airflow as a managed service in GCP, they handle a lot of the infrastructure heavy lifting for you. Provisioning, patching, the underlying GKE cluster Airflow runs on, the metadata database, the webserver, all of it is set up for you when you create a Composer environment.

That said, Composer is not no-ops. It is low-ops. You still configure things like environment size, worker autoscaling parameters, the Airflow version, the Python packages installed in the environment, and the network configuration. On the exam, if you see a scenario describing a team that wants a fully managed workflow tool but is willing to tune a few knobs, Composer fits. If the scenario insists on zero configuration, the answer is probably Workflows or Cloud Scheduler instead.

Why Airflow exists in the first place

Big data pipelines are usually a complex, multi-step process. A realistic workflow might pull from an API, land raw files in Cloud Storage, run a Dataproc job, load the result into BigQuery, run a few SQL transformations, and then push aggregates to Bigtable or trigger a downstream ML training job. That kind of pipeline has a few characteristics that Airflow was designed to handle:

Resources span multiple services. No single GCP product owns the whole workflow.
There are complex dependencies between steps. Step five cannot start until steps three and four both finish, and step six should only run if step five succeeds.
Resources need to be cleaned up. Ephemeral Dataproc clusters or temporary GCS buckets should be torn down once their step finishes, both for cost and tidiness.
The team needs a central view. Engineers, analysts, and on-call rotations need one place to see what ran, what failed, and what is scheduled next.
Scheduling is complicated. Some jobs run on cron-like time schedules, others are event-driven, and some have to wait on external signals.

If you read an exam scenario and most of those bullets apply, Composer is almost certainly the intended answer.

DAGs, the central concept

A DAG is a Directed Acyclic Graph. In Composer, a DAG is a collection of tasks that you want to run, organized in a structure that reflects the dependencies between them. Directed means each connection between two tasks has a direction, this task before that task. Acyclic means there are no loops, you cannot have task A depend on task B which depends on task A.

DAGs are written in Python. A DAG file is just a Python script that imports Airflow operators, instantiates them as tasks, and wires them together with dependency syntax. The Python file lives in a Cloud Storage bucket that Composer reads on a schedule, and any new DAG file that lands in that bucket gets picked up automatically.

A skeleton DAG looks like this:

from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.operators.bash import BashOperator
from datetime import datetime

with DAG(
    dag_id="daily_sales_pipeline",
    start_date=datetime(2026, 1, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command="gsutil cp gs://raw/sales.csv /tmp/",
    )

    load = BigQueryInsertJobOperator(
        task_id="load",
        configuration={"query": {"query": "CALL project.dataset.load_sales()", "useLegacySql": False}},
    )

    extract >> load

The >> operator at the bottom is how you express that extract has to finish before load can start. With a few more tasks, you get the parallel-then-converge diagrams that show up in the Airflow UI, where two tasks fan out from a start node, then meet at a downstream task, then fan out again before converging at the end.

Tasks and operators

Each node in a DAG is a task, and each task is an instance of an operator. Airflow ships with operators for almost every GCP service you would touch as a Professional Data Engineer. There are operators for BigQuery jobs, Dataflow jobs, Dataproc cluster creation and deletion, Cloud Storage transfers, Pub/Sub publishing, Vertex AI training, and plenty more. There is also a generic BashOperator and PythonOperator for the cases where you just need to run arbitrary code as a step.

The mental model that helps on the exam: operators are the verbs of your pipeline, tasks are the specific instances, and the DAG is the sentence those verbs combine to form.

The Airflow monitoring dashboard

Composer environments come with the Airflow web UI built in. From there you can view and manage your DAGs, monitor the status of each task in each run, and rerun failed tasks directly without re-running the entire DAG. That last point matters on the exam. If a question describes a team needing to retry a single failed step in a long workflow without restarting the whole thing, that points to Airflow and therefore Composer.

When Composer is the right exam pick

The pattern I look for on the Professional Data Engineer exam is multi-service, multi-step orchestration with dependencies. If the scenario mentions chaining BigQuery, Dataflow, Dataproc, and storage operations together on a schedule, Composer is the pick. If it mentions Apache Airflow by name and asks for a managed version, the answer is Composer. If it says the team already has Airflow DAGs and wants to lift and shift them to GCP, the answer is Composer.

Where Composer is not the pick: simple cron jobs (Cloud Scheduler), single-step serverless triggers (Cloud Functions or Cloud Run), or pure data transformation pipelines without orchestration concerns (Dataflow). The Professional Data Engineer exam likes to test the boundary between these, so it helps to be able to articulate why Composer is overkill for a one-step job and just right for a tangled multi-service one.

My Professional Data Engineer course covers Cloud Composer in depth, including the Composer API versus Airflow API distinction, environment sizing, and the specific exam patterns that point to Composer as the right answer.

Cloud Composer for the PDE Exam: Managed Apache Airflow and DAGs