
Cloud Composer is one of those services that shows up on the Professional Data Engineer exam in a very specific way. The exam wants to know whether you can recognize the kind of pipeline that needs an orchestrator, and whether you can pick Composer over Dataflow, Workflows, or a Cloud Scheduler plus Cloud Function combo when the scenario calls for it. The way I think about it: Composer is the answer when the question is about coordinating many steps across many services, not about transforming a single stream of data.
Cloud Composer is a managed implementation of Apache Airflow. Airflow is an open source programmatic framework, originally built at Airbnb before being donated to the Apache Foundation, that lets you create, schedule, manage, and monitor data workflows. When Google runs Airflow as a managed service in GCP, they handle a lot of the infrastructure heavy lifting for you. Provisioning, patching, the underlying GKE cluster Airflow runs on, the metadata database, the webserver, all of it is set up for you when you create a Composer environment.
That said, Composer is not no-ops. It is low-ops. You still configure things like environment size, worker autoscaling parameters, the Airflow version, the Python packages installed in the environment, and the network configuration. On the exam, if you see a scenario describing a team that wants a fully managed workflow tool but is willing to tune a few knobs, Composer fits. If the scenario insists on zero configuration, the answer is probably Workflows or Cloud Scheduler instead.
Big data pipelines are usually a complex, multi-step process. A realistic workflow might pull from an API, land raw files in Cloud Storage, run a Dataproc job, load the result into BigQuery, run a few SQL transformations, and then push aggregates to Bigtable or trigger a downstream ML training job. That kind of pipeline has a few characteristics that Airflow was designed to handle:
If you read an exam scenario and most of those bullets apply, Composer is almost certainly the intended answer.
A DAG is a Directed Acyclic Graph. In Composer, a DAG is a collection of tasks that you want to run, organized in a structure that reflects the dependencies between them. Directed means each connection between two tasks has a direction, this task before that task. Acyclic means there are no loops, you cannot have task A depend on task B which depends on task A.
DAGs are written in Python. A DAG file is just a Python script that imports Airflow operators, instantiates them as tasks, and wires them together with dependency syntax. The Python file lives in a Cloud Storage bucket that Composer reads on a schedule, and any new DAG file that lands in that bucket gets picked up automatically.
A skeleton DAG looks like this:
from airflow import DAG
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id="daily_sales_pipeline",
start_date=datetime(2026, 1, 1),
schedule_interval="@daily",
catchup=False,
) as dag:
extract = BashOperator(
task_id="extract",
bash_command="gsutil cp gs://raw/sales.csv /tmp/",
)
load = BigQueryInsertJobOperator(
task_id="load",
configuration={"query": {"query": "CALL project.dataset.load_sales()", "useLegacySql": False}},
)
extract >> load
The >> operator at the bottom is how you express that extract has to finish before load can start. With a few more tasks, you get the parallel-then-converge diagrams that show up in the Airflow UI, where two tasks fan out from a start node, then meet at a downstream task, then fan out again before converging at the end.
Each node in a DAG is a task, and each task is an instance of an operator. Airflow ships with operators for almost every GCP service you would touch as a Professional Data Engineer. There are operators for BigQuery jobs, Dataflow jobs, Dataproc cluster creation and deletion, Cloud Storage transfers, Pub/Sub publishing, Vertex AI training, and plenty more. There is also a generic BashOperator and PythonOperator for the cases where you just need to run arbitrary code as a step.
The mental model that helps on the exam: operators are the verbs of your pipeline, tasks are the specific instances, and the DAG is the sentence those verbs combine to form.
Composer environments come with the Airflow web UI built in. From there you can view and manage your DAGs, monitor the status of each task in each run, and rerun failed tasks directly without re-running the entire DAG. That last point matters on the exam. If a question describes a team needing to retry a single failed step in a long workflow without restarting the whole thing, that points to Airflow and therefore Composer.
The pattern I look for on the Professional Data Engineer exam is multi-service, multi-step orchestration with dependencies. If the scenario mentions chaining BigQuery, Dataflow, Dataproc, and storage operations together on a schedule, Composer is the pick. If it mentions Apache Airflow by name and asks for a managed version, the answer is Composer. If it says the team already has Airflow DAGs and wants to lift and shift them to GCP, the answer is Composer.
Where Composer is not the pick: simple cron jobs (Cloud Scheduler), single-step serverless triggers (Cloud Functions or Cloud Run), or pure data transformation pipelines without orchestration concerns (Dataflow). The Professional Data Engineer exam likes to test the boundary between these, so it helps to be able to articulate why Composer is overkill for a one-step job and just right for a tangled multi-service one.
My Professional Data Engineer course covers Cloud Composer in depth, including the Composer API versus Airflow API distinction, environment sizing, and the specific exam patterns that point to Composer as the right answer.