Cloud Composer Architecture for the PDE Exam: GKE and DAG Storage

619c7c8da6d7b95cf26f6f70

February 22, 2026

When I prep candidates for the Professional Data Engineer exam, Cloud Composer questions almost always reduce to one thing: do you understand what is actually running under the hood. Composer is not a single product. It is three Google Cloud technologies stitched together, and the exam loves to probe whether you can name the pieces, explain what each one does, and predict how the system behaves when you change something.

In this article I want to walk through the architecture the way I think about it when I am ruling out wrong answers on a PDE question. We will cover the three layers Composer sits on, the components Airflow runs, and where your DAG files actually live.

Composer is Airflow plus GKE plus Cloud Storage

The first thing to lock into memory is the layered makeup of Cloud Composer. It is a managed wrapper around Apache Airflow, and Google handles the infrastructure for you, but the underlying pieces are still very much there and the exam expects you to recognize them.

Apache Airflow provides the software framework. It is what defines, schedules, and monitors your workflows, which are written as DAGs (Directed Acyclic Graphs) in Python.
Google Kubernetes Engine hosts the containers that Composer runs Airflow inside. When you hear about workers or nodes being added or removed, those workers are pods running on a GKE cluster behind your environment.
Cloud Storage holds the DAG files, logs, and configuration files. Airflow reads from a bucket rather than from local disk, which is how the system stays persistent and scalable.

If an exam question asks which component is responsible for resource scaling in a Composer environment, the answer is GKE. If it asks where DAGs are stored, the answer is Cloud Storage. If it asks what schedules and monitors the workflow, the answer is Airflow. These three mappings are worth memorizing cold.

What GKE is actually doing underneath

Composer does not just launch Airflow on a single VM. It provisions a GKE cluster and runs the Airflow components as workloads on that cluster. The components you should know by name are the scheduler, the workers, the web server, and the metadata pieces that coordinate them, including a Redis instance used as the message broker between scheduler and workers.

The scheduler decides which tasks should run and when. The workers pick up the tasks and execute them. The web server hosts the Airflow UI you log into. Redis sits in the middle and queues tasks between the scheduler and the worker pool. All of this is running as containers on the GKE cluster that Composer created for your environment.

The reason this matters for the Professional Data Engineer exam is that scaling questions tend to pin on GKE. When a question describes a Composer environment that needs more capacity, the path is to add nodes or scale up the worker pool on the underlying cluster. You do not spin up a parallel Airflow install. You let Composer push more work onto the GKE backing.

Where your DAGs actually live

This is the single most testable detail in Composer architecture. Every Composer environment, the moment you create it, gets its own dedicated Cloud Storage bucket. You do not create the bucket yourself. Composer provisions it automatically and the name follows a fixed pattern:

-composer---

Inside that bucket lives a dags folder, and any Python file you drop into that folder is automatically detected by Composer. There is no separate deploy step. Composer is constantly polling the bucket for updates, so the moment your file lands there, Airflow knows about it and is ready to schedule it.

The practical workflow for shipping new pipeline code looks like this. If you have a PROD environment and you want to add a new DAG, you copy the Python file into the PROD environment's bucket. Composer picks it up. That is the entire deployment.

gsutil cp my_pipeline.py gs://us-central1-prod-composer-myproject-abc123/dags/

Exam questions in this area usually take one of two shapes. The first is a scenario where someone has uploaded a DAG but it is not appearing in Airflow, and you need to identify that they put it in the wrong bucket or wrong subfolder. The second is a scenario asking how to promote a DAG from a DEV environment to a PROD environment, and the correct answer is to copy the file into the PROD environment's bucket rather than to manipulate Airflow directly.

Environment lifecycle versus DAG lifecycle

One distinction I make a point of when prepping for the Professional Data Engineer is the separation between the Composer environment itself and the DAG files it runs. The environment is the GKE cluster, the bucket, the Airflow install, and the supporting plumbing. It exists until you delete it. The DAGs are just files in the environment's bucket. You can add, update, or remove them without touching the environment at all.

This matters because the exam will sometimes describe a change to a single workflow and ask what needs to be redeployed. If only the DAG is changing, the answer is to upload the new Python file to the bucket. You do not rebuild the environment, you do not restart workers, and you certainly do not recreate the GKE cluster. The environment keeps running while the DAGs flow through it.

How I would compress this for exam day

If I had to summarize the Composer architecture into a single mental model for the test, it would be: Airflow defines the work, GKE runs the work, Cloud Storage holds the work, and the environment-specific bucket is the deployment surface. Anything you do in Composer ultimately lands in one of those three places, and almost every PDE question about Composer is asking you to map a scenario back to one of them.

My Professional Data Engineer course covers Cloud Composer architecture, DAG authoring patterns, environment management, and the surrounding orchestration topics in the depth you need to walk into the exam confident on this section.

Cloud Composer Architecture for the PDE Exam: GKE Backing and DAG Storage

Composer is Airflow plus GKE plus Cloud Storage

What GKE is actually doing underneath

Where your DAGs actually live

Environment lifecycle versus DAG lifecycle

How I would compress this for exam day

Get tips and updates from GCP Study Hub