Data Ingestion for AI Workloads

Ben Makansi

April 25, 2026

This is the last article in my Generative AI Leader series, and it closes the loop on something every prior topic depends on. A model cannot train or run inference until data shows up in the right place, in the right shape. The Generative AI Leader exam keeps this section narrow, so I want to walk through exactly which Google Cloud services it expects you to recognize and how they fit together.

The exam organizes the data path into three stages: ingest and process, store, then feed into AI/ML. The note Google attaches to this section is worth taking seriously. There are many related GCP services that touch data movement, but only a small set show up on this exam, and that is the set I am going to cover.

Ingest and process

Three services anchor the ingestion and processing layer.

Pub/Sub is GCP's messaging service for streaming data. It decouples producers and consumers, so data can flow in continuously without overwhelming downstream systems. When the exam describes a scenario where events are arriving in real time and need to be collected and buffered before anything else happens, Pub/Sub is the answer.

Dataflow is the processing layer. It handles batch and streaming transformations, cleaning and shaping data as it moves. If a question describes data being filtered, enriched, or reshaped on the way to its destination, Dataflow is doing that work.

Cloud Composer is the orchestration layer. It is a managed Apache Airflow service used for scheduling and coordinating complex multi-step pipelines. Composer is the right choice when the scenario involves multiple jobs that have to run in a specific order or on a schedule.

Store

Once data has been ingested and processed, it usually lands in a database or object store. The exam focuses on three.

BigQuery is the data warehouse. It is built for large-scale structured data and SQL-based analytics, and it is the natural destination for relational data that downstream consumers will query with SQL.

Cloud SQL is the managed relational database for transactional, structured data. It fits operational workloads that need a traditional relational database rather than warehouse-scale analytics.

Cloud Storage is the object store. It handles unstructured data: images, audio, documents, raw files. If the data is a blob rather than a row, Cloud Storage is where it goes.

Feed into AI/ML

From storage, the data feeds the AI/ML layer. Two services matter here for the Generative AI Leader exam.

Vertex AI is where models get trained and deployed. BigQuery ML lets you run models directly in the warehouse using SQL, which is useful when the data already lives in BigQuery and you want to avoid moving it.

Note (2026-05-06): Vertex AI was rebranded as Gemini Enterprise Agent Platform. Google's exam guides still use the Vertex AI naming, so this article does too. The official guides may switch to the new name at some point as you prep, but for now we're matching the language currently in the exam materials.

The common ingestion scenario

The exam also gives you a reference pipeline that shows up in many GCP architectures, and it is worth memorizing the shape of it.

It starts with Pub/Sub on the left, where data is collected, stored, and buffered. This is the entry point for high-volume, real-time data, and the buffering behavior is what keeps messages from being lost when traffic spikes.

From there, data flows into Dataflow, where transformations and cleaning are applied. Dataflow handles both batch and streaming, so the same processing layer covers continuous and periodic workloads.

After processing, the pipeline routes data to different destinations based on its type:

Unstructured data goes to Cloud Storage. Files, logs, images, video.
Relational data for SQL querying goes to BigQuery.
NoSQL, time series, or IoT data goes to Bigtable, Google's low-latency, highly scalable NoSQL database.

That branch into Cloud Storage, BigQuery, and Bigtable based on data shape is the part the exam tests most directly. If a question describes IoT sensor readings or time-series telemetry, Bigtable is the destination. If it describes structured data that analysts will query with SQL, BigQuery. If it describes raw files or media, Cloud Storage.

What to take into the exam

For the Generative AI Leader exam, you do not need to architect ingestion pipelines. You need to recognize which service does which job and route a described workload to the right destination. Pub/Sub for streaming intake. Dataflow for transformation. Composer for orchestration. BigQuery, Cloud SQL, and Cloud Storage for the three storage shapes. Bigtable for NoSQL and time-series. Vertex AI and BigQuery ML at the AI/ML end.

This is the last article in the series, so if you have been reading along, you now have the full surface of what the exam covers. My Generative AI Leader course walks through this same ingestion pipeline alongside the rest of the foundational material.

Data Ingestion for AI Workloads for the Generative AI Leader Exam

Ingest and process

Store

Feed into AI/ML

The common ingestion scenario

What to take into the exam