PDE Exam: The GCP Data Lifecycle Explained

619c7c8da6d7b95cf26f6f70

June 2, 2025

When I prep candidates for the Professional Data Engineer exam, the first mental model I want them to lock in is the data lifecycle. It is the framework Google uses to organize the entire universe of data services on GCP, and once you can place a service in the right stage, exam questions get a lot less intimidating. You stop memorizing tool names in isolation and start thinking about what a tool actually does for the data flowing through your system.

The lifecycle has five active stages and one cross-cutting concern. The stages are Ingest, Store, Process, Analyze, and Extract Value. The cross-cutting concern is Governance, which wraps every other stage. I want to walk through each one the way I think about it on exam day, including the services Google associates with each step and how they connect.

What the data lifecycle actually is

Data is a tangible information object describing a real-world event. That could be a sensor reading, a click on a website, a row written by an application, or a file dropped into a bucket. The data lifecycle is the path that information takes from the moment it is captured until someone uses it to make a decision. The exam expects you to know each step, which GCP services live at each step, and how those services chain together into pipelines.

One nuance worth holding onto: the lifecycle is not strictly linear. Insights you generate during Analyze often feed back into how you ingest or process data later. The diagram is a useful mental scaffold, not a one-way street.

Ingest

Ingest is where raw data enters your system. It can arrive as a continuous stream or as discrete batches, and you typically do some initial validation or filtering on the way in.

Pub/Sub handles real-time streaming events like clickstreams, application events, or telemetry.
Dataflow ingests in either streaming or batch mode and is the workhorse for ETL pipelines.
Cloud Storage is your batch entry point for files, historical loads, and bulk uploads.
Cloud Logging collects application and system logs from services like App Engine and Cloud Run.
Transfer Appliance, Storage Transfer Service, and BigQuery Data Transfer Service move large volumes of data from on-prem or external sources into GCP.

On the exam, when a question says "data is being generated continuously from millions of devices," your reflex should land on Pub/Sub. When the question says "a nightly batch of CSV files," you are looking at Cloud Storage and a scheduled load.

Store

Once data is in, it needs a home. The right home depends on whether the data is structured, semi-structured, or unstructured, and on the access pattern you expect.

Cloud Storage for unstructured object data: files, images, backups, archives.
BigQuery for structured analytical data at warehouse scale.
Bigtable for very large time-series or IoT workloads needing low-latency wide-column access.
Cloud SQL for managed relational transactional databases.
Spanner for globally consistent, horizontally scalable relational workloads.
Firestore for semi-structured document data with real-time sync.
Memorystore for in-memory caching and session data.

Storage decisions are a heavy theme on the Professional Data Engineer exam, and the trick is matching the access pattern to the service. Analytical SQL over terabytes points to BigQuery. Single-row low-latency reads at massive scale point to Bigtable. Strong global consistency points to Spanner.

Process

Processing is where data gets cleaned, normalized, enriched, and shaped into something analyzable. The services here overlap heavily with Ingest, because in practice ingestion pipelines do real work as they move data.

Dataflow for batch or streaming transformation pipelines built on Apache Beam.
Dataproc for managed Hadoop, Spark, and Hive workloads, especially when migrating existing on-prem pipelines.
BigQuery for in-warehouse SQL transformations, aggregations, and joins.
Cloud Data Fusion for visual, low-code ETL pipeline construction.
Dataprep for interactive, visual data cleaning powered by Trifacta.
Cloud Composer for orchestrating multi-step pipelines on top of Apache Airflow.
Cloud Functions for lightweight event-driven transformations.
Data Loss Prevention API for redacting or de-identifying sensitive fields during processing.

If you see existing Spark or Hadoop jobs in the question, the answer is almost always Dataproc. If you see a brand new streaming pipeline being built from scratch, lean Dataflow.

Analyze and Extract Value

Analysis is where you query, visualize, and apply machine learning to the data you have stored and processed. Extract Value is what happens next inside the business: recommending content, informing strategy, automating decisions, communicating with stakeholders.

BigQuery for SQL analytics over massive datasets, including BigQuery ML for in-warehouse model training.
Looker for dashboards, reports, and self-service exploration.
Vertex AI for building, training, and deploying machine learning models, including notebook-driven analysis via Vertex AI Workbench.
Dataflow for real-time analytical pipelines on streaming data.
Dataproc for Spark MLlib or other open-source analytics frameworks.
Bigtable for low-latency analytics on time-series data.

Notice how BigQuery, Dataflow, and Vertex AI keep showing up across stages. That cross-stage reuse is exactly what makes the data lifecycle a flexible mental model rather than a rigid pipeline.

Governance

Governance sits at the center because it applies to every other stage. It covers access, audit, classification, privacy, encryption, and network isolation.

Cloud IAM for identity and least-privilege access control.
Cloud Logging for audit trails of activity across your environment.
Data Catalog for metadata discovery, tagging, and classification.
Dataplex for unified governance across lakes and warehouses.
Data Loss Prevention API for finding and protecting sensitive data like PII.
VPC for network isolation around data services.
Organization Policy Service for org-wide constraints and guardrails.
Cloud KMS for managing encryption keys.

A common Professional Data Engineer exam trap is treating governance as an afterthought. The right answer to a compliance or sensitive-data question almost always combines an Ingest or Store service with a governance service like DLP, IAM, or KMS.

My Professional Data Engineer course covers each stage of the data lifecycle in depth, including which services Google associates with each stage and the design patterns that connect them into real pipelines.

The Data Lifecycle for the PDE Exam: Ingest, Store, Process, Analyze, Govern