
If you sit for the Google Cloud Professional Data Engineer exam without a clear mental map of the machine learning development lifecycle, you will lose points on questions that look deceptively simple. The exam likes to give you a scenario that drops you somewhere in the middle of an ML workflow and ask what should happen next, or what should have happened earlier. If the stages are fuzzy in your head, the trap answers all look plausible.
I want to walk through the six stages I drill into every Professional Data Engineer candidate, in the exact order they appear on the exam, and explain what Google is actually testing at each one.
The lifecycle Google uses for the Professional Data Engineer exam runs in this sequence:
Memorize the order. The first half is where the exam concentrates its questions, because that is where data engineers actually live. The back half belongs more to data scientists and ML engineers, but you still need to recognize the stages so you can spot when a scenario is jumping ahead or doubling back.
Data Collection is the act of pulling raw inputs from wherever they live. That might be Cloud Storage buckets, Cloud SQL or Spanner databases, BigQuery tables already in your warehouse, third-party APIs, or streaming sources flowing through Pub/Sub. The exam loves to test whether you can pick the right ingestion path for the source you are given.
What Google is really checking at this stage is whether you understand that the data you start with is the ceiling on model quality. If your collection step pulls in biased, incomplete, or stale data, nothing downstream rescues it. Watch for scenario questions that hint at sampling problems or missing sources. The fix is almost always earlier in the lifecycle, not later.
Once you have the raw data, you process it. This is cleaning, deduplication, normalization, handling nulls, encoding categorical variables, feature engineering, and any transformations that put the data into a shape a model can learn from. On Google Cloud, the tools you will see in exam scenarios are Dataflow for streaming and batch transforms, Dataproc when a team is already on Spark, Dataprep for self-service cleaning, and BigQuery SQL for transformations that fit inside the warehouse.
Two things matter for the exam. First, data processing comes after data collection, not before. Second, processing comes before the train/test split. Scenarios that try to trick you into splitting first and cleaning later are almost always wrong, because cleaning after the split risks data leakage if you fit transformations on the full set.
With the data cleaned, you divide it into a training set and a testing set. The typical ratio is 70 to 80 percent for training and 20 to 30 percent for testing. Some teams carve out a third validation set as well, which is fine, but the exam usually sticks with the two-way split.
The reason the split happens here and not later is straightforward. Once you start training, the model sees only the training data. The test set is held back so you can evaluate honestly at the end. If you split too early, before cleaning, your transformations are inconsistent. If you split too late, after training, the test set is contaminated and the evaluation lies to you.
This is where the algorithm runs against the training data and learns patterns. On Google Cloud, this stage maps to Vertex AI training jobs, AutoML, BigQuery ML for SQL-native models, or custom containers if you need full control over the framework.
Validation happens inside this stage. You use a slice of the training data, or k-fold cross-validation, to tune hyperparameters without touching the test set. The exam will not push deep on hyperparameter math, but it will expect you to know that tuning happens here, not during evaluation.
Now you bring back the test set you set aside. You run the trained model against it and measure performance using metrics that match the problem. Accuracy and F1 for classification, RMSE and MAE for regression, AUC-ROC when class balance is off. For the Professional Data Engineer exam, you do not need to derive these metrics. You need to recognize when a scenario describes the wrong metric for the problem.
If the model fails evaluation, you loop back. Sometimes back to feature engineering, sometimes all the way back to data collection if the underlying data is the problem. The lifecycle is linear on paper but iterative in practice.
The final stage pushes the model into a production environment, usually a Vertex AI endpoint for online predictions or a batch prediction job for periodic scoring. Then you monitor. Monitoring catches model drift, where the live data distribution shifts away from what the model was trained on, and prediction skew, where serving features differ from training features.
Vertex AI Model Monitoring is the managed service the exam expects you to reach for. If a scenario describes a model whose accuracy is silently degrading in production, the answer almost always involves setting up drift or skew detection, not retraining blindly.
If you do nothing else, lock in the order. Data Collection, Data Processing, Train/Test Split, Model Training and Validation, Model Evaluation, Deployment and Monitoring. Know that processing always follows collection, and that the split always follows processing. Those two ordering rules alone will earn you points on the lifecycle questions.
My Professional Data Engineer course covers each stage of the ML lifecycle with the specific Google Cloud services and exam scenarios you need to recognize on test day.