BQML Customer Segmentation for the PCA Exam

GCP Study Hub
January 26, 2026

Professional Cloud Architect exam scenarios that involve machine learning rarely test whether you can build a model from scratch. They test whether you can read a business problem, identify the right Google Cloud service to solve it, and justify the choice on the basis of where the data already lives and what the team is set up to operate. Customer segmentation in BigQuery is one of the cleanest examples of this pattern, because the data is already in a warehouse, the algorithm is well-known, and the architectural decision comes down to whether you move the data out or keep it in place.

I'm Ben Makansi, founder of GCP Study Hub, and in this article I want to walk through a customer segmentation scenario the way it tends to appear on the Professional Cloud Architect exam, explain why BigQuery ML with K-means clustering is the right call, and cover the operational details that turn a textbook answer into something a real team would deploy.

The Scenario

Imagine you are an ML engineer at an ecommerce company. The marketing team comes to you with a request. They want to group customers by their buying patterns so they can run targeted campaigns. They do not know how many natural groupings exist in the customer base. They want the data to tell them. The customer data, including transaction history, basket composition, frequency, and lifetime value, already lives in BigQuery.

This is a clustering problem. Specifically, it is an unsupervised learning problem, because there are no labeled customer segments to train against. The algorithm has to find structure in the data on its own. K-means clustering is the canonical approach for this kind of problem, and BigQuery ML supports it natively.

Why BQML Is the Right Architectural Call

The first question a Professional Cloud Architect should ask in this scenario is where the data lives and what it would cost to move it. The answer is that the data is already in BigQuery, and moving it out adds latency, cost, ETL pipelines, and another system to operate. None of that is incidental. Data movement is one of the largest hidden costs in production ML, and the exam expects you to weigh it.

BigQuery ML eliminates the data movement step entirely. You write SQL, you run it against the existing tables, and the model trains inside BigQuery. There is no separate compute environment to provision, no Dataflow job to write, and no Vertex AI training cluster to configure. The model is created as a BigQuery object, lives in a dataset alongside the source data, and can be queried for predictions with the same SQL syntax used for training.

The architectural takeaway is that when the dataset is already in BigQuery and the algorithm is one BQML supports, BQML is almost always the right answer for the exam. The trade-off is that you give up the flexibility of a fully custom model, but for a standard algorithm like K-means, that flexibility is not what the team needs.

The K-Means Model in SQL

Creating the model is a single CREATE MODEL statement.

CREATE MODEL `project.dataset.customer_segmentation`
OPTIONS(model_type='KMEANS', num_clusters=2)
AS
SELECT
  total_spend,
  order_frequency,
  avg_basket_size,
  days_since_last_purchase
FROM
  `project.dataset.customer_features`

The OPTIONS clause is where the model type and configuration live. Setting model_type to KMEANS tells BigQuery to use K-means clustering. The num_clusters parameter controls how many groups the algorithm tries to find. The SELECT statement defines the input features. K-means works on numeric features and uses Euclidean distance, so the columns you include should be the customer attributes that actually differentiate buying behavior.

Once the model is trained, you run predictions with ML.PREDICT.

SELECT
  customer_id,
  CENTROID_ID AS cluster_assignment
FROM
  ML.PREDICT(MODEL `project.dataset.customer_segmentation`,
    (SELECT * FROM `project.dataset.customer_features`))

Each row in the output gets a CENTROID_ID, which is the cluster the model assigned that customer to. From there, the marketing team can join cluster assignments back to customer profiles and run targeted campaigns against each segment.

Picking the Right Number of Clusters

The hard part of K-means is not running the algorithm. It is choosing num_clusters. In the example above, num_clusters is set to 2, but two clusters is rarely the right answer for a real customer base. Two segments is a coarse split, and most ecommerce datasets contain more meaningful structure than that.

The standard approach is the elbow method. You train multiple models with different values of k, typically ranging from 2 through 10 or higher, and you look at how a quality metric like the within-cluster sum of squares changes as k grows. As k increases, the metric always decreases, because more clusters always fit the data more tightly. The elbow is the point where the rate of improvement flattens out, which is the signal that adding more clusters is no longer buying meaningful structure.

In BigQuery ML, the workflow is to run CREATE MODEL multiple times with different num_clusters values, query each model's evaluation metrics with ML.EVALUATE, and plot the results. The k value at the elbow is the one to keep.

SELECT
  davies_bouldin_index,
  mean_squared_distance
FROM
  ML.EVALUATE(MODEL `project.dataset.customer_segmentation_k4`)

The mean_squared_distance is the within-cluster sum of squares metric, and the davies_bouldin_index is a separate quality measure where lower values indicate better-defined clusters.

What the Exam Actually Tests

The Professional Cloud Architect exam will not ask you to write the SQL. It will give you a scenario where a marketing or product team needs to segment users, the data is in BigQuery, and the question presents four options. The wrong options usually involve unnecessary data movement.

One distractor will propose exporting the BigQuery data to Cloud Storage, training a model in a Vertex AI custom training job, and writing predictions back. That works, but it adds three systems and a pipeline to maintain when BQML solves the same problem with one SQL statement.

Another distractor will suggest using AutoML Tables. AutoML is a real Vertex AI service, but it is for supervised learning where you have a target column to predict. Customer segmentation is unsupervised. There is no target column. AutoML is not the right tool for this shape of problem.

A third distractor will suggest writing predictions through a Pretrained API like the Natural Language API or the Vision API. Pretrained APIs are for tasks Google has already trained models for, like sentiment analysis or image labeling. They do not segment customer transaction data.

The correct answer is BQML with K-means, because the data is in BigQuery, the algorithm fits the unsupervised problem shape, and the architectural footprint is minimal.

Operational Considerations

A few details that matter once the model exists. BQML models are stored as objects inside a BigQuery dataset, which means they inherit the dataset's IAM permissions. If the dataset is restricted to the analytics team, the model is too. Granting prediction access without granting training access is done through standard BigQuery roles.

Retraining is a scheduled query. Customer behavior shifts over time, and a segmentation model trained on last quarter's data will drift. The standard pattern is to schedule a CREATE OR REPLACE MODEL statement to run weekly or monthly against the latest customer features, which keeps the segments current without manual intervention.

Cost is billed against BigQuery slot consumption during training and prediction, not as a separate ML service charge. For a customer table with millions of rows and a handful of features, K-means training in BQML is inexpensive relative to standing up a Vertex AI training job, and prediction is just another SELECT.

Closing

BQML customer segmentation is the kind of question the Professional Cloud Architect exam likes because it tests architectural judgment rather than ML depth. The right answer follows from where the data lives and what the team needs to operate. The data is in BigQuery, the algorithm is K-means, and the model lives next to the source tables. Everything else is a distractor.

If you want a structured walkthrough of BigQuery ML alongside the rest of the ML and AI material on the Professional Cloud Architect exam, the GCP Study Hub Professional Cloud Architect course covers the full ML domain with hands-on SQL examples and exam-style scenarios.

arrow