
When I work with candidates preparing for the Google Cloud Professional Data Engineer exam, a small cluster of Dataflow PTransforms tends to cause confusion: GroupByKey, CoGroupByKey, and Flatten. The names sound related, the inputs look similar, and the exam likes to drop a one-line scenario and ask which one fits. In this post I want to walk through what each transform actually does, how they differ, and the kind of scenario wording that should map to each.
All three are PTransforms in Apache Beam, which is what Dataflow runs. They take one or more PCollections in and produce a PCollection out. The differences come down to what shape of data they expect and what shape they hand back.
GroupByKey operates on a single PCollection of key-value pairs. It takes every element with the same key and bundles the values into an iterable. Inputs like (apple, 3), (banana, 4), (apple, 5), (banana, 6) become (apple, [3, 5]) and (banana, [4, 6]).
The cleanest mental model is the SQL GROUP BY clause. In SQL, GROUP BY groups rows by a column and you typically pair it with an aggregate like SUM or COUNT. GroupByKey does the grouping piece, but it stops short of aggregating. You get the full iterable of values, and then you decide what to do with them in a downstream ParDo or Combine step.
This is part of why GroupByKey shows up so often. It is the building block when you want to do something more interesting than a single aggregate, like running a custom function over every value associated with a key, or sorting the values before reducing them.
CoGroupByKey is what you reach for when you have multiple PCollections that share a key type and you want to bring related records together. The classic example is one PCollection with counts and another with attributes.
PCollection 1 has fruits and counts: (apple, 1), (banana, 2), (apple, 3). PCollection 2 has fruits and characteristics: (apple, red), (banana, yellow), (apple, sweet). Apply CoGroupByKey and you get a single PCollection where each key points to a tuple of iterables, one iterable per input PCollection: (apple, ([1, 3], [red, sweet])) and (banana, ([2], [yellow])).
A few details worth pinning down for the exam:
If a question describes pulling together related records from different sources by a common identifier, like joining clickstream events with user profile data on a user ID, CoGroupByKey is almost always the answer.
Flatten is the simplest of the three and the easiest to confuse with CoGroupByKey if you read the question too quickly. Flatten merges multiple PCollections of the same type into a single PCollection. It does not group, it does not join, it does not aggregate. It just concatenates.
Three input PCollections containing (apple, 3) and (banana, 4) in the first, (orange, 2) and (pear, 6) in the second, and (grape, 5) in the third, become a single PCollection containing all five pairs after Flatten. The content of each input is untouched. Two requirements to remember:
Flatten is common after operations that split or filter a stream into multiple branches and you later want to recombine them. A pipeline might split by record type, run different ParDo logic on each branch, and then Flatten the results back into one PCollection for the final sink.
Here is the quick heuristic I give Professional Data Engineer candidates when a scenario lands in front of them:
If the question mentions a join, a lookup, or enriching one dataset with another by a shared identifier, that is a CoGroupByKey signal. If the question mentions splitting a stream and recombining branches, or unioning two sources with identical schemas, that is a Flatten signal. And if the question is about reducing values per key or doing custom processing on grouped values from a single dataset, that is GroupByKey.
None of these transforms do aggregation on their own. They organize data so a downstream step can. Keeping that boundary clear in your head is enough to handle most exam scenarios involving Dataflow transforms.
My Professional Data Engineer course covers Dataflow PTransforms, windowing, and the rest of the streaming and batch pipeline content you need for the exam.