GroupByKey, CoGroupByKey, and Flatten in Dataflow for the PDE Exam

GCP Study Hub
619c7c8da6d7b95cf26f6f70
September 21, 2025

When I work with candidates preparing for the Google Cloud Professional Data Engineer exam, a small cluster of Dataflow PTransforms tends to cause confusion: GroupByKey, CoGroupByKey, and Flatten. The names sound related, the inputs look similar, and the exam likes to drop a one-line scenario and ask which one fits. In this post I want to walk through what each transform actually does, how they differ, and the kind of scenario wording that should map to each.

All three are PTransforms in Apache Beam, which is what Dataflow runs. They take one or more PCollections in and produce a PCollection out. The differences come down to what shape of data they expect and what shape they hand back.

GroupByKey: collect all values for a single key

GroupByKey operates on a single PCollection of key-value pairs. It takes every element with the same key and bundles the values into an iterable. Inputs like (apple, 3), (banana, 4), (apple, 5), (banana, 6) become (apple, [3, 5]) and (banana, [4, 6]).

The cleanest mental model is the SQL GROUP BY clause. In SQL, GROUP BY groups rows by a column and you typically pair it with an aggregate like SUM or COUNT. GroupByKey does the grouping piece, but it stops short of aggregating. You get the full iterable of values, and then you decide what to do with them in a downstream ParDo or Combine step.

This is part of why GroupByKey shows up so often. It is the building block when you want to do something more interesting than a single aggregate, like running a custom function over every value associated with a key, or sorting the values before reducing them.

CoGroupByKey: join two or more PCollections by key

CoGroupByKey is what you reach for when you have multiple PCollections that share a key type and you want to bring related records together. The classic example is one PCollection with counts and another with attributes.

PCollection 1 has fruits and counts: (apple, 1), (banana, 2), (apple, 3). PCollection 2 has fruits and characteristics: (apple, red), (banana, yellow), (apple, sweet). Apply CoGroupByKey and you get a single PCollection where each key points to a tuple of iterables, one iterable per input PCollection: (apple, ([1, 3], [red, sweet])) and (banana, ([2], [yellow])).

A few details worth pinning down for the exam:

  • The input is two or more PCollections with the same key type. The value types can differ.
  • The output is a single PCollection with a key and a tuple of value iterables, one per input.
  • It is the natural fit when you need to merge datasets that share keys but carry different attributes about each key.

If a question describes pulling together related records from different sources by a common identifier, like joining clickstream events with user profile data on a user ID, CoGroupByKey is almost always the answer.

Flatten: concatenate PCollections of the same type

Flatten is the simplest of the three and the easiest to confuse with CoGroupByKey if you read the question too quickly. Flatten merges multiple PCollections of the same type into a single PCollection. It does not group, it does not join, it does not aggregate. It just concatenates.

Three input PCollections containing (apple, 3) and (banana, 4) in the first, (orange, 2) and (pear, 6) in the second, and (grape, 5) in the third, become a single PCollection containing all five pairs after Flatten. The content of each input is untouched. Two requirements to remember:

  • Inputs must share the same schema or element type.
  • Output is a single PCollection that contains every element from every input.

Flatten is common after operations that split or filter a stream into multiple branches and you later want to recombine them. A pipeline might split by record type, run different ParDo logic on each branch, and then Flatten the results back into one PCollection for the final sink.

Telling them apart on the exam

Here is the quick heuristic I give Professional Data Engineer candidates when a scenario lands in front of them:

  • One PCollection, group values by key, no join across sources: GroupByKey.
  • Two or more PCollections, same key type, different value types, merge them by key: CoGroupByKey.
  • Two or more PCollections, same element type, just want them combined into one stream: Flatten.

If the question mentions a join, a lookup, or enriching one dataset with another by a shared identifier, that is a CoGroupByKey signal. If the question mentions splitting a stream and recombining branches, or unioning two sources with identical schemas, that is a Flatten signal. And if the question is about reducing values per key or doing custom processing on grouped values from a single dataset, that is GroupByKey.

None of these transforms do aggregation on their own. They organize data so a downstream step can. Keeping that boundary clear in your head is enough to handle most exam scenarios involving Dataflow transforms.

My Professional Data Engineer course covers Dataflow PTransforms, windowing, and the rest of the streaming and batch pipeline content you need for the exam.

Get tips and updates from GCP Study Hub

arrow