
When I prep candidates for the Professional Data Engineer exam, the section on referential integrity always feels like a throwback. Most people sitting for the exam have spent years in BigQuery or another warehouse and have not thought about primary keys and foreign keys since their undergrad database class. The exam still expects you to know the relational vocabulary cold, because Google sometimes asks scenario questions that hinge on whether you remember what a foreign key actually does.
This article walks through the concepts the way I teach them in my course: relational model, primary keys, foreign keys, referential integrity, and then a worked example that mirrors the style of question you may see on test day.
Relational databases organize data into tables, where each table holds rows about a single entity. A canonical setup is a customers table, an orders table, and a products table. Each table represents one real-world thing, and the relationships between those things are modeled by connecting rows across tables using keys. That last part is the whole point. The relational model is not just about storing data in rows and columns. It is about expressing how entities relate to each other in a way that stays coherent as the data grows.
A primary key is a column, or sometimes a set of columns, that uniquely identifies each row in a table. In the customers table the primary key is customer_id. In the orders table it is order_id. In the products table it is product_id. Every row in each of those tables has a value for that primary key, and no two rows share the same value.
The role of the primary key is identity. If I tell you customer_id 42, you can resolve that to exactly one row in the customers table. There is no ambiguity. This is the anchor that everything else in the relational model attaches to.
A foreign key is a column in one table that refers to the primary key in another table. The foreign key is what actually establishes the relationship between two tables.
Consider the customers table and the orders table. The orders table has its own primary key, order_id, but it also has a customer_id column. That customer_id is not the primary key of the orders table. It is a foreign key that points back at the customers table. For any given row in orders, the customer_id value tells you which customer placed that order.
The vocabulary that goes with this matters for the exam. The orders row is the child. The customers row it points to is the parent. A foreign key links a child to its parent.
Referential integrity is the principle that foreign key values must always correspond to valid entries in the parent table. In plain English, every child must have a parent.
When a child row has a foreign key value that does not match any primary key in the parent table, that child is called an orphaned record. If an orders row has customer_id = 101 but no row in the customers table has customer_id = 101, that orders row is orphaned. It refers to a customer that does not exist.
Orphaned records are bad because they break the meaning of the data. Joins start producing weird results. Aggregations over customers undercount their orders. Reports lie. Preventing orphaned records is what referential integrity is all about, and the way you prevent them in a traditional relational database is by declaring a foreign key constraint that the database engine enforces.
Here is where the exam gets a little tricky. BigQuery does support primary key and foreign key constraints as part of its schema definition, but those constraints are not enforced. They are informational. You can declare them and the query optimizer can use them as hints, but BigQuery will happily let you insert an orders row whose customer_id does not exist in the customers table.
For the Professional Data Engineer exam, the concept being tested is the relational principle itself. You are expected to recognize that the way to express the rule "every transaction must belong to a real store" is a foreign key from the child table to the parent table's primary key. The enforcement question is secondary. The pattern you select on a multiple choice question is the foreign key reference, regardless of how strictly your specific warehouse enforces it.
A typical question looks like this. You have a Stores table with a primary key, store_id, and a Transaction Logs table that records each transaction happening in those stores. How do you prevent orphaned records in the Transaction Logs table?
The answer is to define a foreign key in the Transaction Logs table that references the primary key store_id in the Stores table. That foreign key constraint expresses the rule that every store_id appearing in Transaction Logs must exist in Stores. If a store is later deleted, you can decide how to handle the related transactions, through cascading deletes or some other policy, so that the data stays consistent.
When you see this question on the exam, the trap answers usually involve adding a check elsewhere in the pipeline, denormalizing the store data into the log, or running a periodic cleanup job. Those are workarounds. The correct answer is the foreign key, because that is the relational construct designed for exactly this problem.
My Professional Data Engineer course covers referential integrity, primary and foreign keys, and the broader BigQuery data modeling concepts you need for exam day.