Structured vs Unstructured Data for the PDE Exam

619c7c8da6d7b95cf26f6f70

June 5, 2025

One of the first conceptual splits the Professional Data Engineer exam expects you to internalize is how data gets categorized by shape. Structured, unstructured, and semi-structured are not just academic buckets. They drive which GCP storage service you pick, which processing tool fits, and how you talk about a system in an architecture question. I want to walk through how I think about each category and how they map onto the services Google expects you to know cold.

Why the categorization matters on the exam

The Professional Data Engineer exam loves scenario questions that hide the answer inside the data description. If a prompt says "the team needs to store millions of MRI scans for a research pipeline," the test writer is signaling unstructured data, and Cloud Storage is almost always going to be the right answer. If the prompt says "the team needs to query daily sales transactions with strong consistency across regions," that is structured data with a global footprint, and Spanner jumps to the top of the list.

Getting fluent with the three categories means you skip the guessing step. You read the data description, classify it, and only consider services that fit. That alone cuts the answer choices in half on a lot of questions.

Structured data

Structured data is highly organized and adheres to a schema. Rows and columns. Predictable types. Predefined fields. If you have ever opened a spreadsheet or written a CREATE TABLE statement, you have worked with structured data.

Common examples include financial transactions, inventory records, and employee or student information. Each row is a record. Each column has a known type. The format is rigid, which is exactly what makes it efficient to query and analyze.

The GCP services that handle structured data on the exam are:

BigQuery for large-scale analytical workloads, warehouses, and SQL over petabytes.
Cloud SQL for managed relational databases like MySQL, PostgreSQL, or SQL Server, typically for transactional workloads.
Spanner for relational data that needs global scale and strong consistency across regions.

If a question mentions joins, schemas, primary keys, or SQL analytics, you are in structured-data territory.

Unstructured data

Unstructured data is free-form. No predefined schema. No tidy rows and columns. It comes in wildly varied formats and usually requires specialized tools to extract meaning from it.

The standard examples fall into three rough groups:

Text-based: emails, social media posts, chat logs.
Image-based: smartphone photos, MRI scans, satellite imagery.
Video and audio: security footage, recorded lectures, game streams.

Because there is no schema, you cannot just write SQL against unstructured data. To analyze it you reach for natural language processing on text, image recognition models on visual content, or speech-to-text on audio. That is why unstructured data is so tightly associated with ML pipelines.

On GCP, unstructured data lives in Cloud Storage. That is the answer the exam expects almost every time. Cloud Storage handles any blob you throw at it, scales effectively without limit, and integrates with downstream services like Vertex AI, Dataflow, and BigQuery for further processing. If you see images, video, audio, PDFs, or any free-form blob in a question, default to Cloud Storage.

Semi-structured data

Semi-structured is the middle ground. It does not enforce a fixed schema, but it carries enough metadata, tags, or attributes to give the data some shape. Think of it as self-describing.

The classic examples are:

JSON, with key-value pairs and nested objects.
XML and YAML, which use tags or indentation to organize fields.
Emails, where the headers (to, from, subject) are structured but the body is free text.
NoSQL document stores, like MongoDB, that hold flexible documents rather than rigid rows.

The GCP services that show up on the exam for semi-structured data are:

Bigtable for high-throughput, wide-column workloads like time-series data, IoT telemetry, or financial tick data.
Firestore for document-oriented data with flexible nested structures, often backing mobile and web apps.
Memorystore for key-value caching scenarios where you need millisecond access without enforcing a schema.

One thing to flag for the Professional Data Engineer exam: BigQuery also handles semi-structured data well through native JSON support, but when the question is asking "where do I store this," the canonical mappings above are the ones to lean on.

How to use this on test day

When I work through scenario questions, I run a quick mental checklist. First, classify the data: structured, unstructured, or semi-structured. Second, identify the access pattern: analytical, transactional, key-value lookup, archive, or streaming. Third, narrow the service list to ones that match both. By that point, two of the four answer choices are usually already eliminated.

Another small habit that helps: when a question mentions a specific format like JSON or Avro, do not assume that automatically means semi-structured for storage purposes. JSON in Cloud Storage being read by BigQuery is a perfectly normal pipeline. The category tells you what the data looks like. The access pattern tells you which service to use.

Get comfortable with these three categories and the service mappings above, and a meaningful slice of the storage questions on the Professional Data Engineer exam stop feeling like trivia and start feeling like pattern recognition.

My Professional Data Engineer course covers the full GCP storage decision tree, including when to pick Bigtable over Spanner, when Firestore beats Cloud SQL, and how the exam frames each scenario.

Structured, Unstructured, and Semi-Structured Data for the PDE Exam

Why the categorization matters on the exam

Structured data

Unstructured data

Semi-structured data

How to use this on test day

Get tips and updates from GCP Study Hub