Types of Data for AI: Structured, Unstructured, Semi-Structured for the Generative AI Leader Exam

GCP Study Hub
Ben Makansi
September 27, 2025

One of the foundational topics on the Generative AI Leader exam is data. Specifically, the exam expects you to know how data is organized before it ever reaches a model. The exam will not test you on machine learning math, but it will test you on whether you can look at a dataset and place it in the right bucket. In this article I cover the categories that show up directly in the Generative AI Leader curriculum: raw data, structured data, unstructured data, semi-structured data, and the GCP storage service that lines up with each.

Raw data

Raw data is data in its unprocessed state. It has not been cleaned, organized, labeled, or analyzed. It is the data exactly as it was captured or generated by a system or device. Common examples include raw chat logs that arrive as unformatted text streams with timestamps and user IDs all mixed together, raw audio waveforms straight from a recording device, initial sensor readings flowing in from IoT devices or temperature gauges, and clickstream data showing URL requests and user actions without any session grouping.

Raw data contains valuable information, but it is not immediately usable for machine learning models or business analysis. Data usually gets collected in raw format and then organized into a form that algorithms can actually work with. That might mean parsing text files into databases, converting audio signals into feature vectors, normalizing sensor readings into time series, or grouping clickstream events into user sessions. The point of this processing is to bridge the gap between collecting data and extracting value from it.

Three ways to describe how data is organized

Once data has been organized, we describe it according to that organization. There are three broad categories you need to know for the Generative AI Leader exam: structured, unstructured, and semi-structured. Understanding these categories matters because they determine how you store data, how you process it, how you analyze it, and which tools fit best.

Structured data

Structured data is highly organized and follows a predefined format, almost always in columns and rows. Because of that consistent structure, it is easy to access, process, and analyze. Structured data is typically stored in relational databases, and another way of saying that it is highly organized is to say that it adheres to a schema.

Common examples include financial data, inventory data, and user, employee, or profile information. Financial data is usually captured in tables where each row is a transaction and each column is an attribute like date, amount, or merchant. That predictable shape is what makes structured data so efficient for storage and analysis.

Unstructured data

Unstructured data is the opposite. It is free-form. It lacks a predefined schema or structure, and it spans a wide range of types and formats. The cleanest way to recognize unstructured data is by example.

  • Text-based unstructured data: emails, social media posts, chat logs.
  • Image-based unstructured data: smartphone pictures, MRI scans, satellite imagery.
  • Video-based unstructured data: security footage, recorded lectures, game streaming.

Unstructured data is often in raw format or close to raw format, but not always. For example, you might have raw chat logs sitting in binary database entries that you then convert into text files. Once converted, that data is still unstructured, but it is no longer raw. Raw and unstructured are related ideas, but they are not the same thing, and the exam expects you to be able to keep them separate.

Semi-structured data

Semi-structured data is the middle ground. It does not have a fixed schema, but it does have some metadata, tags, or attributes that organize the elements within it. That partial structure makes it easier to process than fully unstructured data, while still keeping the flexibility you do not get from a relational table.

Examples include:

  • Key-value formats like JSON, where keys give you organization and values can hold diverse types.
  • YAML files and XML files, which use tags to add structure.
  • Emails, which mix structured headers (To, From, Subject) with free-text content.
  • NoSQL databases such as MongoDB, which store data in flexible nested formats.

The pattern across these examples is the same: enough organization to guide processing, but not so much that you lose the flexibility to handle varied information.

GCP services for each data type

The exam also expects you to map each category to the corresponding GCP storage services. Here is the breakdown the curriculum uses.

Structured data:

  • BigQuery for storing and analyzing large datasets at scale.
  • Cloud SQL for managed relational databases that need high compatibility with traditional database engines.
  • Cloud Spanner for relational data that needs global availability and strong consistency.

Unstructured data:

  • Cloud Storage handles all forms of unstructured content, including images, videos, and large documents.

Semi-structured data:

  • Bigtable for high-throughput workloads such as time-series data.
  • Firestore, a NoSQL document database that manages data in flexible nested formats.
  • Memorystore for key-value access, especially caching and scenarios that need fast structured-like access without a rigid schema.

How this shows up on the exam

The Generative AI Leader exam tends to test this material in two ways. The first is straightforward category questions: an example dataset is described and you are asked whether it is structured, unstructured, or semi-structured. JSON payloads, MongoDB collections, and emails with headers are semi-structured. CSV exports, relational tables, and inventory records are structured. Images, video, and free-form text are unstructured.

The second is service mapping. If a scenario describes large unstructured documents, the answer is Cloud Storage. If it describes a relational schema with global consistency, the answer is Cloud Spanner. If it describes a JSON document model with flexible nesting, Firestore is the answer. If it describes time-series data at high throughput, Bigtable is the answer. The exam rarely needs you to dig deeper than that on this topic, but it does expect you to be quick about it.

If you want a structured walkthrough of these data categories alongside the rest of the foundational material, my Generative AI Leader course covers all of it in the order the exam tests it.

arrow