Labeled vs Unlabeled Data: Generative AI Leader Exam

Ben Makansi

March 23, 2026

One of the most fundamental data classifications I cover for the Generative AI Leader certification has nothing to do with structure or schema. It comes down to a single question: has someone gone through the data and tagged it with the answer the model is supposed to learn? That is the labeled versus unlabeled distinction, and it shows up reliably on the exam.

I want to walk through the concept the way it actually appears in the exam material, with a worked example and a sample question at the end so you can see exactly what to look for.

How data classification by annotation differs from how data is organized

Earlier in the course material, data gets classified by how it is organized: structured, unstructured, and semi-structured. That classification is about format. Rows and columns versus free-form text versus JSON-style flexibility.

Classification by annotation is a different axis entirely. It does not care whether the data lives in a relational table or a folder of JPEGs. It only cares whether each piece of data has an attached label that tells you what it represents.

You can have structured data that is unlabeled, and you can have unstructured data that is labeled. The two classification axes are independent of each other.

What unlabeled data actually looks like

In its original form, data usually arrives without labels. The classic example is a folder of image files. Imagine two files named IMG001.jpg and IMG002.jpg. One contains a photo of a cat and the other contains a photo of a dog. You and I can look at them and immediately identify what is in each image, but the files themselves carry no information that tells a machine which is which. They are just pixels with filenames.

That is unlabeled data. It is the default state of most data when it gets collected.

What turns unlabeled data into labeled data

Labeled data is the same data after a manual annotation step has been applied. Someone goes through each example and attaches the correct answer. The cat photo gets tagged with the text "cat" and the dog photo gets tagged with the text "dog."

That tag is the ground truth. It is what a supervised learning algorithm needs in order to figure out the relationship between the input features and the correct category. Labeled data enables models to learn patterns and predict those labels on new examples they have not seen before.

The annotation step requires human judgment, and often domain expertise. For a cat-versus-dog example, almost anyone can do the labeling. For something like medical imaging, you need a radiologist. That is why creating labeled datasets is often the most expensive and time-consuming part of building a supervised model.

Worked example: sentiment analysis on customer feedback

Here is the workflow as it appears in the Generative AI Leader exam material, using a sentiment analysis problem.

Suppose a restaurant collects customer feedback through a survey. The raw survey data is just a list of free-form comments:

"Food was pretty good."
"Great atmosphere, and the lamb shank was delicious."
"Service was lacking, won't be coming back."

That is unlabeled data. The comments exist as text, but nothing tells a model what sentiment any of them represent.

The next step is to organize this raw data into a labeled and structured dataset. A human reads each comment and assigns a sentiment score. The neutral comment "Food was pretty good" gets a 0. The positive comment about atmosphere and lamb shank gets a 1. The negative comment about service gets a -1. The labeling requires a person to interpret tone and content, which is exactly the human judgment piece.

Now you have a labeled dataset, and you can train a machine learning model on it. Once the model has seen enough labeled examples, you can feed it a brand-new comment like "Drinks were super yummy!" and the model will output a prediction. In this case, it would predict 1 for positive sentiment, because it learned to associate words like "super" and "yummy" with positive comments during training.

That is the core loop of supervised learning. Labeled data goes in during training, and the model learns to predict the labels for new, unlabeled inputs.

An exam-style question

Here is the kind of scenario that could show up on the real exam:

A hospital is building an ML model to identify documents produced during a patient's stay. Human reviewers have already tagged a collection of documents with categories such as the relevant department or specific discharge paper. What would you call these tagged documents?

A) Raw data
B) Structured data
C) Labeled data
D) Unlabeled data

The keyword here is that human reviewers have already tagged the documents. That is the annotation step. Once the documents have been tagged with categories like "Cardiology" or "Discharge summary," they are no longer in their raw form, so A is out. The question is not asking about how the documents are organized, so B is a distractor. And D is out because the documents do have labels attached. The answer is C, labeled data.

The pattern to watch for on the Generative AI Leader exam is the phrase "human reviewers have tagged" or any equivalent description of someone manually attaching categories. That is your signal that you are looking at labeled data.

Why this matters for the rest of the exam material

The labeled-versus-unlabeled distinction is the foundation for the next section on learning paradigms. Supervised learning needs labeled data because the whole point is mapping inputs to known outputs. Unsupervised learning works on unlabeled data because the goal is to discover structure, not predict a known label. Reinforcement learning sidesteps the question entirely by learning from a reward signal rather than from labeled examples.

If you have a clean mental model of what makes data labeled, the learning paradigms section will land much more easily.

My Generative AI Leader course covers labeled and unlabeled data alongside the rest of the foundational material you need for the exam.