Types of Generative AI Models for the Generative AI Leader Exam

GCP Study Hub
Ben Makansi
September 3, 2025

Once you have the relationship between AI, machine learning, and deep learning sorted out, the next thing the Generative AI Leader exam expects you to be comfortable with is how generative AI itself splits into different model types. The categorization is straightforward, but the way the categories overlap is where the interesting questions tend to live.

The four primary modalities

Generative AI is generally categorized by the kind of output a model produces. There are four primary modalities you should have memorized for the Generative AI Leader exam:

  • Text generation, handled by Large Language Models (LLMs).
  • Image generation, which produces visual content from prompts or other inputs.
  • Video generation, which produces moving visual content.
  • Audio generation, which produces speech, music, or other sound.

These four modalities are how we describe what comes out of a generative model. A model that outputs paragraphs of text falls into the text generation category. A model that outputs a 10-second clip of footage falls into the video generation category. The grouping is based on the output, not on the input.

Why LLMs sit in the middle

Of those four categories, Large Language Models are currently the most widely used type of generative AI model. When most people first encounter generative AI through a chatbot or a writing assistant, they are interacting with an LLM. LLMs primarily deal with text input and text output, and they are the foundation of many of the tools you likely already use day to day.

For the Generative AI Leader exam, it is worth knowing that LLM is the term reserved specifically for the text modality. Image generation models, video generation models, and audio generation models are not called LLMs, even though they are all generative AI. Mixing those terms up is a common source of wrong answers on questions that test category vocabulary.

Multimodality blurs the boundaries

The clean four-way split is a useful starting point, but it does not match where the field has actually moved. The lines between modalities are increasingly blurred as models and applications have moved toward multimedia and multimodality.

The clearest way to picture this is as a Venn diagram. Imagine a large outer circle representing generative AI as a whole, with four smaller circles inside for LLMs, image generation, video generation, and audio generation. Those smaller circles overlap with each other rather than sitting in their own corners.

The overlap between the LLM circle and the image generation circle is a system that takes a complex written description and turns it into a visual rendering. The overlap between the LLM circle and the video generation circle is a system that takes a written prompt and produces moving footage. The overlap between the LLM circle and the audio generation circle is a system that takes text and produces speech or music. A multimodal system is one where text, image, audio, and video can all be generated by the same architecture.

What this means for choosing an architecture

The reason this categorization matters in a leadership context is that it shapes how you talk about model selection. When you are scoping a use case, the first thing you are doing is identifying which output modality you actually need. A document summarization workflow lives in the LLM circle. A marketing creative workflow may sit in the overlap between LLMs and image generation. A product walkthrough video generated from a script lives in the overlap between LLMs and video generation.

Understanding these overlaps gives you a steady framework for matching the right kind of generative model to the business use case in front of you, instead of defaulting to whichever model happens to be in the news that week.

What to remember for the exam

For the Generative AI Leader certification, the points to lock in from this topic are:

  • Generative AI splits into four main modalities: text, image, video, and audio.
  • LLMs are the text modality and are the most widely used type of generative AI model right now.
  • Modern systems are increasingly multimodal, meaning the boundaries between those four categories are no longer hard walls.
  • The Venn diagram framing (overlapping circles inside a larger Gen AI circle) is the mental model worth holding onto.

My Generative AI Leader course covers this categorization in more depth alongside the rest of the foundational material, including the AI versus ML versus deep learning hierarchy that sits one level above it.

arrow