Chirp Speech Models for the Generative AI Leader Exam

Ben Makansi

October 24, 2025

When most people think about foundation models, they default to text and image. The Generative AI Leader exam pushes you past that mental shortcut, and Chirp is one of the reasons why. Chirp is a speech foundation model, and it shows up on the exam because Google wants leaders to recognize that speech is its own modality with its own dedicated model on Google Cloud.

What Chirp actually is

Chirp is a speech-to-text model built on the Universal Speech Model, or USM. USM has 2 billion parameters. Chirp inherits that scale and is optimized for accuracy across more than 100 languages. It is designed to handle accents and noisy environments effectively, which is the kind of detail that separates a model that works in a demo from one that works in production where audio quality is not always clean.

The flow is straightforward. Audio input goes in on one side, Chirp processes the speech and applies its language models in the middle, and transcribed text comes out on the other side. That is the entire pipeline conceptually, and that simplicity is part of why Chirp is grouped with the other foundation models rather than treated as a niche audio tool.

Why it counts as a foundation model

Chirp is sometimes called lesser-known compared to Gemini or Imagen, but it is a foundation model in the same sense those are. It is pretrained at scale on a massive corpus, it generalizes across languages and acoustic conditions, and it serves as the base layer that downstream applications build on. For the Generative AI Leader exam, the framing matters. If a question lists Gemini, Imagen, Veo, and Chirp and asks which are foundation models, the answer is all of them.

Use cases worth remembering

Two use cases come up often when Chirp is discussed. The first is automated video captions and subtitles. Chirp can take a video's audio track and transcribe it accurately, even when speakers have strong accents, which is the practical bar for caption work that scales across global content libraries. The second is clinical documentation, where doctors dictate notes and Chirp transcribes them. The clinical case highlights why noise robustness and accent handling matter, since these environments are not studio-clean and the cost of transcription errors is real.

For the Generative AI Leader exam, the takeaway is to treat Chirp as the default Google Cloud answer when a scenario involves turning spoken language into written text. If you see a use case that mentions transcription, captioning, voice notes, or multilingual audio, Chirp is the model to reach for.

How this fits into the exam

The Generative AI Leader exam expects you to map use cases to the right Google foundation model. Text scenarios point to Gemini. Image generation points to Imagen. Video points to Veo. Speech-to-text points to Chirp. Holding that mapping in your head is most of what you need on Chirp questions, alongside the fact that it is built on USM with 2 billion parameters and covers 100-plus languages with accent and noise tolerance.

My Generative AI Leader course covers Chirp alongside the rest of the foundational material, so the speech modality slots into the same model-selection framework you use for text, image, and video questions on the Generative AI Leader exam.