
Note (2026-05-06): Vertex AI was rebranded as Gemini Enterprise Agent Platform. Google's exam guides still use the Vertex AI naming, so this article does too. The official guides may switch to the new name at some point as you prep, but for now we're matching the language currently in the exam materials.
Not every AI use case requires building something from scratch. Google's pre-trained AI APIs are ready-to-use services that you call with your data and get an intelligent result back. No model training, no infrastructure setup. You hit an endpoint and you get structured output.
For the Generative AI Leader exam, you need to recognize when a scenario is calling for one of these pre-trained APIs rather than a custom model or a generative model. The hint is almost always in the words "common task" or "general data." Pre-trained APIs are designed for general-purpose use cases like extracting text, analyzing images, transcribing audio, and pulling entities out of text.
Here are the ones I would have on the tip of your tongue going into the Generative AI Leader certification.
Document AI extracts, classifies, and processes information from documents automatically using machine learning. It has three core capabilities worth knowing.
Document classification automatically categorizes documents by type, whether it is an invoice, contract, claim form, or tax document. Instead of someone manually sorting and organizing, the system does it.
Entity extraction pulls key information from unstructured documents. From an invoice, it extracts the invoice number, date, vendor name, line items, and total amount. From a contract, it can pull out parties, dates, key clauses, and obligations.
Document processing is Document AI applied at scale, handling invoices, contracts, forms, and claims in high volume. What would take weeks manually can happen in minutes.
On the exam, think Document AI whenever you see scenarios involving unstructured document processing, high-volume form handling, or workflow automation where classification and extraction are bottlenecks.
Cloud Vision API analyzes images using pre-trained machine learning models to extract meaningful information and insights. The key word is pre-trained. Google has already trained these models on billions of images, so you do not need to build from scratch.
Object detection identifies and locates objects within images. Send a photo of a warehouse shelf and it can detect products, boxes, and inventory items. It returns the type of object, its location, and a confidence score. That makes it valuable for quality control, asset tracking, and automated inventory management.
Text recognition, or OCR, extracts text from images. Photographs of documents, signs, and license plates can all be processed. Cloud Vision supports multiple languages, so a single API call can read text in English, Portuguese, Chinese, or dozens of other languages.
Face and landmark detection identifies human faces in images and can pick up facial attributes like emotion, age range, and head pose. It also recognizes geographic landmarks like Big Ben or the Golden Gate Bridge. Useful for photo organization, content moderation, and security applications.
On the exam, think Cloud Vision API whenever you see scenarios involving image analysis, automated quality control, or extracting information from visual content.
Cloud Video Intelligence API analyzes video content using machine learning to extract insights, detect scenes, objects, and patterns in motion. Like Cloud Vision API, these are pre-trained models.
Scene detection identifies scene boundaries and transitions automatically. It recognizes where one scene ends and another begins, which is useful for video editing, content organization, and splitting raw footage into meaningful segments.
Object tracking is where video becomes more powerful than static images. The API detects objects and tracks them across frames over time. If a person walks through a room, the API follows them. If a car moves through traffic, it tracks the car's movement and returns coordinates for each frame. Essential for surveillance and activity monitoring.
Label detection categorizes video content automatically by identifying activities, objects, scenes, and concepts. A video might be labeled with "person walking," "outdoor scene," "daytime," "park." That enables intelligent video indexing and search at scale.
On the exam, think Cloud Video Intelligence whenever you see scenarios involving video analysis, security monitoring, content management, or automated video processing. It is built for the temporal, frame-by-frame nature of video.
Natural Language API analyzes text to extract meaning, sentiment, entities, and relationships using machine learning. As with the other pre-trained APIs, you send text and get structured insights back.
Sentiment analysis measures the emotional tone of text and returns a sentiment score and magnitude. The score tells you whether the sentiment is positive, negative, or neutral. The magnitude tells you how strong it is. A review that says "This product is amazing" gets a high positive score. "Worst purchase ever" gets a high negative score. Critical for customer feedback prioritization, brand monitoring, and support ticket routing.
Entity recognition identifies people, places, organizations, and concepts mentioned in the text. If a support ticket says "John from the Chicago office mentioned an issue with Salesforce integration," the API extracts John as a person, Chicago as a location, and Salesforce as an organization. Unstructured text becomes structured data.
Content classification categorizes text into predefined categories like news, sports, technology, or entertainment. Useful for content moderation and organizing large volumes of text by subject matter.
To make this concrete, take the sentence "John loves his new iPhone that he bought in New York last week." The Natural Language API returns a sentiment score around +0.85 (strongly positive, the word "loves" is doing the heavy lifting), and entity extraction picks up John as a PERSON, iPhone as a CONSUMER_GOOD, New York as a LOCATION, and "last week" as a DATE. It also returns a syntax parse with parts of speech and grammatical roles, where John is tagged as a proper noun and the subject, and "loves" is tagged as the verb and root of the sentence.
Speech-to-Text takes an audio file and returns a text transcription. Useful for captions, voice commands, or call center logs. It supports both real-time streaming and batch transcription, so it works for live use cases as well as pre-recorded files. It also does speaker labeling, identifying which parts of a conversation came from which speaker, like "Speaker 1" and "Speaker 2."
Text-to-Speech does the reverse. It takes written text and synthesizes human-like audio. You can customize voice, pitch, and speed, which makes it suitable for things like virtual assistants or reading apps.
The Generative AI Leader exam will hand you a scenario and ask which Google service to use. When the scenario describes a common task on general data and explicitly avoids requirements like custom training, niche fine-tuning, or generating new content, the answer is one of the pre-trained APIs.
Map the modality to the API. Documents to Document AI. Images to Cloud Vision. Video to Cloud Video Intelligence. Text to Natural Language. Audio either direction to Speech-to-Text or Text-to-Speech. If the question instead emphasizes content creation, custom training data, or fine-tuning, you are out of pre-trained API territory and into Vertex AI or generative model territory.
My Generative AI Leader course covers these pre-trained APIs alongside the rest of the foundational material you need for the exam.