Gemini Multimodal Media (Image/Video/Speech) Skill
1. Goals and scope
This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:
- - Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
- Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
- Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
- Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
- Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
- Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)
Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.
2. Quick routing (decide which capability to use)
1) Do you need to produce images?
- - Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5)
2) Do you need to understand images?
- - Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)
3) Do you need to produce video?
- - Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7)
4) Do you need to understand video?
- - Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)
5) Do you need to read text aloud?
- - Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)
6) Do you need to understand audio?
- - Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)
3. Unified engineering constraints and I/O spec (must read)
3.0 Prerequisites (dependencies and tools)
- - Node.js 18+ (match your project version)
- Install SDK (example):
npm install @google/genai
- - REST examples only need
curl; if you need to parse image Base64, install jq (optional).
3.1 Authentication and environment variables
- - Put your API key in INLINECODE2
- REST requests use INLINECODE3
3.2 Two file input modes: Inline vs Files API
Inline (embedded bytes/Base64)
- - Pros: shorter call chain, good for small files.
- Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.
Files API (upload then reference)
- - Pros: good for large files, reusing the same file, or multi-turn conversations.
- Typical flow:
1.
files.upload(...) (SDK) or
POST /upload/v1beta/files (REST resumable)
2. Use
file_data /
file_uri in INLINECODE8
Engineering suggestion: implement ensure_file_uri() so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.
3.3 Unified handling of binary media outputs
- - Images: usually returned as
inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG. - Speech (TTS): usually returns PCM bytes (Base64); save as
.pcm or wrap into .wav (commonly 24kHz, 16-bit, mono). - Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).
4. Model selection matrix (choose by scenario)
Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.
4.1 Image generation (Nano Banana)
- - gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing.
- gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.
4.2 General image/video/audio understanding
- - Docs use
gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).
4.3 Video generation (Veo)
- - Example model:
veo-3.1-generate-preview (generates 8-second video and can natively generate audio).
4.4 Speech generation (TTS)
- - Example model:
gemini-2.5-flash-preview-tts (native TTS, currently in preview).
5. Image generation (Nano Banana)
5.1 Text-to-Image
SDK (Node.js) minimal template
CODEBLOCK1
REST (with imageConfig) minimal template
CODEBLOCK2
REST image parsing (Base64 decode)
CODEBLOCK3
5.2 Text-and-Image-to-Image
Use case: given an image, add/remove/modify elements, change style, color grading, etc.
SDK (Node.js) minimal template
CODEBLOCK4
5.3 Multi-turn image iteration (Multi-turn editing)
Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].
5.4 ImageConfig
You can set in generationConfig.imageConfig or the SDK config:
- -
aspectRatio: e.g. 16:9, 1:1. - INLINECODE23 : e.g.
2K, 4K (higher resolution is usually slower/more expensive and model support can vary).
6. Image understanding (Image Understanding)
6.1 Two ways to provide input images
- - Inline image data: suitable for small files (total request size < 20MB).
- Files API upload: better for large files or reuse across multiple requests.
6.2 Inline images (Node.js) minimal template
CODEBLOCK5
6.3 Upload and reference with Files API (Node.js) minimal template
CODEBLOCK6
6.4 Multi-image prompts
Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.
7. Video generation (Veo 3.1)
7.1 Core features (must know)
- - Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
- Supports:
- Aspect ratio (16:9 / 9:16)
- Video extension (extend a generated video; typically limited to 720p)
- First/last frame control (frame-specific)
- Up to 3 reference images (image-based direction)
7.2 SDK (Node.js) minimal template: async polling + download
CODEBLOCK7
7.3 REST minimal template: predictLongRunning + poll + download
Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.
7.4 Common controls (recommend a unified wrapper)
- -
aspectRatio: "16:9" or INLINECODE32 - INLINECODE33 :
"720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive) - When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
- Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.
7.5 Important limits (engineering fallback needed)
- - Latency can vary from seconds to minutes; implement timeouts and retries.
- Generated videos are only retained on the server for a limited time (download promptly).
- Outputs include a SynthID watermark.
Polling fallback (with timeout/backoff) pseudocode
const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
await new Promise((resolve) => setTimeout(resolve, sleepMs));
sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");
8. Video understanding (Video Understanding)
8.1 Video input options
- - Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
- Inline video data: for smaller files.
- Direct YouTube URL: can analyze public videos.
8.2 Files API (Node.js) minimal template
CODEBLOCK9
8.3 Timestamp prompting strategy
- - Ask for segmented bullets with "(mm:ss)" timestamps.
- Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.
9. Speech generation (Text-to-Speech, TTS)
9.1 Positioning
- - Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
- Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.
9.2 Single-speaker TTS (Node.js) minimal template
CODEBLOCK10
9.3 Multi-speaker TTS (max 2 speakers)
Requirements:
- - Use INLINECODE35
- Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).
9.4 Voice options and language
- -
voice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.). - The model can auto-detect input language and supports 24 languages (see docs for the list).
9.5 "Director notes" (strongly recommended for high-quality voice)
Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.
10. Audio understanding (Audio Understanding)
10.1 Typical tasks
- - Describe audio content (including non-speech like birds, alarms, etc.)
- Generate transcripts
- Transcribe specific time ranges
- Count tokens (for cost estimates/segmentation)
10.2 Files API (Node.js) minimal template
CODEBLOCK11
10.3 Key limits and engineering tips
- - Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
- Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
- Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
- If total request size exceeds 20MB, you must use the Files API.
11. End-to-end examples (composition)
Example A: Image generation -> validation via understanding
1) Generate product images with Nano Banana (require negative space, consistent lighting).
2) Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
3) If not satisfied, feed the generated image into text+image editing and iterate.
Example B: Video generation -> video understanding -> narration script
1) Generate an 8-second shot with Veo (include dialogue or SFX).
2) Download and save (respect retention window).
3) Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).
Example C: Audio understanding -> time-range transcription -> TTS redub
1) Upload meeting audio and transcribe full content.
2) Transcribe or summarize specific time ranges.
3) Use TTS to generate a "broadcast" version of the summary.
12. Compliance and risk (must follow)
- - Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
- Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
- Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.
13. Quick reference (Checklist)
- - [ ] Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro).
- [ ] Pick the right input mode: inline for small files; Files API for large/reuse.
- [ ] Parse binary outputs correctly: image/audio via inlinedata decode; video via operation polling + download.
- [ ] For video generation: set aspectRatio / resolution, and download promptly (avoid expiration).
- [ ] For TTS: set responsemodalities=["AUDIO"]; max 2 speakers; speaker names must match prompt.
- [ ] For audio understanding: countTokens when needed; segment long audio or use Files API.
Gemini 多模态媒体(图像/视频/语音)技能
1. 目标与范围
本技能将六项 Gemini API 能力整合为可复用的工作流和实现模板:
- - 图像生成(Nano Banana:文生图、图像编辑、多轮迭代)
- 图像理解(描述/VQA/分类/比较、多图像提示;支持内联和 Files API)
- 视频生成(Veo 3.1:文生视频、宽高比/分辨率控制、参考图像引导、首尾帧、视频扩展、原生音频)
- 视频理解(上传/内联/YouTube URL;摘要、问答、带时间戳的证据)
- 语音生成(Gemini 原生 TTS:单说话人和多说话人;可控风格/口音/语速/语调)
- 音频理解(上传/内联;描述、转录、时间范围转录、Token 计数)
约定:本技能以官方 Google Gen AI SDK(Node.js/REST)为主线;目前仅提供 Node.js/REST 示例。如果你的项目已封装其他语言或框架,请将本技能的请求结构、模型选择和 I/O 规范映射到你的封装层。
2. 快速路由(决定使用哪项能力)
1) 是否需要生成图像?
- - 需要从头生成图像或基于图像进行编辑 -> 使用 Nano Banana 图像生成(参见第 5 节)
2) 是否需要理解图像?
- - 需要识别、描述、问答、比较或信息提取 -> 使用 图像理解(参见第 6 节)
3) 是否需要生成视频?
- - 需要生成 8 秒视频(可选原生音频) -> 使用 Veo 3.1 视频生成(参见第 7 节)
4) 是否需要理解视频?
- - 需要摘要/问答/带时间戳的片段提取 -> 使用 视频理解(参见第 8 节)
5) 是否需要朗读文本?
- - 需要可控旁白、播客/有声书风格等 -> 使用 语音生成(TTS)(参见第 9 节)
6) 是否需要理解音频?
- - 需要音频描述、转录、时间范围转录、Token 计数 -> 使用 音频理解(参见第 10 节)
3. 统一工程约束与 I/O 规范(必读)
3.0 前置条件(依赖和工具)
- - Node.js 18+(与你的项目版本匹配)
- 安装 SDK(示例):
bash
npm install @google/genai
- - REST 示例仅需 curl;如需解析图像 Base64,可安装 jq(可选)。
3.1 认证和环境变量
- - 将 API 密钥放入 GEMINIAPIKEY
- REST 请求使用 x-goog-api-key: $GEMINIAPIKEY
3.2 两种文件输入模式:内联 vs Files API
内联(嵌入字节/Base64)
- - 优点:调用链更短,适合小文件。
- 关键约束:总请求大小(文本提示 + 系统指令 + 嵌入字节)通常有约 20MB 上限。
Files API(上传后引用)
- - 优点:适合大文件、重复使用同一文件或多轮对话。
- 典型流程:
1. files.upload(...)(SDK)或 POST /upload/v1beta/files(REST 可续传)
2. 在 generateContent 中使用 file
data / fileuri
工程建议:实现 ensurefileuri(),当文件超过阈值(例如 10-15MB 警告)或需要重复使用时,自动路由到 Files API。
3.3 二进制媒体输出的统一处理
- - 图像:通常在响应部分中作为 inlinedata(Base64)返回;在 SDK 中使用 part.asimage() 或解码 Base64 并保存为 PNG/JPG。
- 语音(TTS):通常返回 PCM 字节(Base64);保存为 .pcm 或封装为 .wav(通常为 24kHz、16 位、单声道)。
- 视频(Veo):长时间运行的异步任务;轮询操作;下载文件(或使用返回的 URI)。
4. 模型选择矩阵(按场景选择)
重要:模型名称、版本、限制和配额可能随时间变化。使用前请与官方文档核对。最后更新:2026-01-22。
4.1 图像生成(Nano Banana)
- - gemini-2.5-flash-image:针对速度/吞吐量优化;适合频繁、低延迟的生成/编辑。
- gemini-3-pro-image-preview:更强的指令遵循和高保真文本渲染;更适合专业素材和复杂编辑。
4.2 通用图像/视频/音频理解
- - 文档使用 gemini-3-flash-preview 进行图像、视频和音频理解(根据需要选择更强的模型以平衡质量/成本)。
4.3 视频生成(Veo)
- - 示例模型:veo-3.1-generate-preview(生成 8 秒视频并可原生生成音频)。
4.4 语音生成(TTS)
- - 示例模型:gemini-2.5-flash-preview-tts(原生 TTS,目前为预览版)。
5. 图像生成(Nano Banana)
5.1 文生图
SDK(Node.js)最小模板
js
import { GoogleGenAI } from @google/genai;
import * as fs from node:fs;
const ai = new GoogleGenAI({ apiKey: process.env.GEMINIAPIKEY });
const response = await ai.models.generateContent({
model: gemini-2.5-flash-image,
contents:
Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme,
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.text) console.log(part.text);
if (part.inlineData?.data) {
fs.writeFileSync(out.png, Buffer.from(part.inlineData.data, base64));
}
}
REST(带 imageConfig)最小模板
bash
curl -s -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent -H x-goog-api-key: $GEMINIAPIKEY -H Content-Type: application/json -d {
contents:[{parts:[{text:Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme}]}],
generationConfig: {imageConfig: {aspectRatio:16:9}}
}
REST 图像解析(Base64 解码)
bash
curl -s -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent \
-H x-goog-api-key: $GEMINIAPIKEY \
-H Content-Type: application/json \
-d {contents:[{parts:[{text:A minimal studio product shot of a nano banana}]}]} \
| jq -r .candidates[0].content.parts[] | select(.inlinedata) | .inlinedata.data \
| base64 --decode > out.png
macOS 可使用:base64 -D > out.png
5.2 文本+图像到图像
用例:给定一张图像,添加/删除/修改元素,更改风格、色调等。
SDK(Node.js)最小模板
js
import { GoogleGenAI } from @google/genai;
import * as fs from node:fs;
const ai = new GoogleGenAI({ apiKey: process.env.GEMINIAPIKEY });
const prompt =
Add a nano banana on the table, keep lighting consistent, cinematic tone.;
const imageBase64 = fs.readFileSync(input.png).toString(base64);
const response = await ai.models.generateContent({
model: gemini-2.5-flash-image,
contents: [
{ text: prompt },
{ inlineData: { mimeType: image/png, data: imageBase64 } },
],
});
const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.inlineData?.data) {
fs.writeFileSync(edited.png, Buffer.from(part.inlineData.data, base64));
}
}
5.3 多轮图像迭代(多轮编辑)
最佳实践:使用聊天进行连续迭代(例如:先生成,然后仅编辑特定区域/元素,然后以相同风格生成变体)。
要输出混合的文本 + 图像结果,请设置