Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

- Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

2. Quick routing (decide which capability to use)

1) Do you need to produce images?

- Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5)

2) Do you need to understand images?

- Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)

3) Do you need to produce video?

- Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7)

4) Do you need to understand video?

- Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)

5) Do you need to read text aloud?

- Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)

6) Do you need to understand audio?

- Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)

3. Unified engineering constraints and I/O spec (must read)

3.0 Prerequisites (dependencies and tools)

- Node.js 18+ (match your project version)
Install SDK (example):

npm install @google/genai

- REST examples only need curl; if you need to parse image Base64, install jq (optional).

3.1 Authentication and environment variables

- Put your API key in INLINECODE2
REST requests use INLINECODE3

3.2 Two file input modes: Inline vs Files API

Inline (embedded bytes/Base64)

- Pros: shorter call chain, good for small files.
Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.

Files API (upload then reference)

- Pros: good for large files, reusing the same file, or multi-turn conversations.
Typical flow:

1. files.upload(...) (SDK) or POST /upload/v1beta/files (REST resumable)
2. Use file_data / file_uri in INLINECODE8

Engineering suggestion: implement ensure_file_uri() so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.

3.3 Unified handling of binary media outputs

- Images: usually returned as inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG.
Speech (TTS): usually returns PCM bytes (Base64); save as .pcm or wrap into .wav (commonly 24kHz, 16-bit, mono).
Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).

4. Model selection matrix (choose by scenario)

Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.

4.1 Image generation (Nano Banana)

- gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing.
gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.

4.2 General image/video/audio understanding

- Docs use gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).

4.3 Video generation (Veo)

- Example model: veo-3.1-generate-preview (generates 8-second video and can natively generate audio).

4.4 Speech generation (TTS)

- Example model: gemini-2.5-flash-preview-tts (native TTS, currently in preview).

5. Image generation (Nano Banana)

5.1 Text-to-Image

SDK (Node.js) minimal template
CODEBLOCK1

REST (with imageConfig) minimal template
CODEBLOCK2

REST image parsing (Base64 decode)
CODEBLOCK3

5.2 Text-and-Image-to-Image

Use case: given an image, add/remove/modify elements, change style, color grading, etc.

SDK (Node.js) minimal template
CODEBLOCK4

5.3 Multi-turn image iteration (Multi-turn editing)

Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].

5.4 ImageConfig

You can set in generationConfig.imageConfig or the SDK config:

- aspectRatio: e.g. 16:9, 1:1.
INLINECODE23: e.g. 2K, 4K (higher resolution is usually slower/more expensive and model support can vary).

6. Image understanding (Image Understanding)

6.1 Two ways to provide input images

- Inline image data: suitable for small files (total request size < 20MB).
Files API upload: better for large files or reuse across multiple requests.

6.2 Inline images (Node.js) minimal template

CODEBLOCK5

6.3 Upload and reference with Files API (Node.js) minimal template

CODEBLOCK6

6.4 Multi-image prompts

Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.

7. Video generation (Veo 3.1)

7.1 Core features (must know)

- Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
Supports:

- Aspect ratio (16:9 / 9:16) - Video extension (extend a generated video; typically limited to 720p) - First/last frame control (frame-specific) - Up to 3 reference images (image-based direction)

7.2 SDK (Node.js) minimal template: async polling + download

CODEBLOCK7

7.3 REST minimal template: predictLongRunning + poll + download

Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.

7.4 Common controls (recommend a unified wrapper)

- aspectRatio: "16:9" or INLINECODE32
INLINECODE33: "720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive)
When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.

7.5 Important limits (engineering fallback needed)

- Latency can vary from seconds to minutes; implement timeouts and retries.
Generated videos are only retained on the server for a limited time (download promptly).
Outputs include a SynthID watermark.

Polling fallback (with timeout/backoff) pseudocode

const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
  await new Promise((resolve) => setTimeout(resolve, sleepMs));
  sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
  operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");

8. Video understanding (Video Understanding)

8.1 Video input options

- Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
Inline video data: for smaller files.
Direct YouTube URL: can analyze public videos.

8.2 Files API (Node.js) minimal template

CODEBLOCK9

8.3 Timestamp prompting strategy

- Ask for segmented bullets with "(mm:ss)" timestamps.
Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

9. Speech generation (Text-to-Speech, TTS)

9.1 Positioning

- Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.

9.2 Single-speaker TTS (Node.js) minimal template

CODEBLOCK10

9.3 Multi-speaker TTS (max 2 speakers)

Requirements:

- Use INLINECODE35
Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).

9.4 Voice options and language

- voice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).
The model can auto-detect input language and supports 24 languages (see docs for the list).

9.5 "Director notes" (strongly recommended for high-quality voice)

Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.

10. Audio understanding (Audio Understanding)

10.1 Typical tasks

- Describe audio content (including non-speech like birds, alarms, etc.)
Generate transcripts
Transcribe specific time ranges
Count tokens (for cost estimates/segmentation)

10.2 Files API (Node.js) minimal template

CODEBLOCK11

10.3 Key limits and engineering tips

- Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
If total request size exceeds 20MB, you must use the Files API.

11. End-to-end examples (composition)

Example A: Image generation -> validation via understanding

1) Generate product images with Nano Banana (require negative space, consistent lighting). 2) Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements. 3) If not satisfied, feed the generated image into text+image editing and iterate.

Example B: Video generation -> video understanding -> narration script

1) Generate an 8-second shot with Veo (include dialogue or SFX). 2) Download and save (respect retention window). 3) Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).

Example C: Audio understanding -> time-range transcription -> TTS redub

1) Upload meeting audio and transcribe full content. 2) Transcribe or summarize specific time ranges. 3) Use TTS to generate a "broadcast" version of the summary.

12. Compliance and risk (must follow)

- Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

13. Quick reference (Checklist)

- [ ] Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro).
[ ] Pick the right input mode: inline for small files; Files API for large/reuse.
[ ] Parse binary outputs correctly: image/audio via inlinedata decode; video via operation polling + download.
[ ] For video generation: set aspectRatio / resolution, and download promptly (avoid expiration).
[ ] For TTS: set responsemodalities=["AUDIO"]; max 2 speakers; speaker names must match prompt.
[ ] For audio understanding: countTokens when needed; segment long audio or use Files API.

Gemini 多模态媒体（图像/视频/语音）技能

1. 目标与范围

本技能将六项 Gemini API 能力整合为可复用的工作流和实现模板：

- 图像生成（Nano Banana：文生图、图像编辑、多轮迭代）
图像理解（描述/VQA/分类/比较、多图像提示；支持内联和 Files API）
视频生成（Veo 3.1：文生视频、宽高比/分辨率控制、参考图像引导、首尾帧、视频扩展、原生音频）
视频理解（上传/内联/YouTube URL；摘要、问答、带时间戳的证据）
语音生成（Gemini 原生 TTS：单说话人和多说话人；可控风格/口音/语速/语调）
音频理解（上传/内联；描述、转录、时间范围转录、Token 计数）

约定：本技能以官方 Google Gen AI SDK（Node.js/REST）为主线；目前仅提供 Node.js/REST 示例。如果你的项目已封装其他语言或框架，请将本技能的请求结构、模型选择和 I/O 规范映射到你的封装层。

2. 快速路由（决定使用哪项能力）

1) 是否需要生成图像？

- 需要从头生成图像或基于图像进行编辑 -> 使用 Nano Banana 图像生成（参见第 5 节）

2) 是否需要理解图像？

- 需要识别、描述、问答、比较或信息提取 -> 使用 图像理解（参见第 6 节）

3) 是否需要生成视频？

- 需要生成 8 秒视频（可选原生音频） -> 使用 Veo 3.1 视频生成（参见第 7 节）

4) 是否需要理解视频？

- 需要摘要/问答/带时间戳的片段提取 -> 使用 视频理解（参见第 8 节）

5) 是否需要朗读文本？

- 需要可控旁白、播客/有声书风格等 -> 使用 语音生成（TTS）（参见第 9 节）

6) 是否需要理解音频？

- 需要音频描述、转录、时间范围转录、Token 计数 -> 使用 音频理解（参见第 10 节）

3. 统一工程约束与 I/O 规范（必读）

3.0 前置条件（依赖和工具）

- Node.js 18+（与你的项目版本匹配）
安装 SDK（示例）：

bash npm install @google/genai

- REST 示例仅需 curl；如需解析图像 Base64，可安装 jq（可选）。

3.1 认证和环境变量

- 将 API 密钥放入 GEMINIAPIKEY
REST 请求使用 x-goog-api-key: $GEMINIAPIKEY

3.2 两种文件输入模式：内联 vs Files API

内联（嵌入字节/Base64）

- 优点：调用链更短，适合小文件。
关键约束：总请求大小（文本提示 + 系统指令 + 嵌入字节）通常有约 20MB 上限。

Files API（上传后引用）

- 优点：适合大文件、重复使用同一文件或多轮对话。
典型流程：

1. files.upload(...)（SDK）或 POST /upload/v1beta/files（REST 可续传）
2. 在 generateContent 中使用 filedata / fileuri

工程建议：实现 ensurefileuri()，当文件超过阈值（例如 10-15MB 警告）或需要重复使用时，自动路由到 Files API。

3.3 二进制媒体输出的统一处理

- 图像：通常在响应部分中作为 inlinedata（Base64）返回；在 SDK 中使用 part.asimage() 或解码 Base64 并保存为 PNG/JPG。
语音（TTS）：通常返回 PCM 字节（Base64）；保存为 .pcm 或封装为 .wav（通常为 24kHz、16 位、单声道）。
视频（Veo）：长时间运行的异步任务；轮询操作；下载文件（或使用返回的 URI）。

4. 模型选择矩阵（按场景选择）

重要：模型名称、版本、限制和配额可能随时间变化。使用前请与官方文档核对。最后更新：2026-01-22。

4.1 图像生成（Nano Banana）

- gemini-2.5-flash-image：针对速度/吞吐量优化；适合频繁、低延迟的生成/编辑。
gemini-3-pro-image-preview：更强的指令遵循和高保真文本渲染；更适合专业素材和复杂编辑。

4.2 通用图像/视频/音频理解

- 文档使用 gemini-3-flash-preview 进行图像、视频和音频理解（根据需要选择更强的模型以平衡质量/成本）。

4.3 视频生成（Veo）

- 示例模型：veo-3.1-generate-preview（生成 8 秒视频并可原生生成音频）。

4.4 语音生成（TTS）

- 示例模型：gemini-2.5-flash-preview-tts（原生 TTS，目前为预览版）。

5. 图像生成（Nano Banana）

5.1 文生图

SDK（Node.js）最小模板
js
import { GoogleGenAI } from @google/genai;
import * as fs from node:fs;

const ai = new GoogleGenAI({ apiKey: process.env.GEMINIAPIKEY });

const response = await ai.models.generateContent({
model: gemini-2.5-flash-image,
contents:
Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme,
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.text) console.log(part.text);
if (part.inlineData?.data) {
fs.writeFileSync(out.png, Buffer.from(part.inlineData.data, base64));
}
}

REST（带 imageConfig）最小模板
bash
curl -s -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent -H x-goog-api-key: $GEMINIAPIKEY -H Content-Type: application/json -d {
contents:[{parts:[{text:Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme}]}],
generationConfig: {imageConfig: {aspectRatio:16:9}}
}

REST 图像解析（Base64 解码）
bash
curl -s -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent \
-H x-goog-api-key: $GEMINIAPIKEY \
-H Content-Type: application/json \
-d {contents:[{parts:[{text:A minimal studio product shot of a nano banana}]}]} \
| jq -r .candidates[0].content.parts[] | select(.inlinedata) | .inlinedata.data \
| base64 --decode > out.png

macOS 可使用：base64 -D > out.png

5.2 文本+图像到图像

用例：给定一张图像，添加/删除/修改元素，更改风格、色调等。

SDK（Node.js）最小模板
js
import { GoogleGenAI } from @google/genai;
import * as fs from node:fs;

const ai = new GoogleGenAI({ apiKey: process.env.GEMINIAPIKEY });

const prompt =
Add a nano banana on the table, keep lighting consistent, cinematic tone.;
const imageBase64 = fs.readFileSync(input.png).toString(base64);

const response = await ai.models.generateContent({
model: gemini-2.5-flash-image,
contents: [
{ text: prompt },
{ inlineData: { mimeType: image/png, data: imageBase64 } },
],
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
if (part.inlineData?.data) {
fs.writeFileSync(edited.png, Buffer.from(part.inlineData.data, base64));
}
}

5.3 多轮图像迭代（多轮编辑）

最佳实践：使用聊天进行连续迭代（例如：先生成，然后仅编辑特定区域/元素，然后以相同风格生成变体）。
要输出混合的文本 + 图像结果，请设置

google-gemini-mediaGemini媒体API

google-gemini-media

Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

2. Quick routing (decide which capability to use)

3. Unified engineering constraints and I/O spec (must read)

3.0 Prerequisites (dependencies and tools)

3.1 Authentication and environment variables

3.2 Two file input modes: Inline vs Files API

3.3 Unified handling of binary media outputs

4. Model selection matrix (choose by scenario)

4.1 Image generation (Nano Banana)

4.2 General image/video/audio understanding

4.3 Video generation (Veo)

4.4 Speech generation (TTS)

5. Image generation (Nano Banana)

5.1 Text-to-Image

5.2 Text-and-Image-to-Image

5.3 Multi-turn image iteration (Multi-turn editing)

5.4 ImageConfig

6. Image understanding (Image Understanding)

6.1 Two ways to provide input images

6.2 Inline images (Node.js) minimal template

6.3 Upload and reference with Files API (Node.js) minimal template

6.4 Multi-image prompts

7. Video generation (Veo 3.1)

7.1 Core features (must know)

7.2 SDK (Node.js) minimal template: async polling + download

7.3 REST minimal template: predictLongRunning + poll + download

7.4 Common controls (recommend a unified wrapper)

7.5 Important limits (engineering fallback needed)

8. Video understanding (Video Understanding)

8.1 Video input options

8.2 Files API (Node.js) minimal template

8.3 Timestamp prompting strategy

9. Speech generation (Text-to-Speech, TTS)

9.1 Positioning

9.2 Single-speaker TTS (Node.js) minimal template

9.3 Multi-speaker TTS (max 2 speakers)

9.4 Voice options and language

9.5 "Director notes" (strongly recommended for high-quality voice)

10. Audio understanding (Audio Understanding)

10.1 Typical tasks

10.2 Files API (Node.js) minimal template

10.3 Key limits and engineering tips

11. End-to-end examples (composition)

Example A: Image generation -> validation via understanding

Example B: Video generation -> video understanding -> narration script

Example C: Audio understanding -> time-range transcription -> TTS redub

12. Compliance and risk (must follow)

13. Quick reference (Checklist)

Gemini 多模态媒体（图像/视频/语音）技能

1. 目标与范围

2. 快速路由（决定使用哪项能力）

3. 统一工程约束与 I/O 规范（必读）

3.0 前置条件（依赖和工具）

3.1 认证和环境变量

3.2 两种文件输入模式：内联 vs Files API

3.3 二进制媒体输出的统一处理

4. 模型选择矩阵（按场景选择）

4.1 图像生成（Nano Banana）

4.2 通用图像/视频/音频理解

4.3 视频生成（Veo）

4.4 语音生成（TTS）

5. 图像生成（Nano Banana）

5.1 文生图

macOS 可使用：base64 -D > out.png

5.2 文本+图像到图像

5.3 多轮图像迭代（多轮编辑）

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement