midasheng-audio-generate

Audio scene generation from text descriptions. Generates WAV audio with speech, sound effects, music, and environmental sounds.

1. Trigger

Use this skill when the user requests audio, sound effects, or music generation based on a text description.

2. Execution Steps

Step 1: Design the Audio Scene (Prompt Refinement)

Before calling the API, you must act as an expert Audio Scene Architect and Foley Designer. Deeply understand the user's natural language input (which may be in any language) and translate it into a highly structured tagged string based on real-world acoustic logic and scene realism.

Prompt Tag Definition:

* <|caption|>: The overall, comprehensive description of the audio scene.
INLINECODE1: Speaker identity (e.g., middle-aged man, energetic girl) and speaking style.
INLINECODE2: The actual transcript / spoken dialogue.
INLINECODE3: Specific sound effects present in the audio (e.g., footsteps, doorbell, dog barking).
INLINECODE4: Description of background music (e.g., soft jazz, tense orchestral).
INLINECODE5: Environmental or ambient background noise (e.g., city bustle, forest wind and crickets).

Crucial Generation Rules:

1. Scene Enrichment: Do not merely copy the user's input! Act as a sound designer and logically enrich the scene.
Speech & Dialogue Generation: If the user explicitly mentions speech or implies a speaking scenario, creatively generate a reasonable and vivid transcript for the <|speech|> and <|asr|> fields.
Strict ASR Formatting: For the <|asr|> tag, output only the raw spoken text. Do not include any speaker labels or narration, such as “man:”, “speaker1:”, or “a man says”.
Omit Missing Elements: If any element is not relevant, directly omit its corresponding tag.
Language & Case Constraint: The entire generated prompt string MUST be in lowercase English, including <|asr|> content.
Strict Output: Output ONLY the formatted tagged string internally for the next step.

Step 2: Execute Command

CODEBLOCK0

3. Queue Status

Query Command

CODEBLOCK1

Returned Fields

- active: Number of currently active requests
INLINECODE11: Average processing latency (milliseconds)
Estimated wait time = active × avglatencyms

When to Call

1. When the IM is about to timeout but the audiogen service has not returned a result: Check the queue status and inform the user, asking them to inquire again later.
When the user asks about task progress later but the service still hasn't returned: Check the latest queue status and report it back to the user.

Status Levels

- 🟢 active=0 or estimated wait <5s → Service idle
🟡 Estimated wait 5-30s → Slight queue
🔴 Estimated wait >30s → Queue is long, recommend trying again later

midasheng-audio-generate

根据文本描述生成音频场景。生成包含语音、音效、音乐和环境声音的WAV音频。

1. 触发条件

当用户请求基于文本描述生成音频、音效或音乐时，使用此技能。

2. 执行步骤

步骤1：设计音频场景（提示词优化）

在调用API之前，您必须充当专业的音频场景架构师和拟音设计师。深入理解用户的自然语言输入（可能为任何语言），并基于真实声学逻辑和场景真实感，将其转换为高度结构化的标记字符串。

提示词标记定义：

* <|caption|>：音频场景的整体、全面描述。
<|speech|>：说话者身份（如中年男性、活力女孩）及说话风格。
<|asr|>：实际文本/口语对话内容。
<|sfx|>：音频中存在的特定音效（如脚步声、门铃声、狗叫声）。
<|music|>：背景音乐描述（如柔和爵士、紧张管弦乐）。
<|env|>：环境或氛围背景噪音（如城市喧嚣、森林风声和蟋蟀声）。

关键生成规则：

1. 场景丰富化：不要仅仅复制用户输入！作为音效设计师，逻辑性地丰富场景。
语音与对话生成：如果用户明确提到语音或暗示说话场景，创造性地为<|speech|>和<|asr|>字段生成合理且生动的文本。
严格ASR格式：对于<|asr|>标签，仅输出原始口语文本。不要包含任何说话者标签或叙述，如“男人：”、“说话者1：”或“一个男人说”。
省略缺失元素：如果任何元素不相关，直接省略其对应标签。
语言与大小写约束：整个生成的提示词字符串必须为小写英文，包括<|asr|>内容。
严格输出：内部仅输出格式化后的标记字符串，用于下一步。

步骤2：执行命令

bash curl -X POST https://llmplus.ai.xiaomi.com/dasheng/audio/gen \ -H Content-Type: application/json \ -d {\text\: \<格式化提示词字符串>\} \ -o <文件名.wav>

3. 队列状态

查询命令

bash curl -X POST https://llmplus.ai.xiaomi.com/metrics?path=/dasheng/audio/gen

返回字段

- active：当前活跃请求数
avglatencyms：平均处理延迟（毫秒）
预计等待时间 = active × avglatencyms

调用时机

1. 当即时通讯即将超时但音频生成服务尚未返回结果时：检查队列状态并告知用户，请其稍后再次查询。
当用户稍后询问任务进度但服务仍未返回结果时：检查最新队列状态并反馈给用户。

状态等级

- 🟢 active=0或预计等待<5秒 → 服务空闲
🟡 预计等待5-30秒 → 轻微排队
🔴 预计等待>30秒 → 队列较长，建议稍后重试

midasheng-audio-generate音频场景生成