Prompt Design & Tuning Best Practices

The goal of this Skill is not to casually “chat about prompts,” but to turn prompt tuning into an executable, reviewable, and cost-controlled engineering workflow.

The Agent handles most of the execution work.
Humans are responsible only for validating direction, approving high-cost loops, and signing off on the final launch candidate.

When to Use

Use this Skill when the user needs to:

- design or optimize a target prompt from scratch
design a separate evaluation / judge prompt
compare the performance of multiple models on an evaluation set
work with an existing API curl, SDK integration, or request protocol
run controlled prompt iterations under a limited budget
turn prompt tuning into a reusable workflow instead of a one-off chat exercise

Working Modes

1. Design-Only Mode

Use this mode when:

- there is no runnable environment yet
no evaluation resources are available yet
real model calls cannot be executed for now

In this mode, the Agent should produce:

- task definition
target prompt draft
judge prompt draft
evaluation plan
script skeletons
manual execution guidance

2. Execution Mode

Use this mode when:

- a runnable environment already exists
the model invocation method has been provided
the evaluation set, resource limits, and candidate models have been provided

In this mode, the Agent should continue with:

- batch generation
automatic evaluation
result analysis
prompt iteration
final candidate recommendation

Core Principles

The following rules are non-negotiable by default:

1. The target prompt and the judge prompt must be separated.

Do not silently modify both in the same comparison round and then mix their gains together.

2. Before large-scale evaluation, the task definition (task spec) must be frozen first.

3. Every round of prompt optimization must have a clear optimization hypothesis.

No random “this sentence feels off, let’s tweak it” behavior.

4. An experiment log must be maintained, including at least:

- version number - summary of changes in the current round - optimization hypothesis - evaluation results - cost information - conclusion

5. Any high-cost evaluation loop must be approved by a human beforehand.

6. The final launch candidate must be reviewed by a human.

A high machine-evaluation score does not automatically mean it is ready for launch.

7. If the input information is incomplete, low-risk assumptions may be made, but they must be stated explicitly.

Recommended Inputs to Collect

The Agent should gather or infer the following whenever possible:

- business goal
user scenario
input format
output format
hard constraints
unacceptable errors
success criteria
online acceptance threshold
evaluation set
candidate models
invocation method (curl / SDK / API)
resource limits (TPM, RPM, timeout, budget, retry cap)

Target Deliverables

By default, the workflow should aim to produce the following:

- INLINECODE0
INLINECODE1
INLINECODE2
INLINECODE3
INLINECODE4
INLINECODE5
INLINECODE6
INLINECODE7
INLINECODE8

Human Gates

By default, human confirmation is required only at the following key checkpoints:

Gate A — Freeze the Task Definition

Confirm:

- whether the task is understood correctly
whether the success criteria are reasonable
whether the constraints are complete

Gate B — Confirm the Direction of the Target Prompt

Confirm:

- whether the target prompt is directionally correct
whether it is ready to enter evaluation

Gate C — Confirm the Direction of the Judge Prompt

Confirm:

- whether the evaluation standard is fair
whether the judge is evaluating what actually matters

Gate D — Approve a High-Cost Iteration Loop

Confirm:

- model list
TPM budget
number of iteration rounds
whether it is worth spending more resources

Gate E — Final Review

Confirm:

- whether the current best version can serve as a launch candidate
whether to continue optimizing
whether to stop

Unless the user explicitly asks for finer-grained control, do not interrupt too frequently in the middle.

Execution Flow

Phase 0 — Task Definition (Task Spec)

Before writing any prompt, first establish a clear task definition.

The task definition should include at least:

- problem description
input format
output format
business goal
user goal
constraints
explicitly forbidden outputs
positive and negative examples
success metrics
unresolved issues
current assumptions

If the user’s description is incomplete, do not stall.
Fill in reasonable assumptions first, then present them for confirmation.

After this, proceed to Gate A.

Phase 1 — Generate the First Draft of the Target Prompt

Based on the task definition, produce the first draft of the target prompt.

Requirements:

- instructions must be clear
constraints must be explicit
output structure must be stable
ambiguity should be minimized
controllability should be prioritized over fluffy “stylistic” wording
examples should be included only when they are truly helpful

Also output:

- key design rationale
predicted risk points
likely failure scenarios
what to pay close attention to in the first evaluation round

After this, proceed to Gate B.

Phase 2 — Generate the First Draft of the Judge Prompt

Design an independent Judge / Eval Prompt.

Requirements:

- evaluate the task outcome, not whether the prompt itself reads nicely
score across separate dimensions, then aggregate
include hard-fail categories
output must be structured JSON
minimize bias caused by stylistic model preferences
explicitly handle the following cases:

- partially correct outputs - format errors - misunderstanding of the task - unsafe or policy-violating content - reasonable uncertainty caused by incomplete task information

Also output:

- scoring dimensions
weight design
hard-fail conditions
Judge output JSON schema
Judge blind spots

After this, proceed to Gate C.

Phase 3 — Design the Evaluation Plan

Before running large-scale evaluations, define the evaluation plan clearly.

The plan should include at least:

- source and size of the evaluation set
sample slicing strategy (easy / medium / hard / edge cases)
online acceptance threshold
primary metrics
secondary diagnostic metrics
tie-breaking rules
maximum number of iteration rounds
total budget limit
early stopping conditions

Default loop policy:

- by default, run at most 2 high-cost optimization rounds
stop if the gains are marginal and the failure types have not improved in substance
if the Judge itself looks unreliable, fix the Judge first instead of continuing to modify the target prompt

Phase 4 — Write the Generation Script

If executable conditions are available, the Agent should write a batch generation script.

The script should support, as much as possible:

- jsonl / csv / excel input
multiple models
resume from checkpoint
retries and backoff
logging
strict input-output order preservation
TPM / RPM rate limiting
structured outputs for downstream evaluation

TPM Handling Principles

Do not crudely translate TPM directly into high concurrency.

Preferred approach:

- estimate token consumption per request
use token-bucket or time-window rate limiting
use conservative concurrency when RPM and latency are unknown
prioritize stability before speed

Phase 5 — Batch Generate Model Outputs

Run the full evaluation set across all specified models and prompt versions.

At minimum, record:

- model name
prompt version
input sample ID
raw output
token usage (if available)
latency
retry count
request failure information
truncation / parsing failures

If generation failures occur frequently:

- first separate infrastructure issues from prompt issues
do not conclude that the prompt is bad before ruling out quota, network, rate-limiting, or protocol problems

Phase 6 — Run Automatic Evaluation

Use the Judge Prompt to evaluate generated outputs in batch.

Requirements:

- Judge output must be structured JSON
raw judge outputs must be traceable
compute overall scores and slice-level metrics
automatically identify major failure clusters
distinguish format errors from content errors
if the Judge is noisy, state that explicitly instead of pretending the results are reliable

Phase 7 — Analyze and Optimize

A new prompt iteration is allowed only when there is a clear optimization hypothesis.

Each round must include:

1. summarize the previous round’s results
identify the major failure clusters
propose the optimization hypothesis for this round
modify only the most necessary prompt sections
provide a version-diff summary
predict what should improve and what may regress

Do not run another round for no reason.

If the next round will consume meaningful resources, go to Gate D first.

Phase 8 — Final Recommendation

Once a version reaches a sufficiently strong level, the Agent should produce a final review package.

It should include at least:

- final target prompt
final judge prompt
recommended model
overall metrics
slice-level metrics
major remaining failure types
cost / latency notes
whether launch is recommended
what should be monitored after launch

After this, proceed to Gate E.

Default Outputs at Each Gate

Gate A Output

- task definition document
current assumptions
missing information
recommended acceptance criteria

Gate B Output

- target prompt v1
design rationale
expected risks

Gate C Output

- judge prompt v1
scoring rubric
JSON schema
Judge blind spots

Gate D Output

- current result comparison
failure analysis
prompt change summary
next-round optimization hypothesis
estimated resource consumption

Gate E Output

- final candidate
why it is the current best version
where it may still fail
recommendation to launch / continue optimizing / stop

Default Analysis Templates

Experiment Log Fields

Each experiment round should record at least:

- iteration
productionpromptversion
judgepromptversion
model
datasetversion
hypothesis
changesummary
aggregatescore
slicescores
dominant_failures
cost
verdict

Suggested Failure Taxonomy

The Agent should try to classify failures into one of the following:

- task misunderstanding
missing constraints
extraction error
reasoning error
incomplete coverage
unsafe / policy-violating output
format / schema error
verbose / redundant output
hallucinated details
mismatch between Judge and actual task goal
infrastructure failure

Explicitly Forbidden Anti-Patterns

Do not do the following:

- modify both the target prompt and the judge prompt in the same round without saying so
look only at aggregate score and ignore the failure distribution
overfit to a tiny evaluation set without warning about the risk
use machine evaluation as a substitute for final human review
loop endlessly because the score moved slightly
rewrite the whole prompt when only one part is broken
hide critical assumptions
declare success without showing hard examples

Default Behavior When the Skill Is Triggered

When this Skill is triggered, the Agent should follow this order:

1. build or refresh the task definition
determine which phase the workflow is currently in
prioritize filling missing artifacts before rewriting existing ones
prefer incremental optimization over full rewrites
request confirmation only at the defined human gates
after each major step, output a concise decision memo including:

- what changed - why it changed - which metrics improved - what major issues remain - whether another round is worth it

Example Trigger Phrases

The following requests are suitable triggers for this Skill:

- “Help me automate this prompt tuning workflow”
“Write the target prompt first, then the judge prompt, then design the evaluation”
“Use the evaluation set and several models to find the current best prompt”
“Run 1 to 2 prompt iteration rounds under a controlled budget”
“Turn this prompt tuning process into a reusable agent skill”
“Let the agent drive the process, and keep humans only at key checkpoints”

提示设计与调优最佳实践

本技能的目标并非随意“闲聊提示词”，而是将提示调优转化为一个可执行、可审查、成本可控的工程化工作流。

智能体负责大部分执行工作。
人类仅负责验证方向、审批高成本循环以及签署最终发布候选版本。

何时使用

当用户需要以下内容时，请使用本技能：

- 从头设计或优化目标提示词
设计独立的评估/评判提示词
在评估集上比较多个模型的性能
处理现有的 API curl、SDK 集成或请求协议
在有限预算下运行受控的提示词迭代
将提示词调优转化为可复用的工作流，而非一次性的聊天练习

工作模式

1. 仅设计模式

在以下情况下使用此模式：

- 尚无可运行的环境
尚无可用的评估资源
当前无法执行真实的模型调用

在此模式下，智能体应产出：

- 任务定义
目标提示词草稿
评判提示词草稿
评估计划
脚本框架
手动执行指南

2. 执行模式

在以下情况下使用此模式：

- 已有可运行的环境
已提供模型调用方法
已提供评估集、资源限制和候选模型

在此模式下，智能体应继续执行：

- 批量生成
自动评估
结果分析
提示词迭代
最终候选推荐

核心原则

以下规则默认不可协商：

1. 目标提示词和评判提示词必须分离。

不得在同一轮比较中同时静默修改两者，然后将其收益混为一谈。

2. 在大规模评估之前，必须首先冻结任务定义（任务规范）。

3. 每一轮提示词优化都必须有明确的优化假设。

不允许出现“这句话感觉不对，我们改一下”的随机行为。

4. 必须维护实验日志，至少包括：

- 版本号 - 当前轮次的变更摘要 - 优化假设 - 评估结果 - 成本信息 - 结论

5. 任何高成本的评估循环都必须事先获得人类批准。

6. 最终发布候选版本必须由人类审查。

高机器评估分数并不自动意味着它已准备好发布。

7. 如果输入信息不完整，可以做出低风险假设，但必须明确说明。

建议收集的输入信息

智能体应尽可能收集或推断以下信息：

- 业务目标
用户场景
输入格式
输出格式
硬约束
不可接受的错误
成功标准
线上验收阈值
评估集
候选模型
调用方式（curl / SDK / API）
资源限制（TPM、RPM、超时、预算、重试上限）

目标交付物

默认情况下，工作流应致力于产出以下内容：

- docs/taskspec.md
prompts/productionpromptv{n}.md
prompts/judgepromptv{n}.md
docs/evalplan.md
scripts/rungeneration.py
scripts/runjudge.py
reports/iteration{n}summary.md
reports/finalrecommendation.md
reports/experimentlog.md

人工关卡

默认情况下，仅在以下关键检查点需要人工确认：

关卡 A — 冻结任务定义

确认：

- 任务是否被正确理解
成功标准是否合理
约束条件是否完整

关卡 B — 确认目标提示词方向

确认：

- 目标提示词方向是否正确
是否准备好进入评估

关卡 C — 确认评判提示词方向

确认：

- 评估标准是否公平
评判者是否在评估真正重要的内容

关卡 D — 批准高成本迭代循环

确认：

- 模型列表
TPM 预算
迭代轮数
是否值得花费更多资源

关卡 E — 最终审查

确认：

- 当前最佳版本是否可以作为发布候选
是否继续优化
是否停止

除非用户明确要求更细粒度的控制，否则不要在中间过于频繁地打断。

执行流程

阶段 0 — 任务定义（任务规范）

在编写任何提示词之前，首先建立清晰的任务定义。

任务定义至少应包括：

- 问题描述
输入格式
输出格式
业务目标
用户目标
约束条件
明确禁止的输出
正例和反例
成功指标
未解决的问题
当前假设

如果用户的描述不完整，不要停滞不前。
先填入合理的假设，然后呈现给用户确认。

之后，进入关卡 A。

阶段 1 — 生成目标提示词初稿

基于任务定义，生成目标提示词的初稿。

要求：

- 指令必须清晰
约束必须明确
输出结构必须稳定
应尽量减少歧义
可控性优先于空洞的“风格化”措辞
仅在示例真正有帮助时才包含示例

同时输出：

- 关键设计原理
预测的风险点
可能的失败场景
第一轮评估中需要特别关注的内容

之后，进入关卡 B。

阶段 2 — 生成评判提示词初稿

设计独立的评判/评估提示词。

要求：

- 评估任务结果，而不是提示词本身读起来是否优美
按独立维度评分，然后汇总
包含硬性失败类别
输出必须是结构化的 JSON
尽量减少因模型风格偏好引起的偏差
明确处理以下情况：

- 部分正确的输出 - 格式错误 - 对任务的理解错误 - 不安全或违反政策的输出 - 因任务信息不完整导致的合理不确定性

同时输出：

- 评分维度
权重设计
硬性失败条件
评判输出 JSON 模式
评判盲点

之后，进入关卡 C。

阶段 3 — 设计评估计划

在运行大规模评估之前，明确定义评估计划。

计划至少应包括：

- 评估集的来源和规模
样本切片策略（简单/中等/困难/边缘案例）
线上验收阈值
主要指标
辅助诊断指标
平局规则
最大迭代轮数
总预算限制
提前停止条件

默认循环策略：

- 默认情况下，最多运行 2 轮高成本优化
如果收益微薄且失败类型没有实质性改善，则停止
如果评判者本身看起来不可靠，先修复评判者，而不是继续修改目标提示词

阶段 4 — 编写生成脚本

如果具备可执行条件，智能体应编写批量生成脚本。

脚本应尽可能支持：

- jsonl / csv / excel 输入
多个模型
从检查点恢复
重试和退避
日志记录
严格的输入输出顺序保持
TPM / RPM 速率限制
用于下游评估的结构化输出

TPM 处理原则

不要简单地将 TPM 直接转化为高并发。

首选方法：

- 估算每个请求的令牌消耗
使用令牌桶或时间窗口速率限制
在 RPM 和延迟未知时使用保守的并发
稳定性优先于速度

阶段 5 — 批量生成模型输出

在所有指定的模型和提示词版本上运行完整的评估集。

至少记录：

- 模型名称
提示词版本
输入样本 ID
原始输出
令牌使用量（如果可用）
延迟
重试次数
请求失败信息
截断/解析失败

如果生成失败频繁发生：

- 首先将基础设施问题与提示词问题分开
在排除配额、网络、速率限制或协议问题之前，不要断定提示词不好

阶段 6 — 运行自动评估

使用评判提示词批量评估生成的输出。

要求：

- 评判输出必须是结构化的 JSON
原始评判输出必须可追溯
计算总体分数和切片级指标
自动识别主要失败集群
区分格式错误和内容错误
如果评判者存在噪声，明确说明，而不是假装结果可靠

阶段 7 — 分析与优化

只有在有明确的优化假设时，才允许进行新的提示词迭代。

每一轮必须包括：

1. 总结上一轮的结果
识别主要的失败集群
提出本轮优化假设
仅修改最必要的提示词部分
提供版本差异摘要
预测哪些方面应该改善，哪些方面可能退步

不要无缘无故地运行另一轮。

如果下一轮将消耗大量资源，请先进入关卡 D

prompt_design_tuning_best_practice提示词调优实践