Prompt Engineering (Deep Workflow)
Prompts behave like natural-language programs: they need specs, tests, and version control—especially in production.
When to Offer This Workflow
Trigger conditions:
- - Prompt or system message change; quality regressions
- Structured outputs (JSON), tool use, or RAG grounding requirements
- Safety or policy alignment needs
Initial offer:
Use six stages: (1) define task & success, (2) constraints & format, (3) few-shot & style, (4) build eval set, (5) iterate with discipline, (6) ship, monitor, regress). Confirm model family and latency budget.
Stage 1: Define Task & Success
Goal: Clear user-visible outcome and failure modes (hallucination, omission, tone).
Exit condition: Success rubric in plain language; out-of-scope cases listed.
Stage 2: Constraints & Format
Goal: Must/must-not rules; output schema (JSON Schema, bullet structure); length limits.
Practices
- - Separate system (policy, role) from user (task instance)
- Ask model to cite sources when grounding matters
Stage 3: Few-Shot & Style
Goal: Use examples only when they reduce ambiguity—avoid huge prompt bloat.
Practices
- - Diverse examples; avoid overlong negative examples that confuse
Stage 4: Build Eval Set
Goal: Frozen inputs with expected properties (not always exact text match).
Practices
- - Adversarial and multilingual slices if relevant
- Regression suite in CI for critical prompts
Stage 5: Iterate With Discipline
Goal: Change one major variable at a time when debugging quality.
Practices
- - Compare with same temperature settings when A/B testing wording
- Log prompt version id with outputs in production
Stage 6: Ship, Monitor, Regress
Goal: Canary prompt changes; watch implicit signals (thumbs, edits, task completion).
Final Review Checklist
- - [ ] Task and rubric defined
- [ ] Constraints and output format explicit
- [ ] Eval set versioned; regression path exists
- [ ] Iteration log disciplined; prompt versions tracked
- [ ] Production monitoring and rollback plan
Tips for Effective Guidance
- - Clarity beats cleverness—short explicit instructions often win.
- Chain-of-thought: use when reasoning helps; hide chain from end users if needed.
- Align with llm-evaluation skill for larger harness design.
Handling Deviations
- - Chat vs batch: batch can use stricter structure and lower temperature.
- Multimodal: specify how image details may be used or ignored.
提示工程(深度工作流)
提示词的行为类似于自然语言程序:它们需要规格说明、测试和版本控制——尤其是在生产环境中。
何时提供此工作流
触发条件:
- - 提示词或系统消息变更;质量回退
- 结构化输出(JSON)、工具调用或RAG锚定需求
- 安全性或策略对齐需求
初始提供:
使用六个阶段:(1)定义任务与成功标准,(2)约束条件与格式,(3)少样本与风格,(4)构建评估集,(5)规范迭代,(6)发布、监控与回退。确认模型系列和延迟预算。
阶段1:定义任务与成功标准
目标: 明确的用户可见结果和失败模式(幻觉、遗漏、语气)。
退出条件: 用通俗语言描述的成功评估标准;列出超出范围的情况。
阶段2:约束条件与格式
目标: 必须/禁止规则;输出模式(JSON Schema、列表结构);长度限制。
实践方法
- - 将系统(策略、角色)与用户(任务实例)分开
- 在需要锚定事实时要求模型引用来源
阶段3:少样本与风格
目标: 仅在能减少歧义时使用示例——避免提示词过度膨胀。
实践方法
阶段4:构建评估集
目标: 具有预期属性的固定输入(不总是精确文本匹配)。
实践方法
- - 相关时包含对抗性和多语言切片
- 对关键提示词在CI中设置回归测试套件
阶段5:规范迭代
目标: 调试质量问题时每次只改变一个主要变量。
实践方法
- - A/B测试措辞时使用相同的温度设置进行比较
- 在生产环境中记录输出对应的提示词版本ID
阶段6:发布、监控与回退
目标: 金丝雀式提示词变更;监控隐式信号(点赞、编辑、任务完成)。
最终审查清单
- - [ ] 任务和评估标准已定义
- [ ] 约束条件和输出格式已明确
- [ ] 评估集已版本化;存在回归路径
- [ ] 迭代日志规范;提示词版本已追踪
- [ ] 生产监控和回滚计划已就绪
有效指导的技巧
- - 清晰胜过巧妙——简短明确的指令往往更有效。
- 思维链:在推理有帮助时使用;必要时对最终用户隐藏思维链。
- 与大语言模型评估技能对齐,用于更大规模的测试框架设计。
处理偏差
- - 对话 vs 批处理:批处理可以使用更严格的结构和更低的温度。
- 多模态:明确说明图像细节可能如何使用或被忽略。