Flow Test
Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.
This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.
Invoke this skill when:
- - the task depends on live or changing web content
- the output can vary but still be correct
- the workflow spans multiple model or tool steps
- intermediate evidence matters more than one exact final string
- you need to verify user intent was satisfied, not exact wording
Do not use this skill when:
- - the result is deterministic and easy to assert directly
- a schema check, exact match, snapshot, or pure function test is enough
- the requirement can be covered fully by normal unit or integration tests
Objective
Turn a fuzzy requirement into a test design that combines:
- - deterministic checks for stable invariants
- evidence collection for dynamic execution
- semantic evaluation for variable outcomes
- a bounded verdict of
pass, fail, or INLINECODE2
Design Principles
1. Keep asserts where they still work
Do not replace traditional tests blindly. Preserve exact checks for stable facts such as:
- - tool call success
- required fields
- minimum counts
- status codes
- domain restrictions
- date or freshness constraints when machine-checkable
2. Judge task completion, not exact phrasing
Prefer questions like:
- - did the agent reach the right source
- did it gather relevant information
- does the final answer satisfy the user request
Avoid requiring one exact string unless the wording itself is the requirement.
3. Require inspectable evidence
Ask the execution flow to print or capture concise evidence such as:
- - visited URL
- page title
- visible headings
- extracted entities
- timestamps or date clues
- key tool outputs
- final answer
The evaluator should be able to inspect why a verdict was reached.
4. Use explicit semantic rubrics
Never rely on vague instructions such as "judge whether it looks good."
Always define:
- - what evidence is required
- what counts as a pass
- what clearly fails
- when uncertainty should become INLINECODE3
5. Prefer bounded confidence
If evidence is incomplete, contradictory, or too weak, do not force a pass.
Return needs_review.
Workflow
When invoked, design the test in the following order.
1. Identify why exact assertions are brittle
Classify the task:
- - dynamic web browsing
- search or retrieval
- LLM generation
- multi-tool orchestration
- end-to-end user flow
Then explain why literal equality or fixed snapshots are not sufficient.
2. Split deterministic checks from semantic checks
Write two groups:
Deterministic Checks
Use exact validation for stable parts, such as:
- - tool returned successfully
- required fields are present
- minimum number of results exists
- source domain matches expectation
- response includes a valid date range
Semantic Checks
Use agent evaluation for variable parts, such as:
- - relevance to the requested topic
- freshness of the retrieved content
- whether the answer reflects the gathered evidence
- whether the workflow actually satisfies the intended task
3. Define the evidence schema
Specify exactly what the run should log or output.
Recommended evidence fields:
- - task
- sourceurl
- sourcetitle
- extracteditems
- freshnesssignals
- intermediateresults
- finalanswer
- evaluator_notes
Keep evidence minimal but sufficient for review.
4. Define the verdict rubric
Use this baseline:
Pass
- - the agent reached a relevant source or completed the intended flow
- collected evidence supports the conclusion
- the final output is relevant and sufficiently current for the task
- there is no major contradiction between evidence and answer
Fail
- - the agent failed to reach a relevant source or complete the flow
- the result is clearly irrelevant, stale, or fabricated
- the output contradicts the evidence
- the workflow misses a required user objective
Needs Review
- - evidence is partial or ambiguous
- freshness cannot be determined confidently
- multiple interpretations remain plausible
5. Produce a structured test spec
Return the design in this format:
CODEBLOCK0
Output Template
CODEBLOCK1
Example
Task: verify that visiting a news site returns today's news rather than stale content.
Good test design:
- - deterministic checks confirm the page loads and at least one article item is collected
- evidence includes the visited site, page title, visible headlines, date clues, and final summary
- semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source
- semantic rubric fails when headlines are outdated, unrelated, or invented
- semantic rubric returns
needs_review when freshness cannot be established from the evidence
Bad test design:
Guidance
When using this skill:
- - keep traditional asserts for stable invariants
- use semantic evaluation only where exact matching becomes brittle
- prefer narrow rubrics over subjective judgment
- require visible evidence before passing the test
- state uncertainty explicitly instead of masking it
Deliverables
When asked to design a flow test, provide:
- - a structured test spec
- deterministic checks
- an evidence schema
- a semantic rubric
- a final verdict format
流程测试
使用此技能为无法仅通过传统单元测试断言可靠验证的任务设计测试。
此技能适用于流程测试:智能体执行一个实际任务,记录过程中的关键证据,然后使用明确的语义评估标准判断成功与否。
在以下情况下调用此技能:
- - 任务依赖于实时或变化的网络内容
- 输出可能不同但仍然是正确的
- 工作流跨越多个模型或工具步骤
- 中间证据比一个精确的最终字符串更重要
- 需要验证用户意图是否得到满足,而非精确措辞
在以下情况下不使用此技能:
- - 结果是确定性的且易于直接断言
- 模式检查、精确匹配、快照或纯函数测试就足够了
- 需求可以通过常规单元测试或集成测试完全覆盖
目标
将模糊的需求转化为结合以下内容的测试设计:
- - 对稳定不变量的确定性检查
- 对动态执行的证据收集
- 对可变结果的语义评估
- 对通过、失败或需审查的有限判定
设计原则
1. 保留仍然有效的断言
不要盲目替换传统测试。对稳定事实保留精确检查,例如:
- - 工具调用成功
- 必填字段
- 最小数量
- 状态码
- 域名限制
- 机器可检查的日期或时效性约束
2. 判断任务完成情况,而非精确措辞
优先考虑以下问题:
- - 智能体是否到达了正确的来源
- 是否收集了相关信息
- 最终答案是否满足用户请求
除非措辞本身就是需求,否则避免要求一个精确的字符串。
3. 要求可检查的证据
要求执行流程打印或捕获简洁的证据,例如:
- - 访问的URL
- 页面标题
- 可见标题
- 提取的实体
- 时间戳或日期线索
- 关键工具输出
- 最终答案
评估者应能检查得出判定的原因。
4. 使用明确的语义评估标准
永远不要依赖模糊的指令,例如判断它看起来是否好。
始终定义:
- - 需要什么证据
- 什么算通过
- 什么明显失败
- 何时不确定性应变为需审查
5. 优先考虑有限置信度
如果证据不完整、矛盾或太弱,不要强制通过。
返回需审查。
工作流程
调用时,按以下顺序设计测试。
1. 识别精确断言为何脆弱
对任务进行分类:
- - 动态网页浏览
- 搜索或检索
- LLM生成
- 多工具编排
- 端到端用户流程
然后解释为什么字面相等或固定快照不够充分。
2. 将确定性检查与语义检查分开
编写两组:
确定性检查
对稳定部分使用精确验证,例如:
- - 工具成功返回
- 必填字段存在
- 存在最小结果数量
- 来源域名符合预期
- 响应包含有效日期范围
语义检查
对可变部分使用智能体评估,例如:
- - 与请求主题的相关性
- 检索内容的时效性
- 答案是否反映收集的证据
- 工作流是否实际满足预期任务
3. 定义证据模式
精确指定运行应记录或输出的内容。
推荐的证据字段:
- - 任务
- 来源URL
- 来源标题
- 提取的项目
- 时效性信号
- 中间结果
- 最终答案
- 评估者备注
保持证据最小化但足以用于审查。
4. 定义判定标准
使用此基准:
通过
- - 智能体到达了相关来源或完成了预期流程
- 收集的证据支持结论
- 最终输出与任务相关且足够及时
- 证据与答案之间没有重大矛盾
失败
- - 智能体未能到达相关来源或完成流程
- 结果明显不相关、过时或捏造
- 输出与证据矛盾
- 工作流遗漏了所需的用户目标
需审查
- - 证据不完整或模糊
- 无法确定时效性
- 存在多种合理的解释
5. 生成结构化的测试规范
按以下格式返回设计:
markdown
测试意图
精确断言失败的原因
确定性检查
需收集的证据
语义评估标准
执行说明
最终判定格式
输出模板
markdown
测试意图
精确断言失败的原因
确定性检查
需收集的证据
语义评估标准
执行说明
最终判定格式
示例
任务:验证访问新闻网站是否返回今天的新闻而非过时内容。
好的测试设计:
- - 确定性检查确认页面加载且至少收集了一个文章项目
- 证据包括访问的网站、页面标题、可见标题、日期线索和最终摘要
- 语义评估标准在结果明确反映来自访问来源的当日或当前报道时通过
- 语义评估标准在标题过时、不相关或编造时失败
- 语义评估标准在无法从证据确定时效性时返回需审查
差的测试设计:
- - assert returned_text == 今天的新闻是...
指导
使用此技能时:
- - 对稳定不变量保留传统断言
- 仅在精确匹配变得脆弱时使用语义评估
- 优先使用狭窄的评估标准而非主观判断
- 在通过测试前要求可见证据
- 明确陈述不确定性而非掩盖它
交付物
当被要求设计流程测试时,提供:
- - 结构化的测试规范
- 确定性检查
- 证据模式
- 语义评估标准
- 最终判定格式