Flow Test

Use this skill to design tests for tasks that cannot be validated reliably with traditional unit-test assertions alone.

This skill is for flow testing: the agent performs a realistic task, records key evidence from the process, and then judges success with an explicit semantic rubric.

Invoke this skill when:

- the task depends on live or changing web content
the output can vary but still be correct
the workflow spans multiple model or tool steps
intermediate evidence matters more than one exact final string
you need to verify user intent was satisfied, not exact wording

Do not use this skill when:

- the result is deterministic and easy to assert directly
a schema check, exact match, snapshot, or pure function test is enough
the requirement can be covered fully by normal unit or integration tests

Objective

Turn a fuzzy requirement into a test design that combines:

- deterministic checks for stable invariants
evidence collection for dynamic execution
semantic evaluation for variable outcomes
a bounded verdict of pass, fail, or INLINECODE2

Design Principles

1. Keep asserts where they still work

Do not replace traditional tests blindly. Preserve exact checks for stable facts such as:

- tool call success
required fields
minimum counts
status codes
domain restrictions
date or freshness constraints when machine-checkable

2. Judge task completion, not exact phrasing

Prefer questions like:

- did the agent reach the right source
did it gather relevant information
does the final answer satisfy the user request

Avoid requiring one exact string unless the wording itself is the requirement.

3. Require inspectable evidence

Ask the execution flow to print or capture concise evidence such as:

- visited URL
page title
visible headings
extracted entities
timestamps or date clues
key tool outputs
final answer

The evaluator should be able to inspect why a verdict was reached.

4. Use explicit semantic rubrics

Never rely on vague instructions such as "judge whether it looks good."

Always define:

- what evidence is required
what counts as a pass
what clearly fails
when uncertainty should become INLINECODE3

5. Prefer bounded confidence

If evidence is incomplete, contradictory, or too weak, do not force a pass.

Return needs_review.

Workflow

When invoked, design the test in the following order.

1. Identify why exact assertions are brittle

Classify the task:

- dynamic web browsing
search or retrieval
LLM generation
multi-tool orchestration
end-to-end user flow

Then explain why literal equality or fixed snapshots are not sufficient.

2. Split deterministic checks from semantic checks

Write two groups:

Deterministic Checks

Use exact validation for stable parts, such as:

- tool returned successfully
required fields are present
minimum number of results exists
source domain matches expectation
response includes a valid date range

Semantic Checks

Use agent evaluation for variable parts, such as:

- relevance to the requested topic
freshness of the retrieved content
whether the answer reflects the gathered evidence
whether the workflow actually satisfies the intended task

3. Define the evidence schema

Specify exactly what the run should log or output.

Recommended evidence fields:

- task
sourceurl
sourcetitle
extracteditems
freshnesssignals
intermediateresults
finalanswer
evaluator_notes

Keep evidence minimal but sufficient for review.

4. Define the verdict rubric

Use this baseline:

Pass

- the agent reached a relevant source or completed the intended flow
collected evidence supports the conclusion
the final output is relevant and sufficiently current for the task
there is no major contradiction between evidence and answer

Fail

- the agent failed to reach a relevant source or complete the flow
the result is clearly irrelevant, stale, or fabricated
the output contradicts the evidence
the workflow misses a required user objective

Needs Review

- evidence is partial or ambiguous
freshness cannot be determined confidently
multiple interpretations remain plausible

5. Produce a structured test spec

Return the design in this format:

CODEBLOCK0

Output Template

CODEBLOCK1

Example

Task: verify that visiting a news site returns today's news rather than stale content.

Good test design:

- deterministic checks confirm the page loads and at least one article item is collected
evidence includes the visited site, page title, visible headlines, date clues, and final summary
semantic rubric passes when the result clearly reflects same-day or current reporting from the visited source
semantic rubric fails when headlines are outdated, unrelated, or invented
semantic rubric returns needs_review when freshness cannot be established from the evidence

Bad test design:

- INLINECODE6

Guidance

When using this skill:

- keep traditional asserts for stable invariants
use semantic evaluation only where exact matching becomes brittle
prefer narrow rubrics over subjective judgment
require visible evidence before passing the test
state uncertainty explicitly instead of masking it

Deliverables

When asked to design a flow test, provide:

- a structured test spec
deterministic checks
an evidence schema
a semantic rubric
a final verdict format

流程测试

使用此技能为无法仅通过传统单元测试断言可靠验证的任务设计测试。

此技能适用于流程测试：智能体执行一个实际任务，记录过程中的关键证据，然后使用明确的语义评估标准判断成功与否。

在以下情况下调用此技能：

- 任务依赖于实时或变化的网络内容
输出可能不同但仍然是正确的
工作流跨越多个模型或工具步骤
中间证据比一个精确的最终字符串更重要
需要验证用户意图是否得到满足，而非精确措辞

在以下情况下不使用此技能：

- 结果是确定性的且易于直接断言
模式检查、精确匹配、快照或纯函数测试就足够了
需求可以通过常规单元测试或集成测试完全覆盖

目标

将模糊的需求转化为结合以下内容的测试设计：

- 对稳定不变量的确定性检查
对动态执行的证据收集
对可变结果的语义评估
对通过、失败或需审查的有限判定

设计原则

1. 保留仍然有效的断言

不要盲目替换传统测试。对稳定事实保留精确检查，例如：

- 工具调用成功
必填字段
最小数量
状态码
域名限制
机器可检查的日期或时效性约束

2. 判断任务完成情况，而非精确措辞

优先考虑以下问题：

- 智能体是否到达了正确的来源
是否收集了相关信息
最终答案是否满足用户请求

除非措辞本身就是需求，否则避免要求一个精确的字符串。

3. 要求可检查的证据

要求执行流程打印或捕获简洁的证据，例如：

- 访问的URL
页面标题
可见标题
提取的实体
时间戳或日期线索
关键工具输出
最终答案

评估者应能检查得出判定的原因。

4. 使用明确的语义评估标准

永远不要依赖模糊的指令，例如判断它看起来是否好。

始终定义：

- 需要什么证据
什么算通过
什么明显失败
何时不确定性应变为需审查

5. 优先考虑有限置信度

如果证据不完整、矛盾或太弱，不要强制通过。

返回需审查。

工作流程

调用时，按以下顺序设计测试。

1. 识别精确断言为何脆弱

对任务进行分类：

- 动态网页浏览
搜索或检索
LLM生成
多工具编排
端到端用户流程

然后解释为什么字面相等或固定快照不够充分。

2. 将确定性检查与语义检查分开

编写两组：

确定性检查

对稳定部分使用精确验证，例如：

- 工具成功返回
必填字段存在
存在最小结果数量
来源域名符合预期
响应包含有效日期范围

语义检查

对可变部分使用智能体评估，例如：

- 与请求主题的相关性
检索内容的时效性
答案是否反映收集的证据
工作流是否实际满足预期任务

3. 定义证据模式

精确指定运行应记录或输出的内容。

推荐的证据字段：

- 任务
来源URL
来源标题
提取的项目
时效性信号
中间结果
最终答案
评估者备注

保持证据最小化但足以用于审查。

4. 定义判定标准

使用此基准：

通过

- 智能体到达了相关来源或完成了预期流程
收集的证据支持结论
最终输出与任务相关且足够及时
证据与答案之间没有重大矛盾

失败

- 智能体未能到达相关来源或完成流程
结果明显不相关、过时或捏造
输出与证据矛盾
工作流遗漏了所需的用户目标

需审查

- 证据不完整或模糊
无法确定时效性
存在多种合理的解释

5. 生成结构化的测试规范

按以下格式返回设计：

markdown

测试意图

精确断言失败的原因

确定性检查

需收集的证据

语义评估标准

执行说明

最终判定格式

输出模板

markdown

测试意图

- 验证：

精确断言失败的原因

- 动态因素：
字面相等为何脆弱：

确定性检查

- 检查1：
检查2：

需收集的证据

- 证据1：
证据2：

语义评估标准

- 通过条件：
失败条件：
需审查条件：

执行说明

- 约束：
允许的差异：
安全问题：

最终判定格式

- 判定：通过 | 失败 | 需审查
原因：
证据：

示例

任务：验证访问新闻网站是否返回今天的新闻而非过时内容。

好的测试设计：

- 确定性检查确认页面加载且至少收集了一个文章项目
证据包括访问的网站、页面标题、可见标题、日期线索和最终摘要
语义评估标准在结果明确反映来自访问来源的当日或当前报道时通过
语义评估标准在标题过时、不相关或编造时失败
语义评估标准在无法从证据确定时效性时返回需审查

差的测试设计：

- assert returned_text == 今天的新闻是...

指导

使用此技能时：

- 对稳定不变量保留传统断言
仅在精确匹配变得脆弱时使用语义评估
优先使用狭窄的评估标准而非主观判断
在通过测试前要求可见证据
明确陈述不确定性而非掩盖它

交付物

当被要求设计流程测试时，提供：

- 结构化的测试规范
确定性检查
证据模式
语义评估标准
最终判定格式

flow-test流程测试

flow-test

Flow Test

Objective

Design Principles

1. Keep asserts where they still work

2. Judge task completion, not exact phrasing

3. Require inspectable evidence

4. Use explicit semantic rubrics

5. Prefer bounded confidence

Workflow

1. Identify why exact assertions are brittle

2. Split deterministic checks from semantic checks

Deterministic Checks

Semantic Checks

3. Define the evidence schema

4. Define the verdict rubric

Pass

Fail

Needs Review

5. Produce a structured test spec

Output Template

Example

Guidance

Deliverables

流程测试

目标

设计原则

1. 保留仍然有效的断言

2. 判断任务完成情况，而非精确措辞

3. 要求可检查的证据

4. 使用明确的语义评估标准

5. 优先考虑有限置信度

工作流程

1. 识别精确断言为何脆弱

2. 将确定性检查与语义检查分开

确定性检查

语义检查

3. 定义证据模式

4. 定义判定标准

通过

失败

需审查

5. 生成结构化的测试规范

测试意图

精确断言失败的原因

确定性检查

需收集的证据

语义评估标准

执行说明

最终判定格式

输出模板

测试意图

精确断言失败的原因

确定性检查

需收集的证据

语义评估标准

执行说明

最终判定格式

示例

指导

交付物

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement