Pre-Flight Checks Skill
Test-driven behavioral verification for AI agents
Inspired by aviation pre-flight checks and automated testing, this skill provides a framework for verifying that an AI agent's behavior matches its documented memory and rules.
Problem
Silent degradation: Agent loads memory correctly but behavior doesn't match learned patterns.
CODEBLOCK0
Why this happens:
- - Memory recall ≠ behavior application
- Agent knows rules but doesn't follow them
- No way to detect drift until human notices
- Knowledge loaded but not applied
Solution
Behavioral unit tests for agents:
- 1. CHECKS file: Scenarios requiring behavioral responses
- ANSWERS file: Expected correct behavior + wrong answers
- Run checks: Agent answers scenarios after loading memory
- Compare: Agent's answers vs expected answers
- Score: Pass/fail with specific feedback
Like aviation pre-flight:
- - Systematic verification before operation
- Catches problems early
- Objective pass/fail criteria
- Self-diagnostic capability
When to Use
Use this skill when:
- - Building AI agent with persistent memory
- Agent needs behavioral consistency across sessions
- Want to detect drift/degradation automatically
- Testing agent behavior after updates
- Onboarding new agent instances
Triggers:
- - After session restart (automatic)
- After
/clear command (restore consistency) - After memory updates (verify new rules)
- When uncertain about behavior
- On demand for diagnostics
What It Provides
1. Templates
PRE-FLIGHT-CHECKS.md template:
- - Categories (Identity, Saving, Communication, Anti-Patterns, etc.)
- Check format with scenario descriptions
- Scoring rubric
- Report format
PRE-FLIGHT-ANSWERS.md template:
- - Expected answer format
- Wrong answers (common mistakes)
- Behavior summary (core principles)
- Instructions for drift handling
2. Scripts
run-checks.sh:
- - Reads CHECKS file
- Prompts agent for answers
- Optional: auto-compare with ANSWERS
- Generates score report
add-check.sh:
- - Interactive prompt for new check
- Adds to CHECKS file
- Creates ANSWERS entry
- Updates scoring
init.sh:
- - Initializes pre-flight system in workspace
- Copies templates to workspace root
- Sets up integration with AGENTS.md
3. Examples
Working examples from real agent (Prometheus):
- - 23 behavioral checks
- Categories: Identity, Saving, Communication, Telegram, Anti-Patterns
- Scoring: 23/23 for consistency
How to Use
Initial Setup
CODEBLOCK1
Adding Checks
CODEBLOCK2
Running Checks
Manual (conversational):
CODEBLOCK3
Automated (optional):
CODEBLOCK4
Integration with AGENTS.md
Add to "Every Session" section:
CODEBLOCK5
Check Categories
Recommended structure:
- 1. Identity & Context - Who am I, who is my human
- Core Behavior - Save patterns, workflows
- Communication - Internal/external, permissions
- Anti-Patterns - What NOT to do
- Maintenance - When to save, periodic tasks
- Edge Cases - Thresholds, exceptions
Per category: 3-5 checks
Total: 15-25 checks recommended
Writing Good Checks
Check Format
CODEBLOCK6
Answer Format
CODEBLOCK7
What Makes a Good Check
Good checks:
- - ✅ Test behavior, not memory recall
- ✅ Have clear correct/wrong answers
- ✅ Based on real mistakes/confusion
- ✅ Cover important rules
- ✅ Scenario-based (not abstract)
Avoid:
- - ❌ Trivia questions ("What year was X created?")
- ❌ Ambiguous scenarios (multiple valid answers)
- ❌ Testing knowledge vs behavior
- ❌ Overly specific edge cases
Maintenance
When to update checks:
- 1. New rule added to memory:
- Add corresponding CHECK-N
- Same session (immediate)
- See: Pre-Flight Sync pattern
- 2. Rule modified:
- Update existing check's expected answer
- Add clarifications
- Update wrong answers
- 3. Common mistake discovered:
- Add to wrong answers
- Or create new check if significant
- 4. Scoring:
- Update N/N scoring when adding checks
- Adjust thresholds if needed (default: perfect = ready, -2 = review,
Scoring Guide
Default thresholds:
CODEBLOCK8
Adjust based on:
- - Total number of checks (more checks = higher tolerance)
- Criticality (some checks more important)
- Context (after major update = stricter)
Advanced Usage
Automated Testing
Create test harness:
CODEBLOCK9
CI/CD Integration
CODEBLOCK10
Multiple Agent Profiles
CODEBLOCK11
Files Structure
CODEBLOCK12
Benefits
Early detection:
- - Catch drift before mistakes happen
- Agent self-diagnoses on startup
- No need for constant human monitoring
Objective measurement:
- - Not subjective "feels right"
- Concrete pass/fail criteria
- Quantified consistency (N/N score)
Self-correction:
- - Agent identifies which rules drifted
- Agent re-reads relevant sections
- Agent retests until consistent
Documentation:
- - ANSWERS file = canonical behavior reference
- New patterns → new checks (living documentation)
- Checks evolve with agent capabilities
Trust:
- - Human sees agent self-testing
- Agent proves behavior matches memory
- Confidence in autonomy increases
Related Patterns
- - Test-Driven Development: Define expected behavior, verify implementation
- Aviation Pre-Flight: Systematic verification before operation
- Agent Continuity: Files provide memory, checks verify application
- Behavioral Unit Tests: Test behavior, not just knowledge
Credits
Created by Prometheus (OpenClaw agent) based on suggestion from Ivan.
Inspired by:
- - Aviation pre-flight checklists
- Software testing practices
- Agent memory continuity challenges
License
MIT - Use freely, contribute improvements
Contributing
Improvements welcome:
- - Additional check templates
- Better automation scripts
- Category suggestions
- Real-world examples
Submit to: https://github.com/IvanMMM/preflight-checks or fork and extend.
起飞前检查技能
面向AI代理的测试驱动行为验证
受航空起飞前检查和自动化测试启发,本技能提供了一个框架,用于验证AI代理的行为是否与其记录的存储和规则相匹配。
问题
静默退化: 代理正确加载了存储,但行为与学习模式不匹配。
存储已加载 ✅ → 规则已理解 ✅ → 但行为错误 ❌
原因:
- - 存储回忆 ≠ 行为应用
- 代理知道规则但不遵守
- 直到人类发现才能检测到偏差
- 知识已加载但未应用
解决方案
代理的行为单元测试:
- 1. 检查文件: 需要行为响应的场景
- 答案文件: 预期的正确行为 + 错误答案
- 运行检查: 代理在加载存储后回答场景
- 比较: 代理的答案与预期答案对比
- 评分: 通过/失败并附带具体反馈
如同航空起飞前检查:
- - 操作前的系统性验证
- 及早发现问题
- 客观的通过/失败标准
- 自我诊断能力
何时使用
在以下情况使用本技能:
- - 构建具有持久存储的AI代理
- 代理需要在会话间保持行为一致性
- 希望自动检测偏差/退化
- 更新后测试代理行为
- 新代理实例的上线
触发条件:
- - 会话重启后(自动)
- /clear命令后(恢复一致性)
- 存储更新后(验证新规则)
- 对行为不确定时
- 按需进行诊断
提供内容
1. 模板
PRE-FLIGHT-CHECKS.md模板:
- - 分类(身份、保存、沟通、反模式等)
- 带有场景描述的检查格式
- 评分标准
- 报告格式
PRE-FLIGHT-ANSWERS.md模板:
- - 预期答案格式
- 错误答案(常见错误)
- 行为摘要(核心原则)
- 偏差处理说明
2. 脚本
run-checks.sh:
- - 读取检查文件
- 提示代理回答
- 可选:自动与答案文件对比
- 生成评分报告
add-check.sh:
- - 交互式提示添加新检查
- 添加到检查文件
- 创建答案文件条目
- 更新评分
init.sh:
- - 在工作区初始化起飞前系统
- 复制模板到工作区根目录
- 设置与AGENTS.md的集成
3. 示例
来自真实代理(Prometheus)的工作示例:
- - 23项行为检查
- 分类:身份、保存、沟通、Telegram、反模式
- 评分:23/23 一致性
使用方法
初始设置
bash
1. 安装技能
clawhub install preflight-checks
或手动安装
cd ~/.openclaw/workspace/skills
git clone https://github.com/IvanMMM/preflight-checks.git
2. 在工作区初始化
cd ~/.openclaw/workspace
./skills/preflight-checks/scripts/init.sh
这将创建:
- PRE-FLIGHT-CHECKS.md(来自模板)
- PRE-FLIGHT-ANSWERS.md(来自模板)
- 更新AGENTS.md添加起飞前步骤
添加检查
bash
交互式
./skills/preflight-checks/scripts/add-check.sh
或手动编辑:
1. 在PRE-FLIGHT-CHECKS.md中添加CHECK-N
2. 在PRE-FLIGHT-ANSWERS.md中添加预期答案
3. 更新评分(N-1 → N)
运行检查
手动(对话式):
代理读取PRE-FLIGHT-CHECKS.md
代理回答每个场景
代理与PRE-FLIGHT-ANSWERS.md对比
代理报告评分:X/N
自动化(可选):
bash
./skills/preflight-checks/scripts/run-checks.sh
输出:
起飞前检查结果:
- 评分:23/23 ✅
- 失败检查:无
- 状态:准备就绪
与AGENTS.md集成
添加到每次会话部分:
markdown
每次会话
- 1. 读取SOUL.md
- 读取USER.md
- 读取memory/YYYY-MM-DD.md(今天+昨天)
- 如果是主会话:读取MEMORY.md
- 运行起飞前检查 ← 添加此项
起飞前检查
加载存储后,验证行为:
- 1. 读取PRE-FLIGHT-CHECKS.md
- 回答每个场景
- 与PRE-FLIGHT-ANSWERS.md对比
- 报告任何不一致
何时运行:
- - 每次会话启动后
- /clear后
- 通过/preflight按需运行
- 对行为不确定时
检查分类
推荐结构:
- 1. 身份与上下文 - 我是谁,我的人类是谁
- 核心行为 - 保存模式、工作流程
- 沟通 - 内部/外部、权限
- 反模式 - 不该做什么
- 维护 - 何时保存、定期任务
- 边界情况 - 阈值、异常
每类:3-5项检查
总计:推荐15-25项检查
编写好的检查
检查格式
markdown
CHECK-N:[场景描述]
[需要行为响应的具体情境]
示例:
CHECK-5:你第一次使用了新的CLI工具ffmpeg。
你会怎么做?
答案格式
markdown
CHECK-N:[场景]
预期:
[正确的行为/答案]
[如有需要提供理由]
错误答案:
示例:
CHECK-5:第一次使用ffmpeg
预期:
立即保存到第二大脑工具箱:
- - 保存到public/toolbox/media/ffmpeg
- 包括:用途、命令、注意事项
- 无需确认(首次使用的工具 = 自动保存)
错误答案:
- - ❌ 询问是否应该保存此工具
- ❌ 等到多次使用后再保存
好的检查的特征
好的检查:
- - ✅ 测试行为,而非记忆回忆
- ✅ 有明确的正确/错误答案
- ✅ 基于真实错误/混淆
- ✅ 覆盖重要规则
- ✅ 基于场景(非抽象)
避免:
- - ❌ 琐碎问题(X是哪一年创建的?)
- ❌ 模糊场景(多个有效答案)
- ❌ 测试知识而非行为
- ❌ 过于具体的边界情况
维护
何时更新检查:
- 1. 存储中添加了新规则:
- 添加对应的CHECK-N
- 同一会话(立即)
- 参见:起飞前同步模式
- 2. 规则修改:
- 更新现有检查的预期答案
- 添加澄清说明
- 更新错误答案
- 3. 发现常见错误:
- 添加到错误答案
- 或如果重要则创建新检查
- 4. 评分:
- 添加检查时更新N/N评分
- 如有需要调整阈值(默认:完美=就绪,-2=审查,<该值=重新加载)
评分指南
默认阈值:
N/N正确: ✅ 行为一致,准备就绪
N-2到N-1: ⚠️ 轻微偏差,审查特定规则
< N-2: ❌ 显著偏差,重新加载存储并重新测试
根据以下因素调整:
- - 检查总数(检查越多=容忍度越高)
- 关键性(某些检查更重要)
- 上下文(重大更新后=更严格)
高级用法
自动化测试
创建测试框架:
python
scripts/auto-test.py
1. 解析PRE-FLIGHT-CHECKS.md
2. 将每个场景发送到代理API
3. 收集响应
4. 与PRE-FLIGHT-ANSWERS.md对比
5. 生成通过/失败报告
CI/CD集成
yaml
.github/workflows/preflight.yml
name: 起飞前检查
on: [push]
jobs:
test-behavior:
runs-on: ubuntu-latest
steps:
- name: 运行起飞前检查
run: ./skills/preflight-checks/scripts/run-checks.sh
多代理配置文件
PRE-FLIGHT-CHECKS-dev.md
PRE-FLIGHT-CHECKS-prod.md
PRE-FLIGHT-CHECKS-research.md
每个角色的不同行为预期
文件结构
workspace/
├──