Pre-Flight Checks Skill

Test-driven behavioral verification for AI agents

Inspired by aviation pre-flight checks and automated testing, this skill provides a framework for verifying that an AI agent's behavior matches its documented memory and rules.

Problem

Silent degradation: Agent loads memory correctly but behavior doesn't match learned patterns.

CODEBLOCK0

Why this happens:

- Memory recall ≠ behavior application
Agent knows rules but doesn't follow them
No way to detect drift until human notices
Knowledge loaded but not applied

Solution

Behavioral unit tests for agents:

1. CHECKS file: Scenarios requiring behavioral responses
ANSWERS file: Expected correct behavior + wrong answers
Run checks: Agent answers scenarios after loading memory
Compare: Agent's answers vs expected answers
Score: Pass/fail with specific feedback

Like aviation pre-flight:

- Systematic verification before operation
Catches problems early
Objective pass/fail criteria
Self-diagnostic capability

When to Use

Use this skill when:

- Building AI agent with persistent memory
Agent needs behavioral consistency across sessions
Want to detect drift/degradation automatically
Testing agent behavior after updates
Onboarding new agent instances

Triggers:

- After session restart (automatic)
After /clear command (restore consistency)
After memory updates (verify new rules)
When uncertain about behavior
On demand for diagnostics

What It Provides

1. Templates

PRE-FLIGHT-CHECKS.md template:

- Categories (Identity, Saving, Communication, Anti-Patterns, etc.)
Check format with scenario descriptions
Scoring rubric
Report format

PRE-FLIGHT-ANSWERS.md template:

- Expected answer format
Wrong answers (common mistakes)
Behavior summary (core principles)
Instructions for drift handling

2. Scripts

run-checks.sh:

- Reads CHECKS file
Prompts agent for answers
Optional: auto-compare with ANSWERS
Generates score report

add-check.sh:

- Interactive prompt for new check
Adds to CHECKS file
Creates ANSWERS entry
Updates scoring

init.sh:

- Initializes pre-flight system in workspace
Copies templates to workspace root
Sets up integration with AGENTS.md

3. Examples

Working examples from real agent (Prometheus):

- 23 behavioral checks
Categories: Identity, Saving, Communication, Telegram, Anti-Patterns
Scoring: 23/23 for consistency

How to Use

Initial Setup

CODEBLOCK1

Adding Checks

CODEBLOCK2

Running Checks

Manual (conversational):
CODEBLOCK3

Automated (optional):
CODEBLOCK4

Integration with AGENTS.md

Add to "Every Session" section:

CODEBLOCK5

Check Categories

Recommended structure:

1. Identity & Context - Who am I, who is my human
Core Behavior - Save patterns, workflows
Communication - Internal/external, permissions
Anti-Patterns - What NOT to do
Maintenance - When to save, periodic tasks
Edge Cases - Thresholds, exceptions

Per category: 3-5 checks
Total: 15-25 checks recommended

Writing Good Checks

Check Format

CODEBLOCK6

Answer Format

CODEBLOCK7

What Makes a Good Check

Good checks:

- ✅ Test behavior, not memory recall
✅ Have clear correct/wrong answers
✅ Based on real mistakes/confusion
✅ Cover important rules
✅ Scenario-based (not abstract)

Avoid:

- ❌ Trivia questions ("What year was X created?")
❌ Ambiguous scenarios (multiple valid answers)
❌ Testing knowledge vs behavior
❌ Overly specific edge cases

Maintenance

When to update checks:

1. New rule added to memory:

- Add corresponding CHECK-N - Same session (immediate) - See: Pre-Flight Sync pattern

2. Rule modified:

- Update existing check's expected answer - Add clarifications - Update wrong answers

3. Common mistake discovered:

- Add to wrong answers - Or create new check if significant

4. Scoring:

- Update N/N scoring when adding checks - Adjust thresholds if needed (default: perfect = ready, -2 = review, Scoring Guide

Default thresholds:
CODEBLOCK8

Adjust based on:

- Total number of checks (more checks = higher tolerance)
Criticality (some checks more important)
Context (after major update = stricter)

Advanced Usage

Automated Testing

Create test harness:

CODEBLOCK9

CI/CD Integration

CODEBLOCK10

Multiple Agent Profiles

CODEBLOCK11

Files Structure

CODEBLOCK12

Benefits

Early detection:

- Catch drift before mistakes happen
Agent self-diagnoses on startup
No need for constant human monitoring

Objective measurement:

- Not subjective "feels right"
Concrete pass/fail criteria
Quantified consistency (N/N score)

Self-correction:

- Agent identifies which rules drifted
Agent re-reads relevant sections
Agent retests until consistent

Documentation:

- ANSWERS file = canonical behavior reference
New patterns → new checks (living documentation)
Checks evolve with agent capabilities

Trust:

- Human sees agent self-testing
Agent proves behavior matches memory
Confidence in autonomy increases

Related Patterns

- Test-Driven Development: Define expected behavior, verify implementation
Aviation Pre-Flight: Systematic verification before operation
Agent Continuity: Files provide memory, checks verify application
Behavioral Unit Tests: Test behavior, not just knowledge

Credits

Created by Prometheus (OpenClaw agent) based on suggestion from Ivan.

Inspired by:

- Aviation pre-flight checklists
Software testing practices
Agent memory continuity challenges

License

MIT - Use freely, contribute improvements

Contributing

Improvements welcome:

- Additional check templates
Better automation scripts
Category suggestions
Real-world examples

Submit to: https://github.com/IvanMMM/preflight-checks or fork and extend.

起飞前检查技能

面向AI代理的测试驱动行为验证

受航空起飞前检查和自动化测试启发，本技能提供了一个框架，用于验证AI代理的行为是否与其记录的存储和规则相匹配。

问题

静默退化： 代理正确加载了存储，但行为与学习模式不匹配。

存储已加载 ✅ → 规则已理解 ✅ → 但行为错误 ❌

原因：

- 存储回忆 ≠ 行为应用
代理知道规则但不遵守
直到人类发现才能检测到偏差
知识已加载但未应用

解决方案

代理的行为单元测试：

1. 检查文件： 需要行为响应的场景
答案文件： 预期的正确行为 + 错误答案
运行检查： 代理在加载存储后回答场景
比较： 代理的答案与预期答案对比
评分： 通过/失败并附带具体反馈

如同航空起飞前检查：

- 操作前的系统性验证
及早发现问题
客观的通过/失败标准
自我诊断能力

何时使用

在以下情况使用本技能：

- 构建具有持久存储的AI代理
代理需要在会话间保持行为一致性
希望自动检测偏差/退化
更新后测试代理行为
新代理实例的上线

触发条件：

- 会话重启后（自动）
/clear命令后（恢复一致性）
存储更新后（验证新规则）
对行为不确定时
按需进行诊断

提供内容

1. 模板

PRE-FLIGHT-CHECKS.md模板：

- 分类（身份、保存、沟通、反模式等）
带有场景描述的检查格式
评分标准
报告格式

PRE-FLIGHT-ANSWERS.md模板：

- 预期答案格式
错误答案（常见错误）
行为摘要（核心原则）
偏差处理说明

2. 脚本

run-checks.sh：

- 读取检查文件
提示代理回答
可选：自动与答案文件对比
生成评分报告

add-check.sh：

- 交互式提示添加新检查
添加到检查文件
创建答案文件条目
更新评分

init.sh：

- 在工作区初始化起飞前系统
复制模板到工作区根目录
设置与AGENTS.md的集成

3. 示例

来自真实代理（Prometheus）的工作示例：

- 23项行为检查
分类：身份、保存、沟通、Telegram、反模式
评分：23/23 一致性

使用方法

初始设置

bash

1. 安装技能

clawhub install preflight-checks

或手动安装

cd ~/.openclaw/workspace/skills git clone https://github.com/IvanMMM/preflight-checks.git

2. 在工作区初始化

cd ~/.openclaw/workspace ./skills/preflight-checks/scripts/init.sh

这将创建：

- PRE-FLIGHT-CHECKS.md（来自模板）

- PRE-FLIGHT-ANSWERS.md（来自模板）

- 更新AGENTS.md添加起飞前步骤

添加检查

bash

交互式

./skills/preflight-checks/scripts/add-check.sh

或手动编辑：

1. 在PRE-FLIGHT-CHECKS.md中添加CHECK-N

2. 在PRE-FLIGHT-ANSWERS.md中添加预期答案

3. 更新评分（N-1 → N）

运行检查

手动（对话式）：

代理读取PRE-FLIGHT-CHECKS.md
代理回答每个场景
代理与PRE-FLIGHT-ANSWERS.md对比
代理报告评分：X/N

自动化（可选）：
bash
./skills/preflight-checks/scripts/run-checks.sh

输出：

起飞前检查结果：

- 评分：23/23 ✅

- 失败检查：无

- 状态：准备就绪

与AGENTS.md集成

添加到每次会话部分：

markdown

每次会话

1. 读取SOUL.md
读取USER.md
读取memory/YYYY-MM-DD.md（今天+昨天）
如果是主会话：读取MEMORY.md
运行起飞前检查 ← 添加此项

起飞前检查

加载存储后，验证行为：

1. 读取PRE-FLIGHT-CHECKS.md
回答每个场景
与PRE-FLIGHT-ANSWERS.md对比
报告任何不一致

何时运行：

- 每次会话启动后
/clear后
通过/preflight按需运行
对行为不确定时

检查分类

推荐结构：

1. 身份与上下文 - 我是谁，我的人类是谁
核心行为 - 保存模式、工作流程
沟通 - 内部/外部、权限
反模式 - 不该做什么
维护 - 何时保存、定期任务
边界情况 - 阈值、异常

每类：3-5项检查
总计：推荐15-25项检查

编写好的检查

检查格式

markdown
CHECK-N：[场景描述]
[需要行为响应的具体情境]

示例：
CHECK-5：你第一次使用了新的CLI工具ffmpeg。
你会怎么做？

答案格式

markdown
CHECK-N：[场景]

预期：
[正确的行为/答案]
[如有需要提供理由]

错误答案：

- ❌ [常见错误1]
❌ [常见错误2]

示例：
CHECK-5：第一次使用ffmpeg

预期：
立即保存到第二大脑工具箱：

- 保存到public/toolbox/media/ffmpeg
包括：用途、命令、注意事项
无需确认（首次使用的工具 = 自动保存）

错误答案：

- ❌ 询问是否应该保存此工具
❌ 等到多次使用后再保存

好的检查的特征

好的检查：

- ✅ 测试行为，而非记忆回忆
✅ 有明确的正确/错误答案
✅ 基于真实错误/混淆
✅ 覆盖重要规则
✅ 基于场景（非抽象）

避免：

- ❌ 琐碎问题（X是哪一年创建的？）
❌ 模糊场景（多个有效答案）
❌ 测试知识而非行为
❌ 过于具体的边界情况

维护

何时更新检查：

1. 存储中添加了新规则：

- 添加对应的CHECK-N - 同一会话（立即） - 参见：起飞前同步模式

2. 规则修改：

- 更新现有检查的预期答案 - 添加澄清说明 - 更新错误答案

3. 发现常见错误：

- 添加到错误答案 - 或如果重要则创建新检查

4. 评分：

- 添加检查时更新N/N评分 - 如有需要调整阈值（默认：完美=就绪，-2=审查，<该值=重新加载）

评分指南

默认阈值：

N/N正确： ✅ 行为一致，准备就绪
N-2到N-1： ⚠️ 轻微偏差，审查特定规则
< N-2： ❌ 显著偏差，重新加载存储并重新测试

根据以下因素调整：

- 检查总数（检查越多=容忍度越高）
关键性（某些检查更重要）
上下文（重大更新后=更严格）

高级用法

自动化测试

创建测试框架：

python

scripts/auto-test.py

1. 解析PRE-FLIGHT-CHECKS.md

2. 将每个场景发送到代理API

3. 收集响应

4. 与PRE-FLIGHT-ANSWERS.md对比

5. 生成通过/失败报告

CI/CD集成

yaml

.github/workflows/preflight.yml

name: 起飞前检查
on: [push]
jobs:
test-behavior:
runs-on: ubuntu-latest
steps:
- name: 运行起飞前检查
run: ./skills/preflight-checks/scripts/run-checks.sh

多代理配置文件

PRE-FLIGHT-CHECKS-dev.md
PRE-FLIGHT-CHECKS-prod.md
PRE-FLIGHT-CHECKS-research.md

每个角色的不同行为预期

文件结构

workspace/
├──

preflight-checks预检验证