Load Local Context
CODEBLOCK0
Eval Skill
Structured evaluation of everything the agent manages.
When to Use
Trigger phrases:
- - "run eval"
- "what's working and what isn't"
- "rate yourself"
- "check everything"
Pre-Eval Behavioral Checks (Always)
- 1. React 👍 when owner triggers eval
- React ✅ when report is complete
- PA directory source: INLINECODE0
- Calendar check: use direct API (NOT gog CLI)
Eval Report Format
CODEBLOCK1
Running the Eval
Step 1 — Self Performance Score
Score each dimension 1–5 based on today's activity:
CODEBLOCK2
Step 2 — Task Audit
CODEBLOCK3
Step 3 — PA Network Health
CODEBLOCK4
Step 4 — Skills Audit
CODEBLOCK5
Step 5 — Integration Health
CODEBLOCK6
Step 6 — Memory Health
CODEBLOCK7
Recommendations Logic
After running all steps, generate recommendations:
CODEBLOCK8
Scheduling
Run eval:
- - On demand — when owner asks
- Weekly — every Sunday at 09:00
- After major incidents — billing crisis, WA disconnect, etc.
Cost Tips
- - Cheap: Reading files, scoring, formatting — any small model
- Expensive: Summarizing large memory files — skip if not asked
- Avoid: Running all API health checks every hour — cache for 30 min
- Batch: Run all health checks in one pass, not one at a time
Minimum Model
Any model that can:
- 1. Read files
- Apply if/then scoring rules
- Format a structured report
No advanced reasoning needed.
PA Performance Scoring (Merged from pa-eval skill)
Use this section when evaluating individual PA agents (weekly self-eval or on-demand when owner gives feedback).
Scoring Dimensions (1–5 each, max 40 points)
| Dimension | What to Measure |
|---|
| Execution | Tasks completed without reminders |
| Accuracy |
Results are correct and complete |
|
Speed | Response time is fast |
|
Proactivity | Acts without being asked |
|
Communication | Concise and context-appropriate |
|
Memory | Remembers context across sessions |
|
Tool Use | Tools used correctly and efficiently |
|
Judgment | Knows when to act vs. when to ask |
Grade: A (36–40), B (28–35), C (20–27), D (<20)
Owner Feedback Signals
Log these automatically when detected:
| Signal | Action |
|---|
| 👍 reaction / "thanks" / "great" | Log +1 positive |
| 👎 reaction / "wrong" / "not good" |
Log -1, record the correction |
| Owner re-asks the same question | Log -1 memory gap |
| Owner does the task themselves | Log -1 initiative gap |
| Owner surprised by proactive action | Log +2 proactivity |
Rule: Log feedback signals immediately — don't batch them.
Weekly Eval File
Save to .learnings/eval/YYYY-MM-DD.md with: scores table, owner feedback, tasks completed/failed, what went well, what to improve, actions for next week.
Benchmark Tests (Run Monthly)
- - Task Completion Rate:
completed / assigned × 100% — Target: >90% - Accuracy Rate:
(tasks - corrections) / tasks × 100% — Target: >95% - Memory Retention: Ask about something discussed 7+ days ago — Target: >80% recall
加载本地上下文
bash
CONTEXT_FILE=/opt/ocana/openclaw/workspace/skills/eval/.context
[ -f $CONTEXT
FILE ] && source $CONTEXTFILE
然后使用:$OWNERPHONE, $WORKSPACE, $TASKSFILE, $MONDAYTOKENFILE, $GOG_CREDS 等变量
评估技能
对智能体管理的所有内容进行结构化评估。
使用时机
触发短语:
- - 运行评估
- 哪些有效,哪些无效
- 给自己打分
- 检查所有内容
评估前行为检查(始终执行)
- 1. 当所有者触发评估时,回复 👍 表情
- 报告完成时,回复 ✅ 表情
- PA 目录来源:/opt/ocana/openclaw/workspace/PA_LIST.md
- 日历检查:使用直接 API(而非 gog CLI)
评估报告格式
📋 完整评估 — [日期]
━━━ 自我表现 ━━━
执行力: [1-5] [评论]
准确性: [1-5] [评论]
记忆力: [1-5] [评论]
主动性: [1-5] [评论]
沟通能力: [1-5] [评论]
总分:[X]/25
━━━ 活跃任务 ━━━
✅ 今日完成: [数量]
🟡 进行中: [数量]
❌ 停滞: [数量] — [列出停滞任务]
━━━ PA 网络 ━━━
✅ 正常: [列表]
⚠️ 问题: [列表及问题描述]
❌ 离线: [列表]
━━━ 技能 ━━━
已安装: [数量]
今日使用: [列表]
未使用(7天以上): [列表]
━━━ 集成 ━━━
日历(所有者): [已连接 ✅ / 故障 ❌ / 未知 ?]
monday.com: [已连接 ✅ / 故障 ❌]
邮件(gog): [已连接 ✅ / 故障 ❌]
GitHub 备份: [上次推送:X 前]
WhatsApp: [已连接 ✅ / 未连接 ❌]
━━━ 记忆健康 ━━━
每日笔记: [今日文件存在?✅/❌]
长期记忆: [MEMORY.md 大小 — 正常 / 臃肿]
学习记录: [本周数量]
上次备份: [X 前]
━━━ 建议 ━━━
- 1. [最需要修复的问题]
- [次要优先级]
- [可选改进]
运行评估
第一步 — 自我表现评分
根据今日活动,为每个维度评分 1–5:
执行力(1–5):
- - 5:所有任务无需提醒即完成
- 3:大部分任务完成,部分需要跟进
- 1:多项任务遗漏或遗忘
准确性(1–5):
- - 5:无需所有者纠正
- 3:1–2 次纠正
- 1:多次错误或输出错误
记忆力(1–5):
- - 5:每次都能正确回忆上下文
- 3:遗漏部分上下文,但能及时纠正
- 1:重复相同错误
主动性(1–5):
- - 5:多次在被告知前主动行动
- 3:响应请求,主动性最低
- 1:仅被动反应,无主动行动
沟通能力(1–5):
- - 5:清晰、简洁,无多余叙述
- 3:偶尔冗长或表达不清
- 1:分享推理过程、列出选项、叙述步骤
第二步 — 任务审计
bash
TASKS_FILE=$HOME/.openclaw/workspace/memory/tasks.md
echo 已完成任务:
grep -c \[x\] $TASKS_FILE 2>/dev/null || echo 0
echo 进行中任务:
grep -c \[ \] $TASKS_FILE 2>/dev/null || echo 0
停滞 = 进行中超过 2 天
echo 停滞任务(超过 2 天):
grep \[ \] $TASKS_FILE | grep -v $(date +%Y-%m-%d) | grep -v $(date -u -d 1 day ago +%Y-%m-%d 2>/dev/null) || echo 无
第三步 — PA 网络健康检查
bash
BILLING_FILE=$HOME/.openclaw/workspace/memory/billing-status.json
echo PA 网络状态:
python3 << PYEOF
import json
data = json.load(open(/opt/ocana/openclaw/workspace/memory/billing-status.json))
for pa in data[issues]:
status = ✅ if pa[status] == resolved else ⚠️
print(f {status} {pa[pa]} ({pa[owner]}): {pa[status]})
PYEOF
第四步 — 技能审计
bash
SKILLS_DIR=$HOME/.openclaw/workspace/skills
echo 已安装技能数量:
ls $SKILLS_DIR | grep -v README | wc -l
echo 技能列表:
ls $SKILLS_DIR | grep -v README
第五步 — 集成健康检查
bash
测试 Anthropic 计费
API
STATUS=$(curl -s -o /dev/null -w %{httpcode} \
-H x-api-key: ${ANTHROPIC
APIKEY:-none} \
-H anthropic-version: 2023-06-01 \
https://api.anthropic.com/v1/models 2>/dev/null)
解释结果
if [ $API_STATUS = 200 ]; then echo 计费:✅ 正常
elif [ $API_STATUS = 402 ]; then echo 计费:❌ 额度不足
elif [ $API_STATUS = 401 ]; then echo 计费:❌ 密钥无效
else echo 计费:? HTTP $API_STATUS
fi
测试 GitHub 备份
LAST_PUSH=$(git -C $HOME/.openclaw/workspace log -1 --format=%ar 2>/dev/null)
echo 上次备份:$LAST_PUSH
测试 monday.com
if [ -f $HOME/.credentials/monday-api-token.txt ]; then
MONDAY
STATUS=$(curl -s -o /dev/null -w %{httpcode} \
-X POST https://api.monday.com/v2 \
-H Authorization: $(cat $HOME/.credentials/monday-api-token.txt) \
-H Content-Type: application/json \
-d {query: { me { id } }} 2>/dev/null)
[ $MONDAY
STATUS = 200 ] && echo monday.com:✅ || echo monday.com:❌ ($MONDAYSTATUS)
else
echo monday.com:? (未找到令牌)
fi
第六步 — 记忆健康检查
bash
TODAY=$(date -u +%Y-%m-%d)
WORKSPACE=$HOME/.openclaw/workspace
检查每日笔记是否存在
[ -f $WORKSPACE/memory/$TODAY.md ] \
&& echo 每日笔记:✅ \
|| echo 每日笔记:❌ 尚未创建
检查 MEMORY.md 大小(超过 200 行则警告)
MEMORY_LINES=$(wc -l < $WORKSPACE/MEMORY.md 2>/dev/null || echo 0)
if [ $MEMORY_LINES -gt 200 ]; then
echo MEMORY.md:⚠️ 过大($MEMORY_LINES 行)— 建议精简
else
echo MEMORY.md:✅($MEMORY_LINES 行)
fi
统计本周学习记录
LEARNINGS=$(grep -c ^## $WORKSPACE/.learnings/LEARNINGS.md 2>/dev/null || echo 0)
echo 已记录学习总数:$LEARNINGS
建议逻辑
运行所有步骤后,生成建议:
如果任何 PA 存在 billing_error 且状态 != resolved:
→ 修复 [PA 列表] 的计费问题 — 它们无法正常工作
如果任何任务状态为进行中且超过 2 天:
→ 跟进停滞任务:[任务名称]
如果 MEMORY.md 超过 200 行:
→ 精简 MEMORY.md — 文件变得臃肿
如果每日笔记不存在:
→ 创建今日记忆文件
如果上次备份超过 6 小时前:
→ 运行 git 备份
如果 API 计费状态为 402:
→ 我的 API 密钥额度不足 —