Auto-Optimizer
Zero-friction optimization engine. Tell your OpenClaw agent "optimize my [thing]" and it runs autonomous improve-eval-keep/revert cycles until your metric goes up. Works on anything: copy, prompts, code, prediction strategies, configs.
Results in under 5 minutes. Zero configuration required for demos.
Quick Setup (AI Agent Instructions)
Follow these steps exactly. No judgment calls needed.
Step 1 — Install
CODEBLOCK0
Step 2 — Initialize a git repo (required)
The optimizer uses git to revert bad changes. Must be in a git repo.
CODEBLOCK1
Step 3 — Pick your mode
Scalar mode — you have a command that outputs a single number (test score, Brier score, conversion rate, word count, etc.)
Binary mode — you want to evaluate quality with yes/no criteria (copy quality, prompt effectiveness, UX clarity)
Step 4 — Run it
Scalar mode:
CODEBLOCK2
Binary mode:
CODEBLOCK3
Not sure? Use the wizard:
./skills/auto-optimizer/auto-optimizer.sh --wizard
Pre-Built Starter Packs
Three self-contained demos that run immediately. No files to create, no config needed.
Demo 1: Cold Outreach Optimizer
Optimizes a cold email template using a mock scoring metric (hook strength + clarity + CTA quality + length).
CODEBLOCK5
What it does:
- - Creates a sample outreach template in INLINECODE0
- Runs a mock metric that scores: hook ≤15 words, single CTA, body ≤120 words, value prop clarity
- Iterates 5 times, keeps improvements, reverts regressions
- Prints a final report showing baseline → best score
Sample outreach template used:
CODEBLOCK6
Mock metric logic (inline in demo):
CODEBLOCK7
Demo 2: Prediction Market Strategy
Runs the optimization loop on a prediction strategy file, scoring by mock accuracy.
CODEBLOCK8
What it does:
- - Creates a sample prediction strategy in INLINECODE1
- Runs a mock metric that scores: specificity of criteria, use of base rates, calibration language
- Shows how the loop works on structured reasoning files
Demo 3: Prompt Quality Optimizer (Binary Mode)
Optimizes a system prompt using 5 yes/no quality criteria.
CODEBLOCK9
What it does:
- - Creates a sample system prompt in INLINECODE2
- Evaluates each iteration against 5 criteria (inline):
1. Does the prompt specify a clear role/persona?
2. Does it include explicit output format instructions?
3. Does it define what NOT to do?
4. Is it under 500 words?
5. Does it include at least one concrete example?
- - Batch size 10: generates 10 outputs, scores each, calculates pass rate %
- Keeps versions that increase the pass rate
Sample system prompt used:
You are a helpful assistant. Answer questions clearly and accurately.
Be concise but thorough. Help the user accomplish their goals.
Full Capability Guide
--wizard — Interactive Setup
Walks you through setup interactively. Best when you're not sure which mode to use.
CODEBLOCK11
Prompts you to choose:
- 1. Cold outreach / email copy → sets up binary mode with outreach evals
- LLM prompt / system prompt → sets up binary mode with prompt quality evals
- Prediction market strategy → sets up scalar mode with accuracy metric
- Code / config file → sets up scalar mode, prompts for your test command
- Custom → asks for your file and metric
--eval-mode scalar (default)
When to use: Anything with a measurable number. Test pass rate, Brier score, word count, latency, revenue, API response score.
Requirements: Your --metric command must print a single float to stdout.
CODEBLOCK12
How it works: Runs metric → agent proposes change → run metric again → if improved, commit; else git checkout to revert.
--eval-mode binary
When to use: Copy, prompts, UX, anything where quality is multi-dimensional and hard to reduce to one number.
Requirements: An evals file (markdown list of yes/no criteria) and a --batch-size (default 10).
CODEBLOCK13
How it works: For each iteration, generates batch-size outputs from the current file, scores each against all criteria, calculates overall pass % → agent proposes change → compare pass % → keep or revert.
--budget N
Number of optimization iterations to run. Each iteration = one agent call + one eval.
| Budget | Time (approx) | Best for |
|---|
| 5 | ~2 min | Quick demo, sanity check |
| 10 |
~5 min | Initial optimization pass |
| 20 | ~10 min | Production runs |
| 50+ | ~30 min | Overnight deep optimization |
Minimum effective budget: 5 iterations. Below 5, not enough signal.
--goal minimize / --goal maximize (default: maximize)
CODEBLOCK14
--session NAME
Name your session for organized results. Results saved to ./skills/auto-optimizer/results/NAME/.
CODEBLOCK15
--batch-size N (binary mode only)
How many outputs to generate per iteration for scoring. Higher = more reliable signal, slower.
- - 5 = fast, less reliable
- 10 = balanced (default)
- 20 = slow, high confidence
Hypothesis Memory
Every iteration is logged to results/SESSION/hypothesis_log.jsonl. The agent reads the last 5 entries before each iteration, so it never retries approaches that already failed.
This is what makes multi-iteration runs productive rather than random. The optimizer builds on what worked, avoids what didn't.
OpenClaw Integration
The natural way (just tell your agent)
CODEBLOCK16
Your OpenClaw agent reads this SKILL.md, picks the right mode, sets up the files, and runs the loop.
Direct invocation
CODEBLOCK17
Real Results (What to Expect)
From actual runs:
Prediction market strategy — 5 iterations:
- - Brier score: 0.23642 → 0.23563 (↓ better)
- ROI: +2.83% → +6.25%
- What changed: added base rate anchoring, tightened confidence thresholds
Cold outreach template — 10 iterations:
- - Binary pass rate: 60% → 85%
- Version: v1.0 → v1.1
- What changed: shorter hook (22 words → 11 words), single clear CTA, ICP-specific pain point added
System prompt — 20 iterations (binary):
- - Pass rate: 40% → 92%
- What changed: added explicit persona, output format section, negative constraints, inline example
Typical pattern:
- - Iterations 1-3: big gains (low-hanging fruit)
- Iterations 4-10: diminishing returns, more targeted changes
- Iterations 10+: fine-tuning, marginal improvements
Troubleshooting
"Not a git repo"
CODEBLOCK18
"Metric command failed" or returns 0 always
Your metric command must print a single float to stdout. Test it standalone:
bash ./your-metric.sh
# Should output: 73.5
If it outputs anything else (multiline, text, nothing), wrap it:
CODEBLOCK20
"claude CLI not found"
Option A: Install claude CLI globally: npm install -g @anthropic-ai/claude-code
Option B: The script falls back to OpenClaw's claude-code skill automatically if skills/claude-code/claude-code.sh exists.
"ERROR: --evals is required for binary eval mode"
Binary mode needs an evals file with numbered criteria:
CODEBLOCK21
"Budget too low"
Minimum 5 iterations to see meaningful improvement. Use --budget 10 for first runs.
Results not improving after many iterations
- - Check hypothesis log: INLINECODE20
- The agent may be stuck in a local optimum — try a new session with INLINECODE21
- Consider rewriting the evals/metric to be more discriminating
"program.md.template not found"
ls skills/auto-optimizer/
# Should show: auto-optimizer.sh, SKILL.md, program.md.template, results/
If missing, reinstall:
clawhub install auto-optimizer
MiroFish Integration
MiroFish is a swarm intelligence engine that runs thousands of AI agents to simulate outcomes and generate prediction reports. Combined with auto-optimizer, you can autonomously improve the "seed" inputs that drive MiroFish simulations.
What MiroFish does
- - Takes a "seed" (market data, news, signals) as input
- Runs multi-agent social simulation in a digital world (Twitter + Reddit environments)
- Simulates distinct personas: RetailTrader, WhaleInvestor, AlgorithmicTrader, etc.
- Outputs agent actions, discussions, and quantitative probability estimates
The combination
- - Mutable asset: the seed file / prediction prompt sent to MiroFish
- Scalar metric: confidence score OR prediction accuracy on known outcomes
- Loop: auto-optimizer iterates on the seed to maximize simulation confidence
Proven optimization results (2026-03-26 run)
| Iteration | Seed Content | Confidence Score | Delta |
|---|
| Baseline | Price + Fear/Greed only | 0.35 | — |
| Iter 1 |
+ Technical signals (TTM Squeeze, funding rates) | 0.58 | +0.23 ✅ |
| Iter 2 | + Whale on-chain data + price levels | 0.67 | +0.09 ✅ |
| Iter 3 | + Cross-asset correlation (BTC) | 0.61 | specific question |
Key insight: Adding technical structure data (funding rates, squeeze, key levels) produces the biggest confidence boost. Whale on-chain context is the #2 improvement driver.
Setup
MiroFish must be running (backend on port 5001):
CODEBLOCK23
Build a seed from live market data:
CODEBLOCK24
Run a simulation manually via the API:
CODEBLOCK25
Scoring function for auto-optimizer
Extract confidence from MiroFish agent actions:
CODEBLOCK26
Use cases
- - Crypto market predictions — BTC/ETH/SOL price direction (24-72h)
- Prediction market research — Polymarket/Kalshi question research
- Seed quality optimization — find which data signals drive the highest swarm confidence
- Any "what if" scenario you want to simulate at scale with diverse AI personas
Files
- - Seed builder: INLINECODE23
- Results example: INLINECODE24
- MiroFish API:
http://localhost:5001 (Flask backend, port 5001)
Auto-Optimizer
零摩擦优化引擎。告诉你的OpenClaw智能体优化我的[某物],它将自动运行改进-评估-保留/回退循环,直到你的指标上升。适用于任何内容:文案、提示词、代码、预测策略、配置。
5分钟内出结果。演示无需任何配置。
快速设置(AI智能体指令)
请严格按以下步骤操作。无需自行判断。
步骤1 — 安装
bash
clawhub install auto-optimizer
步骤2 — 初始化git仓库(必需)
优化器使用git来回滚不良更改。必须在git仓库中运行。
bash
cd your-project
git init && git add . && git commit -m baseline
步骤3 — 选择模式
标量模式 — 你有一个输出单一数值的命令(测试分数、Brier分数、转化率、字数等)
二元模式 — 你想用是/否标准评估质量(文案质量、提示词有效性、用户体验清晰度)
步骤4 — 运行
标量模式:
bash
./skills/auto-optimizer/auto-optimizer.sh \
--file ./your-file.md \
--metric bash ./your-metric.sh \
--budget 10
二元模式:
bash
./skills/auto-optimizer/auto-optimizer.sh \
--eval-mode binary \
--file ./your-file.md \
--evals ./your-evals.md \
--batch-size 10 \
--budget 10
不确定?使用向导:
bash
./skills/auto-optimizer/auto-optimizer.sh --wizard
预构建入门包
三个可立即运行的独立演示。无需创建文件,无需配置。
演示1:冷外联优化器
使用模拟评分指标(钩子强度+清晰度+行动号召质量+长度)优化冷邮件模板。
bash
./skills/auto-optimizer/auto-optimizer.sh --demo outreach --budget 5
功能说明:
- - 在/tmp/demo-outreach/outreach.md创建示例外联模板
- 运行模拟指标,评分标准:钩子≤15词、单一行动号召、正文≤120词、价值主张清晰度
- 迭代5次,保留改进,回退降级
- 打印最终报告,显示基线→最佳分数
使用的示例外联模板:
主题:关于[公司]的快速问题
您好[姓名],
我联系您是因为我一直关注[公司]的工作,认为我们可能有很好的合作机会。
我们帮助像贵公司这样的企业使用AI驱动的外联工具改进销售流程。我们的客户通常在第一个月内回复率提升3倍。
您是否愿意在下周安排15分钟通话,探讨这对[公司]是否有价值?
期待您的回复,
[您的姓名]
模拟指标逻辑(演示内联):
bash
基于以下标准评分0-100:
- 钩子长度<=15词:+25分
- 单一行动号召(非多个请求):+25分
- 正文<=120词:+25分
- 包含具体价值/数字:+25分
演示2:预测市场策略
在预测策略文件上运行优化循环,通过模拟准确性评分。
bash
./skills/auto-optimizer/auto-optimizer.sh --demo prediction --budget 5
功能说明:
- - 在/tmp/demo-prediction/strategy.md创建示例预测策略
- 运行模拟指标,评分标准:标准的具体性、基准率的使用、校准语言
- 展示循环在结构化推理文件上的工作方式
演示3:提示词质量优化器(二元模式)
使用5个是/否质量标准优化系统提示词。
bash
./skills/auto-optimizer/auto-optimizer.sh --demo prompt --budget 5 --eval-mode binary
功能说明:
- - 在/tmp/demo-prompt/system-prompt.md创建示例系统提示词
- 根据5个标准(内联)评估每次迭代:
1. 提示词是否指定了明确的角色/人设?
2. 是否包含明确的输出格式指令?
3. 是否定义了不应做的事项?
4. 是否少于500词?
5. 是否包含至少一个具体示例?
- - 批量大小10:生成10个输出,逐一评分,计算通过率%
- 保留提高通过率的版本
使用的示例系统提示词:
你是一个有用的助手。清晰准确地回答问题。
简洁但全面。帮助用户实现他们的目标。
完整功能指南
--wizard — 交互式设置
以交互方式引导你完成设置。最适合不确定使用哪种模式时。
bash
./skills/auto-optimizer/auto-optimizer.sh --wizard
提示你选择:
- 1. 冷外联/邮件文案 → 设置带外联评估的二元模式
- 大语言模型提示词/系统提示词 → 设置带提示词质量评估的二元模式
- 预测市场策略 → 设置带准确性指标的标量模式
- 代码/配置文件 → 设置标量模式,提示输入测试命令
- 自定义 → 询问你的文件和指标
--eval-mode scalar(默认)
使用场景: 任何有可测量数值的内容。测试通过率、Brier分数、字数、延迟、收入、API响应分数。
要求: 你的--metric命令必须向标准输出打印一个浮点数。
bash
示例指标命令:
--metric python test_score.py # 输出:0.847
--metric bash run_eval.sh | tail -1 # 输出:73.2
--metric node score.js # 输出:0.91
工作原理: 运行指标→智能体提出更改→再次运行指标→如果改进则提交;否则git checkout回退。
--eval-mode binary
使用场景: 文案、提示词、用户体验,任何质量是多维且难以简化为单一数值的内容。
要求: 一个评估文件(是/否标准的Markdown列表)和--batch-size(默认10)。
bash
示例evals.md:
- 1. 钩子是否少于15词?
- 是否恰好有一个行动号召?
- 是否提到具体结果或数字?
- 总长度是否少于150词?
- 是否针对特定痛点?
工作原理: 每次迭代,从当前文件生成batch-size个输出,根据所有标准逐一评分,计算总体通过率%→智能体提出更改→比较通过率%→保留或回退。
--budget N
要运行的优化迭代次数。每次迭代=一次智能体调用+一次评估。
| 预算 | 时间(约) | 最佳用途 |
|---|
| 5 | ~2分钟 | 快速演示、合理性检查 |
| 10 |
~5分钟 | 初始优化轮次 |
| 20 | ~10分钟 | 生产运行 |
| 50+ | ~30分钟 | 通宵深度优化 |
最小有效预算: 5次迭代。低于5次,信号不足。
--goal minimize / --goal maximize(默认:maximize)
bash
最小化(例如:Brier分数、错误率、延迟):
--goal minimize --metric python score_brier.py
最大化(默认——例如:准确性、通过率、收入):
--metric python score_accuracy.py
--session NAME
为你的会话命名,以便组织结果。结果保存到./skills/auto-optimizer/results/NAME/。
bash
--session outreach-v2-$(date +%Y%m%d)
--batch-size N(仅二元模式)
每次迭代生成多少个输出用于评分。数值越大=信号越可靠,速度越慢。
- - 5 = 快速,可靠性较低
- 10 = 平衡(默认)
- 20 = 慢速,高置信度
假设记忆
每次迭代记录到results/SESSION/hypothesis_log.jsonl。智能体在每次迭代前读取最后5条记录,因此绝不会重试已失败的方法。
这就是多次迭代运行高效而非随机的关键。优化器基于有效内容进行构建,避免无效内容。
OpenClaw集成
自然方式(直接告诉你的智能体)
在./outreach.md上运行auto-optimizer,优化回复率,20次迭代
使用二元评估模式优化我在./prompts/classifier.md的系统提示词
对我的预测策略启动通宵优化循环,最小化Brier分数,预算50
为我的冷外联模板设置auto-optimizer
你的OpenClaw智能体读取此SKILL.md,选择正确模式,设置文件,并运行循环。
直接调用
bash
外联优化(二元