Autoresearch Loop Skill

Karpathy's autoresearch methodology applied to improving Claude skills, n8n workflows, system prompts, and business processes.

Core idea: Define what "better" means. Lock everything except the artifact being improved. Propose a change → test → measure → keep or discard → repeat until a stopping condition is met.

When NOT to use this loop:

- You can't define a single measurable metric (e.g. "improve my writing style" — too subjective)
The artifact is too large to evaluate cheaply in a fixed budget
There's no fixed eval set (or you can't create one) — without a stable yardstick, you're just guessing
You need to improve two interdependent artifacts simultaneously — do them sequentially instead
The artifact is a one-time document (a single client proposal, a one-off report) — the loop is for artifacts that will be reused and improved over time. A one-time deliverable has no future eval value; just write it well directly

If you can't answer "what number tells me if this experiment worked?", stop and define that first.

The methodology is format-agnostic: The loop works for any artifact type — code, prompts, documents, design systems, API configurations, process specs — as long as you can define an artifact, a metric, and a repeatable eval. For novel artifact types not covered by the examples below: walk through the setup phase (artifact → metric → eval → budget) and creatively define each. A Figma component library's metric could be a checklist pass rate (accessibility, consistency, coverage); its eval could be test scenarios ("render a data table", "create a form with validation states") scored against that checklist. Start with a small eval (5–10 test cases) to validate the metric produces meaningful signal before committing to a full campaign.

Setup Phase

Before the loop starts, establish these five things with the user:

1. The Artifact (What You're Improving)

The single file, document, workflow, or process being iteratively modified. Think of this as train.py in Karpathy's repo — the one thing the agent edits.

Examples:

- A SKILL.md file
An n8n workflow JSON
A system prompt
An SOP document
A business process description

Fixed files: Identify what must NOT change — the evaluation criteria, input test cases, external integrations. These are your prepare.py.

Warm-starting from a related artifact: If a similar artifact already exists (e.g., a Barcelona property agent prompt when you need a Madrid one), start from it rather than from scratch — it inherits solved problems and gives a better baseline than an empty file. But: you must still run a proper baseline (iteration 0) on the new artifact with a new, context-appropriate eval set. Don't assume the old score transfers. Early experiments may show fast gains just from removing Barcelona-specific content before the real Madrid-specific improvements begin. Inherited debt: If more than ~50% of your early experiments are removing or reworking inherited content rather than adding new capability, the warm-start is creating more debt than value — consider restarting from scratch with the lessons learned (not the content) from the attempt.

Live production artifacts: If the artifact is currently serving real users (a live agent, a deployed workflow), never run the loop on the live version directly. Instead: (1) copy it to a working branch/file, (2) freeze the live version — no changes until the loop produces a winner, (3) run the loop on the copy, (4) do a deliberate controlled deploy of the winning version when ready. The metric can't catch production regressions in real-time; protect live users by keeping the loop sandboxed. Emergency exception: If production breaks critically during an active loop, fix the live version immediately — user safety trumps loop discipline. Then reconcile: apply the same fix to your sandboxed copy, re-eval to get a new current score, log the hotfix as an out-of-band experiment in results.tsv, and continue the loop from the updated state.

2. The Metric (What "Better" Means)

One clear, measurable signal that determines keep vs. discard. Lower or higher must unambiguously mean better.

Examples by artifact type:

Artifact	Good Metric
Claude skill	Pass rate on test prompts (0–100%)
System prompt

If you can't define a metric, you can't run the loop. Work with the user until there's one.

Building a composite metric — if you care about two dimensions (e.g., accuracy AND conciseness):

1. Score each dimension separately on the same eval set (e.g., accuracy: 0–1 per prompt, conciseness: 0–1 per prompt)
Define weights based on relative importance before the loop starts: INLINECODE3
The composite score is what goes in results.tsv — one number, decisive
Never adjust the weights mid-loop based on results — that's changing the metric, which invalidates comparisons
Document the weights in results.tsv header or a separate note so future sessions know what they're comparing against

Multi-model artifacts — if the artifact must work across different models (e.g., Opus and Sonnet), ONE METRIC still applies. Options: (a) floor strategy — use the weaker model's score as the metric, ensuring the artifact works everywhere; (b) usage-weighted average — weight by actual usage distribution (e.g., 0.3 * opus_score + 0.7 * sonnet_score if most users are on Sonnet). Lock the model weights before the loop starts, same as composite metric rules. Do not run separate loops on the same artifact for different models — that creates conflicting optimization pressures.

3. The Budget (Experiment Scope)

What one experiment consists of. Keep it short — Karpathy uses 5 minutes per training run. Translate to your domain:

- Skill: run N test prompts through Claude (N = 5–20; use a fast subset for iteration, the full set before committing a keep on borderline results)
Workflow: execute on M sample inputs
Process: dry-run or peer review against checklist

What makes a good eval set:

- Diverse — covers all the main use cases of the artifact, not just the happy path
Adversarial — includes inputs that should fail gracefully, edge cases, ambiguous inputs
Stable — prompts that have clear, unambiguous pass/fail criteria; avoid prompts where "it depends"

If a prompt's criteria turn out ambiguous mid-loop: You cannot change the prompt (EVAL IS IMMUTABLE), but you CAN clarify the scoring rubric — the prompt text is fixed, but if the criteria were genuinely underspecified (e.g., "respond appropriately"), document a concrete interpretation now and apply it consistently for the rest of the session. Flag this prompt for replacement in the next session's eval set. Never define "pass" after seeing the output for that specific run.

- Representative — if the artifact handles 5 different scenarios, have prompts for each
Large enough — with fewer than 10 prompts, one flip = 10–17 percentage points. That's noise, not signal. Require at least 10 prompts; with fewer, require 2+ prompt improvements (not 1) before keeping an experiment.

A bad eval set (10 nearly identical prompts) will give you a misleadingly high score. If you improve from 60% to 80% but all 8 passing prompts are the same scenario, you've learned nothing about the other scenarios.

Eval difficulty imbalance: If some prompts are trivially easy (baseline passes them) and others are so hard no version has ever passed them, your effective discrimination range is narrower than the eval appears — locked passes and locked fails don't differentiate artifact versions. For the current round: continue as-is (EVAL IS IMMUTABLE) but apply statistical fragility rules to the effective prompt count, not the total. For the next round: replace trivially easy prompts with harder versions, and either make impossible prompts achievable (relax criteria) or remove them if they test beyond the artifact's scope.

Eval quality and designer bias — if every new artifact hits 100% within 2–3 sessions, your evals are probably too easy. The risk is amplified when the same person designs the eval and runs the loop — you may unconsciously write prompts you know the artifact can handle. Concrete safeguards: (a) write eval prompts BEFORE looking at the current artifact version — test what it should do, not what it does; (b) run the pre-loop BASELINE artifact against the eval — if it scores 70%+, the eval isn't discriminating enough (aim for 30–60% baselines on a reasonably good artifact); (c) have a second person review or contribute prompts, and try to break the "converged" artifact with new prompts not in your eval; (d) count happy-path vs. adversarial prompts — if >60% are happy path, rebalance; (e) include "red team" prompts and real-world failure cases from actual usage — they're unbiased by definition.

Eval-audience mismatch: If the eval was written by experts but the real users are non-experts (or vice versa), a high score means nothing — you've optimized for the wrong input distribution. Redesign the eval using actual user queries collected from production or user interviews. The eval must test how real users actually communicate, not how experts think they should.

Building an eval set from scratch — if none exists:

1. List every distinct use case and scenario the artifact is supposed to handle
For each scenario, write 1–2 prompts: one normal case, one edge/adversarial case
Aim for 10–20 prompts total — enough for meaningful signal, small enough to run fast
Write the pass/fail criteria for each prompt BEFORE running any experiments — don't define "pass" after seeing the output
If you have real historical inputs (e.g., past emails, past requests), use those as the foundation — they're more realistic than synthetic ones

Do not start the loop until the eval set is complete and criteria are written.

Evolving the eval set across sessions — how to make each round harder without adding random prompts:

1. After each session, review every prompt that passed easily (especially ones that passed from round 1). Ask: "What harder version of this question would break the current artifact?"
Find scenarios the artifact handles correctly but only barely — probe the edge of what it knows
Add failure modes one step removed from current coverage: if it handles "email in Spanish," test "email mixing Spanish and Catalan"
Retire prompts that have become trivial — they no longer discriminate between good and bad versions
Each new eval set should feel noticeably harder than the last. If you can't design harder prompts, the artifact has genuinely converged
Regression suite: When retiring old prompts, keep 1–2 of the most critical from each round in a small, persistent "regression suite" that runs alongside (not instead of) each new round's eval. This prevents cross-round forgetting — a capability that passed in Round 2 can silently break in Round 6 if no current eval tests it. The regression suite grows slowly (aim for 5–10 prompts max) and acts as a guardrail, not a scoring mechanism — a regression failure is a red flag to investigate, not an automatic discard. If a regression prompt conflicts with the current round's eval (e.g., the artifact's guidance has evolved and the old expectation no longer applies), update or remove the conflicting regression prompt — the regression suite is a living guardrail, not an immutable archive
Holdout eval for campaign-level tracking: Since each round uses a different eval set, 100% in Round 7 ≠ 100% in Round 3 — the scores aren't comparable. To measure campaign-level progress, maintain a separate holdout eval: 5–10 hard, representative prompts that never change across rounds. Run the holdout at the start and end of each round (in addition to the round's own eval). The holdout score IS comparable across rounds and shows real trajectory. Critically: do NOT use the holdout for keep/discard decisions within a round — that would cause overfitting to it. It's purely a campaign-level progress indicator

Declining baselines across rounds are expected: If each round's eval is harder than the last, baselines will drop even as the artifact improves — that's the eval doing its job, not the artifact regressing. The holdout eval proves this: if holdout scores are rising or stable while round baselines drop, the artifact IS improving. Without a holdout: run the current artifact against an early round's eval — it should score far above the original baseline, confirming progress.

4. The Results Log

Create results.tsv (tab-separated) with these columns:

CODEBLOCK0

- iteration: sequential number
INLINECODE7: the metric value (numeric)
INLINECODE8: keep, discard, or INLINECODE11
INLINECODE12: what this experiment tried (keep under 100 chars)

Example:
CODEBLOCK1

5. Confirm and Go

Show the user the setup summary:

- Artifact: [path/name]
Metric: [what you're measuring and direction]
Budget: [what one experiment costs]
Baseline: will be measured on first run

Get confirmation, then kick off the loop.

The Experiment Loop

Run the loop until a stopping condition is met (max iterations, time budget, or convergence):

CODEBLOCK2

Step 2: Proposing a Change

Each experiment tests one hypothesis. A good hypothesis has three parts:

1. What you're changing (specific, not vague)
What you predict will happen (which prompts will now pass)
Why you expect that mechanism to work

❌ Bad: "change something" / "add more examples"
✅ Good: "If I add 3 concrete examples showing edge-case handling, adversarial prompts will start passing because the model currently lacks pattern context for those cases."

Experiment prioritization — run in this order for fastest signal:

1. Fix failures identified in the last eval run (highest signal, already know what broke)
Simplifications — remove something and retest (free wins, low risk)
Targeted additions for specific failing prompts
Structural restructuring
Radical rewrites (last resort after incremental plateau)

Diagnostic pass at low baselines: If the baseline is below ~50% and you have no prior experiment history, do NOT jump straight into changes. First, read every failing prompt carefully and categorize why it fails — missing concept, wrong structure, too vague, wrong tone, etc. Group the failures into 2–3 root cause buckets. Your first experiment should address the largest bucket. Experimenting without this diagnosis wastes iterations on symptoms rather than causes.

Escalation rule: If 8+ consecutive experiments discard on incremental changes, escalate to a radical rewrite of the artifact. One big structural change is still one experiment. If it discards, your backup brings you right back.

When you're stuck — specific techniques beyond "try harder" for persistent failing prompts:

- Read the failing output character by character — is it missing information, wrong format, wrong reasoning, or right-answer-wrong-framing? The fix differs for each
Try removal instead of addition: content elsewhere in the artifact may be causing the failure (conflicting guidance, misleading examples that pattern-match incorrectly)
Study other artifacts that handle similar scenarios — what structural approach do they use?
Ask: "What would a domain expert add if they read this failing output?" — then add exactly that, nothing more
Check whether the failing prompts require knowledge the artifact fundamentally can't provide — if so, the eval may need adjustment next round, not the artifact

Campaign fatigue: Over many sessions, experiment quality degrades — ideas get repetitive, changes become trivial rewording. This is normal, not a personal failure. Rut-breakers: (a) take a break and return with fresh eyes — the loop continues across sessions, there's no rush; (b) have a different person run the next session — fresh perspective generates different hypotheses; (c) consult the cross-artifact learnings doc for patterns that worked elsewhere; (d) question whether the eval itself is the bottleneck — maybe the artifact is good enough and the eval needs redesigning, not the artifact; (e) try a radical structural rewrite rather than another incremental tweak.

Simplicity criterion (from Karpathy): All else equal, simpler is better.

- A 2% improvement with 30 new lines: probably not worth it
A 2% improvement by deleting 10 lines: definitely keep
0% change but much cleaner: keep the simpler version

Artifact size creep: Over many rounds, each harder eval demands more content — the artifact grows monotonically even with simplification passes. Countermeasures: (1) set a hard size budget (max line/token count) at campaign start; when approaching it, the next experiment must be a compression refactor, not an addition; (2) prefer structural compression over line-by-line trimming — replace 5 specific examples with 1 generalized pattern, merge overlapping sections, extract repeated guidance into a shared rule; (3) if the artifact genuinely needs 400+ lines to pass a hard eval, it's likely too broad in scope — fork it. Size discipline is a long-campaign survival skill, not just an aesthetic preference.

Forking an artifact: When distinct scenarios within one artifact have diverged enough to need separate eval criteria (e.g., a "customer communication" skill that now covers email, Slack, phone, and escalation), split it: (a) identify natural boundaries — sections that serve different use cases, (b) create 2–3 new artifacts from the relevant sections, (c) create NEW eval sets for each fork — the old eval doesn't apply as-is since it tested the combined artifact, (d) run a fresh baseline (iteration 0) on each fork, (e) continue as independent loops with separate results.tsv files. The parent artifact's results.tsv and campaign history stay as historical record — the forks start clean.

Multi-stakeholder conflicts: If different stakeholders want conflicting things from the same artifact (e.g., sales wants persuasive language, support wants empathetic de-escalation), a single metric can't optimize for both — improving one hurts the other. Preferred solution: fork into separate artifacts, each with its own eval set tuned to that stakeholder's needs. If forking isn't feasible (must remain one artifact), use the composite metric approach with stakeholder-agreed weights defined before the loop starts — but expect a compromise artifact that underperforms dedicated ones.

Step 4: Running the Evaluation

Handling noisy metrics: LLM-based evaluations are non-deterministic — the same artifact can score 74% one run and 81% the next. Strategies:

- Multi-run averaging: Run the eval 3 times, take the mean. Require improvement > the noise floor (e.g., +5%) before keeping.
Noise floor rule: If your runs vary by ±7%, a +3% improvement is meaningless — it's within noise. Only keep changes that beat the noise floor.
Deterministic scoring: Use pass/fail rubrics with binary criteria where possible — they're less noisy than 1–5 scales.

If your eval is highly noisy, averaging 3 runs before deciding keep/discard is the minimum. Treat single-run scores with suspicion.

LLM-as-judge: When human scoring is impractical (too many prompts, too slow), using an LLM evaluator is viable but introduces specific risks. Mitigate: (a) use binary pass/fail with very explicit, observable criteria — subjective rubrics ("respond appropriately") amplify LLM inconsistency; (b) fix the evaluator model for the entire loop — switching between Opus and Sonnet as judge mid-loop is a form of metric drift; (c) periodically spot-check a sample of LLM judgments against your own human scoring — if agreement drops below ~85%, the rubric needs tightening, not the judge; (d) score each prompt independently (separate calls or cleared context) — sequential evaluation in one conversation lets earlier prompts anchor scoring of later ones. If independent scoring isn't feasible, fix the prompt order and keep it consistent across all experiments so at least the bias is constant. LLM-as-judge adds noise; treat it like noisy metrics (multi-run averaging, noise floor rule) and don't trust single-run borderline results.

Suspiciously large gains: A single experiment producing +30–40% improvement warrants re-running the eval before logging a keep. Normal experiments produce +5–15%; a large jump suggests either a fundamental structural fix (legitimate) or a measurement error (e.g., eval ran differently, wrong file loaded). Re-run once to confirm. If it reproduces, it's real — log and celebrate. If it doesn't, treat as noise.

For Claude Skills:
Take the test prompts provided (or generate them). For each prompt:

1. Load the skill SKILL.md into context
Follow the skill to complete the task
Score against the rubric (pass/fail or 0–100)
Average across all prompts = the metric

For n8n Workflows, System Prompts, Business Processes:
Adapt the same pattern: run the artifact against fixed inputs, score each against the rubric, average. For system prompts, use a fixed adversarial+normal prompt set; for SOPs, check against a requirements checklist.

Workflow eval specifics: The eval set is a collection of sample inputs with expected outputs — not prompts. For an email routing workflow: each test case = one sample email + the expected routing result + a pass/fail rule (e.g., "billing inquiry → Finance department"). Build 10–20 test cases covering: normal routing, ambiguous inputs (could go to two departments), malformed inputs (missing fields, bad encoding), and boundary cases. Run each through the workflow (actually execute in a test environment, or trace node-by-node if execution isn't feasible). Score = % of test cases correctly handled.

Step 6: Keep or Discard

Condition	Action
Score strictly improved	KEEP — advance to next iteration from this version
Score equal or worse

DISCARD — revert artifact to previous version | | Evaluation errored | ERROR — log, investigate, fix or skip |

On discard: restore the artifact to its previous state before trying the next experiment.

Internal contradictions override the metric: If a kept experiment introduces conflicting instructions within the artifact (e.g., "always use bullets" in one section, "always use prose" in another), revert it — a higher score achieved through a broken artifact is not a genuine improvement. Fix the contradiction as its own targeted experiment, then re-run. Structural integrity of the artifact is a hard constraint that the metric alone cannot enforce.

Regression risk: The metric is only trustworthy if your eval set is well-designed. If you notice the artifact seems worse at something NOT in your eval set (e.g., it now handles Spanish inputs poorly), treat that as a red flag. You have two options: discard the change on principle, or note the gap and add a Spanish-language prompt to the NEXT session's eval set. The metric is king — but only if it measures the right things.

Partial improvement with in-eval regression: An experiment that improves 3 prompts but breaks 1, with net score increasing (e.g., 70% → 80%), is a KEEP per the metric. But: flag the regressed prompt as requiring immediate follow-up — make the NEXT experiment specifically target restoring it without losing the new gains. If the regressed prompt covers a critical real-world scenario, add it to the regression suite as a guardrail. Never ignore a regression just because net score went up.

Metric gaming / Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." After 20–30 experiments, high scores (95%+) on a fixed eval set may reflect overfitting to those specific prompts rather than genuine improvement. Signs: score is high but real-world performance feels flat. Fix: redesign the eval set for the next session with harder, more diverse prompts. The loop is only as good as its yardstick.

Eval isolation: If the artifact being improved can read the eval prompts during experiments (e.g., they're in the same folder and Claude loads both), you've created a structural overfitting risk — the artifact is being tuned to specific question wording, not underlying capability. Keep eval files in a separate location not loaded during artifact editing, or run scoring as a distinct step where the artifact doesn't see the questions. Eval criteria visible during scoring = fine. Eval questions visible during modification = problem.

Step 7: Logging + Version Control

Always log before the next iteration starts.

Backup pattern — after every KEEP, save a versioned copy:
CODEBLOCK3

Revert pattern — on a DISCARD, restore before the next experiment:
CODEBLOCK4

Verify revert completeness: After reverting, always confirm the file matches the backup before proceeding. If uncertain whether a discard was fully reverted: diff the current file against the backup (diff SKILL.md SKILL_v3.md), or run the eval immediately — if the score matches the last logged keep score, the state is clean. Do NOT continue on uncertain state. Contamination is insidious: subsequent experiments build on a corrupted baseline, making all future results unreliable.

Log the discard entry in results.tsv BEFORE reverting, so the history is complete.

If you're using git: git stash or git checkout -- <file> works just as well. The key is that you always have a clean last-known-good to return to.

Version file cleanup: After many sessions you may accumulate dozens of numbered backups. You only need two at any time: the baseline (v0) and the current best. Once results.tsv documents what each version contained, intermediate files can be archived or deleted — the log is the history, not the files. If using git, commit after every KEEP and delete numbered copies entirely; the commit log replaces them.

Eval file management: Over many rounds you'll accumulate evalround1.md through evalroundN.md plus regressionsuite.md and holdouteval.md. Operationally you need: the current round's eval (active), regressionsuite.md (persistent), and holdouteval.md (persistent). Old round evals can be archived — their value was consumed when results.tsv logged experiments against them. Keep them archived rather than deleted if storage allows; they document how the eval evolved and can inform future eval design.

If you forgot to save a backup: Check git history first (git log --oneline). If not tracked, read results.tsv to identify what the last kept version looked like from its description, then reconstruct those changes manually from the log. This is painful — it's why you back up immediately after every KEEP, before running the next experiment.

If results.tsv is corrupted or lost: The current best artifact is more important than the log — don't start over. Reconstruct what you can: diff the baseline (v0) against the current best to see the net changes across all sessions. Add a note in results.tsv marking the gap (e.g., "iterations 5–12 lost — see diff of v0 vs v12 for net changes"). Continue from the next iteration number. What you lose: research memory for the missing entries, which means you might re-try already-failed experiments and can't trace why specific content was added. What you keep: all artifact improvements already achieved. The log is valuable, but the artifact is the deliverable.

Rules

EVAL IS IMMUTABLE: Once the loop starts, the test set and scoring rubric cannot change. Adding, removing, or modifying eval prompts mid-loop invalidates all previous scores — you can no longer compare iterations fairly. If you discover a missing edge case, note it for the next session's eval set. Don't touch the current one.

Contradictory eval prompts: If you discover two prompts with conflicting pass criteria mid-loop (e.g., one requires bullets, another requires prose), you cannot fix them without invalidating the current session. Finish the session, note the contradiction, and redesign the eval for next session with it resolved. Log your scores honestly — they're partially meaningless against a contradictory eval, but the experiment history still has value. For future eval design: read every prompt pair for logical conflicts before starting a loop.

Eval scoring bug: If you discover you've been scoring a prompt incorrectly (e.g., passing it when the criteria say fail), that's distinct from an ambiguous prompt — it's a measurement error. Fix: (a) correct the scoring immediately going forward, (b) do NOT retroactively re-score past experiments — the history is what happened, (c) note the bug in results.tsv ("scoring error on P5 corrected at iteration 14"), (d) re-eval the current best artifact with correct scoring to get a true current score, then continue from there. EVAL IS IMMUTABLE protects prompt text and criteria — it doesn't prevent fixing genuine mistakes in how you apply those criteria.

Metric drift from external changes: If the underlying source of your metric changes between sessions — e.g., a compliance checklist is updated, an API spec changes, a rubric is revised — your old results.tsv scores are no longer comparable. Treat this like a new loop: archive the old results.tsv as historical record, run a fresh baseline (iteration 0) against the updated metric, and continue from there. Do not add experiments to the old log.

Input distribution drift: If the real-world inputs your artifact handles have changed significantly (e.g., your eval was built on emails from 6 months ago and client communication style has evolved), the eval set no longer represents reality — even if the scoring criteria haven't changed. Pause the loop, rebuild the eval set from recent real inputs, establish a new baseline. This is distinct from metric drift: the criteria are the same, but the test cases are stale. Same cure: fresh eval set, fresh session.

Rapidly evolving domains: When the knowledge the artifact encodes changes frequently (e.g., AI tool recommendations, regulatory guidance), evals go stale faster than normal. Adapt: (a) keep rounds short — 1–2 sessions max before refreshing the eval with current information; (b) separate timeless content (methodology, decision frameworks) from time-sensitive content (specific tool names, current pricing) in the artifact — optimize the timeless parts with the loop, update the time-sensitive parts outside it on a schedule; (c) accept higher eval maintenance cost as inherent to the domain, not a failure of the methodology.

Model version change: If the underlying model changes (e.g., Sonnet 3.5 → Sonnet 4) and your artifact's score drops, the artifact didn't break — its environment changed. Treat this as a new campaign context: the dropped score IS the new baseline, archive the old results.tsv as historical, and start a fresh log. Many prior improvements may re-apply, so check cross-artifact learnings and old results.tsv for warm-start ideas. If multi-model support is needed going forward, apply the multi-model evaluation guidance (floor strategy or weighted average) to prevent this from recurring.

NO SCOPE CREEP: The artifact's purpose cannot expand mid-loop. If you started improving a consulting SOW template and realize it should also cover retainer agreements, that's a scope change — not a content improvement. The existing eval set doesn't test the new scope, so any score change is meaningless. Finish the current loop for the original scope. Start a fresh loop with a new eval set for the expanded scope.

Format migration: Converting the artifact to a different format (e.g., Markdown → YAML, plaintext → JSON) is a format change, not a scope change — treat it as a single experiment within the current loop. Convert, re-eval with the same eval set and criteria. If score holds, keep (migration succeeded). If score drops, fix the conversion or discard. The results.tsv history remains valid and continuous — a format change is just another logged experiment.

BASELINE FIRST: Iteration 0 is always a zero-change run. Do not modify the artifact. Run the evaluation as-is to establish the baseline score. Every future experiment is compared against this number. Without a clean baseline, you have nothing to beat.

If the baseline looks broken (e.g., 10% on a reasonable eval set): do NOT start experimenting from there. Investigate first — is the eval set miscalibrated? Is the artifact fundamentally broken before you started? Is the scoring rubric too strict? Fix the root cause (artifact or eval), then re-run the baseline as a clean iteration 0. Experimenting on top of a broken baseline wastes every experiment on debugging rather than improvement.

Mid-session cadence: Once the loop starts, run experiments in sequence. Check in with the user periodically (e.g., every 10 iterations or when a significant result occurs) and always respect the stopping conditions (max iterations, time budget, convergence). If you run out of ideas, think harder — look at failed test cases, try the inverse of something that worked, try combining near-misses, escalate to a radical rewrite.

Stopping for the day is fine: When the human says they're done for the day: save the best artifact version, ensure results.tsv is up to date, note the top 2–3 experiment ideas to try next session. Resume tomorrow using the Resuming a Session guide. The loop continues across sessions — it doesn't reset.

Automated loops (CI/CD): Running the loop in a pipeline is valid when properly guarded. Essential guardrails: (a) hard stop conditions — max experiments per run, automatic halt on score regression below a threshold, (b) alerting on anomalies — suspicious large gains, consecutive errors, or unexpected score patterns, (c) periodic human review of results.tsv to catch drift or quality degradation, (d) never auto-deploy to production — the pipeline produces a candidate artifact, a human approves the deploy. The eval must be fully deterministic for unattended runs; LLM-as-judge noise without human oversight creates phantom improvements.

Principled stopping criteria: A loop can legitimately run its course. Consider stopping when ALL of these are true: (1) score ≥ 95% on a well-designed, diverse eval set, AND (2) 10+ consecutive discards across incremental, structural, and radical experiments, AND (3) you've already tried combining near-misses. At that point the eval set — not the artifact — is likely the bottleneck. Redesign the eval and start a fresh session.

The 100% score: If you genuinely hit 100% on a diverse, adversarial eval, run one simplification pass first — try removing sections to see if you can maintain 100% with less content. If simplification discards, you've confirmed the artifact needs everything it has. Then stop: you've converged. 100% is only meaningful if the eval was hard; a shallow eval gives false 100%s. But: if you're at 100% and know of real-world gaps the eval doesn't cover, the eval has been outgrown — don't ship, start a new round immediately using those real-world failures as the new eval. A 100% score on an outgrown eval is not convergence; it's a signal to level up the eval.

The good-enough decision: At high scores (90%+), apply a cost/benefit lens. If the remaining failing prompts cover rare edge cases that almost never appear in practice, 92% may be genuinely good enough to ship — don't optimize for the last 8% if it has no real-world impact. If the failing prompts cover common scenarios, keep going. The loop is a tool, not a religion.

When the baseline starts high (95%+): The artifact may already be at or near its ceiling before the loop begins. Run 3–5 simplification experiments first — can you achieve the same score with less content? If those discard, try 2–3 targeted improvements on failing prompts. If those also discard, the loop has served its purpose in one session: the artifact was already good, simplification didn't help, and there's no obvious improvement to find. Ship it.

Experiment cost: Each experiment has a real time cost. If experiments are expensive (30+ minutes each), apply a tighter diminishing-returns threshold — don't run 10 more experiments chasing a 2% gain when that time is worth more elsewhere. After 3–4 consecutive discards at high scores, consider pausing, stepping back, and either redesigning the eval set or declaring good enough. Cheap experiments (5 minutes) warrant more persistence; expensive ones demand more selectivity.

Campaign-level cost tracking: Over a multi-month campaign, track cumulative cost (tokens, sessions, hours) alongside cumulative metric gains in a campaign log or results.tsv header. Plot cost vs. gain per artifact: early sessions show steep improvement; later sessions flatten. When the curve flattens — each session costs the same but produces smaller gains — the loop has delivered its value and continuing has negative ROI. This connects to retirement guidance: an artifact is ready to ship not just when the eval ceiling is reached, but when the next session's expected gain no longer justifies its cost.

ONE CHANGE AT A TIME: Never bundle two untested hypotheses in one experiment. You won't know which one caused the result.

Exception — combining near-misses: If change A was tested and discarded (0% improvement) AND change B was tested and discarded (0% improvement), you may combine them in one experiment. The logic: individually they had no effect, so combining them can't be blamed on bundling. If the combo works, you accept not knowing which part drove it — that's fine. If it discards, revert to best-so-far as usual.

Experiment dependencies: If a new idea requires a previously-discarded experiment as a prerequisite, you cannot bundle them — the artifact has changed since the original discard, invalidating that result. Correct approach: re-run the discarded experiment as a fresh standalone experiment first. If it now keeps (conditions may have changed), then propose the dependent idea as the next experiment. If it discards again, the dependent idea isn't viable. Two experiments, not one.

ONE METRIC: Pick exactly one metric before the loop starts and never change it. Two metrics create unsolvable keep/discard decisions — if one improves and one degrades, you're stuck. If you care about multiple dimensions, define a single weighted composite score before starting, not mid-loop.

When human intuition conflicts with the metric: If the metric says discard but you prefer the result, the metric wins — overriding it mid-loop breaks the entire comparison system. The right response: discard per the metric, then note your intuition as a hypothesis ("felt clearer") and design a future experiment that specifically targets that dimension. If your intuition is consistently right and the metric keeps being wrong, the metric is poorly designed — redesign it for next session, don't override it now.

Unexplained keeps: If an experiment improves the score but you can't explain why (e.g., reordering a section produced a +13% jump), keep it — the metric decides, and positional/ordering effects are real. But: note the uncertainty in results.tsv ("mechanism unclear") so future experimenters know this was structural, not content-driven. Tread carefully in subsequent experiments — unexplained keeps can create fragile states where small changes cause unexpected regressions.

ONE ARTIFACT: The loop improves exactly one artifact per session. Two artifacts (e.g., a system prompt AND the workflow that calls it) means you can't isolate what caused a score change. Finish one loop, ship it, then start a fresh loop for the next artifact.

Artifact dependencies: If an artifact depends on another (e.g., a routing skill fed by a language detection skill), and performance is bad — isolate which one is broken before starting any loop. Test the upstream artifact in isolation first. Starting a loop on the downstream artifact while its inputs are broken means you're optimizing against corrupted data. Fix root causes before iterating.

Parallel loops are fine: One artifact per loop, but you can run multiple loops simultaneously on different artifacts with separate eval sets and separate results.tsv files. The constraint is isolation within a loop, not globally.

Same-artifact parallel branching: Running two experiments from the same baseline and picking the winner is valid — if you fully discard the loser and continue from the winner only. What's not valid: merging both into one artifact ("take the best of both"). That's bundling two untested changes and you lose isolation. Pick one winner, discard the other entirely, continue.

TRACK LINEAGE: Each experiment builds on the best-so-far, not the original baseline. You are walking uphill.

TIMEOUT/ERROR HANDLING: If an evaluation errors or produces unusable results, log it as error, investigate once, fix if trivial, skip if not. Don't spend more than 2 attempts on a broken experiment before moving on.

LOG EVERYTHING: Even bad experiments. Especially bad experiments. The history is your research memory.

Resuming a Session

When you come back to a loop after a break:

1. Read results.tsv — understand exactly where the loop left off: last iteration number, best score achieved, what was kept vs discarded
Load the current best artifact — NOT the baseline. Find the last keep entry in results.tsv and load the corresponding versioned file (e.g., SKILL_v5.md). That is your starting point.
Do NOT re-run the baseline — it's already logged. Wasting an experiment re-running it is a mistake.
Continue the loop — propose the next experiment from where you left off, incrementing the iteration counter

The results.tsv is your research memory. It tells you what was tried, what worked, and what didn't. Read it before every session, not just the first one.

Handing off to another person: If someone else is taking over the loop (e.g., a colleague picking up mid-run), give them:

1. results.tsv — the full experiment history
The current best artifact version (last keep entry)
The fixed eval set + scoring rubric (unchanged)
A short note: current best score, top 2–3 ideas not yet tried, any patterns observed in failures

They continue from the next iteration number — no re-running baseline, no restarting from scratch.

Async multi-operator loops: When multiple people take turns running sessions on the same artifact (e.g., Alice Monday, Bob Wednesday), the handoff protocol above applies at every transition — but async operation requires stricter discipline because you can't verify state in real-time. Extra rules: (1) verify before starting: each operator must diff the current artifact against the last logged keep's backup before running any experiment — a previous operator's incomplete discard creates silent corruption; (2) single-operator lock: only one person runs experiments at a time, never concurrently; use a shared signal (Slack message, lock file, git branch) to indicate "loop in progress"; (3) use git: with multiple operators, file-based backups are fragile — use git branches, commit after every keep, and require clean working state before starting. The core risk is shared-state corruption; the cure is verification at every transition.

Unauthorized out-of-loop changes: If someone edits the artifact directly without going through the loop (e.g., a teammate adds a section because a customer asked), the artifact and results.tsv are out of sync. Recovery: (a) diff against the last logged keep's backup to see exactly what changed, (b) run the eval on the current (modified) artifact — if score improved or held, consider keeping the change; if it dropped, revert to backup, (c) log the event in results.tsv as an out-of-band change with a note, (d) establish a governance rule: all artifact changes go through the loop, or at minimum get logged and evaluated.

Applying to Specific Artifact Types

Quick reference for good experiment ideas by type:

Claude Skills: add concrete examples, add "when not to use" section, strengthen trigger language, add edge case handling, reorder sections for clarity, add quick-reference tables

n8n Workflows: simplify multi-step logic into fewer nodes, add error handling branches, fix expression syntax, improve routing conditions, add validation before expensive operations

System Prompts: tighten instruction specificity, add formatting constraints, add failure mode handling, add few-shot examples, remove contradictory instructions

Business Processes / SOPs / Team Workflows: eliminate redundant steps, add decision trees for edge cases, add rollback/error procedures, clarify ownership, add measurable completion criteria. When the "artifact" has no single file (team workflows where people are the execution layer), the artifact becomes the process documentation (SOP, RACI, checklist) and the eval becomes structured simulation against case studies — the loop is identical, experiments just take longer.

Output at the End of a Session

When the user interrupts or the session ends, produce a Research Summary:

CODEBLOCK5

Delivering to a client when the loop has converged: If the loop is genuinely done (high score, consecutive discards, tried everything), the deliverable package is: the best artifact version + results.tsv + eval set. Frame it professionally: "We ran N experiments over X sessions, improved from Y% to Z% on a [diverse/adversarial] eval set, and reached a performance ceiling after [M] consecutive non-improving experiments. The artifact is production-ready." The results.tsv is your audit trail — it shows rigorous, evidence-based iteration, not guesswork.

Multiple versions in production: If different clients froze at different artifact versions, the loop continues on the latest — don't maintain separate loops per client version. When an old version has a bug: fix it on the latest first, then offer migration rather than patching the old version. If migration isn't feasible, apply a targeted fix as an out-of-band change to the old version. Establish a versioning policy: either all clients track the latest (simplest) or maintain explicit version branches with their own regression suites.

Campaign documentation for teams: When multiple people run loops across many artifacts, organize for institutional knowledge: (a) per-artifact folder containing current best artifact, results.tsv, active eval set, regression suite, holdout eval; (b) campaign index (a README) listing all artifacts under active improvement — their status (active/retired/shipped), last session date, current best score; (c) cross-artifact learnings doc as a shared team resource; (d) runbook: "how to pick up any artifact's loop" referencing the handoff protocol. Results.tsv is the documentation for individual loops; the campaign index and learnings doc are the documentation for the practice itself.

Example: Improving a Claude Skill Overnight

Setup:

- Artifact: INLINECODE22
Test prompts: 20 real-world n8n questions
Metric: % answered correctly by Claude using only the skill
Baseline: run all 20 prompts → score first

Typical first-session trajectory:

1. Baseline: 60% → keep
Add concrete workflow examples: 72% → keep
Add "common mistakes" section: 65% → discard
Restructure by workflow type: 78% → keep
Tighten description trigger language: 80% → keep
Add error handling patterns: 83% → keep
Remove redundant preamble: 83% → keep (simpler)

...

Wake up to a skill that went from 60% to 83%+ with a full research log.

Cross-artifact learning: After running multiple loops on similar artifact types (e.g., several SKILL.md files), common improvement patterns emerge — add examples, tighten trigger language, add edge case handling. Maintain a living cross-artifact learnings doc: a list of "experiments that improved multiple artifacts of this type." Use it as the warm-start hypothesis list for each new loop of the same type — this can cut early-session iteration counts dramatically. The artifact warm-starts from a related file; the experiment ideas warm-start from prior loop history.

Meta use — improving this skill with itself: Yes, this is valid and encouraged. The autoresearch-loop skill itself was built and improved using exactly this loop across multiple sessions. Artifact = SKILL.md, metric = pass rate on test prompts that evaluate guidance quality, eval set = prompts covering real edge cases users hit. The only twist: the eval set should test guidance quality (does the skill tell Claude what to do correctly?) not just triggering.

When to retire an artifact from active improvement: The loop is not a continuous obligation. Retire it when: (a) the artifact handles all real-world use cases reliably in practice, and (b) you genuinely struggle to design harder eval prompts because there are no obvious gap scenarios left. At that point, don't run more sessions — ship it and maintain it. Resume a loop session only when real-world usage reveals a new failure mode worth addressing. The goal is a useful artifact, not a perfect score on an ever-harder eval set.

When to abandon a loop entirely: Abandonment is distinct from retirement. Retirement means the artifact is good; abandonment means the approach isn't working. Signals: (a) multiple rounds with redesigned evals and the score plateau persists below a useful threshold, (b) root cause analysis keeps pointing to the same fundamental structural problem that incremental experiments can't fix, (c) failing prompts reveal the artifact's format or scope is wrong for the problem — e.g., a static FAQ that keeps failing because answers require human judgment or system access. When you hit these signals, stop iterating and step back to the design level: does this problem need a different artifact entirely? The loop is also a discovery tool — sometimes what you discover is that you need a fundamentally different approach, not a better version of this one.

Post-deployment feedback loop: When a shipped artifact starts failing on new real-world scenarios, that's the signal to resume — not start over. Collect the specific failure cases from production (they're the highest-quality eval inputs — real, not synthetic). Use them as the foundation for the next round's eval set. Before starting experiments, run the regression suite and holdout eval to check whether old capabilities still hold — if they do, the failures are in uncovered territory, not regressions. Then run a normal round with the failure-informed eval. This is the artifact lifecycle: build → loop → ship → monitor → resume when needed.

Methodology: github.com/karpathy/autoresearch

自动研究循环技能

Karpathy的自动研究方法论，应用于改进Claude技能、n8n工作流、系统提示和业务流程。

核心思想：定义什么是更好。锁定除被改进工件之外的所有内容。提出变更 → 测试 → 衡量 → 保留或丢弃 → 重复，直到满足停止条件。

何时不使用此循环：

- 你无法定义单一可衡量的指标（例如改进我的写作风格——过于主观）
工件太大，无法在固定预算内廉价评估
没有固定的评估集（或你无法创建一个）——没有稳定的衡量标准，你只是在猜测
你需要同时改进两个相互依赖的工件——应顺序进行
工件是一次性文档（单个客户提案、一次性报告）——循环适用于将被重复使用和随时间改进的工件。一次性交付物没有未来的评估价值；直接写好即可

如果你无法回答什么数字能告诉我这个实验是否有效？，请停下来，先定义这个数字。

该方法论与格式无关：该循环适用于任何工件类型——代码、提示、文档、设计系统、API配置、流程规范——只要你能定义工件、指标和可重复的评估。对于以下示例未涵盖的新型工件类型：逐步完成设置阶段（工件 → 指标 → 评估 → 预算）并创造性地定义每个要素。Figma组件库的指标可以是检查清单通过率（可访问性、一致性、覆盖率）；其评估可以是针对该检查清单评分的测试场景（渲染数据表、创建带有验证状态的表单）。先从一个小的评估（5-10个测试用例）开始，验证指标能产生有意义的信号，然后再投入完整的活动。

设置阶段

在循环开始之前，与用户一起确定以下五项内容：

1. 工件（你正在改进的内容）

被迭代修改的单个文件、文档、工作流或流程。将其视为Karpathy仓库中的train.py——代理编辑的唯一对象。

示例：

- 一个SKILL.md文件
一个n8n工作流JSON
一个系统提示
一个SOP文档
一个业务流程描述

固定文件：确定哪些内容不得更改——评估标准、输入测试用例、外部集成。这些是你的prepare.py。

从相关工件热启动：如果类似的工件已经存在（例如，当你需要一个马德里房产代理提示时，已有巴塞罗那房产代理提示），则从它开始，而不是从头开始——它继承了已解决的问题，并提供了比空文件更好的基线。但：你仍然必须使用新的、上下文合适的评估集，在新工件上运行适当的基线（迭代0）。不要假设旧分数会转移。早期实验可能会显示，在真正的马德里特定改进开始之前，仅通过移除巴塞罗那特定内容就能快速获得收益。继承的债务：如果你早期实验中超过约50%是在移除或重做继承的内容，而不是增加新能力，那么热启动造成的债务多于价值——考虑从尝试中吸取的教训（而不是内容）重新开始。

生产环境中的活跃工件：如果工件当前正在服务真实用户（活跃的代理、已部署的工作流），切勿直接在活跃版本上运行循环。而是：(1) 将其复制到工作分支/文件中，(2) 冻结活跃版本——在循环产生胜出版本之前不做任何更改，(3) 在副本上运行循环，(4) 准备好后，有意控制地部署胜出版本。指标无法实时捕捉生产环境中的回归；通过保持循环在沙盒中运行来保护活跃用户。紧急例外：如果在活跃循环期间生产环境发生严重故障，立即修复活跃版本——用户安全优先于循环纪律。然后进行协调：将相同的修复应用到你的沙盒副本，重新评估以获得新的当前分数，将热修复作为带外实验记录在results.tsv中，并从更新后的状态继续循环。

2. 指标（更好的含义）

一个清晰、可衡量的信号，用于决定保留还是丢弃。指标上升或下降必须明确意味着更好。

按工件类型举例：

工件	良好指标
Claude技能	测试提示的通过率（0-100%）
系统提示

如果你无法定义指标，就无法运行循环。 与用户合作，直到有一个指标。

构建复合指标——如果你关心两个维度（例如，准确性和简洁性）：

1. 在同一评估集上分别对每个维度评分（例如，准确性：每个提示0-1分，简洁性：每个提示0-1分）
在循环开始前根据相对重要性定义权重：分数 = 0.7 准确性 + 0.3 简洁性
复合分数就是进入results.tsv的内容——一个数字，决定性的
切勿根据结果在循环中调整权重——那是在更改指标，会使比较无效
在results.tsv的标题或单独的注释中记录权重，以便未来的会话知道他们在与什么进行比较

多模型工件——如果工件必须在不同模型上工作（例如，Opus和Sonnet），仍然适用一个指标。选项：(a) 底线策略——使用较弱模型的分数作为指标，确保工件在任何地方都能工作；(b) 使用量加权平均——按实际使用分布加权（例如，如果大多数用户使用Sonnet，则为0.3 opus分数 + 0.7 sonnet分数）。在循环开始前锁定模型权重，规则与复合指标相同。不要为不同模型在同一工件上运行单独的循环——那会产生冲突的优化压力。

3. 预算（实验范围）

一个实验由什么组成。保持简短——Karpathy每次训练运行使用5分钟。将其转化到你的领域：

- 技能：通过Claude运行N个测试提示（N = 5-20；在迭代时使用快速子集，在提交保留边界结果之前使用完整集）
工作流：在M个样本输入上执行
流程：对照检查清单进行模拟运行或同行评审

一个好的评估集应具备什么：

- 多样化——涵盖工件所有主要用例，而不仅仅是理想路径
对抗性——包括应优雅失败的输入、边缘情况、模糊输入
稳定——具有清晰、明确通过/失败标准的提示；避免视情况而定的提示

如果提示的标准在循环中变得模糊：你不能更改提示（评估集不可变），但你可以澄清评分标准——提示文本是固定的，但如果标准确实规定不足（例如，适当回应），现在记录一个具体的解释，并在会话的剩余部分一致地应用它。标记此提示以便在下一个会话的评估集中替换。切勿在看到该特定运行的输出后定义通过。

- 代表性——如果工件处理5种不同的场景，为每种场景准备提示
足够大——少于10个提示时，一个翻转=10-17个百分点。那是噪音，不是信号。要求至少10个提示；如果更少，则在保留实验之前需要2个以上的提示改进（而不是1个）。

一个糟糕的评估集（10个几乎相同的提示）会给你一个误导性的高分。如果你从60%改进到80%，但所有8个通过的提示都是同一个场景，你对其他场景一无所知。

评估难度不平衡：如果某些提示非常简单（基线通过），而其他提示非常困难以至于任何版本都从未通过，那么你的有效区分范围比评估看起来更窄——锁定的通过和锁定的失败不能区分工件版本。对于当前轮次：按原样继续（评估集不可变），但对有效提示计数应用统计脆弱性规则，而不是总数。对于下一轮次：用更难的版本替换非常简单的提示，要么使不可能的提示变得可实现（放宽标准），要么如果它们测试的内容超出工件范围则将其移除。

评估质量和设计者偏差——如果每个新工件在2-3个会话内都达到100%，你的评估可能太容易了。当同一个人设计评估并运行循环时，风险会放大——你可能会无意识地编写你知道工件能处理的提示。具体防护措施：(a) 在查看当前工件版本之前编写评估提示——测试它应该做什么，而不是它做什么；(b) 针对评估运行循环前的基线工件——如果它得分70%+，则评估的区分度不够（对于一个相当好的工件，目标基线为30-60%）；(c) 让第二个人审查或贡献提示，并尝试用不在你评估中的新提示来打破已收敛的工件；(d) 计算理想路径与对抗性提示的比例——如果>60%是理想路径，则重新平衡；(e) 包含红队提示和实际使用中的真实失败案例——它们本质上是无偏的。

评估-受众不匹配：如果评估由专家编写，但真实用户是非专家（反之亦然），那么高分毫无意义——你为错误的输入分布进行了优化。使用从生产环境或用户访谈中收集的实际用户查询重新设计评估。评估必须测试真实用户实际如何沟通，而不是专家认为他们应该如何沟通。

autoresearch-loop自动研究循环

autoresearch-loop

Autoresearch Loop Skill