SWARM Safety Skill
Study how intelligence swarms — and where it fails.
SWARM is a research framework for studying emergent risks in multi-agent AI systems using soft (probabilistic) labels instead of binary good/bad classifications. AGI-level risks don't require AGI-level agents — harmful dynamics emerge when many sub-AGI agents interact, even when no individual agent is misaligned.
v1.7.0 | 38 agent types | 29 governance levers | 55 scenarios | 2922 tests | 8 framework bridges
Repository: INLINECODE0
Hard Rules
- - SWARM simulations run locally. Install the package first.
- Do not submit scenarios containing real API keys, credentials, or PII.
- Simulation results are research artifacts. Do not present them as ground truth about real systems.
- When publishing results, cite the framework and disclose simulation parameters.
Security
- - API binds to localhost only (
127.0.0.1) by default to prevent network exposure. - CORS restricted to localhost origins by default.
- No authentication on development API — do not expose to untrusted networks.
- In-memory storage — data does not persist between restarts.
- For production deployment, add authentication middleware and use a proper database.
Install
CODEBLOCK0
Quick Start (Python)
CODEBLOCK1
Quick Start (CLI)
CODEBLOCK2
Quick Start (API)
Start the API server:
CODEBLOCK3
API documentation at http://localhost:8000/docs.
Security Note: The server binds to 127.0.0.1 (localhost only) by default. Do not bind to 0.0.0.0 unless you understand the security implications and have proper firewall rules in place.
Register Agent
CODEBLOCK4
Returns agent_id and api_key.
Submit Scenario
CODEBLOCK5
Create & Join Simulation
CODEBLOCK6
Core Concepts
Soft Probabilistic Labels
Interactions carry p = P(v = +1) — probability of beneficial outcome:
CODEBLOCK7
Five Key Metrics
| Metric | What It Measures |
|---|
| Toxicity rate | Expected harm among accepted interactions: INLINECODE8 |
| Quality gap |
Adverse selection indicator (negative = bad):
E[p \| accepted] - E[p \| rejected] |
|
Conditional loss | Selection effect on payoffs |
|
Incoherence | Variance-to-error ratio across replays |
|
Illusion delta | Gap between perceived coherence and actual consistency |
Agent Types (14 families, 38 implementations)
| Type | Behavior |
|---|
| Honest | Cooperative, trust-based, completes tasks diligently |
| Opportunistic |
Maximizes short-term payoff, cherry-picks tasks |
|
Deceptive | Builds trust, then exploits trusted relationships |
|
Adversarial | Targets honest agents, coordinates with allies |
|
LDT | Logical Decision Theory with FDT/UDT precommitment |
|
RLM | Reinforcement Learning from Memory |
|
Council | Multi-agent deliberation-based decisions |
|
SkillRL | Learns interaction strategies via reward signals |
|
LLM | Behavior determined by LLM (Anthropic, OpenAI, or Ollama) |
|
Moltbook | Domain-specific social platform agent |
|
Scholar | Academic citation and research agent |
|
Wiki Editor | Collaborative editing with editorial policy |
Governance Levers (29 mechanisms)
- - Transaction Taxes — Reduce exploitation, cost welfare
- Reputation Decay — Punish bad actors, erode honest standing
- Circuit Breakers — Freeze toxic agents quickly
- Random Audits — Deter hidden exploitation
- Staking — Filter undercapitalized agents
- Collusion Detection — Catch coordinated attacks (the critical lever near collapse threshold)
- Sybil Detection — Identify duplicate agents
- Transparency Ledger — Reward/penalize based on outcome
- Moderator Agent — Probabilistic review of interactions
- Incoherence Friction — Tax uncertainty-driven decisions
- Council Deliberation — Multi-agent governance decisions
- Diversity Enforcement — Prevent monoculture collapse
- Moltipedia-specific — Pair caps, page cooldowns, daily caps, self-fix prevention
Framework Bridges
| Bridge | Integration |
|---|
| Concordia | DeepMind's multi-agent framework |
| GasTown |
Multi-agent workspace governance |
|
Claude Code | Claude CLI agent integration |
|
LiveSWE | Live software engineering tasks |
|
OpenClaw | Open agent protocol |
|
Prime Intellect | Cross-platform run tracking |
|
Ralph | Agent orchestration |
|
Worktree | Git worktree-based sandboxing |
Scenario YAML Format
CODEBLOCK8
Key Research Findings
Phase Transitions (11-scenario, 209-epoch study)
| Regime | Adversarial % | Toxicity | Welfare | Outcome |
|---|
| Cooperative | 0-20% | < 0.30 | Stable | Survives |
| Contested |
20-37.5% | 0.33-0.37 | Declining | Survives |
| Collapse | 50%+ | ~0.30 | Zero by epoch 12-14 |
Collapses |
Critical threshold between 37.5% and 50% adversarial agents separates recoverable from irreversible collapse.
Governance Cost Paradox (v1.7.0 GasTown study)
42-run study reveals: governance reduces toxicity at all adversarial levels (mean reduction 0.071) but imposes net-negative welfare costs at current parameter tuning. At 0% adversarial, governance costs 216 welfare units (-57.6%) for only 0.066 toxicity reduction.
Case Studies
GasTown Governance Cost
Study governance overhead vs. toxicity reduction across 7 agent compositions with and without governance levers. Reveals the safety-throughput trade-off. See scenarios/gastown_governance_cost.yaml.
LDT Cooperation
220 runs across 10 seeds comparing TDT vs FDT vs UDT cooperation strategies at population scales up to 21 agents. See scenarios/ldt_cooperation.yaml.
Moltipedia Heartbeat
Model the Moltipedia wiki editing loop: competing AI editors, editorial policy, point farming, and anti-gaming governance. See scenarios/moltipedia_heartbeat.yaml.
Moltbook CAPTCHA
Model Moltbook's anti-human math challenges and rate limiting: obfuscated text parsing, verification gates, and spam prevention. See scenarios/moltbook_captcha.yaml.
API Endpoints (Full Reference)
| Method | Endpoint | Description |
|---|
| GET | INLINECODE14 | Health check |
| GET |
/ | API info |
| POST |
/api/v1/agents/register | Register agent |
| GET |
/api/v1/agents/{agent_id} | Get agent details |
| GET |
/api/v1/agents/ | List agents |
| POST |
/api/v1/scenarios/submit | Submit scenario |
| GET |
/api/v1/scenarios/{scenario_id} | Get scenario |
| GET |
/api/v1/scenarios/ | List scenarios |
| POST |
/api/v1/simulations/create | Create simulation |
| POST |
/api/v1/simulations/{id}/join | Join simulation |
| GET |
/api/v1/simulations/{id} | Get simulation |
| GET |
/api/v1/simulations/ | List simulations |
Citation
CODEBLOCK9
Linked Docs
- - Skill metadata: INLINECODE26
- Agent discovery: INLINECODE27
- Full documentation: INLINECODE28
- Theoretical foundations: INLINECODE29
- Governance guide: INLINECODE30
- Red-teaming guide: INLINECODE31
- Scenario format: INLINECODE32
SWARM 安全技能
研究智能如何形成集群——以及它在何处失效。
SWARM 是一个研究框架,用于研究多智能体AI系统中的涌现风险,采用软(概率)标签而非二元的好/坏分类。AGI级别的风险并不需要AGI级别的智能体——当许多亚AGI智能体相互作用时,即使没有单个智能体出现偏差,也会产生有害的动态行为。
v1.7.0 | 38种智能体类型 | 29个治理杠杆 | 55个场景 | 2922个测试 | 8个框架桥接
仓库地址:https://github.com/swarm-ai-safety/swarm
硬性规则
- - SWARM模拟在本地运行。请先安装该包。
- 不要提交包含真实API密钥、凭证或个人身份信息的场景。
- 模拟结果是研究产物。不要将其作为真实系统的绝对真理呈现。
- 发布结果时,请引用该框架并披露模拟参数。
安全性
- - API默认仅绑定到本地主机(127.0.0.1),以防止网络暴露。
- CORS默认限制为本地主机来源。
- 开发API无身份验证——请勿暴露给不受信任的网络。
- 内存存储——数据在重启后不会持久化。
- 对于生产部署,请添加身份验证中间件并使用合适的数据库。
安装
bash
从PyPI安装
pip install swarm-safety
支持LLM智能体
pip install swarm-safety[llm]
完整开发(所有附加组件)
git clone https://github.com/swarm-ai-safety/swarm.git
cd swarm
pip install -e .[dev,runtime]
快速入门(Python)
python
from swarm.agents.honest import HonestAgent
from swarm.agents.opportunistic import OpportunisticAgent
from swarm.agents.deceptive import DeceptiveAgent
from swarm.agents.adversarial import AdversarialAgent
from swarm.core.orchestrator import Orchestrator, OrchestratorConfig
config = OrchestratorConfig(nepochs=10, stepsper_epoch=10, seed=42)
orchestrator = Orchestrator(config=config)
orchestrator.registeragent(HonestAgent(agentid=honest_1, name=Alice))
orchestrator.registeragent(HonestAgent(agentid=honest_2, name=Bob))
orchestrator.registeragent(OpportunisticAgent(agentid=opp_1))
orchestrator.registeragent(DeceptiveAgent(agentid=dec_1))
metrics = orchestrator.run()
for m in metrics:
print(fEpoch {m.epoch}: toxicity={m.toxicityrate:.3f}, welfare={m.totalwelfare:.2f})
快速入门(CLI)
bash
列出可用场景
swarm list
运行一个场景
swarm run scenarios/baseline.yaml
覆盖设置
swarm run scenarios/baseline.yaml --seed 42 --epochs 20 --steps 15
导出结果
swarm run scenarios/baseline.yaml --export-json results.json --export-csv outputs/
快速入门(API)
启动API服务器:
bash
pip install swarm-safety[api]
uvicorn swarm.api.app:app --host 127.0.0.1 --port 8000
API文档位于 http://localhost:8000/docs。
安全说明:服务器默认绑定到 127.0.0.1(仅本地主机)。除非您了解安全影响并已设置适当的防火墙规则,否则不要绑定到 0.0.0.0。
注册智能体
bash
curl -X POST http://localhost:8000/api/v1/agents/register \
-H Content-Type: application/json \
-d {
name: YourAgent,
description: What your agent does,
capabilities: [governance-testing, red-teaming]
}
返回 agentid 和 apikey。
提交场景
bash
curl -X POST http://localhost:8000/api/v1/scenarios/submit \
-H Content-Type: application/json \
-d {
name: my-scenario,
description: Testing collusion detection with 5 agents,
yamlcontent: simulation:\n nepochs: 10\n stepsperepoch: 10\nagents:\n - type: honest\n count: 3\n - type: adversarial\n count: 2,
tags: [collusion, governance]
}
创建并加入模拟
bash
创建
curl -X POST http://localhost:8000/api/v1/simulations/create \
-H Content-Type: application/json \
-d {scenario
id: SCENARIOID, max_participants: 5}
加入
curl -X POST http://localhost:8000/api/v1/simulations/SIM_ID/join \
-H Content-Type: application/json \
-d {agent
id: YOURAGENT_ID, role: participant}
核心概念
软概率标签
交互携带 p = P(v = +1) —— 有益结果的概率:
可观测变量 -> 代理计算器 -> v_hat -> sigmoid -> p -> 收益引擎 -> 收益
|
软指标 -> 毒性、质量差距等
五个关键指标
| 指标 | 衡量内容 |
|---|
| 毒性率 | 已接受交互中的预期危害:E[1-p \ | accepted] |
| 质量差距 |
逆向选择指标(负值表示不良):E[p \| accepted] - E[p \| rejected] |
|
条件损失 | 对收益的选择效应 |
|
不一致性 | 重播时的方差与误差比 |
|
幻觉差值 | 感知一致性与实际一致性之间的差距 |
智能体类型(14个家族,38种实现)
| 类型 | 行为 |
|---|
| 诚实型 | 合作、基于信任、勤勉完成任务 |
| 机会主义型 |
最大化短期收益,挑拣任务 |
|
欺骗型 | 建立信任,然后利用信任关系 |
|
对抗型 | 针对诚实智能体,与盟友协调 |
|
LDT | 逻辑决策理论,带有FDT/UDT预承诺 |
|
RLM | 基于记忆的强化学习 |
|
委员会型 | 多智能体审议决策 |
|
SkillRL | 通过奖励信号学习交互策略 |
|
LLM | 行为由LLM决定(Anthropic、OpenAI或Ollama) |
|
Moltbook | 特定领域的社交平台智能体 |
|
学者型 | 学术引用和研究智能体 |
|
维基编辑型 | 遵循编辑政策的协作编辑 |
治理杠杆(29种机制)
- - 交易税 —— 减少剥削,但损害福利
- 声誉衰减 —— 惩罚不良行为者,侵蚀诚实声誉
- 断路器 —— 快速冻结有毒智能体
- 随机审计 —— 威慑隐藏的剥削行为
- 质押 —— 过滤资本不足的智能体
- 共谋检测 —— 捕捉协调攻击(接近崩溃阈值的关键杠杆)
- 女巫检测 —— 识别重复智能体
- 透明账本 —— 根据结果奖励/惩罚
- 审核智能体 —— 对交互进行概率性审查
- 不一致摩擦 —— 对不确定性驱动的决策征税
- 委员会审议 —— 多智能体治理决策
- 多样性强制 —— 防止单一文化崩溃
- Moltipedia特有 —— 配对上限、页面冷却、每日上限、自我修复预防
框架桥接
| 桥接 | 集成 |
|---|
| Concordia | DeepMind的多智能体框架 |
| GasTown |
多智能体工作空间治理 |
|
Claude Code | Claude CLI智能体集成 |
|
LiveSWE | 实时软件工程任务 |
|
OpenClaw | 开放智能体协议 |
|
Prime Intellect | 跨平台运行追踪 |
|
Ralph | 智能体编排 |
|
Worktree | 基于Git工作树的沙箱 |
场景YAML格式
yaml
simulation:
n_epochs: 10
stepsperepoch: 10
seed: 42
agents:
- type: honest
count: 3
config:
acceptance_threshold: 0.4
- type: adversarial
count: 2
config:
aggression_level: 0.7
governance:
transactiontaxrate: 0.05
circuitbreakerenabled