Kubernetes Agent Swarm — Platform Operations
A multi-agent system for Kubernetes and OpenShift platform operations. Seven specialized agents work together as a coordinated swarm.
Runtime Requirements
| Requirement | Required | Description |
|---|
| INLINECODE0 | ✅ Yes | Kubernetes CLI — must be in PATH |
| INLINECODE1 |
Optional | OpenShift CLI — needed for OCP/ROSA/ARO |
|
helm | Optional | For GitOps agent Helm operations |
|
jq | Optional | For JSON output parsing |
|
KUBECONFIG | ✅ Yes | Cluster access via env var or
~/.kube/config |
Optional cloud CLIs (aws, az, gcloud, rosa) — only needed for managed cluster operations.
Installation
CODEBLOCK0
Or install individual agents:
CODEBLOCK1
The Swarm — Agent Roster
| Agent | Code Name | Domain |
|---|
| Orchestrator | Jarvis | Task routing, coordination, standups |
| Cluster Ops |
Atlas | Cluster lifecycle, nodes, upgrades |
| GitOps | Flow | ArgoCD, Helm, Kustomize, deploys |
| Security | Shield | RBAC, policies, secrets, scanning |
| Observability | Pulse | Metrics, logs, alerts, incidents |
| Artifacts | Cache | Registries, SBOM, promotion, CVEs |
| Developer Experience | Desk | Namespaces, onboarding, support |
How It Works
This is an instruction-only skill. Agents receive markdown instructions describing what commands to run and how to interpret output. No executable scripts are included — the agent translates instructions into actions using the host's installed CLI tools.
Session Setup
Before using the swarm, establish cluster context:
CODEBLOCK2
Agent Communication
Agents communicate via @mentions in shared task comments:
CODEBLOCK3
Escalation Path
- 1. Agent detects issue
- Agent attempts resolution within guardrails
- If blocked → @mention another agent or escalate to human
- P1 incidents → all relevant agents auto-notified
Heartbeat Schedule
CODEBLOCK4
Agent Capabilities
What Agents CAN Do
- - Read cluster state (
kubectl get, kubectl describe, oc get) - Deploy via GitOps (
argocd app sync, Flux reconciliation) - Create documentation and reports
- Investigate and triage incidents
- Provision standard resources (namespaces, quotas, RBAC)
- Run health checks and audits
- Query metrics and logs
What Agents CANNOT Do (Human-in-the-Loop Required)
- - Delete production resources
- Modify cluster-wide policies
- Make direct changes to secrets without rotation workflow
- Perform irreversible cluster upgrades
- Approve production deployments (can prepare, human approves)
Key Principles
- - Roles over genericism — Each agent has a defined domain
- Files over mental notes — Only files persist between sessions
- Human-in-the-loop — Critical actions require approval
- Guardrails over freedom — Define what agents can and cannot do
- Audit everything — Every action logged
File Structure
CODEBLOCK5
Detailed Agent Documentation
See individual SKILL.md files for each agent's full capabilities, personality, and workflow instructions.
Kubernetes Agent Swarm — 平台运维
一个用于Kubernetes和OpenShift平台运维的多智能体系统。七个专业智能体以协调集群的方式协同工作。
运行时要求
| 要求 | 必需 | 描述 |
|---|
| kubectl | ✅ 是 | Kubernetes CLI — 必须在PATH环境变量中 |
| oc |
可选 | OpenShift CLI — OCP/ROSA/ARO环境需要 |
| helm | 可选 | 用于GitOps智能体的Helm操作 |
| jq | 可选 | 用于JSON输出解析 |
| KUBECONFIG | ✅ 是 | 通过环境变量或~/.kube/config访问集群 |
可选的云CLI工具(aws、az、gcloud、rosa)— 仅托管集群操作需要。
安装
bash
clawhub install kubernetes
或安装单个智能体:
bash
clawhub install orchestrator
clawhub install cluster-ops
clawhub install gitops
clawhub install security
clawhub install observability
clawhub install artifacts
clawhub install developer-experience
集群 — 智能体名册
| 智能体 | 代号 | 领域 |
|---|
| 编排器 | Jarvis | 任务路由、协调、站会 |
| 集群运维 |
Atlas | 集群生命周期、节点、升级 |
| GitOps | Flow | ArgoCD、Helm、Kustomize、部署 |
| 安全 | Shield | RBAC、策略、密钥、扫描 |
| 可观测性 | Pulse | 指标、日志、告警、事件 |
| 制品 | Cache | 仓库、SBOM、升级、CVE |
| 开发者体验 | Desk | 命名空间、入职、支持 |
工作原理
这是一个仅指令技能。智能体接收Markdown格式的指令,描述要运行的命令以及如何解释输出。不包含可执行脚本——智能体使用主机已安装的CLI工具将指令转化为操作。
会话设置
使用集群前,先建立集群上下文:
bash
验证访问
kubectl cluster-info
kubectl get nodes
对于OpenShift
oc status
智能体通信
智能体通过在共享任务评论中使用@提及进行通信:
@Shield 请在同步前检查payment-service v3.2的RBAC配置。
@Pulse CPU峰值与部署相关还是外部流量导致?
@Atlas 预发布集群需要再增加2个工作节点。
升级路径
- 1. 智能体检测到问题
- 智能体在安全护栏内尝试解决
- 如果受阻 → @提及其他智能体或升级给人工处理
- P1事件 → 自动通知所有相关智能体
心跳调度
/5 * Atlas、Pulse、Shield (快速响应:事件、告警、CVE)
/10 * Flow、Cache (计划任务:部署、升级)
/15 * Desk、Orchestrator (批量任务:入职、站会)
智能体能力
智能体可以执行的操作
- - 读取集群状态(kubectl get、kubectl describe、oc get)
- 通过GitOps部署(argocd app sync、Flux协调)
- 创建文档和报告
- 调查和分类事件
- 配置标准资源(命名空间、配额、RBAC)
- 运行健康检查和审计
- 查询指标和日志
智能体不能执行的操作(需要人工介入)
- - 删除生产资源
- 修改集群级策略
- 未经轮换流程直接修改密钥
- 执行不可逆的集群升级
- 批准生产部署(可准备,但需人工批准)
关键原则
- - 角色优于通用 — 每个智能体有明确的领域
- 文件优于记忆 — 只有文件能在会话间持久化
- 人工介入 — 关键操作需要审批
- 安全护栏优于自由 — 定义智能体可以做什么和不能做什么
- 审计一切 — 每个操作都记录在案
文件结构
kubernetes/
├── SKILL.md # 本文件 — 集群组合
├── AGENTS.md # 集群配置和协议
├── skills/
│ ├── orchestrator/SKILL.md # Jarvis — 任务路由
│ ├── cluster-ops/SKILL.md # Atlas — 集群运维
│ ├── gitops/SKILL.md # Flow — GitOps
│ ├── security/SKILL.md # Shield — 安全
│ ├── observability/SKILL.md # Pulse — 监控
│ ├── artifacts/SKILL.md # Cache — 制品
│ └── developer-experience/SKILL.md # Desk — 开发者体验
├── memory/MEMORY.md # 长期智能体记忆
├── working/WORKING.md # 会话进度
└── logs/LOGS.md # 操作审计追踪
详细智能体文档
请参阅各智能体的SKILL.md文件,了解完整能力、个性特征和工作流程说明。