SLA Monitor Skill
Purpose
Help teams set up production-grade monitoring for AI agents and automated services. Covers uptime tracking, response time SLAs, error budgets, and incident escalation.
When to Use
- - Deploying AI agents to production
- Setting up monitoring for client-facing automation
- Creating SLA documentation for service agreements
- Building incident response procedures
Monitoring Stack Options
Option 1: UptimeRobot (Free tier available)
- - 50 monitors free, 5-minute intervals
- HTTP, keyword, ping, port monitors
- Email + Slack + webhook alerts
Option 2: Better Stack (Formerly Uptime.com)
- - Status pages included
- Incident management built-in
- Free tier: 10 monitors
Option 3: Self-Hosted (Uptime Kuma)
CODEBLOCK0
SLA Tiers
Tier 1: Standard ($1,500/mo)
- - 99.5% uptime guarantee (43.8h downtime/year)
- Response within 4 hours (business hours)
- Monthly performance report
Tier 2: Professional ($3,000/mo)
- - 99.9% uptime guarantee (8.76h downtime/year)
- Response within 1 hour (business hours)
- Weekly performance reports
- Quarterly optimization reviews
Tier 3: Enterprise ($5,000+/mo)
- - 99.95% uptime (4.38h downtime/year)
- Response within 15 minutes (24/7)
- Real-time dashboard access
- Dedicated support channel
Alert Configuration Template
CODEBLOCK1
Incident Response Playbook
Severity 1 — Total Outage
- 1. Acknowledge within 5 minutes
- Status page update within 10 minutes
- Root cause identification within 30 minutes
- Resolution or workaround within 2 hours
- Post-mortem within 24 hours
Severity 2 — Degraded Performance
- 1. Acknowledge within 15 minutes
- Investigation within 30 minutes
- Resolution within 4 hours
- Summary report within 48 hours
Severity 3 — Minor Issue
- 1. Acknowledge within 1 hour
- Resolution within 24 hours
- Logged for next review cycle
Error Budget Calculator
CODEBLOCK2
Status Page Template
Provide clients with a public status page showing:
- - Current system status (operational / degraded / outage)
- Component-level status (Agent A, Agent B, API, Dashboard)
- Uptime percentage (30-day rolling)
- Incident history with resolution notes
- Scheduled maintenance windows
Next Steps
Need managed AI agents with built-in SLA monitoring?
→ AfrexAI handles deployment, monitoring, and maintenance for $1,500/mo
→ Book a call: https://calendly.com/cbeckford-afrexai/30min
→ Learn more: https://afrexai-cto.github.io/aaas/landing.html
SLA 监控技能
目的
帮助团队为AI智能体和自动化服务搭建生产级监控体系。涵盖运行时间追踪、响应时间SLA、错误预算及事件升级流程。
适用场景
- - 将AI智能体部署至生产环境
- 为客户面向的自动化系统搭建监控
- 为服务协议创建SLA文档
- 构建事件响应流程
监控方案选项
方案一:UptimeRobot(提供免费套餐)
- - 50个监控器免费,5分钟检测间隔
- 支持HTTP、关键词、Ping、端口监控
- 邮件+Slack+Webhook告警
方案二:Better Stack(原Uptime.com)
- - 包含状态页面
- 内置事件管理功能
- 免费套餐:10个监控器
方案三:自托管方案(Uptime Kuma)
bash
docker run -d --restart=always -p 3001:3001 -v uptime-kuma:/app/data --name uptime-kuma louislam/uptime-kuma:1
SLA等级
一级:标准版(1,500美元/月)
- - 99.5%运行时间保证(每年43.8小时停机)
- 4小时内响应(工作时间)
- 月度性能报告
二级:专业版(3,000美元/月)
- - 99.9%运行时间保证(每年8.76小时停机)
- 1小时内响应(工作时间)
- 周度性能报告
- 季度优化评审
三级:企业版(5,000美元+/月)
- - 99.95%运行时间(每年4.38小时停机)
- 15分钟内响应(7×24小时)
- 实时仪表盘访问
- 专属支持通道
告警配置模板
yaml
monitors:
- name: 智能体健康检查
type: http
url: https://your-agent-endpoint/health
interval: 300 # 5分钟
alerts:
- type: email
threshold: 1 # 1次失败后告警
- type: slack
webhook: ${SLACK_WEBHOOK}
threshold: 2 # 连续2次失败后告警
- type: sms
threshold: 3 # 3次失败后升级
- name: API响应时间
type: http
url: https://your-agent-endpoint/api
interval: 60
expectedresponsetime: 2000 # 毫秒
alerts:
- type: slack
condition: response_time > 5000
error_budget:
monthly_target: 99.9
burnratealert: 2.0 # 当消耗速率达到正常2倍时告警
事件响应手册
一级严重性——完全中断
- 1. 5分钟内确认
- 10分钟内更新状态页面
- 30分钟内定位根本原因
- 2小时内解决或提供临时方案
- 24小时内完成事后复盘
二级严重性——性能降级
- 1. 15分钟内确认
- 30分钟内启动调查
- 4小时内解决
- 48小时内提交总结报告
三级严重性——轻微问题
- 1. 1小时内确认
- 24小时内解决
- 记录至下一评审周期
错误预算计算器
月度分钟数:43,200(30天)
99.9% SLA = 允许43.2分钟停机
99.5% SLA = 允许216分钟停机
99.0% SLA = 允许432分钟停机
消耗速率 = (实际停机时间 / 预算) × 100
若剩余2周以上且消耗速率 > 50% → 需评审
若消耗速率 > 80% → 冻结部署
状态页面模板
为客户提供公开状态页面,展示:
- - 当前系统状态(正常运行/性能降级/中断)
- 组件级状态(智能体A、智能体B、API、仪表盘)
- 运行时间百分比(30天滚动)
- 事件历史及解决方案说明
- 计划维护窗口
后续步骤
需要内置SLA监控的托管AI智能体?
→ AfrexAI提供部署、监控和维护服务,1,500美元/月
→ 预约通话:https://calendly.com/cbeckford-afrexai/30min
→ 了解更多:https://afrexai-cto.github.io/aaas/landing.html