AWS ECS Monitor
Production health monitoring and log analysis for AWS ECS services.
What It Does
- - Health Checks: HTTP probes against your domain, ECS service status (desired vs running), ALB target group health, SSL certificate expiry
- Log Analysis: Pulls CloudWatch logs, categorizes errors (panics, fatals, OOM, timeouts, 5xx), detects container restarts, filters health check noise
- Auto-Diagnosis: Reads health status and automatically investigates failing services via log analysis
Prerequisites
- -
aws CLI configured with appropriate IAM permissions:
-
ecs:ListServices,
ecs:DescribeServices
-
elasticloadbalancing:DescribeTargetGroups,
elasticloadbalancing:DescribeTargetHealth
-
logs:FilterLogEvents,
logs:DescribeLogGroups
- -
curl for HTTP health checks - INLINECODE8 for JSON processing and log analysis
- INLINECODE9 for SSL certificate checks (optional)
Configuration
All configuration is via environment variables:
| Variable | Required | Default | Description |
|---|
| INLINECODE10 | Yes | — | ECS cluster name |
| INLINECODE11 |
No |
us-east-1 | AWS region |
|
ECS_DOMAIN | No | — | Domain for HTTP/SSL checks (skip if unset) |
|
ECS_SERVICES | No | auto-detect | Comma-separated service names to monitor |
|
ECS_HEALTH_STATE | No |
./data/ecs-health.json | Path to write health state JSON |
|
ECS_HEALTH_OUTDIR | No |
./data/ | Output directory for logs and alerts |
|
ECS_LOG_PATTERN | No |
/ecs/{service} | CloudWatch log group pattern (
{service} is replaced) |
|
ECS_HTTP_ENDPOINTS | No | — | Comma-separated
name=url pairs for HTTP probes |
Directories Written
- -
ECS_HEALTH_STATE (default: ./data/ecs-health.json) — Health state JSON file ECS_HEALTH_OUTDIR (default: ./data/) — Output directory for logs, alerts, and analysis reports
Scripts
scripts/ecs-health.sh — Health Monitor
CODEBLOCK0
Exit codes: 0 = healthy, 1 = unhealthy/degraded, 2 = script error
scripts/cloudwatch-logs.sh — Log Analyzer
CODEBLOCK1
Options: --minutes N (default: 60), --json, --limit N (default: 200), INLINECODE36
Auto-Detection
When ECS_SERVICES is not set, both scripts auto-detect services from the cluster:
CODEBLOCK2
Log groups are resolved by pattern (default /ecs/{service}). Override with ECS_LOG_PATTERN:
CODEBLOCK3
Integration
The health monitor can trigger the log analyzer for auto-diagnosis when issues are detected. Set ECS_HEALTH_OUTDIR to a shared directory and run both scripts together:
CODEBLOCK4
Error Categories
The log analyzer classifies errors into:
- -
panic — Go panics - INLINECODE42 — Fatal errors
- INLINECODE43 — Out of memory
- INLINECODE44 — Connection/request timeouts
- INLINECODE45 — Connection refused/reset
- INLINECODE46 — HTTP 500-level responses
- INLINECODE47 — Python tracebacks
- INLINECODE48 — Generic exceptions
- INLINECODE49 — Permission/authorization failures
- INLINECODE50 — JSON-structured error logs
- INLINECODE51 — Generic ERROR-level messages
Health check noise (GET/HEAD /health from ALB) is automatically filtered from error counts and HTTP status distribution.
AWS ECS 监控
针对AWS ECS服务的生产健康监控与日志分析。
功能概述
- - 健康检查:对域名进行HTTP探测、ECS服务状态(期望运行数vs实际运行数)、ALB目标组健康状态、SSL证书过期检查
- 日志分析:拉取CloudWatch日志,分类错误类型(panic、fatal、OOM、超时、5xx),检测容器重启,过滤健康检查噪音
- 自动诊断:读取健康状态,通过日志分析自动调查故障服务
前置条件
- ecs:ListServices、ecs:DescribeServices
- elasticloadbalancing:DescribeTargetGroups、elasticloadbalancing:DescribeTargetHealth
- logs:FilterLogEvents、logs:DescribeLogGroups
- - 用于HTTP健康检查的curl
- 用于JSON处理和日志分析的python3
- 用于SSL证书检查的openssl(可选)
配置
所有配置均通过环境变量进行:
| 变量 | 必填 | 默认值 | 描述 |
|---|
| ECSCLUSTER | 是 | — | ECS集群名称 |
| ECSREGION |
否 | us-east-1 | AWS区域 |
| ECS_DOMAIN | 否 | — | HTTP/SSL检查的域名(未设置则跳过) |
| ECS_SERVICES | 否 | 自动检测 | 要监控的服务名称(逗号分隔) |
| ECS
HEALTHSTATE | 否 | ./data/ecs-health.json | 健康状态JSON文件写入路径 |
| ECS
HEALTHOUTDIR | 否 | ./data/ | 日志和告警的输出目录 |
| ECS
LOGPATTERN | 否 | /ecs/{service} | CloudWatch日志组模式({service}会被替换) |
| ECS
HTTPENDPOINTS | 否 | — | HTTP探测的名称=URL键值对(逗号分隔) |
写入目录
- - ECSHEALTHSTATE(默认:./data/ecs-health.json)— 健康状态JSON文件
- ECSHEALTHOUTDIR(默认:./data/)— 日志、告警和分析报告的输出目录
脚本
scripts/ecs-health.sh — 健康监控
bash
完整检查
ECS
CLUSTER=my-cluster ECSDOMAIN=example.com ./scripts/ecs-health.sh
仅输出JSON
ECS_CLUSTER=my-cluster ./scripts/ecs-health.sh --json
静默模式(无告警,仅状态文件)
ECS_CLUSTER=my-cluster ./scripts/ecs-health.sh --quiet
退出码:0 = 健康,1 = 不健康/降级,2 = 脚本错误
scripts/cloudwatch-logs.sh — 日志分析器
bash
拉取某个服务的原始日志
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh pull my-api --minutes 30
显示所有服务的错误
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh errors all --minutes 120
深度分析并分类错误
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh diagnose --minutes 60
检测容器重启
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh restarts my-api
从健康状态文件自动诊断
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh auto-diagnose
所有服务的汇总信息
ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh summary --minutes 120
选项:--minutes N(默认:60)、--json、--limit N(默认:200)、--verbose
自动检测
当未设置ECS_SERVICES时,两个脚本都会自动检测集群中的服务:
bash
aws ecs list-services --cluster $ECS_CLUSTER
日志组通过模式解析(默认/ecs/{service})。可通过ECSLOGPATTERN覆盖:
bash
如果日志组为 /ecs/prod/my-api、/ecs/prod/my-frontend 等
ECS
LOGPATTERN=/ecs/prod/{service} ECS_CLUSTER=my-cluster ./scripts/cloudwatch-logs.sh diagnose
集成
健康监控可在检测到问题时触发日志分析器进行自动诊断。将ECSHEALTHOUTDIR设置为共享目录,并同时运行两个脚本:
bash
export ECS_CLUSTER=my-cluster
export ECS_DOMAIN=example.com
export ECSHEALTHOUTDIR=./data
运行健康检查(失败时自动触发日志分析)
./scripts/ecs-health.sh
或独立运行日志分析
./scripts/cloudwatch-logs.sh auto-diagnose --minutes 30
错误分类
日志分析器将错误分类为:
- - panic — Go语言panic
- fatal — 致命错误
- oom — 内存溢出
- timeout — 连接/请求超时
- connectionerror — 连接被拒绝/重置
- http5xx — HTTP 500级别响应
- pythontraceback — Python回溯
- exception — 通用异常
- autherror — 权限/授权失败
- structured_error — JSON结构化错误日志
- error — 通用ERROR级别消息
健康检查噪音(来自ALB的GET/HEAD /health请求)会自动从错误计数和HTTP状态分布中过滤。