Monitoring Dashboard Audit
Structured assessment of monitoring infrastructure for network operations.
Evaluates Grafana dashboards, PromQL query efficiency, alert rule
configuration, SLA/SLO reporting accuracy, and Prometheus data source
health. This skill reads monitoring configuration and metrics — it does
not create, modify, or delete dashboards, alerts, or data sources.
Reference references/cli-reference.md for Grafana and Prometheus API
commands organized by audit step, and references/query-reference.md for
PromQL patterns covering network interface utilization, error rates, BGP
peer state, and SNMP-derived metric evaluation.
When to Use
- - Monitoring gap assessment — verifying that all critical network infrastructure has dashboard coverage and active alerting
- Dashboard quality review — evaluating whether existing Grafana dashboards present accurate, actionable data to operations teams
- Alert fatigue investigation — audit when teams report excessive or irrelevant alert notifications that mask genuine incidents
- SLA/SLO compliance review — validating that error budget calculations and availability metrics reflect actual service delivery
- Pre-migration monitoring readiness — confirming monitoring will survive infrastructure changes (new devices, topology changes, platform migrations)
- Post-incident review — assessing whether monitoring detected the incident, how quickly alerts fired, and what gaps allowed silent failures
Prerequisites
- - Grafana access — API token or service account with Viewer role minimum (
grafana_url and Authorization: Bearer <token> header confirmed working) - Prometheus access — HTTP API reachable at
prometheus_url/api/v1/status/config (no authentication required by default, or appropriate auth header configured) - Network scope defined — device inventory, subnet ranges, and critical service list available for coverage gap analysis
- Baseline documentation — existing SLA/SLO targets, expected alert thresholds, and operations runbooks available for comparison
- Timestamp awareness — confirm NTP synchronization across monitoring stack; Prometheus scrape timestamps and Grafana time range selections depend on consistent clocks
Procedure
Follow these six steps sequentially. Each step produces findings that
feed the monitoring coverage scorecard and optimization recommendations
in Step 6.
Step 1: Dashboard Inventory
Enumerate all Grafana dashboards to establish the monitoring surface
area and identify coverage gaps, stale dashboards, and organizational
issues.
Query the Grafana API to list all dashboards with metadata:
CODEBLOCK0
For each dashboard, record: uid, title, folder, tags, and last-updated
timestamp. Identify dashboards not updated in over 180 days as staleness
candidates — these may reference deprecated metrics or decommissioned
infrastructure.
Examine folder organization. Flat folder structures with 50+ dashboards
at root level indicate organizational debt. Check for naming convention
adherence — dashboards without consistent prefixes or tags reduce
discoverability during incidents.
Retrieve the full JSON model for each dashboard:
CODEBLOCK1
Record panel count, data source references, and template variables.
Flag dashboards with hardcoded time ranges (no relative time selector)
and panels referencing nonexistent data sources — these produce empty
panels that erode operator trust.
Build a coverage matrix: map dashboard panels to infrastructure
inventory. Devices, interfaces, or services present in inventory but
absent from any dashboard represent monitoring blind spots.
Step 2: Panel and Query Analysis
Evaluate PromQL query efficiency, panel threshold configuration,
and visualization appropriateness across all dashboards.
Extract all PromQL expressions from dashboard panel JSON targets.
For each query, assess:
Rate function usage — rate() requires a range vector at least
two scrape intervals wide. Replace irate() in dashboard panels that
display trends over long time ranges — irate() only uses the last
two data points and is appropriate only for volatile, short-window displays.
Recording rule candidates — Complex queries repeating across
multiple dashboards (same expression in 3+ panels) should be recording
rules. Identify these by hashing normalized PromQL expressions.
Common candidates: interface utilization calculations, error rate
ratios, and aggregated availability metrics.
Label cardinality — Queries aggregating across high-cardinality
labels without explicit filtering generate expensive computation.
Flag queries with no label matchers on high-cardinality metrics and
queries using {__name__=~".*"} patterns.
Panel thresholds — Verify gauge and stat panels have threshold
values configured. Panels displaying utilization or error rates
without color-coded thresholds fail to provide at-a-glance severity.
Compare configured thresholds against operational standards (e.g.,
interface utilization warning at 70%, critical at 90%).
Visualization appropriateness — Time series data on gauge panels
loses temporal context. Single-value stats for volatile metrics mislead
operators. Table panels with 100+ rows without sorting are unusable
during incidents.
Step 3: Alert Rule Validation
Assess alert rule configuration for detection coverage, threshold
accuracy, and notification reliability.
Retrieve all alert rules from the Grafana alerting API:
CODEBLOCK2
For Prometheus-native alerting, also query:
CODEBLOCK3
For each alert rule, evaluate:
Threshold appropriateness — Alert thresholds should align with
operational impact, not arbitrary percentages. Interface utilization
alerts at 50% on a 10Gbps link are premature; at 95% on a 100Mbps
WAN link they are late. Cross-reference thresholds against link
capacity and historical peak usage from Prometheus.
Evaluation intervals — Alert rules evaluated every 5 minutes
cannot detect sub-minute outages. For critical infrastructure (WAN
links, core routers), evaluation intervals should match or be less
than the Prometheus scrape interval. Flag alert groups where evaluation
interval exceeds the scrape interval of the underlying metrics.
Pending and for durations — for: 0s alerts fire on transient
spikes and contribute to alert fatigue. for: 30m on critical
infrastructure means 30 minutes of unnotified outage. Recommended
ranges: Critical alerts for: 2m-5m, Warning alerts for: 5m-15m.
Notification channels — Verify all alert rules have at least one
active notification channel. Check channel health — Slack webhooks
return 200, PagerDuty keys are valid, email SMTP is reachable.
Routing and silencing — Review Alertmanager routing tree for
catch-all routes dumping all alerts to a single channel. Verify
silences have expiration times. Active silences without expiration
mask ongoing problems indefinitely.
Escalation completeness — Critical alerts should escalate from
Slack/email to PagerDuty/phone after acknowledgment timeout. Alert
rules with only Slack notification for Critical-severity failures
lack escalation depth.
Step 4: SLA/SLO Reporting
Validate that SLA/SLO dashboards and calculations reflect actual
service delivery accuracy.
Error budget calculation — Verify the formula:
error_budget_remaining = 1 - (actual_errors / allowed_errors).
Common mistakes: using the wrong time window (calendar month vs
rolling 30 days), excluding planned maintenance from downtime
calculations, or computing availability from a single data source
when the service spans multiple components.
Availability SLIs — Check that uptime percentage, MTTR (Mean
Time to Repair), and MTBF (Mean Time Between Failures) use correct
inputs. Uptime should reference probe-based measurement (blackbox
exporter, synthetic checks), not just device-reported status. MTTR
excluding detection time understates actual recovery duration.
Burn rate alerting — Multi-window burn rate alerting provides
early warning when error budget consumption accelerates. Verify burn
rate alerts use at least two windows (e.g., 1-hour and 6-hour). A
single-window alert either fires too late or too often. Check that
severity maps to budget consumption: a 14.4x burn rate over 1 hour
warrants page-level severity.
Multi-window alert patterns — Confirm long-window alerts (6h, 3d)
for trend detection and short-window alerts (5m, 1h) for rapid
response. Verify severity increases with burn rate magnitude.
Step 5: Data Source Health
Assess Prometheus scrape target status, metric cardinality, retention
configuration, and remote write health.
Scrape target status — Query the targets API:
CODEBLOCK4
Check the up metric across all scrape targets. Targets with
up == 0 are failing to scrape — investigate network reachability,
exporter health, or authentication issues. Targets with
scrape_duration_seconds exceeding the scrape interval are timing
out, producing gaps in metrics and potentially triggering false alerts.
Cardinality assessment — Query TSDB status:
CODEBLOCK5
Identify the top 10 metrics by series count. Network environments
commonly see cardinality explosion from per-interface SNMP metrics
on large chassis devices. Metrics exceeding 10,000 active series per
name warrant investigation for label optimization or aggregation.
Retention and storage — Verify retention period covers the longest
SLA reporting window. A 15-day retention with 30-day SLA dashboards
produces incomplete reports. Check WAL size — a WAL larger than 20%
of total TSDB size may indicate write amplification from high churn.
Remote write health — If Prometheus uses remote write (Thanos,
Cortex, Mimir, VictoriaMetrics), check remote storage lag metrics.
Lag exceeding 5 minutes means long-term store is behind real-time.
Flag failed sample counters as write failures creating data gaps.
Metric naming conventions — Verify metrics follow Prometheus
naming conventions: snakecase, unit suffixes (bytes, _seconds,
_total), base units. Inconsistent naming makes dashboard authoring
error-prone and cross-device comparison unreliable.
Step 6: Monitoring Coverage Report
Compile findings into a structured monitoring coverage scorecard
with prioritized optimization recommendations.
Coverage scorecard — Rate each infrastructure category
(core routers, distribution switches, WAN links, firewalls,
load balancers) on a 1–5 scale: 1 = no monitoring, 2 = basic
up/down only, 3 = utilization dashboards, 4 = dashboards with
alerting, 5 = dashboards with alerting and SLO tracking.
Alert quality assessment — Compute alert quality metrics:
alert-to-incident ratio, mean time to acknowledge, silence
frequency per alert rule. High-noise alerts (>80% silenced or
acknowledged-without-action) are candidates for threshold
adjustment or removal.
PromQL optimization recommendations — For each inefficient
query from Step 2, provide current and optimized expressions.
Prioritize recording rule creation for queries appearing in 3+
panels.
Threshold Tables
| Domain | Severity | Condition | Example |
|---|
| Coverage | Critical | Production device with zero dashboard panels | Core router absent from all dashboards |
| Coverage |
High | Service with dashboards but no alerting | WAN link monitored but no capacity alert |
| Coverage | Medium | Dashboard exists but is stale (>180 days) | Last-modified date precedes device refresh |
| Coverage | Low | Dashboard lacks threshold coloring | Utilization panel with no warning/critical bands |
| Query | Critical | PromQL uses absent metric name | Panel returns no data due to renamed metric |
| Query | High | rate() range vector shorter than 2x scrape interval |
rate(metric[15s]) with 30s scrape |
| Query | Medium | Repeated query across 3+ panels without recording rule | Same utilization formula in 5 dashboards |
| Query | Low | irate() used for trend display over long time range | irate() on 24h overview panel |
| Alert | Critical | Alert rule with no notification channel | Silent alarm on Critical infrastructure |
| Alert | High | Critical alert with for duration >15m | 15+ minute detection gap |
| Alert | Medium | Warning alert with for duration of 0s | Transient spikes cause alert fatigue |
| Alert | Low | Single notification channel with no escalation | Slack-only for P1 infrastructure alert |
| SLA | Critical | Error budget calculation uses wrong time window | Calendar month vs rolling 30d mismatch |
| SLA | High | Availability SLI excludes detection time from MTTR | MTTR underreported by omitting MTTD |
| SLA | Medium | Single-window burn rate alerting | Only 1h window, no long-term trend window |
| SLA | Low | SLO dashboard missing historical trend comparison | No month-over-month burn rate trend |
| DataSource | Critical | Scrape target down (up == 0) for >5m | SNMP exporter unreachable |
| DataSource | High | Cardinality >10k series per metric name | Per-interface metrics on 500-port chassis |
| DataSource | Medium | Remote write lag >5m | Long-term store behind real-time |
| DataSource | Low | Metric naming violates Prometheus conventions | CamelCase or missing unit suffix |
Decision Trees
CODEBLOCK6
Report Template
CODEBLOCK7
Troubleshooting
Grafana API returns 401/403 — Verify the API token has Viewer
permissions. Service accounts require Viewer role on all folders
containing in-scope dashboards. Admin-level tokens are not required
for read-only audit.
Prometheus API unreachable — Check network connectivity and reverse
proxy configuration. Verify the base URL includes the correct path
prefix if Prometheus runs behind a subpath.
Empty dashboard inventory — Grafana API search returns paginated
results. Increase the limit parameter or paginate. Folder-level
permissions can hide dashboards from restricted tokens.
PromQL queries reference absent metrics — SNMP exporter metric
names change between exporter versions. Check exporter version and
generator configuration when dashboards return no data.
Cardinality data not available — The TSDB status endpoint requires
Prometheus 2.14+. If unavailable, estimate cardinality using count
queries, though these are expensive on large installations.
SLA calculations differ from business reports — Time zone handling
is the most common cause. Prometheus stores UTC timestamps. Verify
all SLA dashboards use explicit UTC time ranges or a consistent
time zone variable.
Alert rules exist in both Grafana and Prometheus — Grafana Unified
Alerting coexists with Prometheus-native alerting. Audit both sources
to avoid duplicate or conflicting coverage. Document which system is
authoritative for each alert category.
监控仪表盘审计
对网络运维的监控基础设施进行结构化评估。
评估Grafana仪表盘、PromQL查询效率、告警规则
配置、SLA/SLO报告准确性以及Prometheus数据源
健康状态。此技能读取监控配置和指标——它不会
创建、修改或删除仪表盘、告警或数据源。
参考references/cli-reference.md获取按审计步骤组织的Grafana和Prometheus API
命令,以及references/query-reference.md获取涵盖网络接口利用率、错误率、BGP
对等体状态和SNMP派生指标评估的PromQL模式。
使用时机
- - 监控缺口评估——验证所有关键网络基础设施是否具有仪表盘覆盖和活跃告警
- 仪表盘质量审查——评估现有Grafana仪表盘是否为运维团队提供准确、可操作的数据
- 告警疲劳调查——当团队报告过多或无关的告警通知掩盖真实事件时进行审计
- SLA/SLO合规审查——验证错误预算计算和可用性指标是否反映实际服务交付
- 迁移前监控就绪确认——确认监控能够应对基础设施变更(新设备、拓扑变化、平台迁移)
- 事件后审查——评估监控是否检测到事件、告警触发的速度以及哪些缺口导致了静默故障
前提条件
- - Grafana访问权限——至少具有查看者角色的API令牌或服务账户(确认grafanaurl和Authorization: Bearer 标头正常工作)
- Prometheus访问权限——可通过prometheusurl/api/v1/status/config访问HTTP API(默认无需认证,或已配置适当的认证标头)
- 网络范围已定义——设备清单、子网范围和关键服务列表可用于覆盖缺口分析
- 基线文档——现有的SLA/SLO目标、预期告警阈值和运维手册可供比较
- 时间戳意识——确认监控堆栈中的NTP同步;Prometheus抓取时间戳和Grafana时间范围选择依赖于一致的时钟
流程
按顺序执行以下六个步骤。每个步骤产生的结果将
输入到第6步的监控覆盖评分卡和优化建议中。
第1步:仪表盘清单
列举所有Grafana仪表盘以确定监控覆盖
范围,并识别覆盖缺口、过时仪表盘和组织
性问题。
查询Grafana API以列出所有仪表盘及其元数据:
GET /api/search?type=dash-db&limit=5000
对于每个仪表盘,记录:uid、标题、文件夹、标签和最后更新
时间戳。将超过180天未更新的仪表盘标记为过时
候选——这些可能引用了已弃用的指标或已退役的
基础设施。
检查文件夹组织。根级别包含50个以上仪表盘的扁平
文件夹结构表明存在组织性债务。检查命名约定
遵循情况——没有一致前缀或标签的仪表盘会降低
事件期间的可发现性。
检索每个仪表盘的完整JSON模型:
GET /api/dashboards/uid/
记录面板数量、数据源引用和模板变量。
标记具有硬编码时间范围(无相对时间选择器)的仪表盘
以及引用不存在数据源的面板——这些会产生空
面板,削弱操作员的信任。
构建覆盖矩阵:将仪表盘面板映射到基础设施
清单。清单中存在但任何仪表盘都未覆盖的设备、
接口或服务代表监控盲点。
第2步:面板和查询分析
评估所有仪表盘中PromQL查询效率、面板阈值配置
和可视化适当性。
从仪表盘面板JSON目标中提取所有PromQL表达式。
对于每个查询,评估:
rate函数使用——rate()需要一个至少为
两个抓取间隔宽度的范围向量。在显示长时间范围趋势的
仪表盘面板中替换irate()——irate()仅使用最后
两个数据点,仅适用于波动性大、短窗口的显示。
记录规则候选——跨多个仪表盘重复的复杂查询
(3个以上面板中出现相同表达式)应成为记录
规则。通过对归一化的PromQL表达式进行哈希来识别这些。
常见候选:接口利用率计算、错误率
比率和聚合的可用性指标。
标签基数——在没有显式过滤的情况下聚合高基数
标签的查询会产生昂贵的计算。
标记在高基数指标上没有标签匹配器的查询以及
使用{name=~.*}模式的查询。
面板阈值——验证仪表盘和统计面板是否配置了阈值
值。显示利用率或错误率但没有颜色编码阈值的面板
无法提供一目了然的严重程度。将配置的阈值与
运维标准进行比较(例如,接口利用率警告70%,严重90%)。
可视化适当性——仪表盘面板上的时间序列数据
丢失了时间上下文。波动性指标的单一值统计
会误导操作员。包含100行以上且未排序的表格面板
在事件期间无法使用。
第3步:告警规则验证
评估告警规则配置的检测覆盖范围、阈值
准确性和通知可靠性。
从Grafana告警API检索所有告警规则:
GET /api/v1/provisioning/alert-rules
对于Prometheus原生告警,还需查询:
GET /api/v1/rules?type=alert
对于每个告警规则,评估:
阈值适当性——告警阈值应与运营影响对齐,
而不是任意百分比。10Gbps链路上50%的接口利用率
告警为时过早;100Mbps WAN链路上95%的告警则为时已晚。
将阈值与链路容量和Prometheus的历史峰值使用量
进行交叉参考。
评估间隔——每5分钟评估一次的告警规则
无法检测到分钟级内的中断。对于关键基础设施(WAN
链路、核心路由器),评估间隔应等于或小于
Prometheus抓取间隔。标记评估间隔超过底层指标
抓取间隔的告警组。
等待和for持续时间——for: 0s的告警会在瞬态尖峰时触发
并导致告警疲劳。关键基础设施上的for: 30m
意味着30分钟的无通知中断。建议
范围:关键告警for: 2m-5m,警告告警for: 5m-15m。
通知渠道——验证所有告警规则至少有一个
活跃的通知渠道。检查渠道健康状态——Slack webhook
返回200,PagerDuty密钥有效,电子邮件SMTP可达。
路由和静默——审查Alertmanager路由树中
将所有告警转储到单个渠道的捕获所有路由。验证
静默具有过期时间。没有过期时间的活跃静默
会无限期掩盖持续问题。
升级完整性——关键告警应在确认超时后
从Slack/电子邮件升级到PagerDuty/电话。仅具有Slack通知的
关键严重性故障告警规则缺乏升级深度。
第4步:SLA/SLO报告
验证SLA/SLO仪表盘和计算是否反映实际
服务交付准确性。
错误预算计算——验证公式:
errorbudgetremaining = 1 - (actualerrors / allowederrors)。
常见错误:使用错误的时间窗口(日历月 vs
滚动30天),从停机时间计算中排除计划维护,
或者在服务跨越多个组件时从单个数据源
计算可用性。
可用性SLI——检查正常运行时间百分比、MTTR(平均
修复时间)和MTBF(平均故障间隔时间)是否使用正确的
输入。正常运行时间应引用基于探测的测量(blackbox
导出器、合成检查),而不仅仅是设备报告的状态。
排除检测时间的MTTR低估了实际恢复持续时间。
燃烧率告警——多窗口燃烧率告警在错误预算消耗
加速时提供早期警告。验证燃烧率告警
至少使用两个窗口(例如,1小时和6小时)。单
窗口告警要么触发太晚,要么太频繁。检查
严重程度是否映射到预算消耗:1小时内14.4倍的燃烧率
需要页面级别的严重程度。
多窗口告警模式——确认长窗口告警(6小时、3天)
用于趋势检测,短窗口告警(5分钟、1小时)用于快速
响应。验证严重程度随燃烧率幅度增加。
第5步:数据源健康
评估Prometheus抓取目标状态、指标基数、保留
配置和远程写入健康状态。
抓取目标状态——查询目标API:
GET /api/v1/targets?state=active
检查所有抓取目标的up指标。up == 0的
目标正在失败抓取——调查网络可达性、
导出器健康或认证问题。scrapedurationseconds
超过抓取间隔的目标正在超时,
在指标中产生间隙,并可能触发虚假告警。
基数评估——查询TSDB状态:
GET /api/v1/status/tsdb
按序列计数识别前10个指标。网络环境
中常见大型机箱设备上每个接口的SNMP指标
导致的基数爆炸。每个名称超过10,000个活跃序列的
指标需要调查标签优化或聚合。
保留和存储——验证保留期覆盖最