Cisco IOS-XE Device Health Check
Structured triage procedure for assessing Cisco IOS-XE device health. Produces a
prioritized findings report with severity classifications and recommended actions.
When to Use
- - Device is reported as slow, unresponsive, or dropping traffic
- Scheduled health audit of IOS-XE routers or switches
- Post-change verification after configuration or software updates
- Capacity planning data collection for CPU, memory, and interface utilization
- Incident response when a device is suspected as the fault domain
Prerequisites
- - SSH or console access to the target IOS-XE device (privilege level 1 minimum)
- Device running IOS-XE 16.x or 17.x (commands validated against 17.3+)
- Network reachability confirmed (ping/traceroute to management IP succeeds)
- Knowledge of the device's normal baseline (typical CPU, memory, traffic levels)
- Change control approval if performing checks during a maintenance window
Procedure
Follow this sequence. Each step produces data for the final report. Do not skip
steps unless the device is unresponsive (jump to Step 6 for crash recovery).
Step 1: Establish Baseline Context
Collect device identity and uptime to frame the health check.
CODEBLOCK0
Record: hostname, software version, uptime, hardware model, current time.
Flag if uptime is unexpectedly short — indicates recent reload or crash.
Step 2: CPU Utilization Assessment
CODEBLOCK1
Compare 5-second, 1-minute, and 5-minute averages against thresholds.
If 5-second average exceeds 80%, identify the top process immediately.
Key processes to watch:
- - IP Input — high values indicate traffic processing overload
- Crypto IKMP — VPN negotiation storms
- SNMP ENGINE — aggressive polling
- BGP Router — large table churn or route oscillation
- IOSD — general control plane congestion
Step 3: Memory Utilization Assessment
CODEBLOCK2
Calculate used percentage: (Total - Free) / Total * 100.
Check for memory fragmentation: compare Largest Free block to Total Free.
If largest free block is less than 10% of total free, fragmentation is a concern.
Step 4: Interface Health
CODEBLOCK3
For each interface with errors:
- - Calculate error rate: INLINECODE1
- Error rate above 0.1% is warning, above 1% is critical
- CRC errors suggest Layer 1 issues (cabling, optics, SFP)
- Input errors with no CRC suggest buffer or overrun issues
- Output drops indicate congestion — check QoS policy
Step 5: Routing Table Health
CODEBLOCK4
Verify: expected number of routes present, no unexpected route withdrawals,
all routing protocol neighbors in established/full state.
Flag: neighbor state changes in the last hour, route count significantly
different from baseline, any routes via unexpected next-hops.
Step 6: Platform and Environment
CODEBLOCK5
Check: power supply status, fan status, temperature readings.
Any environmental alarm is an immediate escalation trigger.
Review recent syslog for crash signatures (traceback, CPUHOG, MALLOCFAIL).
Threshold Tables
Reference: references/threshold-tables.md for detailed per-parameter thresholds.
| Parameter | Normal | Warning | Critical |
|---|
| CPU 5-min avg | < 40% | 40–70% | > 70% |
| CPU 5-sec spike |
< 80% | 80–90% | > 90% |
| Memory used | < 70% | 70–85% | > 85% |
| Memory fragmentation | > 10% largest/total | 5–10% | < 5% |
| Interface error rate | < 0.01% | 0.01–0.1% | > 0.1% |
| Interface output drops | < 100/hr | 100–1000/hr | > 1000/hr |
| Routing neighbors | All established | Flapping | Down |
| Temperature | Within spec | Within 5°C of max | At or above max |
Decision Trees
Triage Priority
CODEBLOCK6
Escalation Criteria
Escalate to senior engineer or TAC when any of these conditions are met:
- - CPU sustained above 90% for more than 15 minutes with no identifiable cause
- Memory below 15% free with no recent change to explain consumption
- Traceback or CPUHOG messages in logs within last 24 hours
- Environmental alarm (power, fan, temperature) present
- More than 3 routing neighbor state changes in last hour
Report Template
Generate a structured report with these sections:
CODEBLOCK7
Severity levels for findings:
- - INFO — within normal thresholds, noted for baseline
- WARNING — approaching threshold, monitor closely
- CRITICAL — threshold exceeded, action required
- EMERGENCY — device at risk of failure, immediate action
Troubleshooting
Device Unresponsive to SSH
Try console access. If console is also unresponsive, check power and
environment remotely (smart PDU, out-of-band management). If the device has
crashed, collect crashinfo: dir crashinfo: after recovery.
CPU Spikes During Health Check
SNMP polling or show commands themselves can briefly spike CPU. Wait 30 seconds
after connecting before collecting CPU data. Use terminal length 0 to avoid
paging pauses that extend session time.
Inconsistent Memory Readings
Memory values fluctuate during normal operation. Collect three samples at
30-second intervals and average them. Check show memory dead for memory
that is allocated but unreachable (leak indicator).
Interface Counter Interpretation
Counters are cumulative since last clear. Use show interfaces [name]
to see the last clear time. For rate calculations, collect counters twice
with a known interval: (counter2 - counter1) / interval_seconds.
Routing Protocol Neighbor Issues
If OSPF neighbors are stuck in INIT/2WAY, check MTU mismatch and area
configuration. If BGP peers show "Active" state, verify TCP connectivity
on port 179 and check for ACL blocking. EIGRP stuck-in-active indicates
a convergence problem downstream.
Cisco IOS-XE 设备健康检查
用于评估 Cisco IOS-XE 设备健康状况的结构化诊断流程。生成带有严重性分类和推荐操作的优先级发现报告。
使用场景
- - 设备被报告为运行缓慢、无响应或丢包
- IOS-XE 路由器或交换机的定期健康审计
- 配置或软件更新后的变更验证
- CPU、内存和接口利用率的容量规划数据收集
- 设备被怀疑为故障域时的应急响应
前提条件
- - 对目标 IOS-XE 设备具有 SSH 或控制台访问权限(最低特权级别 1)
- 设备运行 IOS-XE 16.x 或 17.x(命令已在 17.3+ 版本上验证)
- 确认网络可达性(ping/traceroute 到管理 IP 成功)
- 了解设备的正常基线(典型 CPU、内存、流量水平)
- 如果在维护窗口期间执行检查,需获得变更控制批准
操作步骤
按以下顺序执行。每个步骤为最终报告生成数据。除非设备无响应(跳至步骤 6 进行崩溃恢复),否则不要跳过步骤。
步骤 1:建立基线上下文
收集设备标识和运行时间以构建健康检查框架。
show version | include uptime|Version|bytes of memory
show inventory | include PID
show clock
记录:主机名、软件版本、运行时间、硬件型号、当前时间。
如果运行时间异常短则标记——表示近期重新加载或崩溃。
步骤 2:CPU 利用率评估
show processes cpu sorted | head 20
show processes cpu history
show processes cpu platform sorted 5sec
将 5 秒、1 分钟和 5 分钟平均值与阈值进行比较。
如果 5 秒平均值超过 80%,立即识别出占用最高的进程。
需要关注的进程:
- - IP Input — 高值表示流量处理过载
- Crypto IKMP — VPN 协商风暴
- SNMP ENGINE — 激进的轮询
- BGP Router — 大量表项变动或路由振荡
- IOSD — 控制平面拥塞
步骤 3:内存利用率评估
show memory statistics
show memory platform information
show processes memory sorted | head 15
计算已用百分比:(Total - Free) / Total * 100。
检查内存碎片:比较最大空闲块与总空闲内存。
如果最大空闲块小于总空闲内存的 10%,则存在碎片问题。
步骤 4:接口健康
show interfaces summary
show interfaces counters errors
show interfaces | include line protocol|drops|error|CRC|collision
对于每个有错误的接口:
- - 计算错误率:errors / (input packets + output packets) * 100
- 错误率高于 0.1% 为警告,高于 1% 为严重
- CRC 错误表示第 1 层问题(线缆、光模块、SFP)
- 无 CRC 的输入错误表示缓冲区或溢出问题
- 输出丢弃表示拥塞——检查 QoS 策略
步骤 5:路由表健康
show ip route summary
show ip bgp summary(如果配置了 BGP)
show ip ospf neighbor(如果配置了 OSPF)
show ip eigrp neighbors(如果配置了 EIGRP)
验证:存在预期的路由数量,没有意外的路由撤销,
所有路由协议邻居处于已建立/完全状态。
标记:过去一小时内邻居状态变化、路由数量与基线显著不同、
任何通过意外下一跳的路由。
步骤 6:平台与环境
show environment all
show platform software status control-processor brief
show logging | include %|Error|Warning|traceback(最后 50 行)
检查:电源状态、风扇状态、温度读数。
任何环境告警都是立即升级的触发因素。
审查最近的系统日志以查找崩溃特征(traceback、CPUHOG、MALLOCFAIL)。
阈值表
参考:references/threshold-tables.md 获取详细的每个参数阈值。
| 参数 | 正常 | 警告 | 严重 |
|---|
| CPU 5 分钟平均值 | < 40% | 40–70% | > 70% |
| CPU 5 秒峰值 |
< 80% | 80–90% | > 90% |
| 内存已用 | < 70% | 70–85% | > 85% |
| 内存碎片 | > 10% 最大/总计 | 5–10% | < 5% |
| 接口错误率 | < 0.01% | 0.01–0.1% | > 0.1% |
| 接口输出丢弃 | < 100/小时 | 100–1000/小时 | > 1000/小时 |
| 路由邻居 | 全部已建立 | 振荡 | 断开 |
| 温度 | 在规格范围内 | 距最大值 5°C 以内 | 达到或超过最大值 |
决策树
诊断优先级
设备是否可达?
├── 否 → 立即升级。检查控制台访问、电源、环境。
└── 是
├── CPU 严重?→ 识别占用最高的进程 → 按进程应用缓解措施
│ ├── IP Input → 检查流量风暴、ACL 优化
│ ├── BGP Router → 检查路由变动、邻居振荡、表项大小
│ └── 其他 → 收集 show tech-support 供 TAC 升级
├── 内存严重?→ 检查内存泄漏
│ ├── 最大空闲 < 总计的 5% → 可能是碎片,安排重新加载
│ └── 随时间稳定增长 → 内存泄漏,收集 show mem alloc
├── 接口错误?→ 分类错误类型
│ ├── CRC/输入错误 → 第 1 层(线缆、光模块、SFP)
│ └── 输出丢弃 → QoS 策略或拥塞
└── 全部在阈值内 → 记录健康状态,安排下次检查
升级标准
当满足以下任一条件时,升级至高级工程师或 TAC:
- - CPU 持续高于 90% 超过 15 分钟且无法确定原因
- 空闲内存低于 15% 且近期无变更解释消耗
- 过去 24 小时内日志中出现 traceback 或 CPUHOG 消息
- 存在环境告警(电源、风扇、温度)
- 过去一小时内超过 3 次路由邻居状态变化
报告模板
生成包含以下部分的结构化报告:
设备健康报告
====================
设备:[主机名]
型号:[库存中的 PID]
软件:[版本]
运行时间:[运行时间字符串]
检查时间:[时间戳]
执行人:[操作员/代理]
摘要:[健康 | 警告 | 严重]
发现:
- 1. [严重性] [组件] — [描述]
观察到:[指标值]
阈值:[正常/警告/严重范围]
操作:[推荐操作]
- 2. ...
建议:
下次检查:[根据发现严重性安排的日期]
发现的严重性级别:
- - 信息 — 在正常阈值内,记录为基线
- 警告 — 接近阈值,密切监控
- 严重 — 超过阈值,需要操作
- 紧急 — 设备面临故障风险,立即操作
故障排除
设备对 SSH 无响应
尝试控制台访问。如果控制台也无响应,远程检查电源和环境(智能 PDU、带外管理)。如果设备已崩溃,在恢复后收集崩溃信息:dir crashinfo:。
健康检查期间 CPU 峰值
SNMP 轮询或 show 命令本身可能会短暂导致 CPU 峰值。在收集 CPU 数据前等待 30 秒再连接。使用 terminal length 0 避免分页暂停延长会话时间。
内存读数不一致
正常操作期间内存值会波动。以 30 秒间隔收集三个样本并取平均值。检查 show memory dead 以查找已分配但不可达的内存(泄漏指示)。
接口计数器解读
计数器自上次清除以来是累积的。使用 show interfaces [名称] 查看上次清除时间。对于速率计算,以已知间隔收集两次计数器:(counter2 - counter1) / interval_seconds。
路由协议邻居问题
如果 OSPF 邻居卡在 INIT/2WAY 状态,检查 MTU 不匹配和区域配置。如果 BGP 对等体显示Active状态,验证端口 179 上的 TCP 连接并检查 ACL 是否阻止。EIGRP 卡在活动状态表示下游存在收敛问题。