Cisco IOS-XE Device Health Check

Structured triage procedure for assessing Cisco IOS-XE device health. Produces a
prioritized findings report with severity classifications and recommended actions.

When to Use

- Device is reported as slow, unresponsive, or dropping traffic
Scheduled health audit of IOS-XE routers or switches
Post-change verification after configuration or software updates
Capacity planning data collection for CPU, memory, and interface utilization
Incident response when a device is suspected as the fault domain

Prerequisites

- SSH or console access to the target IOS-XE device (privilege level 1 minimum)
Device running IOS-XE 16.x or 17.x (commands validated against 17.3+)
Network reachability confirmed (ping/traceroute to management IP succeeds)
Knowledge of the device's normal baseline (typical CPU, memory, traffic levels)
Change control approval if performing checks during a maintenance window

Procedure

Follow this sequence. Each step produces data for the final report. Do not skip
steps unless the device is unresponsive (jump to Step 6 for crash recovery).

Step 1: Establish Baseline Context

Collect device identity and uptime to frame the health check.

CODEBLOCK0

Record: hostname, software version, uptime, hardware model, current time.
Flag if uptime is unexpectedly short — indicates recent reload or crash.

Step 2: CPU Utilization Assessment

CODEBLOCK1

Compare 5-second, 1-minute, and 5-minute averages against thresholds.
If 5-second average exceeds 80%, identify the top process immediately.

Key processes to watch:

- IP Input — high values indicate traffic processing overload
Crypto IKMP — VPN negotiation storms
SNMP ENGINE — aggressive polling
BGP Router — large table churn or route oscillation
IOSD — general control plane congestion

Step 3: Memory Utilization Assessment

CODEBLOCK2

Calculate used percentage: (Total - Free) / Total * 100.
Check for memory fragmentation: compare Largest Free block to Total Free.
If largest free block is less than 10% of total free, fragmentation is a concern.

Step 4: Interface Health

CODEBLOCK3

For each interface with errors:

- Calculate error rate: INLINECODE1
Error rate above 0.1% is warning, above 1% is critical
CRC errors suggest Layer 1 issues (cabling, optics, SFP)
Input errors with no CRC suggest buffer or overrun issues
Output drops indicate congestion — check QoS policy

Step 5: Routing Table Health

CODEBLOCK4

Verify: expected number of routes present, no unexpected route withdrawals,
all routing protocol neighbors in established/full state.

Flag: neighbor state changes in the last hour, route count significantly
different from baseline, any routes via unexpected next-hops.

Step 6: Platform and Environment

CODEBLOCK5

Check: power supply status, fan status, temperature readings.
Any environmental alarm is an immediate escalation trigger.
Review recent syslog for crash signatures (traceback, CPUHOG, MALLOCFAIL).

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical
CPU 5-min avg	< 40%	40–70%	> 70%
CPU 5-sec spike

< 80% | 80–90% | > 90% |
| Memory used | < 70% | 70–85% | > 85% |
| Memory fragmentation | > 10% largest/total | 5–10% | < 5% |
| Interface error rate | < 0.01% | 0.01–0.1% | > 0.1% |
| Interface output drops | < 100/hr | 100–1000/hr | > 1000/hr |
| Routing neighbors | All established | Flapping | Down |
| Temperature | Within spec | Within 5°C of max | At or above max |

Decision Trees

Triage Priority

CODEBLOCK6

Escalation Criteria

Escalate to senior engineer or TAC when any of these conditions are met:

- CPU sustained above 90% for more than 15 minutes with no identifiable cause
Memory below 15% free with no recent change to explain consumption
Traceback or CPUHOG messages in logs within last 24 hours
Environmental alarm (power, fan, temperature) present
More than 3 routing neighbor state changes in last hour

Report Template

Generate a structured report with these sections:

CODEBLOCK7

Severity levels for findings:

- INFO — within normal thresholds, noted for baseline
WARNING — approaching threshold, monitor closely
CRITICAL — threshold exceeded, action required
EMERGENCY — device at risk of failure, immediate action

Troubleshooting

Device Unresponsive to SSH

Try console access. If console is also unresponsive, check power and
environment remotely (smart PDU, out-of-band management). If the device has
crashed, collect crashinfo: dir crashinfo: after recovery.

CPU Spikes During Health Check

SNMP polling or show commands themselves can briefly spike CPU. Wait 30 seconds
after connecting before collecting CPU data. Use terminal length 0 to avoid
paging pauses that extend session time.

Inconsistent Memory Readings

Memory values fluctuate during normal operation. Collect three samples at
30-second intervals and average them. Check show memory dead for memory
that is allocated but unreachable (leak indicator).

Interface Counter Interpretation

Counters are cumulative since last clear. Use show interfaces [name]
to see the last clear time. For rate calculations, collect counters twice
with a known interval: (counter2 - counter1) / interval_seconds.

Routing Protocol Neighbor Issues

If OSPF neighbors are stuck in INIT/2WAY, check MTU mismatch and area
configuration. If BGP peers show "Active" state, verify TCP connectivity
on port 179 and check for ACL blocking. EIGRP stuck-in-active indicates
a convergence problem downstream.

Cisco IOS-XE 设备健康检查

用于评估 Cisco IOS-XE 设备健康状况的结构化诊断流程。生成带有严重性分类和推荐操作的优先级发现报告。

使用场景

- 设备被报告为运行缓慢、无响应或丢包
IOS-XE 路由器或交换机的定期健康审计
配置或软件更新后的变更验证
CPU、内存和接口利用率的容量规划数据收集
设备被怀疑为故障域时的应急响应

前提条件

- 对目标 IOS-XE 设备具有 SSH 或控制台访问权限（最低特权级别 1）
设备运行 IOS-XE 16.x 或 17.x（命令已在 17.3+ 版本上验证）
确认网络可达性（ping/traceroute 到管理 IP 成功）
了解设备的正常基线（典型 CPU、内存、流量水平）
如果在维护窗口期间执行检查，需获得变更控制批准

操作步骤

按以下顺序执行。每个步骤为最终报告生成数据。除非设备无响应（跳至步骤 6 进行崩溃恢复），否则不要跳过步骤。

步骤 1：建立基线上下文

收集设备标识和运行时间以构建健康检查框架。

show version | include uptime|Version|bytes of memory
show inventory | include PID
show clock

记录：主机名、软件版本、运行时间、硬件型号、当前时间。
如果运行时间异常短则标记——表示近期重新加载或崩溃。

步骤 2：CPU 利用率评估

show processes cpu sorted | head 20
show processes cpu history
show processes cpu platform sorted 5sec

将 5 秒、1 分钟和 5 分钟平均值与阈值进行比较。
如果 5 秒平均值超过 80%，立即识别出占用最高的进程。

需要关注的进程：

- IP Input — 高值表示流量处理过载
Crypto IKMP — VPN 协商风暴
SNMP ENGINE — 激进的轮询
BGP Router — 大量表项变动或路由振荡
IOSD — 控制平面拥塞

步骤 3：内存利用率评估

show memory statistics
show memory platform information
show processes memory sorted | head 15

计算已用百分比：(Total - Free) / Total * 100。
检查内存碎片：比较最大空闲块与总空闲内存。
如果最大空闲块小于总空闲内存的 10%，则存在碎片问题。

步骤 4：接口健康

对于每个有错误的接口：

- 计算错误率：errors / (input packets + output packets) * 100
错误率高于 0.1% 为警告，高于 1% 为严重
CRC 错误表示第 1 层问题（线缆、光模块、SFP）
无 CRC 的输入错误表示缓冲区或溢出问题
输出丢弃表示拥塞——检查 QoS 策略

步骤 5：路由表健康

show ip route summary
show ip bgp summary（如果配置了 BGP）
show ip ospf neighbor（如果配置了 OSPF）
show ip eigrp neighbors（如果配置了 EIGRP）

验证：存在预期的路由数量，没有意外的路由撤销，
所有路由协议邻居处于已建立/完全状态。

标记：过去一小时内邻居状态变化、路由数量与基线显著不同、
任何通过意外下一跳的路由。

步骤 6：平台与环境

show environment all
show platform software status control-processor brief
show logging | include %|Error|Warning|traceback（最后 50 行）

检查：电源状态、风扇状态、温度读数。
任何环境告警都是立即升级的触发因素。
审查最近的系统日志以查找崩溃特征（traceback、CPUHOG、MALLOCFAIL）。

阈值表

参考：references/threshold-tables.md 获取详细的每个参数阈值。

参数	正常	警告	严重
CPU 5 分钟平均值	< 40%	40–70%	> 70%
CPU 5 秒峰值

< 80% | 80–90% | > 90% |
| 内存已用 | < 70% | 70–85% | > 85% |
| 内存碎片 | > 10% 最大/总计 | 5–10% | < 5% |
| 接口错误率 | < 0.01% | 0.01–0.1% | > 0.1% |
| 接口输出丢弃 | < 100/小时 | 100–1000/小时 | > 1000/小时 |
| 路由邻居 | 全部已建立 | 振荡 | 断开 |
| 温度 | 在规格范围内 | 距最大值 5°C 以内 | 达到或超过最大值 |

决策树

诊断优先级

设备是否可达？
├── 否 → 立即升级。检查控制台访问、电源、环境。
└── 是
├── CPU 严重？→ 识别占用最高的进程 → 按进程应用缓解措施
│ ├── IP Input → 检查流量风暴、ACL 优化
│ ├── BGP Router → 检查路由变动、邻居振荡、表项大小
│ └── 其他 → 收集 show tech-support 供 TAC 升级
├── 内存严重？→ 检查内存泄漏
│ ├── 最大空闲 < 总计的 5% → 可能是碎片，安排重新加载
│ └── 随时间稳定增长 → 内存泄漏，收集 show mem alloc
├── 接口错误？→ 分类错误类型
│ ├── CRC/输入错误 → 第 1 层（线缆、光模块、SFP）
│ └── 输出丢弃 → QoS 策略或拥塞
└── 全部在阈值内 → 记录健康状态，安排下次检查

升级标准

当满足以下任一条件时，升级至高级工程师或 TAC：

- CPU 持续高于 90% 超过 15 分钟且无法确定原因
空闲内存低于 15% 且近期无变更解释消耗
过去 24 小时内日志中出现 traceback 或 CPUHOG 消息
存在环境告警（电源、风扇、温度）
过去一小时内超过 3 次路由邻居状态变化

报告模板

生成包含以下部分的结构化报告：

设备健康报告
====================
设备：[主机名]
型号：[库存中的 PID]
软件：[版本]
运行时间：[运行时间字符串]
检查时间：[时间戳]
执行人：[操作员/代理]

摘要：[健康 | 警告 | 严重]

发现：

1. [严重性] [组件] — [描述]

观察到：[指标值]
阈值：[正常/警告/严重范围]
操作：[推荐操作]

2. ...

建议：

- [按优先级的操作列表]

下次检查：[根据发现严重性安排的日期]

发现的严重性级别：

- 信息 — 在正常阈值内，记录为基线
警告 — 接近阈值，密切监控
严重 — 超过阈值，需要操作
紧急 — 设备面临故障风险，立即操作

故障排除

设备对 SSH 无响应

尝试控制台访问。如果控制台也无响应，远程检查电源和环境（智能 PDU、带外管理）。如果设备已崩溃，在恢复后收集崩溃信息：dir crashinfo:。

健康检查期间 CPU 峰值

SNMP 轮询或 show 命令本身可能会短暂导致 CPU 峰值。在收集 CPU 数据前等待 30 秒再连接。使用 terminal length 0 避免分页暂停延长会话时间。

内存读数不一致

正常操作期间内存值会波动。以 30 秒间隔收集三个样本并取平均值。检查 show memory dead 以查找已分配但不可达的内存（泄漏指示）。

接口计数器解读

计数器自上次清除以来是累积的。使用 show interfaces [名称] 查看上次清除时间。对于速率计算，以已知间隔收集两次计数器：(counter2 - counter1) / interval_seconds。

路由协议邻居问题

如果 OSPF 邻居卡在 INIT/2WAY 状态，检查 MTU 不匹配和区域配置。如果 BGP 对等体显示Active状态，验证端口 179 上的 TCP 连接并检查 ACL 是否阻止。EIGRP 卡在活动状态表示下游存在收敛问题。

example-device-health设备健康检查

example-device-health

Cisco IOS-XE Device Health Check

When to Use

Prerequisites

Procedure

Step 1: Establish Baseline Context

Step 2: CPU Utilization Assessment

Step 3: Memory Utilization Assessment

Step 4: Interface Health

Step 5: Routing Table Health

Step 6: Platform and Environment

Threshold Tables

Decision Trees

Triage Priority

Escalation Criteria

Report Template

Troubleshooting

Device Unresponsive to SSH

CPU Spikes During Health Check

Inconsistent Memory Readings

Interface Counter Interpretation

Routing Protocol Neighbor Issues

Cisco IOS-XE 设备健康检查

使用场景

前提条件

操作步骤

步骤 1：建立基线上下文

步骤 2：CPU 利用率评估

步骤 3：内存利用率评估

步骤 4：接口健康

步骤 5：路由表健康

步骤 6：平台与环境

阈值表

决策树

诊断优先级

升级标准

报告模板

故障排除

设备对 SSH 无响应

健康检查期间 CPU 峰值

内存读数不一致

接口计数器解读

路由协议邻居问题

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement