Juniper JunOS Device Health Check

Structured triage procedure for assessing Juniper device health across MX, SRX,
EX, QFX, and PTX platforms. Produces a prioritized findings report with severity
classifications and recommended actions.

JunOS separates Routing Engine (RE) and Packet Forwarding Engine (PFE). These
are independent health domains — a healthy RE does not guarantee a healthy PFE,
and vice versa. This procedure assesses both explicitly.

When to Use

- Device reported as slow, dropping traffic, or unresponsive
Scheduled health audit of Juniper routers, switches, or firewalls
Post-change verification after commits, upgrades, or ISSU
Capacity planning data collection for RE CPU, memory, and link utilization
Incident response when a Juniper device is suspected as the fault domain
RE failover event — verify mastership and standby RE state
Chassis alarm triggered — severity triage and root cause identification

Prerequisites

- SSH or console access to the device (login class with view permissions minimum)
JunOS 21.x or later (commands validated against JunOS 23.2+)
Network reachability to management interface or fxp0 confirmed
Awareness of the device's normal baseline (CPU, memory, traffic patterns)
For dual-RE systems: know which RE should be master under normal operations
Knowledge of recent commit history if correlating symptoms with changes

Procedure

Follow this sequence. Each step produces data for the final report. RE mastership
verification is mandatory first — all subsequent data is RE-scoped.

Step 1: Verify RE Mastership (Mandatory)

On dual-RE systems, health data comes from the RE you are logged into. If you
are on the backup RE, all metrics reflect the standby engine — not the active
forwarding path. This step is non-negotiable.

CODEBLOCK0

Verify: your session is on the master RE. If Current state shows Backup,
switch to master: request routing-engine login other-routing-engine.

On single-RE platforms, confirm RE is Master (not in a degraded state).
Record: hostname, RE slot, mastership state, uptime, last reboot reason.
Short uptime after an unexpected reboot — investigate immediately.

Step 2: Alarm Analysis

JunOS surfaces alarms as first-class status indicators. Check chassis and
system alarms before deeper investigation — alarms may already identify the
problem.

CODEBLOCK1

Alarm severities:

- Major — service-affecting condition, requires immediate attention
Minor — degraded but service continues, investigate promptly

If alarms are present, record each alarm's class, description, and time.
Major alarms take priority over all other triage — address them first.
Common alarm sources: FPC offline, power supply failure, rescue config
not set, license expiry, FRU removal.

No alarms → proceed with systematic health assessment.

Step 3: Routing Engine Health

RE handles control plane: routing protocols, management, commit operations.

CODEBLOCK2

Key fields from show chassis routing-engine:

- CPU utilization — temperature, idle percentage (idle below 30% is warning)
Memory utilization — total and used; watch for used > 80%
Temperature — compare to platform-specific thresholds
Start time — recent RE restart indicates crash or failover
Load averages — 1min/5min/15min; sustained > 1.0 per core is elevated

High RE CPU with top process identification:

- rpd — routing protocol daemon: route churn, table size, peer instability
INLINECODE7 — chassis management: sensor polling issues, FPC communication
INLINECODE8 — SNMP polling storms
INLINECODE9 — management: large config, slow commit, CLI session overload
INLINECODE10 — key management: IKE/IPsec negotiation storms (SRX)

RE CPU spikes during commit operations are normal (can hit 80–90% briefly).
Compare against commit history: show system commit.

Step 4: PFE Health

PFE handles data plane forwarding independently from RE. A healthy RE with
a degraded PFE means traffic is being dropped even though the control plane
looks fine.

CODEBLOCK3

INLINECODE13:

- State must be Online. Any other state (Present, Offline, Empty)

indicates a hardware issue or intentional deactivation.

- CPU Total — PFE CPU utilization; above 80% is warning, above 90% critical
Memory heap utilization — above 80% indicates PFE memory pressure

INLINECODE18:

- Compare input vs output packet counts — large delta indicates drops
Check fabric input drops and local input drops for discard sources

INLINECODE21:

- Any non-zero error counters warrant investigation
Sustained incrementing errors (check twice 30 seconds apart) indicate active issues

On MX platforms with multiple FPCs, check each FPC individually. A single
degraded FPC affects only interfaces on that linecard.

Step 5: System Resources

CODEBLOCK4

Storage: JunOS partitions can fill from logs, core dumps, or failed upgrades.
Any partition above 85% used is warning. /var filling above 90% can prevent
commits and logging.

Memory: show system memory gives kernel-level view. Compare to RE memory
from Step 3 for consistency. Sustained growth without corresponding config
changes suggests a memory leak.

Core dumps: Presence of recent core files (within last 7 days) indicates
process crashes. Record the process name and timestamp — this is JTAC-relevant
data.

Commit history: Recent commits correlate with symptoms. A device that was
healthy before a commit and unhealthy after has an obvious investigation path.

Step 6: Interface and Routing Health

CODEBLOCK5

For each interface with errors:

- CRC errors → Layer 1 (cabling, optics, SFP)
Input errors without CRC → buffer overruns, MTU mismatch
Output drops → congestion or policer drops
Carrier transitions → link flap, check SFP DOM: INLINECODE24

Routing: verify expected neighbor count, all adjacencies in Established/Full state.
BGP prefix counts deviating > 10% from baseline indicate route churn.

Step 7: Environment

CODEBLOCK6

Check: all temperature sensors within thresholds, all power supplies OK, all
fans operational. Any environmental alarm maps directly to Major alarm severity.

On platforms with redundant RE: check both RE temperatures. A standby RE running
hot may indicate cooling issues even if master RE temperature is normal.

Threshold Tables

Reference: references/threshold-tables.md for detailed per-parameter thresholds.

Parameter	Normal	Warning	Critical	Notes
RE CPU idle	> 40%	20–40%	< 20%	Spikes during commit are normal
RE memory used

< 75% | 75–85% | > 85% | |
| RE load avg (1min) | < 0.7/core | 0.7–1.5/core | > 1.5/core | Scale by RE core count |
| PFE CPU | < 60% | 60–80% | > 80% | Per-FPC |
| PFE heap used | < 70% | 70–85% | > 85% | Per-FPC |
| Storage partition | < 80% | 80–90% | > 90% | /var critical for commits |
| Interface error rate | < 0.01% | 0.01–0.1% | > 0.1% | |
| Output drops/hr | < 100 | 100–1000 | > 1000 | |
| Chassis alarm | None | Minor present | Major present | |
| Temperature | Within spec | 5°C of max | At/above max | Per-sensor |

Decision Trees

Primary Triage

CODEBLOCK7

Alarm Severity Triage

CODEBLOCK8

Escalation Criteria

Escalate to senior engineer or JTAC when:

- RE CPU sustained above 90% for 15+ minutes with no identifiable cause
RE memory above 90% used with no recent config change
PFE offline or in non-Online state after power cycle attempt
Core dumps present from critical processes (rpd, chassisd, pfed)
Major chassis alarm with no clear remediation
Multiple FPC failures or fabric errors
RE failover loop (multiple failovers in short period)
Any environmental alarm (power, fan, temperature)
More than 3 routing neighbor state changes in the last hour

Report Template

CODEBLOCK9

Troubleshooting

Device Unresponsive to SSH

Try console access. If console is also unresponsive, check power and environment
via out-of-band management (craft interface, console server). After recovery:
show system core-dumps, show chassis routing-engine for reboot reason,
show log messages | match "kernel|panic|watchdog".

Logged Into Backup RE

If show chassis routing-engine shows your RE as Backup, you are collecting
standby metrics. Switch to master: request routing-engine login other-routing-engine.
If master RE is unreachable from backup, this indicates master RE failure — check
show chassis routing-engine from backup for master's last known state.

RE CPU Spikes During Commit

JunOS RE CPU can spike to 80–90% during commit operations. This is expected
behavior — the config daemon and rpd both consume CPU during commit processing.
Verify: show system commit to confirm a recent commit, then wait 2–3 minutes
and re-check. Sustained high CPU after commit settles indicates a real problem.

PFE Drops With Healthy RE

The RE (control plane) and PFE (data plane) are independent. High PFE drops with
a normal RE means traffic is being discarded at the forwarding level. Check:
show pfe statistics traffic for drop categories, show chassis fpc detail
for PFE CPU and memory. Common causes: filter/policer drops (may be expected),
next-hop resolution failures, PFE memory exhaustion from large tables.

Storage Full Preventing Commits

If /var is above 95%, commits will fail. Clear space:
request system storage cleanup — removes old logs, core dumps, and temporary
files. If that is insufficient: show system storage to identify the largest
consumers, then selectively remove old software images or rotated log files.

Dual-RE Failover Investigation

After an RE failover: verify new master is healthy (Steps 1–3), then investigate
the old master. From the new master: show chassis routing-engine shows both
REs' state. Check show log messages | match "mastership|failover|switchover"
for the event trigger. Common causes: RE crash (core dump present), watchdog
timeout, manual switchover, GRES/NSR failure.

Juniper JunOS 设备健康检查

用于评估MX、SRX、EX、QFX和PTX平台上Juniper设备健康状况的结构化分类程序。生成带有严重性分类和建议措施的优先级发现报告。

JunOS将路由引擎（RE）和转发引擎（PFE）分离。这是两个独立的健康域——健康的RE不能保证健康的PFE，反之亦然。本程序对两者进行明确评估。

何时使用

- 设备报告为缓慢、丢包或无响应
对Juniper路由器、交换机或防火墙进行定期健康审计
提交、升级或ISSU后的变更验证
针对RE CPU、内存和链路利用率的容量规划数据收集
怀疑Juniper设备为故障域时的应急响应
RE故障切换事件——验证主备和备用RE状态
机箱告警触发——严重性分类和根本原因识别

前提条件

- 通过SSH或控制台访问设备（最低需要具有view权限的登录类）
JunOS 21.x或更高版本（命令已在JunOS 23.2+上验证）
确认管理接口或fxp0的网络可达性
了解设备的正常基线（CPU、内存、流量模式）
对于双RE系统：了解正常操作下哪个RE应为主用
如果要将症状与变更关联，需了解最近的提交历史

程序

按以下顺序执行。每个步骤为最终报告生成数据。RE主备验证是强制性的第一步——所有后续数据都基于RE范围。

步骤1：验证RE主备状态（强制性）

在双RE系统上，健康数据来自您登录的RE。如果您在备用RE上，所有指标反映的是备用引擎——而非活跃转发路径。此步骤不可跳过。

show chassis routing-engine | match Slot|Current state|Mastership
show route summary | match Router ID
show system uptime

验证：您的会话在主用RE上。如果Current state显示Backup，切换到主用：request routing-engine login other-routing-engine。

在单RE平台上，确认RE为Master（未处于降级状态）。
记录：主机名、RE插槽、主备状态、运行时间、上次重启原因。
意外重启后运行时间短——立即调查。

步骤2：告警分析

JunOS将告警作为一等状态指示器呈现。在深入调查前检查机箱和系统告警——告警可能已经识别出问题。

show chassis alarms
show system alarms

告警严重性：

- Major — 影响服务的状况，需要立即关注
Minor — 降级但服务继续，及时调查

如果存在告警，记录每个告警的类别、描述和时间。
Major告警优先于所有其他分类——首先处理它们。
常见告警来源：FPC离线、电源故障、救援配置未设置、许可证过期、FRU移除。

无告警→继续系统健康评估。

步骤3：路由引擎健康

RE处理控制平面：路由协议、管理、提交操作。

show chassis routing-engine
show system processes extensive | match PID|last pid|%CPU | head 20
show task replication

show chassis routing-engine的关键字段：

- CPU利用率 — 温度、空闲百分比（空闲低于30%为警告）
内存利用率 — 总量和已用；注意已用超过80%
温度 — 与平台特定阈值比较
启动时间 — 最近的RE重启表示崩溃或故障切换
负载平均值 — 1分钟/5分钟/15分钟；每核心持续超过1.0为偏高

高RE CPU及顶级进程识别：

- rpd — 路由协议守护进程：路由震荡、表大小、对等体不稳定
chassisd — 机箱管理：传感器轮询问题、FPC通信
snmpd — SNMP轮询风暴
mgd — 管理：大型配置、慢提交、CLI会话过载
kmd — 密钥管理：IKE/IPsec协商风暴（SRX）

commit操作期间RE CPU峰值是正常的（可能短暂达到80-90%）。与提交历史比较：show system commit。

步骤4：PFE健康

PFE独立于RE处理数据平面转发。健康的RE配合降级的PFE意味着即使控制平面看起来正常，流量仍在被丢弃。

show chassis fpc
show chassis fpc detail
show pfe statistics traffic
show pfe statistics error

show chassis fpc：

- 状态必须为 Online。任何其他状态（Present、Offline、Empty）表示硬件问题或有意停用。
CPU总量 — PFE CPU利用率；超过80%为警告，超过90%为严重
内存堆利用率 — 超过80%表示PFE内存压力

show pfe statistics traffic：

- 比较输入与输出数据包计数——大差值表示丢包
检查fabric input drops和local input drops以查找丢弃来源

show pfe statistics error：

- 任何非零错误计数器都值得调查
持续递增的错误（间隔30秒检查两次）表示活跃问题

在具有多个FPC的MX平台上，单独检查每个FPC。单个降级的FPC仅影响该线卡上的接口。

步骤5：系统资源

show system storage
show system memory
show system core-dumps
show system commit | head 10

存储： JunOS分区可能因日志、核心转储或升级失败而填满。任何分区使用超过85%为警告。/var使用超过90%可能阻止提交和日志记录。

内存： show system memory提供内核级视图。与步骤3中的RE内存比较以保持一致性。在没有相应配置变更的情况下持续增长表明存在内存泄漏。

核心转储： 存在最近的核心文件（7天内）表示进程崩溃。记录进程名称和时间戳——这是JTAC相关数据。

提交历史： 最近的提交与症状相关。在提交前健康、提交后不健康的设备有明确的调查路径。

步骤6：接口和路由健康

对于每个有错误的接口：

- CRC错误 → 第1层（线缆、光模块、SFP）
无CRC的输入错误 → 缓冲区溢出、MTU不匹配
输出丢弃 → 拥塞或策略器丢弃
载波转换 → 链路抖动，检查SFP DOM：show interfaces diagnostics optics [name]

路由：验证预期的邻居数量，所有邻接处于Established/Full状态。
BGP前缀计数偏离基线超过10%表示路由震荡。

步骤7：环境

show chassis environment
show chassis temperature-thresholds
show chassis power
show chassis fan

检查：所有温度传感器在阈值内，所有电源正常，所有风扇运行。任何环境告警直接映射到Major告警严重性。

在具有冗余RE的平台上：检查两个RE的温度。即使主用RE温度正常，备用RE运行温度高可能表示冷却问题。

阈值表

参考：references/threshold-tables.md 获取详细的每参数阈值。

参数	正常	警告	严重	备注
RE CPU空闲	> 40%	20–40%	< 20%	提交期间峰值正常
RE内存已用

< 75% | 75–85% | > 85% | |
| RE负载平均（1分钟） | < 0.7/核心 | 0.7–1.5/核心 | > 1.5/核心 | 按RE核心数缩放 |
| PFE CPU | < 60% | 60–80% | > 80% | 每FPC |
| PFE堆已用 | < 70% | 70–85% | > 85% | 每FPC |
| 存储分区 | < 80% | 80–90% | > 90% | /var对提交至关重要 |
| 接口错误率 | < 0.01% | 0.01–0.1% | > 0.1% | |
| 每小时输出丢弃 | < 100 | 100–1000 | > 1000 | |
| 机箱告警 | 无 | 存在Minor | 存在Major | |
| 温度 | 在规格内 | 距离最大值5°C | 达到/超过最大值 | 每传感器 |

决策树

主要分类

设备是否可达？
├── 否 → 检查控制台、电源、环境。恢复后收集核心转储。
└── 是
├── 验证RE主备状态 → 在主用RE上？
│ ├──

juniper-device-health瞻博设备健康

juniper-device-health

Juniper JunOS Device Health Check

When to Use

Prerequisites

Procedure

Step 1: Verify RE Mastership (Mandatory)

Step 2: Alarm Analysis

Step 3: Routing Engine Health

Step 4: PFE Health

Step 5: System Resources

Step 6: Interface and Routing Health

Step 7: Environment

Threshold Tables

Decision Trees

Primary Triage

Alarm Severity Triage

Escalation Criteria

Report Template

Troubleshooting

Device Unresponsive to SSH

Logged Into Backup RE

RE CPU Spikes During Commit

PFE Drops With Healthy RE

Storage Full Preventing Commits

Dual-RE Failover Investigation

Juniper JunOS 设备健康检查

何时使用

前提条件

程序

步骤1：验证RE主备状态（强制性）

步骤2：告警分析

步骤3：路由引擎健康

步骤4：PFE健康

步骤5：系统资源

步骤6：接口和路由健康

步骤7：环境

阈值表

决策树

主要分类

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement