Juniper JunOS Device Health Check
Structured triage procedure for assessing Juniper device health across MX, SRX,
EX, QFX, and PTX platforms. Produces a prioritized findings report with severity
classifications and recommended actions.
JunOS separates Routing Engine (RE) and Packet Forwarding Engine (PFE). These
are independent health domains — a healthy RE does not guarantee a healthy PFE,
and vice versa. This procedure assesses both explicitly.
When to Use
- - Device reported as slow, dropping traffic, or unresponsive
- Scheduled health audit of Juniper routers, switches, or firewalls
- Post-change verification after commits, upgrades, or ISSU
- Capacity planning data collection for RE CPU, memory, and link utilization
- Incident response when a Juniper device is suspected as the fault domain
- RE failover event — verify mastership and standby RE state
- Chassis alarm triggered — severity triage and root cause identification
Prerequisites
- - SSH or console access to the device (login class with
view permissions minimum) - JunOS 21.x or later (commands validated against JunOS 23.2+)
- Network reachability to management interface or fxp0 confirmed
- Awareness of the device's normal baseline (CPU, memory, traffic patterns)
- For dual-RE systems: know which RE should be master under normal operations
- Knowledge of recent commit history if correlating symptoms with changes
Procedure
Follow this sequence. Each step produces data for the final report. RE mastership
verification is mandatory first — all subsequent data is RE-scoped.
Step 1: Verify RE Mastership (Mandatory)
On dual-RE systems, health data comes from the RE you are logged into. If you
are on the backup RE, all metrics reflect the standby engine — not the active
forwarding path. This step is non-negotiable.
CODEBLOCK0
Verify: your session is on the master RE. If Current state shows Backup,
switch to master: request routing-engine login other-routing-engine.
On single-RE platforms, confirm RE is Master (not in a degraded state).
Record: hostname, RE slot, mastership state, uptime, last reboot reason.
Short uptime after an unexpected reboot — investigate immediately.
Step 2: Alarm Analysis
JunOS surfaces alarms as first-class status indicators. Check chassis and
system alarms before deeper investigation — alarms may already identify the
problem.
CODEBLOCK1
Alarm severities:
- - Major — service-affecting condition, requires immediate attention
- Minor — degraded but service continues, investigate promptly
If alarms are present, record each alarm's class, description, and time.
Major alarms take priority over all other triage — address them first.
Common alarm sources: FPC offline, power supply failure, rescue config
not set, license expiry, FRU removal.
No alarms → proceed with systematic health assessment.
Step 3: Routing Engine Health
RE handles control plane: routing protocols, management, commit operations.
CODEBLOCK2
Key fields from show chassis routing-engine:
- - CPU utilization — temperature, idle percentage (idle below 30% is warning)
- Memory utilization — total and used; watch for used > 80%
- Temperature — compare to platform-specific thresholds
- Start time — recent RE restart indicates crash or failover
- Load averages — 1min/5min/15min; sustained > 1.0 per core is elevated
High RE CPU with top process identification:
- -
rpd — routing protocol daemon: route churn, table size, peer instability - INLINECODE7 — chassis management: sensor polling issues, FPC communication
- INLINECODE8 — SNMP polling storms
- INLINECODE9 — management: large config, slow commit, CLI session overload
- INLINECODE10 — key management: IKE/IPsec negotiation storms (SRX)
RE CPU spikes during commit operations are normal (can hit 80–90% briefly).
Compare against commit history: show system commit.
Step 4: PFE Health
PFE handles data plane forwarding independently from RE. A healthy RE with
a degraded PFE means traffic is being dropped even though the control plane
looks fine.
CODEBLOCK3
INLINECODE13 :
- - State must be
Online. Any other state (Present, Offline, Empty)
indicates a hardware issue or intentional deactivation.
- - CPU Total — PFE CPU utilization; above 80% is warning, above 90% critical
- Memory heap utilization — above 80% indicates PFE memory pressure
INLINECODE18 :
- - Compare input vs output packet counts — large delta indicates drops
- Check
fabric input drops and local input drops for discard sources
INLINECODE21 :
- - Any non-zero error counters warrant investigation
- Sustained incrementing errors (check twice 30 seconds apart) indicate active issues
On MX platforms with multiple FPCs, check each FPC individually. A single
degraded FPC affects only interfaces on that linecard.
Step 5: System Resources
CODEBLOCK4
Storage: JunOS partitions can fill from logs, core dumps, or failed upgrades.
Any partition above 85% used is warning. /var filling above 90% can prevent
commits and logging.
Memory: show system memory gives kernel-level view. Compare to RE memory
from Step 3 for consistency. Sustained growth without corresponding config
changes suggests a memory leak.
Core dumps: Presence of recent core files (within last 7 days) indicates
process crashes. Record the process name and timestamp — this is JTAC-relevant
data.
Commit history: Recent commits correlate with symptoms. A device that was
healthy before a commit and unhealthy after has an obvious investigation path.
Step 6: Interface and Routing Health
CODEBLOCK5
For each interface with errors:
- - CRC errors → Layer 1 (cabling, optics, SFP)
- Input errors without CRC → buffer overruns, MTU mismatch
- Output drops → congestion or policer drops
- Carrier transitions → link flap, check SFP DOM: INLINECODE24
Routing: verify expected neighbor count, all adjacencies in Established/Full state.
BGP prefix counts deviating > 10% from baseline indicate route churn.
Step 7: Environment
CODEBLOCK6
Check: all temperature sensors within thresholds, all power supplies OK, all
fans operational. Any environmental alarm maps directly to Major alarm severity.
On platforms with redundant RE: check both RE temperatures. A standby RE running
hot may indicate cooling issues even if master RE temperature is normal.
Threshold Tables
Reference: references/threshold-tables.md for detailed per-parameter thresholds.
| Parameter | Normal | Warning | Critical | Notes |
|---|
| RE CPU idle | > 40% | 20–40% | < 20% | Spikes during commit are normal |
| RE memory used |
< 75% | 75–85% | > 85% | |
| RE load avg (1min) | < 0.7/core | 0.7–1.5/core | > 1.5/core | Scale by RE core count |
| PFE CPU | < 60% | 60–80% | > 80% | Per-FPC |
| PFE heap used | < 70% | 70–85% | > 85% | Per-FPC |
| Storage partition | < 80% | 80–90% | > 90% | /var critical for commits |
| Interface error rate | < 0.01% | 0.01–0.1% | > 0.1% | |
| Output drops/hr | < 100 | 100–1000 | > 1000 | |
| Chassis alarm | None | Minor present | Major present | |
| Temperature | Within spec | 5°C of max | At/above max | Per-sensor |
Decision Trees
Primary Triage
CODEBLOCK7
Alarm Severity Triage
CODEBLOCK8
Escalation Criteria
Escalate to senior engineer or JTAC when:
- - RE CPU sustained above 90% for 15+ minutes with no identifiable cause
- RE memory above 90% used with no recent config change
- PFE offline or in non-Online state after power cycle attempt
- Core dumps present from critical processes (rpd, chassisd, pfed)
- Major chassis alarm with no clear remediation
- Multiple FPC failures or fabric errors
- RE failover loop (multiple failovers in short period)
- Any environmental alarm (power, fan, temperature)
- More than 3 routing neighbor state changes in the last hour
Report Template
CODEBLOCK9
Troubleshooting
Device Unresponsive to SSH
Try console access. If console is also unresponsive, check power and environment
via out-of-band management (craft interface, console server). After recovery:
show system core-dumps, show chassis routing-engine for reboot reason,
show log messages | match "kernel|panic|watchdog".
Logged Into Backup RE
If show chassis routing-engine shows your RE as Backup, you are collecting
standby metrics. Switch to master: request routing-engine login other-routing-engine.
If master RE is unreachable from backup, this indicates master RE failure — check
show chassis routing-engine from backup for master's last known state.
RE CPU Spikes During Commit
JunOS RE CPU can spike to 80–90% during commit operations. This is expected
behavior — the config daemon and rpd both consume CPU during commit processing.
Verify: show system commit to confirm a recent commit, then wait 2–3 minutes
and re-check. Sustained high CPU after commit settles indicates a real problem.
PFE Drops With Healthy RE
The RE (control plane) and PFE (data plane) are independent. High PFE drops with
a normal RE means traffic is being discarded at the forwarding level. Check:
show pfe statistics traffic for drop categories, show chassis fpc detail
for PFE CPU and memory. Common causes: filter/policer drops (may be expected),
next-hop resolution failures, PFE memory exhaustion from large tables.
Storage Full Preventing Commits
If /var is above 95%, commits will fail. Clear space:
request system storage cleanup — removes old logs, core dumps, and temporary
files. If that is insufficient: show system storage to identify the largest
consumers, then selectively remove old software images or rotated log files.
Dual-RE Failover Investigation
After an RE failover: verify new master is healthy (Steps 1–3), then investigate
the old master. From the new master: show chassis routing-engine shows both
REs' state. Check show log messages | match "mastership|failover|switchover"
for the event trigger. Common causes: RE crash (core dump present), watchdog
timeout, manual switchover, GRES/NSR failure.
Juniper JunOS 设备健康检查
用于评估MX、SRX、EX、QFX和PTX平台上Juniper设备健康状况的结构化分类程序。生成带有严重性分类和建议措施的优先级发现报告。
JunOS将路由引擎(RE)和转发引擎(PFE)分离。这是两个独立的健康域——健康的RE不能保证健康的PFE,反之亦然。本程序对两者进行明确评估。
何时使用
- - 设备报告为缓慢、丢包或无响应
- 对Juniper路由器、交换机或防火墙进行定期健康审计
- 提交、升级或ISSU后的变更验证
- 针对RE CPU、内存和链路利用率的容量规划数据收集
- 怀疑Juniper设备为故障域时的应急响应
- RE故障切换事件——验证主备和备用RE状态
- 机箱告警触发——严重性分类和根本原因识别
前提条件
- - 通过SSH或控制台访问设备(最低需要具有view权限的登录类)
- JunOS 21.x或更高版本(命令已在JunOS 23.2+上验证)
- 确认管理接口或fxp0的网络可达性
- 了解设备的正常基线(CPU、内存、流量模式)
- 对于双RE系统:了解正常操作下哪个RE应为主用
- 如果要将症状与变更关联,需了解最近的提交历史
程序
按以下顺序执行。每个步骤为最终报告生成数据。RE主备验证是强制性的第一步——所有后续数据都基于RE范围。
步骤1:验证RE主备状态(强制性)
在双RE系统上,健康数据来自您登录的RE。如果您在备用RE上,所有指标反映的是备用引擎——而非活跃转发路径。此步骤不可跳过。
show chassis routing-engine | match Slot|Current state|Mastership
show route summary | match Router ID
show system uptime
验证:您的会话在主用RE上。如果Current state显示Backup,切换到主用:request routing-engine login other-routing-engine。
在单RE平台上,确认RE为Master(未处于降级状态)。
记录:主机名、RE插槽、主备状态、运行时间、上次重启原因。
意外重启后运行时间短——立即调查。
步骤2:告警分析
JunOS将告警作为一等状态指示器呈现。在深入调查前检查机箱和系统告警——告警可能已经识别出问题。
show chassis alarms
show system alarms
告警严重性:
- - Major — 影响服务的状况,需要立即关注
- Minor — 降级但服务继续,及时调查
如果存在告警,记录每个告警的类别、描述和时间。
Major告警优先于所有其他分类——首先处理它们。
常见告警来源:FPC离线、电源故障、救援配置未设置、许可证过期、FRU移除。
无告警→继续系统健康评估。
步骤3:路由引擎健康
RE处理控制平面:路由协议、管理、提交操作。
show chassis routing-engine
show system processes extensive | match PID|last pid|%CPU | head 20
show task replication
show chassis routing-engine的关键字段:
- - CPU利用率 — 温度、空闲百分比(空闲低于30%为警告)
- 内存利用率 — 总量和已用;注意已用超过80%
- 温度 — 与平台特定阈值比较
- 启动时间 — 最近的RE重启表示崩溃或故障切换
- 负载平均值 — 1分钟/5分钟/15分钟;每核心持续超过1.0为偏高
高RE CPU及顶级进程识别:
- - rpd — 路由协议守护进程:路由震荡、表大小、对等体不稳定
- chassisd — 机箱管理:传感器轮询问题、FPC通信
- snmpd — SNMP轮询风暴
- mgd — 管理:大型配置、慢提交、CLI会话过载
- kmd — 密钥管理:IKE/IPsec协商风暴(SRX)
commit操作期间RE CPU峰值是正常的(可能短暂达到80-90%)。与提交历史比较:show system commit。
步骤4:PFE健康
PFE独立于RE处理数据平面转发。健康的RE配合降级的PFE意味着即使控制平面看起来正常,流量仍在被丢弃。
show chassis fpc
show chassis fpc detail
show pfe statistics traffic
show pfe statistics error
show chassis fpc:
- - 状态 必须为 Online。任何其他状态(Present、Offline、Empty)表示硬件问题或有意停用。
- CPU总量 — PFE CPU利用率;超过80%为警告,超过90%为严重
- 内存堆利用率 — 超过80%表示PFE内存压力
show pfe statistics traffic:
- - 比较输入与输出数据包计数——大差值表示丢包
- 检查fabric input drops和local input drops以查找丢弃来源
show pfe statistics error:
- - 任何非零错误计数器都值得调查
- 持续递增的错误(间隔30秒检查两次)表示活跃问题
在具有多个FPC的MX平台上,单独检查每个FPC。单个降级的FPC仅影响该线卡上的接口。
步骤5:系统资源
show system storage
show system memory
show system core-dumps
show system commit | head 10
存储: JunOS分区可能因日志、核心转储或升级失败而填满。任何分区使用超过85%为警告。/var使用超过90%可能阻止提交和日志记录。
内存: show system memory提供内核级视图。与步骤3中的RE内存比较以保持一致性。在没有相应配置变更的情况下持续增长表明存在内存泄漏。
核心转储: 存在最近的核心文件(7天内)表示进程崩溃。记录进程名称和时间戳——这是JTAC相关数据。
提交历史: 最近的提交与症状相关。在提交前健康、提交后不健康的设备有明确的调查路径。
步骤6:接口和路由健康
show interfaces terse | match down|err
show interfaces extensive [name] | match error|drop|CRC|carrier
show route summary
show bgp summary
show ospf neighbor
show isis adjacency
对于每个有错误的接口:
- - CRC错误 → 第1层(线缆、光模块、SFP)
- 无CRC的输入错误 → 缓冲区溢出、MTU不匹配
- 输出丢弃 → 拥塞或策略器丢弃
- 载波转换 → 链路抖动,检查SFP DOM:show interfaces diagnostics optics [name]
路由:验证预期的邻居数量,所有邻接处于Established/Full状态。
BGP前缀计数偏离基线超过10%表示路由震荡。
步骤7:环境
show chassis environment
show chassis temperature-thresholds
show chassis power
show chassis fan
检查:所有温度传感器在阈值内,所有电源正常,所有风扇运行。任何环境告警直接映射到Major告警严重性。
在具有冗余RE的平台上:检查两个RE的温度。即使主用RE温度正常,备用RE运行温度高可能表示冷却问题。
阈值表
参考:references/threshold-tables.md 获取详细的每参数阈值。
| 参数 | 正常 | 警告 | 严重 | 备注 |
|---|
| RE CPU空闲 | > 40% | 20–40% | < 20% | 提交期间峰值正常 |
| RE内存已用 |
< 75% | 75–85% | > 85% | |
| RE负载平均(1分钟) | < 0.7/核心 | 0.7–1.5/核心 | > 1.5/核心 | 按RE核心数缩放 |
| PFE CPU | < 60% | 60–80% | > 80% | 每FPC |
| PFE堆已用 | < 70% | 70–85% | > 85% | 每FPC |
| 存储分区 | < 80% | 80–90% | > 90% | /var对提交至关重要 |
| 接口错误率 | < 0.01% | 0.01–0.1% | > 0.1% | |
| 每小时输出丢弃 | < 100 | 100–1000 | > 1000 | |
| 机箱告警 | 无 | 存在Minor | 存在Major | |
| 温度 | 在规格内 | 距离最大值5°C | 达到/超过最大值 | 每传感器 |
决策树
主要分类
设备是否可达?
├── 否 → 检查控制台、电源、环境。恢复后收集核心转储。
└── 是
├── 验证RE主备状态 → 在主用RE上?
│ ├──