UPLO DevOps — Operational Memory for Infrastructure

It is 3 AM. PagerDuty is screaming. The on-call engineer who has seen this exact failure pattern left the company four months ago. The runbook exists somewhere, maybe in Confluence, maybe in a GitHub repo, maybe in a Notion page that someone bookmarked. UPLO DevOps eliminates this scramble by indexing runbooks, post-incident reviews, infrastructure documentation, CI/CD configurations, and architecture decision records into a single searchable layer that works when you need it most.

Session Start

CODEBLOCK0

This loads your team assignments (platform, SRE, application), on-call rotation status, and access tier. Some production configurations and credentials documentation are restricted by clearance.

Grab active directives — these include change freeze windows, incident commander designations, and infrastructure migration deadlines:

CODEBLOCK1

When to Use

- You are on-call, an alert fires for a service you have never touched, and you need the runbook immediately
Investigating a production incident and need to find whether this failure mode has occurred before, including the root cause and fix
Planning a migration and need to understand the current architecture, dependencies, and the last three ADRs (Architecture Decision Records) related to the affected service
Setting up a new CI/CD pipeline and want to see how similar services in the org have configured their build, test, and deploy stages
Preparing a post-incident review and need to compile the timeline, impacted services, and blast radius from multiple data sources
A new team member needs to understand the infrastructure topology, deployment process, and escalation paths for their service area
Evaluating whether a proposed infrastructure change conflicts with documented SLOs or capacity constraints

Example Workflows

Incident Response — Novel Failure Mode

The payments service is returning 503 errors. The on-call engineer has not worked on payments before.

CODEBLOCK2

Check for previous incidents with similar symptoms:

CODEBLOCK3

If the runbook suggests checking the connection pool but the current configuration is unclear:

CODEBLOCK4

After resolving:

CODEBLOCK5

Infrastructure Migration Planning

The platform team is moving from self-managed Kafka to a managed streaming service. The tech lead needs to scope the blast radius.

CODEBLOCK6

Find the ADRs that led to the original Kafka deployment:

CODEBLOCK7

Check current SLOs and whether the migration might violate them:

CODEBLOCK8

CODEBLOCK9

Key Tools for DevOps

search_knowledge — Your go-to during incidents. When you need a specific runbook, a configuration reference, or a known procedure, this is the fastest path. Latency matters at 3 AM. Example: INLINECODE0

searchwithcontext — For investigation and planning. "What services depend on this database?" or "Has this failure happened before?" require traversing relationships between services, incidents, and infrastructure components. Example: INLINECODE1

get_directives — Change freeze windows, incident escalation policies, and migration deadlines surface here. Checking before a production change can prevent a career-limiting mistake.

flag_outdated — Infrastructure documentation rots faster than any other type. The Kubernetes cluster version documented last quarter is wrong. The network diagram shows a load balancer that was decommissioned. The runbook references a CLI tool that was replaced. Flag these aggressively — someone will use them during an incident.

reportknowledgegap — When a service has no runbook, no architecture diagram, or no documented owner, that is an operational risk. Reporting the gap creates a trackable item for the platform team.

Tips

- Service names are the most reliable search key. Use the exact service identifier from your deployment manifests (payments-api, auth-service-v2, order-processor) rather than casual descriptions.
Post-incident reviews are the most valuable documents in your knowledge base. When writing PIRs, include structured fields: affected services, duration, blast radius, root cause category, and action items. These fields are indexed by the extraction engine.
When on-call, start with search_knowledge for the runbook. Only escalate to search_with_context if the runbook does not exist or the failure mode is novel. Speed matters during incidents.
Use log_conversation after every incident investigation, even false alarms. The pattern of false alarms is itself a signal that the monitoring team should investigate.

UPLO DevOps — 基础设施运维记忆

凌晨3点。PagerDuty正在疯狂告警。曾见过这种确切故障模式的值班工程师四个月前已经离职。运行手册存在于某处，可能在Confluence里，可能在GitHub仓库中，也可能在某人收藏的Notion页面里。UPLO DevOps通过将运行手册、事故后复盘、基础设施文档、CI/CD配置和架构决策记录索引到一个可搜索的单一层中，在你最需要的时候发挥作用，消除了这种慌乱。

会话开始

getidentitycontext

这将加载你的团队分配（平台、SRE、应用）、值班轮换状态和访问权限等级。某些生产配置和凭证文档受权限限制。

获取活跃指令——包括变更冻结窗口、事故指挥官指定和基础设施迁移截止日期：

get_directives

何时使用

- 你正在值班，某个你从未接触过的服务触发告警，你需要立即获取运行手册
调查生产事故，需要查找这种故障模式是否以前发生过，包括根因和修复方案
规划迁移，需要了解当前架构、依赖关系，以及与该受影响服务相关的最近三个ADR（架构决策记录）
设置新的CI/CD流水线，想了解组织中类似服务如何配置其构建、测试和部署阶段
准备事故后复盘，需要从多个数据源汇总时间线、受影响服务和爆炸半径
新团队成员需要了解其服务领域的基础设施拓扑、部署流程和升级路径
评估提议的基础设施变更是否与文档化的SLO或容量约束冲突

示例工作流

事故响应 — 新型故障模式

支付服务返回503错误。值班工程师之前没有处理过支付服务。

search_knowledge query=支付服务 503错误运行手册故障排除步骤

检查是否有类似症状的先前事故：

searchwithcontext query=支付服务宕机 503 超时数据库连接池先前事故根因

如果运行手册建议检查连接池但当前配置不明确：

search_knowledge query=支付服务数据库连接池配置 pgbouncer 设置生产环境

解决后：

logconversation summary=已解决支付服务503宕机；根因是流量激增后pgbouncer的maxclientconn超出限制；匹配PIR-2024-087模式；将池大小增加到200 topics=[事故,支付,pgbouncer,连接池] toolsused=[searchknowledge,searchwith_context]

基础设施迁移规划

平台团队正在从自管理Kafka迁移到托管流服务。技术负责人需要评估爆炸半径。

searchwithcontext query=Kafka 消费者生产者服务依赖主题配置

查找导致最初Kafka部署的ADR：

search_knowledge query=架构决策记录 ADR Kafka 事件流选型理由

检查当前SLO以及迁移是否可能违反它们：

search_knowledge query=事件流 SLO 延迟吞吐量要求 Kafka p99

exportorgcontext

DevOps关键工具

searchknowledge — 事故期间的首选工具。当你需要特定的运行手册、配置参考或已知流程时，这是最快的路径。凌晨3点，延迟至关重要。示例：searchknowledge query=redis集群故障转移运行手册手动提升步骤

searchwithcontext — 用于调查和规划。哪些服务依赖这个数据库？或这种故障以前发生过吗？需要遍历服务、事故和基础设施组件之间的关系。示例：searchwithcontext query=认证服务依赖上游下游数据库缓存

get_directives — 变更冻结窗口、事故升级策略和迁移截止日期在此显示。在生产变更前检查可以防止职业生涯受限的错误。

flag_outdated — 基础设施文档比其他任何类型的文档腐烂得更快。上季度记录的Kubernetes集群版本是错误的。网络图显示了一个已退役的负载均衡器。运行手册引用了一个已被替换的CLI工具。积极标记这些——有人在事故期间会用到它们。

reportknowledgegap — 当某个服务没有运行手册、没有架构图或没有记录的所有者时，这就是运营风险。报告缺口会为平台团队创建一个可追踪的项目。

提示

- 服务名称是最可靠的搜索键。使用部署清单中的确切服务标识符（payments-api、auth-service-v2、order-processor），而不是随意描述。
事故后复盘是知识库中最有价值的文档。编写PIR时，包含结构化字段：受影响服务、持续时间、爆炸半径、根因类别和行动项。这些字段由提取引擎索引。
值班时，首先使用searchknowledge查找运行手册。仅当运行手册不存在或故障模式是新型时才升级到searchwithcontext。事故期间速度至关重要。
每次事故调查后都要使用logconversation，即使是误报。误报模式本身就是监控团队应调查的信号。

uplo-devopsAI运维知识库

uplo-devops

UPLO DevOps — Operational Memory for Infrastructure

Session Start

When to Use