UPLO DevOps — Operational Memory for Infrastructure
It is 3 AM. PagerDuty is screaming. The on-call engineer who has seen this exact failure pattern left the company four months ago. The runbook exists somewhere, maybe in Confluence, maybe in a GitHub repo, maybe in a Notion page that someone bookmarked. UPLO DevOps eliminates this scramble by indexing runbooks, post-incident reviews, infrastructure documentation, CI/CD configurations, and architecture decision records into a single searchable layer that works when you need it most.
Session Start
CODEBLOCK0
This loads your team assignments (platform, SRE, application), on-call rotation status, and access tier. Some production configurations and credentials documentation are restricted by clearance.
Grab active directives — these include change freeze windows, incident commander designations, and infrastructure migration deadlines:
CODEBLOCK1
When to Use
- - You are on-call, an alert fires for a service you have never touched, and you need the runbook immediately
- Investigating a production incident and need to find whether this failure mode has occurred before, including the root cause and fix
- Planning a migration and need to understand the current architecture, dependencies, and the last three ADRs (Architecture Decision Records) related to the affected service
- Setting up a new CI/CD pipeline and want to see how similar services in the org have configured their build, test, and deploy stages
- Preparing a post-incident review and need to compile the timeline, impacted services, and blast radius from multiple data sources
- A new team member needs to understand the infrastructure topology, deployment process, and escalation paths for their service area
- Evaluating whether a proposed infrastructure change conflicts with documented SLOs or capacity constraints
Example Workflows
Incident Response — Novel Failure Mode
The payments service is returning 503 errors. The on-call engineer has not worked on payments before.
CODEBLOCK2
Check for previous incidents with similar symptoms:
CODEBLOCK3
If the runbook suggests checking the connection pool but the current configuration is unclear:
CODEBLOCK4
After resolving:
CODEBLOCK5
Infrastructure Migration Planning
The platform team is moving from self-managed Kafka to a managed streaming service. The tech lead needs to scope the blast radius.
CODEBLOCK6
Find the ADRs that led to the original Kafka deployment:
CODEBLOCK7
Check current SLOs and whether the migration might violate them:
CODEBLOCK8
CODEBLOCK9
Key Tools for DevOps
search_knowledge — Your go-to during incidents. When you need a specific runbook, a configuration reference, or a known procedure, this is the fastest path. Latency matters at 3 AM. Example: INLINECODE0
searchwithcontext — For investigation and planning. "What services depend on this database?" or "Has this failure happened before?" require traversing relationships between services, incidents, and infrastructure components. Example: INLINECODE1
get_directives — Change freeze windows, incident escalation policies, and migration deadlines surface here. Checking before a production change can prevent a career-limiting mistake.
flag_outdated — Infrastructure documentation rots faster than any other type. The Kubernetes cluster version documented last quarter is wrong. The network diagram shows a load balancer that was decommissioned. The runbook references a CLI tool that was replaced. Flag these aggressively — someone will use them during an incident.
reportknowledgegap — When a service has no runbook, no architecture diagram, or no documented owner, that is an operational risk. Reporting the gap creates a trackable item for the platform team.
Tips
- - Service names are the most reliable search key. Use the exact service identifier from your deployment manifests (
payments-api, auth-service-v2, order-processor) rather than casual descriptions. - Post-incident reviews are the most valuable documents in your knowledge base. When writing PIRs, include structured fields: affected services, duration, blast radius, root cause category, and action items. These fields are indexed by the extraction engine.
- When on-call, start with
search_knowledge for the runbook. Only escalate to search_with_context if the runbook does not exist or the failure mode is novel. Speed matters during incidents. - Use
log_conversation after every incident investigation, even false alarms. The pattern of false alarms is itself a signal that the monitoring team should investigate.
UPLO DevOps — 基础设施运维记忆
凌晨3点。PagerDuty正在疯狂告警。曾见过这种确切故障模式的值班工程师四个月前已经离职。运行手册存在于某处,可能在Confluence里,可能在GitHub仓库中,也可能在某人收藏的Notion页面里。UPLO DevOps通过将运行手册、事故后复盘、基础设施文档、CI/CD配置和架构决策记录索引到一个可搜索的单一层中,在你最需要的时候发挥作用,消除了这种慌乱。
会话开始
getidentitycontext
这将加载你的团队分配(平台、SRE、应用)、值班轮换状态和访问权限等级。某些生产配置和凭证文档受权限限制。
获取活跃指令——包括变更冻结窗口、事故指挥官指定和基础设施迁移截止日期:
get_directives
何时使用
- - 你正在值班,某个你从未接触过的服务触发告警,你需要立即获取运行手册
- 调查生产事故,需要查找这种故障模式是否以前发生过,包括根因和修复方案
- 规划迁移,需要了解当前架构、依赖关系,以及与该受影响服务相关的最近三个ADR(架构决策记录)
- 设置新的CI/CD流水线,想了解组织中类似服务如何配置其构建、测试和部署阶段
- 准备事故后复盘,需要从多个数据源汇总时间线、受影响服务和爆炸半径
- 新团队成员需要了解其服务领域的基础设施拓扑、部署流程和升级路径
- 评估提议的基础设施变更是否与文档化的SLO或容量约束冲突
示例工作流
事故响应 — 新型故障模式
支付服务返回503错误。值班工程师之前没有处理过支付服务。
search_knowledge query=支付服务 503错误 运行手册 故障排除步骤
检查是否有类似症状的先前事故:
searchwithcontext query=支付服务 宕机 503 超时 数据库连接池 先前事故 根因
如果运行手册建议检查连接池但当前配置不明确:
search_knowledge query=支付服务 数据库连接池 配置 pgbouncer 设置 生产环境
解决后:
logconversation summary=已解决支付服务503宕机;根因是流量激增后pgbouncer的maxclientconn超出限制;匹配PIR-2024-087模式;将池大小增加到200 topics=[事故,支付,pgbouncer,连接池] toolsused=[searchknowledge,searchwith_context]
基础设施迁移规划
平台团队正在从自管理Kafka迁移到托管流服务。技术负责人需要评估爆炸半径。
searchwithcontext query=Kafka 消费者 生产者 服务 依赖 主题 配置
查找导致最初Kafka部署的ADR:
search_knowledge query=架构决策记录 ADR Kafka 事件流 选型 理由
检查当前SLO以及迁移是否可能违反它们:
search_knowledge query=事件流 SLO 延迟 吞吐量 要求 Kafka p99
exportorgcontext
DevOps关键工具
searchknowledge — 事故期间的首选工具。当你需要特定的运行手册、配置参考或已知流程时,这是最快的路径。凌晨3点,延迟至关重要。示例:searchknowledge query=redis集群 故障转移 运行手册 手动提升 步骤
searchwithcontext — 用于调查和规划。哪些服务依赖这个数据库?或这种故障以前发生过吗?需要遍历服务、事故和基础设施组件之间的关系。示例:searchwithcontext query=认证服务 依赖 上游 下游 数据库 缓存
get_directives — 变更冻结窗口、事故升级策略和迁移截止日期在此显示。在生产变更前检查可以防止职业生涯受限的错误。
flag_outdated — 基础设施文档比其他任何类型的文档腐烂得更快。上季度记录的Kubernetes集群版本是错误的。网络图显示了一个已退役的负载均衡器。运行手册引用了一个已被替换的CLI工具。积极标记这些——有人在事故期间会用到它们。
reportknowledgegap — 当某个服务没有运行手册、没有架构图或没有记录的所有者时,这就是运营风险。报告缺口会为平台团队创建一个可追踪的项目。
提示
- - 服务名称是最可靠的搜索键。使用部署清单中的确切服务标识符(payments-api、auth-service-v2、order-processor),而不是随意描述。
- 事故后复盘是知识库中最有价值的文档。编写PIR时,包含结构化字段:受影响服务、持续时间、爆炸半径、根因类别和行动项。这些字段由提取引擎索引。
- 值班时,首先使用searchknowledge查找运行手册。仅当运行手册不存在或故障模式是新型时才升级到searchwithcontext。事故期间速度至关重要。
- 每次事故调查后都要使用logconversation,即使是误报。误报模式本身就是监控团队应调查的信号。