RAG Pipelines (Deep Workflow)
RAG quality is dominated by chunking, retrieval, and evaluation—not the LLM alone. Treat the system as data engineering plus generation with explicit failure modes.
When to Offer This Workflow
Trigger conditions:
- - Building Q&A over internal docs, support assistants, or copilots
- Hallucinations, wrong citations, or stale answers
- New content types (PDF, HTML, code repositories)
Initial offer:
Use six stages: (1) task & success criteria, (2) ingestion & cleaning, (3) chunking & metadata, (4) retrieval & rerank, (5) generation & grounding, (6) evaluation & monitoring). Confirm embedding model and retrieval stack (vector DB, search engine, hybrid).
Stage 1: Task & Success Criteria
Goal: Define what a “good” answer contains: required citations, length, tone, and when to refuse.
Exit condition: Written rubric with examples of acceptable vs unacceptable answers.
Stage 2: Ingestion & Cleaning
Goal: Deterministic text extraction (strip boilerplate, handle PDF/OCR if needed); deduplicate documents; track source URL and updated_at for staleness.
Practices
- - Version pipelines when parsers change (re-embed job)
Stage 3: Chunking & Metadata
Goal: Tune chunk size and overlap to query patterns—not one global token count for all content.
Practices
- - Attach metadata for ACL filtering (tenant, product area)
- Prefer structure-aware splits for docs (headings, sections)
Stage 4: Retrieval & Rerank
Goal: Hybrid lexical + dense retrieval often beats vector-only for keyword-heavy queries.
Practices
- - Cross-encoder reranking on top-k for quality (watch latency)
- Query rewriting for multi-turn contexts
Stage 5: Generation & Grounding
Goal: System prompts that require using only provided context; explicit “not found” behavior; optional citation format (snippet, doc id, link).
Stage 6: Evaluation & Monitoring
Goal: Offline golden questions with expected supporting docs; online thumbs-down reasons; monitor retrieval hit rate, nDCG@k, and age of sources used.
Final Review Checklist
- - [ ] Rubric and refusal behavior defined
- [ ] Ingestion deterministic; dedupe and versioning
- [ ] Chunking and metadata match queries and ACLs
- [ ] Hybrid retrieval and rerank tuned with metrics
- [ ] Grounding and citation behavior enforced in prompts
- [ ] Offline eval plus production monitoring
Tips for Effective Guidance
- - Debug retrieval before blaming the LLM.
- Long chunks hurt precision; short chunks hurt context—sweep experiments.
- See also vector-databases and llm-evaluation skills for depth.
Handling Deviations
- - Code RAG: symbol- or AST-aware chunking often beats line-based splits.
- High-stakes domains: add human review gates and audit logs for sources cited.
RAG 流水线(深度工作流)
RAG 质量主要取决于分块、检索和评估——而非仅靠大语言模型。应将系统视为数据工程加生成,并明确故障模式。
何时提供此工作流
触发条件:
- - 基于内部文档构建问答系统、支持助手或副驾驶
- 出现幻觉、错误引用或过时答案
- 新增内容类型(PDF、HTML、代码仓库)
初始建议:
使用六个阶段:(1)任务与成功标准,(2)摄取与清洗,(3)分块与元数据,(4)检索与重排序,(5)生成与接地,(6)评估与监控。确认嵌入模型和检索栈(向量数据库、搜索引擎、混合模式)。
阶段 1:任务与成功标准
目标: 定义好答案包含的内容:所需引用、长度、语气,以及何时拒绝回答。
退出条件: 包含可接受与不可接受答案示例的书面评分标准。
阶段 2:摄取与清洗
目标: 确定性文本提取(去除模板内容,必要时处理PDF/OCR);去重文档;追踪来源URL和updated_at以判断过时。
实践
- - 解析器变更时对流水线进行版本控制(重新嵌入任务)
阶段 3:分块与元数据
目标: 根据查询模式调整分块大小和重叠量——而非对所有内容使用统一的全局令牌数。
实践
- - 附加元数据用于ACL过滤(租户、产品领域)
- 对文档优先采用结构感知分割(标题、章节)
阶段 4:检索与重排序
目标: 对于关键词密集型查询,混合词法+稠密检索通常优于纯向量检索。
实践
- - 对top-k结果使用交叉编码器重排序以提升质量(注意延迟)
- 对多轮对话上下文进行查询重写
阶段 5:生成与接地
目标: 系统提示要求仅使用提供的上下文;明确的未找到行为;可选的引用格式(片段、文档ID、链接)。
阶段 6:评估与监控
目标: 离线黄金问题集及预期支持文档;在线踩原因;监控检索命中率、nDCG@k和所用来源的时效性。
最终审查清单
- - [ ] 已定义评分标准和拒绝行为
- [ ] 摄取过程确定性;去重和版本控制
- [ ] 分块和元数据匹配查询和ACL
- [ ] 混合检索和重排序已根据指标调优
- [ ] 提示中已强制执行接地和引用行为
- [ ] 离线评估加生产监控
有效指导技巧
- - 在归咎大语言模型之前先调试检索。
- 长分块损害精确度,短分块损害上下文——进行实验扫描。
- 另见向量数据库和大语言模型评估技能以深入了解。
处理偏差
- - 代码RAG: 符号或AST感知的分块通常优于基于行的分割。
- 高风险领域: 添加人工审核关卡和引用来源的审计日志。