RAG Accuracy Optimizer
A skill for optimizing end-to-end accuracy in RAG systems.
Workflow Overview
CODEBLOCK0
Each step impacts accuracy. Optimize each step in order.
1. Structured Data Design
SQL vs Vector DB — When to Use What?
| Criteria | SQL (PostgreSQL, MySQL) | Vector DB (Pinecone, Qdrant, Weaviate) |
|---|
| Exact facts (price, date, product code) | ✅ Optimal | ❌ Not suitable |
| Semantic search (query meaning) |
❌ Not supported | ✅ Optimal |
| Aggregation (SUM, COUNT, AVG) | ✅ Native | ❌ Not supported |
| Fuzzy matching ("similar to...") | ⚠️ Limited | ✅ Optimal |
|
Hybrid (recommended) | pgvector for both | Vector DB + SQL metadata store |
Principle: Clearly structured data → SQL. Unstructured data requiring semantic understanding → Vector DB. Most production systems need both.
Schema Design Patterns by Domain
Insurance:
CODEBLOCK1
Finance:
CODEBLOCK2
Healthcare:
CODEBLOCK3
E-commerce:
CODEBLOCK4
Metadata Tagging Strategy
Each chunk/document needs at minimum:
CODEBLOCK5
Metadata principles:
- - Always include
source for traceability and citation - INLINECODE1 enables pre-filtering before search → reduces noise
- INLINECODE2 +
total_chunks enables fetching surrounding context - Domain-specific fields (clausenumber, ticker, drugid) vary by use case
Normalization vs Denormalization
| Normalized | Denormalized |
|---|
| Pros | Less duplication, easy to update | Faster queries, fewer JOINs |
| Cons |
Requires JOINs, slower | Duplication, harder to sync |
|
Use when | Source of truth (SQL) | Vector store chunks |
Recommendation: Normalized for SQL source → Denormalized when creating chunks for Vector DB. Each chunk should contain sufficient context, no JOINs needed at retrieval time.
2. Chunking Strategies
Detailed code examples: read INLINECODE4
Choosing the Right Strategy
CODEBLOCK6
Chunk Size Guidelines
| Size | Use case | Trade-off |
|---|
| 128-256 tokens | FAQ, short definitions | High precision, less context |
| 256-512 tokens |
Recommended default | Good balance |
| 512-1024 tokens | Complex text, legal docs | More context, potential noise |
| >1024 tokens | Rarely used | Too much noise |
Semantic Chunking
Split by meaning (section, topic) instead of fixed size:
CODEBLOCK7
Overlap Strategy
- - 10-20% overlap between adjacent chunks
- Ensures information at boundaries is not lost
- Chunk N ends with 1-2 opening sentences of chunk N+1
Hierarchical Chunking (Parent-Child)
CODEBLOCK8
- - Search at paragraph level (most detailed)
- When matched, pull parent section for additional context
- Keep
parent_id in metadata
Domain-Specific Chunking
- - Insurance: 1 chunk = 1 clause
- Finance: 1 chunk = 1 report section, metadata = ticker + period
- Healthcare: 1 chunk = 1 guideline/recommendation
- E-commerce: 1 chunk = 1 review or 1 product description
- Legal: 1 chunk = 1 article/clause/section
Metadata Enrichment Per Chunk
Each chunk should be enriched with:
- - Summary: 1-2 sentence content summary (LLM-generated)
- Keywords: Key terms (supports BM25)
- Questions: 2-3 questions this chunk can answer (hypothetical questions)
- Entities: Named entities (product names, codes, dates)
3. Retrieval Optimization
Detailed code examples: read INLINECODE6
Recommended Retrieval Pipeline
CODEBLOCK9
Hybrid Search (Vector + BM25)
- - Vector search: Find by meaning (semantic similarity)
- BM25 (keyword): Find by exact keywords (product names, codes)
- Combined: Weighted fusion or Reciprocal Rank Fusion (RRF)
CODEBLOCK10
Query Rewriting
Use LLM to reformulate the user question for clarity:
CODEBLOCK11
Multi-Query
From 1 question, generate 3-5 variants → search each variant → merge results:
CODEBLOCK12
Reranking
After retrieval, use a reranking model to re-sort by relevance:
- - Cohere Rerank: Simple API, highly effective
- Cross-encoder: More accurate than bi-encoder, but slower
- GPT Rerank: Use LLM to evaluate relevance (expensive but flexible)
Retrieve top 20 → rerank → take top 3-5 for generation.
Contextual Compression
After reranking, compress each chunk: keep only the part relevant to the question.
CODEBLOCK13
Reduces noise, saves context window, improves accuracy.
Metadata Filtering
Narrow the search space BEFORE vector search:
CODEBLOCK14
4. Accuracy Testing & Monitoring
Test Suite Design
Create ground truth Q&A pairs:
CODEBLOCK15
Recommendation: Minimum 50-100 test cases, evenly distributed across categories and difficulty levels.
Metrics
| Metric | Meaning | Target |
|---|
| Precision@K | % relevant results in top K | >0.8 |
| Recall@K |
% ground truth found in top K | >0.9 |
|
F1 | Harmonic mean of Precision and Recall | >0.85 |
|
MRR | Mean Reciprocal Rank — average position of first correct result | >0.8 |
|
NDCG | Normalized Discounted Cumulative Gain — ranking quality | >0.85 |
|
Answer Accuracy | % correct answers (human eval or LLM judge) | >0.9 |
A/B Testing
Compare strategies by running the same test suite:
CODEBLOCK16
Error Analysis Framework
Classify errors to know where to optimize:
| Error Type | Cause | Solution |
|---|
| Retrieval Miss | Correct chunk not found | Improve chunking, add hypothetical Q |
| Ranking Error |
Correct chunk found but ranked low | Add reranking |
|
Generation Error | Correct chunk but LLM answers wrong | Improve prompt, add few-shot |
|
No Answer | Information not in DB | Expand knowledge base |
|
Hallucination | LLM fabricates information | Add citation enforcement |
Production Monitoring
Log each query:
CODEBLOCK17
Alerts:
- - Continuous confidence < 0.5 → review chunking/retrieval
- Latency > 2s → optimize index or reduce top_k
- Negative feedback > 20% → audit error patterns
5. Safeguards
Hallucination Prevention
Mandatory system prompt:
CODEBLOCK18
Citation Enforcement
Require source citations:
CODEBLOCK19
Confidence Thresholds
CODEBLOCK20
Answer Verification
Cross-check the answer with the DB:
- 1. Extract claims from the answer (using LLM)
- Verify each claim against retrieved chunks
- Flag claims without supporting evidence
- Return only verified claims
6. Embedding Model Selection
Detailed comparison: read INLINECODE7
Quick Decision
| Scenario | Model | Reason |
|---|
| Production, budget OK | Cohere embed-v4 | Highest MTEB, input_type optimization |
| Production, low cost |
OpenAI text-embedding-3-small | $0.02/1M tokens, good quality |
| Self-host, multilingual |
BGE-M3 ⭐ | Hybrid dense+sparse, 100+ languages, free |
| Self-host, Vietnamese |
BGE-M3 or
multilingual-e5-large | Best for Vietnamese RAG |
| POC / Prototype | all-MiniLM-L6-v2 | 90MB, runs on CPU |
Key Principles
- - Dimension reduction: OpenAI embed-3 supports Matryoshka — reduce 3072→512 with only ~3% quality loss
- Normalize embeddings: Always
normalize_embeddings=True when encoding for cosine similarity - Batch processing: Encode in batches (256-2000 items) instead of one at a time
- Consistency: Use the SAME model for indexing and querying
7. Vector DB Comparison
Detailed comparison + HNSW tuning: read INLINECODE9
Quick Decision
CODEBLOCK21
HNSW Tuning Quick Reference
| Param | Default | Accuracy-critical | Speed-critical |
|---|
| M | 16 | 48-64 | 8-16 |
| ef_construction |
200 | 400-500 | 100-200 |
| ef (search) | 100 | 200-256 | 50-100 |
Trade-off: Higher M and ef → better recall but more RAM and slower. Tune per SLA.
8. Advanced Techniques
Detailed code examples: read INLINECODE10
Late Chunking
Embed the entire document first, then pool embeddings by chunk boundaries. Each chunk retains context from surrounding text.
CODEBLOCK22
Use when: Documents have many co-references ("it", "this", "the package"). Quality gain: +5-10%.
RAPTOR (Recursive Abstractive Processing)
Build a multi-level summary tree: Level 0 (chunks) → Level 1 (summaries) → Level 2 (summary of summaries).
Use when: Need to answer both broad queries ("Compare all insurance packages") and narrow queries ("Clause X of Package Y"). Quality gain: +10-15%.
GraphRAG (Microsoft)
Build a knowledge graph from documents → detect communities → summarize communities → query via map-reduce.
Use when: Multi-hop reasoning, synthesize across many documents. Quality gain: +15-25% for synthesis queries. High overhead (many LLM calls when building the graph).
Combining Techniques (Production Stack)
CODEBLOCK23
9. Performance Optimization
Caching Layer
CODEBLOCK24
Async Retrieval
CODEBLOCK25
HNSW Index Tuning
See details in references/vector-db-comparison.md HNSW section. Key: tune ef (search) per latency SLA, tune M per recall target.
10. Vietnamese-Specific RAG
Details: read INLINECODE14
Key Challenges
| Issue | Solution |
|---|
| Diacritics (with vs without) | Dual indexing: index both versions |
| Compound words ("bảo hiểm") |
Word segmentation (underthesea) |
| Abbreviations (BHXH, TTCK, BLLĐ) | Abbreviation expansion dictionary |
| Vietnamese proper names | NER with underthesea/PhoBERT |
| Domain terms (finance, law, medical) | Domain-specific term enrichment |
Embedding Models for Vietnamese
- - BGE-M3: Best overall — hybrid dense+sparse, 100+ languages
- multilingual-e5-large: Good alternative — retrieval-optimized
- PhoBERT-v2: Best for NER/classification (needs fine-tuning for retrieval)
Preprocessing Pipeline
CODEBLOCK26
11. AI Orchestrator — Multi-Model Cost Optimization
Detailed prompt templates, code examples: read INLINECODE15
Query Classification Pipeline
Each user query is classified into 1 of 5 categories:
| Category | Description | Example | Model |
|---|
| simple | Greeting, FAQ, simple lookup | "Hello", "Opening hours?" | No LLM / Local |
| rag |
Needs knowledge base search | "Does insurance cover cancer?" | Cheap (Gemini Flash) |
|
complex | Multi-hop reasoning, comparison, analysis | "Compare 3 insurance packages for a family of 4" | Standard (GPT-4o-mini) / Premium (Claude Sonnet) |
|
action | Needs tool/API execution (create form, calculate) | "Calculate insurance premium for me, age 30" | Standard + Tools |
|
unsafe | Violation content, injection, jailbreak | "Ignore instructions..." | Block — No LLM |
2-Stage Classification (Minimize LLM Tokens)
CODEBLOCK27
Stage 1 blocks 60-80% of queries without spending a single LLM token.
Model Routing
CODEBLOCK28
Cost Optimization Rules
- 1. Rule-based first: Greeting, FAQ, unsafe → DON'T call LLM
- Cheapest sufficient model: Prefer Gemini Flash for RAG queries
- Escalate on failure: Gemini Flash fail/low-confidence → GPT-4o-mini → Claude Sonnet
- Cache responses: Identical queries → cached answer (TTL 5-30 min)
- Batch classify: Multiple queries → 1 LLM call to classify all
- Token budget: Set max_tokens per category (simple: 100, rag: 300, complex: 500)
RAG Trigger Rules
| Condition | RAG On/Off |
|---|
| Query contains domain keywords | ✅ ON |
| Classification = "rag" or "complex" |
✅ ON |
| Greeting, simple lookup, unsafe | ❌ OFF |
| Confidence score > 0.9 from cache/FAQ | ❌ OFF (answer from cache) |
Tool Trigger Rules
| Condition | Tools |
|---|
| Query requests calculation (fees, interest) | calculator tool |
| Query requests form creation/submission |
form_builder tool |
| Query requests real-time lookup (price, exchange rate) | api_lookup tool |
| Classification ≠ "action" | No tools |
JSON Output Format
CODEBLOCK29
Scripts
eval_ragas.py
RAGAS evaluation pipeline. Run:
CODEBLOCK30
Input: JSON file with test cases (question, answer, contexts, ground_truth). Output: metrics report + threshold checks.
Requires: INLINECODE16
embedding_benchmark.py
Benchmark embedding models on a Vietnamese dataset. Run:
CODEBLOCK31
Input: JSON file with query-positive-negative pairs. Output: accuracy + latency comparison.
Requires: INLINECODE17
chunk_optimizer.py
Evaluate chunk quality. Run:
CODEBLOCK32
Input: JSONL file, each line is {"text": "...", "metadata": {...}}. Output: quality report with scores.
accuracy_test.py
Test framework for RAG accuracy. Run:
CODEBLOCK33
Input: JSON file with test cases (question, expectedanswer, expectedsource). Output: metrics report.
References
- -
references/chunking-patterns.md — Python code examples for chunking strategies - INLINECODE20 — Code examples for hybrid search, reranking, multi-query
- INLINECODE21 — Detailed embedding model comparison (OpenAI, Cohere, BGE-M3, PhoBERT...)
- INLINECODE22 — Vector DB comparison + HNSW tuning guide
- INLINECODE23 — Late Chunking, RAPTOR, GraphRAG with code examples
- INLINECODE24 — RAGAS, LLM-as-Judge, Adversarial testing
- INLINECODE25 — Vietnamese NLP: diacritics, abbreviations, NER, domain terms
- INLINECODE26 — Multi-model orchestrator: prompt templates, rule-based pre-classifier, cost comparison, fallback chain, monitoring
RAG 精度优化器
用于优化 RAG 系统端到端精度的技能。
工作流程概览
数据设计 → 分块 → 索引 → 检索 → 生成 → 测试 → 监控
每个步骤都会影响精度。按顺序优化每个步骤。
1. 结构化数据设计
SQL 与向量数据库 — 何时使用哪种?
| 标准 | SQL (PostgreSQL, MySQL) | 向量数据库 (Pinecone, Qdrant, Weaviate) |
|---|
| 精确事实 (价格、日期、产品代码) | ✅ 最佳 | ❌ 不适合 |
| 语义搜索 (查询含义) |
❌ 不支持 | ✅ 最佳 |
| 聚合 (SUM, COUNT, AVG) | ✅ 原生支持 | ❌ 不支持 |
| 模糊匹配 (类似于...) | ⚠️ 有限 | ✅ 最佳 |
|
混合 (推荐) | 使用 pgvector 实现两者 | 向量数据库 + SQL 元数据存储 |
原则: 结构清晰的数据 → SQL。需要语义理解的非结构化数据 → 向量数据库。大多数生产系统需要两者。
按领域的模式设计
保险:
policies(policyid, producttype, effective_date)
clauses(clauseid, policyid, clause_number, title, content)
exclusions(exclusionid, clauseid, description)
-- 向量:clause.content + exclusion.description 的嵌入
金融:
securities(ticker, name, sector, exchange)
reports(reportid, ticker, period, reporttype)
sections(sectionid, reportid, heading, content)
-- 向量:section.content 的嵌入,元数据:ticker + period
医疗:
drugs(drugid, genericname, brand_name, category)
guidelines(guidelineid, condition, recommendation, evidencelevel)
interactions(drugaid, drugbid, severity, description)
-- 向量:guidelines.recommendation 的嵌入
电商:
products(product_id, name, category, brand, price)
reviews(reviewid, productid, rating, content)
specs(product_id, attribute, value)
-- 向量:review.content + product description 的嵌入
元数据标记策略
每个块/文档至少需要:
python
metadata = {
source: policydocv2.pdf, # 来源
source_type: pdf, # 文件类型
domain: insurance, # 领域
category: life_insurance, # 分类
entity_id: POL-2024-001, # 相关实体 ID
section: exclusions, # 文档中的章节
chunk_index: 3, # 块位置
total_chunks: 12, # 文档总块数
created_at: 2024-01-15, # 创建日期
version: 2.0, # 版本
language: en # 语言
}
元数据原则:
- - 始终包含 source 以实现可追溯性和引用
- entityid 支持在搜索前进行预过滤 → 减少噪音
- chunkindex + totalchunks 支持获取周围上下文
- 领域特定字段 (clausenumber, ticker, drug_id) 因用例而异
规范化与反规范化
| 规范化 | 反规范化 |
|---|
| 优点 | 更少重复,易于更新 | 查询更快,更少 JOIN |
| 缺点 |
需要 JOIN,速度较慢 | 重复,更难同步 |
|
使用场景 | 数据源 (SQL) | 向量存储块 |
建议: SQL 源使用规范化 → 为向量数据库创建块时使用反规范化。每个块应包含足够的上下文,检索时无需 JOIN。
2. 分块策略
详细代码示例:阅读 references/chunking-patterns.md
选择正确的策略
数据有清晰的结构(条款、章节)?
→ 语义分块(按标题/章节)
长连续数据(文章、转录稿)?
→ 固定大小 + 重叠(512 tokens,10-20% 重叠)
需要概览 + 细节?
→ 分层分块(父子结构)
具有自身逻辑单元的领域特定数据?
→ 领域特定分块
块大小指南
| 大小 | 用例 | 权衡 |
|---|
| 128-256 tokens | 常见问题、简短定义 | 高精度,上下文较少 |
| 256-512 tokens |
推荐默认值 | 良好平衡 |
| 512-1024 tokens | 复杂文本、法律文档 | 更多上下文,潜在噪音 |
| >1024 tokens | 很少使用 | 噪音过多 |
语义分块
按含义(章节、主题)而非固定大小进行分割:
python
按 Markdown 标题分割
按段落分隔符 (\n\n) 分割
按主题变化分割(使用 NLP 或 LLM 检测)
重叠策略
- - 相邻块之间 10-20% 重叠
- 确保边界处的信息不会丢失
- 块 N 以块 N+1 的 1-2 个开头句子结束
分层分块(父子结构)
文档(摘要)
└── 章节(标题 + 关键点)
└── 段落(详细信息)
- - 在段落级别搜索(最详细)
- 匹配时,拉取父章节以获取额外上下文
- 在元数据中保留 parent_id
领域特定分块
- - 保险: 1 个块 = 1 个条款
- 金融: 1 个块 = 1 个报告章节,元数据 = ticker + period
- 医疗: 1 个块 = 1 个指南/建议
- 电商: 1 个块 = 1 条评论或 1 个产品描述
- 法律: 1 个块 = 1 条/款/节
每个块的元数据丰富化
每个块应丰富以下内容:
- - 摘要: 1-2 句内容摘要(LLM 生成)
- 关键词: 关键术语(支持 BM25)
- 问题: 该块可以回答的 2-3 个问题(假设性问题)
- 实体: 命名实体(产品名称、代码、日期)
3. 检索优化
详细代码示例:阅读 references/retrieval-patterns.md
推荐的检索流程
用户查询
→ 查询重写(扩展/改写)
→ 多查询生成(3-5 个变体)
→ 元数据过滤(缩小范围)
→ 混合搜索(向量 + BM25)
→ 合并与去重
→ 重排序(前 20 → 前 5)
→ 上下文压缩
→ LLM 生成(带引用)
混合搜索(向量 + BM25)
- - 向量搜索: 按含义查找(语义相似度)
- BM25(关键词): 按精确关键词查找(产品名称、代码)
- 组合: 加权融合或倒数排名融合(RRF)
finalscore = α × vectorscore + (1-α) × bm25_score
α = 0.7 是一个好的起点,按领域调整
查询重写
使用 LLM 改写用户问题以提高清晰度:
用户:保险赔吗?
→ 重写:人寿保险在什么情况下支付赔偿金?
多查询
从 1 个问题生成 3-5 个变体 → 搜索每个变体 → 合并结果:
原始:哪家银行储蓄利率最高?
查询 1:比较 2024 年各银行储蓄利率
查询 2:目前存款利率最高的银行
查询 3:最佳存款利率的顶级银行
重排序
检索后,使用重排序模型按相关性重新排序:
- - Cohere Rerank: 简单 API,非常有效
- 交叉编码器: 比双编码器更准确,但速度较慢
- GPT 重排序: 使用 LLM 评估相关性(昂贵但灵活)
检索前 20 个 → 重排序 → 取前 3-5 个用于生成。
上下文压缩
重排序后,压缩每个块:仅保留与问题相关的部分。
原始块(500 tokens) → 压缩后(150 tokens,仅相关部分)
减少噪音,节省上下文窗口,提高精度。
元数据过滤
在向量搜索之前缩小搜索空间:
python
而不是搜索所有 100 万个块:
filter = {domain: insurance, product_type: life}
#