Vector Databases (Deep Workflow)

Vector search is approximate nearest neighbor (ANN) at scale—not magic semantic understanding. Success requires embedding model alignment, index parameters, metadata filters, and evaluation against real queries.

When to Offer This Workflow

Trigger conditions:

- Building RAG, similarity search, dedup, recommendations, anomaly clustering
Comparing managed vector DB vs pgvector vs search engine kNN
Recall issues, stale vectors, slow queries, or cost explosions

Initial offer:

Use six stages: (1) problem & metrics, (2) embeddings & schema, (3) index & parameters, (4) hybrid & filtering, (5) operations & cost, (6) evaluation & iteration. Confirm scale (vectors, QPS, dimension) and latency SLO.

Stage 1: Problem & Metrics

Goal: Define what “similar” means for the product—not only cosine similarity.

Questions

1. Query types: short keyword vs long paragraph? multilingual?
Precision vs recall preference: legal/medical may need high precision
Freshness: how often do vectors change? Real-time upserts?

4 Ground truth: any labeled relevant pairs for eval?

Metrics

- Recall@k, MRR, nDCG when judgments exist; otherwise human spot checks + proxy tasks

Exit condition: Success metric and minimum acceptable recall/latency stated.

Stage 2: Embeddings & Schema

Goal: Stable embedding pipeline with versioning and metadata design.

Embeddings

- Model choice: domain fit (code vs general text); dimension; distance metric (cosine vs dot vs L2)—match DB defaults
Chunking strategy upstream—bad chunks → bad retrieval regardless of DB

Schema

- Payload/metadata per vector: doc_id, tenant_id, acl, source, timestamps
Multi-vector per doc (passages) vs single centroid—trade-offs

Versioning

- Re-embed all on model change—plan downtime or dual-write period

Exit condition: ID strategy + metadata filter needs documented.

Stage 3: Index & Parameters

Goal: Pick index type and build params for data size and recall.

Common families (vendor-specific names)

- HNSW: strong latency/recall; memory hungry; tunable efConstruction, INLINECODE5
IVF: better memory; needs training nlist; probe tuning
PQ/OPQ: compression—recall hit; good for huge scale

Tuning loop

- Start defaults; sweep parameters with benchmark queries
Watch insert throughput during index build on large backfills

Exit condition: Benchmark results: p95 latency vs recall at fixed k.

Stage 4: Hybrid Search & Filtering

Goal: Combine vector similarity with structured constraints—most production needs this.

Patterns

- Pre-filter metadata (tenant, date) before ANN when supported—verify filter selectivity
Hybrid: BM25 + vector with weighted fusion or rerank stage
Reranking: cross-encoder on top-k candidates—quality boost, latency cost

Pitfalls

- Filtering that leaves too few candidates—empty results despite “similar” existing in other tenants

Exit condition: Query plan documented: ANN → filter → rerank (as applicable).

Stage 5: Operations & Cost

Goal: Reliable ingestion, monitoring, and predictable bills.

Ops

- Upsert idempotency; delete tombstones for compliance
Backups, multi-region if needed—eventual consistency semantics per vendor
Capacity: memory per node vs sharding; replication factor

Cost

- Managed per dimension × count; egress; query units—estimate from peak QPS

Exit condition: Runbook for reindex, scaling, and incident “search degraded.”

Stage 6: Evaluation & Iteration

Goal: Continuous improvement with labeled or proxy eval.

Loop

- Golden query set updated when product changes
A/B embedding models or rerankers with guardrails on latency
Monitor click-through, thumbs, or human grading in RAG

Debugging bad retrieval

- Chunk inspection, metadata leaks, wrong tenant filter, stale index

Final Review Checklist

- [ ] Metrics and embedding/model versioning plan
[ ] Index family chosen with benchmark evidence
[ ] Hybrid/filter strategy matches product needs
[ ] Ops: upsert, delete, scaling, backup understood
[ ] Eval set and iteration process in place

Tips for Effective Guidance

- Never promise “semantic search understands intent”—ground with eval.
pgvector vs specialized: trade-offs on scale, ops, features—state honestly.
Warn: high-cardinality filters + ANN can be slow—design metadata carefully.

Handling Deviations

- Tiny corpus: brute force or simple index may suffice—avoid over-engineering.
Multimodal: separate embedding spaces or unified model—fusion strategy required.

向量数据库（深度工作流）

向量搜索本质上是大规模近似最近邻（ANN）——并非神奇的语义理解。成功需要嵌入模型对齐、索引参数、元数据过滤器以及针对真实查询的评估。

何时提供此工作流

触发条件：

- 构建RAG、相似性搜索、去重、推荐、异常聚类
比较托管向量数据库 vs pgvector vs 搜索引擎kNN
召回率问题、过期向量、慢查询或成本激增

初始提供：

使用六个阶段：（1）问题与指标，（2）嵌入与模式，（3）索引与参数，（4）混合与过滤，（5）运维与成本，（6）评估与迭代。确认规模（向量数、QPS、维度）和延迟SLO。

阶段1：问题与指标

目标： 定义产品中“相似”的含义——不仅仅是余弦相似度。

问题

1. 查询类型：短关键词 vs 长段落？多语言？
精确率 vs 召回率偏好：法律/医疗可能需要高精确率
新鲜度：向量变更频率？实时更新？
基准真相：是否有标注的相关对用于评估？

指标

- 有判断依据时使用Recall@k、MRR、nDCG；否则使用人工抽查 + 代理任务

退出条件： 明确成功指标和最低可接受的召回率/延迟。

阶段2：嵌入与模式

目标： 具有版本控制和元数据设计的稳定嵌入管道。

嵌入

- 模型选择：领域适配（代码 vs 通用文本）；维度；距离度量（余弦 vs 点积 vs L2）——匹配数据库默认值
分块策略上游——分块质量差 → 检索质量差，与数据库无关

模式

- 每个向量的负载/元数据：docid、tenantid、acl、source、时间戳
每个文档的多向量（段落）vs 单中心点——权衡

版本控制

- 模型变更时重新嵌入所有内容——规划停机时间或双写期间

退出条件： ID策略 + 元数据过滤需求已记录。

阶段3：索引与参数

目标： 根据数据规模和召回率选择索引类型和构建参数。

常见族（供应商特定名称）

- HNSW：延迟/召回率强；内存消耗大；可调efConstruction、M
IVF：内存更好；需要训练nlist；探测调优
PQ/OPQ：压缩——召回率受影响；适合超大规模

调优循环

- 从默认值开始；使用基准查询扫描参数
在大规模回填期间监控索引构建时的插入吞吐量

退出条件： 基准测试结果：固定k下的p95延迟 vs 召回率。

阶段4：混合搜索与过滤

目标： 结合向量相似性与结构化约束——大多数生产环境需要此功能。

模式

- 在支持时在ANN之前预过滤元数据（租户、日期）——验证过滤选择性
混合：BM25 + 向量，使用加权融合或重排序阶段
重排序：对top-k候选使用交叉编码器——质量提升，延迟成本

陷阱

- 过滤导致候选太少——尽管其他租户中存在“相似”内容，但结果为空

退出条件： 查询计划已记录：ANN → 过滤 → 重排序（如适用）。

阶段5：运维与成本

目标： 可靠的数据摄入、监控和可预测的账单。

运维

- 更新幂等性；删除墓碑标记以符合合规要求
备份，如需多区域——各供应商的最终一致性语义
容量：每个节点的内存 vs 分片；复制因子

成本

- 托管按维度 × 数量计费；出站流量；查询单元——根据峰值QPS进行估算

退出条件： 重索引、扩缩容和“搜索降级”事件的运行手册。

阶段6：评估与迭代

目标： 使用标注或代理评估进行持续改进。

循环

- 产品变更时更新黄金查询集
A/B测试嵌入模型或重排序器，设置延迟护栏
在RAG中监控点击率、点赞或人工评分

调试检索问题

- 分块检查、元数据泄露、错误的租户过滤、过期索引

最终审查清单

- [ ] 指标和嵌入模型版本控制计划
[ ] 基于基准证据选择的索引族
[ ] 混合/过滤策略匹配产品需求
[ ] 运维：更新、删除、扩缩容、备份已理解
[ ] 评估集和迭代流程已就位

有效指导技巧

- 永远不要承诺“语义搜索理解意图”——用评估来落地。
pgvector vs 专用数据库：在规模、运维、功能上的权衡——诚实说明。
警告：高基数过滤器 + ANN可能慢——精心设计元数据。

处理偏差

- 极小语料库：暴力搜索或简单索引可能足够——避免过度设计。
多模态：分离嵌入空间或统一模型——需要融合策略。

vector-databases向量数据库