LLM Evaluation (Deep Workflow)
Evaluation turns “it feels better” into reproducible evidence. Design around failure modes your product cares about—not only aggregate scores.
When to Offer This Workflow
Trigger conditions:
- - Prompt or model change; need before/after proof
- Building CI for LLM outputs; flaky quality in production
- RAG/agents: grounding, tool use, safety regressions
Initial offer:
Use six stages: (1) define quality & constraints, (2) build datasets & rubrics, (3) automatic metrics, (4) human evaluation, (5) regression & gates, (6) online validation & iteration. Confirm latency/cost budgets and risk (PII, safety).
Stage 1: Define Quality & Constraints
Goal: Name dimensions that map to user harm if they fail.
Typical dimensions (pick what matters)
- - Correctness / task success; groundedness (RAG); faithfulness to sources
- Safety: policy violations, jailbreaks, PII leakage
- Style: tone, brevity, format (when product-critical)
- Robustness: paraphrase, multilingual, edge inputs
Constraints
- - Max tokens, latency p95, cost per request; locale requirements
Exit condition: Weighted priority of dimensions; non-goals stated.
Stage 2: Datasets & Rubrics
Goal: Fixed eval sets + clear scoring rules.
Practices
- - Stratify by intent: easy/medium/hard; adversarial slice separate
- Rubrics: 1–5 scales with anchors; binary checks for safety
- Version datasets (git or table); no silent edits without changelog
- Privacy: synthetic or redacted real examples per policy
Exit condition: Golden set size justified; inter-rater plan if human scoring.
Stage 3: Automatic Metrics
Goal: Fast signals—know limitations.
Options
- - Reference-based: BLEU/ROUGE—often weak for assistants
- Model-as-judge: fast, biased—calibrate vs human
- Task-specific: exact match, JSON schema validity, tool-call args match
- RAG: citation overlap, nugget recall, entailment models (use carefully)
Hygiene
- - No training on test; detect leakage from prompts
Exit condition: Each auto metric has known blind spots documented.
Stage 4: Human Evaluation
Goal: Authoritative judgment where automatic metrics lie.
Design
- - Sample size for confidence; blind A/B when possible
- Guidelines + examples; adjudication for disagreements
- Locale-native raters when language quality matters
Exit condition: Human scores correlate enough with auto for ongoing monitoring—or you rely on human for release.
Stage 5: Regression & Gates
Goal: Block bad deploys in CI or release pipeline.
Gates
- - Must-pass suites: safety, critical user journeys
- Trend tracking: not only point-in-time
- Canary with online metrics (see Stage 6)
Artifacts
- - Report: model/prompt id, dataset versions, scores, diff
Exit condition: Rollback criteria defined before rollout.
Stage 6: Online Validation
Goal: Production truth—shadow, A/B, or gradual ramp.
Signals
- - Implicit: thumbs, edits, task completion, support tickets
- Explicit: user ratings (sparse)
Causality
- - Confounds: seasonality, cohort—control where possible
Final Review Checklist
- - [ ] Quality dimensions prioritized for the product
- [ ] Versioned eval sets and rubrics
- [ ] Auto + human roles explicit; limitations documented
- [ ] Release gates and rollback tied to metrics
- [ ] Plan for online feedback loop
Tips for Effective Guidance
- - Slice metrics—averages hide regressions on critical intents.
- For agents, evaluate trajectories, not only final text.
- Never claim objective truth—evaluation is operationalized judgment.
Handling Deviations
- - No labels: start with smallest pairwise comparison set + spot human review.
- High-stakes (medical/legal): human-in-the-loop gate; disclaim limits of auto eval.
LLM评估(深度工作流)
评估将“感觉更好”转化为可复现的证据。围绕你的产品关心的失败模式进行设计——而不仅仅是聚合分数。
何时提供此工作流
触发条件:
- - 提示词或模型变更;需要变更前后的证明
- 为LLM输出构建CI;生产环境中质量不稳定
- RAG/智能体:接地性、工具使用、安全性回归
初始提供:
使用六个阶段:(1)定义质量与约束,(2)构建数据集与评分标准,(3)自动指标,(4)人工评估,(5)回归与门控,(6)在线验证与迭代。确认延迟/成本预算和风险(PII、安全性)。
阶段1:定义质量与约束
目标: 命名维度,这些维度若失败将映射到用户损害。
典型维度(选择重要的)
- - 正确性/任务成功率;接地性(RAG);忠实度于来源
- 安全性:策略违规、越狱、PII泄露
- 风格:语气、简洁性、格式(当产品关键时)
- 鲁棒性:释义、多语言、边缘输入
约束
- - 最大令牌数、延迟p95、每次请求成本;语言环境要求
退出条件: 维度的加权优先级;明确说明非目标。
阶段2:数据集与评分标准
目标: 固定的评估集 + 清晰的评分规则。
实践
- - 按意图分层:简单/中等/困难;对抗性样本单独切片
- 评分标准:1–5分制,带锚点;安全性使用二元检查
- 版本化数据集(git或表格);无变更日志则不进行静默编辑
- 隐私:根据策略使用合成或脱敏的真实示例
退出条件: 黄金集大小合理;若有人工评分,需有评分者间一致性计划。
阶段3:自动指标
目标: 快速信号——了解局限性。
选项
- - 基于参考:BLEU/ROUGE——对助手类应用通常较弱
- 模型作为评判者:快速但有偏见——需与人类校准
- 任务特定:精确匹配、JSON模式有效性、工具调用参数匹配
- RAG:引用重叠、关键点召回率、蕴含模型(谨慎使用)
卫生
退出条件: 每个自动指标都有已知盲点并记录在案。
阶段4:人工评估
目标: 在自动指标失效的地方提供权威性判断。
设计
- - 为置信度确定样本量;尽可能进行盲测A/B对比
- 指南 + 示例;对分歧进行裁定
- 当语言质量重要时使用本地语言评分者
退出条件: 人工评分与自动评分足够相关以进行持续监控——或依赖人工进行发布决策。
阶段5:回归与门控
目标: 在CI或发布流水线中阻止不良部署。
门控
- - 必须通过的套件:安全性、关键用户旅程
- 趋势跟踪:不仅是时间点数据
- 带在线指标的金丝雀发布(见阶段6)
产物
- - 报告:模型/提示词ID、数据集版本、分数、差异
退出条件: 在发布前定义回滚标准。
阶段6:在线验证
目标: 生产环境真相——影子模式、A/B测试或逐步放量。
信号
- - 隐式:点赞、编辑、任务完成、支持工单
- 显式:用户评分(稀疏)
因果性
最终审查清单
- - [ ] 为产品确定质量维度优先级
- [ ] 版本化的评估集和评分标准
- [ ] 明确自动+人工角色;记录局限性
- [ ] 发布门控和回滚与指标挂钩
- [ ] 在线反馈循环计划
有效指导技巧
- - 切片指标——平均值会隐藏关键意图上的回归。
- 对于智能体,评估轨迹,而不仅仅是最终文本。
- 切勿声称客观真理——评估是操作化的判断。
偏差处理
- - 无标签:从最小的成对比较集开始 + 抽查人工审查。
- 高风险(医疗/法律):人在回路门控;声明自动评估的局限性。