Advanced Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
When to Activate
Activate this skill when:
- - Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
Core Concepts
The Evaluation Taxonomy
Evaluation approaches fall into two primary categories with distinct reliability profiles:
Direct Scoring: A single LLM rates one response on a defined scale.
- - Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria
- Failure mode: Score calibration drift, inconsistent scale interpretation
Pairwise Comparison: An LLM compares two responses and selects the better one.
- - Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
- Failure mode: Position bias, length bias
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
The Bias Landscape
LLM judges exhibit systematic biases that must be actively mitigated:
Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
Metric Selection Framework
Choose metrics based on the evaluation task structure:
| Task Type | Primary Metrics | Secondary Metrics |
|---|
| Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ |
| Ordinal scale (1-5 rating) |
Spearman's ρ, Kendall's τ | Cohen's κ (weighted) |
| Pairwise preference | Agreement rate, Position consistency | Confidence calibration |
| Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
Evaluation Approaches
Direct Scoring Implementation
Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
Criteria Definition Pattern:
CODEBLOCK0
Scale Calibration:
- - 1-3 scales: Binary with neutral option, lowest cognitive load
- 1-5 scales: Standard Likert, good balance of granularity and reliability
- 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics
Prompt Structure for Direct Scoring:
CODEBLOCK1
Chain-of-Thought Requirement: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
Pairwise Comparison Implementation
Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
Position Bias Mitigation Protocol:
- 1. First pass: Response A in first position, Response B in second
- Second pass: Response B in first position, Response A in second
- Consistency check: If passes disagree, return TIE with reduced confidence
- Final verdict: Consistent winner with averaged confidence
Prompt Structure for Pairwise Comparison:
CODEBLOCK2
Confidence Calibration: Confidence scores should reflect position consistency:
- - Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE
Rubric Generation
Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
Rubric Components:
- 1. Level descriptions: Clear boundaries for each score level
- Characteristics: Observable features that define each level
- Examples: Representative text for each level (optional but valuable)
- Edge cases: Guidance for ambiguous situations
- Scoring guidelines: General principles for consistent application
Strictness Calibration:
- - Lenient: Lower bar for passing scores, appropriate for encouraging iteration
- Balanced: Fair, typical expectations for production use
- Strict: High standards, appropriate for safety-critical or high-stakes evaluation
Domain Adaptation: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards.
Practical Guidance
Evaluation Pipeline Design
Production evaluation systems require multiple layers:
CODEBLOCK3
Common Anti-Patterns
Anti-pattern: Scoring without justification
- - Problem: Scores lack grounding, difficult to debug or improve
- Solution: Always require evidence-based justification before score
Anti-pattern: Single-pass pairwise comparison
- - Problem: Position bias corrupts results
- Solution: Always swap positions and check consistency
Anti-pattern: Overloaded criteria
- - Problem: Criteria measuring multiple things are unreliable
- Solution: One criterion = one measurable aspect
Anti-pattern: Missing edge case guidance
- - Problem: Evaluators handle ambiguous cases inconsistently
- Solution: Include edge cases in rubrics with explicit guidance
Anti-pattern: Ignoring confidence calibration
- - Problem: High-confidence wrong judgments are worse than low-confidence
- Solution: Calibrate confidence to position consistency and evidence strength
Decision Framework: Direct vs. Pairwise
Use this decision tree:
CODEBLOCK4
Scaling Evaluation
For high-volume evaluation:
- 1. Panel of LLMs (PoLL): Use multiple models as judges, aggregate votes
- Reduces individual model bias
- More expensive but more reliable for high-stakes decisions
- 2. Hierarchical evaluation: Fast cheap model for screening, expensive model for edge cases
- Cost-effective for large volumes
- Requires calibration of screening threshold
- 3. Human-in-the-loop: Automated evaluation for clear cases, human review for low-confidence
- Best reliability for critical applications
- Design feedback loop to improve automated evaluation
Examples
Example 1: Direct Scoring for Accuracy
Input:
CODEBLOCK5
Output:
CODEBLOCK6
Example 2: Pairwise Comparison with Position Swap
Input:
CODEBLOCK7
First Pass (A first):
CODEBLOCK8
Second Pass (B first):
{ "winner": "A", "confidence": 0.6 }
(Note: Winner is A because B was in first position)
Mapped Second Pass:
CODEBLOCK10
Final Result:
CODEBLOCK11
Example 3: Rubric Generation
Input:
CODEBLOCK12
Output (abbreviated):
CODEBLOCK13
Guidelines
- 1. Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
- 2. Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
- 3. Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
- 4. Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
- 5. Include confidence scores - Calibrate to position consistency and evidence strength
- 6. Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
- 7. Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations
- 8. Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment
- 9. Monitor for systematic bias - Track disagreement patterns by criterion, response type, model
- 10. Design for iteration - Evaluation systems improve with feedback loops
Integration
This skill integrates with:
- - context-fundamentals - Evaluation prompts require effective context structure
- tool-design - Evaluation tools need proper schemas and error handling
- context-optimization - Evaluation prompts can be optimized for token efficiency
- evaluation (foundational) - This skill extends the foundational evaluation concepts
References
Internal reference:
External research:
Related skills in this collection:
- - evaluation - Foundational evaluation concepts
- context-fundamentals - Context structure for evaluation prompts
- tool-design - Building evaluation tools
Skill Metadata
Created: 2024-12-24
Last Updated: 2024-12-24
Author: Muratcan Koylan
Version: 1.0.0
高级评估
本技能涵盖使用LLM作为评判者评估LLM输出的生产级技术。它将学术论文、行业实践和实际实施经验中的研究成果综合为可操作的模式,用于构建可靠的评估系统。
关键洞察:LLM作为评判者并非单一技术,而是一系列方法,每种方法适用于不同的评估场景。选择正确的方法并减轻已知偏差是本技能培养的核心能力。
何时激活
在以下情况下激活本技能:
- - 为LLM输出构建自动化评估流水线
- 比较多个模型响应以选择最佳结果
- 在评估团队中建立一致的质量标准
- 调试显示不一致结果的评估系统
- 为提示词或模型变更设计A/B测试
- 创建人工或自动化评估的评分标准
- 分析自动化评估与人工判断之间的相关性
核心概念
评估分类法
评估方法分为两个主要类别,具有不同的可靠性特征:
直接评分:单个LLM按照定义的量表对一个响应进行评分。
- - 最佳适用:客观标准(事实准确性、指令遵循、毒性检测)
- 可靠性:对于定义明确的标准为中等至高等
- 失败模式:评分校准漂移、量表解释不一致
成对比较:LLM比较两个响应并选择较优者。
- - 最佳适用:主观偏好(语气、风格、说服力)
- 可靠性:对于偏好评估高于直接评分
- 失败模式:位置偏差、长度偏差
来自MT-Bench论文(Zheng等人,2023)的研究表明,在基于偏好的评估中,成对比较比直接评分与人类评判者的一致性更高,而直接评分仍适用于具有明确客观事实的客观标准。
偏差全景
LLM评判者表现出必须主动缓解的系统性偏差:
位置偏差:在成对比较中,第一个位置的响应受到优待。缓解措施:交换位置评估两次,使用多数投票或一致性检查。
长度偏差:较长的响应无论质量如何都获得更高评分。缓解措施:明确提示忽略长度,长度归一化评分。
自我增强偏差:模型对自己的输出评分更高。缓解措施:使用不同模型进行生成和评估,或承认局限性。
冗长偏差:即使不必要,详细解释也会获得更高分数。缓解措施:针对特定标准的评分标准,惩罚无关细节。
权威偏差:无论准确性如何,自信、权威的语气获得更高评分。缓解措施:要求引用证据、事实核查层。
指标选择框架
根据评估任务结构选择指标:
| 任务类型 | 主要指标 | 次要指标 |
|---|
| 二元分类(通过/失败) | 召回率、精确率、F1 | Cohens κ |
| 序数量表(1-5评分) |
Spearmans ρ、Kendalls τ | Cohens κ(加权) |
| 成对偏好 | 一致率、位置一致性 | 置信度校准 |
| 多标签 | Macro-F1、Micro-F1 | 每个标签的精确率/召回率 |
关键洞察:绝对一致性高不如系统性不一致模式重要。一个在特定标准上与人类持续不一致的评判者比具有随机噪声的评判者问题更大。
评估方法
直接评分实现
直接评分需要三个组成部分:明确的标准、校准的量表和结构化的输出格式。
标准定义模式:
标准:[名称]
描述:[该标准衡量的内容]
权重:[相对重要性,0-1]
量表校准:
- - 1-3量表:带中立选项的二元量表,认知负荷最低
- 1-5量表:标准李克特量表,粒度与可靠性的良好平衡
- 1-10量表:粒度较高但校准难度大,仅与详细评分标准一起使用
直接评分的提示结构:
你是一位评估响应质量的专家评估者。
任务
根据每个标准评估以下响应。
原始提示
{提示}
待评估响应
{响应}
标准
{每个标准:名称、描述、权重}
说明
对于每个标准:
- 1. 在响应中找到具体证据
- 根据评分标准打分(1-{最大值}量表)
- 用证据证明你的分数
- 提出一个具体的改进建议
输出格式
以结构化JSON格式响应,包含分数、理由和摘要。
思维链要求:所有评分提示必须在给出分数之前要求提供理由。研究表明,与先评分的方法相比,这可将可靠性提高15-25%。
成对比较实现
对于基于偏好的评估,成对比较本质上更可靠,但需要偏差缓解。
位置偏差缓解协议:
- 1. 第一轮:响应A在第一位,响应B在第二位
- 第二轮:响应B在第一位,响应A在第二位
- 一致性检查:如果两轮结果不一致,返回平局并降低置信度
- 最终裁决:一致胜出者,取平均置信度
成对比较的提示结构:
你是一位比较两个AI响应的专家评估者。
关键说明
- - 不要因为响应较长而偏好它
- 不要基于位置(第一个与第二个)偏好响应
- 仅根据指定标准关注质量
- 当响应确实等价时,平局是可接受的
原始提示
{提示}
响应A
{response_a}
响应B
{response_b}
比较标准
{标准列表}
说明
- 1. 首先独立分析每个响应
- 在每个标准上比较它们
- 确定总体胜出者及置信度水平
输出格式
JSON格式,包含每个标准的比较、总体胜出者、置信度(0-1)和推理过程。
置信度校准:置信度分数应反映位置一致性:
- - 两轮一致:置信度 = 各轮置信度的平均值
- 两轮不一致:置信度 = 0.5,判定 = 平局
评分标准生成
与开放式评分相比,定义良好的评分标准可将评估方差降低40-60%。
评分标准组成部分:
- 1. 等级描述:每个分数等级的明确界限
- 特征:定义每个等级的可观察特征
- 示例:每个等级的代表性文本(可选但有价值)
- 边缘情况:对模糊情况的指导
- 评分指南:一致应用的一般原则
严格度校准:
- - 宽松:通过分数的门槛较低,适合鼓励迭代
- 平衡:公平,生产使用的典型期望
- 严格:高标准,适合安全关键或高风险的评估
领域适应:评分标准应使用领域特定术语。代码可读性评分标准提及变量、函数和注释。医学准确性评分标准引用临床术语和证据标准。
实用指南
评估流水线设计
生产评估系统需要多个层次:
┌─────────────────────────────────────────────────┐
│ 评估流水线 │
├─────────────────────────────────────────────────┤
│ │
│ 输入:响应 + 提示 + 上下文 │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ 标准加载器 │ ◄── 评分标准、权重 │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ 主要评分器 │ ◄── 直接或成对 │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ 偏差缓解 │ ◄── 位置交换等 │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ 置信度评分 │ ◄── 校准 │
│ └──────────┬──────────┘ │
│ │ │
│ ▼ │
│ 输出:分数 + 理由 + 置信度 │
│ │
└─────────────────────────────────────────────────┘
常见反模式
反模式:无理由的评分
- - 问题:分数缺乏依据,难以调试或改进
- 解决方案:始终要求在评分前提供基于证据的理由
反模式:单轮成对比较
- - 问题:位置偏差污染结果
- 解决方案:始终交换位置并检查一致性
反模式:过载的标准
- - 问题:衡量多个事物的标准不可靠
- 解决方案:一个标准 = 一个可衡量的方面
反模式:缺少边缘情况指导
- - 问题:评估者对模糊情况的处理不一致
- 解决方案:在评分标准中包含边缘情况并提供明确指导
反模式:忽略置信度校准
- - 问题:高置信度的错误判断比低置信度更糟糕
- 解决方案:将置信度校准到位置一致性和证据强度
决策框架:直接评分与成对比较
使用此决策树:
是否存在客观事实