Rag Evaluator
AI-powered RAG (Retrieval-Augmented Generation) evaluation toolkit. Configure, benchmark, compare, and optimize your RAG pipelines from the command line. Track prompts, evaluations, fine-tuning experiments, costs, and usage — all with persistent local logging and full export capabilities.
Commands
Run rag-evaluator <command> [args] to use.
| Command | Description |
|---|
| INLINECODE1 | Configure RAG evaluation settings and parameters |
| INLINECODE2 |
Run benchmarks against your RAG pipeline |
|
compare | Compare results across different RAG configurations |
|
prompt | Log and manage prompt templates and variations |
|
evaluate | Evaluate RAG output quality and relevance |
|
fine-tune | Track fine-tuning experiments and parameters |
|
analyze | Analyze evaluation results and identify patterns |
|
cost | Track and log API/inference costs |
|
usage | Monitor token usage and API call volumes |
|
optimize | Log optimization strategies and results |
|
test | Run test cases against RAG configurations |
|
report | Generate evaluation reports |
|
stats | Show summary statistics across all categories |
|
export <fmt> | Export data in json, csv, or txt format |
|
search <term> | Search across all logged entries |
|
recent | Show recent activity from history log |
|
status | Health check — version, data dir, disk usage |
|
help | Show help and available commands |
|
version | Show version (v2.0.0) |
Each domain command (configure, benchmark, compare, etc.) works in two modes:
- - Without arguments: displays the most recent 20 entries from that category
- With arguments: logs the input with a timestamp and saves to the category log file
Data Storage
All data is stored locally in ~/.local/share/rag-evaluator/:
- - Each command creates its own log file (e.g.,
configure.log, benchmark.log) - A unified
history.log tracks all activity across commands - Entries are stored in
timestamp|value pipe-delimited format - Export supports JSON, CSV, and plain text formats
Requirements
- - Bash 4+ with
set -euo pipefail strict mode - Standard Unix utilities:
date, wc, du, tail, grep, sed, INLINECODE32 - No external dependencies or API keys required
When to Use
- 1. Evaluating RAG pipeline quality — log evaluation scores, compare retrieval strategies, and track improvements over time
- Benchmarking different configurations — run benchmarks across embedding models, chunk sizes, or retrieval methods and compare results side by side
- Tracking costs and usage — monitor API costs and token usage across experiments to stay within budget
- Managing prompt engineering — log prompt variations, test them against your pipeline, and analyze which templates perform best
- Generating reports for stakeholders — export evaluation data as JSON/CSV for dashboards, or generate text reports summarizing RAG performance
Examples
CODEBLOCK0
Output
All commands output to stdout. Redirect to a file if needed:
CODEBLOCK1
Configuration
Set DATA_DIR by modifying the script, or use the default: ~/.local/share/rag-evaluator/
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
技能名称: Ragaai Catalyst
详细描述:
检索增强生成评估器
AI驱动的RAG(检索增强生成)评估工具包。通过命令行配置、基准测试、比较和优化您的RAG管道。跟踪提示词、评估、微调实验、成本和使用情况——全部通过持久化本地日志记录和完整导出功能实现。
命令
运行 rag-evaluator <命令> [参数] 即可使用。
| 命令 | 描述 |
|---|
| configure | 配置RAG评估设置和参数 |
| benchmark |
对RAG管道运行基准测试 |
| compare | 比较不同RAG配置的结果 |
| prompt | 记录和管理提示词模板及其变体 |
| evaluate | 评估RAG输出质量和相关性 |
| fine-tune | 跟踪微调实验和参数 |
| analyze | 分析评估结果并识别模式 |
| cost | 跟踪和记录API/推理成本 |
| usage | 监控令牌使用量和API调用量 |
| optimize | 记录优化策略和结果 |
| test | 针对RAG配置运行测试用例 |
| report | 生成评估报告 |
| stats | 显示所有类别的汇总统计信息 |
| export <格式> | 以json、csv或txt格式导出数据 |
| search <搜索词> | 在所有记录条目中搜索 |
| recent | 显示历史日志中的最近活动 |
| status | 健康检查——版本、数据目录、磁盘使用情况 |
| help | 显示帮助信息和可用命令 |
| version | 显示版本号(v2.0.0) |
每个领域命令(configure、benchmark、compare等)有两种工作模式:
- - 无参数:显示该类别最近20条记录
- 带参数:将输入内容连同时间戳一起记录并保存到类别日志文件中
数据存储
所有数据本地存储在 ~/.local/share/rag-evaluator/ 目录下:
- - 每个命令创建自己的日志文件(例如 configure.log、benchmark.log)
- 统一的 history.log 文件跟踪所有命令的活动
- 条目以 时间戳|值 的竖线分隔格式存储
- 支持JSON、CSV和纯文本格式导出
系统要求
- - Bash 4+,启用 set -euo pipefail 严格模式
- 标准Unix工具:date、wc、du、tail、grep、sed、cat
- 无需外部依赖或API密钥
使用场景
- 1. 评估RAG管道质量——记录评估分数,比较检索策略,并跟踪随时间推移的改进
- 对不同配置进行基准测试——针对嵌入模型、分块大小或检索方法运行基准测试,并并排比较结果
- 跟踪成本和使用情况——监控各实验的API成本和令牌使用量,确保不超出预算
- 管理提示词工程——记录提示词变体,针对管道进行测试,并分析哪些模板表现最佳
- 为利益相关者生成报告——将评估数据导出为JSON/CSV格式用于仪表板,或生成总结RAG性能的文本报告
示例
bash
配置新的评估运行
rag-evaluator configure model=gpt-4 chunks=512 overlap=50 top_k=5
运行基准测试并记录结果
rag-evaluator benchmark latency=230ms recall@5=0.82 precision@5=0.71
比较两种检索策略
rag-evaluator compare bm25 vs dense: bm25 recall=0.78, dense recall=0.85
跟踪评估分数
rag-evaluator evaluate faithfulness=0.91 relevance=0.87 coherence=0.93
记录某次运行的API成本
rag-evaluator cost run-042: $0.23 (1.2k tokens input, 800 tokens output)
查看汇总统计信息
rag-evaluator stats
将所有数据导出为CSV
rag-evaluator export csv
搜索特定条目
rag-evaluator search gpt-4
检查最近活动
rag-evaluator recent
健康检查
rag-evaluator status
输出
所有命令输出到标准输出。如需重定向到文件:
bash
rag-evaluator report weekly summary > report.txt
rag-evaluator export json # 保存到 ~/.local/share/rag-evaluator/export.json
配置
通过修改脚本设置 DATA_DIR,或使用默认值:~/.local/share/rag-evaluator/
由BytesAgain提供技术支持 | bytesagain.com | hello@bytesagain.com