Smart Code Search
Search code and docs by meaning, not just strings.
Powered by ColGREP and NextPlaid from LightOn — the engine behind the #1 ranked code retrieval model on MTEB and the #1 retriever on BrowseComp-Plus, OpenAI's hardest agentic search benchmark.
grep finds strings. This finds intent. Ask "payment capture logic" and get results from files that never contain those exact words — because it understands what your code does, not just what it says.
Why This Exists
Every developer has been here: you know what you're looking for but not where it lives. You chain 4 different grep -r attempts, guess filenames, scroll through directory trees. Coding agents are even worse — they grep, miss things, hallucinate file paths, waste tokens exploring blind.
ColGREP fixes this with multi-vector semantic search. It parses your code with Tree-sitter, embeds each function/method/class with token-level vectors, and ranks results by meaning. The model is 17M parameters, runs on CPU, and returns results in under a second.
The Numbers
| Metric | Value |
|---|
| MTEB Code Leaderboard | #1 (LateOn-Code) |
| BrowseComp-Plus |
87.59% accuracy, beating all models up to 8B params (
blog) |
|
vs grep in coding agents | 70% win rate head-to-head |
|
Model size | 17M params — 54× smaller than competing 8B models |
|
Search latency | 200–900ms on CPU |
|
API cost | $0. Forever. Runs 100% local |
|
Privacy | Code never leaves your machine |
Install
CODEBLOCK0
Verify: INLINECODE1
Quick Start
1. Index Your Project
CODEBLOCK1
That's it. ColGREP parses every file with Tree-sitter, builds multi-vector embeddings on CPU, and stores the index in .colgrep/. Takes 30–60 seconds for ~1000 files. After this, the index auto-updates on every search — changed files are detected and re-indexed automatically.
2. Search
CODEBLOCK2
Results are ranked by semantic relevance score. Higher = better match.
Examples:
CODEBLOCK3
3. Combine Regex + Semantics
Filter files by regex pattern first, then rank semantically:
CODEBLOCK4
Search Options
CODEBLOCK5
When to Use This vs grep
| You know... | Use |
|---|
| The exact string or function name | INLINECODE3 |
| The concept but not the words |
colgrep "what it does" |
| A pattern + a concept |
colgrep -e "pattern" "meaning" |
| Where something is implemented |
colgrep "description of behavior" |
| How a feature works across files |
colgrep "feature workflow" |
Coding Agent Integration
ColGREP provides built-in integration with popular coding agents. After installing, restart your agent to enable semantic search:
- - Claude Code: INLINECODE8
- OpenCode: INLINECODE9
- Codex: INLINECODE10
These commands register ColGREP as a search tool within the agent. The agent will automatically use semantic search when navigating indexed projects.
Multi-Project Setup
Index each project independently. Search from the project directory:
CODEBLOCK6
Works great for monorepos, microservices, documentation vaults, and any directory with text/code files.
How It Works
ColGREP uses ColBERT late-interaction retrieval — a fundamentally different approach than traditional single-vector embeddings:
- 1. Tree-sitter parses your code into structured units (functions, methods, classes, signatures)
- LateOn-Code-edge (17M params) creates multiple token-level embeddings per code unit — not one lossy summary vector
- NextPlaid stores these in a quantized, memory-mapped Rust index
- At search time, query tokens interact with document tokens for fine-grained relevance scoring
This is why a 17M model beats 8B models — late interaction preserves token-level semantics that single-vector approaches compress away. Read the full technical story: The Bloated Retriever Era Is Over
Interpreting Scores
- - 6.0+ — Near-exact conceptual match. The code does exactly what you described.
- 5.0–6.0 — Strong semantic match. Highly relevant code.
- 4.0–5.0 — Good match. Related code worth reviewing.
- 3.0–4.0 — Weak match. May or may not be relevant.
- Below 3.0 — Likely noise. Ignore these results.
Troubleshooting
"Index is being updated by another process" — Another colgrep instance is updating. Current search uses existing index. Safe to ignore.
Re-index from scratch:
CODEBLOCK7
Add to .gitignore:
CODEBLOCK8
Links
智能代码搜索
按含义搜索代码和文档,而不仅仅是字符串。
由 LightOn 的 ColGREP 和 NextPlaid 提供支持——这是 MTEB 排名第一的代码检索模型和 BrowseComp-Plus(OpenAI 最难的智能体搜索基准测试)排名第一的检索器背后的引擎。
grep 查找字符串。本工具查找意图。搜索支付捕获逻辑会从从未包含这些确切单词的文件中获取结果——因为它理解你的代码做什么,而不仅仅是它说什么。
为什么存在
每个开发者都遇到过这种情况:你知道要找什么,但不知道它在哪里。你连续尝试 4 次不同的 grep -r,猜测文件名,在目录树中滚动。编码智能体更糟糕——它们使用 grep,遗漏内容,幻觉文件路径,浪费 token 盲目探索。
ColGREP 通过多向量语义搜索解决了这个问题。它使用 Tree-sitter 解析你的代码,用 token 级向量嵌入每个函数/方法/类,并按含义对结果排序。该模型有 1700 万参数,在 CPU 上运行,并在不到一秒内返回结果。
数据
87.59% 准确率,击败所有高达 8B 参数的模型 (
博客) |
|
与编码智能体中的 grep 对比 | 70% 胜率正面交锋 |
|
模型大小 | 1700 万参数——比竞争的 8B 模型小 54 倍 |
|
搜索延迟 | CPU 上 200–900ms |
|
API 成本 | $0。永远免费。100% 本地运行 |
|
隐私 | 代码永远不会离开你的机器 |
安装
bash
brew install lightonai/tap/colgrep
验证:colgrep --version
快速开始
1. 索引你的项目
bash
cd /path/to/project
colgrep init
就这样。ColGREP 使用 Tree-sitter 解析每个文件,在 CPU 上构建多向量嵌入,并将索引存储在 .colgrep/ 中。对于约 1000 个文件需要 30–60 秒。之后,索引在每次搜索时自动更新——更改的文件会被检测并自动重新索引。
2. 搜索
bash
colgrep 你想要的自然语言描述
结果按语义相关性分数排序。分数越高 = 匹配越好。
示例:
bash
colgrep 认证中间件令牌验证
colgrep 数据库迁移回滚策略
colgrep 带错误显示的 React 表单验证
colgrep 带指数退避的 Webhook 重试逻辑
3. 结合正则表达式 + 语义
先按正则表达式模式过滤文件,然后按语义排序:
bash
colgrep -e async.*await 错误处理模式
colgrep -e def test_ 支付捕获边界情况
colgrep -e \.tsx$ 患者仪表板布局
搜索选项
bash
colgrep 查询 # 默认输出:文件:行号 (分数: X.XX)
colgrep 查询 --json # JSON 输出,用于管道传输到其他工具
colgrep 查询 -n 5 # 仅前 5 个结果
何时使用本工具 vs grep
| 你知道... | 使用 |
|---|
| 确切的字符串或函数名 | grep -r functionName |
| 概念但不知道单词 |
colgrep 它做什么 |
| 一个模式 + 一个概念 | colgrep -e 模式 含义 |
| 某物在哪里实现 | colgrep 行为描述 |
| 一个功能如何跨文件工作 | colgrep 功能工作流 |
编码智能体集成
ColGREP 提供与流行编码智能体的内置集成。安装后,重启你的智能体以启用语义搜索:
- - Claude Code: colgrep --install-claude-code
- OpenCode: colgrep --install-opencode
- Codex: colgrep --install-codex
这些命令将 ColGREP 注册为智能体内的搜索工具。智能体在导航已索引项目时将自动使用语义搜索。
多项目设置
独立索引每个项目。从项目目录搜索:
bash
cd ~/code/api && colgrep init
cd ~/code/frontend && colgrep init
cd ~/code/infrastructure && colgrep init
cd ~/docs && colgrep init
独立搜索每个项目
cd ~/code/api && colgrep 支付处理服务
cd ~/code/frontend && colgrep 结账表单验证
适用于单体仓库、微服务、文档库以及任何包含文本/代码文件的目录。
工作原理
ColGREP 使用 ColBERT 后期交互检索——一种与传统单向量嵌入根本不同的方法:
- 1. Tree-sitter 将你的代码解析为结构化单元(函数、方法、类、签名)
- LateOn-Code-edge(1700 万参数)为每个代码单元创建多个 token 级嵌入——而不是一个有损的摘要向量
- NextPlaid 将这些存储在量化的、内存映射的 Rust 索引中
- 在搜索时,查询 token 与文档 token 交互,进行细粒度相关性评分
这就是为什么 1700 万参数的模型能击败 8B 模型——后期交互保留了单向量方法压缩掉的 token 级语义。阅读完整技术故事:臃肿检索器时代已经结束
解读分数
- - 6.0+ — 近乎精确的概念匹配。代码完全按照你的描述工作。
- 5.0–6.0 — 强语义匹配。高度相关的代码。
- 4.0–5.0 — 良好匹配。值得审查的相关代码。
- 3.0–4.0 — 弱匹配。可能相关也可能不相关。
- 低于 3.0 — 可能是噪声。忽略这些结果。
故障排除
索引正在被另一个进程更新 — 另一个 colgrep 实例正在更新。当前搜索使用现有索引。可以安全忽略。
从头重新索引:
bash
rm -rf .colgrep/ && colgrep init
添加到 .gitignore:
.colgrep/
链接