Smart Code Search

Search code and docs by meaning, not just strings.

Powered by ColGREP and NextPlaid from LightOn — the engine behind the #1 ranked code retrieval model on MTEB and the #1 retriever on BrowseComp-Plus, OpenAI's hardest agentic search benchmark.

grep finds strings. This finds intent. Ask "payment capture logic" and get results from files that never contain those exact words — because it understands what your code does, not just what it says.

Why This Exists

Every developer has been here: you know what you're looking for but not where it lives. You chain 4 different grep -r attempts, guess filenames, scroll through directory trees. Coding agents are even worse — they grep, miss things, hallucinate file paths, waste tokens exploring blind.

ColGREP fixes this with multi-vector semantic search. It parses your code with Tree-sitter, embeds each function/method/class with token-level vectors, and ranks results by meaning. The model is 17M parameters, runs on CPU, and returns results in under a second.

The Numbers

Metric	Value
MTEB Code Leaderboard	#1 (LateOn-Code)
BrowseComp-Plus

Install

CODEBLOCK0

Verify: INLINECODE1

Quick Start

1. Index Your Project

CODEBLOCK1

That's it. ColGREP parses every file with Tree-sitter, builds multi-vector embeddings on CPU, and stores the index in .colgrep/. Takes 30–60 seconds for ~1000 files. After this, the index auto-updates on every search — changed files are detected and re-indexed automatically.

2. Search

CODEBLOCK2

Results are ranked by semantic relevance score. Higher = better match.

Examples:
CODEBLOCK3

3. Combine Regex + Semantics

Filter files by regex pattern first, then rank semantically:

CODEBLOCK4

Search Options

CODEBLOCK5

When to Use This vs grep

You know...	Use
The exact string or function name	INLINECODE3
The concept but not the words

Coding Agent Integration

ColGREP provides built-in integration with popular coding agents. After installing, restart your agent to enable semantic search:

- Claude Code: INLINECODE8
OpenCode: INLINECODE9
Codex: INLINECODE10

These commands register ColGREP as a search tool within the agent. The agent will automatically use semantic search when navigating indexed projects.

Multi-Project Setup

Index each project independently. Search from the project directory:

CODEBLOCK6

Works great for monorepos, microservices, documentation vaults, and any directory with text/code files.

How It Works

ColGREP uses ColBERT late-interaction retrieval — a fundamentally different approach than traditional single-vector embeddings:

1. Tree-sitter parses your code into structured units (functions, methods, classes, signatures)
LateOn-Code-edge (17M params) creates multiple token-level embeddings per code unit — not one lossy summary vector
NextPlaid stores these in a quantized, memory-mapped Rust index
At search time, query tokens interact with document tokens for fine-grained relevance scoring

This is why a 17M model beats 8B models — late interaction preserves token-level semantics that single-vector approaches compress away. Read the full technical story: The Bloated Retriever Era Is Over

Interpreting Scores

- 6.0+ — Near-exact conceptual match. The code does exactly what you described.
5.0–6.0 — Strong semantic match. Highly relevant code.
4.0–5.0 — Good match. Related code worth reviewing.
3.0–4.0 — Weak match. May or may not be relevant.
Below 3.0 — Likely noise. Ignore these results.

Troubleshooting

"Index is being updated by another process" — Another colgrep instance is updating. Current search uses existing index. Safe to ignore.

Re-index from scratch:
CODEBLOCK7

Add to .gitignore:
CODEBLOCK8

智能代码搜索

按含义搜索代码和文档，而不仅仅是字符串。

由 LightOn 的 ColGREP 和 NextPlaid 提供支持——这是 MTEB 排名第一的代码检索模型和 BrowseComp-Plus（OpenAI 最难的智能体搜索基准测试）排名第一的检索器背后的引擎。

grep 查找字符串。本工具查找意图。搜索支付捕获逻辑会从从未包含这些确切单词的文件中获取结果——因为它理解你的代码做什么，而不仅仅是它说什么。

为什么存在

每个开发者都遇到过这种情况：你知道要找什么，但不知道它在哪里。你连续尝试 4 次不同的 grep -r，猜测文件名，在目录树中滚动。编码智能体更糟糕——它们使用 grep，遗漏内容，幻觉文件路径，浪费 token 盲目探索。

ColGREP 通过多向量语义搜索解决了这个问题。它使用 Tree-sitter 解析你的代码，用 token 级向量嵌入每个函数/方法/类，并按含义对结果排序。该模型有 1700 万参数，在 CPU 上运行，并在不到一秒内返回结果。

数据

指标	数值
MTEB 代码排行榜	#1 (LateOn-Code)
BrowseComp-Plus

安装

bash
brew install lightonai/tap/colgrep

验证：colgrep --version

快速开始

1. 索引你的项目

bash
cd /path/to/project
colgrep init

就这样。ColGREP 使用 Tree-sitter 解析每个文件，在 CPU 上构建多向量嵌入，并将索引存储在 .colgrep/ 中。对于约 1000 个文件需要 30–60 秒。之后，索引在每次搜索时自动更新——更改的文件会被检测并自动重新索引。

2. 搜索

bash
colgrep 你想要的自然语言描述

结果按语义相关性分数排序。分数越高 = 匹配越好。

示例：
bash
colgrep 认证中间件令牌验证
colgrep 数据库迁移回滚策略
colgrep 带错误显示的 React 表单验证
colgrep 带指数退避的 Webhook 重试逻辑

3. 结合正则表达式 + 语义

先按正则表达式模式过滤文件，然后按语义排序：

bash
colgrep -e async.*await 错误处理模式
colgrep -e def test_ 支付捕获边界情况
colgrep -e \.tsx$ 患者仪表板布局

搜索选项

bash
colgrep 查询 # 默认输出：文件:行号 (分数: X.XX)
colgrep 查询 --json # JSON 输出，用于管道传输到其他工具
colgrep 查询 -n 5 # 仅前 5 个结果

何时使用本工具 vs grep

你知道...	使用
确切的字符串或函数名	grep -r functionName
概念但不知道单词

编码智能体集成

ColGREP 提供与流行编码智能体的内置集成。安装后，重启你的智能体以启用语义搜索：

- Claude Code： colgrep --install-claude-code
OpenCode： colgrep --install-opencode
Codex： colgrep --install-codex

这些命令将 ColGREP 注册为智能体内的搜索工具。智能体在导航已索引项目时将自动使用语义搜索。

多项目设置

独立索引每个项目。从项目目录搜索：

bash
cd ~/code/api && colgrep init
cd ~/code/frontend && colgrep init
cd ~/code/infrastructure && colgrep init
cd ~/docs && colgrep init

独立搜索每个项目

cd ~/code/api && colgrep 支付处理服务 cd ~/code/frontend && colgrep 结账表单验证

适用于单体仓库、微服务、文档库以及任何包含文本/代码文件的目录。

工作原理

ColGREP 使用 ColBERT 后期交互检索——一种与传统单向量嵌入根本不同的方法：

1. Tree-sitter 将你的代码解析为结构化单元（函数、方法、类、签名）
LateOn-Code-edge（1700 万参数）为每个代码单元创建多个 token 级嵌入——而不是一个有损的摘要向量
NextPlaid 将这些存储在量化的、内存映射的 Rust 索引中
在搜索时，查询 token 与文档 token 交互，进行细粒度相关性评分

这就是为什么 1700 万参数的模型能击败 8B 模型——后期交互保留了单向量方法压缩掉的 token 级语义。阅读完整技术故事：臃肿检索器时代已经结束

解读分数

- 6.0+ — 近乎精确的概念匹配。代码完全按照你的描述工作。
5.0–6.0 — 强语义匹配。高度相关的代码。
4.0–5.0 — 良好匹配。值得审查的相关代码。
3.0–4.0 — 弱匹配。可能相关也可能不相关。
低于 3.0 — 可能是噪声。忽略这些结果。

故障排除

索引正在被另一个进程更新 — 另一个 colgrep 实例正在更新。当前搜索使用现有索引。可以安全忽略。

从头重新索引：
bash
rm -rf .colgrep/ && colgrep init

添加到 .gitignore：

.colgrep/

smart-code-search智能代码搜索