Ollama Proxy — One Endpoint for All Your Ollama Instances
You have Ollama running on multiple machines. Instead of hardcoding IPs and manually picking which Ollama instance to hit, point everything at the Ollama proxy. The Ollama proxy routes to the best available device automatically.
CODEBLOCK0
Set up the Ollama proxy
CODEBLOCK1
On one machine (the Ollama proxy):
CODEBLOCK2
On every machine running Ollama:
CODEBLOCK3
Now point your apps at http://ollama-proxy:11435 instead of http://localhost:11434. Same Ollama API, same model names, same streaming — the Ollama proxy handles smarter routing.
Drop-in Ollama proxy replacement
Every Ollama API endpoint works through the Ollama proxy:
CODEBLOCK4
OpenAI-compatible Ollama proxy API
The Ollama proxy also exposes an OpenAI-compatible endpoint — same models, no code changes:
CODEBLOCK5
What the Ollama proxy does that direct Ollama doesn't
| Feature | Direct Ollama | Ollama Proxy (Herd) |
|---|
| Multiple machines | Manual IP switching | Ollama proxy routes automatically |
| Load balancing |
None | Ollama proxy scores on 7 signals |
| Failover | None | Ollama proxy auto-retries on next node |
| Model discovery | Per-machine Ollama | Ollama proxy aggregates fleet-wide |
| Queue management | None | Ollama proxy manages per-node:model queues |
| Dashboard | None | Ollama proxy provides real-time web UI |
| Health checks | None | Ollama proxy runs 15 automated checks |
| Request tracing | None | Ollama proxy logs to SQLite trace store |
| Image generation | None | Ollama proxy routes mflux + DiffusionKit |
| Speech-to-text | None | Ollama proxy routes Qwen3-ASR |
Ollama proxy works with your existing tools
Just change the Ollama URL to the Ollama proxy — no other configuration needed:
| Tool | Before (direct Ollama) | After (Ollama proxy) |
|---|
| Open WebUI | INLINECODE2 | INLINECODE3 |
| Aider |
--openai-api-base http://localhost:11434/v1 |
--openai-api-base http://ollama-proxy:11435/v1 |
|
Continue.dev | Ollama at localhost | Ollama proxy at
ollama-proxy:11435 |
|
LangChain |
Ollama(base_url="http://localhost:11434") |
Ollama(base_url="http://ollama-proxy:11435") |
|
LiteLLM |
ollama/llama3.3:70b |
ollama/llama3.3:70b (point at Ollama proxy) |
|
CrewAI |
OPENAI_API_BASE=http://localhost:11434/v1 |
OPENAI_API_BASE=http://ollama-proxy:11435/v1 |
How the Ollama proxy routes requests
When a request arrives at the Ollama proxy, it scores all Ollama nodes that have the requested model:
- 1. Thermal state — is the model already loaded in the Ollama instance (hot)?
- Memory fit — does the Ollama node have enough free RAM?
- Queue depth — is the Ollama node busy with other requests?
- Latency history — how fast has this Ollama node been recently?
- Role affinity — the Ollama proxy sends big models to big machines
- Availability trend — is this Ollama node reliably available?
- Context fit — does the loaded context window match the request?
The highest-scoring Ollama node wins. If it fails, the Ollama proxy retries on the next best node automatically.
Monitor your Ollama proxy fleet
Ollama proxy dashboard at http://ollama-proxy:11435/dashboard — see every Ollama node, every model, every queue in real time.
CODEBLOCK6
Full documentation
Contribute
Ollama Herd (the Ollama proxy) is open source (MIT). We welcome contributions:
- - Star on GitHub — help others find the Ollama proxy
- Open an issue — bug reports, feature requests
- PRs welcome —
CLAUDE.md gives AI agents full Ollama proxy context. 444 tests, async Python.
Guardrails
- - No automatic model downloads — the Ollama proxy requires explicit user confirmation for model pulls.
- Model deletion requires explicit user confirmation via the Ollama proxy.
- All Ollama proxy requests stay local — no data leaves your network.
- Never delete or modify files in
~/.fleet-manager/.
Ollama Proxy — 一个端点管理所有Ollama实例
你在多台机器上运行Ollama。无需硬编码IP地址或手动选择要访问的Ollama实例,只需将所有请求指向Ollama Proxy。Ollama Proxy会自动路由到最佳可用设备。
之前: 应用 → http://macmini:11434 (单个Ollama实例,希望它不忙)
之后: 应用 → http://ollama-proxy:11435 (Ollama Proxy自动选择最佳机器)
设置Ollama Proxy
bash
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
在一台机器上(作为Ollama Proxy):
bash
herd # 在端口11435上启动Ollama Proxy
在每台运行Ollama的机器上:
bash
herd-node # 自动发现网络上的Ollama Proxy
现在将你的应用指向 http://ollama-proxy:11435 而不是 http://localhost:11434。相同的Ollama API、相同的模型名称、相同的流式传输——Ollama Proxy负责更智能的路由。
即插即用的Ollama Proxy替代方案
所有Ollama API端点均可通过Ollama Proxy工作:
bash
通过Ollama Proxy聊天(与直接使用Ollama相同)
curl http://ollama-proxy:11435/api/chat -d {
model: llama3.3:70b,
messages: [{role: user, content: 通过Ollama Proxy问好}]
}
通过Ollama Proxy生成(与直接使用Ollama相同)
curl http://ollama-proxy:11435/api/generate -d {
model: qwen3:32b,
prompt: 通过Ollama Proxy解释量子计算
}
通过Ollama Proxy列出模型(汇总所有Ollama节点)
curl http://ollama-proxy:11435/api/tags
通过Ollama Proxy列出已加载模型(跨所有Ollama节点)
curl http://ollama-proxy:11435/api/ps
通过Ollama Proxy拉取模型(自动选择最佳节点)
curl -N http://ollama-proxy:11435/api/pull -d {name: codestral}
兼容OpenAI的Ollama Proxy API
Ollama Proxy还暴露了兼容OpenAI的端点——相同的模型,无需修改代码:
python
from openai import OpenAI
指向Ollama Proxy而非直接使用Ollama
ollama
proxyclient = OpenAI(base
url=http://ollama-proxy:11435/v1, apikey=not-needed)
ollama
proxyresponse = ollama
proxyclient.chat.completions.create(
model=llama3.3:70b,
messages=[{role: user, content: 通过Ollama Proxy问好}],
stream=True,
)
Ollama Proxy相比直接使用Ollama的额外功能
| 功能 | 直接使用Ollama | Ollama Proxy (Herd) |
|---|
| 多机器支持 | 手动切换IP | Ollama Proxy自动路由 |
| 负载均衡 |
无 | Ollama Proxy基于7个信号评分 |
| 故障转移 | 无 | Ollama Proxy自动重试下一个节点 |
| 模型发现 | 每台机器独立 | Ollama Proxy汇总整个集群 |
| 队列管理 | 无 | Ollama Proxy管理每节点:模型队列 |
| 仪表盘 | 无 | Ollama Proxy提供实时Web界面 |
| 健康检查 | 无 | Ollama Proxy运行15项自动检查 |
| 请求追踪 | 无 | Ollama Proxy记录到SQLite追踪存储 |
| 图像生成 | 无 | Ollama Proxy路由mflux + DiffusionKit |
| 语音转文字 | 无 | Ollama Proxy路由Qwen3-ASR |
Ollama Proxy与现有工具兼容
只需将Ollama URL改为Ollama Proxy——无需其他配置:
| 工具 | 之前(直接使用Ollama) | 之后(使用Ollama Proxy) |
|---|
| Open WebUI | http://localhost:11434 | http://ollama-proxy:11435 |
| Aider |
--openai-api-base http://localhost:11434/v1 | --openai-api-base http://ollama-proxy:11435/v1 |
|
Continue.dev | 本地Ollama | 使用ollama-proxy:11435的Ollama Proxy |
|
LangChain | Ollama(base
url=http://localhost:11434) | Ollama(baseurl=http://ollama-proxy:11435) |
|
LiteLLM | ollama/llama3.3:70b | ollama/llama3.3:70b(指向Ollama Proxy) |
|
CrewAI | OPENAI
APIBASE=http://localhost:11434/v1 | OPENAI
APIBASE=http://ollama-proxy:11435/v1 |
Ollama Proxy如何路由请求
当请求到达Ollama Proxy时,它会对所有拥有请求模型的Ollama节点进行评分:
- 1. 热状态 — 模型是否已加载到Ollama实例中(热加载)?
- 内存适配 — Ollama节点是否有足够的空闲RAM?
- 队列深度 — Ollama节点是否忙于处理其他请求?
- 延迟历史 — 该Ollama节点最近响应速度如何?
- 角色亲和性 — Ollama Proxy将大模型发送到大机器
- 可用性趋势 — 该Ollama节点是否稳定可用?
- 上下文适配 — 已加载的上下文窗口是否匹配请求?
得分最高的Ollama节点胜出。如果失败,Ollama Proxy会自动重试下一个最佳节点。
监控你的Ollama Proxy集群
Ollama Proxy仪表盘位于 http://ollama-proxy:11435/dashboard —— 实时查看每个Ollama节点、每个模型、每个队列。
bash
Ollama Proxy集群概览
curl -s http://ollama-proxy:11435/fleet/status | python3 -m json.tool
Ollama Proxy健康检查
curl -s http://ollama-proxy:11435/dashboard/api/health | python3 -m json.tool
完整文档
- - 代理设置指南 — 设置Ollama Proxy
- API参考 — 所有Ollama Proxy端点
- 配置 — Ollama Proxy设置
贡献
Ollama Herd(Ollama Proxy)是开源的(MIT协议)。我们欢迎贡献:
- - 在GitHub上星标 — 帮助他人找到Ollama Proxy
- 提交问题 — 错误报告、功能请求
- 欢迎PR — CLAUDE.md为AI代理提供完整的Ollama Proxy上下文。444个测试,异步Python。
安全护栏
- - 不自动下载模型 — Ollama Proxy需要用户明确确认才能拉取模型。
- 删除模型需要通过Ollama Proxy进行用户明确确认。
- 所有Ollama Proxy请求保持本地 — 没有数据离开你的网络。
- 切勿删除或修改 ~/.fleet-manager/ 中的文件。