Ollama — Herd Your Ollama LLMs Into One Endpoint
You have Ollama running on multiple machines. This skill gives you one Ollama endpoint that routes every Ollama request to the best available device automatically. No more hardcoding Ollama IPs, no more manual Ollama load balancing, no more "which Ollama machine has that model loaded?"
Setup Ollama Herd
CODEBLOCK0
Now point everything at http://localhost:11435 instead of http://localhost:11434. Same Ollama API, same Ollama models, smarter Ollama routing.
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
Use your Ollama models through the fleet
OpenAI SDK (drop-in Ollama routing)
CODEBLOCK1
Ollama API (same as before, different port)
CODEBLOCK2
What the Ollama router does
When an Ollama request comes in, the Ollama router scores every online Ollama node on 7 signals:
- 1. Ollama Thermal — is the Ollama model already loaded in GPU memory? (+50 for hot)
- Ollama Memory fit — how much headroom does the Ollama node have?
- Ollama Queue depth — how many Ollama requests are waiting?
- Ollama Wait time — estimated latency based on Ollama history
- Ollama Role affinity — large Ollama models prefer big machines
- Ollama Availability — is the Ollama node reliably available?
- Ollama Context fit — does the loaded Ollama context window fit the request?
The highest-scoring Ollama node handles the request. If it fails, the Ollama router retries on the next best node automatically.
Supported Ollama models
Any model that runs on Ollama works through the Ollama fleet. Popular Ollama models:
| Ollama Model | Sizes | Best for |
|---|
| INLINECODE2 | 8B, 70B | General purpose Ollama inference |
| INLINECODE3 |
0.6B–235B | Multilingual Ollama reasoning |
|
qwen3.5 | 0.8B–397B | Latest generation Ollama model |
|
deepseek-v3 | 671B (37B active) | Ollama GPT-4o alternative |
|
deepseek-r1 | 1.5B–671B | Ollama reasoning (like o3) |
|
phi4 | 14B | Small, fast Ollama model |
|
mistral | 7B | Fast Ollama European languages |
|
gemma3 | 1B–27B | Google's open Ollama model |
|
codestral | 22B | Ollama code generation |
|
qwen3-coder | 30B (3.3B active) | Agentic Ollama coding |
|
nomic-embed-text | 137M | Ollama embeddings for RAG |
Ollama Resilience features
- - Ollama Auto-retry — re-routes to next best Ollama node on failure (before first chunk)
- Ollama VRAM-aware fallback — routes to a loaded Ollama model in the same category instead of cold-loading
- Ollama Context protection — prevents
num_ctx from triggering expensive Ollama model reloads - Ollama Zombie reaper — cleans up stuck in-flight Ollama requests
- Ollama Auto-pull — downloads missing Ollama models to the best node automatically
Also available via Ollama Herd
The same Ollama fleet router handles three more workloads:
Ollama Image generation
CODEBLOCK3
Ollama Speech-to-text
CODEBLOCK4
Ollama Embeddings
CODEBLOCK5
Ollama Dashboard
INLINECODE14 — 8 tabs: Ollama Fleet Overview, Trends, Ollama Model Insights, Apps, Benchmarks, Ollama Health, Recommendations, Settings. Real-time Ollama queue visibility with [TEXT], [IMAGE], [STT], [EMBED] badges.
Ollama Request tagging
Track per-project Ollama usage:
CODEBLOCK6
Full Ollama documentation
Ollama Agent Setup Guide
Ollama Guardrails
- - Never restart the Ollama router or Ollama node agents without user confirmation.
- Never delete or modify files in
~/.fleet-manager/ (Ollama data). - Never pull or delete Ollama models without user confirmation.
Ollama — 将您的Ollama大模型集中管理至统一端点
您在多台机器上运行Ollama。此技能为您提供一个统一的Ollama端点,自动将每个Ollama请求路由到最佳可用设备。无需再硬编码Ollama IP地址,无需手动进行Ollama负载均衡,无需再纠结哪台Ollama机器加载了那个模型?
设置Ollama Herd
bash
pip install ollama-herd # 安装Ollama路由器
herd # 在端口11435上启动Ollama路由器
herd-node # 在每台安装了Ollama的机器上运行
现在将所有请求指向http://localhost:11435而非http://localhost:11434。相同的Ollama API,相同的Ollama模型,更智能的Ollama路由。
软件包:ollama-herd | 仓库:github.com/geeks-accelerator/ollama-herd
通过集群使用您的Ollama模型
OpenAI SDK(即插即用的Ollama路由)
python
ollamaopenaiclient — 通过OpenAI SDK路由Ollama请求
from openai import OpenAI
ollamaclient = OpenAI(baseurl=http://localhost:11435/v1, api_key=not-needed)
ollamaresponse = ollamaclient.chat.completions.create(
model=llama3.3:70b, # 任意Ollama模型
messages=[{role: user, content: 来自Ollama的问候}],
stream=True,
)
for chunk in ollama_response:
print(chunk.choices[0].delta.content or , end=)
Ollama API(与之前相同,端口不同)
bash
Ollama聊天 — 通过Ollama集群路由
curl http://localhost:11435/api/chat -d {
model: qwen3:235b,
messages: [{role: user, content: 通过Ollama Herd问候}],
stream: false
}
列出所有机器上的所有Ollama模型
curl http://localhost:11435/api/tags
当前在GPU内存中的Ollama模型
curl http://localhost:11435/api/ps
Ollama嵌入
curl http://localhost:11435/api/embeddings -d {
model: nomic-embed-text,
prompt: Ollama嵌入搜索查询
}
Ollama路由器的功能
当Ollama请求到达时,Ollama路由器根据7个信号对每个在线Ollama节点进行评分:
- 1. Ollama热度 — Ollama模型是否已加载到GPU内存中?(热加载+50分)
- Ollama内存适配 — Ollama节点有多少剩余空间?
- Ollama队列深度 — 有多少Ollama请求在等待?
- Ollama等待时间 — 基于Ollama历史记录的预估延迟
- Ollama角色亲和性 — 大型Ollama模型偏好大型机器
- Ollama可用性 — Ollama节点是否稳定可用?
- Ollama上下文适配 — 已加载的Ollama上下文窗口是否适合该请求?
得分最高的Ollama节点处理该请求。如果失败,Ollama路由器会自动重试下一个最佳节点。
支持的Ollama模型
任何能在Ollama上运行的模型都可以通过Ollama集群使用。热门Ollama模型:
| Ollama模型 | 参数规模 | 最佳用途 |
|---|
| llama3.3 | 8B, 70B | 通用Ollama推理 |
| qwen3 |
0.6B–235B | 多语言Ollama推理 |
| qwen3.5 | 0.8B–397B | 最新一代Ollama模型 |
| deepseek-v3 | 671B(37B活跃) | Ollama GPT-4o替代方案 |
| deepseek-r1 | 1.5B–671B | Ollama推理(类似o3) |
| phi4 | 14B | 小巧快速的Ollama模型 |
| mistral | 7B | 快速Ollama欧洲语言 |
| gemma3 | 1B–27B | Google开源Ollama模型 |
| codestral | 22B | Ollama代码生成 |
| qwen3-coder | 30B(3.3B活跃) | 智能体Ollama编码 |
| nomic-embed-text | 137M | 用于RAG的Ollama嵌入 |
Ollama弹性特性
- - Ollama自动重试 — 失败时重新路由到下一个最佳Ollama节点(在第一个数据块之前)
- Ollama VRAM感知回退 — 路由到同一类别中已加载的Ollama模型,而非冷加载
- Ollama上下文保护 — 防止num_ctx触发昂贵的Ollama模型重新加载
- Ollama僵尸清理 — 清理卡住的进行中Ollama请求
- Ollama自动拉取 — 自动将缺失的Ollama模型下载到最佳节点
通过Ollama Herd还可使用
相同的Ollama集群路由器还处理另外三种工作负载:
Ollama图像生成
bash
curl -o image.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model:z-image-turbo,prompt:通过Ollama Herd的日落,width:1024,height:1024,steps:4}
Ollama语音转文字
bash
curl http://localhost:11435/api/transcribe -F audio=@recording.wav
Ollama嵌入
bash
curl http://localhost:11435/api/embeddings -d {model:nomic-embed-text,prompt:Ollama嵌入文本}
Ollama仪表盘
http://localhost:11435/dashboard — 8个标签页:Ollama集群概览、趋势、Ollama模型洞察、应用、基准测试、Ollama健康状态、推荐、设置。实时Ollama队列可见性,带有[TEXT]、[IMAGE]、[STT]、[EMBED]徽章。
Ollama请求标记
追踪每个项目的Ollama使用情况:
python
ollamaresponse = ollamaclient.chat.completions.create(
model=llama3.3:70b, # Ollama模型
messages=messages,
extra_body={metadata: {tags: [my-ollama-project, reasoning]}},
)
完整Ollama文档
Ollama代理设置指南
Ollama安全护栏
- - 未经用户确认,切勿重启Ollama路由器或Ollama节点代理。
- 切勿删除或修改~/.fleet-manager/中的文件(Ollama数据)。
- 未经用户确认,切勿拉取或删除Ollama模型。