Ollama — Herd Your Ollama LLMs Into One Endpoint

You have Ollama running on multiple machines. This skill gives you one Ollama endpoint that routes every Ollama request to the best available device automatically. No more hardcoding Ollama IPs, no more manual Ollama load balancing, no more "which Ollama machine has that model loaded?"

Setup Ollama Herd

CODEBLOCK0

Now point everything at http://localhost:11435 instead of http://localhost:11434. Same Ollama API, same Ollama models, smarter Ollama routing.

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Use your Ollama models through the fleet

OpenAI SDK (drop-in Ollama routing)

CODEBLOCK1

Ollama API (same as before, different port)

CODEBLOCK2

What the Ollama router does

When an Ollama request comes in, the Ollama router scores every online Ollama node on 7 signals:

1. Ollama Thermal — is the Ollama model already loaded in GPU memory? (+50 for hot)
Ollama Memory fit — how much headroom does the Ollama node have?
Ollama Queue depth — how many Ollama requests are waiting?
Ollama Wait time — estimated latency based on Ollama history
Ollama Role affinity — large Ollama models prefer big machines
Ollama Availability — is the Ollama node reliably available?
Ollama Context fit — does the loaded Ollama context window fit the request?

The highest-scoring Ollama node handles the request. If it fails, the Ollama router retries on the next best node automatically.

Supported Ollama models

Any model that runs on Ollama works through the Ollama fleet. Popular Ollama models:

Ollama Model	Sizes	Best for
INLINECODE2	8B, 70B	General purpose Ollama inference
INLINECODE3

Ollama Resilience features

- Ollama Auto-retry — re-routes to next best Ollama node on failure (before first chunk)
Ollama VRAM-aware fallback — routes to a loaded Ollama model in the same category instead of cold-loading
Ollama Context protection — prevents num_ctx from triggering expensive Ollama model reloads
Ollama Zombie reaper — cleans up stuck in-flight Ollama requests
Ollama Auto-pull — downloads missing Ollama models to the best node automatically

Also available via Ollama Herd

The same Ollama fleet router handles three more workloads:

Ollama Image generation

CODEBLOCK3

Ollama Speech-to-text

CODEBLOCK4

Ollama Embeddings

CODEBLOCK5

Ollama Dashboard

INLINECODE14 — 8 tabs: Ollama Fleet Overview, Trends, Ollama Model Insights, Apps, Benchmarks, Ollama Health, Recommendations, Settings. Real-time Ollama queue visibility with [TEXT], [IMAGE], [STT], [EMBED] badges.

Ollama Request tagging

Track per-project Ollama usage:

CODEBLOCK6

Full Ollama documentation

Ollama Agent Setup Guide

Ollama Guardrails

- Never restart the Ollama router or Ollama node agents without user confirmation.
Never delete or modify files in ~/.fleet-manager/ (Ollama data).
Never pull or delete Ollama models without user confirmation.

Ollama — 将您的Ollama大模型集中管理至统一端点

您在多台机器上运行Ollama。此技能为您提供一个统一的Ollama端点，自动将每个Ollama请求路由到最佳可用设备。无需再硬编码Ollama IP地址，无需手动进行Ollama负载均衡，无需再纠结哪台Ollama机器加载了那个模型？

设置Ollama Herd

bash
pip install ollama-herd # 安装Ollama路由器
herd # 在端口11435上启动Ollama路由器
herd-node # 在每台安装了Ollama的机器上运行

现在将所有请求指向http://localhost:11435而非http://localhost:11434。相同的Ollama API，相同的Ollama模型，更智能的Ollama路由。

软件包：ollama-herd | 仓库：github.com/geeks-accelerator/ollama-herd

通过集群使用您的Ollama模型

OpenAI SDK（即插即用的Ollama路由）

python

ollamaopenaiclient — 通过OpenAI SDK路由Ollama请求

from openai import OpenAI

ollamaclient = OpenAI(baseurl=http://localhost:11435/v1, api_key=not-needed)
ollamaresponse = ollamaclient.chat.completions.create(
model=llama3.3:70b, # 任意Ollama模型
messages=[{role: user, content: 来自Ollama的问候}],
stream=True,
)
for chunk in ollama_response:
print(chunk.choices[0].delta.content or , end=)

Ollama API（与之前相同，端口不同）

bash

Ollama聊天 — 通过Ollama集群路由

curl http://localhost:11435/api/chat -d {
model: qwen3:235b,
messages: [{role: user, content: 通过Ollama Herd问候}],
stream: false
}

列出所有机器上的所有Ollama模型

curl http://localhost:11435/api/tags

当前在GPU内存中的Ollama模型

curl http://localhost:11435/api/ps

Ollama嵌入

curl http://localhost:11435/api/embeddings -d { model: nomic-embed-text, prompt: Ollama嵌入搜索查询 }

Ollama路由器的功能

当Ollama请求到达时，Ollama路由器根据7个信号对每个在线Ollama节点进行评分：

1. Ollama热度 — Ollama模型是否已加载到GPU内存中？（热加载+50分）
Ollama内存适配 — Ollama节点有多少剩余空间？
Ollama队列深度 — 有多少Ollama请求在等待？
Ollama等待时间 — 基于Ollama历史记录的预估延迟
Ollama角色亲和性 — 大型Ollama模型偏好大型机器
Ollama可用性 — Ollama节点是否稳定可用？
Ollama上下文适配 — 已加载的Ollama上下文窗口是否适合该请求？

得分最高的Ollama节点处理该请求。如果失败，Ollama路由器会自动重试下一个最佳节点。

支持的Ollama模型

任何能在Ollama上运行的模型都可以通过Ollama集群使用。热门Ollama模型：

Ollama模型	参数规模	最佳用途
llama3.3	8B, 70B	通用Ollama推理
qwen3

Ollama弹性特性

- Ollama自动重试 — 失败时重新路由到下一个最佳Ollama节点（在第一个数据块之前）
Ollama VRAM感知回退 — 路由到同一类别中已加载的Ollama模型，而非冷加载
Ollama上下文保护 — 防止num_ctx触发昂贵的Ollama模型重新加载
Ollama僵尸清理 — 清理卡住的进行中Ollama请求
Ollama自动拉取 — 自动将缺失的Ollama模型下载到最佳节点

通过Ollama Herd还可使用

相同的Ollama集群路由器还处理另外三种工作负载：

Ollama图像生成

bash
curl -o image.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model:z-image-turbo,prompt:通过Ollama Herd的日落,width:1024,height:1024,steps:4}

Ollama语音转文字

bash
curl http://localhost:11435/api/transcribe -F audio=@recording.wav

Ollama嵌入

bash
curl http://localhost:11435/api/embeddings -d {model:nomic-embed-text,prompt:Ollama嵌入文本}

Ollama仪表盘

http://localhost:11435/dashboard — 8个标签页：Ollama集群概览、趋势、Ollama模型洞察、应用、基准测试、Ollama健康状态、推荐、设置。实时Ollama队列可见性，带有[TEXT]、[IMAGE]、[STT]、[EMBED]徽章。

Ollama请求标记

追踪每个项目的Ollama使用情况：

python
ollamaresponse = ollamaclient.chat.completions.create(
model=llama3.3:70b, # Ollama模型
messages=messages,
extra_body={metadata: {tags: [my-ollama-project, reasoning]}},
)

完整Ollama文档

Ollama代理设置指南

Ollama安全护栏

- 未经用户确认，切勿重启Ollama路由器或Ollama节点代理。
切勿删除或修改~/.fleet-manager/中的文件（Ollama数据）。
未经用户确认，切勿拉取或删除Ollama模型。

ollama-ollama-herdOllama模型路由

ollama-ollama-herd

Ollama — Herd Your Ollama LLMs Into One Endpoint

Setup Ollama Herd

Use your Ollama models through the fleet

OpenAI SDK (drop-in Ollama routing)

Ollama API (same as before, different port)

What the Ollama router does

Supported Ollama models

Ollama Resilience features

Also available via Ollama Herd

Ollama Image generation

Ollama Speech-to-text

Ollama Embeddings

Ollama Dashboard

Ollama Request tagging

Full Ollama documentation

Ollama Guardrails

Ollama — 将您的Ollama大模型集中管理至统一端点

设置Ollama Herd

通过集群使用您的Ollama模型

OpenAI SDK（即插即用的Ollama路由）

ollamaopenaiclient — 通过OpenAI SDK路由Ollama请求

Ollama API（与之前相同，端口不同）

Ollama聊天 — 通过Ollama集群路由

列出所有机器上的所有Ollama模型

当前在GPU内存中的Ollama模型

Ollama嵌入

Ollama路由器的功能

支持的Ollama模型

Ollama弹性特性

通过Ollama Herd还可使用

Ollama图像生成

Ollama语音转文字

Ollama嵌入

Ollama仪表盘

Ollama请求标记

完整Ollama文档

Ollama安全护栏

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement