Self-Hosted AI — Own Your Entire AI Stack
Stop paying per token. Stop sending data to cloud APIs. Run self-hosted LLMs, self-hosted image generation, self-hosted speech-to-text, and self-hosted embeddings on your own hardware. One self-hosted router makes all your devices act like one system.
What self-hosted AI replaces
| Cloud service | Self-hosted replacement | How |
|---|
| OpenAI API | Self-hosted Llama 3.3, Qwen 3.5, DeepSeek-R1 via Ollama | Same OpenAI SDK, swap the base URL |
| DALL-E / Midjourney |
Self-hosted Stable Diffusion 3, Flux via mflux/DiffusionKit |
POST /api/generate-image |
|
Whisper API | Self-hosted Qwen3-ASR via MLX |
POST /api/transcribe |
|
OpenAI Embeddings | Self-hosted nomic-embed-text, mxbai-embed via Ollama |
POST /api/embed |
Same APIs. Same quality. Zero per-request costs. All data stays on your self-hosted machines.
Self-Hosted Setup
CODEBLOCK0
No Docker. No Kubernetes. No config files. Self-hosted devices find each other automatically on your local network.
Self-Hosted LLM Inference
Drop-in self-hosted replacement for the OpenAI SDK:
CODEBLOCK1
Self-hosted Ollama API
CODEBLOCK2
Self-Hosted Image Generation
Self-hosted replacement for DALL-E and Midjourney:
CODEBLOCK3
Self-Hosted Speech-to-Text
Self-hosted replacement for Whisper API:
CODEBLOCK4
All self-hosted transcription stays on your network. No audio data sent to cloud services.
Self-Hosted Embeddings
Self-hosted replacement for OpenAI's embedding API:
CODEBLOCK5
Self-Hosted Cost Comparison
| Service | Cloud cost | Self-hosted cost |
|---|
| GPT-4o (1M tokens/month) | ~$15-30/month | $0 (self-hosted hardware you own) |
| DALL-E (1000 images/month) |
~$40/month | $0 (self-hosted image gen) |
| Whisper API (10 hours audio/month) | ~$6/month | $0 (self-hosted transcription) |
| OpenAI embeddings (1M tokens/month) | ~$0.10/month | $0 (self-hosted embeddings) |
|
Total |
~$60+/month |
$0/month self-hosted |
After hardware investment, every self-hosted request is free forever. No rate limits, no usage caps, no surprise bills.
Self-Hosted Advantages
- - Self-hosted data sovereignty — prompts, images, audio, and documents never leave your network
- Self-hosted throughput — your hardware, no rate limits
- Self-hosted uptime — cloud API outages don't affect your self-hosted fleet
- Self-hosted flexibility — switch models instantly, no vendor lock-in
- Self-hosted compliance — HIPAA, GDPR, SOC2 — no third-party data processors
- Self-hosted predictability — hardware depreciates, but never surprises you with a bill
Self-Hosted Fleet Routing
The self-hosted router scores each device on 7 signals and picks the best one for every request. Multiple self-hosted machines share the load automatically.
CODEBLOCK6
Self-hosted dashboard at http://localhost:11435/dashboard for visual monitoring of your entire self-hosted fleet.
Full self-hosted documentation
Contribute
Ollama Herd is open source (MIT). Self-hosted AI for everyone:
- - Star on GitHub — help others discover self-hosted AI
- Open an issue — share your self-hosted setup
- PRs welcome from humans and AI agents.
CLAUDE.md gives full self-hosted context. 444 tests.
Self-Hosted Guardrails
- - No automatic downloads — all self-hosted model pulls require explicit user confirmation.
- Self-hosted model deletion requires explicit user confirmation.
- All self-hosted requests stay local — no data leaves your network. No telemetry, no analytics, no cloud callbacks.
- Never delete or modify self-hosted files in
~/.fleet-manager/. - Your self-hosted fleet has zero cloud dependencies — works fully offline after initial model downloads.
自托管AI — 拥有你的完整AI栈
停止按token付费。停止向云端API发送数据。在你的自有硬件上运行自托管LLM、自托管图像生成、自托管语音转文本和自托管嵌入。一个自托管路由器让你的所有设备如同一个系统般协同工作。
自托管AI替代方案
| 云服务 | 自托管替代方案 | 实现方式 |
|---|
| OpenAI API | 通过Ollama自托管Llama 3.3、Qwen 3.5、DeepSeek-R1 | 相同OpenAI SDK,更换基础URL |
| DALL-E / Midjourney |
通过mflux/DiffusionKit自托管Stable Diffusion 3、Flux | POST /api/generate-image |
|
Whisper API | 通过MLX自托管Qwen3-ASR | POST /api/transcribe |
|
OpenAI Embeddings | 通过Ollama自托管nomic-embed-text、mxbai-embed | POST /api/embed |
相同API。相同质量。零按次请求成本。所有数据保留在你的自托管机器上。
自托管设置
bash
pip install ollama-herd # 从PyPI安装自托管AI路由器
herd # 启动自托管路由器
herd-node # 在每台自托管机器上运行 — 自动发现路由器
无需Docker。无需Kubernetes。无需配置文件。自托管设备在本地网络上自动相互发现。
自托管LLM推理
OpenAI SDK的直接自托管替代方案:
python
from openai import OpenAI
自托管推理客户端 — 替代OpenAI云端
self
hostedclient = OpenAI(base
url=http://localhost:11435/v1, apikey=not-needed)
selfhostedresponse = selfhostedclient.chat.completions.create(
model=llama3.3:70b, # 自托管模型,无云端依赖
messages=[{role: user, content: 分析这份合同的风险}],
stream=True,
)
for chunk in selfhostedresponse:
print(chunk.choices[0].delta.content or , end=)
自托管Ollama API
bash
curl http://localhost:11435/api/chat -d {
model: deepseek-r1:70b,
messages: [{role: user, content: 解释自托管AI相比云端API的优势}],
stream: false
}
自托管图像生成
DALL-E和Midjourney的自托管替代方案:
bash
在任何节点上安装自托管图像后端
uv tool install mflux # 自托管Flux模型(约7秒)
uv tool install diffusionkit # 自托管Stable Diffusion 3/3.5
在你的自托管集群上生成图像
curl -o self
hostedoutput.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model: z-image-turbo, prompt: 自托管AI生成产品模型, width: 1024, height: 1024}
自托管语音转文本
Whisper API的自托管替代方案:
bash
curl http://localhost:11435/api/transcribe \
-F file=@selfhostedmeeting.wav \
-F model=qwen3-asr
所有自托管转录保留在你的网络中。无音频数据发送到云端服务。
自托管嵌入
OpenAI嵌入API的自托管替代方案:
bash
curl http://localhost:11435/api/embed \
-d {model: nomic-embed-text, input: 用于私有RAG管道的自托管文档嵌入}
自托管成本对比
| 服务 | 云端成本 | 自托管成本 |
|---|
| GPT-4o(每月100万token) | 约15-30美元/月 | 0美元(你拥有的自托管硬件) |
| DALL-E(每月1000张图像) |
约40美元/月 | 0美元(自托管图像生成) |
| Whisper API(每月10小时音频) | 约6美元/月 | 0美元(自托管转录) |
| OpenAI嵌入(每月100万token) | 约0.10美元/月 | 0美元(自托管嵌入) |
|
总计 |
约60+美元/月 |
自托管每月0美元 |
硬件投资后,每个自托管请求永久免费。无速率限制,无使用上限,无意外账单。
自托管优势
- - 自托管数据主权 — 提示词、图像、音频和文档永不离开你的网络
- 自托管吞吐量 — 你的硬件,无速率限制
- 自托管正常运行时间 — 云端API中断不影响你的自托管集群
- 自托管灵活性 — 即时切换模型,无供应商锁定
- 自托管合规性 — HIPAA、GDPR、SOC2 — 无第三方数据处理者
- 自托管可预测性 — 硬件折旧,但永不给你意外账单
自托管集群路由
自托管路由器根据7个信号对每台设备评分,并为每个请求选择最佳设备。多台自托管机器自动分担负载。
bash
自托管集群概览
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
自托管健康检查
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
针对你硬件的自托管模型推荐
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
自托管仪表板位于 http://localhost:11435/dashboard,用于可视化监控你的整个自托管集群。
完整自托管文档
贡献
Ollama Herd是开源项目(MIT)。为所有人提供自托管AI:
- - 在GitHub上标星 — 帮助他人发现自托管AI
- 提交Issue — 分享你的自托管设置
- 欢迎PR,来自人类和AI Agent。CLAUDE.md提供完整的自托管上下文。444个测试。
自托管安全护栏
- - 无自动下载 — 所有自托管模型拉取都需要明确的用户确认。
- 自托管模型删除需要明确的用户确认。
- 所有自托管请求保持本地 — 无数据离开你的网络。无遥测,无分析,无云端回调。
- 切勿删除或修改 ~/.fleet-manager/ 中的自托管文件。
- 你的自托管集群零云端依赖 — 初始模型下载后可完全离线工作。