DeepSeek — Run DeepSeek Models Across Your Local Fleet
Run DeepSeek-V3, DeepSeek-R1, and DeepSeek-Coder on your own hardware. The fleet router picks the best device for every request — no cloud API needed, zero per-token costs, all data stays on your machines.
Supported DeepSeek models
| Model | Parameters | Ollama name | Best for |
|---|
| DeepSeek-V3 | 671B MoE (37B active) | INLINECODE0 | General — matches GPT-4o on most benchmarks |
| DeepSeek-V3.1 |
671B MoE |
deepseek-v3.1 | Hybrid thinking/non-thinking modes |
|
DeepSeek-V3.2 | 671B MoE |
deepseek-v3.2 | Improved reasoning + agent performance |
|
DeepSeek-R1 | 1.5B–671B |
deepseek-r1 | Reasoning — approaches O3 and Gemini 2.5 Pro |
|
DeepSeek-Coder | 1.3B–33B |
deepseek-coder | Code generation (87% code, 13% NL training) |
|
DeepSeek-Coder-V2 | 236B MoE (21B active) |
deepseek-coder-v2 | Code — matches GPT-4 Turbo on code tasks |
Setup
CODEBLOCK0
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
Models are pulled on demand — the router auto-pulls when a request arrives for a model not yet on any node, or you can pull manually via the dashboard. No models are downloaded during installation.
Use DeepSeek through the fleet
OpenAI SDK
CODEBLOCK1
DeepSeek-Coder for code
CODEBLOCK2
Ollama API
CODEBLOCK3
Hardware recommendations (optional — choose models that fit your RAM)
Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.
DeepSeek offers models at every size. Pick the one that fits your available memory — smaller models work great for most tasks:
| Model | Min RAM | Recommended hardware |
|---|
| INLINECODE6 | 4GB | Any Mac |
| INLINECODE7 |
8GB | Mac Mini M4 (16GB) |
|
deepseek-r1:14b | 12GB | Mac Mini M4 (24GB) |
|
deepseek-r1:32b | 24GB | Mac Mini M4 Pro (48GB) |
|
deepseek-r1:70b | 48GB | Mac Studio M4 Max (128GB) |
|
deepseek-coder-v2:16b | 12GB | Mac Mini M4 (24GB) |
|
deepseek-v3 | 256GB+ | Mac Studio M3 Ultra (512GB) |
The fleet router automatically sends requests to the machine where the model is loaded — no manual routing needed.
Why run DeepSeek locally
- - Zero cost — DeepSeek API charges per token. Local is free after hardware.
- Privacy — code and business data never leave your network.
- No rate limits — DeepSeek API throttles during peak hours. Local has no throttle.
- Availability — DeepSeek API has had outages. Your hardware doesn't depend on their servers.
- Fleet routing — multiple machines share the load. One busy? Request goes to the next.
Fleet features
- - 7-signal scoring — picks the optimal node for every request
- Auto-retry — fails over to next best node transparently
- VRAM-aware fallback — routes to a loaded model in the same category instead of cold-loading
- Context protection — prevents expensive model reloads from
num_ctx changes - Request tagging — track per-project DeepSeek usage
Also available on this fleet
Other LLM models
Llama 3.3, Qwen 3.5, Phi 4, Mistral, Gemma 3 — any Ollama model routes through the same endpoint.
Image generation
CODEBLOCK4
Speech-to-text
CODEBLOCK5
Embeddings
CODEBLOCK6
Dashboard
INLINECODE14 — monitor DeepSeek requests alongside all other models. Per-model latency, token throughput, health checks.
Full documentation
Agent Setup Guide
Guardrails
- - Model downloads require explicit user confirmation — DeepSeek models range from 1GB (1.5B) to 400GB+ (671B). Always confirm before pulling.
- Model deletion requires explicit user confirmation — never remove models without asking.
- Never delete or modify files in
~/.fleet-manager/. - If a DeepSeek model is too large for available memory, suggest a smaller variant (e.g.,
deepseek-r1:7b instead of :70b). - No models are downloaded automatically — all pulls are user-initiated or require opt-in via the
auto_pull setting.
DeepSeek — 在本地设备群中运行DeepSeek模型
在您自己的硬件上运行DeepSeek-V3、DeepSeek-R1和DeepSeek-Coder。设备群路由器为每个请求选择最佳设备——无需云API,零按token计费,所有数据保留在您的机器上。
支持的DeepSeek模型
| 模型 | 参数 | Ollama名称 | 最佳用途 |
|---|
| DeepSeek-V3 | 671B MoE(37B活跃) | deepseek-v3 | 通用——在大多数基准测试中与GPT-4o相当 |
| DeepSeek-V3.1 |
671B MoE | deepseek-v3.1 | 混合思考/非思考模式 |
|
DeepSeek-V3.2 | 671B MoE | deepseek-v3.2 | 改进的推理 + 智能体性能 |
|
DeepSeek-R1 | 1.5B–671B | deepseek-r1 | 推理——接近O3和Gemini 2.5 Pro |
|
DeepSeek-Coder | 1.3B–33B | deepseek-coder | 代码生成(87%代码,13%自然语言训练) |
|
DeepSeek-Coder-V2 | 236B MoE(21B活跃) | deepseek-coder-v2 | 代码——在代码任务上与GPT-4 Turbo相当 |
安装设置
bash
pip install ollama-herd
herd # 启动路由器(端口11435)
herd-node # 在每台机器上运行
软件包:ollama-herd | 仓库:github.com/geeks-accelerator/ollama-herd
模型按需拉取——当请求到达时,如果模型尚未在任何节点上,路由器会自动拉取;或者您可以通过仪表盘手动拉取。安装过程中不会下载任何模型。
通过设备群使用DeepSeek
OpenAI SDK
python
from openai import OpenAI
client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)
DeepSeek-R1用于推理
response = client.chat.completions.create(
model=deepseek-r1:70b,
messages=[{role: user, content: 证明存在无穷多个素数}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or , end=)
DeepSeek-Coder用于代码
python
response = client.chat.completions.create(
model=deepseek-coder-v2:16b,
messages=[{role: user, content: 用Python编写一个Redis缓存装饰器}],
)
print(response.choices[0].message.content)
Ollama API
bash
DeepSeek-V3通用对话
curl http://localhost:11435/api/chat -d {
model: deepseek-v3,
messages: [{role: user, content: 解释量子计算}],
stream: false
}
DeepSeek-R1推理
curl http://localhost:11435/api/chat -d {
model: deepseek-r1:70b,
messages: [{role: user, content: 逐步解决这个问题:...}],
stream: false
}
硬件建议(可选——选择适合您内存的模型)
跨平台: 这些是示例配置。任何具有等效内存的设备(Mac、Linux、Windows)均可使用。设备群路由器支持所有平台。
DeepSeek提供各种尺寸的模型。选择适合您可用内存的模型——较小的模型在大多数任务中表现良好:
| 模型 | 最小内存 | 推荐硬件 |
|---|
| deepseek-r1:1.5b | 4GB | 任意Mac |
| deepseek-r1:7b |
8GB | Mac Mini M4(16GB) |
| deepseek-r1:14b | 12GB | Mac Mini M4(24GB) |
| deepseek-r1:32b | 24GB | Mac Mini M4 Pro(48GB) |
| deepseek-r1:70b | 48GB | Mac Studio M4 Max(128GB) |
| deepseek-coder-v2:16b | 12GB | Mac Mini M4(24GB) |
| deepseek-v3 | 256GB+ | Mac Studio M3 Ultra(512GB) |
设备群路由器自动将请求发送到加载了模型的机器——无需手动路由。
为什么在本地运行DeepSeek
- - 零成本——DeepSeek API按token收费。本地运行在硬件投入后完全免费。
- 隐私——代码和业务数据永远不会离开您的网络。
- 无速率限制——DeepSeek API在高峰时段会限流。本地运行无限制。
- 可用性——DeepSeek API曾出现过宕机。您的硬件不依赖于他们的服务器。
- 设备群路由——多台机器分担负载。一台繁忙?请求自动转到下一台。
设备群功能
- - 7信号评分——为每个请求选择最优节点
- 自动重试——透明地故障转移到下一个最佳节点
- VRAM感知回退——路由到同一类别中已加载的模型,而不是冷加载
- 上下文保护——防止因num_ctx变化导致昂贵的模型重新加载
- 请求标记——跟踪每个项目的DeepSeek使用情况
该设备群还支持
其他LLM模型
Llama 3.3、Qwen 3.5、Phi 4、Mistral、Gemma 3——任何Ollama模型都通过同一端点路由。
图像生成
bash
curl -o image.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model:z-image-turbo,prompt:日落,width:1024,height:1024,steps:4}
语音转文字
bash
curl http://localhost:11435/api/transcribe -F audio=@recording.wav
嵌入向量
bash
curl http://localhost:11435/api/embeddings -d {model:nomic-embed-text,prompt:查询}
仪表盘
http://localhost:11435/dashboard——监控DeepSeek请求以及所有其他模型。每个模型的延迟、token吞吐量、健康检查。
完整文档
智能体设置指南
安全护栏
- - 模型下载需要明确的用户确认——DeepSeek模型范围从1GB(1.5B)到400GB+(671B)。拉取前务必确认。
- 模型删除需要明确的用户确认——未经询问绝不删除模型。
- 绝不删除或修改~/.fleet-manager/中的文件。
- 如果DeepSeek模型对于可用内存来说过大,建议使用较小的变体(例如,使用deepseek-r1:7b代替:70b)。
- 不会自动下载任何模型——所有拉取均由用户发起,或需要通过auto_pull设置选择加入。