Gemma 3 — Run Google's Open Models Across Your Fleet
Gemma 3 is Google's most capable open-source LLM family. 128K context window, strong coding performance, multilingual support across 140+ languages. The fleet router picks the best device for every request — no manual load balancing.
Supported Gemma models
| Model | Parameters | Ollama name | Best for |
|---|
| Gemma 3 27B | 27B | INLINECODE0 | Highest quality — rivals much larger models |
| Gemma 3 12B |
12B |
gemma3:12b | Balanced quality and speed |
|
Gemma 3 4B | 4B |
gemma3:4b | Fast, runs on low-RAM devices |
|
Gemma 3 1B | 1B |
gemma3:1b | Ultra-light, instant responses |
|
CodeGemma 7B | 7B |
codegemma | Code-focused variant |
Quick start
CODEBLOCK0
No models are downloaded during installation. Models are pulled on demand when a request arrives, or manually via the dashboard. All pulls require user confirmation.
Use Gemma through the fleet
OpenAI SDK (drop-in replacement)
CODEBLOCK1
Code generation with CodeGemma
CODEBLOCK2
curl (Ollama format)
CODEBLOCK3
curl (OpenAI format)
CODEBLOCK4
Which Gemma for your hardware
Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.
| Device | RAM | Best Gemma model |
|---|
| MacBook Air (8GB) | 8GB | INLINECODE5 — instant responses |
| Mac Mini (16GB) |
16GB |
gemma3:4b — strong for its size |
| Mac Mini (24GB) | 24GB |
gemma3:12b — great balance |
| MacBook Pro (36GB) | 36GB |
gemma3:27b — full power |
| Mac Studio (64GB+) | 64GB+ |
gemma3:27b +
codegemma simultaneously |
Why Gemma locally
- - 128K context — process entire codebases and long documents
- 140+ languages — multilingual without switching models
- Google quality, zero cost — no per-token charges after hardware
- Privacy — all data stays on your network
- Fleet routing — multiple machines share the load
Check what's running
CODEBLOCK5
Web dashboard at http://localhost:11435/dashboard — live monitoring.
Also available on this fleet
Other LLMs
Llama 3.3, Qwen 3.5, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Codestral — same endpoint.
Image generation
CODEBLOCK6
Speech-to-text
CODEBLOCK7
Embeddings
CODEBLOCK8
Full documentation
Contribute
Ollama Herd is open source (MIT). Stars, issues, and PRs welcome — from humans and AI agents alike:
- - GitHub — 444 tests, fully async,
CLAUDE.md makes AI agents productive instantly - Found a bug? Open an issue
- Want to add a feature? Fork, branch, PR — the test suite runs in under 40 seconds
Guardrails
- - Model downloads require explicit user confirmation — Gemma models range from 1GB (1B) to 16GB (27B).
- Model deletion requires explicit user confirmation.
- Never delete or modify files in
~/.fleet-manager/. - No models are downloaded automatically — all pulls are user-initiated or require opt-in via
auto_pull.
Gemma 3 — 在你的设备集群中运行谷歌开源模型
Gemma 3 是谷歌最强大的开源大语言模型系列。支持128K上下文窗口,强大的编码性能,覆盖140多种语言的多语言支持。集群路由器会为每个请求自动选择最佳设备——无需手动负载均衡。
支持的Gemma模型
| 模型 | 参数量 | Ollama名称 | 最佳用途 |
|---|
| Gemma 3 27B | 270亿 | gemma3:27b | 最高质量——可与更大模型媲美 |
| Gemma 3 12B |
120亿 | gemma3:12b | 质量与速度均衡 |
|
Gemma 3 4B | 40亿 | gemma3:4b | 快速,可在低内存设备上运行 |
|
Gemma 3 1B | 10亿 | gemma3:1b | 超轻量,即时响应 |
|
CodeGemma 7B | 70亿 | codegemma | 专注代码的变体 |
快速开始
bash
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
herd # 启动路由器(端口11435)
herd-node # 在每个设备上运行——自动发现路由器
安装过程中不会下载任何模型。模型会在请求到达时按需拉取,或通过仪表盘手动拉取。所有拉取操作都需要用户确认。
通过集群使用Gemma
OpenAI SDK(即插即用替代方案)
python
from openai import OpenAI
client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)
Gemma 3 27B 用于复杂推理
response = client.chat.completions.create(
model=gemma3:27b,
messages=[{role: user, content: 向10岁孩子解释量子纠缠}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or , end=)
使用CodeGemma生成代码
python
response = client.chat.completions.create(
model=codegemma,
messages=[{role: user, content: 用Rust编写一个包含插入、删除和搜索功能的二叉搜索树}],
)
print(response.choices[0].message.content)
curl(Ollama格式)
bash
Gemma 3 27B
curl http://localhost:11435/api/chat -d {
model: gemma3:27b,
messages: [{role: user, content: 翻译成日语:今天天气真好}],
stream: false
}
curl(OpenAI格式)
bash
curl http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model: gemma3:4b, messages: [{role: user, content: 你好}]}
为你的硬件选择合适的Gemma
跨平台: 以下为示例配置。任何具有同等内存的设备(Mac、Linux、Windows)均可运行。集群路由器支持所有平台。
| 设备 | 内存 | 最佳Gemma模型 |
|---|
| MacBook Air(8GB) | 8GB | gemma3:1b — 即时响应 |
| Mac Mini(16GB) |
16GB | gemma3:4b — 同尺寸中表现强劲 |
| Mac Mini(24GB) | 24GB | gemma3:12b — 极佳平衡 |
| MacBook Pro(36GB) | 36GB | gemma3:27b — 全功率 |
| Mac Studio(64GB+) | 64GB+ | gemma3:27b + codegemma 同时运行 |
为什么在本地运行Gemma
- - 128K上下文 — 处理整个代码库和长文档
- 140+种语言 — 无需切换模型即可支持多语言
- 谷歌品质,零成本 — 硬件之外无按token计费
- 隐私保护 — 所有数据保留在你的网络中
- 集群路由 — 多台机器分担负载
查看运行状态
bash
已加载到内存中的模型
curl -s http://localhost:11435/api/ps | python3 -m json.tool
集群健康状态
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Web仪表盘地址:http://localhost:11435/dashboard — 实时监控。
该集群还提供以下功能
其他大语言模型
Llama 3.3、Qwen 3.5、DeepSeek-V3、DeepSeek-R1、Phi 4、Mistral、Codestral — 使用同一端点。
图像生成
bash
curl -o image.png http://localhost:11435/api/generate-image \
-d {model: z-image-turbo, prompt: 一颗捕捉光线的宝石, width: 1024, height: 1024}
语音转文字
bash
curl http://localhost:11435/api/transcribe -F file=@meeting.wav -F model=qwen3-asr
嵌入向量
bash
curl http://localhost:11435/api/embed \
-d {model: nomic-embed-text, input: Google Gemma开源语言模型}
完整文档
贡献
Ollama Herd是开源项目(MIT协议)。欢迎来自人类和AI代理的Star、Issue和PR:
- - GitHub — 444个测试,完全异步,CLAUDE.md让AI代理立即高效工作
- 发现Bug?提交Issue
- 想添加功能?Fork、分支、PR — 测试套件运行时间不到40秒
安全护栏
- - 模型下载需要用户明确确认 — Gemma模型大小从1GB(1B)到16GB(27B)不等。
- 模型删除需要用户明确确认。
- 切勿删除或修改~/.fleet-manager/目录中的文件。
- 不会自动下载任何模型——所有拉取操作均由用户发起,或需要通过auto_pull选择加入。