Llama 3 — Run Meta's LLMs Across Your Local Fleet
The Llama family is the most widely deployed open-source LLM. This skill routes Llama requests across your devices — the fleet picks the best machine for every request automatically.
Supported Llama models
| Model | Parameters | Ollama name | Best for |
|---|
| Llama 3.3 | 70B | INLINECODE0 | Best overall — matches GPT-4o on most benchmarks |
| Llama 3.2 |
1B, 3B |
llama3.2:3b | Fast responses on low-RAM devices |
|
Llama 3.1 | 8B, 70B, 405B |
llama3.1:70b | Proven workhorse, massive community |
|
Llama 3 | 8B, 70B |
llama3:70b | Original release, still widely used |
Quick start
CODEBLOCK0
No models are downloaded during installation. Models are pulled on demand when a request arrives, or manually via the dashboard. All pulls require user confirmation.
Use Llama through the fleet
OpenAI SDK (drop-in replacement)
CODEBLOCK1
curl (Ollama format)
CODEBLOCK2
curl (OpenAI format)
CODEBLOCK3
Which Llama model for your hardware
Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.
Pick the model that fits your available memory — smaller models work great for most tasks:
| Model | Min RAM | Example hardware |
|---|
| INLINECODE4 | 2GB | Any Mac — even 8GB |
| INLINECODE5 |
4GB | Mac Mini (16GB) |
|
llama3:8b | 8GB | Mac Mini (16GB) |
|
llama3.3:70b | 48GB | Mac Studio M4 Max (128GB) |
|
llama3.1:405b | 256GB+ | Mac Studio M4 Ultra (256GB) or distributed |
The fleet router sends requests to the machine where the model is loaded. No manual routing needed.
Why run Llama locally
- - Free after hardware — Meta's license allows commercial use with no per-token cost
- Privacy — prompts and responses never leave your network
- No rate limits — your hardware, your throughput
- Fleet routing — multiple machines share the load automatically
See what's running
CODEBLOCK4
Monitor Llama performance
CODEBLOCK5
Web dashboard at http://localhost:11435/dashboard — live view of all nodes, queues, and models.
Also available on this fleet
Other LLM models
Qwen 3.5, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Gemma 3, Codestral — any Ollama model routes through the same endpoint.
Image generation
CODEBLOCK6
Speech-to-text
CODEBLOCK7
Embeddings
CODEBLOCK8
Full documentation
Guardrails
- - Model downloads require explicit user confirmation — Llama models range from 1GB (1B) to 230GB+ (405B). Always confirm before pulling.
- Model deletion requires explicit user confirmation.
- Never delete or modify files in
~/.fleet-manager/. - If a model is too large for available memory, suggest a smaller variant.
- No models are downloaded automatically — all pulls are user-initiated or require opt-in via the
auto_pull setting.
Llama 3 — 在本地设备群中运行Meta的LLM
Llama系列是部署最广泛的开源大语言模型。本技能可将Llama请求路由到您的各台设备——设备群会自动为每个请求选择最佳机器。
支持的Llama模型
| 模型 | 参数规模 | Ollama名称 | 最佳用途 |
|---|
| Llama 3.3 | 70B | llama3.3:70b | 综合最佳——在多数基准测试中媲美GPT-4o |
| Llama 3.2 |
1B, 3B | llama3.2:3b | 低内存设备快速响应 |
|
Llama 3.1 | 8B, 70B, 405B | llama3.1:70b | 久经考验的主力模型,社区庞大 |
|
Llama 3 | 8B, 70B | llama3:70b | 原始版本,仍广泛使用 |
快速开始
bash
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
herd # 启动路由器(端口11435)
herd-node # 在每台设备上运行——自动发现路由器
安装时不会下载任何模型。模型会在请求到达时按需拉取,或通过仪表盘手动拉取。所有拉取操作均需用户确认。
通过设备群使用Llama
OpenAI SDK(即插即用替代)
python
from openai import OpenAI
client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)
response = client.chat.completions.create(
model=llama3.3:70b,
messages=[{role: user, content: 解释Transformer架构}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or , end=)
curl(Ollama格式)
bash
curl http://localhost:11435/api/chat -d {
model: llama3.3:70b,
messages: [{role: user, content: 用Python写一个快速排序}],
stream: false
}
curl(OpenAI格式)
bash
curl http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model: llama3.2:3b, messages: [{role: user, content: 你好}]}
根据硬件选择Llama模型
跨平台: 以下为示例配置。任何具备同等内存的设备(Mac、Linux、Windows)均可使用。设备群路由器支持所有平台。
根据可用内存选择模型——较小模型在大多数任务中表现出色:
| 模型 | 最低内存 | 示例硬件 |
|---|
| llama3.2:1b | 2GB | 任意Mac——甚至8GB机型 |
| llama3.2:3b |
4GB | Mac Mini(16GB) |
| llama3:8b | 8GB | Mac Mini(16GB) |
| llama3.3:70b | 48GB | Mac Studio M4 Max(128GB) |
| llama3.1:405b | 256GB+ | Mac Studio M4 Ultra(256GB)或分布式部署 |
设备群路由器会将请求发送到已加载模型的机器上。无需手动路由。
为何本地运行Llama
- - 硬件之外免费——Meta的许可允许商业使用,无按token计费
- 隐私保护——提示词和响应永不离开您的网络
- 无速率限制——您的硬件,您的吞吐量
- 设备群路由——多台机器自动分担负载
查看运行状态
bash
当前内存中已加载的模型
curl -s http://localhost:11435/api/ps | python3 -m json.tool
设备群中所有可用模型
curl -s http://localhost:11435/api/tags | python3 -m json.tool
监控Llama性能
bash
近期请求追踪——查看延迟、token数、处理请求的节点
curl -s http://localhost:11435/dashboard/api/traces?limit=10 | python3 -m json.tool
设备群健康状态——15项自动化检查
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
Web仪表盘访问 http://localhost:11435/dashboard ——实时查看所有节点、队列和模型。
本设备群还支持
其他LLM模型
Qwen 3.5、DeepSeek-V3、DeepSeek-R1、Phi 4、Mistral、Gemma 3、Codestral——任何Ollama模型均可通过同一端点路由。
图像生成
bash
curl http://localhost:11435/api/generate-image \
-d {model: z-image-turbo, prompt: 山中的羊驼, width: 512, height: 512}
语音转文字
bash
curl http://localhost:11435/api/transcribe -F file=@recording.wav -F model=qwen3-asr
嵌入向量
bash
curl http://localhost:11435/api/embed \
-d {model: nomic-embed-text, input: Meta Llama开源语言模型}
完整文档
安全护栏
- - 模型下载需明确用户确认——Llama模型大小从1GB(1B)到230GB+(405B)不等。拉取前务必确认。
- 模型删除需明确用户确认。
- 切勿删除或修改 ~/.fleet-manager/ 中的文件。
- 如果模型超出可用内存,建议使用较小版本。
- 不会自动下载任何模型——所有拉取均由用户发起,或需通过 auto_pull 设置选择加入。