Llama 3 — Run Meta's LLMs Across Your Local Fleet

The Llama family is the most widely deployed open-source LLM. This skill routes Llama requests across your devices — the fleet picks the best machine for every request automatically.

Supported Llama models

Model	Parameters	Ollama name	Best for
Llama 3.3	70B	INLINECODE0	Best overall — matches GPT-4o on most benchmarks
Llama 3.2

Quick start

CODEBLOCK0

No models are downloaded during installation. Models are pulled on demand when a request arrives, or manually via the dashboard. All pulls require user confirmation.

Use Llama through the fleet

OpenAI SDK (drop-in replacement)

CODEBLOCK1

curl (Ollama format)

CODEBLOCK2

curl (OpenAI format)

CODEBLOCK3

Which Llama model for your hardware

Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.

Pick the model that fits your available memory — smaller models work great for most tasks:

Model	Min RAM	Example hardware
INLINECODE4	2GB	Any Mac — even 8GB
INLINECODE5

The fleet router sends requests to the machine where the model is loaded. No manual routing needed.

Why run Llama locally

- Free after hardware — Meta's license allows commercial use with no per-token cost
Privacy — prompts and responses never leave your network
No rate limits — your hardware, your throughput
Fleet routing — multiple machines share the load automatically

See what's running

CODEBLOCK4

Monitor Llama performance

CODEBLOCK5

Web dashboard at http://localhost:11435/dashboard — live view of all nodes, queues, and models.

Also available on this fleet

Other LLM models

Qwen 3.5, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Gemma 3, Codestral — any Ollama model routes through the same endpoint.

Image generation

CODEBLOCK6

Speech-to-text

CODEBLOCK7

Embeddings

CODEBLOCK8

Full documentation

- Agent Setup Guide — all 4 model types
API Reference — complete endpoint docs

Guardrails

- Model downloads require explicit user confirmation — Llama models range from 1GB (1B) to 230GB+ (405B). Always confirm before pulling.
Model deletion requires explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/.
If a model is too large for available memory, suggest a smaller variant.
No models are downloaded automatically — all pulls are user-initiated or require opt-in via the auto_pull setting.

Llama 3 — 在本地设备群中运行Meta的LLM

Llama系列是部署最广泛的开源大语言模型。本技能可将Llama请求路由到您的各台设备——设备群会自动为每个请求选择最佳机器。

支持的Llama模型

模型	参数规模	Ollama名称	最佳用途
Llama 3.3	70B	llama3.3:70b	综合最佳——在多数基准测试中媲美GPT-4o
Llama 3.2

快速开始

bash
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
herd # 启动路由器（端口11435）
herd-node # 在每台设备上运行——自动发现路由器

安装时不会下载任何模型。模型会在请求到达时按需拉取，或通过仪表盘手动拉取。所有拉取操作均需用户确认。

通过设备群使用Llama

OpenAI SDK（即插即用替代）

python
from openai import OpenAI

client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)

response = client.chat.completions.create(
model=llama3.3:70b,
messages=[{role: user, content: 解释Transformer架构}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or , end=)

curl（Ollama格式）

bash
curl http://localhost:11435/api/chat -d {
model: llama3.3:70b,
messages: [{role: user, content: 用Python写一个快速排序}],
stream: false
}

curl（OpenAI格式）

bash
curl http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model: llama3.2:3b, messages: [{role: user, content: 你好}]}

根据硬件选择Llama模型

跨平台： 以下为示例配置。任何具备同等内存的设备（Mac、Linux、Windows）均可使用。设备群路由器支持所有平台。

根据可用内存选择模型——较小模型在大多数任务中表现出色：

模型	最低内存	示例硬件
llama3.2:1b	2GB	任意Mac——甚至8GB机型
llama3.2:3b

设备群路由器会将请求发送到已加载模型的机器上。无需手动路由。

为何本地运行Llama

- 硬件之外免费——Meta的许可允许商业使用，无按token计费
隐私保护——提示词和响应永不离开您的网络
无速率限制——您的硬件，您的吞吐量
设备群路由——多台机器自动分担负载

查看运行状态

bash

当前内存中已加载的模型

curl -s http://localhost:11435/api/ps | python3 -m json.tool

设备群中所有可用模型

curl -s http://localhost:11435/api/tags | python3 -m json.tool

监控Llama性能

bash

近期请求追踪——查看延迟、token数、处理请求的节点

curl -s http://localhost:11435/dashboard/api/traces?limit=10 | python3 -m json.tool

设备群健康状态——15项自动化检查

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

Web仪表盘访问 http://localhost:11435/dashboard ——实时查看所有节点、队列和模型。

本设备群还支持

其他LLM模型

Qwen 3.5、DeepSeek-V3、DeepSeek-R1、Phi 4、Mistral、Gemma 3、Codestral——任何Ollama模型均可通过同一端点路由。

图像生成

bash
curl http://localhost:11435/api/generate-image \
-d {model: z-image-turbo, prompt: 山中的羊驼, width: 512, height: 512}

语音转文字

bash
curl http://localhost:11435/api/transcribe -F file=@recording.wav -F model=qwen3-asr

嵌入向量

bash
curl http://localhost:11435/api/embed \
-d {model: nomic-embed-text, input: Meta Llama开源语言模型}

完整文档

- Agent设置指南 ——全部4种模型类型
API参考 ——完整端点文档

安全护栏

- 模型下载需明确用户确认——Llama模型大小从1GB（1B）到230GB+（405B）不等。拉取前务必确认。
模型删除需明确用户确认。
切勿删除或修改 ~/.fleet-manager/ 中的文件。
如果模型超出可用内存，建议使用较小版本。
不会自动下载任何模型——所有拉取均由用户发起，或需通过 auto_pull 设置选择加入。

llama-llama3Llama3本地运行

llama-llama3

Llama 3 — Run Meta's LLMs Across Your Local Fleet

Supported Llama models

Quick start

Use Llama through the fleet

OpenAI SDK (drop-in replacement)

curl (Ollama format)

curl (OpenAI format)

Which Llama model for your hardware

Why run Llama locally

See what's running

Monitor Llama performance

Also available on this fleet

Other LLM models

Image generation

Speech-to-text

Embeddings

Full documentation

Guardrails

Llama 3 — 在本地设备群中运行Meta的LLM

支持的Llama模型

快速开始

通过设备群使用Llama

OpenAI SDK（即插即用替代）

curl（Ollama格式）

curl（OpenAI格式）

根据硬件选择Llama模型

为何本地运行Llama

查看运行状态

当前内存中已加载的模型

设备群中所有可用模型

监控Llama性能

近期请求追踪——查看延迟、token数、处理请求的节点

设备群健康状态——15项自动化检查

本设备群还支持

其他LLM模型

图像生成

语音转文字

嵌入向量

完整文档

安全护栏

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement