MLX Local AI — Apple's ML Framework Powers Your Entire Fleet

Everything in this fleet runs on Apple's MLX framework. LLM inference, image generation, speech-to-text, embeddings — all MLX-native, all optimized for Apple Silicon's unified memory architecture.

The MLX stack

Capability	Tool	MLX usage
LLM inference	Ollama	MLX backend for model loading and inference on Apple Silicon
Image gen (Flux)

One router. One framework. Four modalities. All local.

Setup

CODEBLOCK0

All tools leverage MLX for Metal-accelerated inference on Apple Silicon's GPU cores.

LLM inference via MLX

Ollama runs models using MLX on Apple Silicon. Unified memory means the entire model stays in one address space — no PCIe bottleneck.

CODEBLOCK1

Image generation via MLX

Both mflux and DiffusionKit are pure MLX implementations — no PyTorch, no CUDA.

CODEBLOCK2

Speech-to-text via MLX

Qwen3-ASR transcribes audio using MLX acceleration.

CODEBLOCK3

Embeddings via MLX

Ollama embedding models run on the MLX backend.

CODEBLOCK4

Why MLX matters for local AI

- Unified memory — model weights, activations, and KV cache share one memory pool. No CPU-GPU transfer overhead.
Metal acceleration — MLX compiles to Metal shaders that run on Apple Silicon GPU cores (up to 80 on M3/M4 Ultra).
Lazy evaluation — MLX only computes what's needed, reducing memory pressure.
Dynamic shapes — no recompilation when input sizes change (unlike some CUDA frameworks).
Apple-maintained — MLX is developed by Apple's ML research team, optimized for every chip generation.

Fleet performance on Apple Silicon

Chip	GPU Cores	Memory	LLM Sweet Spot	Image Gen
M1	8	8-16GB	3-7B models	Slow
M2 Pro

19 | 32GB | 14B models | Capable | | M3 Max | 40 | 128GB | 70B models | Fast | | M4 Ultra | 80 | 256GB | 120B+ models | Very fast |

Monitor your MLX fleet

CODEBLOCK5

Dashboard at http://localhost:11435/dashboard — see every node, every model, every queue in real time.

Full documentation

- Agent Setup Guide — all 4 model types
Image Generation Guide — 3 backends
API Reference

Contribute

Ollama Herd is open source (MIT) and built on the MLX ecosystem. We welcome contributions:

- Star on GitHub — helps others discover the project
Open an issue — bug reports, feature requests, questions
AI agents welcome — CLAUDE.md provides full architectural context. Fork, branch, PR.
444 tests, async Python, runs in under 40 seconds. Hard to break things.

Guardrails

- No automatic downloads — all model pulls require explicit user confirmation.
Model deletion requires explicit user confirmation.
All requests stay local — no data leaves your network.
Never delete or modify files in ~/.fleet-manager/.

MLX 本地AI — 苹果ML框架驱动你的整个集群

该集群中的所有设备均运行在苹果的 MLX框架之上。大语言模型推理、图像生成、语音转文本、嵌入向量——全部原生支持MLX，并针对Apple Silicon的统一内存架构进行了优化。

MLX技术栈

能力	工具	MLX用途
大语言模型推理	Ollama	在Apple Silicon上用于模型加载和推理的MLX后端
图像生成（Flux）

一个路由器。一个框架。四种模态。全部本地运行。

安装设置

bash
pip install ollama-herd # PyPI: https://pypi.org/project/ollama-herd/
herd # 启动路由器（端口11435）
herd-node # 在每个设备上运行——自动发现路由器

安装图像生成后端

uv tool install mflux # Flux模型（512px约7秒） uv tool install diffusionkit # Stable Diffusion 3/3.5

所有工具均利用MLX在Apple Silicon的GPU核心上实现Metal加速推理。

通过MLX进行大语言模型推理

Ollama在Apple Silicon上使用MLX运行模型。统一内存意味着整个模型驻留在单一地址空间中——无需PCIe瓶颈。

python
from openai import OpenAI

client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)
response = client.chat.completions.create(
model=llama3.3:70b,
messages=[{role: user, content: 解释MLX统一内存}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or , end=)

通过MLX进行图像生成

mflux和DiffusionKit都是纯MLX实现——无需PyTorch，无需CUDA。

bash

通过mflux生成Flux图像（最快）

curl -o flux.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model: z-image-turbo, prompt: 神经网络可视化, width: 1024, height: 1024}

通过DiffusionKit生成Stable Diffusion 3图像

curl -o sd3.png http://localhost:11435/api/generate-image \ -H Content-Type: application/json \ -d {model: sd3-medium, prompt: 电路板景观, width: 1024, height: 1024, steps: 20}

通过MLX进行语音转文本

Qwen3-ASR利用MLX加速进行音频转录。

bash
curl http://localhost:11435/api/transcribe \
-F file=@meeting.wav \
-F model=qwen3-asr

通过MLX生成嵌入向量

Ollama嵌入模型在MLX后端上运行。

bash
curl http://localhost:11435/api/embed \
-d {model: nomic-embed-text, input: 用于机器学习的苹果MLX框架}

MLX对本地AI的重要性

- 统一内存 — 模型权重、激活值和KV缓存共享同一内存池。无CPU-GPU传输开销。
Metal加速 — MLX编译为在Apple Silicon GPU核心上运行的Metal着色器（M3/M4 Ultra上最多80个核心）。
惰性求值 — MLX仅计算所需内容，减少内存压力。
动态形状 — 输入大小变化时无需重新编译（与某些CUDA框架不同）。
苹果维护 — MLX由苹果的ML研究团队开发，针对每一代芯片进行了优化。

Apple Silicon上的集群性能

芯片	GPU核心数	内存	大语言模型最佳点	图像生成
M1	8	8-16GB	3-7B模型	慢
M2 Pro

19 | 32GB | 14B模型 | 可用 | | M3 Max | 40 | 128GB | 70B模型 | 快 | | M4 Ultra | 80 | 256GB | 120B+模型 | 非常快 |

监控你的MLX集群

bash

集群概览

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

基于硬件的模型推荐

curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

健康检查

curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

仪表板位于 http://localhost:11435/dashboard — 实时查看每个节点、每个模型、每个队列。

完整文档

- Agent设置指南 — 全部4种模型类型
图像生成指南 — 3个后端
API参考

贡献

Ollama Herd是开源项目（MIT许可），构建于MLX生态系统之上。我们欢迎贡献：

- 在GitHub上标星 — 帮助他人发现该项目
提交问题 — 错误报告、功能请求、疑问
欢迎AI代理 — CLAUDE.md 提供完整的架构上下文。Fork、分支、PR。
444个测试，异步Python，40秒内运行完成。难以破坏。

安全护栏

- 无自动下载 — 所有模型拉取均需用户明确确认。
模型删除需要用户明确确认。
所有请求保持本地 — 无数据离开你的网络。
切勿删除或修改 ~/.fleet-manager/ 中的文件。

mlx-apple-silicon-mlxMLX苹果芯片

mlx-apple-silicon-mlx

MLX Local AI — Apple's ML Framework Powers Your Entire Fleet

The MLX stack

Setup

LLM inference via MLX

Image generation via MLX

Speech-to-text via MLX

Embeddings via MLX

Why MLX matters for local AI

Fleet performance on Apple Silicon

Monitor your MLX fleet

Full documentation

Contribute

Guardrails

MLX 本地AI — 苹果ML框架驱动你的整个集群

MLX技术栈

安装设置

安装图像生成后端

通过MLX进行大语言模型推理

通过MLX进行图像生成

通过mflux生成Flux图像（最快）

通过DiffusionKit生成Stable Diffusion 3图像

通过MLX进行语音转文本

通过MLX生成嵌入向量

MLX对本地AI的重要性

Apple Silicon上的集群性能

监控你的MLX集群

集群概览

基于硬件的模型推荐

健康检查

完整文档

贡献

安全护栏

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

mlx-apple-silicon-mlxMLX苹果芯片

mlx-apple-silicon-mlx

MLX Local AI — Apple's ML Framework Powers Your Entire Fleet

The MLX stack

Setup

LLM inference via MLX

Image generation via MLX

Speech-to-text via MLX

Embeddings via MLX

Why MLX matters for local AI

Fleet performance on Apple Silicon

Monitor your MLX fleet

Full documentation

Contribute

Guardrails

MLX 本地AI — 苹果ML框架驱动你的整个集群

MLX技术栈

安装设置

安装图像生成后端

通过MLX进行大语言模型推理

通过MLX进行图像生成

通过mflux生成Flux图像（最快）

通过DiffusionKit生成Stable Diffusion 3图像

通过MLX进行语音转文本

通过MLX生成嵌入向量

MLX对本地AI的重要性

Apple Silicon上的集群性能

监控你的MLX集群

集群概览

基于硬件的模型推荐

健康检查

完整文档

贡献

安全护栏

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement