Apple Silicon AI — Your Macs Are the Cluster
Turn your Mac Studio, Mac Mini, MacBook Pro, or Mac Pro into a local Apple Silicon AI fleet. One endpoint routes LLM inference, image generation, speech-to-text, and embeddings across every Apple Silicon device on your network.
No cloud APIs. No GPU rentals. No Docker. Your Apple Silicon M1/M2/M3/M4 chips with unified memory are already better inference hardware than most cloud instances — you just need software that treats them as an Apple Silicon fleet.
Why Apple Silicon for AI
Apple Silicon unified memory keeps the entire model in one address space — no PCIe bottleneck, no CPU-GPU transfer overhead. A Mac Studio with M4 Ultra and 256GB runs 120B parameter models that would need multiple NVIDIA A100s. That is the Apple Silicon advantage.
| Apple Silicon Chip | Unified Memory | LLM Sweet Spot | Apple Silicon Image Gen | Notes |
|---|
| M1 (8GB) | 8GB | 7B models | Slow | Entry-level Apple Silicon |
| M1 Pro/Max (32-64GB) |
32-64GB | 14B-32B | Capable | Apple Silicon MacBook Pro |
| M2 Ultra (192GB) | 192GB | 70B-120B | Fast | Apple Silicon Mac Studio/Pro |
| M3 Max (128GB) | 128GB | 70B | Fast | Latest Apple Silicon MacBook Pro |
| M4 Max (128GB) | 128GB | 70B | Fast | Apple Silicon Mac Studio, newest gen |
| M4 Ultra (256GB) | 256GB | 120B+ | Very fast | Apple Silicon Mac Studio/Pro, largest models |
Apple Silicon Fleet Setup
1. Install on every Apple Silicon Mac
CODEBLOCK0
2. Start the Apple Silicon router (pick one Mac)
CODEBLOCK1
3. Start the Apple Silicon node agent on every Mac
CODEBLOCK2
That's it. Apple Silicon nodes discover the router automatically on your local network. No IP addresses to configure, no config files. For explicit connection, use herd-node --router-url http://<router-ip>:11435.
How Apple Silicon routing works
CODEBLOCK3
The Apple Silicon router scores each device on 7 signals and routes every request to the best available Mac — thermal state, memory fit, queue depth, and more.
Apple Silicon LLM Inference
Run Llama, Qwen, DeepSeek, Phi, Mistral, Gemma, and any Ollama model across your Apple Silicon fleet.
OpenAI-compatible API (Apple Silicon backend)
CODEBLOCK4
Ollama-compatible API
CODEBLOCK5
Apple Silicon Python Client
CODEBLOCK6
Apple Silicon Image Generation (mflux)
Generate images using MLX-native Flux models. Runs natively on Apple Silicon — no CUDA, no cloud.
CODEBLOCK7
Apple Silicon image generation performance:
- - Mac Studio M4 Ultra: ~5s at 512px, ~14s at 1024px
- MacBook Pro M3 Max: ~7s at 512px, ~18s at 1024px
- Mac Mini M4: ~12s at 512px, ~30s at 1024px
Apple Silicon Speech-to-Text (Qwen ASR)
Transcribe audio locally on Apple Silicon using Qwen3-ASR via MLX. Meetings, voice notes, podcasts — no cloud, no Whisper API costs.
CODEBLOCK8
Supports WAV, MP3, M4A, FLAC. ~2s for a 30-second clip on Apple Silicon M4 Ultra.
Apple Silicon Embeddings
Embed documents across your Apple Silicon fleet using Ollama embedding models (nomic-embed-text, mxbai-embed-large, snowflake-arctic-embed).
CODEBLOCK9
Batch thousands of documents across Apple Silicon nodes instead of bottlenecking on one Mac.
Apple Silicon Fleet Monitoring
Dashboard
Open http://localhost:11435/dashboard — see every Apple Silicon Mac in your fleet: models loaded, queue depth, thermal state, memory usage, and health status.
Apple Silicon Fleet Status API
CODEBLOCK10
Returns every Apple Silicon node with hardware specs, loaded models, image/STT capabilities, and health metrics.
Apple Silicon Health Checks
CODEBLOCK11
15 automated checks: offline Apple Silicon nodes, memory pressure, thermal throttling, VRAM fallbacks, error rates, and more.
Recommended Models by Apple Silicon Hardware
| Your Apple Silicon Mac | RAM | Recommended models |
|---|
| Mac Mini (16GB) | 16GB | llama3.2:3b, phi4-mini, nomic-embed-text |
| Mac Mini (32GB) |
32GB | qwen3:14b, deepseek-r1:14b, llama3.3:8b |
| MacBook Pro (36-64GB) | 36-64GB | qwen3:32b, deepseek-r1:32b, codestral |
| Mac Studio (128GB) | 128GB | llama3.3:70b, qwen3:72b, deepseek-r1:70b |
| Mac Studio/Pro (192-256GB) | 192-256GB | qwen3:110b, deepseek-v3:236b (quantized) |
The Apple Silicon router's model recommender analyzes your fleet hardware and suggests the optimal model mix: GET /dashboard/api/model-recommendations.
Full documentation
Guardrails
- - No automatic downloads: Apple Silicon model pulls are always user-initiated and require explicit confirmation. Downloads range from 2GB to 70GB+ depending on model size.
- Model deletion requires confirmation: Never remove models from Apple Silicon nodes without explicit user approval.
- All Apple Silicon requests stay local: No data leaves your local network — all inference happens on your Apple Silicon Macs.
- No API keys: No accounts, no tokens, no cloud dependencies for your Apple Silicon fleet.
- No external network access: The Apple Silicon router and nodes communicate only on your local network. No telemetry, no cloud callbacks.
- Read-only local state: The only local files created are
~/.fleet-manager/latency.db (Apple Silicon routing metrics) and ~/.fleet-manager/logs/herd.jsonl (structured logs). Never delete or modify these files without user confirmation.
Apple Silicon AI — 你的Mac就是集群
将你的Mac Studio、Mac Mini、MacBook Pro或Mac Pro转变为一个本地Apple Silicon AI集群。一个端点即可将LLM推理、图像生成、语音转文本和嵌入任务路由到网络中的每一台Apple Silicon设备。
无需云API。无需租用GPU。无需Docker。你的Apple Silicon M1/M2/M3/M4芯片搭配统一内存,其推理硬件性能已超越大多数云实例——你只需要一款能将它们视为Apple Silicon集群的软件。
为什么选择Apple Silicon做AI
Apple Silicon统一内存将整个模型保存在一个地址空间中——没有PCIe瓶颈,没有CPU-GPU传输开销。搭载M4 Ultra和256GB内存的Mac Studio可以运行需要多块NVIDIA A100才能运行的120B参数模型。这就是Apple Silicon的优势。
| Apple Silicon芯片 | 统一内存 | LLM最佳适配 | Apple Silicon图像生成 | 备注 |
|---|
| M1 (8GB) | 8GB | 7B模型 | 慢 | 入门级Apple Silicon |
| M1 Pro/Max (32-64GB) |
32-64GB | 14B-32B | 可用 | Apple Silicon MacBook Pro |
| M2 Ultra (192GB) | 192GB | 70B-120B | 快 | Apple Silicon Mac Studio/Pro |
| M3 Max (128GB) | 128GB | 70B | 快 | 最新Apple Silicon MacBook Pro |
| M4 Max (128GB) | 128GB | 70B | 快 | Apple Silicon Mac Studio,最新一代 |
| M4 Ultra (256GB) | 256GB | 120B+ | 非常快 | Apple Silicon Mac Studio/Pro,最大模型 |
Apple Silicon集群设置
1. 在每台Apple Silicon Mac上安装
bash
pip install ollama-herd # Apple Silicon优化推理路由器
2. 启动Apple Silicon路由器(选择一台Mac)
bash
herd # 在端口11435上启动Apple Silicon路由器
3. 在每台Mac上启动Apple Silicon节点代理
bash
herd-node # Apple Silicon节点自动发现路由器
就这样。Apple Silicon节点会在本地网络上自动发现路由器。无需配置IP地址,无需配置文件。如需显式连接,请使用herd-node --router-url http://:11435。
Apple Silicon路由工作原理
MacBook Pro (M3 Max, 64GB) ─┐
Mac Mini (M4, 32GB) ├──→ Apple Silicon路由器 (:11435) ←── 你的应用
Mac Studio (M4 Ultra, 256GB) ─┘
Apple Silicon路由器根据7个信号对每台设备进行评分,并将每个请求路由到最佳可用Mac——热状态、内存适配度、队列深度等。
Apple Silicon LLM推理
在你的Apple Silicon集群上运行Llama、Qwen、DeepSeek、Phi、Mistral、Gemma以及任何Ollama模型。
OpenAI兼容API(Apple Silicon后端)
bash
curl http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {
model: llama3.3:70b,
messages: [{role: user, content: 解释Apple Silicon统一内存架构}]
}
Ollama兼容API
bash
curl http://localhost:11435/api/chat \
-d {model: qwen3:32b, messages: [{role: user, content: 比较Apple Silicon M4与M3在AI推理方面的表现}]}
Apple Silicon Python客户端
python
from openai import OpenAI
Apple Silicon推理客户端
apple
siliconclient = OpenAI(base
url=http://localhost:11435/v1, apikey=unused)
apple
siliconresponse = apple
siliconclient.chat.completions.create(
model=deepseek-r1:70b,
messages=[{role: user, content: 为Apple Silicon优化此函数}]
)
Apple Silicon图像生成(mflux)
使用MLX原生Flux模型生成图像。原生运行于Apple Silicon——无需CUDA,无需云端。
bash
curl http://localhost:11435/api/generate-image \
-d {prompt: Apple Silicon Mac Studio渲染AI艺术,照片级真实感, model: z-image-turbo, width: 512, height: 512}
Apple Silicon图像生成性能:
- - Mac Studio M4 Ultra:512px约5秒,1024px约14秒
- MacBook Pro M3 Max:512px约7秒,1024px约18秒
- Mac Mini M4:512px约12秒,1024px约30秒
Apple Silicon语音转文本(Qwen ASR)
使用通过MLX运行的Qwen3-ASR在Apple Silicon上本地转录音频。会议、语音笔记、播客——无需云端,无需Whisper API费用。
bash
curl http://localhost:11435/api/transcribe \
-F file=@applesiliconmeeting.wav \
-F model=qwen3-asr
支持WAV、MP3、M4A、FLAC格式。在Apple Silicon M4 Ultra上,30秒片段约需2秒。
Apple Silicon嵌入
使用Ollama嵌入模型(nomic-embed-text、mxbai-embed-large、snowflake-arctic-embed)在你的Apple Silicon集群上嵌入文档。
bash
curl http://localhost:11435/api/embed \
-d {model: nomic-embed-text, input: Apple Silicon统一内存架构用于AI推理}
跨Apple Silicon节点批量处理数千个文档,而不是在单台Mac上形成瓶颈。
Apple Silicon集群监控
仪表盘
打开http://localhost:11435/dashboard——查看集群中每台Apple Silicon Mac:加载的模型、队列深度、热状态、内存使用情况和健康状态。
Apple Silicon集群状态API
bash
curl http://localhost:11435/fleet/status
返回每个Apple Silicon节点的硬件规格、加载的模型、图像/STT能力和健康指标。
Apple Silicon健康检查
bash
curl http://localhost:11435/dashboard/api/health
15项自动检查:离线Apple Silicon节点、内存压力、热节流、VRAM回退、错误率等。
按Apple Silicon硬件推荐的模型
| 你的Apple Silicon Mac | 内存 | 推荐模型 |
|---|
| Mac Mini (16GB) | 16GB | llama3.2:3b, phi4-mini, nomic-embed-text |
| Mac Mini (32GB) |
32GB | qwen3:14b, deepseek-r1:14b, llama3.3:8b |
| MacBook Pro (36-64GB) | 36-64GB | qwen3:32b, deepseek-r1:32b, codestral |
| Mac Studio (128GB) | 128GB | llama3.3:70b, qwen3:72b, deepseek-r1:70b |
| Mac Studio/Pro (192-256GB) | 192-256GB | qwen3:110b, deepseek-v3:236b (量化) |
Apple Silicon路由器的模型推荐器会分析你的集群硬件并建议最佳模型组合:GET /dashboard/api/model-recommendations。
完整文档
- - 代理设置指南 — 所有4种模型类型的完整Apple Silicon设置
- 配置参考 — 所有44+个环境变量
- API参考 — 所有端点及请求/响应模式
- 故障排除 — 常见Apple Silicon问题及修复
安全护栏
- - 无自动下载:Apple Silicon模型拉取始终由用户发起,需要明确确认。下载大小从2GB到70GB+不等,取决于模型大小。
- 删除模型需要确认:未经用户明确批准,绝不从Apple Silicon节点移除模型。
- 所有Apple Silicon请求保持本地:无数据离开你的本地网络——所有推理都在你的Apple Silicon Mac上完成。
- 无需API密钥:你的Apple Silicon集群无需账户、无需令牌、无需云依赖。
- 无外部网络访问:Apple Silicon路由器和节点仅在你的本地网络上通信。无遥测、无云回调