Qwen — Run Qwen Models Across Your Local Fleet
Run Qwen3.5, Qwen3, Qwen3-Coder, and Qwen ASR on your own hardware. The fleet router picks the best device for every request — chat, code generation, and speech-to-text from one endpoint.
Supported Qwen models
LLM (Chat & Reasoning)
| Model | Parameters | Ollama name | Best for |
|---|
| Qwen3.5 | 0.8B–397B MoE | INLINECODE0 | Latest — multimodal, best reasoning |
| Qwen3 |
0.6B–235B MoE |
qwen3 | Competitive with GPT-4o |
|
Qwen2.5 | 0.5B–72B |
qwen2.5 | Proven, stable, multilingual |
Code Generation
| Model | Parameters | Ollama name | Best for |
|---|
| Qwen3-Coder | 30B MoE (3.3B active) | INLINECODE3 | Agentic coding workflows |
| Qwen2.5-Coder |
0.5B–32B |
qwen2.5-coder | Code — matches GPT-4o at 32B |
Speech-to-Text
| Model | Parameters | Tool | Best for |
|---|
| Qwen3-ASR | 0.6B–1.7B | INLINECODE5 | State-of-the-art local transcription |
Setup
CODEBLOCK0
For speech-to-text:
CODEBLOCK1
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
Use Qwen through the fleet
OpenAI SDK
CODEBLOCK2
Qwen3-Coder for code
CODEBLOCK3
Qwen ASR for transcription
CODEBLOCK4
CODEBLOCK5
Ollama API
CODEBLOCK6
Hardware recommendations
Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.
| Model | Min RAM | Recommended hardware |
|---|
| INLINECODE6 | 2GB | Any Mac |
| INLINECODE7 |
8GB | Mac Mini M4 (16GB) |
|
qwen3.5:32b | 24GB | Mac Mini M4 Pro (48GB) |
|
qwen3.5:122b-a10b | 64GB | Mac Studio M4 Max (128GB) |
|
qwen3.5:397b-a17b | 256GB+ | Mac Studio M3 Ultra (512GB) |
|
qwen3-coder | 24GB | Mac Mini M4 Pro (48GB) |
|
qwen2.5-coder:32b | 24GB | Mac Mini M4 Pro (48GB) |
| Qwen3-ASR (0.6B) | 1.2GB | Any Mac |
| Qwen3-ASR (1.7B) | 3.4GB | Any Mac (8GB+) |
Why run Qwen locally
- - Zero cost — no per-token charges for Qwen API
- Privacy — Chinese and English content stays on your devices
- Full Qwen family — chat, code, reasoning, and speech-to-text from one fleet
- No rate limits — Alibaba Cloud throttles API access. Local runs unlimited
- Fleet routing — multiple machines share the load. The router picks the fastest available
The Qwen advantage on this fleet
Qwen models are uniquely suited for fleet routing:
- - MoE architecture — Qwen3.5 (397B total, 17B active) and Qwen3-Coder (30B total, 3.3B active) use Mixture of Experts. Only a fraction of parameters activate per request, making them fast despite large total size.
- Size variety — from 0.6B to 397B, there's a Qwen model for every device in your fleet. Small Macs run the small models, big Macs run the big ones.
- Code + Chat + STT — Qwen covers three modalities. One vendor, one fleet, three capabilities.
Also available on this fleet
Other LLM models
Llama 3.3, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Gemma 3 — any Ollama model routes through the same endpoint.
Image generation
CODEBLOCK7
Embeddings
CODEBLOCK8
Dashboard
INLINECODE13 — monitor Qwen requests alongside all other models. Per-model latency, token throughput, error rates, health checks.
Full documentation
Agent Setup Guide
Guardrails
- - Never pull or delete Qwen models without user confirmation.
- Never delete or modify files in
~/.fleet-manager/. - If a Qwen model is too large for available memory, suggest a smaller variant or MoE version.
Qwen — 在本地集群中运行Qwen模型
在您自己的硬件上运行Qwen3.5、Qwen3、Qwen3-Coder和Qwen ASR。集群路由器为每个请求选择最佳设备——聊天、代码生成和语音转文本,统一端点。
支持的Qwen模型
大语言模型(聊天与推理)
| 模型 | 参数规模 | Ollama名称 | 最佳用途 |
|---|
| Qwen3.5 | 0.8B–397B MoE | qwen3.5 | 最新——多模态,最强推理 |
| Qwen3 |
0.6B–235B MoE | qwen3 | 与GPT-4o竞争 |
|
Qwen2.5 | 0.5B–72B | qwen2.5 | 成熟稳定,多语言 |
代码生成
| 模型 | 参数规模 | Ollama名称 | 最佳用途 |
|---|
| Qwen3-Coder | 30B MoE(3.3B激活) | qwen3-coder | 智能体编码工作流 |
| Qwen2.5-Coder |
0.5B–32B | qwen2.5-coder | 代码——32B版本匹配GPT-4o |
语音转文本
| 模型 | 参数规模 | 工具 | 最佳用途 |
|---|
| Qwen3-ASR | 0.6B–1.7B | mlx-qwen3-asr | 最先进的本地转录 |
设置
bash
pip install ollama-herd
herd # 启动路由器(端口11435)
herd-node # 在每台机器上运行
拉取Qwen模型
ollama pull qwen3.5:32b
ollama pull qwen3-coder
语音转文本:
bash
uv tool install mlx-qwen3-asr[serve] --python 3.14
curl -X POST http://localhost:11435/dashboard/api/settings \
-H Content-Type: application/json -d {transcription: true}
软件包:ollama-herd | 仓库:github.com/geeks-accelerator/ollama-herd
通过集群使用Qwen
OpenAI SDK
python
from openai import OpenAI
client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)
Qwen3.5用于通用聊天
response = client.chat.completions.create(
model=qwen3.5:32b,
messages=[{role: user, content: 你好}],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content or , end=)
Qwen3-Coder用于代码
python
response = client.chat.completions.create(
model=qwen3-coder,
messages=[{role: user, content: 用FastAPI和SQLAlchemy写一个CRUD应用}],
)
print(response.choices[0].message.content)
Qwen ASR用于转录
bash
curl http://localhost:11435/api/transcribe -F audio=@meeting.wav
python
import httpx
def transcribe(audio_path):
with open(audio_path, rb) as f:
resp = httpx.post(
http://localhost:11435/api/transcribe,
files={audio: (audio_path, f)},
timeout=300.0,
)
resp.raiseforstatus()
return resp.json()[text]
Ollama API
bash
Qwen3.5聊天
curl http://localhost:11435/api/chat -d {
model: qwen3.5:32b,
messages: [{role: user, content: 解释一下Transformer}],
stream: false
}
Qwen2.5-Coder
curl http://localhost:11435/api/chat -d {
model: qwen2.5-coder:32b,
messages: [{role: user, content: 优化这个SQL查询:...}],
stream: false
}
硬件建议
跨平台: 以下为示例配置。任何具有等效内存的设备(Mac、Linux、Windows)均可使用。集群路由器支持所有平台。
| 模型 | 最低内存 | 推荐硬件 |
|---|
| qwen3.5:0.8b | 2GB | 任意Mac |
| qwen3.5:9b |
8GB | Mac Mini M4(16GB) |
| qwen3.5:32b | 24GB | Mac Mini M4 Pro(48GB) |
| qwen3.5:122b-a10b | 64GB | Mac Studio M4 Max(128GB) |
| qwen3.5:397b-a17b | 256GB+ | Mac Studio M3 Ultra(512GB) |
| qwen3-coder | 24GB | Mac Mini M4 Pro(48GB) |
| qwen2.5-coder:32b | 24GB | Mac Mini M4 Pro(48GB) |
| Qwen3-ASR(0.6B) | 1.2GB | 任意Mac |
| Qwen3-ASR(1.7B) | 3.4GB | 任意Mac(8GB+) |
为什么在本地运行Qwen
- - 零成本——无需为Qwen API按token付费
- 隐私——中英文内容保留在您的设备上
- 完整Qwen家族——聊天、代码、推理和语音转文本,统一集群
- 无速率限制——阿里云限制API访问。本地运行无限制
- 集群路由——多台机器分担负载。路由器选择最快可用设备
Qwen在此集群上的优势
Qwen模型特别适合集群路由:
- - MoE架构——Qwen3.5(总计397B,激活17B)和Qwen3-Coder(总计30B,激活3.3B)使用混合专家模型。每次请求仅激活部分参数,尽管总规模大但速度快。
- 规模多样性——从0.6B到397B,集群中每台设备都有对应的Qwen模型。小型Mac运行小模型,大型Mac运行大模型。
- 代码+聊天+语音转文本——Qwen覆盖三种模态。一个供应商,一个集群,三种能力。
此集群还提供
其他大语言模型
Llama 3.3、DeepSeek-V3、DeepSeek-R1、Phi 4、Mistral、Gemma 3——任何Ollama模型都通过同一端点路由。
图像生成
bash
curl -o image.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model:z-image-turbo,prompt:日落,width:1024,height:1024,steps:4}
嵌入
bash
curl http://localhost:11435/api/embeddings -d {model:nomic-embed-text,prompt:查询}
仪表盘
http://localhost:11435/dashboard——监控Qwen请求以及所有其他模型。每个模型的延迟、token吞吐量、错误率、健康检查。
完整文档
智能体设置指南
安全限制
- - 未经用户确认,绝不拉取或删除Qwen模型。
- 绝不删除或修改~/.fleet-manager/中的文件。
- 如果Qwen模型对于可用内存过大,建议使用更小的变体或MoE版本。