Qwen — Run Qwen Models Across Your Local Fleet

Run Qwen3.5, Qwen3, Qwen3-Coder, and Qwen ASR on your own hardware. The fleet router picks the best device for every request — chat, code generation, and speech-to-text from one endpoint.

Supported Qwen models

LLM (Chat & Reasoning)

Model	Parameters	Ollama name	Best for
Qwen3.5	0.8B–397B MoE	INLINECODE0	Latest — multimodal, best reasoning
Qwen3

Code Generation

Model	Parameters	Ollama name	Best for
Qwen3-Coder	30B MoE (3.3B active)	INLINECODE3	Agentic coding workflows
Qwen2.5-Coder

0.5B–32B | qwen2.5-coder | Code — matches GPT-4o at 32B |

Speech-to-Text

Model	Parameters	Tool	Best for
Qwen3-ASR	0.6B–1.7B	INLINECODE5	State-of-the-art local transcription

Setup

CODEBLOCK0

For speech-to-text:

CODEBLOCK1

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Use Qwen through the fleet

OpenAI SDK

CODEBLOCK2

Qwen3-Coder for code

CODEBLOCK3

Qwen ASR for transcription

CODEBLOCK4

CODEBLOCK5

Ollama API

CODEBLOCK6

Hardware recommendations

Cross-platform: These are example configurations. Any device (Mac, Linux, Windows) with equivalent RAM works. The fleet router runs on all platforms.

Model	Min RAM	Recommended hardware
INLINECODE6	2GB	Any Mac
INLINECODE7

Why run Qwen locally

- Zero cost — no per-token charges for Qwen API
Privacy — Chinese and English content stays on your devices
Full Qwen family — chat, code, reasoning, and speech-to-text from one fleet
No rate limits — Alibaba Cloud throttles API access. Local runs unlimited
Fleet routing — multiple machines share the load. The router picks the fastest available

The Qwen advantage on this fleet

Qwen models are uniquely suited for fleet routing:

- MoE architecture — Qwen3.5 (397B total, 17B active) and Qwen3-Coder (30B total, 3.3B active) use Mixture of Experts. Only a fraction of parameters activate per request, making them fast despite large total size.
Size variety — from 0.6B to 397B, there's a Qwen model for every device in your fleet. Small Macs run the small models, big Macs run the big ones.
Code + Chat + STT — Qwen covers three modalities. One vendor, one fleet, three capabilities.

Also available on this fleet

Other LLM models

Llama 3.3, DeepSeek-V3, DeepSeek-R1, Phi 4, Mistral, Gemma 3 — any Ollama model routes through the same endpoint.

Image generation

CODEBLOCK7

Embeddings

CODEBLOCK8

Dashboard

INLINECODE13 — monitor Qwen requests alongside all other models. Per-model latency, token throughput, error rates, health checks.

Full documentation

Agent Setup Guide

Guardrails

- Never pull or delete Qwen models without user confirmation.
Never delete or modify files in ~/.fleet-manager/.
If a Qwen model is too large for available memory, suggest a smaller variant or MoE version.

Qwen — 在本地集群中运行Qwen模型

在您自己的硬件上运行Qwen3.5、Qwen3、Qwen3-Coder和Qwen ASR。集群路由器为每个请求选择最佳设备——聊天、代码生成和语音转文本，统一端点。

支持的Qwen模型

大语言模型（聊天与推理）

模型	参数规模	Ollama名称	最佳用途
Qwen3.5	0.8B–397B MoE	qwen3.5	最新——多模态，最强推理
Qwen3

代码生成

模型	参数规模	Ollama名称	最佳用途
Qwen3-Coder	30B MoE（3.3B激活）	qwen3-coder	智能体编码工作流
Qwen2.5-Coder

0.5B–32B | qwen2.5-coder | 代码——32B版本匹配GPT-4o |

语音转文本

模型	参数规模	工具	最佳用途
Qwen3-ASR	0.6B–1.7B	mlx-qwen3-asr	最先进的本地转录

设置

bash
pip install ollama-herd
herd # 启动路由器（端口11435）
herd-node # 在每台机器上运行

拉取Qwen模型

ollama pull qwen3.5:32b ollama pull qwen3-coder

语音转文本：

bash
uv tool install mlx-qwen3-asr[serve] --python 3.14
curl -X POST http://localhost:11435/dashboard/api/settings \
-H Content-Type: application/json -d {transcription: true}

软件包：ollama-herd | 仓库：github.com/geeks-accelerator/ollama-herd

通过集群使用Qwen

OpenAI SDK

python
from openai import OpenAI

client = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)

Qwen3.5用于通用聊天

response = client.chat.completions.create( model=qwen3.5:32b, messages=[{role: user, content: 你好}], stream=True, ) for chunk in response: print(chunk.choices[0].delta.content or , end=)

Qwen3-Coder用于代码

python
response = client.chat.completions.create(
model=qwen3-coder,
messages=[{role: user, content: 用FastAPI和SQLAlchemy写一个CRUD应用}],
)
print(response.choices[0].message.content)

Qwen ASR用于转录

bash
curl http://localhost:11435/api/transcribe -F audio=@meeting.wav

python
import httpx

def transcribe(audio_path):
with open(audio_path, rb) as f:
resp = httpx.post(
http://localhost:11435/api/transcribe,
files={audio: (audio_path, f)},
timeout=300.0,
)
resp.raiseforstatus()
return resp.json()[text]

Ollama API

bash

Qwen3.5聊天

curl http://localhost:11435/api/chat -d {
model: qwen3.5:32b,
messages: [{role: user, content: 解释一下Transformer}],
stream: false
}

Qwen2.5-Coder

curl http://localhost:11435/api/chat -d { model: qwen2.5-coder:32b, messages: [{role: user, content: 优化这个SQL查询：...}], stream: false }

硬件建议

跨平台： 以下为示例配置。任何具有等效内存的设备（Mac、Linux、Windows）均可使用。集群路由器支持所有平台。

模型	最低内存	推荐硬件
qwen3.5:0.8b	2GB	任意Mac
qwen3.5:9b

为什么在本地运行Qwen

- 零成本——无需为Qwen API按token付费
隐私——中英文内容保留在您的设备上
完整Qwen家族——聊天、代码、推理和语音转文本，统一集群
无速率限制——阿里云限制API访问。本地运行无限制
集群路由——多台机器分担负载。路由器选择最快可用设备

Qwen在此集群上的优势

Qwen模型特别适合集群路由：

- MoE架构——Qwen3.5（总计397B，激活17B）和Qwen3-Coder（总计30B，激活3.3B）使用混合专家模型。每次请求仅激活部分参数，尽管总规模大但速度快。
规模多样性——从0.6B到397B，集群中每台设备都有对应的Qwen模型。小型Mac运行小模型，大型Mac运行大模型。
代码+聊天+语音转文本——Qwen覆盖三种模态。一个供应商，一个集群，三种能力。

此集群还提供

其他大语言模型

Llama 3.3、DeepSeek-V3、DeepSeek-R1、Phi 4、Mistral、Gemma 3——任何Ollama模型都通过同一端点路由。

图像生成

bash
curl -o image.png http://localhost:11435/api/generate-image \
-H Content-Type: application/json \
-d {model:z-image-turbo,prompt:日落,width:1024,height:1024,steps:4}

嵌入

bash
curl http://localhost:11435/api/embeddings -d {model:nomic-embed-text,prompt:查询}

仪表盘

http://localhost:11435/dashboard——监控Qwen请求以及所有其他模型。每个模型的延迟、token吞吐量、错误率、健康检查。

完整文档

智能体设置指南

安全限制

- 未经用户确认，绝不拉取或删除Qwen模型。
绝不删除或修改~/.fleet-manager/中的文件。
如果Qwen模型对于可用内存过大，建议使用更小的变体或MoE版本。

qwen-qwen3通义千问3

qwen-qwen3

Qwen — Run Qwen Models Across Your Local Fleet

Supported Qwen models

LLM (Chat & Reasoning)

Code Generation

Speech-to-Text

Setup

Use Qwen through the fleet

OpenAI SDK

Qwen3-Coder for code

Qwen ASR for transcription

Ollama API

Hardware recommendations

Why run Qwen locally

The Qwen advantage on this fleet

Also available on this fleet

Other LLM models

Image generation

Embeddings

Dashboard

Full documentation

Guardrails

Qwen — 在本地集群中运行Qwen模型

支持的Qwen模型

大语言模型（聊天与推理）

代码生成

语音转文本

设置

拉取Qwen模型

通过集群使用Qwen

OpenAI SDK

Qwen3.5用于通用聊天

Qwen3-Coder用于代码

Qwen ASR用于转录

Ollama API

Qwen3.5聊天

Qwen2.5-Coder

硬件建议

为什么在本地运行Qwen

Qwen在此集群上的优势

此集群还提供

其他大语言模型

图像生成

嵌入

仪表盘

完整文档

安全限制

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement