llmfit-advisor
Hardware-aware local LLM advisor. Detects your system specs (RAM, CPU, GPU/VRAM) and recommends models that actually fit, with optimal quantization and speed estimates.
When to use (trigger phrases)
Use this skill immediately when the user asks any of:
- - "what local models can I run?"
- "which LLMs fit my hardware?"
- "recommend a local model"
- "what's the best model for my GPU?"
- "can I run Llama 70B locally?"
- "configure local models"
- "set up Ollama models"
- "what models fit my VRAM?"
- "help me pick a local model for coding"
Also use this skill when:
- - The user wants to configure
models.providers.ollama or INLINECODE1 - The user mentions running models locally and you need to know what fits
- A model recommendation is needed and the user has local inference capability (Ollama, vLLM, LM Studio)
Quick start
Detect hardware
CODEBLOCK0
Returns JSON with CPU, RAM, GPU name, VRAM, multi-GPU info, and whether memory is unified (Apple Silicon).
Get top recommendations
CODEBLOCK1
Returns the top 5 models ranked by a composite score (quality, speed, fit, context) with optimal quantization for the detected hardware.
Filter by use case
CODEBLOCK2
Valid use cases: general, coding, reasoning, chat, multimodal, embedding.
Filter by minimum fit level
CODEBLOCK3
Valid fit levels (best to worst): perfect, good, marginal.
Understanding the output
System JSON
CODEBLOCK4
Recommendation JSON
Each model in the models array includes:
| Field | Meaning |
|---|
| INLINECODE12 | HuggingFace model ID (e.g. meta-llama/Llama-3.1-8B-Instruct) |
| INLINECODE14 |
Model provider (Meta, Alibaba, Google, etc.) |
|
params_b | Parameter count in billions |
|
score | Composite score 0–100 (higher is better) |
|
score_components | Breakdown:
quality,
speed,
fit,
context (each 0–100) |
|
fit_level |
Perfect,
Good,
Marginal, or
TooTight |
|
run_mode |
GPU,
CPU+GPU Offload, or
CPU Only |
|
best_quant | Optimal quantization for the hardware (e.g.
Q5_K_M,
Q4_K_M) |
|
estimated_tps | Estimated tokens per second |
|
memory_required_gb | VRAM/RAM needed at this quantization |
|
memory_available_gb | Available VRAM/RAM detected |
|
utilization_pct | How much of available memory the model uses |
|
use_case | What the model is designed for |
|
context_length | Maximum context window |
Fit levels explained
- - Perfect: Model fits comfortably with room to spare. Ideal choice.
- Good: Model fits but uses most available memory. Will work well.
- Marginal: Model barely fits. May work but expect slower performance or reduced context.
- TooTight: Model does not fit. Do not recommend.
Run modes explained
- - GPU: Full GPU inference. Fastest. Model weights loaded entirely into VRAM.
- CPU+GPU Offload: Some layers on GPU, rest in system RAM. Slower than pure GPU.
- CPU Only: All inference on CPU using system RAM. Slowest but works without GPU.
Configuring OpenClaw with results
After getting recommendations, configure the user's local model provider.
For Ollama
Map the HuggingFace model name to its Ollama tag. Common mappings:
| llmfit name | Ollama tag |
|---|
| INLINECODE40 | INLINECODE41 |
| INLINECODE42 |
llama3.3:70b |
|
Qwen/Qwen2.5-Coder-7B-Instruct |
qwen2.5-coder:7b |
|
Qwen/Qwen2.5-72B-Instruct |
qwen2.5:72b |
|
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct |
deepseek-coder-v2:16b |
|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
deepseek-r1:32b |
|
google/gemma-2-9b-it |
gemma2:9b |
|
mistralai/Mistral-7B-Instruct-v0.3 |
mistral:7b |
|
microsoft/Phi-3-mini-4k-instruct |
phi3:mini |
|
microsoft/Phi-4-mini-instruct |
phi4-mini |
Then update openclaw.json:
CODEBLOCK5
And optionally set as default:
CODEBLOCK6
For vLLM / LM Studio
Use the HuggingFace model name directly as the model identifier with the appropriate provider prefix (vllm/ or lmstudio/).
Workflow example
When a user asks "what local models can I run?":
- 1. Run
llmfit --json system to show hardware summary - Run
llmfit recommend --json --limit 5 to get top picks - Present the recommendations with scores and fit levels
- If the user wants to configure one, map it to the appropriate Ollama/vLLM/LM Studio tag
- Offer to update
openclaw.json with the chosen model
When a user asks for a specific use case like "recommend a coding model":
- 1. Run INLINECODE66
- Present the coding-specific recommendations
- Offer to pull via Ollama and configure
Notes
- - llmfit detects NVIDIA GPUs (via nvidia-smi), AMD GPUs (via rocm-smi), and Apple Silicon (unified memory).
- Multi-GPU setups aggregate VRAM across cards automatically.
- The
best_quant field tells you the optimal quantization — higher quant (Q6K, Q80) means better quality if VRAM allows. - Speed estimates (
estimated_tps) are approximate and vary by hardware and quantization. - Models with
fit_level: "TooTight" should never be recommended to users.
llmfit-advisor
硬件感知的本地LLM顾问。检测您的系统规格(内存、CPU、GPU/显存),并推荐实际适配的模型,提供最佳量化方案和速度估算。
使用时机(触发短语)
当用户提出以下任一问题时,立即使用此技能:
- - 我能运行哪些本地模型?
- 哪些LLM适合我的硬件?
- 推荐一个本地模型
- 我的GPU最适合什么模型?
- 我能在本地运行Llama 70B吗?
- 配置本地模型
- 设置Ollama模型
- 哪些模型适合我的显存?
- 帮我选一个用于编程的本地模型
在以下情况也使用此技能:
- - 用户想要配置 models.providers.ollama 或 models.providers.lmstudio
- 用户提到在本地运行模型,你需要知道哪些模型适配
- 需要模型推荐,且用户具备本地推理能力(Ollama、vLLM、LM Studio)
快速开始
检测硬件
bash
llmfit --json system
返回包含CPU、内存、GPU名称、显存、多GPU信息以及内存是否统一(Apple Silicon)的JSON数据。
获取最佳推荐
bash
llmfit recommend --json --limit 5
返回按综合评分(质量、速度、适配度、上下文)排名前5的模型,并针对检测到的硬件提供最佳量化方案。
按使用场景筛选
bash
llmfit recommend --json --use-case coding --limit 3
llmfit recommend --json --use-case reasoning --limit 3
llmfit recommend --json --use-case chat --limit 3
有效使用场景:general(通用)、coding(编程)、reasoning(推理)、chat(聊天)、multimodal(多模态)、embedding(嵌入)。
按最低适配等级筛选
bash
llmfit recommend --json --min-fit good --limit 10
有效适配等级(从好到差):perfect(完美)、good(良好)、marginal(勉强)。
理解输出结果
系统JSON
json
{
system: {
cpu_name: Apple M2 Max,
cpu_cores: 12,
totalramgb: 32.0,
availableramgb: 24.5,
has_gpu: true,
gpu_name: Apple M2 Max,
gpuvramgb: 32.0,
gpu_count: 1,
backend: Metal,
unified_memory: true
}
}
推荐JSON
models数组中的每个模型包含:
| 字段 | 含义 |
|---|
| name | HuggingFace模型ID(例如 meta-llama/Llama-3.1-8B-Instruct) |
| provider |
模型提供商(Meta、阿里巴巴、Google等) |
| params_b | 参数量(十亿) |
| score | 综合评分0–100(越高越好) |
| score_components | 评分细分:quality(质量)、speed(速度)、fit(适配度)、context(上下文)(各0–100) |
| fit_level | Perfect(完美)、Good(良好)、Marginal(勉强)或 TooTight(太紧) |
| run_mode | GPU(GPU)、CPU+GPU Offload(CPU+GPU卸载)或 CPU Only(仅CPU) |
| best
quant | 针对硬件的最佳量化方案(例如 Q5K
M、Q4K_M) |
| estimated_tps | 预估每秒token数 |
| memory
requiredgb | 此量化方案所需的显存/内存(GB) |
| memory
availablegb | 检测到的可用显存/内存(GB) |
| utilization_pct | 模型占用可用内存的百分比 |
| use_case | 模型设计用途 |
| context_length | 最大上下文窗口 |
适配等级说明
- - Perfect(完美):模型适配良好,且有富余空间。理想选择。
- Good(良好):模型适配,但占用大部分可用内存。运行效果良好。
- Marginal(勉强):模型勉强适配。可能可以运行,但预计性能较慢或上下文受限。
- TooTight(太紧):模型不适配。不推荐。
运行模式说明
- - GPU:完全GPU推理。速度最快。模型权重完全加载到显存中。
- CPU+GPU Offload(CPU+GPU卸载):部分层在GPU上运行,其余在系统内存中。速度比纯GPU慢。
- CPU Only(仅CPU):所有推理在CPU上使用系统内存运行。速度最慢,但无需GPU即可运行。
使用结果配置OpenClaw
获取推荐后,配置用户的本地模型提供商。
针对Ollama
将HuggingFace模型名称映射到其Ollama标签。常见映射:
| llmfit名称 | Ollama标签 |
|---|
| meta-llama/Llama-3.1-8B-Instruct | llama3.1:8b |
| meta-llama/Llama-3.3-70B-Instruct |
llama3.3:70b |
| Qwen/Qwen2.5-Coder-7B-Instruct | qwen2.5-coder:7b |
| Qwen/Qwen2.5-72B-Instruct | qwen2.5:72b |
| deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct | deepseek-coder-v2:16b |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | deepseek-r1:32b |
| google/gemma-2-9b-it | gemma2:9b |
| mistralai/Mistral-7B-Instruct-v0.3 | mistral:7b |
| microsoft/Phi-3-mini-4k-instruct | phi3:mini |
| microsoft/Phi-4-mini-instruct | phi4-mini |
然后更新 openclaw.json:
json
{
models: {
providers: {
ollama: {
models: [ollama/]
}
}
}
}
并可选择设置为默认:
json
{
agents: {
defaults: {
model: {
primary: ollama/
}
}
}
}
针对vLLM / LM Studio
直接使用HuggingFace模型名称作为模型标识符,并加上相应的提供商前缀(vllm/ 或 lmstudio/)。
工作流程示例
当用户询问我能运行哪些本地模型?时:
- 1. 运行 llmfit --json system 显示硬件摘要
- 运行 llmfit recommend --json --limit 5 获取最佳推荐
- 展示带评分和适配等级的推荐结果
- 如果用户想配置某个模型,将其映射到相应的Ollama/vLLM/LM Studio标签
- 提供更新 openclaw.json 并添加所选模型的选项
当用户询问特定使用场景,如推荐一个编程模型时:
- 1. 运行 llmfit recommend --json --use-case coding --limit 3
- 展示编程相关的推荐结果
- 提供通过Ollama拉取并配置的选项
注意事项
- - llmfit可检测NVIDIA GPU(通过nvidia-smi)、AMD GPU(通过rocm-smi)和Apple Silicon(统一内存)。
- 多GPU设置会自动聚合各显卡的显存。
- bestquant 字段指示最佳量化方案——如果显存允许,更高的量化等级(Q6K、Q80)意味着更好的质量。
- 速度估算(estimatedtps)为近似值,因硬件和量化方案而异。
- fit_level: TooTight 的模型绝不应推荐给用户。