Token Optimizer

Reduce your LLM API costs by 20-35% with three proven mechanisms: pre-send token estimation, structured memory extraction, and context compression. Model-agnostic, zero dependencies.

Mechanism 1 — Pre-Send Token Estimation

Estimate token count before sending a request. If the payload exceeds a threshold, compress or truncate it. Never pay for tokens you could have avoided.

Rules

1. Estimate before every API call. Use these formulas:

- Plain text: tokens ≈ character_count / 4 - JSON / structured data: tokens ≈ character_count / 2 - Code (mixed): tokens ≈ character_count / 3.5 - Images / PDFs: tokens ≈ 2000 (flat per asset, regardless of size)

2. Set a token budget per request. Default threshold: 8 000 tokens. Adjust per use case.

3. If estimated tokens exceed the budget:

- Summarize or truncate the longest sections first. - Strip intermediate reasoning, keep conclusions only. - For JSON: remove null/empty fields, shorten keys if feeding to a model that doesn't need human-readable keys. - For code: send only the relevant function/class, not the full file.

4. Log the estimate vs. actual usage (from the API response) to calibrate over time.

Example

CODEBLOCK0

Reference

See references/token-formula.md for the full formula breakdown with worked examples.

Mechanism 2 — Memory Extraction

Instead of re-reading the entire conversation history every turn, extract and persist key information into structured memory files. On subsequent turns, load only the memory index — not the raw history.

Rules

1. Use a lightweight secondary model (Haiku, GPT-4o-mini, Gemini Flash) as the memory extraction agent. Never burn expensive model tokens on bookkeeping.

2. Maintain a session cursor. Track which messages have already been processed. On each extraction pass, only read new messages since the last cursor position.

3. Limit extraction to 5 rounds max per session. Each round processes a batch of new messages. Stop early if no new information is found.

4. Parallelize I/O within rounds:

- Round 1: all reads in parallel (gather raw content). - Round 2: all writes in parallel (persist extracted memories).

5. Structure memory as index + detail files:

- MEMORY.md — index file, max 200 lines. Contains only pointers: - [topic-name](memory/topic-name.md) — one-line description. - memory/topic-name.md — full content for each topic with frontmatter (name, description, type).

6. Memory types (categorize each entry):

- user — who the user is, their preferences, expertise level. - feedback — corrections and confirmed approaches (what to do / not do). - project — current goals, deadlines, decisions, constraints. - reference — pointers to external resources (URLs, dashboards, issue trackers).

7. Do not store what can be derived. No code snippets, no git history, no file paths — these are always available from the source. Store only non-obvious context.

Example — Extraction Prompt

CODEBLOCK1

Reference

See references/memory-extraction-pattern.md for the full pattern with prompt templates.

Mechanism 3 — Context Compression

As conversations grow, compress older exchanges into dense summaries. Keep only the last N messages in full fidelity. This prevents context windows from filling with stale reasoning.

Rules

1. Keep the last 6 messages uncompressed (3 user + 3 assistant). These are "fresh" — they contain active context.

2. Summarize everything older into a single <compressed-context> block at the top of the conversation. Format:

CODEBLOCK2

3. What to keep in summaries:

- Decisions and their rationale. - Current state of work (done / in-progress / blocked). - Constraints and deadlines. - User preferences and corrections.

4. What to discard:

- Intermediate reasoning ("I considered X but..."). - Exploratory questions that were already answered. - Tool call details (file reads, grep results, build output). - Repeated or superseded information.

5. Trigger compression when the conversation exceeds 60% of the model's context window. Use Mechanism 1's estimation formula to check.

6. Never compress system prompts or skill instructions. These must remain intact.

Example — Savings Calculation

CODEBLOCK3

Combined Savings Estimate

Mechanism	Typical Savings	When It Hits
Pre-send estimation	10-15%	Every request with large payloads
Memory extraction

These are conservative estimates based on real-world agent workflows. Actual savings depend on conversation length, payload sizes, and how aggressively you compress.

Quick Start

1. Copy this skill into your agent's skill directory (or paste SKILL.md into your system prompt).
Apply Mechanism 1 immediately — add token estimation before your API calls.
Set up Mechanism 2 if you run multi-turn or multi-session workflows.
Enable Mechanism 3 for any conversation that runs beyond 15-20 messages.

No code to install. No dependencies. Just rules your agent follows.

Token Optimizer

通过三种经过验证的机制，将您的LLM API成本降低20-35%：发送前令牌估算、结构化记忆提取和上下文压缩。模型无关，零依赖。

机制一 — 发送前令牌估算

在发送请求之前估算令牌数量。如果负载超过阈值，则进行压缩或截断。绝不支付本可避免的令牌费用。

规则

1. 每次API调用前进行估算。 使用以下公式：

- 纯文本：令牌数 ≈ 字符数 / 4 - JSON/结构化数据：令牌数 ≈ 字符数 / 2 - 代码（混合）：令牌数 ≈ 字符数 / 3.5 - 图片/PDF：令牌数 ≈ 2000（每个资源固定值，与大小无关）

2. 为每次请求设置令牌预算。 默认阈值：8,000令牌。根据使用场景调整。

3. 如果估算令牌数超出预算：

- 优先总结或截断最长的部分。 - 去除中间推理过程，仅保留结论。 - 对于JSON：移除null/空字段，如果目标模型不需要人类可读的键名，则缩短键名。 - 对于代码：仅发送相关函数/类，而非完整文件。

4. 记录估算值与实际使用量（来自API响应），以便随时间校准。

示例

输入：24,000字符的纯文本
估算令牌数：24000 / 4 = 6,000 → 在预算内，原样发送。

输入：40,000字符的JSON
估算令牌数：40000 / 2 = 20,000 → 超出预算。
操作：移除null字段，删除冗余嵌套对象 → 14,000字符 → 7,000令牌 → 发送。

参考

详见 references/token-formula.md，包含完整公式分解及计算示例。

机制二 — 记忆提取

无需每次轮次重新读取整个对话历史，而是将关键信息提取并持久化到结构化记忆文件中。在后续轮次中，仅加载记忆索引——而非原始历史记录。

规则

1. 使用轻量级辅助模型（Haiku、GPT-4o-mini、Gemini Flash）作为记忆提取代理。绝不将昂贵的模型令牌浪费在记账上。

2. 维护会话游标。 追踪哪些消息已被处理。每次提取时，仅读取游标位置之后的新消息。

3. 每会话最多限制5轮提取。 每轮处理一批新消息。如果未发现新信息，则提前停止。

4. 在轮次内并行化I/O操作：

- 第一轮：所有读取操作并行执行（收集原始内容）。 - 第二轮：所有写入操作并行执行（持久化提取的记忆）。

5. 将记忆结构化为索引文件+详情文件：

- MEMORY.md — 索引文件，最多200行。仅包含指针：- 主题名称 — 一行描述。 - memory/主题名称.md — 每个主题的完整内容，包含前置元数据（名称、描述、类型）。

6. 记忆类型（对每条记录进行分类）：

- user — 用户身份、偏好、专业水平。 - feedback — 修正和确认的方法（该做/不该做的事）。 - project — 当前目标、截止日期、决策、约束条件。 - reference — 指向外部资源的指针（URL、仪表盘、问题追踪器）。

7. 不存储可推导的信息。 不存储代码片段、git历史、文件路径——这些始终可从源获取。仅存储非显而易见的上下文。

示例 — 提取提示

你是一个记忆提取代理。请阅读以下自游标位置{cursor}以来的新消息。

对于每条非显而易见的信息，输出一个JSON对象：
{
topic: 短横线命名格式的名称,
type: user | feedback | project | reference,
description: 用于索引的一行摘要,
content: 完整记忆内容，包含原因和如何应用的结构化信息
}

规则：

- 每次最多提取5条记忆。
跳过任何可从代码、git或现有记忆中推导的信息。
将相对日期转换为绝对日期（今天是{date}）。
如果该主题的记忆已存在，则输出更新内容，而非重复内容。

参考

详见 references/memory-extraction-pattern.md，包含完整模式及提示模板。

机制三 — 上下文压缩

随着对话增长，将较早的交流压缩为密集摘要。仅保留最后N条消息的完整精度。这防止上下文窗口被过时的推理填满。

规则

1. 保留最后6条消息不压缩（3条用户消息 + 3条助手消息）。这些是新鲜的——包含活跃上下文。

2. 将所有更早的内容总结为一个单一的块，置于对话顶部。格式如下：

## 已做出的决策
- 为用户表选择PostgreSQL而非MongoDB（原因：关系型查询）。
- API速率限制设为每位用户100次请求/分钟。

## 当前状态
- 认证模块：已完成，合并至主分支。
- 支付集成：进行中，被Stripe webhook配置阻塞。

## 关键约束
- 必须在2026年4月15日前发布。
- 公共API v2不得有破坏性变更。

3. 摘要中保留的内容：

- 决策及其理由。 - 当前工作状态（已完成/进行中/被阻塞）。 - 约束条件和截止日期。 - 用户偏好和修正。

4. 丢弃的内容：

- 中间推理过程（我考虑过X但...）。 - 已得到回答的探索性问题。 - 工具调用细节（文件读取、grep结果、构建输出）。 - 重复或已被取代的信息。

5. 当对话超过模型上下文窗口的60%时触发压缩。 使用机制一的估算公式进行检查。

6. 绝不压缩系统提示或技能指令。 这些必须保持完整。

示例 — 节省计算

压缩前：
42条消息，约32,000令牌。

压缩后：
压缩块：约2,000令牌。
最后6条消息：约4,500令牌。
总计：约6,500令牌。

节省：32,000 - 6,500 = 25,500令牌（历史记录减少80%）。
每次请求节省（持续）：约25,500令牌 × $0.003/1K = 每次请求$0.077。

综合节省估算

机制	典型节省	生效时机
发送前估算	10-15%	每次大负载请求
记忆提取

5-10% | 多会话工作流 |
| 上下文压缩 | 15-25% | 长对话（>20条消息） |
| 综合 | 20-35% | 会话中的持续使用 |

这些是基于真实代理工作流的保守估算。实际节省取决于对话长度、负载大小以及压缩的激进程度。

快速入门

1. 将此技能复制到您的代理技能目录中（或将SKILL.md粘贴到系统提示中）。
立即应用机制一 — 在API调用前添加令牌估算。
设置机制二（如果您运行多轮或多会话工作流）。
启用机制三（适用于任何超过15-20条消息的对话）。

无需安装代码。无需依赖。只需您的代理遵循的规则。

token-optimizer令牌优化器

token-optimizer

Token Optimizer

Mechanism 1 — Pre-Send Token Estimation

Rules

Example

Reference

Mechanism 2 — Memory Extraction

Rules

Example — Extraction Prompt

Reference

Mechanism 3 — Context Compression

Rules

Example — Savings Calculation

Combined Savings Estimate

Quick Start

Token Optimizer

机制一 — 发送前令牌估算

规则

示例

参考

机制二 — 记忆提取

规则

示例 — 提取提示

参考

机制三 — 上下文压缩

规则

示例 — 节省计算

综合节省估算

快速入门

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

token-optimizer令牌优化器

token-optimizer

Token Optimizer

Mechanism 1 — Pre-Send Token Estimation

Rules

Example

Reference

Mechanism 2 — Memory Extraction

Rules

Example — Extraction Prompt

Reference

Mechanism 3 — Context Compression

Rules

Example — Savings Calculation

Combined Savings Estimate

Quick Start

Token Optimizer

机制一 — 发送前令牌估算

规则

示例

参考

机制二 — 记忆提取

规则

示例 — 提取提示

参考

机制三 — 上下文压缩

规则

示例 — 节省计算

综合节省估算

快速入门

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement