ride-receipts-gateway-llm
Build a ride-receipt pipeline that fetches Gmail receipts into one emails.json file, sends each email to the local OpenClaw Gateway /v1/responses endpoint for structured extraction, writes one rides.json array, and inserts the result into SQLite.
Before you start
- - Require
gog CLI authenticated for the target Gmail account. - Always run
gog auth list before fetching, even if the user already named an account. - If multiple accounts are configured, present explicit choices using the real account emails, e.g.
Which account should I use: (A) name1@example.com or (B) name2@example.com? Do not summarize as "default" or make the user infer which accounts exist. - If exactly one account is configured, use it and mention it briefly.
- Do not assume an account named
default exists. - Require a reachable local OpenClaw Gateway.
- Require Gateway auth token available via
OPENCLAW_GATEWAY_TOKEN or ~/.openclaw/openclaw.json. - Require the Gateway HTTP Responses endpoint to be enabled.
- Ask the user for date scope: all-time, after a date, or between two dates.
- Treat receipt emails as sensitive financial/location data.
- Tell the user that
emails.json stores fetched receipt emails locally and may include full HTML receipt content. - Before extraction, confirm the user is okay sending raw receipt email JSON/HTML to the active local/private Gateway-backed model.
- Prefer loopback or private Gateway targets. Only use a non-local Gateway when the user explicitly accepts that data flow.
Outputs
Primary artifacts:
- -
data/gateway-llm/emails.json — fetched receipt emails in one JSON array; may include full HTML receipt content - INLINECODE11 — extracted ride records in one JSON array
- INLINECODE12 — queryable SQLite database containing normalized ride fields plus
extracted_ride_json, but not raw source email JSON
Pipeline
Run each step in order. Stop and report on failure.
Summary and querying
- - When summarizing the SQLite output, do not guess schema field names.
- First inspect the actual schema with
PRAGMA table_info(rides) or read references/schema_rides.sql. - Base SQL queries only on confirmed columns from the live DB schema.
- If the schema and your expected fields differ, adapt the query to the real schema instead of forcing old column names.
- Prefer stable summary dimensions that are explicitly present in the schema, such as
provider, email_date_text, currency, amount, pickup_city, and dropoff_city.
1. Initialize DB
CODEBLOCK0
2. Fetch Gmail receipts into emails.json
CODEBLOCK1
Notes:
- - Omit
--after / --before when not needed. - Supported provider queries live in
references/provider_queries.json. - Current coverage includes Uber, Bolt, Yandex, Lyft, Free Now, Curb, and Via.
3. Extract rides with Gateway /v1/responses into rides.json
CODEBLOCK2
Notes:
- - The script iterates one email at a time.
- It sends raw email JSON to the Gateway
/v1/responses endpoint. - By default it refuses non-local Gateway hosts for this sensitive data flow; override only with
OPENCLAW_ALLOW_NONLOCAL_GATEWAY=1 when the user explicitly trusts that target. - It expects JSON-only output matching the current ride schema.
- It retries failed requests up to 3 times.
- It writes
rides.json after each successful extraction, so progress is checkpointed. - If
rides.json already exists, it skips emails whose gmail_message_id is already present there. - If rate limits become a problem, re-run with
--delay-ms <n>.
4. Insert rides.json into SQLite
CODEBLOCK3
5. Generate a schema-aware summary from SQLite
CODEBLOCK4
Notes:
- - This script inspects the live
rides table schema first. - It chooses available date/amount fields dynamically instead of assuming a fixed schema revision.
- Use this script for provider/month/currency/city summaries to avoid column-name mismatches.
6. Generate short ride insights
Do this as an agent action, not a dedicated insights script.
Recommended workflow:
- - Read
data/gateway-llm/rides.json when available because it preserves the extracted ride objects directly. - Optionally query
data/gateway-llm/rides.sqlite for a few basic totals if helpful, but do not turn the output into a raw SQL dump. - Feed the ride records plus a compact factual summary into the active Gateway-backed model.
- Ask the model to produce 8-10 short behavioral insights.
Notes:
- - Prefer interpretation over aggregation.
- Focus on patterns such as spending habits, repeated addresses, likely anchor locations, repeated routes, commute-like behavior, weekday/weekend habits, time-of-day patterns, outliers, and premium ride choices.
- Use light factual grounding first (totals, counts, repeated places), then let the model write the final insight bullets.
- Keep the output short and human.
- Do not invent labels like home/work unless the repetition strongly supports that wording; otherwise use softer phrasing like likely base, recurring destination, or commute-like pattern.
- Do not create or rely on dedicated Python insights scripts unless the user later asks for deterministic reporting artifacts.
7. Export anonymized CSV report
Use the bundled Python exporter when the user asks for an anonymized/shareable ride report.
CODEBLOCK5
Export rules:
- - Read from SQLite only.
- Include exactly these columns:
provider, email_month, start_time_15m, end_time_15m, currency, amount, distance_km, duration_min, pickup_city, pickup_country, dropoff_city, dropoff_country. - Convert
email_date_text to month-only format like 2025-05. - Round
start_time_text and end_time_text upward to the next 15-minute bucket. Exact quarter-hours stay unchanged. - Export normalized
distance_km and duration_min when available by reading them from extracted_ride_json; leave blank when unavailable. - Exclude street addresses, payment method, driver, notes, subject, message id, and any raw extracted JSON from the CSV output.
- When the user asks for the anonymized CSV, generate it as a real
.csv file in the workspace; do not paste inline CSV text into chat. - Save the file to a stable path such as
data/gateway-llm/anonymized_rides.csv. - To send it to chat, use OpenClaw's outbound media attachment mechanism: include a short text line plus a separate line containing exactly
MEDIA:./data/gateway-llm/anonymized_rides.csv. - Keep the accompanying message very short, e.g.
Done — I regenerated the anonymized CSV and attached the updated file. followed by the MEDIA: line. - Do not paste inline CSV text into chat.
- Saving a local copy is allowed and expected when needed to send the attachment cleanly.
Constraints
- - Use only the scripts bundled in this skill.
- Do not silently switch to direct provider APIs or embedded agent internals.
- Never hallucinate fields; use
null when unknown. - Keep addresses and time strings verbatim.
- Keep user-facing output brief: counts, paths, and failures.
References
- - Schema: INLINECODE63
- Provider Gmail queries: INLINECODE64
ride-receipts-gateway-llm
构建一个乘车收据处理管道,将Gmail收据抓取到一个emails.json文件中,将每封邮件发送到本地OpenClaw Gateway的/v1/responses端点进行结构化提取,写入一个rides.json数组,并将结果插入到SQLite数据库中。
开始前准备
- - 需要已认证目标Gmail账户的gog命令行工具。
- 在抓取前始终运行gog auth list,即使用户已指定账户名称。
- 如果配置了多个账户,使用真实的账户邮箱提供明确选择,例如我应该使用哪个账户:(A) name1@example.com 还是 (B) name2@example.com?不要概括为默认或让用户推断存在哪些账户。
- 如果只配置了一个账户,直接使用并简要提及。
- 不要假设存在名为default的账户。
- 需要可访问的本地OpenClaw Gateway。
- 需要通过OPENCLAWGATEWAYTOKEN或~/.openclaw/openclaw.json获取Gateway认证令牌。
- 需要启用Gateway的HTTP Responses端点。
- 询问用户日期范围:全部时间、某个日期之后、或两个日期之间。
- 将收据邮件视为敏感的财务/位置数据。
- 告知用户emails.json会在本地存储抓取的收据邮件,可能包含完整的HTML收据内容。
- 在提取前,确认用户同意将原始收据邮件JSON/HTML发送到活跃的本地/私有Gateway支持的模型。
- 优先使用回环或私有Gateway目标。仅在用户明确接受该数据流时才使用非本地Gateway。
输出
主要产物:
- - data/gateway-llm/emails.json — 抓取的收据邮件,以JSON数组形式存储;可能包含完整的HTML收据内容
- data/gateway-llm/rides.json — 提取的乘车记录,以JSON数组形式存储
- data/gateway-llm/rides.sqlite — 可查询的SQLite数据库,包含规范化的乘车字段及extractedridejson,但不包含原始源邮件JSON
处理流程
按顺序执行每个步骤。失败时停止并报告。
汇总与查询
- - 汇总SQLite输出时,不要猜测模式字段名称。
- 首先使用PRAGMA tableinfo(rides)检查实际模式,或读取references/schemarides.sql。
- SQL查询仅基于实时数据库模式中确认的列。
- 如果模式与预期字段不同,根据实际模式调整查询,而不是强制使用旧的列名。
- 优先使用模式中明确存在的稳定汇总维度,如provider、emaildatetext、currency、amount、pickupcity和dropoffcity。
1. 初始化数据库
bash
python3 skills/ride-receipts-gateway-llm/scripts/init_db.py \
--db ./data/gateway-llm/rides.sqlite \
--schema skills/ride-receipts-gateway-llm/references/schema_rides.sql
2. 将Gmail收据抓取到emails.json
bash
python3 skills/ride-receipts-gateway-llm/scripts/fetchemailsjson.py \
--account \
--after YYYY-MM-DD \
--before YYYY-MM-DD \
--max-per-provider 5000 \
--out ./data/gateway-llm/emails.json
注意:
- - 不需要时省略--after/--before参数。
- 支持的提供商查询位于references/provider_queries.json中。
- 当前覆盖范围包括Uber、Bolt、Yandex、Lyft、Free Now、Curb和Via。
3. 使用Gateway的/v1/responses端点提取乘车记录到rides.json
bash
OPENCLAWGATEWAYURL=http://127.0.0.1:18789 \
OPENCLAWGATEWAYTOKEN=... \
python3 skills/ride-receipts-gateway-llm/scripts/extractridesgateway.py \
--emails-json ./data/gateway-llm/emails.json \
--out ./data/gateway-llm/rides.json
注意:
- - 脚本每次迭代处理一封邮件。
- 它将原始邮件JSON发送到Gateway的/v1/responses端点。
- 默认情况下,对于此敏感数据流,它拒绝非本地Gateway主机;仅当用户明确信任该目标时,使用OPENCLAWALLOWNONLOCALGATEWAY=1覆盖。
- 期望输出为纯JSON格式,匹配当前乘车模式。
- 失败请求最多重试3次。
- 每次成功提取后写入rides.json,因此进度会被检查点保存。
- 如果rides.json已存在,则跳过其中已存在的gmailmessage_id对应的邮件。
- 如果遇到速率限制问题,使用--delay-ms 重新运行。
4. 将rides.json插入到SQLite
bash
python3 skills/ride-receipts-gateway-llm/scripts/insertridesjson_sqlite.py \
--db ./data/gateway-llm/rides.sqlite \
--rides-json ./data/gateway-llm/rides.json
5. 从SQLite生成模式感知的汇总
bash
python3 skills/ride-receipts-gateway-llm/scripts/summaryridessqlite.py \
--db ./data/gateway-llm/rides.sqlite
注意:
- - 此脚本首先检查实时的rides表模式。
- 它动态选择可用的日期/金额字段,而不是假设固定的模式版本。
- 使用此脚本进行提供商/月份/货币/城市汇总,以避免列名不匹配。
6. 生成简短的乘车洞察
作为代理操作执行,而不是专用的洞察脚本。
推荐工作流程:
- - 当可用时读取data/gateway-llm/rides.json,因为它直接保留了提取的乘车对象。
- 可选地查询data/gateway-llm/rides.sqlite获取一些基本总计数据(如有帮助),但不要将输出变成原始SQL转储。
- 将乘车记录加上紧凑的事实汇总输入到活跃的Gateway支持的模型中。
- 要求模型生成8-10条简短的行为洞察。
注意:
- - 优先解释而非聚合。
- 关注模式,如消费习惯、重复地址、可能的锚定位置、重复路线、通勤行为、工作日/周末习惯、时间段模式、异常值和高端乘车选择。
- 先使用轻量的事实依据(总计、计数、重复地点),然后让模型编写最终的洞察要点。
- 保持输出简短且人性化。
- 除非重复性强烈支持该措辞,否则不要发明家/工作等标签;否则使用更柔和的表述,如可能的基础位置、重复目的地或通勤模式。
- 除非用户后来要求确定性的报告产物,否则不要创建或依赖专用的Python洞察脚本。
7. 导出匿名化CSV报告
当用户要求匿名化/可共享的乘车报告时,使用捆绑的Python导出器。
bash
python3 skills/ride-receipts-gateway-llm/scripts/exportanonymizedrides_csv.py \
--db ./data/gateway-llm/rides.sqlite \
--out ./data/gateway-llm/anonymized_rides.csv
导出规则:
- - 仅从SQLite读取。
- 精确包含以下列:provider、emailmonth、starttime15m、endtime15m、currency、amount、distancekm、durationmin、pickupcity、pickupcountry、dropoffcity、dropoffcountry。
- 将emaildatetext转换为仅月份格式,如2025-05。
- 将starttimetext和endtimetext向上舍入到下一个15分钟时段。精确的整刻钟保持不变。
- 通过从extractedridejson读取,导出可用的规范化distancekm和durationmin;不可用时留空。
- 从CSV输出中排除街道地址、支付方式、司机、备注、主题、消息ID以及任何原始提取的JSON。
- 当用户要求匿名化CSV时,在工作区中生成一个真实的.csv文件;不要将内联CSV文本粘贴到聊天中。
- 将文件保存到稳定路径,如data/gateway-llm/anonymizedrides.csv。
- 要发送到聊天,使用OpenClaw的出站媒体附件机制:包含一行简短文本,后跟一行精确包含MEDIA:./data/gateway-llm/anonymized_