Vision Sandbox 🔭
Leverage Gemini's native code execution to analyze images with high precision. The model writes and runs Python code in a Google-hosted sandbox to verify visual data, perfect for UI auditing, spatial grounding, and visual reasoning.
Installation
CODEBLOCK0
Usage
CODEBLOCK1
Pattern Library
📍 Spatial Grounding
Ask the model to find specific items and return coordinates.
- * Prompt: "Locate the 'Submit' button in this screenshot. Use code execution to verify its center point and return the [x, y] coordinates in a [0, 1000] scale."
🧮 Visual Math
Ask the model to count or calculate based on the image.
- * Prompt: "Count the number of items in the list. Use Python to sum their values if prices are visible."
🖥️ UI Audit
Check layout and readability.
- * Prompt: "Check if the header text overlaps with any icons. Use the sandbox to calculate the bounding box intersections."
🖐️ Counting & Logic
Solve visual counting tasks with code verification.
- * Prompt: "Count the number of fingers on this hand. Use code execution to identify the bounding box for each finger and return the total count."
Integration with OpenCode
This skill is designed to provide
Visual Grounding for automated coding agents like OpenCode.
- - Step 1: Use
vision-sandbox to extract UI metadata (coordinates, sizes, colors). - Step 2: Pass the JSON output to OpenCode to generate or fix CSS/HTML.
Configuration
- - GEMINIAPIKEY: Required environment variable.
- Model: Defaults to
gemini-3-flash-preview.
技能名称: Vision Sandbox 🔭
详细描述:
Vision Sandbox 🔭
利用Gemini的原生代码执行功能,以高精度分析图像。该模型在Google托管的沙盒中编写并运行Python代码来验证视觉数据,非常适合UI审计、空间定位和视觉推理。
安装
bash
clawhub install vision-sandbox
使用
bash
uv run vision-sandbox --image path/to/image.png --prompt 识别所有按钮并提供[x, y]坐标。
模式库
📍 空间定位
让模型查找特定项目并返回坐标。
- * 提示: 在此截图中定位提交按钮。使用代码执行验证其中心点,并以[0, 1000]比例返回[x, y]坐标。
🧮 视觉数学
让模型根据图像进行计数或计算。
- * 提示: 统计列表中的项目数量。如果价格可见,使用Python计算它们的总值。
🖥️ UI审计
检查布局和可读性。
- * 提示: 检查标题文本是否与任何图标重叠。使用沙盒计算边界框的交集。
🖐️ 计数与逻辑
通过代码验证解决视觉计数任务。
- * 提示: 统计这只手上的手指数量。使用代码执行识别每根手指的边界框,并返回总数。
与OpenCode集成
此技能旨在为OpenCode等自动化编码代理提供
视觉定位功能。
- - 步骤1: 使用vision-sandbox提取UI元数据(坐标、尺寸、颜色)。
- 步骤2: 将JSON输出传递给OpenCode,以生成或修复CSS/HTML。
配置
- - GEMINIAPIKEY:必需的环境变量。
- 模型:默认使用gemini-3-flash-preview。