Tabstack Extractor
Overview
This skill enables structured data extraction from websites using the Tabstack API. It's ideal for web scraping tasks where you need consistent, schema-based data extraction from job boards, news sites, product pages, or any structured content.
Quick Start
1. Install Babashka (if needed)
CODEBLOCK0
2. Set up API Key
Option A: Environment variable (recommended)
CODEBLOCK1
Option B: Configuration file
CODEBLOCK2
Get an API key: Sign up at Tabstack Console
3. Test Connection
CODEBLOCK3
4. Extract Markdown (Simple)
CODEBLOCK4
5. Extract JSON (Start Simple)
CODEBLOCK5
6. Advanced Features
CODEBLOCK6
Core Capabilities
1. Markdown Extraction
Extract clean, readable markdown from any webpage. Useful for content analysis, summarization, or archiving.
When to use: When you need the textual content of a page without the HTML clutter.
Example use cases:
- - Extract article content for summarization
- Archive webpage content
- Analyze blog post content
2. JSON Schema Extraction
Extract structured data using JSON schemas. Define exactly what data you want and get it in a consistent format.
When to use: When scraping job listings, product pages, news articles, or any structured data.
Example use cases:
- - Scrape job listings from BuiltIn/LinkedIn
- Extract product details from e-commerce sites
- Gather news articles with consistent metadata
3. Schema Templates
Pre-built schemas for common scraping tasks. See
references/ directory for templates.
Available schemas:
- - Job listing schema (see
references/job_schema.json) - News article schema
- Product page schema
- Contact information schema
Workflow: Job Scraping Example
Follow this workflow to scrape job listings:
- 1. Identify target sites - BuiltIn, LinkedIn, company career pages
- Choose or create schema - Use
references/job_schema.json or customize - Test extraction - Run a single page to verify schema works
- Scale up - Process multiple URLs
- Store results - Save to database or file
Example job schema:
CODEBLOCK7
Integration with Other Skills
Combine with Web Search
- 1. Use
web_search to find relevant URLs - Use Tabstack to extract structured data from those URLs
- Store results in Datalevin (future skill)
Combine with Browser Automation
- 1. Use
browser tool to navigate complex sites - Extract page URLs
- Use Tabstack for structured extraction
Error Handling
Common issues and solutions:
- 1. Authentication failed - Check
TABSTACK_API_KEY environment variable - Invalid URL - Ensure URL is accessible and correct
- Schema mismatch - Adjust schema to match page structure
- Rate limiting - Add delays between requests
Resources
scripts/
- -
tabstack.clj - Main API wrapper in Babashka (recommended, has retry logic, caching, batch processing) - INLINECODE7 - Bash/curl fallback (simple, no dependencies)
- INLINECODE8 - Python API wrapper (requires requests module)
references/
- -
job_schema.json - Template schema for job listings - INLINECODE10 - Tabstack API documentation
Best Practices
- 1. Start small - Test with single pages before scaling
- Respect robots.txt - Check site scraping policies
- Add delays - Avoid overwhelming target sites
- Validate schemas - Test schemas on sample pages
- Handle errors gracefully - Implement retry logic for failed requests
Teaching Focus: How to Create Schemas
This skill is designed to teach agents how to use Tabstack API effectively. The key is learning to create appropriate JSON schemas for different websites.
Learning Path
- 1. Start Simple - Use
references/simple_article.json (4 basic fields) - Test Extensively - Try schemas on multiple page types
- Iterate - Add fields based on what the page actually contains
- Optimize - Remove unnecessary fields for speed
See Schema Creation Guide for detailed instructions and examples.
Common Mistakes to Avoid
- - Over-complex schemas - Start with 2-3 fields, not 20
- Missing fields - Don't require fields that don't exist on the page
- No testing - Always test with example.com first, then target sites
- Ignoring timeouts - Complex schemas take longer (45s timeout)
Babashka Advantages
Using Babashka for this skill provides:
- 1. Single binary - Easy to share/install (GitHub releases, brew, nix)
- Fast startup - No JVM warmup, ~50ms startup time
- Built-in HTTP client - No external dependencies
- Clojure syntax - Familiar to you (Wes), expressive
- Retry logic & caching - Built into the skill
- Batch processing - Parallel extraction for multiple URLs
Example User Requests
For this skill to trigger:
- - "Scrape job listings from Docker careers page"
- "Extract the main content from this article"
- "Get structured product data from this e-commerce page"
- "Pull all the news articles from this site"
- "Extract contact information from this company page"
- "Batch extract job listings from these 20 URLs"
- "Get cached results for this page (avoid API calls)"
Tabstack Extractor
概述
本技能支持使用Tabstack API从网站中提取结构化数据。它非常适合需要从招聘网站、新闻网站、产品页面或任何结构化内容中获取一致、基于模式的数据的网页抓取任务。
快速开始
1. 安装Babashka(如需)
bash
选项A:从GitHub安装(推荐用于共享)
curl -s https://raw.githubusercontent.com/babashka/babashka/master/install | bash
选项B:从Nix安装
nix-shell -p babashka
选项C:从Homebrew安装
brew install borkdude/brew/babashka
2. 设置API密钥
选项A:环境变量(推荐)
bash
export TABSTACKAPIKEY=yourapikey_here
选项B:配置文件
bash
mkdir -p ~/.config/tabstack
echo {:api-key yourapikey_here} > ~/.config/tabstack/config.edn
获取API密钥: 在Tabstack控制台注册
3. 测试连接
bash
bb scripts/tabstack.clj test
4. 提取Markdown(简单)
bash
bb scripts/tabstack.clj markdown https://example.com
5. 提取JSON(从简单开始)
bash
从简单模式开始(快速、可靠)
bb scripts/tabstack.clj json https://example.com references/simple_article.json
尝试更复杂的模式(可能较慢)
bb scripts/tabstack.clj json https://news.site references/news_schema.json
6. 高级功能
bash
带重试逻辑的提取(3次重试,1秒延迟)
bb scripts/tabstack.clj json-retry https://example.com references/simple_article.json
带缓存的提取(24小时缓存)
bb scripts/tabstack.clj json-cache https://example.com references/simple_article.json
从URL文件批量提取
echo https://example.com > urls.txt
echo https://example.org >> urls.txt
bb scripts/tabstack.clj batch urls.txt references/simple_article.json
核心功能
1. Markdown提取
从任何网页提取清晰、可读的Markdown内容。适用于内容分析、摘要或归档。
使用场景: 当您需要页面的文本内容而不需要HTML杂乱信息时。
示例用例:
- - 提取文章内容用于摘要
- 归档网页内容
- 分析博客文章内容
2. JSON模式提取
使用JSON模式提取结构化数据。精确定义您需要的数据,并以一致的格式获取。
使用场景: 抓取职位列表、产品页面、新闻文章或任何结构化数据时。
示例用例:
- - 从BuiltIn/LinkedIn抓取职位列表
- 从电商网站提取产品详情
- 收集具有一致元数据的新闻文章
3. 模式模板
针对常见抓取任务的预构建模式。参见references/目录获取模板。
可用模式:
- - 职位列表模式(参见references/job_schema.json)
- 新闻文章模式
- 产品页面模式
- 联系信息模式
工作流程:职位抓取示例
按照以下工作流程抓取职位列表:
- 1. 确定目标网站 - BuiltIn、LinkedIn、公司招聘页面
- 选择或创建模式 - 使用references/job_schema.json或自定义
- 测试提取 - 运行单个页面验证模式是否有效
- 扩展规模 - 处理多个URL
- 存储结果 - 保存到数据库或文件
示例职位模式:
json
{
type: object,
properties: {
title: {type: string},
company: {type: string},
location: {type: string},
description: {type: string},
salary: {type: string},
apply_url: {type: string},
posted_date: {type: string},
requirements: {type: array, items: {type: string}}
}
}
与其他技能的集成
与网页搜索结合
- 1. 使用web_search查找相关URL
- 使用Tabstack从这些URL提取结构化数据
- 将结果存储在Datalevin中(未来技能)
与浏览器自动化结合
- 1. 使用browser工具导航复杂网站
- 提取页面URL
- 使用Tabstack进行结构化提取
错误处理
常见问题及解决方案:
- 1. 认证失败 - 检查TABSTACKAPIKEY环境变量
- 无效URL - 确保URL可访问且正确
- 模式不匹配 - 调整模式以匹配页面结构
- 速率限制 - 在请求之间添加延迟
资源
scripts/
- - tabstack.clj - Babashka主API封装(推荐,具有重试逻辑、缓存、批量处理)
- tabstackcurl.sh - Bash/curl备用方案(简单,无依赖)
- tabstackapi.py - Python API封装(需要requests模块)
references/
- - jobschema.json - 职位列表模板模式
- apireference.md - Tabstack API文档
最佳实践
- 1. 从小开始 - 在扩展前先用单个页面测试
- 遵守robots.txt - 检查网站抓取策略
- 添加延迟 - 避免压垮目标网站
- 验证模式 - 在样本页面上测试模式
- 优雅处理错误 - 为失败的请求实现重试逻辑
教学重点:如何创建模式
本技能旨在教授代理如何有效使用Tabstack API。关键在于学习为不同网站创建合适的JSON模式。
学习路径
- 1. 从简单开始 - 使用references/simple_article.json(4个基本字段)
- 广泛测试 - 在多种页面类型上尝试模式
- 迭代优化 - 根据页面实际内容添加字段
- 优化精简 - 为提高速度移除不必要的字段
详细说明和示例请参见模式创建指南。
应避免的常见错误
- - 模式过于复杂 - 从2-3个字段开始,而不是20个
- 缺少字段 - 不要要求页面上不存在的字段
- 不进行测试 - 始终先用example.com测试,再测试目标网站
- 忽略超时 - 复杂模式需要更长时间(45秒超时)
Babashka优势
使用Babashka实现本技能的优势:
- 1. 单一二进制文件 - 易于共享/安装(GitHub发布、brew、nix)
- 快速启动 - 无需JVM预热,约50ms启动时间
- 内置HTTP客户端 - 无外部依赖
- Clojure语法 - 对您(Wes)来说熟悉且富有表现力
- 重试逻辑和缓存 - 内置于技能中
- 批量处理 - 多个URL的并行提取
示例用户请求
触发本技能的场景:
- - 从Docker招聘页面抓取职位列表
- 提取这篇文章的主要内容
- 从这个电商页面获取结构化产品数据
- 从这个网站拉取所有新闻文章
- 从这个公司页面提取联系信息
- 批量从这20个URL提取职位列表
- 获取此页面的缓存结果(避免API调用)