Tabstack Extractor

Overview

This skill enables structured data extraction from websites using the Tabstack API. It's ideal for web scraping tasks where you need consistent, schema-based data extraction from job boards, news sites, product pages, or any structured content.

Quick Start

1. Install Babashka (if needed)

CODEBLOCK0

2. Set up API Key

Option A: Environment variable (recommended)
CODEBLOCK1

Option B: Configuration file
CODEBLOCK2

Get an API key: Sign up at Tabstack Console

3. Test Connection

CODEBLOCK3

4. Extract Markdown (Simple)

CODEBLOCK4

5. Extract JSON (Start Simple)

CODEBLOCK5

6. Advanced Features

CODEBLOCK6

Core Capabilities

1. Markdown Extraction

Extract clean, readable markdown from any webpage. Useful for content analysis, summarization, or archiving.

When to use: When you need the textual content of a page without the HTML clutter.

Example use cases:

- Extract article content for summarization
Archive webpage content
Analyze blog post content

2. JSON Schema Extraction

Extract structured data using JSON schemas. Define exactly what data you want and get it in a consistent format.

When to use: When scraping job listings, product pages, news articles, or any structured data.

Example use cases:

- Scrape job listings from BuiltIn/LinkedIn
Extract product details from e-commerce sites
Gather news articles with consistent metadata

3. Schema Templates

Pre-built schemas for common scraping tasks. See references/ directory for templates.

Available schemas:

- Job listing schema (see references/job_schema.json)
News article schema
Product page schema
Contact information schema

Workflow: Job Scraping Example

Follow this workflow to scrape job listings:

1. Identify target sites - BuiltIn, LinkedIn, company career pages
Choose or create schema - Use references/job_schema.json or customize
Test extraction - Run a single page to verify schema works
Scale up - Process multiple URLs
Store results - Save to database or file

Example job schema:
CODEBLOCK7

Integration with Other Skills

Combine with Web Search

1. Use web_search to find relevant URLs
Use Tabstack to extract structured data from those URLs
Store results in Datalevin (future skill)

Combine with Browser Automation

1. Use browser tool to navigate complex sites
Extract page URLs
Use Tabstack for structured extraction

Error Handling

Common issues and solutions:

1. Authentication failed - Check TABSTACK_API_KEY environment variable
Invalid URL - Ensure URL is accessible and correct
Schema mismatch - Adjust schema to match page structure
Rate limiting - Add delays between requests

Resources

scripts/

- tabstack.clj - Main API wrapper in Babashka (recommended, has retry logic, caching, batch processing)
INLINECODE7 - Bash/curl fallback (simple, no dependencies)
INLINECODE8 - Python API wrapper (requires requests module)

references/

- job_schema.json - Template schema for job listings
INLINECODE10 - Tabstack API documentation

Best Practices

1. Start small - Test with single pages before scaling
Respect robots.txt - Check site scraping policies
Add delays - Avoid overwhelming target sites
Validate schemas - Test schemas on sample pages
Handle errors gracefully - Implement retry logic for failed requests

Teaching Focus: How to Create Schemas

This skill is designed to teach agents how to use Tabstack API effectively. The key is learning to create appropriate JSON schemas for different websites.

Learning Path

1. Start Simple - Use references/simple_article.json (4 basic fields)
Test Extensively - Try schemas on multiple page types
Iterate - Add fields based on what the page actually contains
Optimize - Remove unnecessary fields for speed

See Schema Creation Guide for detailed instructions and examples.

Common Mistakes to Avoid

- Over-complex schemas - Start with 2-3 fields, not 20
Missing fields - Don't require fields that don't exist on the page
No testing - Always test with example.com first, then target sites
Ignoring timeouts - Complex schemas take longer (45s timeout)

Babashka Advantages

Using Babashka for this skill provides:

1. Single binary - Easy to share/install (GitHub releases, brew, nix)
Fast startup - No JVM warmup, ~50ms startup time
Built-in HTTP client - No external dependencies
Clojure syntax - Familiar to you (Wes), expressive
Retry logic & caching - Built into the skill
Batch processing - Parallel extraction for multiple URLs

Example User Requests

For this skill to trigger:

- "Scrape job listings from Docker careers page"
"Extract the main content from this article"
"Get structured product data from this e-commerce page"
"Pull all the news articles from this site"
"Extract contact information from this company page"
"Batch extract job listings from these 20 URLs"
"Get cached results for this page (avoid API calls)"

Tabstack Extractor

概述

本技能支持使用Tabstack API从网站中提取结构化数据。它非常适合需要从招聘网站、新闻网站、产品页面或任何结构化内容中获取一致、基于模式的数据的网页抓取任务。

快速开始

1. 安装Babashka（如需）

bash

选项A：从GitHub安装（推荐用于共享）

curl -s https://raw.githubusercontent.com/babashka/babashka/master/install | bash

选项B：从Nix安装

nix-shell -p babashka

选项C：从Homebrew安装

brew install borkdude/brew/babashka

2. 设置API密钥

选项A：环境变量（推荐）
bash
export TABSTACKAPIKEY=yourapikey_here

选项B：配置文件
bash
mkdir -p ~/.config/tabstack
echo {:api-key yourapikey_here} > ~/.config/tabstack/config.edn

获取API密钥： 在Tabstack控制台注册

3. 测试连接

bash bb scripts/tabstack.clj test

4. 提取Markdown（简单）

bash bb scripts/tabstack.clj markdown https://example.com

5. 提取JSON（从简单开始）

bash

从简单模式开始（快速、可靠）

bb scripts/tabstack.clj json https://example.com references/simple_article.json

尝试更复杂的模式（可能较慢）

bb scripts/tabstack.clj json https://news.site references/news_schema.json

6. 高级功能

bash

带重试逻辑的提取（3次重试，1秒延迟）

bb scripts/tabstack.clj json-retry https://example.com references/simple_article.json

带缓存的提取（24小时缓存）

bb scripts/tabstack.clj json-cache https://example.com references/simple_article.json

从URL文件批量提取

echo https://example.com > urls.txt echo https://example.org >> urls.txt bb scripts/tabstack.clj batch urls.txt references/simple_article.json

核心功能

1. Markdown提取

从任何网页提取清晰、可读的Markdown内容。适用于内容分析、摘要或归档。

使用场景： 当您需要页面的文本内容而不需要HTML杂乱信息时。

示例用例：

- 提取文章内容用于摘要
归档网页内容
分析博客文章内容

2. JSON模式提取

使用JSON模式提取结构化数据。精确定义您需要的数据，并以一致的格式获取。

使用场景： 抓取职位列表、产品页面、新闻文章或任何结构化数据时。

示例用例：

- 从BuiltIn/LinkedIn抓取职位列表
从电商网站提取产品详情
收集具有一致元数据的新闻文章

3. 模式模板

针对常见抓取任务的预构建模式。参见references/目录获取模板。

可用模式：

- 职位列表模式（参见references/job_schema.json）
新闻文章模式
产品页面模式
联系信息模式

工作流程：职位抓取示例

按照以下工作流程抓取职位列表：

1. 确定目标网站 - BuiltIn、LinkedIn、公司招聘页面
选择或创建模式 - 使用references/job_schema.json或自定义
测试提取 - 运行单个页面验证模式是否有效
扩展规模 - 处理多个URL
存储结果 - 保存到数据库或文件

示例职位模式：
json
{
type: object,
properties: {
title: {type: string},
company: {type: string},
location: {type: string},
description: {type: string},
salary: {type: string},
apply_url: {type: string},
posted_date: {type: string},
requirements: {type: array, items: {type: string}}
}
}

与其他技能的集成

与网页搜索结合

1. 使用web_search查找相关URL
使用Tabstack从这些URL提取结构化数据
将结果存储在Datalevin中（未来技能）

与浏览器自动化结合

1. 使用browser工具导航复杂网站
提取页面URL
使用Tabstack进行结构化提取

错误处理

常见问题及解决方案：

1. 认证失败 - 检查TABSTACKAPIKEY环境变量
无效URL - 确保URL可访问且正确
模式不匹配 - 调整模式以匹配页面结构
速率限制 - 在请求之间添加延迟

资源

scripts/

- tabstack.clj - Babashka主API封装（推荐，具有重试逻辑、缓存、批量处理）
tabstackcurl.sh - Bash/curl备用方案（简单，无依赖）
tabstackapi.py - Python API封装（需要requests模块）

references/

- jobschema.json - 职位列表模板模式
apireference.md - Tabstack API文档

最佳实践

1. 从小开始 - 在扩展前先用单个页面测试
遵守robots.txt - 检查网站抓取策略
添加延迟 - 避免压垮目标网站
验证模式 - 在样本页面上测试模式
优雅处理错误 - 为失败的请求实现重试逻辑

教学重点：如何创建模式

本技能旨在教授代理如何有效使用Tabstack API。关键在于学习为不同网站创建合适的JSON模式。

学习路径

1. 从简单开始 - 使用references/simple_article.json（4个基本字段）
广泛测试 - 在多种页面类型上尝试模式
迭代优化 - 根据页面实际内容添加字段
优化精简 - 为提高速度移除不必要的字段

详细说明和示例请参见模式创建指南。

应避免的常见错误

- 模式过于复杂 - 从2-3个字段开始，而不是20个
缺少字段 - 不要要求页面上不存在的字段
不进行测试 - 始终先用example.com测试，再测试目标网站
忽略超时 - 复杂模式需要更长时间（45秒超时）

Babashka优势

使用Babashka实现本技能的优势：

1. 单一二进制文件 - 易于共享/安装（GitHub发布、brew、nix）
快速启动 - 无需JVM预热，约50ms启动时间
内置HTTP客户端 - 无外部依赖
Clojure语法 - 对您（Wes）来说熟悉且富有表现力
重试逻辑和缓存 - 内置于技能中
批量处理 - 多个URL的并行提取

示例用户请求

触发本技能的场景：

- 从Docker招聘页面抓取职位列表
提取这篇文章的主要内容
从这个电商页面获取结构化产品数据
从这个网站拉取所有新闻文章
从这个公司页面提取联系信息
批量从这20个URL提取职位列表
获取此页面的缓存结果（避免API调用）

tabstack-extractorTabstack数据提取