When to use
- Extracting content from websites
- Handling paywalls and anti-bot measures
- Implementing scraping cascades with fallbacks
- Processing social media (YouTube, Instagram, TikTok)
- Finding and using undocumented APIs
What's included
Scraping cascade
Three-tier fallback: Trafilatura (fast) to Requests (HTTP) to Playwright (JavaScript rendering with stealth).
Poison pill detection
Detect paywalls, CAPTCHAs, rate limits, Cloudflare, and login walls with pattern matching.
Undocumented APIs
Find and use hidden APIs via browser dev tools, with examples for autocomplete endpoints.
Social media tools
yt-dlp for YouTube/TikTok, instaloader for Instagram, with metadata extraction and download patterns.
Scraping cascade architecture
Try multiple extraction strategies with automatic fallback:
Trafilatura
Lightweight extraction for standard articles. Best for news sites and blogs.
Requests + BeautifulSoup
HTTP requests with rotating user agents. Good for static content.
Playwright with stealth
Full JavaScript rendering with anti-bot bypass. For SPAs and protected sites.
Poison pill types
| Type | Detection patterns |
|---|---|
| Paywall | "subscribe to continue", "you've reached your limit" |
| CAPTCHA | "verify you are human", "robot verification" |
| Rate limit | "too many requests", HTTP 429 |
| Cloudflare | "checking your browser", "ddos protection" |
| Login required | "sign in to continue", "create an account" |
Installation
# Clone the repository
git clone https://github.com/jamditis/claude-skills-journalism.git
# Copy the skill to your Claude config
cp -r claude-skills-journalism/web-scraping ~/.claude/skills/
Or download just this skill from the GitHub repository.
Related skills
Extract what you need, ethically
Cascade architecture, poison pill detection, and social media tools in one skill.
View on GitHub