Home Manual About Updates GitHub
v1.0 // Active development

SCRAPEFRUIT

A Python web application for web scraping with a visual interface. Cascade scraping, anti-bot bypass, and local LLM integration.

Python 3.11+ 170+ tests

01 // FEATURES

CORE CAPABILITIES

Cascade scraping

Multi-method fallback system. HTTP → Playwright → Puppeteer → Agent-browser → Browser-use. Auto-detects blocks and escalates.

Anti-bot bypass

Playwright-stealth integration for handling Cloudflare, CAPTCHAs, and rate limiting. User agent rotation included.

Poison pill detection

Automatic detection of paywalls, rate limiting, anti-bot patterns, dead links, and login walls. Never scrape garbage.

Local LLM integration

Free local inference via Ollama. Summarization, entity extraction, classification. No API costs.

Video transcription

Extract and transcribe videos from YouTube, Twitter/X, TikTok, and 1000+ platforms via yt-dlp + Whisper.

Vision/OCR fallback

When DOM extraction fails, automatically capture screenshots and use Tesseract OCR to extract text.

02 // CASCADE_SYSTEM

FALLBACK STRATEGY
Method Speed JS support Use case
HTTP Fastest No Static pages, APIs
Playwright Medium Yes JavaScript-heavy sites, stealth mode
Puppeteer Medium Yes Alternative browser fingerprint
Agent-browser Slower Yes AI-optimized with accessibility tree
Browser-use Slowest Yes LLM-controlled automation
Video Varies N/A YouTube, Twitter/X, TikTok, 1000+ sites
FALLBACK_TRIGGERS:
  • > Blocked status codes (403, 429, 503)
  • > Anti-bot detection patterns (Cloudflare, CAPTCHA)
  • > Empty or minimal content (<500 chars)
  • > JavaScript-heavy SPA markers

03 // QUICK_START

GET RUNNING
terminal
# Clone and setup
git clone https://github.com/jamditis/scrapefruit.git
cd scrapefruit

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
playwright install

# Configure and run
cp .env.example .env
python main.py
Requirements
  • Python 3.11+
  • Chromium (via Playwright)
  • Tesseract OCR (optional)
Optional extras
  • Ollama for local LLM
  • yt-dlp + faster-whisper for video
  • Google Sheets credentials for export
Read the full manual