Home Manual About Updates GitHub
HOME / MANUAL

USER_MANUAL

Complete documentation for using Scrapefruit. From basic setup to advanced cascade configuration.

01

Installation

Requirements

  • > Python 3.11 or higher
  • > Chromium browser (installed via Playwright)
  • > 2GB RAM minimum (4GB recommended for LLM features)
terminal
# Clone the repository
git clone https://github.com/jamditis/scrapefruit.git
cd scrapefruit

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

# Install browser binaries
playwright install chromium
02

Configuration

Copy the example environment file and customize it for your setup:

terminal
cp .env.example .env
.env
# Flask configuration
FLASK_DEBUG=true
SECRET_KEY=your-secret-key
# LLM (optional)
OLLAMA_MODEL=qwen2.5:0.5b
# Video transcription (optional)
WHISPER_MODEL=tiny
VIDEO_USE_2X_SPEED=true
03

Basic usage

terminal
# Start the application
python main.py

# Access the web interface
http://127.0.0.1:5150
Create a job

Click "New Job" and enter the URLs you want to scrape. Add extraction rules using CSS selectors or XPath.

Run the job

Hit "Start" to begin scraping. Watch real-time progress via the activity log. Results appear as they complete.

04

Cascade scraping

The cascade system automatically escalates through scraping methods when one fails. This handles blocks, anti-bot measures, and JavaScript-heavy sites without manual intervention.

Default cascade order

HTTP → Playwright → Puppeteer → Agent-browser → Browser-use
When does fallback trigger?

> HTTP status codes: 403, 429, 503

> Anti-bot patterns: "cloudflare", "captcha", "challenge"

> Content too short: less than 500 characters

> JavaScript markers: empty body, SPA indicators

How to customize cascade per job?

In the job settings panel, you can:

  • • Enable or disable cascade entirely
  • • Select which methods to include
  • • Reorder methods via drag-and-drop
  • • Configure custom fallback triggers
05

Data extraction

Define extraction rules to pull specific data from pages. Supports CSS selectors, XPath, and OCR fallback.

CSS selectors

Standard CSS selector syntax

article h1.title

XPath

Full XPath expression support

//div[@class='content']

Vision/OCR

Screenshot + Tesseract fallback

auto-enabled
Tip: Auto-extract from HTML

Use the "Analyze HTML" feature to automatically detect extraction rules from a sample page. Works best on modern sites with semantic HTML and Open Graph tags.

06

LLM integration

Process scraped content with local LLMs via Ollama. Summarize articles, extract entities, classify content—all without API costs.

Setup Ollama
# Install Ollama from ollama.ai, then:
ollama pull qwen2.5:0.5b  # 400MB, good for low-memory
ollama pull llama3:8b     # Better quality, needs 8GB+ RAM
Python usage
from core.llm import get_llm_service

llm = get_llm_service()

# Summarize content
summary = llm.summarize(article_text)

# Extract entities
entities = llm.extract_entities(text)
# Returns: {"people": [...], "organizations": [...], "dates": [...]}

# Classify content
category = llm.classify(text, ["news", "opinion", "analysis"])
07

Video transcription

Extract and transcribe videos from YouTube, Twitter/X, TikTok, and 1000+ platforms using yt-dlp and Whisper.

Supported platforms

YouTube Vimeo Twitter/X TikTok Facebook Instagram Twitch Dailymotion +1000 more
Python usage
from core.scraping.fetchers.video_fetcher import VideoFetcher

fetcher = VideoFetcher(
    whisper_model="tiny",   # tiny, base, small, medium, large
    use_2x_speed=True        # Halves transcription time
)

result = fetcher.fetch("https://youtube.com/watch?v=...")

# Access results
print(result.transcript)       # Plain text
print(result.to_srt())        # SRT subtitles
print(result.metadata.title)   # Video metadata
08

Export options

SQLite (default)

All results are automatically stored in a local SQLite database at data/scrapefruit.db.

No configuration required.

Google Sheets

Export results directly to Google Sheets for collaboration and analysis.

Requires service account credentials.

09

Troubleshooting

Playwright browser errors

Reinstall the browser binaries:

playwright install chromium
Still getting blocked
  • • Increase delays between requests in job settings
  • • Enable cascade mode to try alternative fetchers
  • • Check if the site requires login (poison pill detector will flag this)
  • • Consider using residential proxies for heavy scraping
Ollama not detected

Make sure Ollama is running:

ollama serve

The service auto-detects Ollama at localhost:11434. Set OLLAMA_BASE_URL in .env if using a different host.

Video transcription fails
  • • Ensure yt-dlp is installed: pip install yt-dlp
  • • Ensure faster-whisper is installed: pip install faster-whisper
  • • For 2x speed processing, install ffmpeg
  • • Try a smaller Whisper model if running out of memory