USER_MANUAL
Complete documentation for using Scrapefruit. From basic setup to advanced cascade configuration.
Installation
Requirements
- > Python 3.11 or higher
- > Chromium browser (installed via Playwright)
- > 2GB RAM minimum (4GB recommended for LLM features)
# Clone the repository
git clone https://github.com/jamditis/scrapefruit.git
cd scrapefruit
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Install browser binaries
playwright install chromium
Configuration
Copy the example environment file and customize it for your setup:
cp .env.example .env
Basic usage
# Start the application
python main.py
# Access the web interface
http://127.0.0.1:5150
Click "New Job" and enter the URLs you want to scrape. Add extraction rules using CSS selectors or XPath.
Hit "Start" to begin scraping. Watch real-time progress via the activity log. Results appear as they complete.
Cascade scraping
The cascade system automatically escalates through scraping methods when one fails. This handles blocks, anti-bot measures, and JavaScript-heavy sites without manual intervention.
Default cascade order
> HTTP status codes: 403, 429, 503
> Anti-bot patterns: "cloudflare", "captcha", "challenge"
> Content too short: less than 500 characters
> JavaScript markers: empty body, SPA indicators
In the job settings panel, you can:
- • Enable or disable cascade entirely
- • Select which methods to include
- • Reorder methods via drag-and-drop
- • Configure custom fallback triggers
Data extraction
Define extraction rules to pull specific data from pages. Supports CSS selectors, XPath, and OCR fallback.
CSS selectors
Standard CSS selector syntax
article h1.title
XPath
Full XPath expression support
//div[@class='content']
Vision/OCR
Screenshot + Tesseract fallback
auto-enabled
Use the "Analyze HTML" feature to automatically detect extraction rules from a sample page. Works best on modern sites with semantic HTML and Open Graph tags.
LLM integration
Process scraped content with local LLMs via Ollama. Summarize articles, extract entities, classify content—all without API costs.
# Install Ollama from ollama.ai, then:
ollama pull qwen2.5:0.5b # 400MB, good for low-memory
ollama pull llama3:8b # Better quality, needs 8GB+ RAM
from core.llm import get_llm_service
llm = get_llm_service()
# Summarize content
summary = llm.summarize(article_text)
# Extract entities
entities = llm.extract_entities(text)
# Returns: {"people": [...], "organizations": [...], "dates": [...]}
# Classify content
category = llm.classify(text, ["news", "opinion", "analysis"])
Video transcription
Extract and transcribe videos from YouTube, Twitter/X, TikTok, and 1000+ platforms using yt-dlp and Whisper.
Supported platforms
from core.scraping.fetchers.video_fetcher import VideoFetcher
fetcher = VideoFetcher(
whisper_model="tiny", # tiny, base, small, medium, large
use_2x_speed=True # Halves transcription time
)
result = fetcher.fetch("https://youtube.com/watch?v=...")
# Access results
print(result.transcript) # Plain text
print(result.to_srt()) # SRT subtitles
print(result.metadata.title) # Video metadata
Export options
SQLite (default)
All results are automatically stored in a local SQLite database at data/scrapefruit.db.
No configuration required.
Google Sheets
Export results directly to Google Sheets for collaboration and analysis.
Requires service account credentials.
Troubleshooting
Reinstall the browser binaries:
playwright install chromium
- • Increase delays between requests in job settings
- • Enable cascade mode to try alternative fetchers
- • Check if the site requires login (poison pill detector will flag this)
- • Consider using residential proxies for heavy scraping
Make sure Ollama is running:
ollama serve
The service auto-detects Ollama at localhost:11434. Set OLLAMA_BASE_URL in .env if using a different host.
- • Ensure yt-dlp is installed:
pip install yt-dlp - • Ensure faster-whisper is installed:
pip install faster-whisper - • For 2x speed processing, install ffmpeg
- • Try a smaller Whisper model if running out of memory