2ab847f51d
- Add webscraper_search_hint() tool using Brave Search as backend (no CAPTCHA/GDPR consent wall, works with plain httpx) - Add User-Agent header to _fetch_page() — fixes 403 on Wikipedia, Feynman Lectures, and other sites that block headless requests - Add 5 new tests for search hint (23 total, 90% coverage) Brave Search URL: https://search.brave.com/search?q={query}&source=web Use sparingly — once per research task as orientation, not in loops
Webscraper MCP Server
MCP server for web scraping operations: fetch pages, extract links/tables, parse sitemaps.
Tools
webscraper_fetch(url, max_chars=5000)— Title + markdown body + metadatawebscraper_fetch_links(url, deduplicate=True)— Extract all hrefswebscraper_fetch_tables(url)— HTML tables as markdownwebscraper_fetch_all(url, max_chars=5000)— Everything in one callwebscraper_fetch_section(url, selector)— Specific CSS sectionwebscraper_fetch_meta(url)— Title, description, OG tagswebscraper_fetch_sitemap(url, max_urls=100)— Sitemap URL list
Stack
- httpx (HTTP client)
- BeautifulSoup4 + lxml (HTML parsing)
- html2text (HTML to markdown)
Run
./run.sh # uv sync && uv run src/server.py
Tests
uv run pytest tests/ --cov=src
MCP Config
Add to .roo/mcp.json:
"webscraper": {
"command": "uv",
"args": ["run", "--directory", "/home/pplate/pi_mcps/webscraper", "src/server.py"]
}