62c3b67e66
- Use urllib.parse.quote_plus instead of str.replace(' ', '+') for correct
URL encoding of special chars (&, %, +, #, =)
- Add search_url field to return dict so caller can verify/debug the query
- Add result_count field for quick summary without len(results)
- Deduplicate results by URL via seen_urls set
- Filter cards with both empty title AND empty snippet
- Richer hint string: 'Title (url): snippet[:120]' pipe-separated
- Max-results guard now breaks early (no over-fetching)
- 5 new tests (23→28): URL encoding, result_count, dedup, empty filter, hint format
Webscraper MCP Server
MCP server for web scraping operations: fetch pages, extract links/tables, parse sitemaps.
Tools
webscraper_fetch(url, max_chars=5000)— Title + markdown body + metadatawebscraper_fetch_links(url, deduplicate=True)— Extract all hrefswebscraper_fetch_tables(url)— HTML tables as markdownwebscraper_fetch_all(url, max_chars=5000)— Everything in one callwebscraper_fetch_section(url, selector)— Specific CSS sectionwebscraper_fetch_meta(url)— Title, description, OG tagswebscraper_fetch_sitemap(url, max_urls=100)— Sitemap URL list
Stack
- httpx (HTTP client)
- BeautifulSoup4 + lxml (HTML parsing)
- html2text (HTML to markdown)
Run
./run.sh # uv sync && uv run src/server.py
Tests
uv run pytest tests/ --cov=src
MCP Config
Add to .roo/mcp.json:
"webscraper": {
"command": "uv",
"args": ["run", "--directory", "/home/pplate/pi_mcps/webscraper", "src/server.py"]
}