Files
Patrick Plate 62c3b67e66 fix(mcp-webscraper): improve search_hint quality — quote_plus, richer hint, dedup, result_count
- Use urllib.parse.quote_plus instead of str.replace(' ', '+') for correct
  URL encoding of special chars (&, %, +, #, =)
- Add search_url field to return dict so caller can verify/debug the query
- Add result_count field for quick summary without len(results)
- Deduplicate results by URL via seen_urls set
- Filter cards with both empty title AND empty snippet
- Richer hint string: 'Title (url): snippet[:120]' pipe-separated
- Max-results guard now breaks early (no over-fetching)
- 5 new tests (23→28): URL encoding, result_count, dedup, empty filter, hint format
2026-04-05 09:57:43 +02:00
..

Webscraper MCP Server

MCP server for web scraping operations: fetch pages, extract links/tables, parse sitemaps.

Tools

  • webscraper_fetch(url, max_chars=5000) — Title + markdown body + metadata
  • webscraper_fetch_links(url, deduplicate=True) — Extract all hrefs
  • webscraper_fetch_tables(url) — HTML tables as markdown
  • webscraper_fetch_all(url, max_chars=5000) — Everything in one call
  • webscraper_fetch_section(url, selector) — Specific CSS section
  • webscraper_fetch_meta(url) — Title, description, OG tags
  • webscraper_fetch_sitemap(url, max_urls=100) — Sitemap URL list

Stack

  • httpx (HTTP client)
  • BeautifulSoup4 + lxml (HTML parsing)
  • html2text (HTML to markdown)

Run

./run.sh  # uv sync && uv run src/server.py

Tests

uv run pytest tests/ --cov=src

MCP Config

Add to .roo/mcp.json:

"webscraper": {
  "command": "uv",
  "args": ["run", "--directory", "/home/pplate/pi_mcps/webscraper", "src/server.py"]
}