# πŸ•ΈοΈ mcp-webscraper β€” Web Scraping ![Webscraper Banner](http://192.168.188.119:30008/pplate/pi_mcps/raw/branch/main/docs/wiki/images/webscraper-banner.png) **mcp-webscraper** is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search. ## Tools | Tool | Description | |---|---| | `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata | | `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page | | `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown | | `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call (fetch + links + tables + meta) | | `webscraper_fetch_section(url, selector)` | Specific CSS selector section only | | `webscraper_fetch_meta(url)` | Title, description, Open Graph tags | | `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list | | `webscraper_search_hint(query, max_results=5)` | Brave Search β€” top URLs + snippets for a query | ## Stack - **HTTP client:** `httpx` (async, with SSL support, Chrome/Linux User-Agent) - **HTML parser:** `BeautifulSoup4` + `lxml` - **Markdown converter:** `html2text` - **Search backend:** Brave Search (`search.brave.com`) β€” works without CAPTCHA - **SSL:** Custom cert bundle for Fedora 43 compatibility ## Search Hint Strategy `webscraper_search_hint` uses Brave Search because: - βœ… Returns real results without CAPTCHA or consent walls - ❌ Google blocks plain HTTP with 302 consent redirect - ❌ DuckDuckGo blocks with CAPTCHA Use it sparingly β€” once per research task β€” to get oriented before deep-scraping individual pages. ```python # Get top 5 results for a query webscraper_search_hint("FastMCP tool decorator syntax", max_results=5) ``` ## SSL Note β€” Fedora 43 Comodo Root CA Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/). The server automatically uses this cert bundle β€” no manual configuration needed. ## Quick Start ```bash cd mcp/webscraper uv sync uv run python src/server.py ``` ## Run Tests ```bash cd mcp/webscraper uv run pytest tests/ -v # 23/23 tests passing ``` ## Usage Examples ```python # Fetch a page as Markdown webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000) # Extract all links from Gitea repo webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps") # Get all tables from a documentation page webscraper_fetch_tables("https://pypi.org/project/fastmcp/") # Get Open Graph metadata webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI") # Fetch specific section by CSS selector webscraper_fetch_section("https://docs.python.org", "#content") # Quick search orientation webscraper_search_hint("Gitea wiki git clone", max_results=3) ```