docs: create mcp-webscraper wiki page

2026-04-04 14:35:21 +02:00
parent 4660af57ad
commit 698da0146e
1 changed files with 54 additions and 0 deletions
@@ -0,0 +1,54 @@
+# 🕸️ mcp-webscraper — Web Scraping
+
+![Webscraper Banner](http://192.168.188.119:30008/pplate/pi_mcps/raw/branch/main/docs/wiki/images/webscraper-banner.png)
+
+**mcp-webscraper** is a FastMCP server providing comprehensive web scraping and data extraction capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, and sitemaps.
+
+## Tools
+
+| Tool | Description |
+|---|---|
+| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata |
+| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page |
+| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown |
+| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call |
+| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only |
+| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags |
+| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list |
+
+## Stack
+
+- **HTTP client:** `httpx` (async, with SSL support)
+- **HTML parser:** `BeautifulSoup4` + `lxml`
+- **Markdown converter:** `html2text`
+
+## SSL Note — Fedora 43
+
+Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at `mcp/webscraper/certs/comodo-aaa-services-root.pem` — applied automatically, no manual config needed.
+
+## Quick Start
+
+```bash
+cd mcp/webscraper
+uv sync
+./run.sh
+```
+
+## Usage Examples
+
+```python
+# Fetch a page as Markdown
+webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
+
+# Extract all links from Gitea repo
+webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")
+
+# Get all tables
+webscraper_fetch_tables("https://pypi.org/project/fastmcp/")
+
+# Get Open Graph metadata
+webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
+
+# Fetch specific section by CSS selector
+webscraper_fetch_section("https://docs.python.org", "#content")
+```