diff --git a/mcp-webscraper.-.md b/mcp-webscraper.-.md new file mode 100644 index 0000000..32636f0 --- /dev/null +++ b/mcp-webscraper.-.md @@ -0,0 +1,54 @@ +# πŸ•ΈοΈ mcp-webscraper β€” Web Scraping + +![Webscraper Banner](http://192.168.188.119:30008/pplate/pi_mcps/raw/branch/main/docs/wiki/images/webscraper-banner.png) + +**mcp-webscraper** is a FastMCP server providing comprehensive web scraping and data extraction capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, and sitemaps. + +## Tools + +| Tool | Description | +|---|---| +| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata | +| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page | +| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown | +| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call | +| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only | +| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags | +| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list | + +## Stack + +- **HTTP client:** `httpx` (async, with SSL support) +- **HTML parser:** `BeautifulSoup4` + `lxml` +- **Markdown converter:** `html2text` + +## SSL Note β€” Fedora 43 + +Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at `mcp/webscraper/certs/comodo-aaa-services-root.pem` β€” applied automatically, no manual config needed. + +## Quick Start + +```bash +cd mcp/webscraper +uv sync +./run.sh +``` + +## Usage Examples + +```python +# Fetch a page as Markdown +webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000) + +# Extract all links from Gitea repo +webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps") + +# Get all tables +webscraper_fetch_tables("https://pypi.org/project/fastmcp/") + +# Get Open Graph metadata +webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI") + +# Fetch specific section by CSS selector +webscraper_fetch_section("https://docs.python.org", "#content") +```