docs: create mcp-webscraper wiki page

2026-04-04 14:35:21 +02:00
parent 4660af57ad
commit 698da0146e
+54
@@ -0,0 +1,54 @@
# 🕸️ mcp-webscraper — Web Scraping
![Webscraper Banner](http://192.168.188.119:30008/pplate/pi_mcps/raw/branch/main/docs/wiki/images/webscraper-banner.png)
**mcp-webscraper** is a FastMCP server providing comprehensive web scraping and data extraction capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, and sitemaps.
## Tools
| Tool | Description |
|---|---|
| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata |
| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page |
| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown |
| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call |
| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only |
| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags |
| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list |
## Stack
- **HTTP client:** `httpx` (async, with SSL support)
- **HTML parser:** `BeautifulSoup4` + `lxml`
- **Markdown converter:** `html2text`
## SSL Note — Fedora 43
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at `mcp/webscraper/certs/comodo-aaa-services-root.pem` — applied automatically, no manual config needed.
## Quick Start
```bash
cd mcp/webscraper
uv sync
./run.sh
```
## Usage Examples
```python
# Fetch a page as Markdown
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")
# Get all tables
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")
# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")
```