docs: create mcp-webscraper wiki page
@@ -0,0 +1,54 @@
|
||||
# 🕸️ mcp-webscraper — Web Scraping
|
||||
|
||||

|
||||
|
||||
**mcp-webscraper** is a FastMCP server providing comprehensive web scraping and data extraction capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, and sitemaps.
|
||||
|
||||
## Tools
|
||||
|
||||
| Tool | Description |
|
||||
|---|---|
|
||||
| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata |
|
||||
| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page |
|
||||
| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown |
|
||||
| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call |
|
||||
| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only |
|
||||
| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags |
|
||||
| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list |
|
||||
|
||||
## Stack
|
||||
|
||||
- **HTTP client:** `httpx` (async, with SSL support)
|
||||
- **HTML parser:** `BeautifulSoup4` + `lxml`
|
||||
- **Markdown converter:** `html2text`
|
||||
|
||||
## SSL Note — Fedora 43
|
||||
|
||||
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at `mcp/webscraper/certs/comodo-aaa-services-root.pem` — applied automatically, no manual config needed.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
cd mcp/webscraper
|
||||
uv sync
|
||||
./run.sh
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```python
|
||||
# Fetch a page as Markdown
|
||||
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
|
||||
|
||||
# Extract all links from Gitea repo
|
||||
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")
|
||||
|
||||
# Get all tables
|
||||
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")
|
||||
|
||||
# Get Open Graph metadata
|
||||
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
|
||||
|
||||
# Fetch specific section by CSS selector
|
||||
webscraper_fetch_section("https://docs.python.org", "#content")
|
||||
```
|
||||
Reference in New Issue
Block a user