docs: create mcp-webscraper wiki page
@@ -0,0 +1,54 @@
|
|||||||
|
# 🕸️ mcp-webscraper — Web Scraping
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**mcp-webscraper** is a FastMCP server providing comprehensive web scraping and data extraction capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, and sitemaps.
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
| Tool | Description |
|
||||||
|
|---|---|
|
||||||
|
| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata |
|
||||||
|
| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page |
|
||||||
|
| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown |
|
||||||
|
| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call |
|
||||||
|
| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only |
|
||||||
|
| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags |
|
||||||
|
| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list |
|
||||||
|
|
||||||
|
## Stack
|
||||||
|
|
||||||
|
- **HTTP client:** `httpx` (async, with SSL support)
|
||||||
|
- **HTML parser:** `BeautifulSoup4` + `lxml`
|
||||||
|
- **Markdown converter:** `html2text`
|
||||||
|
|
||||||
|
## SSL Note — Fedora 43
|
||||||
|
|
||||||
|
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at `mcp/webscraper/certs/comodo-aaa-services-root.pem` — applied automatically, no manual config needed.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd mcp/webscraper
|
||||||
|
uv sync
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Fetch a page as Markdown
|
||||||
|
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
|
||||||
|
|
||||||
|
# Extract all links from Gitea repo
|
||||||
|
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")
|
||||||
|
|
||||||
|
# Get all tables
|
||||||
|
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")
|
||||||
|
|
||||||
|
# Get Open Graph metadata
|
||||||
|
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
|
||||||
|
|
||||||
|
# Fetch specific section by CSS selector
|
||||||
|
webscraper_fetch_section("https://docs.python.org", "#content")
|
||||||
|
```
|
||||||
Reference in New Issue
Block a user