dabdda167f
- Extract all wiki content from create_wiki_pages.py into docs/wiki/pages/*.md - Add docs/wiki/deploy_wiki.sh: copies pages to wiki/ repo, commits, pushes - Add /wiki/ to .gitignore (anchored — does not affect docs/wiki/) - 12 pages: Home, MCP-Servers-Overview, mcp-image-gen, ComfyUI-Setup, mcp-webscraper (8 tools incl. search_hint), BigMind (schema v8), Development-Conventions, Java-Projects, Java-wellmann-shop, Java-mss-failsafe, Java-Architecture, _Sidebar - Workflow: edit docs/wiki/pages/*.md → ./docs/wiki/deploy_wiki.sh
85 lines
3.0 KiB
Markdown
85 lines
3.0 KiB
Markdown
# 🕸️ mcp-webscraper — Web Scraping
|
|
|
|

|
|
|
|
**mcp-webscraper** is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search.
|
|
|
|
## Tools
|
|
|
|
| Tool | Description |
|
|
|---|---|
|
|
| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata |
|
|
| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page |
|
|
| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown |
|
|
| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call (fetch + links + tables + meta) |
|
|
| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only |
|
|
| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags |
|
|
| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list |
|
|
| `webscraper_search_hint(query, max_results=5)` | Brave Search — top URLs + snippets for a query |
|
|
|
|
## Stack
|
|
|
|
- **HTTP client:** `httpx` (async, with SSL support, Chrome/Linux User-Agent)
|
|
- **HTML parser:** `BeautifulSoup4` + `lxml`
|
|
- **Markdown converter:** `html2text`
|
|
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
|
|
- **SSL:** Custom cert bundle for Fedora 43 compatibility
|
|
|
|
## Search Hint Strategy
|
|
|
|
`webscraper_search_hint` uses Brave Search because:
|
|
- ✅ Returns real results without CAPTCHA or consent walls
|
|
- ❌ Google blocks plain HTTP with 302 consent redirect
|
|
- ❌ DuckDuckGo blocks with CAPTCHA
|
|
|
|
Use it sparingly — once per research task — to get oriented before deep-scraping individual pages.
|
|
|
|
```python
|
|
# Get top 5 results for a query
|
|
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
|
|
```
|
|
|
|
## SSL Note — Fedora 43 Comodo Root CA
|
|
|
|
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).
|
|
|
|
The server automatically uses this cert bundle — no manual configuration needed.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
cd mcp/webscraper
|
|
uv sync
|
|
uv run python src/server.py
|
|
```
|
|
|
|
## Run Tests
|
|
|
|
```bash
|
|
cd mcp/webscraper
|
|
uv run pytest tests/ -v
|
|
# 23/23 tests passing
|
|
```
|
|
|
|
## Usage Examples
|
|
|
|
```python
|
|
# Fetch a page as Markdown
|
|
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
|
|
|
|
# Extract all links from Gitea repo
|
|
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")
|
|
|
|
# Get all tables from a documentation page
|
|
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")
|
|
|
|
# Get Open Graph metadata
|
|
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
|
|
|
|
# Fetch specific section by CSS selector
|
|
webscraper_fetch_section("https://docs.python.org", "#content")
|
|
|
|
# Quick search orientation
|
|
webscraper_search_hint("Gitea wiki git clone", max_results=3)
|
|
```
|