# 🕸️ mcp-webscraper — Web Scraping

![Webscraper Banner](http://192.168.188.119:30008/pplate/pi_mcps/raw/branch/main/docs/wiki/images/webscraper-banner.png)

**mcp-webscraper** is a FastMCP server providing comprehensive web scraping, data extraction, and search capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, sitemaps, and can perform web searches via Brave Search.

## Tools

| Tool | Description |
|---|---|
| `webscraper_fetch(url, max_chars=5000)` | Title + full page as Markdown + metadata |
| `webscraper_fetch_links(url, deduplicate=True)` | All `href` links found on the page |
| `webscraper_fetch_tables(url)` | All HTML tables converted to Markdown |
| `webscraper_fetch_all(url, max_chars=5000)` | Everything in one call (fetch + links + tables + meta) |
| `webscraper_fetch_section(url, selector)` | Specific CSS selector section only |
| `webscraper_fetch_meta(url)` | Title, description, Open Graph tags |
| `webscraper_fetch_sitemap(url, max_urls=100)` | Parse sitemap.xml, return URL list |
| `webscraper_search_hint(query, max_results=5)` | Brave Search — top URLs + snippets for a query |

## Stack

- **HTTP client:** `httpx` (async, with SSL support, Chrome/Linux User-Agent)
- **HTML parser:** `BeautifulSoup4` + `lxml`
- **Markdown converter:** `html2text`
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
- **SSL:** Custom cert bundle for Fedora 43 compatibility

---

## 🔍 Search: The Two-Step Research Pattern

`webscraper_search_hint` is the **entry point for all web research**. The recommended workflow is:

```
Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
Step 2: webscraper_fetch(best_url)           → get full page content
```

This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.

### Why Brave Search?

`webscraper_search_hint` uses Brave Search (`search.brave.com`) because:
- ✅ Returns real results without CAPTCHA or consent walls
- ✅ No API key required — works with plain HTTP GET
- ✅ Handles special characters (C++, &, %, etc.) via URL encoding
- ❌ Google blocks plain HTTP with 302 consent redirect
- ❌ DuckDuckGo blocks with CAPTCHA

### Return Value

The tool returns a structured dict:

```json
{
  "query": "FastMCP tool decorator",
  "search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
  "result_count": 5,
  "hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
  "results": [
    {
      "title": "FastMCP Docs",
      "url": "https://docs.fastmcp.dev",
      "snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
    },
    ...
  ]
}
```

The `hint` field is a pipe-separated string of `"Title (url): snippet[:120]"` entries — immediately actionable for deciding which URL to fetch next.

### Example: Two-Step Research Flow

```python
# Step 1: Orient — what pages exist about this topic?
result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."

# Step 2: Deep-dive the most relevant result
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)
```

### Known Limitations

- **Reddit / Stack Overflow snippets** may be empty — these platforms block snippet extraction
- **Brave CSS selectors** use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
- **Use sparingly** — once per research task to get oriented, not for every query

---

## SSL Note — Fedora 43 Comodo Root CA

Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).

The server automatically uses this cert bundle — no manual configuration needed.

## Quick Start

```bash
cd mcp/webscraper
uv sync
uv run python src/server.py
```

## Run Tests

```bash
cd mcp/webscraper
uv run pytest tests/ -v
# 28/28 tests passing
```

## Usage Examples

```python
# Step 1: Search — get candidate URLs for a topic
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)

# Step 2: Deep-dive the most relevant URL
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)

# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")

# Get all tables from a documentation page
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")

# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")

# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")

# Search with special characters (C++, &, % all work)
webscraper_search_hint("C++ std::optional usage", max_results=3)
```