1
mcp-webscraper
pplate edited this page 2026-04-04 14:35:21 +02:00

🕸️ mcp-webscraper — Web Scraping

Webscraper Banner

mcp-webscraper is a FastMCP server providing comprehensive web scraping and data extraction capabilities. It fetches pages, converts HTML to clean Markdown, extracts tables, links, CSS sections, metadata, and sitemaps.

Tools

Tool Description
webscraper_fetch(url, max_chars=5000) Title + full page as Markdown + metadata
webscraper_fetch_links(url, deduplicate=True) All href links found on the page
webscraper_fetch_tables(url) All HTML tables converted to Markdown
webscraper_fetch_all(url, max_chars=5000) Everything in one call
webscraper_fetch_section(url, selector) Specific CSS selector section only
webscraper_fetch_meta(url) Title, description, Open Graph tags
webscraper_fetch_sitemap(url, max_urls=100) Parse sitemap.xml, return URL list

Stack

  • HTTP client: httpx (async, with SSL support)
  • HTML parser: BeautifulSoup4 + lxml
  • Markdown converter: html2text

SSL Note — Fedora 43

Fedora 43 is missing the Comodo AAA Services Root CA needed for Cloudflare-protected sites. The fix is bundled at mcp/webscraper/certs/comodo-aaa-services-root.pem — applied automatically, no manual config needed.

Quick Start

cd mcp/webscraper
uv sync
./run.sh

Usage Examples

# Fetch a page as Markdown
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)

# Extract all links from Gitea repo
webscraper_fetch_links("http://192.168.188.119:30008/pplate/pi_mcps")

# Get all tables
webscraper_fetch_tables("https://pypi.org/project/fastmcp/")

# Get Open Graph metadata
webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")

# Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content")