155d56e8e8
- Move bigmind/ -> mcp/bigmind/ - Move webscraper/ -> mcp/webscraper/ - Move mss-failsafe/ -> java/mss-failsafe/ - Move Wellmann-Shop/ -> java/wellmann-shop/ (normalize to kebab-case) - Add .roo/ IDE config files to tracking - Add plans/REPO_STRATEGY.md (monorepo strategy document) - Expand .gitignore: Java/Maven, Node/TS, coverage, uv.lock - Rewrite README.md as navigation index - Update .roo/mcp.json webscraper path to mcp/webscraper/
43 lines
1010 B
Markdown
43 lines
1010 B
Markdown
# Webscraper MCP Server
|
|
|
|
MCP server for web scraping operations: fetch pages, extract links/tables, parse sitemaps.
|
|
|
|
## Tools
|
|
|
|
- `webscraper_fetch(url, max_chars=5000)` — Title + markdown body + metadata
|
|
- `webscraper_fetch_links(url, deduplicate=True)` — Extract all hrefs
|
|
- `webscraper_fetch_tables(url)` — HTML tables as markdown
|
|
- `webscraper_fetch_all(url, max_chars=5000)` — Everything in one call
|
|
- `webscraper_fetch_section(url, selector)` — Specific CSS section
|
|
- `webscraper_fetch_meta(url)` — Title, description, OG tags
|
|
- `webscraper_fetch_sitemap(url, max_urls=100)` — Sitemap URL list
|
|
|
|
## Stack
|
|
|
|
- httpx (HTTP client)
|
|
- BeautifulSoup4 + lxml (HTML parsing)
|
|
- html2text (HTML to markdown)
|
|
|
|
## Run
|
|
|
|
```bash
|
|
./run.sh # uv sync && uv run src/server.py
|
|
```
|
|
|
|
## Tests
|
|
|
|
```bash
|
|
uv run pytest tests/ --cov=src
|
|
```
|
|
|
|
## MCP Config
|
|
|
|
Add to `.roo/mcp.json`:
|
|
|
|
```json
|
|
"webscraper": {
|
|
"command": "uv",
|
|
"args": ["run", "--directory", "/home/pplate/pi_mcps/webscraper", "src/server.py"]
|
|
}
|
|
```
|