docs: promote webscraper_search_hint in wiki pages
@@ -145,6 +145,38 @@ Use the `new-mcp-server` Roo skill in MCP Builder mode for full scaffolding:
|
|||||||
3. Roo will load the new-mcp-server skill and scaffold everything
|
3. Roo will load the new-mcp-server skill and scaffold everything
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Web Research with mcp-webscraper
|
||||||
|
|
||||||
|
Before asking Patrick for information about a library, framework, API, or technology — **search first**.
|
||||||
|
|
||||||
|
The webscraper MCP server provides `webscraper_search_hint` (Brave Search, no API key, always available) as the entry point for all research tasks. Use the two-step pattern:
|
||||||
|
|
||||||
|
```
|
||||||
|
Step 1: webscraper_search_hint("topic or error message") → get candidate URLs
|
||||||
|
Step 2: webscraper_fetch(best_url) → read the full page
|
||||||
|
```
|
||||||
|
|
||||||
|
### When to search
|
||||||
|
|
||||||
|
| Situation | Action |
|
||||||
|
|---|---|
|
||||||
|
| Need docs for a library or framework | `webscraper_search_hint("library-name official docs")` |
|
||||||
|
| Investigating an error or stack trace | `webscraper_search_hint("exact error message language")` |
|
||||||
|
| Planning a feature — need design patterns | `webscraper_search_hint("pattern-name best practices")` |
|
||||||
|
| Checking latest version / changelog | `webscraper_search_hint("library-name changelog release")` |
|
||||||
|
| Looking up API contracts | `webscraper_fetch(official_docs_url)` directly |
|
||||||
|
|
||||||
|
### Especially useful in
|
||||||
|
|
||||||
|
- **🏗️ Architect mode** — look up patterns and docs *before* designing. Don't design blind.
|
||||||
|
- **🪲 Debug mode** — search the exact error message before forming hypotheses.
|
||||||
|
- **🔧 MCP Builder mode** — check FastMCP changelog for new patterns before implementing.
|
||||||
|
|
||||||
|
### Known caveats
|
||||||
|
|
||||||
|
- Reddit and Stack Overflow may return empty snippets (platform blocks)
|
||||||
|
- Brave uses Svelte CSS classes that can change — if `webscraper_search_hint` returns 0 results, selectors may need updating (last verified: 2026-04-05)
|
||||||
|
|
||||||
## Gitea Repository
|
## Gitea Repository
|
||||||
|
|
||||||
Code is hosted at: `http://192.168.188.119:30008/pplate/pi_mcps`
|
Code is hosted at: `http://192.168.188.119:30008/pplate/pi_mcps`
|
||||||
|
|||||||
+62
-9
@@ -25,20 +25,70 @@
|
|||||||
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
|
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
|
||||||
- **SSL:** Custom cert bundle for Fedora 43 compatibility
|
- **SSL:** Custom cert bundle for Fedora 43 compatibility
|
||||||
|
|
||||||
## Search Hint Strategy
|
---
|
||||||
|
|
||||||
`webscraper_search_hint` uses Brave Search because:
|
## 🔍 Search: The Two-Step Research Pattern
|
||||||
|
|
||||||
|
`webscraper_search_hint` is the **entry point for all web research**. The recommended workflow is:
|
||||||
|
|
||||||
|
```
|
||||||
|
Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
|
||||||
|
Step 2: webscraper_fetch(best_url) → get full page content
|
||||||
|
```
|
||||||
|
|
||||||
|
This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.
|
||||||
|
|
||||||
|
### Why Brave Search?
|
||||||
|
|
||||||
|
`webscraper_search_hint` uses Brave Search (`search.brave.com`) because:
|
||||||
- ✅ Returns real results without CAPTCHA or consent walls
|
- ✅ Returns real results without CAPTCHA or consent walls
|
||||||
|
- ✅ No API key required — works with plain HTTP GET
|
||||||
|
- ✅ Handles special characters (C++, &, %, etc.) via URL encoding
|
||||||
- ❌ Google blocks plain HTTP with 302 consent redirect
|
- ❌ Google blocks plain HTTP with 302 consent redirect
|
||||||
- ❌ DuckDuckGo blocks with CAPTCHA
|
- ❌ DuckDuckGo blocks with CAPTCHA
|
||||||
|
|
||||||
Use it sparingly — once per research task — to get oriented before deep-scraping individual pages.
|
### Return Value
|
||||||
|
|
||||||
|
The tool returns a structured dict:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"query": "FastMCP tool decorator",
|
||||||
|
"search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
|
||||||
|
"result_count": 5,
|
||||||
|
"hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
|
||||||
|
"results": [
|
||||||
|
{
|
||||||
|
"title": "FastMCP Docs",
|
||||||
|
"url": "https://docs.fastmcp.dev",
|
||||||
|
"snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
|
||||||
|
},
|
||||||
|
...
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The `hint` field is a pipe-separated string of `"Title (url): snippet[:120]"` entries — immediately actionable for deciding which URL to fetch next.
|
||||||
|
|
||||||
|
### Example: Two-Step Research Flow
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Get top 5 results for a query
|
# Step 1: Orient — what pages exist about this topic?
|
||||||
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
|
result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
|
||||||
|
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."
|
||||||
|
|
||||||
|
# Step 2: Deep-dive the most relevant result
|
||||||
|
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Known Limitations
|
||||||
|
|
||||||
|
- **Reddit / Stack Overflow snippets** may be empty — these platforms block snippet extraction
|
||||||
|
- **Brave CSS selectors** use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
|
||||||
|
- **Use sparingly** — once per research task to get oriented, not for every query
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## SSL Note — Fedora 43 Comodo Root CA
|
## SSL Note — Fedora 43 Comodo Root CA
|
||||||
|
|
||||||
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).
|
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).
|
||||||
@@ -58,13 +108,16 @@ uv run python src/server.py
|
|||||||
```bash
|
```bash
|
||||||
cd mcp/webscraper
|
cd mcp/webscraper
|
||||||
uv run pytest tests/ -v
|
uv run pytest tests/ -v
|
||||||
# 23/23 tests passing
|
# 28/28 tests passing
|
||||||
```
|
```
|
||||||
|
|
||||||
## Usage Examples
|
## Usage Examples
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# Fetch a page as Markdown
|
# Step 1: Search — get candidate URLs for a topic
|
||||||
|
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
|
||||||
|
|
||||||
|
# Step 2: Deep-dive the most relevant URL
|
||||||
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
|
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
|
||||||
|
|
||||||
# Extract all links from Gitea repo
|
# Extract all links from Gitea repo
|
||||||
@@ -79,6 +132,6 @@ webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
|
|||||||
# Fetch specific section by CSS selector
|
# Fetch specific section by CSS selector
|
||||||
webscraper_fetch_section("https://docs.python.org", "#content")
|
webscraper_fetch_section("https://docs.python.org", "#content")
|
||||||
|
|
||||||
# Quick search orientation
|
# Search with special characters (C++, &, % all work)
|
||||||
webscraper_search_hint("Gitea wiki git clone", max_results=3)
|
webscraper_search_hint("C++ std::optional usage", max_results=3)
|
||||||
```
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user