merge: docs/wiki/promote-webscraper-search-hint → main

This commit is contained in:
Patrick Plate
2026-04-05 10:11:37 +02:00
3 changed files with 193 additions and 9 deletions
+99
View File
@@ -0,0 +1,99 @@
# Web Research Rules — Use webscraper_search_hint Proactively
## Rule: Search Before Asking
Before asking Patrick for information about a library, framework, API, technology, or error —
**always try `webscraper_search_hint` first**.
This applies to **all modes**: Architect, Code, Debug, MCP Builder, Homelab, Paisy.
### Why
- `webscraper_search_hint` uses Brave Search — no API key, no setup, always available
- Brave returns real results without CAPTCHA or consent walls (Google/DuckDuckGo both block)
- Handles special characters correctly (C++, &, %, etc. — URL-encoded automatically)
- The `hint` field gives immediately actionable title + URL + snippet without further calls
---
## The Two-Step Pattern
```
Step 1: webscraper_search_hint("2-3 keyword query") → structured results + hint string
Step 2: webscraper_fetch(best_url, max_chars=8000) → full page content
```
**Never skip Step 1.** It costs one tool call and often reveals the exact page to read.
### Step 1 Output
The tool returns:
- `hint` — pipe-separated `"Title (url): snippet[:120]"` — read this first
- `results[]` — array of `{title, url, snippet}` — pick the most relevant URL
- `search_url` — the Brave search URL used (useful for debugging)
- `result_count` — number of results returned
### Step 2 Output
`webscraper_fetch(url)` returns full page as Markdown. Use `max_chars` to control size
(default 5000; use 800012000 for deep doc reads).
---
## Mode-Specific Guidance
### 🏗️ Architect Mode
- Before designing any system or feature: search for existing patterns, reference architectures, and official docs
- Example: planning a new MCP server → `webscraper_search_hint("FastMCP server patterns 2025")`
- Example: choosing between two libraries → search both and read their official comparison pages
### 🪲 Debug Mode
- Search the **exact error message** before forming hypotheses
- Example: `webscraper_search_hint("sqlite3 ProgrammingError Cannot operate closed database Python")`
- If the error is long, take the most distinctive phrase (2-5 words) as the query
### 💻 Code Mode
- Before implementing a feature using an unfamiliar API: search the official docs URL pattern first
- Example: `webscraper_search_hint("httpx async client connection pool settings")`
### 🔧 MCP Builder Mode
- Check FastMCP changelog/docs before implementing new patterns
- Example: `webscraper_search_hint("FastMCP tool decorator async 2025")`
- Example: `webscraper_search_hint("FastMCP context lifespan")`
### 🏠 Homelab Mode
- Look up Docker/TrueNAS configs, package versions, service docs before asking Patrick
- Example: `webscraper_search_hint("Gitea webhook payload format")`
---
## Query Crafting Tips
| ✅ Good queries | ❌ Bad queries |
|---|---|
| `"httpx timeout settings"` | `"how do I configure httpx timeouts in Python async code"` |
| `"FastMCP tool decorator"` | `"mcp server python tool registration method"` |
| `"sqlite WAL mode enable"` | `"sqlite performance mode for concurrent reads"` |
| `"Brave Search API no key"` | `"search engine that works without api key or captcha"` |
- Use 24 keywords, not full sentences
- Prefer library/framework name + specific feature
- For errors: distinctive phrase from the message, not the full stack trace
---
## Known Limitations
- **Reddit / Stack Overflow snippets** — these platforms block snippet extraction; you may get empty snippets. The URL is still valid — fetch it directly if needed.
- **Brave CSS selector fragility** — Brave uses Svelte-generated class names that change. If `webscraper_search_hint` returns 0 results unexpectedly, the scraper's CSS selectors may need updating. Last verified working: 2026-04-05.
- **Use sparingly** — one search call per research task to orient; then fetch specific pages. Don't call it in a loop.
---
## Anti-Patterns to Avoid
- ❌ Asking Patrick "what's the FastMCP syntax for X?" before searching
- ❌ Designing architecture without looking up existing solutions first
- ❌ Forming a debug hypothesis without searching the error message
- ❌ Writing code against an API from memory without verifying current docs
- ❌ Calling `webscraper_search_hint` more than 2-3 times for the same topic (broaden/narrow the query instead)
@@ -145,6 +145,38 @@ Use the `new-mcp-server` Roo skill in MCP Builder mode for full scaffolding:
3. Roo will load the new-mcp-server skill and scaffold everything 3. Roo will load the new-mcp-server skill and scaffold everything
``` ```
## Web Research with mcp-webscraper
Before asking Patrick for information about a library, framework, API, or technology — **search first**.
The webscraper MCP server provides `webscraper_search_hint` (Brave Search, no API key, always available) as the entry point for all research tasks. Use the two-step pattern:
```
Step 1: webscraper_search_hint("topic or error message") → get candidate URLs
Step 2: webscraper_fetch(best_url) → read the full page
```
### When to search
| Situation | Action |
|---|---|
| Need docs for a library or framework | `webscraper_search_hint("library-name official docs")` |
| Investigating an error or stack trace | `webscraper_search_hint("exact error message language")` |
| Planning a feature — need design patterns | `webscraper_search_hint("pattern-name best practices")` |
| Checking latest version / changelog | `webscraper_search_hint("library-name changelog release")` |
| Looking up API contracts | `webscraper_fetch(official_docs_url)` directly |
### Especially useful in
- **🏗️ Architect mode** — look up patterns and docs *before* designing. Don't design blind.
- **🪲 Debug mode** — search the exact error message before forming hypotheses.
- **🔧 MCP Builder mode** — check FastMCP changelog for new patterns before implementing.
### Known caveats
- Reddit and Stack Overflow may return empty snippets (platform blocks)
- Brave uses Svelte CSS classes that can change — if `webscraper_search_hint` returns 0 results, selectors may need updating (last verified: 2026-04-05)
## Gitea Repository ## Gitea Repository
Code is hosted at: `http://192.168.188.119:30008/pplate/pi_mcps` Code is hosted at: `http://192.168.188.119:30008/pplate/pi_mcps`
+62 -9
View File
@@ -25,20 +25,70 @@
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA - **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
- **SSL:** Custom cert bundle for Fedora 43 compatibility - **SSL:** Custom cert bundle for Fedora 43 compatibility
## Search Hint Strategy ---
`webscraper_search_hint` uses Brave Search because: ## 🔍 Search: The Two-Step Research Pattern
`webscraper_search_hint` is the **entry point for all web research**. The recommended workflow is:
```
Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
Step 2: webscraper_fetch(best_url) → get full page content
```
This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.
### Why Brave Search?
`webscraper_search_hint` uses Brave Search (`search.brave.com`) because:
- ✅ Returns real results without CAPTCHA or consent walls - ✅ Returns real results without CAPTCHA or consent walls
- ✅ No API key required — works with plain HTTP GET
- ✅ Handles special characters (C++, &, %, etc.) via URL encoding
- ❌ Google blocks plain HTTP with 302 consent redirect - ❌ Google blocks plain HTTP with 302 consent redirect
- ❌ DuckDuckGo blocks with CAPTCHA - ❌ DuckDuckGo blocks with CAPTCHA
Use it sparingly — once per research task — to get oriented before deep-scraping individual pages. ### Return Value
The tool returns a structured dict:
```json
{
"query": "FastMCP tool decorator",
"search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
"result_count": 5,
"hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
"results": [
{
"title": "FastMCP Docs",
"url": "https://docs.fastmcp.dev",
"snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
},
...
]
}
```
The `hint` field is a pipe-separated string of `"Title (url): snippet[:120]"` entries — immediately actionable for deciding which URL to fetch next.
### Example: Two-Step Research Flow
```python ```python
# Get top 5 results for a query # Step 1: Orient — what pages exist about this topic?
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5) result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."
# Step 2: Deep-dive the most relevant result
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)
``` ```
### Known Limitations
- **Reddit / Stack Overflow snippets** may be empty — these platforms block snippet extraction
- **Brave CSS selectors** use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
- **Use sparingly** — once per research task to get oriented, not for every query
---
## SSL Note — Fedora 43 Comodo Root CA ## SSL Note — Fedora 43 Comodo Root CA
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/). Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).
@@ -58,13 +108,16 @@ uv run python src/server.py
```bash ```bash
cd mcp/webscraper cd mcp/webscraper
uv run pytest tests/ -v uv run pytest tests/ -v
# 23/23 tests passing # 28/28 tests passing
``` ```
## Usage Examples ## Usage Examples
```python ```python
# Fetch a page as Markdown # Step 1: Search — get candidate URLs for a topic
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
# Step 2: Deep-dive the most relevant URL
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000) webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
# Extract all links from Gitea repo # Extract all links from Gitea repo
@@ -79,6 +132,6 @@ webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
# Fetch specific section by CSS selector # Fetch specific section by CSS selector
webscraper_fetch_section("https://docs.python.org", "#content") webscraper_fetch_section("https://docs.python.org", "#content")
# Quick search orientation # Search with special characters (C++, &, % all work)
webscraper_search_hint("Gitea wiki git clone", max_results=3) webscraper_search_hint("C++ std::optional usage", max_results=3)
``` ```