diff --git a/.roo/rules/05-webscraper-research.md b/.roo/rules/05-webscraper-research.md new file mode 100644 index 0000000..d7c2a31 --- /dev/null +++ b/.roo/rules/05-webscraper-research.md @@ -0,0 +1,99 @@ +# Web Research Rules — Use webscraper_search_hint Proactively + +## Rule: Search Before Asking + +Before asking Patrick for information about a library, framework, API, technology, or error — +**always try `webscraper_search_hint` first**. + +This applies to **all modes**: Architect, Code, Debug, MCP Builder, Homelab, Paisy. + +### Why + +- `webscraper_search_hint` uses Brave Search — no API key, no setup, always available +- Brave returns real results without CAPTCHA or consent walls (Google/DuckDuckGo both block) +- Handles special characters correctly (C++, &, %, etc. — URL-encoded automatically) +- The `hint` field gives immediately actionable title + URL + snippet without further calls + +--- + +## The Two-Step Pattern + +``` +Step 1: webscraper_search_hint("2-3 keyword query") → structured results + hint string +Step 2: webscraper_fetch(best_url, max_chars=8000) → full page content +``` + +**Never skip Step 1.** It costs one tool call and often reveals the exact page to read. + +### Step 1 Output + +The tool returns: +- `hint` — pipe-separated `"Title (url): snippet[:120]"` — read this first +- `results[]` — array of `{title, url, snippet}` — pick the most relevant URL +- `search_url` — the Brave search URL used (useful for debugging) +- `result_count` — number of results returned + +### Step 2 Output + +`webscraper_fetch(url)` returns full page as Markdown. Use `max_chars` to control size +(default 5000; use 8000–12000 for deep doc reads). + +--- + +## Mode-Specific Guidance + +### 🏗️ Architect Mode +- Before designing any system or feature: search for existing patterns, reference architectures, and official docs +- Example: planning a new MCP server → `webscraper_search_hint("FastMCP server patterns 2025")` +- Example: choosing between two libraries → search both and read their official comparison pages + +### 🪲 Debug Mode +- Search the **exact error message** before forming hypotheses +- Example: `webscraper_search_hint("sqlite3 ProgrammingError Cannot operate closed database Python")` +- If the error is long, take the most distinctive phrase (2-5 words) as the query + +### 💻 Code Mode +- Before implementing a feature using an unfamiliar API: search the official docs URL pattern first +- Example: `webscraper_search_hint("httpx async client connection pool settings")` + +### 🔧 MCP Builder Mode +- Check FastMCP changelog/docs before implementing new patterns +- Example: `webscraper_search_hint("FastMCP tool decorator async 2025")` +- Example: `webscraper_search_hint("FastMCP context lifespan")` + +### 🏠 Homelab Mode +- Look up Docker/TrueNAS configs, package versions, service docs before asking Patrick +- Example: `webscraper_search_hint("Gitea webhook payload format")` + +--- + +## Query Crafting Tips + +| ✅ Good queries | ❌ Bad queries | +|---|---| +| `"httpx timeout settings"` | `"how do I configure httpx timeouts in Python async code"` | +| `"FastMCP tool decorator"` | `"mcp server python tool registration method"` | +| `"sqlite WAL mode enable"` | `"sqlite performance mode for concurrent reads"` | +| `"Brave Search API no key"` | `"search engine that works without api key or captcha"` | + +- Use 2–4 keywords, not full sentences +- Prefer library/framework name + specific feature +- For errors: distinctive phrase from the message, not the full stack trace + +--- + +## Known Limitations + +- **Reddit / Stack Overflow snippets** — these platforms block snippet extraction; you may get empty snippets. The URL is still valid — fetch it directly if needed. +- **Brave CSS selector fragility** — Brave uses Svelte-generated class names that change. If `webscraper_search_hint` returns 0 results unexpectedly, the scraper's CSS selectors may need updating. Last verified working: 2026-04-05. +- **Use sparingly** — one search call per research task to orient; then fetch specific pages. Don't call it in a loop. + +--- + +## Anti-Patterns to Avoid + +- ❌ Asking Patrick "what's the FastMCP syntax for X?" before searching +- ❌ Designing architecture without looking up existing solutions first +- ❌ Forming a debug hypothesis without searching the error message +- ❌ Writing code against an API from memory without verifying current docs +- ❌ Calling `webscraper_search_hint` more than 2-3 times for the same topic (broaden/narrow the query instead) diff --git a/docs/wiki/pages/Development-Conventions.md b/docs/wiki/pages/Development-Conventions.md index 8724945..b27a191 100644 --- a/docs/wiki/pages/Development-Conventions.md +++ b/docs/wiki/pages/Development-Conventions.md @@ -145,6 +145,38 @@ Use the `new-mcp-server` Roo skill in MCP Builder mode for full scaffolding: 3. Roo will load the new-mcp-server skill and scaffold everything ``` +## Web Research with mcp-webscraper + +Before asking Patrick for information about a library, framework, API, or technology — **search first**. + +The webscraper MCP server provides `webscraper_search_hint` (Brave Search, no API key, always available) as the entry point for all research tasks. Use the two-step pattern: + +``` +Step 1: webscraper_search_hint("topic or error message") → get candidate URLs +Step 2: webscraper_fetch(best_url) → read the full page +``` + +### When to search + +| Situation | Action | +|---|---| +| Need docs for a library or framework | `webscraper_search_hint("library-name official docs")` | +| Investigating an error or stack trace | `webscraper_search_hint("exact error message language")` | +| Planning a feature — need design patterns | `webscraper_search_hint("pattern-name best practices")` | +| Checking latest version / changelog | `webscraper_search_hint("library-name changelog release")` | +| Looking up API contracts | `webscraper_fetch(official_docs_url)` directly | + +### Especially useful in + +- **🏗️ Architect mode** — look up patterns and docs *before* designing. Don't design blind. +- **🪲 Debug mode** — search the exact error message before forming hypotheses. +- **🔧 MCP Builder mode** — check FastMCP changelog for new patterns before implementing. + +### Known caveats + +- Reddit and Stack Overflow may return empty snippets (platform blocks) +- Brave uses Svelte CSS classes that can change — if `webscraper_search_hint` returns 0 results, selectors may need updating (last verified: 2026-04-05) + ## Gitea Repository Code is hosted at: `http://192.168.188.119:30008/pplate/pi_mcps` diff --git a/docs/wiki/pages/mcp-webscraper.md b/docs/wiki/pages/mcp-webscraper.md index 7a86294..010932b 100644 --- a/docs/wiki/pages/mcp-webscraper.md +++ b/docs/wiki/pages/mcp-webscraper.md @@ -25,20 +25,70 @@ - **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA - **SSL:** Custom cert bundle for Fedora 43 compatibility -## Search Hint Strategy +--- -`webscraper_search_hint` uses Brave Search because: +## 🔍 Search: The Two-Step Research Pattern + +`webscraper_search_hint` is the **entry point for all web research**. The recommended workflow is: + +``` +Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets +Step 2: webscraper_fetch(best_url) → get full page content +``` + +This avoids scraping irrelevant pages and gives you an overview before committing to a deep read. + +### Why Brave Search? + +`webscraper_search_hint` uses Brave Search (`search.brave.com`) because: - ✅ Returns real results without CAPTCHA or consent walls +- ✅ No API key required — works with plain HTTP GET +- ✅ Handles special characters (C++, &, %, etc.) via URL encoding - ❌ Google blocks plain HTTP with 302 consent redirect - ❌ DuckDuckGo blocks with CAPTCHA -Use it sparingly — once per research task — to get oriented before deep-scraping individual pages. +### Return Value + +The tool returns a structured dict: + +```json +{ + "query": "FastMCP tool decorator", + "search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web", + "result_count": 5, + "hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...", + "results": [ + { + "title": "FastMCP Docs", + "url": "https://docs.fastmcp.dev", + "snippet": "The @mcp.tool() decorator registers a function as an MCP tool..." + }, + ... + ] +} +``` + +The `hint` field is a pipe-separated string of `"Title (url): snippet[:120]"` entries — immediately actionable for deciding which URL to fetch next. + +### Example: Two-Step Research Flow ```python -# Get top 5 results for a query -webscraper_search_hint("FastMCP tool decorator syntax", max_results=5) +# Step 1: Orient — what pages exist about this topic? +result = webscraper_search_hint("httpx async client timeout settings", max_results=5) +# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..." + +# Step 2: Deep-dive the most relevant result +content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000) ``` +### Known Limitations + +- **Reddit / Stack Overflow snippets** may be empty — these platforms block snippet extraction +- **Brave CSS selectors** use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05) +- **Use sparingly** — once per research task to get oriented, not for every query + +--- + ## SSL Note — Fedora 43 Comodo Root CA Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/). @@ -58,13 +108,16 @@ uv run python src/server.py ```bash cd mcp/webscraper uv run pytest tests/ -v -# 23/23 tests passing +# 28/28 tests passing ``` ## Usage Examples ```python -# Fetch a page as Markdown +# Step 1: Search — get candidate URLs for a topic +webscraper_search_hint("FastMCP tool decorator syntax", max_results=5) + +# Step 2: Deep-dive the most relevant URL webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000) # Extract all links from Gitea repo @@ -79,6 +132,6 @@ webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI") # Fetch specific section by CSS selector webscraper_fetch_section("https://docs.python.org", "#content") -# Quick search orientation -webscraper_search_hint("Gitea wiki git clone", max_results=3) +# Search with special characters (C++, &, % all work) +webscraper_search_hint("C++ std::optional usage", max_results=3) ```