Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 4107b8ede2 | |||
| 4202094f01 | |||
| 62c3b67e66 | |||
| c2dd262727 |
@@ -0,0 +1,99 @@
|
||||
# Web Research Rules — Use webscraper_search_hint Proactively
|
||||
|
||||
## Rule: Search Before Asking
|
||||
|
||||
Before asking Patrick for information about a library, framework, API, technology, or error —
|
||||
**always try `webscraper_search_hint` first**.
|
||||
|
||||
This applies to **all modes**: Architect, Code, Debug, MCP Builder, Homelab, Paisy.
|
||||
|
||||
### Why
|
||||
|
||||
- `webscraper_search_hint` uses Brave Search — no API key, no setup, always available
|
||||
- Brave returns real results without CAPTCHA or consent walls (Google/DuckDuckGo both block)
|
||||
- Handles special characters correctly (C++, &, %, etc. — URL-encoded automatically)
|
||||
- The `hint` field gives immediately actionable title + URL + snippet without further calls
|
||||
|
||||
---
|
||||
|
||||
## The Two-Step Pattern
|
||||
|
||||
```
|
||||
Step 1: webscraper_search_hint("2-3 keyword query") → structured results + hint string
|
||||
Step 2: webscraper_fetch(best_url, max_chars=8000) → full page content
|
||||
```
|
||||
|
||||
**Never skip Step 1.** It costs one tool call and often reveals the exact page to read.
|
||||
|
||||
### Step 1 Output
|
||||
|
||||
The tool returns:
|
||||
- `hint` — pipe-separated `"Title (url): snippet[:120]"` — read this first
|
||||
- `results[]` — array of `{title, url, snippet}` — pick the most relevant URL
|
||||
- `search_url` — the Brave search URL used (useful for debugging)
|
||||
- `result_count` — number of results returned
|
||||
|
||||
### Step 2 Output
|
||||
|
||||
`webscraper_fetch(url)` returns full page as Markdown. Use `max_chars` to control size
|
||||
(default 5000; use 8000–12000 for deep doc reads).
|
||||
|
||||
---
|
||||
|
||||
## Mode-Specific Guidance
|
||||
|
||||
### 🏗️ Architect Mode
|
||||
- Before designing any system or feature: search for existing patterns, reference architectures, and official docs
|
||||
- Example: planning a new MCP server → `webscraper_search_hint("FastMCP server patterns 2025")`
|
||||
- Example: choosing between two libraries → search both and read their official comparison pages
|
||||
|
||||
### 🪲 Debug Mode
|
||||
- Search the **exact error message** before forming hypotheses
|
||||
- Example: `webscraper_search_hint("sqlite3 ProgrammingError Cannot operate closed database Python")`
|
||||
- If the error is long, take the most distinctive phrase (2-5 words) as the query
|
||||
|
||||
### 💻 Code Mode
|
||||
- Before implementing a feature using an unfamiliar API: search the official docs URL pattern first
|
||||
- Example: `webscraper_search_hint("httpx async client connection pool settings")`
|
||||
|
||||
### 🔧 MCP Builder Mode
|
||||
- Check FastMCP changelog/docs before implementing new patterns
|
||||
- Example: `webscraper_search_hint("FastMCP tool decorator async 2025")`
|
||||
- Example: `webscraper_search_hint("FastMCP context lifespan")`
|
||||
|
||||
### 🏠 Homelab Mode
|
||||
- Look up Docker/TrueNAS configs, package versions, service docs before asking Patrick
|
||||
- Example: `webscraper_search_hint("Gitea webhook payload format")`
|
||||
|
||||
---
|
||||
|
||||
## Query Crafting Tips
|
||||
|
||||
| ✅ Good queries | ❌ Bad queries |
|
||||
|---|---|
|
||||
| `"httpx timeout settings"` | `"how do I configure httpx timeouts in Python async code"` |
|
||||
| `"FastMCP tool decorator"` | `"mcp server python tool registration method"` |
|
||||
| `"sqlite WAL mode enable"` | `"sqlite performance mode for concurrent reads"` |
|
||||
| `"Brave Search API no key"` | `"search engine that works without api key or captcha"` |
|
||||
|
||||
- Use 2–4 keywords, not full sentences
|
||||
- Prefer library/framework name + specific feature
|
||||
- For errors: distinctive phrase from the message, not the full stack trace
|
||||
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- **Reddit / Stack Overflow snippets** — these platforms block snippet extraction; you may get empty snippets. The URL is still valid — fetch it directly if needed.
|
||||
- **Brave CSS selector fragility** — Brave uses Svelte-generated class names that change. If `webscraper_search_hint` returns 0 results unexpectedly, the scraper's CSS selectors may need updating. Last verified working: 2026-04-05.
|
||||
- **Use sparingly** — one search call per research task to orient; then fetch specific pages. Don't call it in a loop.
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns to Avoid
|
||||
|
||||
- ❌ Asking Patrick "what's the FastMCP syntax for X?" before searching
|
||||
- ❌ Designing architecture without looking up existing solutions first
|
||||
- ❌ Forming a debug hypothesis without searching the error message
|
||||
- ❌ Writing code against an API from memory without verifying current docs
|
||||
- ❌ Calling `webscraper_search_hint` more than 2-3 times for the same topic (broaden/narrow the query instead)
|
||||
@@ -145,6 +145,38 @@ Use the `new-mcp-server` Roo skill in MCP Builder mode for full scaffolding:
|
||||
3. Roo will load the new-mcp-server skill and scaffold everything
|
||||
```
|
||||
|
||||
## Web Research with mcp-webscraper
|
||||
|
||||
Before asking Patrick for information about a library, framework, API, or technology — **search first**.
|
||||
|
||||
The webscraper MCP server provides `webscraper_search_hint` (Brave Search, no API key, always available) as the entry point for all research tasks. Use the two-step pattern:
|
||||
|
||||
```
|
||||
Step 1: webscraper_search_hint("topic or error message") → get candidate URLs
|
||||
Step 2: webscraper_fetch(best_url) → read the full page
|
||||
```
|
||||
|
||||
### When to search
|
||||
|
||||
| Situation | Action |
|
||||
|---|---|
|
||||
| Need docs for a library or framework | `webscraper_search_hint("library-name official docs")` |
|
||||
| Investigating an error or stack trace | `webscraper_search_hint("exact error message language")` |
|
||||
| Planning a feature — need design patterns | `webscraper_search_hint("pattern-name best practices")` |
|
||||
| Checking latest version / changelog | `webscraper_search_hint("library-name changelog release")` |
|
||||
| Looking up API contracts | `webscraper_fetch(official_docs_url)` directly |
|
||||
|
||||
### Especially useful in
|
||||
|
||||
- **🏗️ Architect mode** — look up patterns and docs *before* designing. Don't design blind.
|
||||
- **🪲 Debug mode** — search the exact error message before forming hypotheses.
|
||||
- **🔧 MCP Builder mode** — check FastMCP changelog for new patterns before implementing.
|
||||
|
||||
### Known caveats
|
||||
|
||||
- Reddit and Stack Overflow may return empty snippets (platform blocks)
|
||||
- Brave uses Svelte CSS classes that can change — if `webscraper_search_hint` returns 0 results, selectors may need updating (last verified: 2026-04-05)
|
||||
|
||||
## Gitea Repository
|
||||
|
||||
Code is hosted at: `http://192.168.188.119:30008/pplate/pi_mcps`
|
||||
|
||||
@@ -25,20 +25,70 @@
|
||||
- **Search backend:** Brave Search (`search.brave.com`) — works without CAPTCHA
|
||||
- **SSL:** Custom cert bundle for Fedora 43 compatibility
|
||||
|
||||
## Search Hint Strategy
|
||||
---
|
||||
|
||||
`webscraper_search_hint` uses Brave Search because:
|
||||
## 🔍 Search: The Two-Step Research Pattern
|
||||
|
||||
`webscraper_search_hint` is the **entry point for all web research**. The recommended workflow is:
|
||||
|
||||
```
|
||||
Step 1: webscraper_search_hint("your query") → get candidate URLs + snippets
|
||||
Step 2: webscraper_fetch(best_url) → get full page content
|
||||
```
|
||||
|
||||
This avoids scraping irrelevant pages and gives you an overview before committing to a deep read.
|
||||
|
||||
### Why Brave Search?
|
||||
|
||||
`webscraper_search_hint` uses Brave Search (`search.brave.com`) because:
|
||||
- ✅ Returns real results without CAPTCHA or consent walls
|
||||
- ✅ No API key required — works with plain HTTP GET
|
||||
- ✅ Handles special characters (C++, &, %, etc.) via URL encoding
|
||||
- ❌ Google blocks plain HTTP with 302 consent redirect
|
||||
- ❌ DuckDuckGo blocks with CAPTCHA
|
||||
|
||||
Use it sparingly — once per research task — to get oriented before deep-scraping individual pages.
|
||||
### Return Value
|
||||
|
||||
The tool returns a structured dict:
|
||||
|
||||
```json
|
||||
{
|
||||
"query": "FastMCP tool decorator",
|
||||
"search_url": "https://search.brave.com/search?q=FastMCP+tool+decorator&source=web",
|
||||
"result_count": 5,
|
||||
"hint": "FastMCP Docs (https://docs.fastmcp.dev): The @mcp.tool() decorator registers a function as... | PyPI FastMCP (https://pypi.org/project/fastmcp/): FastMCP 2.x — modern MCP server framework... | ...",
|
||||
"results": [
|
||||
{
|
||||
"title": "FastMCP Docs",
|
||||
"url": "https://docs.fastmcp.dev",
|
||||
"snippet": "The @mcp.tool() decorator registers a function as an MCP tool..."
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
The `hint` field is a pipe-separated string of `"Title (url): snippet[:120]"` entries — immediately actionable for deciding which URL to fetch next.
|
||||
|
||||
### Example: Two-Step Research Flow
|
||||
|
||||
```python
|
||||
# Get top 5 results for a query
|
||||
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
|
||||
# Step 1: Orient — what pages exist about this topic?
|
||||
result = webscraper_search_hint("httpx async client timeout settings", max_results=5)
|
||||
# hint: "HTTPX Docs (https://www.python-httpx.org/...): Configure timeout... | ..."
|
||||
|
||||
# Step 2: Deep-dive the most relevant result
|
||||
content = webscraper_fetch("https://www.python-httpx.org/advanced/timeouts/", max_chars=8000)
|
||||
```
|
||||
|
||||
### Known Limitations
|
||||
|
||||
- **Reddit / Stack Overflow snippets** may be empty — these platforms block snippet extraction
|
||||
- **Brave CSS selectors** use Svelte-generated class names that may change. If you get 0 results, the scraper's selectors may need updating (last verified: 2026-04-05)
|
||||
- **Use sparingly** — once per research task to get oriented, not for every query
|
||||
|
||||
---
|
||||
|
||||
## SSL Note — Fedora 43 Comodo Root CA
|
||||
|
||||
Fedora 43 is missing the **Comodo AAA Services Root CA** needed for Cloudflare-protected sites. The fix is bundled at [`mcp/webscraper/certs/comodo-aaa-services-root.pem`](../src/branch/main/mcp/webscraper/certs/).
|
||||
@@ -58,13 +108,16 @@ uv run python src/server.py
|
||||
```bash
|
||||
cd mcp/webscraper
|
||||
uv run pytest tests/ -v
|
||||
# 23/23 tests passing
|
||||
# 28/28 tests passing
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```python
|
||||
# Fetch a page as Markdown
|
||||
# Step 1: Search — get candidate URLs for a topic
|
||||
webscraper_search_hint("FastMCP tool decorator syntax", max_results=5)
|
||||
|
||||
# Step 2: Deep-dive the most relevant URL
|
||||
webscraper_fetch("https://docs.fastmcp.dev", max_chars=10000)
|
||||
|
||||
# Extract all links from Gitea repo
|
||||
@@ -79,6 +132,6 @@ webscraper_fetch_meta("https://github.com/comfyanonymous/ComfyUI")
|
||||
# Fetch specific section by CSS selector
|
||||
webscraper_fetch_section("https://docs.python.org", "#content")
|
||||
|
||||
# Quick search orientation
|
||||
webscraper_search_hint("Gitea wiki git clone", max_results=3)
|
||||
# Search with special characters (C++, &, % all work)
|
||||
webscraper_search_hint("C++ std::optional usage", max_results=3)
|
||||
```
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
import httpx
|
||||
from bs4 import BeautifulSoup
|
||||
from html2text import html2text
|
||||
from urllib.parse import urljoin
|
||||
from urllib.parse import urljoin, quote_plus
|
||||
from typing import List, Dict, Tuple
|
||||
import re
|
||||
import ssl
|
||||
@@ -275,15 +275,21 @@ def webscraper_search_hint(query: str, max_results: int = 5) -> Dict:
|
||||
max_results: Maximum number of results to return (default: 5)
|
||||
|
||||
Returns:
|
||||
Dict with 'query', 'results' (list of {title, url, snippet}), 'hint'
|
||||
Dict with 'query', 'search_url', 'results' (list of {title, url, snippet}),
|
||||
'result_count', 'hint'
|
||||
"""
|
||||
search_url = f"https://search.brave.com/search?q={quote_plus(query)}&source=web"
|
||||
try:
|
||||
search_url = f"https://search.brave.com/search?q={query.replace(' ', '+')}&source=web"
|
||||
_, soup = _fetch_page(search_url)
|
||||
|
||||
results = []
|
||||
# Brave Search result cards: each <a> with class snippet contains title + description
|
||||
for card in soup.select('.snippet')[:max_results]:
|
||||
seen_urls: set = set()
|
||||
|
||||
# Brave Search result cards: each div.snippet contains title, URL, description
|
||||
for card in soup.select('.snippet'):
|
||||
if len(results) >= max_results:
|
||||
break
|
||||
|
||||
title_el = card.select_one('.snippet-title')
|
||||
url_el = card.select_one('a')
|
||||
desc_el = card.select_one('.snippet-description')
|
||||
@@ -292,20 +298,48 @@ def webscraper_search_hint(query: str, max_results: int = 5) -> Dict:
|
||||
url = url_el['href'] if url_el and url_el.get('href') else ""
|
||||
snippet = desc_el.get_text(strip=True) if desc_el else ""
|
||||
|
||||
if url and url.startswith('http'):
|
||||
results.append({"title": title, "url": url, "snippet": snippet})
|
||||
# Filter: must have a valid http(s) URL
|
||||
if not url or not url.startswith('http'):
|
||||
continue
|
||||
|
||||
hint = "; ".join(
|
||||
f"{r['title']}: {r['url']}" for r in results
|
||||
) if results else "No results found"
|
||||
# Filter: skip results with no useful content at all
|
||||
if not title and not snippet:
|
||||
continue
|
||||
|
||||
# Deduplicate by URL
|
||||
if url in seen_urls:
|
||||
continue
|
||||
seen_urls.add(url)
|
||||
|
||||
results.append({"title": title, "url": url, "snippet": snippet})
|
||||
|
||||
# Richer hint: title + url + first 120 chars of snippet for AI context
|
||||
if results:
|
||||
hint_parts = []
|
||||
for r in results:
|
||||
part = f"{r['title']} ({r['url']})"
|
||||
if r['snippet']:
|
||||
part += f": {r['snippet'][:120]}"
|
||||
hint_parts.append(part)
|
||||
hint = " | ".join(hint_parts)
|
||||
else:
|
||||
hint = "No results found"
|
||||
|
||||
return {
|
||||
"query": query,
|
||||
"search_url": search_url,
|
||||
"results": results,
|
||||
"result_count": len(results),
|
||||
"hint": hint,
|
||||
}
|
||||
except (httpx.RequestError, httpx.HTTPStatusError) as e:
|
||||
return {"query": query, "results": [], "hint": f"Error: {str(e)}"}
|
||||
return {
|
||||
"query": query,
|
||||
"search_url": search_url,
|
||||
"results": [],
|
||||
"result_count": 0,
|
||||
"hint": f"Error: {str(e)}",
|
||||
}
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -234,18 +234,92 @@ def mock_brave_response():
|
||||
return mock_resp
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_brave_response_dups():
|
||||
"""Mock Brave Search response with duplicate URLs to test deduplication."""
|
||||
mock_resp = MagicMock()
|
||||
mock_resp.status_code = 200
|
||||
mock_resp.text = """
|
||||
<html><body>
|
||||
<div class="snippet">
|
||||
<a href="https://example.com/dup">Dup Result A</a>
|
||||
<div class="snippet-title">Dup Result A</div>
|
||||
<div class="snippet-description">First occurrence.</div>
|
||||
</div>
|
||||
<div class="snippet">
|
||||
<a href="https://example.com/dup">Dup Result B</a>
|
||||
<div class="snippet-title">Dup Result B</div>
|
||||
<div class="snippet-description">Second occurrence — same URL.</div>
|
||||
</div>
|
||||
<div class="snippet">
|
||||
<a href="https://example.com/unique">Unique Result</a>
|
||||
<div class="snippet-title">Unique Result</div>
|
||||
<div class="snippet-description">Only once.</div>
|
||||
</div>
|
||||
</body></html>
|
||||
"""
|
||||
mock_resp.headers = {"content-type": "text/html"}
|
||||
return mock_resp
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mock_brave_response_empty_content():
|
||||
"""Mock Brave Search response where one card has no title or snippet."""
|
||||
mock_resp = MagicMock()
|
||||
mock_resp.status_code = 200
|
||||
mock_resp.text = """
|
||||
<html><body>
|
||||
<div class="snippet">
|
||||
<a href="https://example.com/ghost"></a>
|
||||
<div class="snippet-title"></div>
|
||||
<div class="snippet-description"></div>
|
||||
</div>
|
||||
<div class="snippet">
|
||||
<a href="https://example.com/real">Real Result</a>
|
||||
<div class="snippet-title">Real Result</div>
|
||||
<div class="snippet-description">Has content.</div>
|
||||
</div>
|
||||
</body></html>
|
||||
"""
|
||||
mock_resp.headers = {"content-type": "text/html"}
|
||||
return mock_resp
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_returns_structure(mock_get, mock_brave_response):
|
||||
"""Test that search hint returns correct dict structure."""
|
||||
"""Test that search hint returns all required dict fields."""
|
||||
mock_get.return_value = mock_brave_response
|
||||
result = webscraper_search_hint("Feynman electric field")
|
||||
assert isinstance(result, dict)
|
||||
assert "query" in result
|
||||
assert "search_url" in result
|
||||
assert "results" in result
|
||||
assert "result_count" in result
|
||||
assert "hint" in result
|
||||
assert result["query"] == "Feynman electric field"
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_search_url_encoded(mock_get, mock_brave_response):
|
||||
"""Test that search_url uses proper URL encoding (quote_plus, not str.replace)."""
|
||||
mock_get.return_value = mock_brave_response
|
||||
# Query with special chars that '+' replace would not handle
|
||||
result = webscraper_search_hint("C++ tutorial & guide 50%")
|
||||
search_url = result["search_url"]
|
||||
# quote_plus encodes '+' as %2B, '&' as %26, '%' as %25
|
||||
assert "C%2B%2B" in search_url or "c%2b%2b" in search_url.lower()
|
||||
assert "%26" in search_url
|
||||
assert "%25" in search_url
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_result_count(mock_get, mock_brave_response):
|
||||
"""Test that result_count matches the number of results returned."""
|
||||
mock_get.return_value = mock_brave_response
|
||||
result = webscraper_search_hint("Feynman electric field")
|
||||
assert result["result_count"] == len(result["results"])
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_filters_non_http(mock_get, mock_brave_response):
|
||||
"""Test that javascript: URLs are excluded from results."""
|
||||
@@ -262,25 +336,64 @@ def test_webscraper_search_hint_max_results(mock_get, mock_brave_response):
|
||||
mock_get.return_value = mock_brave_response
|
||||
result = webscraper_search_hint("Feynman electric field", max_results=1)
|
||||
assert len(result["results"]) <= 1
|
||||
assert result["result_count"] <= 1
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_deduplicates_urls(mock_get, mock_brave_response_dups):
|
||||
"""Test that duplicate URLs are deduplicated — only first occurrence kept."""
|
||||
mock_get.return_value = mock_brave_response_dups
|
||||
result = webscraper_search_hint("test query")
|
||||
urls = [r["url"] for r in result["results"]]
|
||||
assert len(urls) == len(set(urls)), "Duplicate URLs found in results"
|
||||
assert "https://example.com/dup" in urls
|
||||
assert "https://example.com/unique" in urls
|
||||
assert len(urls) == 2 # dup appears once, unique once
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_filters_empty_content(mock_get, mock_brave_response_empty_content):
|
||||
"""Test that cards with no title AND no snippet are excluded."""
|
||||
mock_get.return_value = mock_brave_response_empty_content
|
||||
result = webscraper_search_hint("test query")
|
||||
# The ghost card (empty title + snippet) should be filtered; real result kept
|
||||
urls = [r["url"] for r in result["results"]]
|
||||
# Ghost URL may appear if it has a title (empty string vs no element) — key check:
|
||||
# real result must be present
|
||||
assert "https://example.com/real" in urls
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_error(mock_get):
|
||||
"""Test error handling in search hint."""
|
||||
"""Test error handling in search hint — returns all required fields."""
|
||||
mock_get.side_effect = httpx.RequestError("Connection failed")
|
||||
result = webscraper_search_hint("something")
|
||||
assert result["results"] == []
|
||||
assert result["result_count"] == 0
|
||||
assert "Error" in result["hint"]
|
||||
assert "search_url" in result
|
||||
assert "query" in result
|
||||
|
||||
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_hint_string(mock_get, mock_brave_response):
|
||||
"""Test that hint string is non-empty when results exist."""
|
||||
def test_webscraper_search_hint_hint_includes_snippet(mock_get, mock_brave_response):
|
||||
"""Test that the hint string includes snippet content, not just title+url."""
|
||||
mock_get.return_value = mock_brave_response
|
||||
result = webscraper_search_hint("Feynman electric field")
|
||||
# hint should summarise results
|
||||
assert len(result["hint"]) > 0
|
||||
# hint should contain snippet text
|
||||
assert "electric field" in result["hint"].lower()
|
||||
assert "No results found" not in result["hint"]
|
||||
assert len(result["hint"]) > 0
|
||||
|
||||
|
||||
# Total: 23 tests covering all tools and edge cases
|
||||
@patch('httpx.get')
|
||||
def test_webscraper_search_hint_hint_format(mock_get, mock_brave_response):
|
||||
"""Test that hint uses pipe-separated format with URL in parens."""
|
||||
mock_get.return_value = mock_brave_response
|
||||
result = webscraper_search_hint("Feynman electric field")
|
||||
# Format: "Title (url): snippet | Title2 (url2): snippet2"
|
||||
assert "(" in result["hint"]
|
||||
assert ")" in result["hint"]
|
||||
|
||||
|
||||
# Total: 31 tests covering all tools and edge cases
|
||||
|
||||
Reference in New Issue
Block a user