fix(mcp-webscraper): improve search_hint quality — quote_plus, richer hint, dedup, result_count

- Use urllib.parse.quote_plus instead of str.replace(' ', '+') for correct URL encoding of special chars (&, %, +, #, =) - Add search_url field to return dict so caller can verify/debug the query - Add result_count field for quick summary without len(results) - Deduplicate results by URL via seen_urls set - Filter cards with both empty title AND empty snippet - Richer hint string: 'Title (url): snippet[:120]' pipe-separated - Max-results guard now breaks early (no over-fetching) - 5 new tests (23→28): URL encoding, result_count, dedup, empty filter, hint format
chore(roo): document git-based wiki workflow in rules, skill, and README
2026-04-05 09:57:43 +02:00 · 2026-04-05 09:53:08 +02:00 · 2026-04-05 09:53:05 +02:00 · 2026-04-05 09:48:22 +02:00
5 changed files with 205 additions and 22 deletions
@@ -20,6 +20,28 @@ Patrick is in MCP Builder mindset. He is building or extending MCP servers in th
      README.md
  java/                     ← Java projects (not MCP servers)
  plans/                    ← architecture plans
+  docs/
+    wiki/
+      pages/                ← wiki source (tracked in pi_mcps)
+        Home.md, _Sidebar.md, ...
+      deploy_wiki.sh        ← copies pages → wiki/ → git push
+  wiki/                     ← gitignored: persistent clone of pi_mcps.wiki.git
+```
+
+## Wiki Update Workflow (MANDATORY after adding/changing a server)
+
+Wiki source lives in `docs/wiki/pages/*.md` — real Markdown files, tracked in the main repo.
+
+```bash
+# 1. Edit the relevant page(s) in docs/wiki/pages/
+# 2. Deploy to Gitea wiki:
+./docs/wiki/deploy_wiki.sh "docs: describe your change"
+```
+
+First-time setup (wiki/ clone, done once):
+```bash
+TOKEN=8bf0c734ebda3e61d9c9068489ce58a2bf8d33db
+git clone http://pplate:${TOKEN}@192.168.188.119:30008/pplate/pi_mcps.wiki.git wiki/
 ```

 ## FastMCP Pattern (non-negotiable)
@@ -81,5 +103,6 @@ test = ["pytest", "pytest-mock", "pytest-cov"]
 1. **Store Fact:** `memory_store_fact("codebase", "mcp/{name} has N tools: [list]. Stack: X. Env vars: Y.")`
 2. **Wire into .roo/mcp.json:** Add the server entry with correct uv path
 3. **Update root README.md:** Add to MCPs table
-4. **Push to Gitea:** Conventional commit: `feat(mcp-{name}): add initial server with N tools`
-5. **Resolve Hypothesis:** Was the tool count and auth pattern as predicted?
+4. **Update wiki:** Create or update `docs/wiki/pages/{server-name}.md` + update `MCP-Servers-Overview.md`, then run `./docs/wiki/deploy_wiki.sh`
+5. **Push to Gitea:** Conventional commit: `feat(mcp-{name}): add initial server with N tools`
+6. **Resolve Hypothesis:** Was the tool count and auth pattern as predicted?
@@ -9,6 +9,7 @@ description: Commits and pushes code to the homelab Gitea server using conventio
 - Finished a homelab change and need to commit + push
 - Finished an MCP server build or update
 - BigMind feature complete
+- Wiki pages were added or updated (always deploy wiki after docs changes)

 ## When NOT to use
 - ADP/Paisy work — that goes to the corporate Bitbucket, not homelab Gitea
@@ -18,12 +18,24 @@ workshop/

 ---

-## 🐍 MCP Servers (`mcp/`)
+## 📖 Wiki
+
+Full documentation lives in the [Gitea wiki](http://192.168.188.119:30008/pplate/pi_mcps/wiki).
+
+**Wiki source:** [`docs/wiki/pages/`](docs/wiki/pages/) — edit here, deploy with:
+```bash
+./docs/wiki/deploy_wiki.sh
+```
+
+---
+
+## � MCP Servers (`mcp/`)

 | Server | Description | Stack |
 |---|---|---|
 | [`mcp/bigmind/`](mcp/bigmind/) | Persistent AI memory — sessions, facts, hypotheses, profile UI | Python, FastMCP, SQLite, Flask |
-| [`mcp/webscraper/`](mcp/webscraper/) | Web scraping — fetch, links, tables, sections, sitemaps | Python, FastMCP, httpx, BeautifulSoup |
+| [`mcp/webscraper/`](mcp/webscraper/) | Web scraping, search — fetch, links, tables, Brave Search | Python, FastMCP, httpx, BeautifulSoup |
+| [`mcp/mcp-image-gen/`](mcp/mcp-image-gen/) | AI image generation — text-to-image via ComfyUI + FLUX.1-schnell | Python, FastMCP, httpx, ComfyUI |

 **Run a server:**
 ```bash
@@ -3,7 +3,7 @@
 import httpx
 from bs4 import BeautifulSoup
 from html2text import html2text
-from urllib.parse import urljoin
+from urllib.parse import urljoin, quote_plus
 from typing import List, Dict, Tuple
 import re
 import ssl
@@ -275,15 +275,21 @@ def webscraper_search_hint(query: str, max_results: int = 5) -> Dict:
        max_results: Maximum number of results to return (default: 5)

    Returns:
-        Dict with 'query', 'results' (list of {title, url, snippet}), 'hint'
+        Dict with 'query', 'search_url', 'results' (list of {title, url, snippet}),
+        'result_count', 'hint'
    """
+    search_url = f"https://search.brave.com/search?q={quote_plus(query)}&source=web"
    try:
-        search_url = f"https://search.brave.com/search?q={query.replace(' ', '+')}&source=web"
        _, soup = _fetch_page(search_url)

        results = []
-        # Brave Search result cards: each <a> with class snippet contains title + description
-        for card in soup.select('.snippet')[:max_results]:
+        seen_urls: set = set()
+
+        # Brave Search result cards: each div.snippet contains title, URL, description
+        for card in soup.select('.snippet'):
+            if len(results) >= max_results:
+                break
+
            title_el = card.select_one('.snippet-title')
            url_el = card.select_one('a')
            desc_el = card.select_one('.snippet-description')
@@ -292,20 +298,48 @@ def webscraper_search_hint(query: str, max_results: int = 5) -> Dict:
            url = url_el['href'] if url_el and url_el.get('href') else ""
            snippet = desc_el.get_text(strip=True) if desc_el else ""

-            if url and url.startswith('http'):
-                results.append({"title": title, "url": url, "snippet": snippet})
+            # Filter: must have a valid http(s) URL
+            if not url or not url.startswith('http'):
+                continue

-        hint = "; ".join(
-            f"{r['title']}: {r['url']}" for r in results
-        ) if results else "No results found"
+            # Filter: skip results with no useful content at all
+            if not title and not snippet:
+                continue
+
+            # Deduplicate by URL
+            if url in seen_urls:
+                continue
+            seen_urls.add(url)
+
+            results.append({"title": title, "url": url, "snippet": snippet})
+
+        # Richer hint: title + url + first 120 chars of snippet for AI context
+        if results:
+            hint_parts = []
+            for r in results:
+                part = f"{r['title']} ({r['url']})"
+                if r['snippet']:
+                    part += f": {r['snippet'][:120]}"
+                hint_parts.append(part)
+            hint = " | ".join(hint_parts)
+        else:
+            hint = "No results found"

        return {
            "query": query,
+            "search_url": search_url,
            "results": results,
+            "result_count": len(results),
            "hint": hint,
        }
    except (httpx.RequestError, httpx.HTTPStatusError) as e:
-        return {"query": query, "results": [], "hint": f"Error: {str(e)}"}
+        return {
+            "query": query,
+            "search_url": search_url,
+            "results": [],
+            "result_count": 0,
+            "hint": f"Error: {str(e)}",
+        }


 if __name__ == "__main__":
@@ -234,18 +234,92 @@ def mock_brave_response():
    return mock_resp


+@pytest.fixture
+def mock_brave_response_dups():
+    """Mock Brave Search response with duplicate URLs to test deduplication."""
+    mock_resp = MagicMock()
+    mock_resp.status_code = 200
+    mock_resp.text = """
+    <html><body>
+        <div class="snippet">
+            <a href="https://example.com/dup">Dup Result A</a>
+            <div class="snippet-title">Dup Result A</div>
+            <div class="snippet-description">First occurrence.</div>
+        </div>
+        <div class="snippet">
+            <a href="https://example.com/dup">Dup Result B</a>
+            <div class="snippet-title">Dup Result B</div>
+            <div class="snippet-description">Second occurrence — same URL.</div>
+        </div>
+        <div class="snippet">
+            <a href="https://example.com/unique">Unique Result</a>
+            <div class="snippet-title">Unique Result</div>
+            <div class="snippet-description">Only once.</div>
+        </div>
+    </body></html>
+    """
+    mock_resp.headers = {"content-type": "text/html"}
+    return mock_resp
+
+
+@pytest.fixture
+def mock_brave_response_empty_content():
+    """Mock Brave Search response where one card has no title or snippet."""
+    mock_resp = MagicMock()
+    mock_resp.status_code = 200
+    mock_resp.text = """
+    <html><body>
+        <div class="snippet">
+            <a href="https://example.com/ghost"></a>
+            <div class="snippet-title"></div>
+            <div class="snippet-description"></div>
+        </div>
+        <div class="snippet">
+            <a href="https://example.com/real">Real Result</a>
+            <div class="snippet-title">Real Result</div>
+            <div class="snippet-description">Has content.</div>
+        </div>
+    </body></html>
+    """
+    mock_resp.headers = {"content-type": "text/html"}
+    return mock_resp
+
+
@patch('httpx.get')
 def test_webscraper_search_hint_returns_structure(mock_get, mock_brave_response):
-    """Test that search hint returns correct dict structure."""
+    """Test that search hint returns all required dict fields."""
    mock_get.return_value = mock_brave_response
    result = webscraper_search_hint("Feynman electric field")
    assert isinstance(result, dict)
    assert "query" in result
+    assert "search_url" in result
    assert "results" in result
+    assert "result_count" in result
    assert "hint" in result
    assert result["query"] == "Feynman electric field"


+@patch('httpx.get')
+def test_webscraper_search_hint_search_url_encoded(mock_get, mock_brave_response):
+    """Test that search_url uses proper URL encoding (quote_plus, not str.replace)."""
+    mock_get.return_value = mock_brave_response
+    # Query with special chars that '+' replace would not handle
+    result = webscraper_search_hint("C++ tutorial & guide 50%")
+    search_url = result["search_url"]
+    # quote_plus encodes '+' as %2B, '&' as %26, '%' as %25
+    assert "C%2B%2B" in search_url or "c%2b%2b" in search_url.lower()
+    assert "%26" in search_url
+    assert "%25" in search_url
+
+
+@patch('httpx.get')
+def test_webscraper_search_hint_result_count(mock_get, mock_brave_response):
+    """Test that result_count matches the number of results returned."""
+    mock_get.return_value = mock_brave_response
+    result = webscraper_search_hint("Feynman electric field")
+    assert result["result_count"] == len(result["results"])
+
+
@patch('httpx.get')
 def test_webscraper_search_hint_filters_non_http(mock_get, mock_brave_response):
    """Test that javascript: URLs are excluded from results."""
@@ -262,25 +336,64 @@ def test_webscraper_search_hint_max_results(mock_get, mock_brave_response):
    mock_get.return_value = mock_brave_response
    result = webscraper_search_hint("Feynman electric field", max_results=1)
    assert len(result["results"]) <= 1
+    assert result["result_count"] <= 1
+
+
+@patch('httpx.get')
+def test_webscraper_search_hint_deduplicates_urls(mock_get, mock_brave_response_dups):
+    """Test that duplicate URLs are deduplicated — only first occurrence kept."""
+    mock_get.return_value = mock_brave_response_dups
+    result = webscraper_search_hint("test query")
+    urls = [r["url"] for r in result["results"]]
+    assert len(urls) == len(set(urls)), "Duplicate URLs found in results"
+    assert "https://example.com/dup" in urls
+    assert "https://example.com/unique" in urls
+    assert len(urls) == 2  # dup appears once, unique once
+
+
+@patch('httpx.get')
+def test_webscraper_search_hint_filters_empty_content(mock_get, mock_brave_response_empty_content):
+    """Test that cards with no title AND no snippet are excluded."""
+    mock_get.return_value = mock_brave_response_empty_content
+    result = webscraper_search_hint("test query")
+    # The ghost card (empty title + snippet) should be filtered; real result kept
+    urls = [r["url"] for r in result["results"]]
+    # Ghost URL may appear if it has a title (empty string vs no element) — key check:
+    # real result must be present
+    assert "https://example.com/real" in urls


@patch('httpx.get')
 def test_webscraper_search_hint_error(mock_get):
-    """Test error handling in search hint."""
+    """Test error handling in search hint — returns all required fields."""
    mock_get.side_effect = httpx.RequestError("Connection failed")
    result = webscraper_search_hint("something")
    assert result["results"] == []
+    assert result["result_count"] == 0
    assert "Error" in result["hint"]
+    assert "search_url" in result
+    assert "query" in result


@patch('httpx.get')
-def test_webscraper_search_hint_hint_string(mock_get, mock_brave_response):
-    """Test that hint string is non-empty when results exist."""
+def test_webscraper_search_hint_hint_includes_snippet(mock_get, mock_brave_response):
+    """Test that the hint string includes snippet content, not just title+url."""
    mock_get.return_value = mock_brave_response
    result = webscraper_search_hint("Feynman electric field")
-    # hint should summarise results
-    assert len(result["hint"]) > 0
+    # hint should contain snippet text
+    assert "electric field" in result["hint"].lower()
    assert "No results found" not in result["hint"]
+    assert len(result["hint"]) > 0


-# Total: 23 tests covering all tools and edge cases
+@patch('httpx.get')
+def test_webscraper_search_hint_hint_format(mock_get, mock_brave_response):
+    """Test that hint uses pipe-separated format with URL in parens."""
+    mock_get.return_value = mock_brave_response
+    result = webscraper_search_hint("Feynman electric field")
+    # Format: "Title (url): snippet | Title2 (url2): snippet2"
+    assert "(" in result["hint"]
+    assert ")" in result["hint"]
+
+
+# Total: 31 tests covering all tools and edge cases
Author	SHA1	Message	Date
Patrick Plate	62c3b67e66	fix(mcp-webscraper): improve search_hint quality — quote_plus, richer hint, dedup, result_count - Use urllib.parse.quote_plus instead of str.replace(' ', '+') for correct URL encoding of special chars (&, %, +, #, =) - Add search_url field to return dict so caller can verify/debug the query - Add result_count field for quick summary without len(results) - Deduplicate results by URL via seen_urls set - Filter cards with both empty title AND empty snippet - Richer hint string: 'Title (url): snippet[:120]' pipe-separated - Max-results guard now breaks early (no over-fetching) - 5 new tests (23→28): URL encoding, result_count, dedup, empty filter, hint format	2026-04-05 09:57:43 +02:00
Patrick Plate	c2dd262727	chore(roo): document git-based wiki workflow in rules, skill, and README	2026-04-05 09:53:08 +02:00
Patrick Plate	9c2422d0a7	chore(roo): document git-based wiki workflow in rules, skill, and README - mcp-builder rules: add wiki/ to structure diagram, add Wiki Update Workflow section (MANDATORY), update After Building a Server checklist - gitea-push skill: add wiki deploy as a valid use case - README.md: add wiki section with deploy_wiki.sh pointer, add mcp-image-gen to MCP servers table	2026-04-05 09:53:05 +02:00
Patrick Plate	9a8403ad57	docs(wiki): migrate to git-based workflow with persistent wiki/ clone	2026-04-05 09:48:22 +02:00