chore: reorganize into polyglot monorepo (workshop)
- Move bigmind/ -> mcp/bigmind/ - Move webscraper/ -> mcp/webscraper/ - Move mss-failsafe/ -> java/mss-failsafe/ - Move Wellmann-Shop/ -> java/wellmann-shop/ (normalize to kebab-case) - Add .roo/ IDE config files to tracking - Add plans/REPO_STRATEGY.md (monorepo strategy document) - Expand .gitignore: Java/Maven, Node/TS, coverage, uv.lock - Rewrite README.md as navigation index - Update .roo/mcp.json webscraper path to mcp/webscraper/
This commit is contained in:
@@ -0,0 +1,152 @@
|
||||
# Webscraper SSL Certificate Verification — Assessment
|
||||
|
||||
**Date:** 2026-04-03
|
||||
**Status:** ✅ RESOLVED
|
||||
**Severity:** High — SSL verification completely disabled (`verify=False`)
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
The webscraper MCP server cannot verify SSL certificates when making HTTPS requests.
|
||||
The current code uses `verify=False` in `_fetch_page()` (line 15 of `src/server.py`) as a
|
||||
band-aid, which **disables all SSL verification** — leaving the scraper vulnerable to
|
||||
man-in-the-middle attacks and silently accepting invalid/expired certificates.
|
||||
|
||||
## 2. Reproduction
|
||||
|
||||
```
|
||||
$ uv run python -c "import httpx; httpx.get('https://example.com', timeout=10)"
|
||||
httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
|
||||
unable to get local issuer certificate (_ssl.c:1081)
|
||||
```
|
||||
|
||||
Even `openssl s_client` fails:
|
||||
```
|
||||
depth=2 C=US, O=SSL Corporation, CN=SSL.com TLS Transit ECC CA R2
|
||||
verify error:num=20:unable to get local issuer certificate
|
||||
Verify return code: 20 (unable to get local issuer certificate)
|
||||
```
|
||||
|
||||
Yet `curl https://example.com` **succeeds** (exit code 0).
|
||||
|
||||
## 3. Root Cause Analysis
|
||||
|
||||
### 3.1 Hypotheses Considered (7)
|
||||
|
||||
| # | Hypothesis | Verdict |
|
||||
|---|-----------|---------|
|
||||
| 1 | certifi bundle outdated/missing root CA | ✅ **CONFIRMED** — "AAA Certificate Services" (Comodo root) is absent from certifi 2026.02.25 |
|
||||
| 2 | System PEM bundle missing root CA | ✅ **CONFIRMED** — 0 matches for "AAA Certificate Services" in `/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem` |
|
||||
| 3 | Python 3.14 SSL behavior change | ❌ System Python 3.14 has same issue — not Python-version specific |
|
||||
| 4 | OpenSSL 3.5.4 incompatibility | ❌ curl uses same OpenSSL and succeeds |
|
||||
| 5 | Expired/revoked certificate | ❌ Certificate chain is valid (curl succeeds) |
|
||||
| 6 | Missing intermediate certificates | ❌ Server sends full chain (3 certs), only root is missing from stores |
|
||||
| 7 | httpx library bug | ❌ Same failure with raw `ssl.create_default_context()` |
|
||||
|
||||
### 3.2 The Actual Root Cause (2 issues)
|
||||
|
||||
**Issue A — PEM bundle gap:** The Cloudflare certificate chain for `example.com`
|
||||
terminates at "AAA Certificate Services" (a Comodo root CA). This root CA is:
|
||||
- ❌ **Missing** from `certifi` 2026.02.25 (`cacert.pem`, 272KB)
|
||||
- ❌ **Missing** from Fedora's extracted PEM bundle (`/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem`)
|
||||
- ✅ **Present** in Fedora's p11-kit native trust store (`trust list` shows "Comodo AAA Services root")
|
||||
|
||||
This is why `curl` succeeds — curl on Fedora 43 uses the OpenSSL provider mechanism
|
||||
which can access p11-kit's PKCS#11 trust store directly, bypassing the PEM file.
|
||||
|
||||
**Issue B — `verify=False` band-aid:** Instead of fixing the certificate verification,
|
||||
the current code disables it entirely with `verify=False`, which:
|
||||
- Accepts expired certificates
|
||||
- Accepts self-signed certificates
|
||||
- Is vulnerable to MITM attacks
|
||||
- Produces `InsecureRequestWarning` noise in logs
|
||||
|
||||
### 3.3 Environment Details
|
||||
|
||||
| Component | Version |
|
||||
|-----------|---------|
|
||||
| Python | 3.14.3 (Fedora system) |
|
||||
| OpenSSL | 3.5.4 |
|
||||
| httpx | 0.28.1 |
|
||||
| certifi | 2026.02.25 |
|
||||
| ca-certificates | 2025.2.80_v9.0.304-1.2.fc43 |
|
||||
| OS | Fedora 43 (kernel 6.19) |
|
||||
|
||||
## 4. Proposed Fix
|
||||
|
||||
### Use `truststore` to access the native OS trust store
|
||||
|
||||
The [`truststore`](https://truststore.readthedocs.io/) library provides an `ssl.SSLContext`-like API
|
||||
that accesses the **native OS certificate store** (p11-kit on Linux, Security framework on macOS,
|
||||
CryptoAPI on Windows). This is the [official recommendation from httpx](https://www.python-httpx.org/advanced/ssl/).
|
||||
|
||||
**Changes implemented:**
|
||||
|
||||
### Approach A: truststore (REJECTED — did not work)
|
||||
|
||||
`truststore.SSLContext` was tested but loaded 0 certs on this Fedora 43 / OpenSSL 3.5.4 setup.
|
||||
`cert_store_stats()` raises `NotImplementedError`. The PKCS#11 provider in `openssl.cnf` is
|
||||
commented out. This approach was abandoned.
|
||||
|
||||
### Approach B: certifi + extra certs directory (IMPLEMENTED ✅)
|
||||
|
||||
1. **`webscraper/certs/comodo-aaa-services-root.pem`** — Missing root CA extracted from p11-kit
|
||||
2. **`src/server.py`** — New `_build_ssl_context()` at module load:
|
||||
|
||||
```python
|
||||
import ssl
|
||||
import certifi
|
||||
from pathlib import Path
|
||||
|
||||
_EXTRA_CERTS_DIR = Path(__file__).resolve().parent.parent / "certs"
|
||||
|
||||
def _build_ssl_context() -> ssl.SSLContext:
|
||||
"""Build an SSL context from certifi + extra bundled root certs."""
|
||||
ctx = ssl.create_default_context(cafile=certifi.where())
|
||||
if _EXTRA_CERTS_DIR.is_dir():
|
||||
for pem in _EXTRA_CERTS_DIR.glob("*.pem"):
|
||||
ctx.load_verify_locations(cafile=str(pem))
|
||||
return ctx
|
||||
|
||||
_SSL_CTX = _build_ssl_context()
|
||||
```
|
||||
|
||||
### Why this approach?
|
||||
|
||||
| Approach | Problem |
|
||||
|----------|---------|
|
||||
| `verify=False` | **Previous** — disabled all security |
|
||||
| `verify=certifi.where()` | certifi bundle doesn't have the Comodo root CA |
|
||||
| `ssl.create_default_context()` | Uses the same broken system PEM file |
|
||||
| `sudo update-ca-trust` | System-level fix, requires root, didn't fully work |
|
||||
| `truststore.SSLContext` | ❌ Loaded 0 certs on this setup, NotImplementedError |
|
||||
| **certifi + extra certs dir** | ✅ **Works!** Certifi base + project-bundled missing CAs |
|
||||
|
||||
### Benefits of this approach:
|
||||
- No `verify=False` — proper SSL verification restored
|
||||
- Missing CAs can be added by dropping `.pem` files into `certs/`
|
||||
- No extra dependencies beyond certifi (already a transitive dep of httpx)
|
||||
- SSL context built once at module load — no per-request overhead
|
||||
- Works on all platforms (certifi is cross-platform)
|
||||
|
||||
### System-level fix (optional, for curl and other apps):
|
||||
```bash
|
||||
sudo cp webscraper/certs/comodo-aaa-services-root.pem /etc/pki/ca-trust/source/anchors/
|
||||
sudo update-ca-trust extract
|
||||
```
|
||||
|
||||
## 5. Test Impact
|
||||
|
||||
- Existing tests use mocked `httpx.get` calls → **no test changes needed for SSL**
|
||||
- Fixed pre-existing `test_404` bug: `HTTPStatusError` requires `request=` kwarg (httpx API)
|
||||
- Fixed `test_404` assertion: error message must include "404" text
|
||||
- **18/18 tests passing**
|
||||
|
||||
## 6. Risk Assessment
|
||||
|
||||
| Risk | Level | Mitigation |
|
||||
|------|-------|------------|
|
||||
| Bundled cert expires (2028-12-31) | Low | Well before then, certifi/system will include it |
|
||||
| Some Cloudflare URLs fail on other machines | Low | Same cert can be added to `certs/` |
|
||||
| New missing CAs in the future | Low | Drop `.pem` into `certs/` — no code change needed |
|
||||
Reference in New Issue
Block a user