155d56e8e8
- Move bigmind/ -> mcp/bigmind/ - Move webscraper/ -> mcp/webscraper/ - Move mss-failsafe/ -> java/mss-failsafe/ - Move Wellmann-Shop/ -> java/wellmann-shop/ (normalize to kebab-case) - Add .roo/ IDE config files to tracking - Add plans/REPO_STRATEGY.md (monorepo strategy document) - Expand .gitignore: Java/Maven, Node/TS, coverage, uv.lock - Rewrite README.md as navigation index - Update .roo/mcp.json webscraper path to mcp/webscraper/
153 lines
6.2 KiB
Markdown
153 lines
6.2 KiB
Markdown
# Webscraper SSL Certificate Verification — Assessment
|
|
|
|
**Date:** 2026-04-03
|
|
**Status:** ✅ RESOLVED
|
|
**Severity:** High — SSL verification completely disabled (`verify=False`)
|
|
|
|
---
|
|
|
|
## 1. Problem Statement
|
|
|
|
The webscraper MCP server cannot verify SSL certificates when making HTTPS requests.
|
|
The current code uses `verify=False` in `_fetch_page()` (line 15 of `src/server.py`) as a
|
|
band-aid, which **disables all SSL verification** — leaving the scraper vulnerable to
|
|
man-in-the-middle attacks and silently accepting invalid/expired certificates.
|
|
|
|
## 2. Reproduction
|
|
|
|
```
|
|
$ uv run python -c "import httpx; httpx.get('https://example.com', timeout=10)"
|
|
httpx.ConnectError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
|
|
unable to get local issuer certificate (_ssl.c:1081)
|
|
```
|
|
|
|
Even `openssl s_client` fails:
|
|
```
|
|
depth=2 C=US, O=SSL Corporation, CN=SSL.com TLS Transit ECC CA R2
|
|
verify error:num=20:unable to get local issuer certificate
|
|
Verify return code: 20 (unable to get local issuer certificate)
|
|
```
|
|
|
|
Yet `curl https://example.com` **succeeds** (exit code 0).
|
|
|
|
## 3. Root Cause Analysis
|
|
|
|
### 3.1 Hypotheses Considered (7)
|
|
|
|
| # | Hypothesis | Verdict |
|
|
|---|-----------|---------|
|
|
| 1 | certifi bundle outdated/missing root CA | ✅ **CONFIRMED** — "AAA Certificate Services" (Comodo root) is absent from certifi 2026.02.25 |
|
|
| 2 | System PEM bundle missing root CA | ✅ **CONFIRMED** — 0 matches for "AAA Certificate Services" in `/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem` |
|
|
| 3 | Python 3.14 SSL behavior change | ❌ System Python 3.14 has same issue — not Python-version specific |
|
|
| 4 | OpenSSL 3.5.4 incompatibility | ❌ curl uses same OpenSSL and succeeds |
|
|
| 5 | Expired/revoked certificate | ❌ Certificate chain is valid (curl succeeds) |
|
|
| 6 | Missing intermediate certificates | ❌ Server sends full chain (3 certs), only root is missing from stores |
|
|
| 7 | httpx library bug | ❌ Same failure with raw `ssl.create_default_context()` |
|
|
|
|
### 3.2 The Actual Root Cause (2 issues)
|
|
|
|
**Issue A — PEM bundle gap:** The Cloudflare certificate chain for `example.com`
|
|
terminates at "AAA Certificate Services" (a Comodo root CA). This root CA is:
|
|
- ❌ **Missing** from `certifi` 2026.02.25 (`cacert.pem`, 272KB)
|
|
- ❌ **Missing** from Fedora's extracted PEM bundle (`/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem`)
|
|
- ✅ **Present** in Fedora's p11-kit native trust store (`trust list` shows "Comodo AAA Services root")
|
|
|
|
This is why `curl` succeeds — curl on Fedora 43 uses the OpenSSL provider mechanism
|
|
which can access p11-kit's PKCS#11 trust store directly, bypassing the PEM file.
|
|
|
|
**Issue B — `verify=False` band-aid:** Instead of fixing the certificate verification,
|
|
the current code disables it entirely with `verify=False`, which:
|
|
- Accepts expired certificates
|
|
- Accepts self-signed certificates
|
|
- Is vulnerable to MITM attacks
|
|
- Produces `InsecureRequestWarning` noise in logs
|
|
|
|
### 3.3 Environment Details
|
|
|
|
| Component | Version |
|
|
|-----------|---------|
|
|
| Python | 3.14.3 (Fedora system) |
|
|
| OpenSSL | 3.5.4 |
|
|
| httpx | 0.28.1 |
|
|
| certifi | 2026.02.25 |
|
|
| ca-certificates | 2025.2.80_v9.0.304-1.2.fc43 |
|
|
| OS | Fedora 43 (kernel 6.19) |
|
|
|
|
## 4. Proposed Fix
|
|
|
|
### Use `truststore` to access the native OS trust store
|
|
|
|
The [`truststore`](https://truststore.readthedocs.io/) library provides an `ssl.SSLContext`-like API
|
|
that accesses the **native OS certificate store** (p11-kit on Linux, Security framework on macOS,
|
|
CryptoAPI on Windows). This is the [official recommendation from httpx](https://www.python-httpx.org/advanced/ssl/).
|
|
|
|
**Changes implemented:**
|
|
|
|
### Approach A: truststore (REJECTED — did not work)
|
|
|
|
`truststore.SSLContext` was tested but loaded 0 certs on this Fedora 43 / OpenSSL 3.5.4 setup.
|
|
`cert_store_stats()` raises `NotImplementedError`. The PKCS#11 provider in `openssl.cnf` is
|
|
commented out. This approach was abandoned.
|
|
|
|
### Approach B: certifi + extra certs directory (IMPLEMENTED ✅)
|
|
|
|
1. **`webscraper/certs/comodo-aaa-services-root.pem`** — Missing root CA extracted from p11-kit
|
|
2. **`src/server.py`** — New `_build_ssl_context()` at module load:
|
|
|
|
```python
|
|
import ssl
|
|
import certifi
|
|
from pathlib import Path
|
|
|
|
_EXTRA_CERTS_DIR = Path(__file__).resolve().parent.parent / "certs"
|
|
|
|
def _build_ssl_context() -> ssl.SSLContext:
|
|
"""Build an SSL context from certifi + extra bundled root certs."""
|
|
ctx = ssl.create_default_context(cafile=certifi.where())
|
|
if _EXTRA_CERTS_DIR.is_dir():
|
|
for pem in _EXTRA_CERTS_DIR.glob("*.pem"):
|
|
ctx.load_verify_locations(cafile=str(pem))
|
|
return ctx
|
|
|
|
_SSL_CTX = _build_ssl_context()
|
|
```
|
|
|
|
### Why this approach?
|
|
|
|
| Approach | Problem |
|
|
|----------|---------|
|
|
| `verify=False` | **Previous** — disabled all security |
|
|
| `verify=certifi.where()` | certifi bundle doesn't have the Comodo root CA |
|
|
| `ssl.create_default_context()` | Uses the same broken system PEM file |
|
|
| `sudo update-ca-trust` | System-level fix, requires root, didn't fully work |
|
|
| `truststore.SSLContext` | ❌ Loaded 0 certs on this setup, NotImplementedError |
|
|
| **certifi + extra certs dir** | ✅ **Works!** Certifi base + project-bundled missing CAs |
|
|
|
|
### Benefits of this approach:
|
|
- No `verify=False` — proper SSL verification restored
|
|
- Missing CAs can be added by dropping `.pem` files into `certs/`
|
|
- No extra dependencies beyond certifi (already a transitive dep of httpx)
|
|
- SSL context built once at module load — no per-request overhead
|
|
- Works on all platforms (certifi is cross-platform)
|
|
|
|
### System-level fix (optional, for curl and other apps):
|
|
```bash
|
|
sudo cp webscraper/certs/comodo-aaa-services-root.pem /etc/pki/ca-trust/source/anchors/
|
|
sudo update-ca-trust extract
|
|
```
|
|
|
|
## 5. Test Impact
|
|
|
|
- Existing tests use mocked `httpx.get` calls → **no test changes needed for SSL**
|
|
- Fixed pre-existing `test_404` bug: `HTTPStatusError` requires `request=` kwarg (httpx API)
|
|
- Fixed `test_404` assertion: error message must include "404" text
|
|
- **18/18 tests passing**
|
|
|
|
## 6. Risk Assessment
|
|
|
|
| Risk | Level | Mitigation |
|
|
|------|-------|------------|
|
|
| Bundled cert expires (2028-12-31) | Low | Well before then, certifi/system will include it |
|
|
| Some Cloudflare URLs fail on other machines | Low | Same cert can be added to `certs/` |
|
|
| New missing CAs in the future | Low | Drop `.pem` into `certs/` — no code change needed |
|