8112ff2f12
- FastMCP server with 4 tools: generate_image, list_available_models, get_generation_status, get_output_directory - ComfyUI REST API client (httpx) polling lifecycle - FLUX.1-schnell workflow JSON template - Dual output: TextContent (path + seed) + ImageContent (base64 PNG) - 14 passing pytest tests with respx HTTP mocking - ROCm/AMD RX 7900 XTX optimized setup in README - Ollama Linux migration path documented (future)
200 lines
11 KiB
Markdown
200 lines
11 KiB
Markdown
# mcp-image-gen — Architecture Assessment
|
||
|
||
**Date:** 2026-04-04
|
||
**Author:** Lumen (for Patrick / pplate)
|
||
**Status:** ✅ APPROVED — ready for implementation
|
||
**BigMind Research Session:** `39809470-6ac8-4713-adf2-79ac0eb36ba7`
|
||
|
||
---
|
||
|
||
## 1. Problem Statement
|
||
|
||
LLM agents (Claude, local models via Ollama) have no native ability to generate images. While
|
||
language models excel at text, creative and technical workflows increasingly need image output —
|
||
concept art, diagrams, product mockups, illustrations — all driven by a text prompt.
|
||
|
||
A FastMCP wrapper around a local image generation backend would give any MCP-capable IDE or
|
||
agent the ability to produce images on demand, with full control over resolution, steps, model,
|
||
and seed — without sending data to external cloud APIs.
|
||
|
||
**Gap being filled:** Local AI image generation accessible to LLM agents via MCP protocol,
|
||
running entirely on Patrick's AMD RX 7900 XTX (24GB VRAM) with ROCm.
|
||
|
||
---
|
||
|
||
## 2. Requirements
|
||
|
||
### 2.1 Functional Requirements
|
||
|
||
| ID | Requirement |
|
||
|----|-------------|
|
||
| F-1 | Generate an image from a text prompt |
|
||
| F-2 | Support configurable resolution (width × height) |
|
||
| F-3 | Support configurable inference steps and seed for reproducibility |
|
||
| F-4 | Support negative prompts to exclude unwanted content |
|
||
| F-5 | List available models from the backend |
|
||
| F-6 | Check the status of an in-progress generation job |
|
||
| F-7 | Return generated image as both a file path AND inline base64 for agent display |
|
||
| F-8 | Configure output directory for saved images |
|
||
| F-9 | Support FLUX.1-schnell as the default model |
|
||
|
||
### 2.2 Non-Functional Requirements
|
||
|
||
| ID | Requirement |
|
||
|----|-------------|
|
||
| NF-1 | Generation time < 30 seconds for FLUX.1-schnell at 1024×1024, 4 steps |
|
||
| NF-2 | VRAM footprint < 12GB (leaves headroom on 24GB for Ollama co-existence) |
|
||
| NF-3 | Must work on AMD ROCm — no CUDA-only dependencies in the MCP server layer |
|
||
| NF-4 | No cloud API calls — fully local execution |
|
||
| NF-5 | Graceful error messages when ComfyUI is not running |
|
||
| NF-6 | MCP tools must work with FastMCP and be discoverable by Claude / Roo Code |
|
||
|
||
---
|
||
|
||
## 3. Technology Decision
|
||
|
||
### 3.1 Candidate Backends
|
||
|
||
| Backend | Stars | ROCm | REST API | FLUX Support | Verdict |
|
||
|---------|-------|------|----------|--------------|---------|
|
||
| **ComfyUI** | 108k | ✅ Native | ✅ localhost:8188 | ✅ FLUX.1-schnell, FLUX.1-dev | ✅ **CHOSEN** |
|
||
| stable-diffusion.cpp | ~15k | ✅ ROCm/Vulkan | ❌ CLI only | ✅ FLUX.1-schnell | ⚠️ Viable alternative |
|
||
| PyTorch + diffusers | — | ✅ ROCm 7.2.1 | ❌ No REST | ✅ All models | ❌ Too complex to manage |
|
||
| Ollama image gen | — | ❌ Linux: N/A | ✅ /api/generate | ✅ FLUX.2, Z-Image | ❌ macOS-only as of April 2026 |
|
||
| A1111 / Forge WebUI | — | ⚠️ Limited | ✅ :7860 | ❌ SDXL primary | ❌ Not FLUX-native |
|
||
|
||
### 3.2 Why ComfyUI
|
||
|
||
1. **ROCm native** — ComfyUI's PyTorch backend runs on AMD GPUs via ROCm without forks or patches.
|
||
2. **REST API** — ComfyUI exposes a stable HTTP API at `localhost:8188` making it trivially
|
||
wrappable with `httpx`. No subprocess management or binary spawning needed.
|
||
3. **Workflow-based** — ComfyUI workflows are JSON graphs. The MCP server ships a minimal
|
||
FLUX.1-schnell workflow that can be parameterized with prompt, size, steps, seed at runtime.
|
||
4. **Model ecosystem** — ComfyUI's model manager supports FLUX.1, SDXL, SD3.5, ControlNet,
|
||
LoRA — giving a future-proof upgrade path.
|
||
5. **Community size** — 108k GitHub stars; extensive community support, model nodes, extensions.
|
||
6. **VRAM efficiency** — FLUX.1-schnell requires ~8GB VRAM. Patrick's 24GB card runs it
|
||
comfortably alongside Ollama.
|
||
|
||
### 3.3 Why NOT the Alternatives
|
||
|
||
- **Ollama:** Definitively blocked on Linux until further notice. No ETA for Linux image gen.
|
||
- **stable-diffusion.cpp:** CLI-based only — the MCP server would need to manage a subprocess,
|
||
parse stdout, handle crashes. More fragile than an HTTP API.
|
||
- **PyTorch + diffusers direct:** Requires managing Python environments, device placement, model
|
||
loading, memory management inside the MCP server process — adds significant complexity and
|
||
risk of VRAM conflicts.
|
||
|
||
---
|
||
|
||
## 4. Architecture Decision
|
||
|
||
### 4.1 System Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────┐
|
||
│ LLM Agent (Claude / Roo Code / local Ollama) │
|
||
└───────────────────────────┬─────────────────────────────┘
|
||
│ MCP Protocol (stdio)
|
||
┌───────────────────────────▼─────────────────────────────┐
|
||
│ mcp-image-gen (FastMCP Python server) │
|
||
│ │
|
||
│ Tools: │
|
||
│ • generate_image(prompt, width, height, steps, ...) │
|
||
│ • list_available_models() │
|
||
│ • get_generation_status(prompt_id) │
|
||
│ • get_output_directory() │
|
||
└───────────────────────────┬─────────────────────────────┘
|
||
│ HTTP REST (httpx)
|
||
┌───────────────────────────▼─────────────────────────────┐
|
||
│ ComfyUI (localhost:8188) │
|
||
│ AMD ROCm + PyTorch │
|
||
│ FLUX.1-schnell model │
|
||
└─────────────────────────────────────────────────────────┘
|
||
│
|
||
┌───────▼───────┐
|
||
│ ~/Pictures/ │
|
||
│ mcp-generated│
|
||
└───────────────┘
|
||
```
|
||
|
||
### 4.2 Key Decisions
|
||
|
||
| Decision | Choice | Rationale |
|
||
|----------|--------|-----------|
|
||
| HTTP client | `httpx` (async) | Already used in webscraper; async-friendly; clean timeout handling |
|
||
| Image return | dual: path + base64 | File path for persistence; base64 `ImageContent` for inline Claude display |
|
||
| ImageContent type | `mcp.types.ImageContent` | FastMCP 3.x: **never** use `fastmcp.utilities.types.Image` with `-> Image` annotation — it breaks serialization. Return `ImageContent` directly as a `ContentBlock`. |
|
||
| Job polling | loop with sleep | ComfyUI `/api/queue` returns pending/running/done status; poll until done or timeout |
|
||
| Workflow format | ComfyUI API JSON | Minimal FLUX.1-schnell graph parameterized at runtime |
|
||
| Config | env vars | `COMFYUI_URL`, `IMAGE_OUTPUT_DIR` — no hardcoded paths |
|
||
| Output naming | `{timestamp}_{seed}.png` | Reproducible, collision-free, sortable |
|
||
|
||
---
|
||
|
||
## 5. Risks
|
||
|
||
| Risk | Likelihood | Impact | Mitigation |
|
||
|------|------------|--------|------------|
|
||
| ComfyUI not running when tool is called | High | High | Return clear error: "ComfyUI not reachable at {url}. Start with: `python main.py --listen`" |
|
||
| Generation timeout (>60s) | Medium | Medium | Configurable timeout; return partial status message with `prompt_id` so agent can poll manually |
|
||
| VRAM contention with Ollama | Medium | Medium | FLUX.1-schnell uses ~8GB; 24GB card has 16GB headroom. Document that running both simultaneously may compete at >8GB Ollama model sizes |
|
||
| ROCm driver instability | Low | High | ComfyUI falls back to CPU if ROCm unavailable — slow but functional. Document ROCm setup. |
|
||
| ComfyUI API changes | Low | Medium | Pin ComfyUI version in setup docs; the `/api/prompt`, `/api/queue`, `/api/view` endpoints are stable |
|
||
| Large output files | Low | Low | PNG default; add optional JPEG quality param in v2 |
|
||
| Malformed workflow JSON | Low | High | Ship a tested, minimal FLUX.1-schnell workflow; validate before submit |
|
||
|
||
---
|
||
|
||
## 6. Alternatives Considered
|
||
|
||
### 6.1 Ollama (Blocked)
|
||
Ollama added image generation in January 2026 (Z-Image Turbo, FLUX.2 Klein) but the feature is
|
||
**macOS-only** as of April 2026. Linux support is listed as "coming soon" with no ETA. This was
|
||
the originally preferred path (uniform API with text generation), but it is not viable on Fedora
|
||
Linux today.
|
||
|
||
**Migration path:** When Ollama Linux image gen ships, a thin backend adapter can be added to
|
||
`mcp-image-gen` so it routes to Ollama instead of ComfyUI — same MCP tool signatures, different
|
||
HTTP target.
|
||
|
||
### 6.2 stable-diffusion.cpp
|
||
DiffuGen MCP server uses this approach. Requires:
|
||
- Building sd.cpp with ROCm/Vulkan flags
|
||
- Spawning a subprocess and parsing CLI output
|
||
- No REST API — process management in Python
|
||
|
||
Viable but more fragile than ComfyUI's HTTP API. Chosen only if ComfyUI proves unworkable.
|
||
|
||
### 6.3 diffusers (Python library, direct)
|
||
Would run diffusion pipeline inside the MCP server process. Problems:
|
||
- MCP server process cannot easily share GPU memory with Ollama
|
||
- Model loading adds 5-15s cold start to every MCP invocation
|
||
- Complex device placement / fp16 / ROCm configuration in server code
|
||
- Risk: VRAM OOM crashes the MCP server process entirely
|
||
|
||
---
|
||
|
||
## 7. Success Criteria
|
||
|
||
| Criterion | Measure |
|
||
|-----------|---------|
|
||
| `generate_image` returns a valid PNG | File exists on disk, base64 decodes to valid PNG bytes |
|
||
| Claude can display the image inline | `ImageContent` returned in tool response, visible in Roo Code chat |
|
||
| FLUX.1-schnell at 1024×1024 4-step completes in <30s | Measured on RX 7900 XTX with ROCm |
|
||
| `list_available_models` returns ComfyUI model list | At minimum includes `flux1-schnell.safetensors` |
|
||
| ComfyUI offline → clear error, not crash | Tool returns error string, no MCP server exception |
|
||
| All pytest tests pass | `uv run pytest tests/ -v` exits 0 with ≥80% coverage |
|
||
| Server wired into `.roo/mcp.json` | Tool appears in Roo Code MCP tool list |
|
||
|
||
---
|
||
|
||
## 8. Open Questions
|
||
|
||
| # | Question | Owner | Priority |
|
||
|---|----------|-------|----------|
|
||
| Q1 | Should `generate_image` be synchronous (block until done) or return a `prompt_id` immediately? | Patrick | High — MVP will be synchronous; async polling is v2 |
|
||
| Q2 | Default output directory: `~/Pictures/mcp-generated` or `~/mcp-images`? | Patrick | Low — configurable via env var |
|
||
| Q3 | Should we support SDXL as a second model in v1, or FLUX.1-schnell only? | Patrick | Low — FLUX.1-schnell only for v1 |
|
||
| Q4 | WebSocket API vs REST polling for job status? | — | ComfyUI has both; REST polling is simpler for v1 |
|