Files

T

Patrick Plate 8112ff2f12 feat(mcp-image-gen): scaffold ComfyUI-backed image generation MCP server

- FastMCP server with 4 tools: generate_image, list_available_models,
  get_generation_status, get_output_directory
- ComfyUI REST API client (httpx) polling lifecycle
- FLUX.1-schnell workflow JSON template
- Dual output: TextContent (path + seed) + ImageContent (base64 PNG)
- 14 passing pytest tests with respx HTTP mocking
- ROCm/AMD RX 7900 XTX optimized setup in README
- Ollama Linux migration path documented (future)

2026-04-04 11:49:31 +02:00

11 KiB

Raw Blame History

mcp-image-gen — Architecture Assessment

Date: 2026-04-04 Author: Lumen (for Patrick / pplate) Status: ✅ APPROVED — ready for implementation BigMind Research Session: 39809470-6ac8-4713-adf2-79ac0eb36ba7

1. Problem Statement

LLM agents (Claude, local models via Ollama) have no native ability to generate images. While language models excel at text, creative and technical workflows increasingly need image output — concept art, diagrams, product mockups, illustrations — all driven by a text prompt.

A FastMCP wrapper around a local image generation backend would give any MCP-capable IDE or agent the ability to produce images on demand, with full control over resolution, steps, model, and seed — without sending data to external cloud APIs.

Gap being filled: Local AI image generation accessible to LLM agents via MCP protocol, running entirely on Patrick's AMD RX 7900 XTX (24GB VRAM) with ROCm.

2. Requirements

2.1 Functional Requirements

ID	Requirement
F-1	Generate an image from a text prompt
F-2	Support configurable resolution (width × height)
F-3	Support configurable inference steps and seed for reproducibility
F-4	Support negative prompts to exclude unwanted content
F-5	List available models from the backend
F-6	Check the status of an in-progress generation job
F-7	Return generated image as both a file path AND inline base64 for agent display
F-8	Configure output directory for saved images
F-9	Support FLUX.1-schnell as the default model

2.2 Non-Functional Requirements

ID	Requirement
NF-1	Generation time < 30 seconds for FLUX.1-schnell at 1024×1024, 4 steps
NF-2	VRAM footprint < 12GB (leaves headroom on 24GB for Ollama co-existence)
NF-3	Must work on AMD ROCm — no CUDA-only dependencies in the MCP server layer
NF-4	No cloud API calls — fully local execution
NF-5	Graceful error messages when ComfyUI is not running
NF-6	MCP tools must work with FastMCP and be discoverable by Claude / Roo Code

3. Technology Decision

3.1 Candidate Backends

Backend	Stars	ROCm	REST API	FLUX Support	Verdict
ComfyUI	108k	✅ Native	✅ localhost:8188	✅ FLUX.1-schnell, FLUX.1-dev	✅ CHOSEN
stable-diffusion.cpp	~15k	✅ ROCm/Vulkan	❌ CLI only	✅ FLUX.1-schnell	⚠️ Viable alternative
PyTorch + diffusers	—	✅ ROCm 7.2.1	❌ No REST	✅ All models	❌ Too complex to manage
Ollama image gen	—	❌ Linux: N/A	✅ /api/generate	✅ FLUX.2, Z-Image	❌ macOS-only as of April 2026
A1111 / Forge WebUI	—	⚠️ Limited	✅ :7860	❌ SDXL primary	❌ Not FLUX-native

3.2 Why ComfyUI

ROCm native — ComfyUI's PyTorch backend runs on AMD GPUs via ROCm without forks or patches.
REST API — ComfyUI exposes a stable HTTP API at localhost:8188 making it trivially wrappable with httpx. No subprocess management or binary spawning needed.
Workflow-based — ComfyUI workflows are JSON graphs. The MCP server ships a minimal FLUX.1-schnell workflow that can be parameterized with prompt, size, steps, seed at runtime.
Model ecosystem — ComfyUI's model manager supports FLUX.1, SDXL, SD3.5, ControlNet, LoRA — giving a future-proof upgrade path.
Community size — 108k GitHub stars; extensive community support, model nodes, extensions.
VRAM efficiency — FLUX.1-schnell requires ~8GB VRAM. Patrick's 24GB card runs it comfortably alongside Ollama.

3.3 Why NOT the Alternatives

Ollama: Definitively blocked on Linux until further notice. No ETA for Linux image gen.
stable-diffusion.cpp: CLI-based only — the MCP server would need to manage a subprocess, parse stdout, handle crashes. More fragile than an HTTP API.
PyTorch + diffusers direct: Requires managing Python environments, device placement, model loading, memory management inside the MCP server process — adds significant complexity and risk of VRAM conflicts.

4. Architecture Decision

4.1 System Overview

┌─────────────────────────────────────────────────────────┐
│  LLM Agent (Claude / Roo Code / local Ollama)           │
└───────────────────────────┬─────────────────────────────┘
                            │ MCP Protocol (stdio)
┌───────────────────────────▼─────────────────────────────┐
│  mcp-image-gen  (FastMCP Python server)                 │
│                                                         │
│  Tools:                                                 │
│  • generate_image(prompt, width, height, steps, ...)    │
│  • list_available_models()                              │
│  • get_generation_status(prompt_id)                     │
│  • get_output_directory()                               │
└───────────────────────────┬─────────────────────────────┘
                            │ HTTP REST (httpx)
┌───────────────────────────▼─────────────────────────────┐
│  ComfyUI (localhost:8188)                               │
│  AMD ROCm + PyTorch                                     │
│  FLUX.1-schnell model                                   │
└─────────────────────────────────────────────────────────┘
                            │
                    ┌───────▼───────┐
                    │  ~/Pictures/  │
                    │  mcp-generated│
                    └───────────────┘

4.2 Key Decisions

Decision	Choice	Rationale
HTTP client	`httpx` (async)	Already used in webscraper; async-friendly; clean timeout handling
Image return	dual: path + base64	File path for persistence; base64 `ImageContent` for inline Claude display
ImageContent type	`mcp.types.ImageContent`	FastMCP 3.x: never use `fastmcp.utilities.types.Image` with `-> Image` annotation — it breaks serialization. Return `ImageContent` directly as a `ContentBlock`.
Job polling	loop with sleep	ComfyUI `/api/queue` returns pending/running/done status; poll until done or timeout
Workflow format	ComfyUI API JSON	Minimal FLUX.1-schnell graph parameterized at runtime
Config	env vars	`COMFYUI_URL`, `IMAGE_OUTPUT_DIR` — no hardcoded paths
Output naming	`{timestamp}_{seed}.png`	Reproducible, collision-free, sortable

5. Risks

Risk	Likelihood	Impact	Mitigation
ComfyUI not running when tool is called	High	High	Return clear error: "ComfyUI not reachable at {url}. Start with: `python main.py --listen`"
Generation timeout (>60s)	Medium	Medium	Configurable timeout; return partial status message with `prompt_id` so agent can poll manually
VRAM contention with Ollama	Medium	Medium	FLUX.1-schnell uses ~8GB; 24GB card has 16GB headroom. Document that running both simultaneously may compete at >8GB Ollama model sizes
ROCm driver instability	Low	High	ComfyUI falls back to CPU if ROCm unavailable — slow but functional. Document ROCm setup.
ComfyUI API changes	Low	Medium	Pin ComfyUI version in setup docs; the `/api/prompt`, `/api/queue`, `/api/view` endpoints are stable
Large output files	Low	Low	PNG default; add optional JPEG quality param in v2
Malformed workflow JSON	Low	High	Ship a tested, minimal FLUX.1-schnell workflow; validate before submit

6. Alternatives Considered

6.1 Ollama (Blocked)

Ollama added image generation in January 2026 (Z-Image Turbo, FLUX.2 Klein) but the feature is macOS-only as of April 2026. Linux support is listed as "coming soon" with no ETA. This was the originally preferred path (uniform API with text generation), but it is not viable on Fedora Linux today.

Migration path: When Ollama Linux image gen ships, a thin backend adapter can be added to mcp-image-gen so it routes to Ollama instead of ComfyUI — same MCP tool signatures, different HTTP target.

6.2 stable-diffusion.cpp

DiffuGen MCP server uses this approach. Requires:

Building sd.cpp with ROCm/Vulkan flags
Spawning a subprocess and parsing CLI output
No REST API — process management in Python

Viable but more fragile than ComfyUI's HTTP API. Chosen only if ComfyUI proves unworkable.

6.3 diffusers (Python library, direct)

Would run diffusion pipeline inside the MCP server process. Problems:

MCP server process cannot easily share GPU memory with Ollama
Model loading adds 5-15s cold start to every MCP invocation
Complex device placement / fp16 / ROCm configuration in server code
Risk: VRAM OOM crashes the MCP server process entirely

7. Success Criteria

Criterion	Measure
`generate_image` returns a valid PNG	File exists on disk, base64 decodes to valid PNG bytes
Claude can display the image inline	`ImageContent` returned in tool response, visible in Roo Code chat
FLUX.1-schnell at 1024×1024 4-step completes in <30s	Measured on RX 7900 XTX with ROCm
`list_available_models` returns ComfyUI model list	At minimum includes `flux1-schnell.safetensors`
ComfyUI offline → clear error, not crash	Tool returns error string, no MCP server exception
All pytest tests pass	`uv run pytest tests/ -v` exits 0 with ≥80% coverage
Server wired into `.roo/mcp.json`	Tool appears in Roo Code MCP tool list

8. Open Questions

#	Question	Owner	Priority
Q1	Should `generate_image` be synchronous (block until done) or return a `prompt_id` immediately?	Patrick	High — MVP will be synchronous; async polling is v2
Q2	Default output directory: `~/Pictures/mcp-generated` or `~/mcp-images`?	Patrick	Low — configurable via env var
Q3	Should we support SDXL as a second model in v1, or FLUX.1-schnell only?	Patrick	Low — FLUX.1-schnell only for v1
Q4	WebSocket API vs REST polling for job status?	—	ComfyUI has both; REST polling is simpler for v1

11 KiB Raw Blame History Unescape Escape