Files
pi_mcps/mcp/mcp-image-gen/ASSESSMENT.md
T
Patrick Plate 8112ff2f12 feat(mcp-image-gen): scaffold ComfyUI-backed image generation MCP server
- FastMCP server with 4 tools: generate_image, list_available_models,
  get_generation_status, get_output_directory
- ComfyUI REST API client (httpx) polling lifecycle
- FLUX.1-schnell workflow JSON template
- Dual output: TextContent (path + seed) + ImageContent (base64 PNG)
- 14 passing pytest tests with respx HTTP mocking
- ROCm/AMD RX 7900 XTX optimized setup in README
- Ollama Linux migration path documented (future)
2026-04-04 11:49:31 +02:00

11 KiB
Raw Blame History

mcp-image-gen — Architecture Assessment

Date: 2026-04-04 Author: Lumen (for Patrick / pplate) Status: APPROVED — ready for implementation BigMind Research Session: 39809470-6ac8-4713-adf2-79ac0eb36ba7


1. Problem Statement

LLM agents (Claude, local models via Ollama) have no native ability to generate images. While language models excel at text, creative and technical workflows increasingly need image output — concept art, diagrams, product mockups, illustrations — all driven by a text prompt.

A FastMCP wrapper around a local image generation backend would give any MCP-capable IDE or agent the ability to produce images on demand, with full control over resolution, steps, model, and seed — without sending data to external cloud APIs.

Gap being filled: Local AI image generation accessible to LLM agents via MCP protocol, running entirely on Patrick's AMD RX 7900 XTX (24GB VRAM) with ROCm.


2. Requirements

2.1 Functional Requirements

ID Requirement
F-1 Generate an image from a text prompt
F-2 Support configurable resolution (width × height)
F-3 Support configurable inference steps and seed for reproducibility
F-4 Support negative prompts to exclude unwanted content
F-5 List available models from the backend
F-6 Check the status of an in-progress generation job
F-7 Return generated image as both a file path AND inline base64 for agent display
F-8 Configure output directory for saved images
F-9 Support FLUX.1-schnell as the default model

2.2 Non-Functional Requirements

ID Requirement
NF-1 Generation time < 30 seconds for FLUX.1-schnell at 1024×1024, 4 steps
NF-2 VRAM footprint < 12GB (leaves headroom on 24GB for Ollama co-existence)
NF-3 Must work on AMD ROCm — no CUDA-only dependencies in the MCP server layer
NF-4 No cloud API calls — fully local execution
NF-5 Graceful error messages when ComfyUI is not running
NF-6 MCP tools must work with FastMCP and be discoverable by Claude / Roo Code

3. Technology Decision

3.1 Candidate Backends

Backend Stars ROCm REST API FLUX Support Verdict
ComfyUI 108k Native localhost:8188 FLUX.1-schnell, FLUX.1-dev CHOSEN
stable-diffusion.cpp ~15k ROCm/Vulkan CLI only FLUX.1-schnell ⚠️ Viable alternative
PyTorch + diffusers ROCm 7.2.1 No REST All models Too complex to manage
Ollama image gen Linux: N/A /api/generate FLUX.2, Z-Image macOS-only as of April 2026
A1111 / Forge WebUI ⚠️ Limited :7860 SDXL primary Not FLUX-native

3.2 Why ComfyUI

  1. ROCm native — ComfyUI's PyTorch backend runs on AMD GPUs via ROCm without forks or patches.
  2. REST API — ComfyUI exposes a stable HTTP API at localhost:8188 making it trivially wrappable with httpx. No subprocess management or binary spawning needed.
  3. Workflow-based — ComfyUI workflows are JSON graphs. The MCP server ships a minimal FLUX.1-schnell workflow that can be parameterized with prompt, size, steps, seed at runtime.
  4. Model ecosystem — ComfyUI's model manager supports FLUX.1, SDXL, SD3.5, ControlNet, LoRA — giving a future-proof upgrade path.
  5. Community size — 108k GitHub stars; extensive community support, model nodes, extensions.
  6. VRAM efficiency — FLUX.1-schnell requires ~8GB VRAM. Patrick's 24GB card runs it comfortably alongside Ollama.

3.3 Why NOT the Alternatives

  • Ollama: Definitively blocked on Linux until further notice. No ETA for Linux image gen.
  • stable-diffusion.cpp: CLI-based only — the MCP server would need to manage a subprocess, parse stdout, handle crashes. More fragile than an HTTP API.
  • PyTorch + diffusers direct: Requires managing Python environments, device placement, model loading, memory management inside the MCP server process — adds significant complexity and risk of VRAM conflicts.

4. Architecture Decision

4.1 System Overview

┌─────────────────────────────────────────────────────────┐
│  LLM Agent (Claude / Roo Code / local Ollama)           │
└───────────────────────────┬─────────────────────────────┘
                            │ MCP Protocol (stdio)
┌───────────────────────────▼─────────────────────────────┐
│  mcp-image-gen  (FastMCP Python server)                 │
│                                                         │
│  Tools:                                                 │
│  • generate_image(prompt, width, height, steps, ...)    │
│  • list_available_models()                              │
│  • get_generation_status(prompt_id)                     │
│  • get_output_directory()                               │
└───────────────────────────┬─────────────────────────────┘
                            │ HTTP REST (httpx)
┌───────────────────────────▼─────────────────────────────┐
│  ComfyUI (localhost:8188)                               │
│  AMD ROCm + PyTorch                                     │
│  FLUX.1-schnell model                                   │
└─────────────────────────────────────────────────────────┘
                            │
                    ┌───────▼───────┐
                    │  ~/Pictures/  │
                    │  mcp-generated│
                    └───────────────┘

4.2 Key Decisions

Decision Choice Rationale
HTTP client httpx (async) Already used in webscraper; async-friendly; clean timeout handling
Image return dual: path + base64 File path for persistence; base64 ImageContent for inline Claude display
ImageContent type mcp.types.ImageContent FastMCP 3.x: never use fastmcp.utilities.types.Image with -> Image annotation — it breaks serialization. Return ImageContent directly as a ContentBlock.
Job polling loop with sleep ComfyUI /api/queue returns pending/running/done status; poll until done or timeout
Workflow format ComfyUI API JSON Minimal FLUX.1-schnell graph parameterized at runtime
Config env vars COMFYUI_URL, IMAGE_OUTPUT_DIR — no hardcoded paths
Output naming {timestamp}_{seed}.png Reproducible, collision-free, sortable

5. Risks

Risk Likelihood Impact Mitigation
ComfyUI not running when tool is called High High Return clear error: "ComfyUI not reachable at {url}. Start with: python main.py --listen"
Generation timeout (>60s) Medium Medium Configurable timeout; return partial status message with prompt_id so agent can poll manually
VRAM contention with Ollama Medium Medium FLUX.1-schnell uses ~8GB; 24GB card has 16GB headroom. Document that running both simultaneously may compete at >8GB Ollama model sizes
ROCm driver instability Low High ComfyUI falls back to CPU if ROCm unavailable — slow but functional. Document ROCm setup.
ComfyUI API changes Low Medium Pin ComfyUI version in setup docs; the /api/prompt, /api/queue, /api/view endpoints are stable
Large output files Low Low PNG default; add optional JPEG quality param in v2
Malformed workflow JSON Low High Ship a tested, minimal FLUX.1-schnell workflow; validate before submit

6. Alternatives Considered

6.1 Ollama (Blocked)

Ollama added image generation in January 2026 (Z-Image Turbo, FLUX.2 Klein) but the feature is macOS-only as of April 2026. Linux support is listed as "coming soon" with no ETA. This was the originally preferred path (uniform API with text generation), but it is not viable on Fedora Linux today.

Migration path: When Ollama Linux image gen ships, a thin backend adapter can be added to mcp-image-gen so it routes to Ollama instead of ComfyUI — same MCP tool signatures, different HTTP target.

6.2 stable-diffusion.cpp

DiffuGen MCP server uses this approach. Requires:

  • Building sd.cpp with ROCm/Vulkan flags
  • Spawning a subprocess and parsing CLI output
  • No REST API — process management in Python

Viable but more fragile than ComfyUI's HTTP API. Chosen only if ComfyUI proves unworkable.

6.3 diffusers (Python library, direct)

Would run diffusion pipeline inside the MCP server process. Problems:

  • MCP server process cannot easily share GPU memory with Ollama
  • Model loading adds 5-15s cold start to every MCP invocation
  • Complex device placement / fp16 / ROCm configuration in server code
  • Risk: VRAM OOM crashes the MCP server process entirely

7. Success Criteria

Criterion Measure
generate_image returns a valid PNG File exists on disk, base64 decodes to valid PNG bytes
Claude can display the image inline ImageContent returned in tool response, visible in Roo Code chat
FLUX.1-schnell at 1024×1024 4-step completes in <30s Measured on RX 7900 XTX with ROCm
list_available_models returns ComfyUI model list At minimum includes flux1-schnell.safetensors
ComfyUI offline → clear error, not crash Tool returns error string, no MCP server exception
All pytest tests pass uv run pytest tests/ -v exits 0 with ≥80% coverage
Server wired into .roo/mcp.json Tool appears in Roo Code MCP tool list

8. Open Questions

# Question Owner Priority
Q1 Should generate_image be synchronous (block until done) or return a prompt_id immediately? Patrick High — MVP will be synchronous; async polling is v2
Q2 Default output directory: ~/Pictures/mcp-generated or ~/mcp-images? Patrick Low — configurable via env var
Q3 Should we support SDXL as a second model in v1, or FLUX.1-schnell only? Patrick Low — FLUX.1-schnell only for v1
Q4 WebSocket API vs REST polling for job status? ComfyUI has both; REST polling is simpler for v1