# mcp-image-gen — Architecture Assessment

**Date:** 2026-04-04
**Author:** Lumen (for Patrick / pplate)
**Status:** ✅ APPROVED — ready for implementation
**BigMind Research Session:** `39809470-6ac8-4713-adf2-79ac0eb36ba7`

---

## 1. Problem Statement

LLM agents (Claude, local models via Ollama) have no native ability to generate images. While
language models excel at text, creative and technical workflows increasingly need image output —
concept art, diagrams, product mockups, illustrations — all driven by a text prompt.

A FastMCP wrapper around a local image generation backend would give any MCP-capable IDE or
agent the ability to produce images on demand, with full control over resolution, steps, model,
and seed — without sending data to external cloud APIs.

**Gap being filled:** Local AI image generation accessible to LLM agents via MCP protocol,
running entirely on Patrick's AMD RX 7900 XTX (24GB VRAM) with ROCm.

---

## 2. Requirements

### 2.1 Functional Requirements

| ID | Requirement |
|----|-------------|
| F-1 | Generate an image from a text prompt |
| F-2 | Support configurable resolution (width × height) |
| F-3 | Support configurable inference steps and seed for reproducibility |
| F-4 | Support negative prompts to exclude unwanted content |
| F-5 | List available models from the backend |
| F-6 | Check the status of an in-progress generation job |
| F-7 | Return generated image as both a file path AND inline base64 for agent display |
| F-8 | Configure output directory for saved images |
| F-9 | Support FLUX.1-schnell as the default model |

### 2.2 Non-Functional Requirements

| ID | Requirement |
|----|-------------|
| NF-1 | Generation time < 30 seconds for FLUX.1-schnell at 1024×1024, 4 steps |
| NF-2 | VRAM footprint < 12GB (leaves headroom on 24GB for Ollama co-existence) |
| NF-3 | Must work on AMD ROCm — no CUDA-only dependencies in the MCP server layer |
| NF-4 | No cloud API calls — fully local execution |
| NF-5 | Graceful error messages when ComfyUI is not running |
| NF-6 | MCP tools must work with FastMCP and be discoverable by Claude / Roo Code |

---

## 3. Technology Decision

### 3.1 Candidate Backends

| Backend | Stars | ROCm | REST API | FLUX Support | Verdict |
|---------|-------|------|----------|--------------|---------|
| **ComfyUI** | 108k | ✅ Native | ✅ localhost:8188 | ✅ FLUX.1-schnell, FLUX.1-dev | ✅ **CHOSEN** |
| stable-diffusion.cpp | ~15k | ✅ ROCm/Vulkan | ❌ CLI only | ✅ FLUX.1-schnell | ⚠️ Viable alternative |
| PyTorch + diffusers | — | ✅ ROCm 7.2.1 | ❌ No REST | ✅ All models | ❌ Too complex to manage |
| Ollama image gen | — | ❌ Linux: N/A | ✅ /api/generate | ✅ FLUX.2, Z-Image | ❌ macOS-only as of April 2026 |
| A1111 / Forge WebUI | — | ⚠️ Limited | ✅ :7860 | ❌ SDXL primary | ❌ Not FLUX-native |

### 3.2 Why ComfyUI

1. **ROCm native** — ComfyUI's PyTorch backend runs on AMD GPUs via ROCm without forks or patches.
2. **REST API** — ComfyUI exposes a stable HTTP API at `localhost:8188` making it trivially
   wrappable with `httpx`. No subprocess management or binary spawning needed.
3. **Workflow-based** — ComfyUI workflows are JSON graphs. The MCP server ships a minimal
   FLUX.1-schnell workflow that can be parameterized with prompt, size, steps, seed at runtime.
4. **Model ecosystem** — ComfyUI's model manager supports FLUX.1, SDXL, SD3.5, ControlNet,
   LoRA — giving a future-proof upgrade path.
5. **Community size** — 108k GitHub stars; extensive community support, model nodes, extensions.
6. **VRAM efficiency** — FLUX.1-schnell requires ~8GB VRAM. Patrick's 24GB card runs it
   comfortably alongside Ollama.

### 3.3 Why NOT the Alternatives

- **Ollama:** Definitively blocked on Linux until further notice. No ETA for Linux image gen.
- **stable-diffusion.cpp:** CLI-based only — the MCP server would need to manage a subprocess,
  parse stdout, handle crashes. More fragile than an HTTP API.
- **PyTorch + diffusers direct:** Requires managing Python environments, device placement, model
  loading, memory management inside the MCP server process — adds significant complexity and
  risk of VRAM conflicts.

---

## 4. Architecture Decision

### 4.1 System Overview

```
┌─────────────────────────────────────────────────────────┐
│  LLM Agent (Claude / Roo Code / local Ollama)           │
└───────────────────────────┬─────────────────────────────┘
                            │ MCP Protocol (stdio)
┌───────────────────────────▼─────────────────────────────┐
│  mcp-image-gen  (FastMCP Python server)                 │
│                                                         │
│  Tools:                                                 │
│  • generate_image(prompt, width, height, steps, ...)    │
│  • list_available_models()                              │
│  • get_generation_status(prompt_id)                     │
│  • get_output_directory()                               │
└───────────────────────────┬─────────────────────────────┘
                            │ HTTP REST (httpx)
┌───────────────────────────▼─────────────────────────────┐
│  ComfyUI (localhost:8188)                               │
│  AMD ROCm + PyTorch                                     │
│  FLUX.1-schnell model                                   │
└─────────────────────────────────────────────────────────┘
                            │
                    ┌───────▼───────┐
                    │  ~/Pictures/  │
                    │  mcp-generated│
                    └───────────────┘
```

### 4.2 Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| HTTP client | `httpx` (async) | Already used in webscraper; async-friendly; clean timeout handling |
| Image return | dual: path + base64 | File path for persistence; base64 `ImageContent` for inline Claude display |
| ImageContent type | `mcp.types.ImageContent` | FastMCP 3.x: **never** use `fastmcp.utilities.types.Image` with `-> Image` annotation — it breaks serialization. Return `ImageContent` directly as a `ContentBlock`. |
| Job polling | loop with sleep | ComfyUI `/api/queue` returns pending/running/done status; poll until done or timeout |
| Workflow format | ComfyUI API JSON | Minimal FLUX.1-schnell graph parameterized at runtime |
| Config | env vars | `COMFYUI_URL`, `IMAGE_OUTPUT_DIR` — no hardcoded paths |
| Output naming | `{timestamp}_{seed}.png` | Reproducible, collision-free, sortable |

---

## 5. Risks

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| ComfyUI not running when tool is called | High | High | Return clear error: "ComfyUI not reachable at {url}. Start with: `python main.py --listen`" |
| Generation timeout (>60s) | Medium | Medium | Configurable timeout; return partial status message with `prompt_id` so agent can poll manually |
| VRAM contention with Ollama | Medium | Medium | FLUX.1-schnell uses ~8GB; 24GB card has 16GB headroom. Document that running both simultaneously may compete at >8GB Ollama model sizes |
| ROCm driver instability | Low | High | ComfyUI falls back to CPU if ROCm unavailable — slow but functional. Document ROCm setup. |
| ComfyUI API changes | Low | Medium | Pin ComfyUI version in setup docs; the `/api/prompt`, `/api/queue`, `/api/view` endpoints are stable |
| Large output files | Low | Low | PNG default; add optional JPEG quality param in v2 |
| Malformed workflow JSON | Low | High | Ship a tested, minimal FLUX.1-schnell workflow; validate before submit |

---

## 6. Alternatives Considered

### 6.1 Ollama (Blocked)
Ollama added image generation in January 2026 (Z-Image Turbo, FLUX.2 Klein) but the feature is
**macOS-only** as of April 2026. Linux support is listed as "coming soon" with no ETA. This was
the originally preferred path (uniform API with text generation), but it is not viable on Fedora
Linux today.

**Migration path:** When Ollama Linux image gen ships, a thin backend adapter can be added to
`mcp-image-gen` so it routes to Ollama instead of ComfyUI — same MCP tool signatures, different
HTTP target.

### 6.2 stable-diffusion.cpp
DiffuGen MCP server uses this approach. Requires:
- Building sd.cpp with ROCm/Vulkan flags
- Spawning a subprocess and parsing CLI output
- No REST API — process management in Python

Viable but more fragile than ComfyUI's HTTP API. Chosen only if ComfyUI proves unworkable.

### 6.3 diffusers (Python library, direct)
Would run diffusion pipeline inside the MCP server process. Problems:
- MCP server process cannot easily share GPU memory with Ollama
- Model loading adds 5-15s cold start to every MCP invocation
- Complex device placement / fp16 / ROCm configuration in server code
- Risk: VRAM OOM crashes the MCP server process entirely

---

## 7. Success Criteria

| Criterion | Measure |
|-----------|---------|
| `generate_image` returns a valid PNG | File exists on disk, base64 decodes to valid PNG bytes |
| Claude can display the image inline | `ImageContent` returned in tool response, visible in Roo Code chat |
| FLUX.1-schnell at 1024×1024 4-step completes in <30s | Measured on RX 7900 XTX with ROCm |
| `list_available_models` returns ComfyUI model list | At minimum includes `flux1-schnell.safetensors` |
| ComfyUI offline → clear error, not crash | Tool returns error string, no MCP server exception |
| All pytest tests pass | `uv run pytest tests/ -v` exits 0 with ≥80% coverage |
| Server wired into `.roo/mcp.json` | Tool appears in Roo Code MCP tool list |

---

## 8. Open Questions

| # | Question | Owner | Priority |
|---|----------|-------|----------|
| Q1 | Should `generate_image` be synchronous (block until done) or return a `prompt_id` immediately? | Patrick | High — MVP will be synchronous; async polling is v2 |
| Q2 | Default output directory: `~/Pictures/mcp-generated` or `~/mcp-images`? | Patrick | Low — configurable via env var |
| Q3 | Should we support SDXL as a second model in v1, or FLUX.1-schnell only? | Patrick | Low — FLUX.1-schnell only for v1 |
| Q4 | WebSocket API vs REST polling for job status? | — | ComfyUI has both; REST polling is simpler for v1 |