ComfyUI MCP Server

Claude's Creative Layer

35+ MCP tools bridging Claude to ComfyUI across image generation, speech synthesis, and lip-synced video

GitHub

Screenshots

Yes, this image was generated by the tool. Built for internal asset generation—use responsibly.

Results

467

Tests passing

Vitest, strict TypeScript

35+

MCP tools

txt2img to talking heads

Model strategies

Illustrious, Pony, Flux, SDXL, SD1.5, Realistic

I wanted Claude to generate images through my local ComfyUI setup, which I could use for talking heads in my language-learning platform. Simple enough: run Sonic on my Mac. Wait thirty minutes. Watch... ten seconds of pure black frames. Woops.

Unified Memory noped right the f*ck out.

Fair enough: time to distribute.

The MCP server moved to Fly.io, stateless and auto-scaling. GPU compute lives on RunPod, pay-per-second. Generated assets go to Supabase with signed URLs. Tailscale meshes it all together securely. What started as "let me generate some images" became a production distributed system because the alternative was a space heater that outputs nothing.

Now Claude can generate images, upscale them, run ControlNet pipelines, synthesize speech, and create lip-synced talking head videos through natural conversation. No per-image API fees and full parameter control over every step of the pipeline.

The next step is characters that speak: tutors with faces, historical figures who answer questions in their own voice. The infrastructure is functional. The applications are in development.

For Engineers

Architecture

MCP server exposes tools to Claude via the Model Context Protocol. Each tool builds a ComfyUI workflow graph dynamically: checkpoint loaders, CLIP encoders, samplers, VAE decoders. The assembled graph gets submitted to the remote GPU over Tailscale.

WebSocket monitoring tracks generation progress in real-time. When the image lands, the storage abstraction pushes it to Supabase, GCP, or local filesystem depending on configuration. Zero code changes to swap providers.

Long-running jobs (portrait generation, TTS, lipsync) run async through Quirrel job queues to avoid Fly.io connection limits. Six model strategies handle the prompting differences between Illustrious, Pony, Flux, SDXL, SD1.5, and Realistic checkpoints, auto-detected from the checkpoint filename.

Key Decisions

Distributed by Necessity

Local Mac rendered black frames with Sonic. The architecture emerged from hardware constraints, not overengineering. Now it scales.

Strategy Pattern for Model Prompting

Illustrious wants tags, Flux wants natural language, Pony needs score tags. Six model families, six strategies. Auto-detected from checkpoint name.

Cloud Storage Abstraction

Single interface, three implementations. Swap providers with an env var. No vendor lock-in.

Quirrel for Long-Running Jobs

Fly.io has connection limits. Portrait generation, TTS, and lipsync run async through job queues. Prevents timeout deaths.

What Was Hard

Distributed GPU inference sounds simple until you connect the pieces.

Tailscale mesh between Fly.io and RunPod was underdocumented; getting the two to see each other required digging through both platforms' networking internals
Rate limiting in a distributed context required Upstash Redis. In-memory limiters fail silently when you have multiple server instances
The ComfyUI WebSocket protocol has quirks (binary frames, inconsistent event ordering) that took time to stabilize

Stack

TypeScript MCP SDK Hono Zod Vitest Fly.io RunPod Supabase Tailscale Upstash Redis Quirrel