ComfyMind

ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback

¹HKUST(GZ), ²HKUST, ³Bytedance
^*Indicates Equal Contribution ^†Indicates Corresponding Author

Abstract

With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems.

Text-to-Image Generation

Generate an image of a hot air balloon floating over a scenic valley at sunrise.

Generate an image of a cat sitting on a windowsill looking outside.

Generate an image of a small village covered in snow with smoke coming from chimneys.

Generate an image of a beach at sunset with waves gently crashing on the shore.

Generate an image of a mountain landscape with snow-capped peaks and a river flowing below.

Reasoning Generation

Generate an image that represents a winter sport in Switzerland

A winter sport often enjoyed in Switzerland, involving snow covered slopes

Create an image representing India's most famous traditional craft

Most representative craft of India

Visualize a famous Egyptian historical landmark

A massive stone statue of a mythical creature that is a prominent historical landmark in Egypt

Illustrate how an octopus reacts to danger

Octopus behavior when facing danger

Show what typically happens after a whale surfaces

Common behavior after a whale surfaces

Demonstrate light dispersion through a prism

Light dispersion from a glass prism

Visualize objects with different densities in water

A tennis ball and a iron block are in a transparent water tank

Image Editing

Edit the image of a whole cake cake.jpg to make it look like a triangular corner piece has been cut out. The remaining cake should appear untouched and natural

Reference

Output

Convert the cherries image into an advertisement version with exhibition stand lighting

Reference

Output

You are given an image man.jpg, which is a photo of a young man. Generate another photo to show the man as an elderly version of himself, with wrinkles, gray hair, and other signs of aging, while preserving his identity. The result should be a realistic image of an older man.

Reference

Output

You are given an image pigeon_scribble.png. Please according to the reference image, generate a ceramic texture cup with the reference image as the logo. The background is a office table.

Reference

Output

Based on the given reference images new_york.jpg, outpaint the image in left and right sides with both 512 pixels, and the prompt is: A spectacular view of New York City's skyline at dusk

Reference

Output

Based on the given reference images castle.jpg, replace the castle in the image with Chineses traditional temple

Reference

Output

Based on the given reference images windmill.jpg, remove the windmill in the image

Reference

Output

Based on the given reference images dinner.jpg, remove the knife and fork in the image

Reference

Output

Video Generation

Generate a 8 seconds high-quality video of a bonfire burning on the seaside

Generate a 8 seconds high-quality video of Fried egg sizzle in the skillet

Generate a 4 seconds high-quality video of a survivor in an exoskeleton scavenges the wastecity, framed by an over-the-shoulder shot

Generate a 4 seconds high-quality video of sunlight filters through the forest, deer herd drinks from a stream

Generate a 4 seconds high-quality video of a winged woman hovers in the desolate skies above the wasteland

Pipeline Overview

Overview of ComfyMind pipeline. Given a user instruction, the system first parses the task and delegates it to Planning Agent. The Agent incrementally explores a semantic search tree, where each node proposes a candidate workflow and receives local feedback based on execution results.

BibTeX

@misc{guo2025comfymindgeneralpurposegenerationtreebased, title={ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback}, author={Litao Guo and Xinli Xu and Luozhou Wang and Jiantao Lin and Jinsong Zhou and Zixin Zhang and Bolan Su and Ying-Cong Chen}, year={2025}, eprint={2505.17908}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.17908}, }