With the rapid advancement of generative models, general-purpose generation has gained increasing attention as a promising approach to unify diverse tasks across modalities within a single system. Despite this progress, existing open-source frameworks often remain fragile and struggle to support complex real-world applications due to the lack of structured workflow planning and execution-level feedback. To address these limitations, we present ComfyMind, a collaborative AI system designed to enable robust and scalable general-purpose generation, built on the ComfyUI platform. ComfyMind introduces two core innovations: Semantic Workflow Interface (SWI) that abstracts low-level node graphs into callable functional modules described in natural language, enabling high-level composition and reducing structural errors; Search Tree Planning mechanism with localized feedback execution, which models generation as a hierarchical decision process and allows adaptive correction at each stage. Together, these components improve the stability and flexibility of complex generative workflows. We evaluate ComfyMind on three public benchmarks: ComfyBench, GenEval, and Reason-Edit, which span generation, editing, and reasoning tasks. Results show that ComfyMind consistently outperforms existing open-source baselines and achieves performance comparable to GPT-Image-1. ComfyMind paves a promising path for the development of open-source general-purpose generative AI systems.
Edit the image of a whole cake cake.jpg to make it look like a triangular corner piece has been cut out. The remaining cake should appear untouched and natural
Convert the cherries image into an advertisement version with exhibition stand lighting
You are given an image man.jpg, which is a photo of a young man. Generate another photo to show the man as an elderly version of himself, with wrinkles, gray hair, and other signs of aging, while preserving his identity. The result should be a realistic image of an older man.
You are given an image pigeon_scribble.png. Please according to the reference image, generate a ceramic texture cup with the reference image as the logo. The background is a office table.
Based on the given reference images new_york.jpg, outpaint the image in left and right sides with both 512 pixels, and the prompt is: A spectacular view of New York City's skyline at dusk
Based on the given reference images castle.jpg, replace the castle in the image with Chineses traditional temple
Based on the given reference images windmill.jpg, remove the windmill in the image
Based on the given reference images dinner.jpg, remove the knife and fork in the image
Generate a 8 seconds high-quality video of a bonfire burning on the seaside
Generate a 8 seconds high-quality video of Fried egg sizzle in the skillet
Generate a 4 seconds high-quality video of a survivor in an exoskeleton scavenges the wastecity, framed by an over-the-shoulder shot
Generate a 4 seconds high-quality video of sunlight filters through the forest, deer herd drinks from a stream
Generate a 4 seconds high-quality video of a winged woman hovers in the desolate skies above the wasteland
@misc{guo2025comfymindgeneralpurposegenerationtreebased,
title={ComfyMind: Toward General-Purpose Generation via Tree-Based Planning and Reactive Feedback},
author={Litao Guo and Xinli Xu and Luozhou Wang and Jiantao Lin and Jinsong Zhou and Zixin Zhang and Bolan Su and Ying-Cong Chen},
year={2025},
eprint={2505.17908},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.17908},
}