Capybara for Text to Image
Create unique images using Capybara
Capybara
Text2Image
0
30
Nodes & Models
RandomNoise
KSamplerSelect
MarkdownNote
UNETLoader
capybara_v0.1.safetensors
VAELoader
hunyuanvideo15_vae_fp16.safetensors
WorkflowGraphics
BasicScheduler
ModelSamplingSD3
CLIPTextEncode
CFGGuider
SamplerCustomAdvanced
VAEDecode
AddLabel
PreviewImage
easy positive
Capybara is a unified visual generation model that can do text‑to‑image, image editing, and video tasks, but here you’d use it mainly for text‑to‑image to create high‑quality still images from prompts.​
What it is
A 14B diffusion‑transformer model (built on HunyuanVideo 1.5) that supports T2I, T2V, I2I, and V2V in one architecture, with custom ComfyUI nodes.
For text‑to‑image, you give a natural‑language prompt and it generates 720p‑class images with strong realism and style flexibility.
Key features (text to image)
Handles complex scenes (multiple characters, detailed environments) while keeping good global composition.​
Supports instruction‑like prompts (“cinematic close‑up,” “anime style,” “studio product shot”) thanks to its unified semantic/vision transformer design.​​
Recommended settings around 720p, ~50 steps for best quality, with the option to reduce steps using acceleration LoRAs for faster renders.​​
Tight ComfyUI integration via official templates like “Capybara: Text to Image,” so you can drop it into existing node graphs easily.​
Best use cases
Cinematic keyframes and concept art from detailed text briefs (characters, lighting, camera language).​
Stylized or realistic illustrations for thumbnails, posters, and social content when you don’t need separate models for video.​
Unified pipelines where you might later extend a still image into motion (I2V/T2V) using the same Capybara model family.
Read more








