API

Pricing

Workflows

API

Pricing

ComfyUI-LongCat-AudioDIT-TTS

Author Saganaki22

https://github.com/Saganaki22/ComfyUI-LongCat-AudioDIT-TTS

129

Last updated

2026-04-04

Run hundreds of ComfyUI nodes and workflows in your browser.

ComfyUI-LongCat-AudioDiT-TTS is a specialized tool that integrates LongCat-AudioDiT, a diffusion-based text-to-speech model, into the ComfyUI framework. It enables users to generate high-quality speech audio from text input and perform voice cloning from short reference audio clips without the need for fine-tuning.

Supports zero-shot voice cloning, allowing users to replicate voices from brief audio samples.
Facilitates multi-speaker conversation synthesis, enabling the generation of dialogues with multiple cloned voices.
Offers various model precision options (FP8, BF16, FP16, FP32) to optimize performance based on hardware capabilities.

Context

This tool serves as a custom node extension for ComfyUI, enhancing its capabilities by incorporating the LongCat-AudioDiT model, which employs a diffusion transformer architecture. The primary purpose is to allow users to generate speech audio directly from text or replicate voices based on provided audio samples, streamlining the process of text-to-speech synthesis and voice cloning.

Key Features & Benefits

The integration provides several practical features that significantly enhance the user experience:

Zero-shot voice cloning allows for quick and efficient voice replication from short audio clips, making it easier for users to create personalized audio outputs.
The multi-speaker TTS functionality enables the generation of dynamic conversations, which is particularly useful for applications requiring interactive dialogue or storytelling.
Diffusion-based generation ensures high audio quality, leveraging advanced algorithms to produce clear and natural-sounding speech.

Advanced Functionalities

The tool includes advanced capabilities such as:

Optimized attention mechanisms that improve audio generation speed and quality, allowing for better performance during the synthesis process.
Smart auto-download and caching features that streamline the workflow by automatically managing model weights and resources, reducing the need for manual intervention.
Support for multiple precision formats, enabling users to select the most suitable model based on their hardware specifications, thus optimizing resource usage.

Practical Benefits

Integrating this tool into ComfyUI significantly enhances workflow efficiency by providing users with robust options for audio generation. The ability to quickly generate high-quality audio from text and clone voices without extensive setup or fine-tuning allows for greater creative control and faster production times. Additionally, the multi-speaker capability opens up new avenues for creating engaging audio content.

Credits/Acknowledgments

This project was developed by contributors associated with the LongCat-AudioDiT model and is licensed under the MIT License. For further details, users can refer to the original repository on Hugging Face and the associated documentation.

Inner Nodes

LongCatMultiSpeakerTTS

LongCatTTS

LongCatVoiceCloneTTS

Discover most popular workflows

Hand-picked based on what hundreds of other artists looked at.

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

Nano Banana 2: Fast Image Generation & Editing

floyoofficial

4.6k

API

gemini flash image

Image2Image

Text2Image

typography

The top-ranked image model on Artificial Analysis and LM Arena. 4K output, text rendering, and subject consistency across 5 characters.

Nano Banana 2: Fast Image Generation & Editing

The top-ranked image model on Artificial Analysis and LM Arena. 4K output, text rendering, and subject consistency across 5 characters.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

goshnii

10.6k

Face swap

Flux

flux 2 klein

Flux 2 Klein face swap

Flux face swap

head swap

image 2 image

image editing

Instead of using outdated or unstable techniques, this workflow was designed to take full advantage of FLUX 2 KLEIN's editing capabilities—using a face image and a reference character image to produce clean, highly consistent results.

Flux 2 Klein 9b - Perfect Face swap

floyoofficial

4.7k

API

Image to Video

LTX2.3

LTX 2.3

LTX 2.3 Pro Image to Video

LTX 2.3

Author

Saganaki22