ComfyUI-DiaTTS – ComfyUI Node

ComfyUI Dia is a node pack designed to integrate the Dia text-to-speech model into ComfyUI, enabling the generation of dialogue and non-verbal sounds. This tool leverages safetensors for efficient model loading and requires a CUDA-enabled GPU for operation.

Enables dialogue generation with speaker tags and non-verbal cues for more realistic audio output.
Supports audio prompting for voice cloning, allowing users to generate new speech in a specific voice style.
Provides detailed configuration options for audio generation, including token limits and sampling parameters.

Context

The ComfyUI Dia tool is a specialized node pack that incorporates the Nari-Labs Dia 1.6b text-to-speech model into the ComfyUI framework. Its primary aim is to facilitate the creation of natural-sounding dialogue, complete with speaker identification and expressive non-verbal sounds, enhancing the overall audio generation capabilities within the ComfyUI environment.

Key Features & Benefits

This tool allows users to generate dialogue using specific speaker tags, which helps in distinguishing different voices in a conversation. It also supports non-verbal sounds, enhancing the realism of the generated audio by including sounds like laughter or sighs. Furthermore, the audio prompting feature permits users to condition the generation process on a previously recorded voice, making it possible to create coherent speech that mimics the style and tone of the original audio input.

Advanced Functionalities

The Dia tool offers advanced functionalities such as the ability to adjust parameters like max_tokens, cfg_scale, temperature, and top_p, which control the length and variability of the generated audio. This level of customization allows users to fine-tune their audio outputs for specific applications, whether they need more deterministic speech or a more varied, creative output. Additionally, the audio prompting feature requires the transcript of the audio prompt to be included in the text input, ensuring that the model aligns the generated speech with the desired voice characteristics.

Practical Benefits

By integrating the Dia text-to-speech model into ComfyUI, this tool significantly enhances the workflow for users involved in audio content creation. It provides greater control over the dialogue generation process, allowing for high-quality outputs that can be tailored to specific needs. The ability to generate not only speech but also non-verbal sounds contributes to the overall quality and realism of the audio, making it suitable for a variety of applications, from entertainment to educational content.

Credits/Acknowledgments

This tool is developed by contributors to the ComfyUI and Nari-Labs projects, specifically utilizing the Dia 1.6b model architecture. The repository is open-source, and users are encouraged to check the original authors and license details on the GitHub page.