A robust integration for Microsoft's VibeVoice text-to-speech model within ComfyUI, this tool facilitates high-quality voice synthesis for both single and multi-speaker scenarios, enhancing user workflows with advanced features. It provides seamless voice cloning and customization options, making it a valuable asset for those looking to generate realistic speech from text.
- Supports single and multi-speaker voice synthesis, allowing for conversations with distinct voices.
- Features voice cloning capabilities, enabling users to replicate specific voice characteristics from audio samples.
- Offers customizable parameters such as voice speed control and text chunking for improved audio output quality.
Context
This tool serves as a comprehensive integration of the VibeVoice text-to-speech model into the ComfyUI framework, facilitating high-quality voice synthesis directly in user workflows. Its primary purpose is to enhance the capabilities of ComfyUI by allowing users to generate realistic speech from text inputs, whether for single speakers or multi-speaker dialogues.
Key Features & Benefits
The integration provides several practical features that significantly enhance user experience:
- Single and Multi-Speaker TTS: Users can generate speech for one or multiple speakers, with support for up to four distinct voices, making it ideal for creating dialogues or conversations.
- Voice Cloning: This feature allows users to clone voices from audio samples, providing the ability to create personalized or character-specific speech outputs.
- Text File Loading and Automatic Chunking: Users can load scripts directly from text files, and long texts are automatically divided into manageable chunks, ensuring smooth processing and output.
Advanced Functionalities
The tool includes advanced capabilities such as:
- LoRA Support: Users can fine-tune voices using custom LoRA adapters, allowing for specialized voice characteristics while maintaining the base model's functionality.
- Voice Speed Control: This feature lets users adjust the rate of speech by modifying the reference voice speed, which is particularly useful for creating natural-sounding dialogues.
- Custom Pause Tags: Users can insert pauses of specified durations within the text, enhancing control over speech pacing and timing.
Practical Benefits
By integrating these features, the tool improves workflow efficiency and control within ComfyUI. Users can generate high-quality audio outputs with greater flexibility and customization, allowing for better management of voice characteristics and speech pacing. This results in a more streamlined process for producing realistic and engaging audio content.
Credits/Acknowledgments
The integration was developed by Fabio Sarracino, with contributions from the community. The VibeVoice model itself is maintained by Microsoft Research and is subject to their licensing terms. The tool is released under the MIT License, promoting open-source collaboration and usage.