floyo logo
Powered by
ThinkDiffusion
floyo logo
Powered by
ThinkDiffusion

VibeVoice-ComfyUI

1372

Last updated
2026-02-18

A robust integration for Microsoft's VibeVoice text-to-speech model within ComfyUI, this tool facilitates high-quality voice synthesis for both single and multi-speaker scenarios, enhancing user workflows with advanced features. It provides seamless voice cloning and customization options, making it a valuable asset for those looking to generate realistic speech from text.

  • Supports single and multi-speaker voice synthesis, allowing for conversations with distinct voices.
  • Features voice cloning capabilities, enabling users to replicate specific voice characteristics from audio samples.
  • Offers customizable parameters such as voice speed control and text chunking for improved audio output quality.

Context

This tool serves as a comprehensive integration of the VibeVoice text-to-speech model into the ComfyUI framework, facilitating high-quality voice synthesis directly in user workflows. Its primary purpose is to enhance the capabilities of ComfyUI by allowing users to generate realistic speech from text inputs, whether for single speakers or multi-speaker dialogues.

Key Features & Benefits

The integration provides several practical features that significantly enhance user experience:

  • Single and Multi-Speaker TTS: Users can generate speech for one or multiple speakers, with support for up to four distinct voices, making it ideal for creating dialogues or conversations.
  • Voice Cloning: This feature allows users to clone voices from audio samples, providing the ability to create personalized or character-specific speech outputs.
  • Text File Loading and Automatic Chunking: Users can load scripts directly from text files, and long texts are automatically divided into manageable chunks, ensuring smooth processing and output.

Advanced Functionalities

The tool includes advanced capabilities such as:

  • LoRA Support: Users can fine-tune voices using custom LoRA adapters, allowing for specialized voice characteristics while maintaining the base model's functionality.
  • Voice Speed Control: This feature lets users adjust the rate of speech by modifying the reference voice speed, which is particularly useful for creating natural-sounding dialogues.
  • Custom Pause Tags: Users can insert pauses of specified durations within the text, enhancing control over speech pacing and timing.

Practical Benefits

By integrating these features, the tool improves workflow efficiency and control within ComfyUI. Users can generate high-quality audio outputs with greater flexibility and customization, allowing for better management of voice characteristics and speech pacing. This results in a more streamlined process for producing realistic and engaging audio content.

Credits/Acknowledgments

The integration was developed by Fabio Sarracino, with contributions from the community. The VibeVoice model itself is maintained by Microsoft Research and is subject to their licensing terms. The tool is released under the MIT License, promoting open-source collaboration and usage.

Inner Nodes

LoadTextFromFileNode, VibeVoice Free Memory, VibeVoice LoRA, VibeVoice Load Text From File, VibeVoice Multiple Speakers, VibeVoice Single Speaker, VibeVoiceFreeMemoryNode, VibeVoiceLoRANode, VibeVoiceMultipleSpeakersNode, VibeVoiceSingleSpeakerNode