VibeVoice Text to Speech Single Speaker
VibeVoice
TTS
VibeVoice
6
722
Nodes & Models
Note
LoadAudio
PreviewAudio
LoadTextFromFileNode
VibeVoiceSingleSpeakerNode
Generate natural-sounding speech from text using a cloned voice.
Upload a short audio sample of the voice you want to clone, type or paste your script, and VibeVoice generates speech that sounds like that person. It handles pacing, emotion, and delivery naturally because it uses an LLM to understand context before generating audio. The output is a clean audio file ready to use as voiceover, narration, or dialogue.
Upload a voice sample, type your text, hit run.
How do you clone a voice and generate speech with VibeVoice?
Upload an audio sample of the voice you want to clone (at least 20 seconds for best results). Type or paste your script into the text field. VibeVoice reads the text, matches the voice characteristics from your sample, and generates speech with natural pacing and expression. The VibeVoice-Large model produces the highest quality output.
Voice to Clone (Load Audio) Upload an audio clip of the voice you want to replicate. MP3, WAV, or extract audio from a video file. Longer samples (20+ seconds) give VibeVoice more data to match the voice's tone, rhythm, and character. Shorter clips work but the voice match may be less accurate.
Clean recordings produce the best clones. Background music, room echo, or noise will bleed into the cloned voice characteristics.
Text Type or paste the script you want spoken. The default includes a test passage that covers different tones, pauses, and pronunciations. For production use, replace it with your own content.
VibeVoice handles punctuation naturally. Periods create full stops. Commas create brief pauses. Question marks shift intonation upward. Line breaks create longer pauses between sections.
Want to load text from a file? The workflow includes a LoadTextFromFile node (bypassed by default). Enable it and point it to a .txt file in your ComfyUI input, output, or temp directory.
Model (default: VibeVoice-Large) Three model sizes are available:
VibeVoice-1.5B (5.4 GB): Smallest and fastest. Good for quick tests and iteration.
VibeVoice-Large (18.7 GB): Best quality. Handles complex delivery, emotion shifts, and longer passages most accurately. This is the default and recommended option.
VibeVoice-Large quantized variants: 8-bit (11.6 GB) and 4-bit (6.6 GB) versions of the Large model for lower VRAM setups. Quality is close to the full model with reduced memory use.
Voice Speed Factor (default: 1.0) Controls speaking pace. 1.0 is natural speed. Keep adjustments small. Values between 0.95 and 1.05 are recommended. Going further can distort the output. Works best when your voice sample is at least 20 seconds long.
Diffusion Steps (default: 20) Controls the audio refinement quality. Higher values produce cleaner audio but take longer. 20 is a good balance. For quick previews, drop to 10-15. For final production audio, try 25-30.
CFG Scale (default: 1.3) Controls how closely the output follows the text instructions versus the voice sample's natural tendencies. Higher values make the model stick closer to the text. Lower values let the cloned voice's natural rhythm dominate.
Temperature (default: 0.95) Controls variation in delivery. Higher temperature adds more expressive variation between runs. Lower temperature makes the output more predictable and consistent. At 0.95, you get natural-sounding variation without instability.
Max Words Per Chunk (default: 250) VibeVoice processes text in chunks. 250 words per chunk works for most scripts. For longer passages, the model handles chunking automatically and stitches the audio together.
Seed (default: 42, fixed) Set to "fixed" with seed 42 by default. This means the same text and voice sample produce the same output every time. Switch to "randomize" if you want variation between runs.
What is VibeVoice good for?
VibeVoice is built for generating expressive speech that sounds like a real person reading your script. It handles tone shifts, pauses, and emotional delivery better than traditional text-to-speech because it uses an LLM to understand context before generating audio. The single-speaker workflow is for one voice at a time.
Voiceover for AI video. Generate narration or dialogue and layer it onto AI-generated video clips. Pair this workflow with Wan or LTX video workflows. Generate the video, generate the voiceover, combine them in post or use them as inputs to a lip sync workflow like MultiTalk.
Podcast and content production. Write a script, clone a voice, and generate the audio. For single-host shows, this workflow handles the full production. For multi-speaker shows, run the single-speaker workflow once per voice and combine the outputs.
Audiobook and e-learning narration. Long-form text works well. VibeVoice processes text in chunks and handles pacing across extended passages. The Large model keeps delivery consistent over longer scripts. Good for course narration, guided meditations, instructional content, and documentation read-aloud.
Character voices for games and animation. Clone a voice actor's sample, then generate all their lines from a script. Iterate on delivery by adjusting temperature and re-running until the read sounds right. Faster than booking studio time for every revision.
Honest limitations. Voice cloning quality depends on your sample. Short or noisy samples produce less accurate clones. The model handles English well. Other languages may produce less natural results depending on the model version. Singing is not supported. For multiple speakers in one conversation, you'll need to run the workflow once per speaker and edit the outputs together.
FAQ
How long does the voice sample need to be for VibeVoice?
At least 20 seconds for best results. Longer samples give the model more data to match tone, rhythm, and character. Shorter clips (5-10 seconds) work but the voice match may drift, especially on longer output text. Clean recordings without background noise produce the most accurate clones.
Which VibeVoice model should I use?
VibeVoice-Large is the default and recommended model. It produces the highest quality speech with the most natural delivery. If you're limited on VRAM, use the 8-bit or 4-bit quantized versions of Large. The 1.5B model is fastest but produces noticeably lower quality for complex or emotional delivery.
Can I generate long-form audio with VibeVoice?
Yes. VibeVoice can synthesize up to about 90 minutes of audio. The model processes text in chunks (default 250 words per chunk) and stitches the output together. For long scripts, the Large model maintains the most consistent voice quality across the full duration.
Can I use VibeVoice output with a lip sync workflow?
Yes. Generate your speech audio here, then feed it into a lip sync workflow like MultiTalk, FantasyTalking, or InfiniteTalk alongside a portrait image. The lip sync model reads the audio and generates matching mouth movements, facial expressions, and head motion.
How do I run VibeVoice Text to Speech online?
You can run VibeVoice online through Floyo. No installation, no setup. Open the workflow in your browser, upload your voice sample, type your text, and hit run. Free to try.
Read more

