
COMMUNITY PAGE
Run VibeVoice on Floyo
Home / Model / VibeVoice on Floyo
AI AUDIO GENERATION
Run VibeVoice on Floyo
Microsoft's open-source long-form multi-speaker TTS. Up to 90 minutes of speech with 4 distinct speakers. Natural turn-taking, voice cloning, and podcast-grade dialogue. MIT licensed.
Run Microsoft's VibeVoice through ComfyUI in your browser. No API key, no installs, no local GPU.
|
Max Duration ~90 minutes |
Speakers Up to 4 distinct |
|
Parameters 1.5B |
License MIT (open source) |
| Try VibeVoice Now → | Browse All Models |
No installation. Runs in browser. Updated April 2026.
What you get?
What You Get
VibeVoice is Microsoft Research's open-source text-to-speech framework designed for long-form, multi-speaker conversational audio. It generates up to 90 minutes of natural speech with up to 4 distinct speakers, maintaining consistent voice identity and natural turn-taking throughout. Built on continuous speech tokenizers at an ultra-low 7.5 Hz frame rate and a next-token diffusion decoder. Supports voice cloning from short reference audio. Accepted as an Oral at ICLR 2026. MIT licensed. Available as ComfyUI nodes on Floyo in single-speaker and multi-speaker workflows.
VIBEVOICE WORKFLOWS ON FLOYO
What is VibeVoice?
VibeVoice is Microsoft Research's open-source TTS framework, first released in August 2025. It is purpose-built for long-form, multi-speaker conversational audio like podcasts, audiobooks, and e-learning narration. The 1.5B parameter model generates up to 90 minutes of continuous speech with up to 4 distinct speakers. It was accepted as an Oral presentation at ICLR 2026.
Most TTS models generate one speaker at a time in short clips. VibeVoice handles complete multi-speaker conversations in a single pass. Each speaker maintains a consistent voice identity across the entire duration. Turn-taking happens naturally: speakers interrupt, pause, respond, and overlap the way real people do in conversation.
The core innovation is the ultra-low frame rate tokenizer. VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at 7.5 Hz. Most TTS systems tokenize at 50-100 Hz. The lower frame rate means fewer tokens per second of audio, which makes it computationally feasible to process sequences long enough for full podcast episodes or audiobook chapters without running out of context window.
Voice cloning works from a short reference audio sample. Provide a few seconds of someone speaking, and VibeVoice captures their timbre and speaking style. The cloned voice maintains consistency across the full output duration, even in 90-minute generations.
On Floyo, VibeVoice runs through native ComfyUI nodes on H100 NVL GPUs. Two workflows are available: single-speaker for narration, and multi-speaker for dialogues and podcasts. No model downloads, no Python environment, no local GPU required.
What are VibeVoice's technical specifications?
VibeVoice uses a next-token diffusion framework with a 1.5B parameter LLM backbone and continuous speech tokenizers at 7.5 Hz. It generates up to 90 minutes of speech with up to 4 speakers. The Realtime-0.5B variant achieves ~200-300ms time-to-first-audio for streaming applications. Both variants support voice cloning and multilingual output.
| Spec | Details |
|---|---|
| Developer | Microsoft Research |
| Architecture | Next-token diffusion (LLM backbone + diffusion head) |
| TTS Model | VibeVoice-1.5B (long-form, multi-speaker) |
| Realtime Model | VibeVoice-Realtime-0.5B (streaming, low-latency) |
| Tokenizer Frame Rate | 7.5 Hz (ultra-low, vs 50-100 Hz typical) |
| Max Duration | ~90 minutes (1.5B), ~10 minutes (Realtime-0.5B) |
| Max Speakers | 4 distinct speakers per generation |
| Context Window | Up to 8192 tokens (progressive expansion) |
| Voice Cloning | From short reference audio (voice prompting) |
| Latency (Realtime) | ~200-300ms time-to-first-audio |
| Languages | English (primary) + experimental multilingual (DE, FR, IT, JP, KR, NL, PL, PT, ES) |
| WER (Short Utterances) | 2.05% on SEED benchmark (Realtime-0.5B) |
| Emotion Detection | Auto-detects and expresses anger, excitement, sadness, and more |
| License | MIT License (full commercial rights) |
| ComfyUI Access | Native support on Floyo (2 workflows) |
| Release Date | August 25, 2025 (TTS), December 3, 2025 (Realtime), January 21, 2026 (ASR) |
What can you create with VibeVoice?
VibeVoice covers long-form narration, multi-speaker podcast generation, audiobook production, e-learning narration, character dialogue, and voice cloning. The single-speaker workflow handles narration and voiceover. The multi-speaker workflow handles dialogues, panels, and conversational content with up to 4 distinct voices and natural turn-taking.
| Capability | What It Does | Use Case |
|---|---|---|
| Long-Form Speech | Generate up to 90 minutes of continuous speech without quality degradation. Speaker identity stays locked across the full duration. | Audiobooks, lectures, documentation narration |
| Multi-Speaker Dialogue | Up to 4 distinct speakers in one generation. Natural turn-taking, consistent voice identity per speaker, and contextual prosody shifts. | Podcasts, interview simulations, panel discussions |
| Voice Cloning | Provide a short reference audio sample. VibeVoice captures timbre and speaking style and applies it to your text. Consistent across full output. | Brand voice, character consistency, personalized narration |
| Automatic Emotion | The model detects emotional context from the text and adjusts delivery automatically. Excitement, sadness, anger, and other emotions render without explicit tags. | Storytelling, dramatic narration, character dialogue |
| Single-Speaker Narration | Clean, consistent single-voice output for narration, voiceover, and documentation. Maintains natural prosody across long texts. | E-learning, tutorial narration, documentation, voiceover |
| Pipeline Integration | Chain with video models in ComfyUI. Generate video with Wan 2.7 or Hailuo, add narration or dialogue with VibeVoice in the same workflow. | Video production pipelines, multimedia content |
What are VibeVoice's key features?
VibeVoice's feature set is designed around one problem: generating long, multi-speaker audio that sounds like a real conversation. Most TTS models max out at single-speaker clips under a minute. VibeVoice generates 90-minute dialogues with 4 speakers. Every architectural decision targets this capability.
90-Minute Long-Form Generation
Most TTS models degrade after 30-60 seconds. VibeVoice generates up to 90 minutes of continuous speech without quality loss. The ultra-low 7.5 Hz tokenizer is the key: it produces fewer tokens per second of audio, keeping long sequences within the model's context window. A 90-minute podcast episode that would require thousands of separate clips with other models generates in one pass with VibeVoice.
4-Speaker Consistency
Four distinct speakers maintain consistent voice identities across the entire output. Speaker 1 sounds the same on minute 1 and minute 90. Turn-taking happens naturally: speakers respond to each other with appropriate timing, pauses, and prosody shifts. This is not separate clips stitched together. The model generates the full multi-speaker conversation as one continuous output.
7.5 Hz Ultra-Low Frame Rate Tokenizer
The core innovation. Most TTS systems tokenize audio at 50-100 Hz. VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) at 7.5 Hz. This 7-13x reduction in token rate makes it computationally feasible to process sequences long enough for full podcast episodes. The tokenizers preserve audio fidelity despite the lower frame rate through continuous (not discrete) representations.
Next-Token Diffusion
The architecture combines an LLM backbone (for understanding text context and dialogue flow) with a diffusion head (for generating high-fidelity acoustic details). The LLM plans what to say and how to say it. The diffusion head handles the fine-grained audio synthesis. This hybrid approach produces more natural-sounding speech than pure autoregressive or pure diffusion methods.
Voice Prompting
Provide a short audio sample and VibeVoice captures the speaker's vocal characteristics. The cloned voice stays consistent across the full output duration. You can define each of the 4 speakers with a different reference, creating a complete cast of distinct characters for podcast or audiobook production.
MIT License
Fully open source under the MIT License. Model weights, inference code, and training details are available on HuggingFace and GitHub. Full commercial rights. No usage restrictions. You can deploy it, modify it, fine-tune it, and build commercial products with it.
How does VibeVoice compare to other TTS models?
VibeVoice's unique advantage is long-form multi-speaker generation (90 minutes, 4 speakers). No other open-source model matches this. Fish Audio S2 leads on inline emotion control and language count. MiniMax Speech 2.8 HD leads on audio fidelity and arena rankings. ElevenLabs leads on consumer ecosystem. VibeVoice is the pick when you need full podcast episodes or audiobook chapters generated in one pass.
| Model | Max Duration | Speakers | Open Source | License |
|---|---|---|---|---|
| VibeVoice | ~90 minutes | 4 | Yes | MIT |
| Fish Audio S2 | No fixed limit | Multi-speaker | Yes | Research License |
| MiniMax Speech 2.8 HD | Long scripts | 17+ preset voices | No | Commercial API |
| ElevenLabs | Long scripts | Multiple voices | No | Commercial API |
Source: Microsoft Research VibeVoice GitHub, HuggingFace model cards, SEED benchmark results, arXiv:2508.19205, and third-party reviews as of April 2026.
How does VibeVoice work?
VibeVoice uses a next-token diffusion framework. An LLM backbone (Qwen2.5 for the Realtime variant) processes the text and dialogue structure. A diffusion head generates high-fidelity acoustic tokens at each step. Continuous speech tokenizers (Acoustic and Semantic) operate at 7.5 Hz, compressing audio into a token-efficient representation that enables long-form generation.
The Acoustic tokenizer captures the fine-grained audio signal: timbre, pitch, breathiness, and vocal texture. The Semantic tokenizer captures the linguistic content: what words are being spoken and their prosodic structure. Both operate in continuous (not discrete) token spaces, which preserves more audio detail than traditional codec-based approaches at the same compression ratio.
For multi-speaker generation, the model uses speaker-specific tokens that anchor each speaker's identity throughout the generation. The LLM backbone plans turn-taking, response timing, and emotional context. The diffusion head renders each speaker's audio with their established voice characteristics. Speaker transitions happen naturally because the model treats the conversation as a single continuous sequence.
On Floyo, VibeVoice runs through native ComfyUI nodes on H100 NVL GPUs. The single-speaker workflow takes a text input and generates narration. The multi-speaker workflow takes a dialogue script with speaker labels and generates the complete conversation. You can chain VibeVoice output with video generation nodes to add narration to AI-generated video in the same workflow.
Fair warning: VibeVoice is primarily optimized for English. Experimental multilingual support exists for German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish, but quality varies by language. The model was briefly taken down by Microsoft in September 2025 due to misuse concerns but was restored. Long-form generations (60+ minutes) require significant compute time. For short clips with maximum audio fidelity, Fish Audio S2 or MiniMax Speech 2.8 HD may produce higher-quality output.
Frequently Asked Questions
Common questions about running VibeVoice on Floyo.
You can start with Floyo's free pricing plan. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. VibeVoice is open-source under the MIT License, so there is no additional API cost beyond your Floyo plan.
Open Floyo in your browser, search "VibeVoice" in the template library, and pick either the single-speaker or multi-speaker workflow. Click Run, write your text, and generate. Floyo handles the GPU, ComfyUI environment, and model weights. No local install, no Python setup.
Microsoft Research, led by Zhiliang Peng and Jianwei Yu. VibeVoice-TTS was open-sourced August 25, 2025. VibeVoice-Realtime-0.5B followed December 3, 2025. VibeVoice-ASR launched January 21, 2026. The TTS paper was accepted as an Oral at ICLR 2026. Model weights are available on HuggingFace under the MIT License.
The single-speaker workflow generates narration in one consistent voice. The multi-speaker workflow generates conversations with up to 4 distinct speakers, each with their own voice identity. Multi-speaker includes natural turn-taking, pauses, and prosody shifts between speakers. Use single-speaker for voiceover and narration. Use multi-speaker for podcasts, dialogues, and audiobooks with multiple characters.
VibeVoice leads on long-form multi-speaker generation (90 minutes, 4 speakers in one pass). Fish Audio S2 leads on inline emotion control (1,500+ free-form tags), language support (80+ languages), and short-clip audio quality. For podcast-length content, VibeVoice is the pick. For short, emotionally expressive clips, Fish Audio S2 has the edge. Both are available on Floyo.
Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate video with Wan 2.7, Hailuo, or Kling Omni, then add narration or multi-speaker dialogue with VibeVoice in the same workflow. The audio output pairs with your video for a complete multimedia package.
Yes. VibeVoice is released under the MIT License, which grants full commercial usage rights. You can use generated audio in products, marketing, client work, podcasts, audiobooks, and any other commercial context without additional licensing.
English is the primary language with the highest quality. Experimental multilingual support includes German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish. Quality varies by language. For multilingual production work, Fish Audio S2 (80+ languages) or MiniMax Speech 2.8 HD (40+ languages) offer broader and more consistent coverage.
Try VibeVoice on Floyo
Open-source long-form multi-speaker TTS. Up to 90 minutes, 4 speakers, voice cloning, MIT licensed. Run it in your browser.
| Try VibeVoice Now → | Browse All Models |
Related Reading
Film and Animation Workflows on Floyo
Setting Up an AI Production Pipeline for Your Studio
Last updated: April 2026. Specs from Microsoft Research VibeVoice GitHub, HuggingFace model cards (microsoft/VibeVoice-1.5B, microsoft/VibeVoice-Realtime-0.5B), arXiv:2508.19205, SEED benchmark results, and ICLR 2026 proceedings.
VibeVoice Text to Speech Single Speaker
VibeVoice
VibeVoice Text to Speech Multi Speaker
Speech Multi Speaker
ChatterBox
Higgs
Text to Speech
TTS
VibeVoice
A workflow of TTS Audio Suite which can to use different type of audio models.
Multi Model for Voice Convesion and Text to Speech
A workflow of TTS Audio Suite which can to use different type of audio models.


