Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

COMMUNITY PAGE

Run VibeVoice on Floyo

Home / Model / VibeVoice on Floyo

AI AUDIO GENERATION

Run VibeVoice on Floyo

Microsoft's open-source long-form multi-speaker TTS. Up to 90 minutes of speech with 4 distinct speakers. Natural turn-taking, voice cloning, and podcast-grade dialogue. MIT licensed.

Run Microsoft's VibeVoice through ComfyUI in your browser. No API key, no installs, no local GPU.

Max Duration

~90 minutes

Speakers

Up to 4 distinct

Parameters

1.5B

License

MIT (open source)

Try VibeVoice Now →

Browse All Models

No installation. Runs in browser. Updated April 2026.

What you get?

What You Get

VibeVoice is Microsoft Research's open-source text-to-speech framework designed for long-form, multi-speaker conversational audio. It generates up to 90 minutes of natural speech with up to 4 distinct speakers, maintaining consistent voice identity and natural turn-taking throughout. Built on continuous speech tokenizers at an ultra-low 7.5 Hz frame rate and a next-token diffusion decoder. Supports voice cloning from short reference audio. Accepted as an Oral at ICLR 2026. MIT licensed. Available as ComfyUI nodes on Floyo in single-speaker and multi-speaker workflows.

VIBEVOICE WORKFLOWS ON FLOYO

VibeVoice Text to Speech - Single Speaker

VibeVoice Text to Speech - Multi Speaker

What is VibeVoice?

VibeVoice is Microsoft Research's open-source TTS framework, first released in August 2025. It is purpose-built for long-form, multi-speaker conversational audio like podcasts, audiobooks, and e-learning narration. The 1.5B parameter model generates up to 90 minutes of continuous speech with up to 4 distinct speakers. It was accepted as an Oral presentation at ICLR 2026.

Most TTS models generate one speaker at a time in short clips. VibeVoice handles complete multi-speaker conversations in a single pass. Each speaker maintains a consistent voice identity across the entire duration. Turn-taking happens naturally: speakers interrupt, pause, respond, and overlap the way real people do in conversation.

The core innovation is the ultra-low frame rate tokenizer. VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) operating at 7.5 Hz. Most TTS systems tokenize at 50-100 Hz. The lower frame rate means fewer tokens per second of audio, which makes it computationally feasible to process sequences long enough for full podcast episodes or audiobook chapters without running out of context window.

Voice cloning works from a short reference audio sample. Provide a few seconds of someone speaking, and VibeVoice captures their timbre and speaking style. The cloned voice maintains consistency across the full output duration, even in 90-minute generations.

On Floyo, VibeVoice runs through native ComfyUI nodes on H100 NVL GPUs. Two workflows are available: single-speaker for narration, and multi-speaker for dialogues and podcasts. No model downloads, no Python environment, no local GPU required.

What are VibeVoice's technical specifications?

VibeVoice uses a next-token diffusion framework with a 1.5B parameter LLM backbone and continuous speech tokenizers at 7.5 Hz. It generates up to 90 minutes of speech with up to 4 speakers. The Realtime-0.5B variant achieves ~200-300ms time-to-first-audio for streaming applications. Both variants support voice cloning and multilingual output.

Spec	Details
Developer	Microsoft Research
Architecture	Next-token diffusion (LLM backbone + diffusion head)
TTS Model	VibeVoice-1.5B (long-form, multi-speaker)
Realtime Model	VibeVoice-Realtime-0.5B (streaming, low-latency)
Tokenizer Frame Rate	7.5 Hz (ultra-low, vs 50-100 Hz typical)
Max Duration	~90 minutes (1.5B), ~10 minutes (Realtime-0.5B)
Max Speakers	4 distinct speakers per generation
Context Window	Up to 8192 tokens (progressive expansion)
Voice Cloning	From short reference audio (voice prompting)
Latency (Realtime)	~200-300ms time-to-first-audio
Languages	English (primary) + experimental multilingual (DE, FR, IT, JP, KR, NL, PL, PT, ES)
WER (Short Utterances)	2.05% on SEED benchmark (Realtime-0.5B)
Emotion Detection	Auto-detects and expresses anger, excitement, sadness, and more
License	MIT License (full commercial rights)
ComfyUI Access	Native support on Floyo (2 workflows)
Release Date	August 25, 2025 (TTS), December 3, 2025 (Realtime), January 21, 2026 (ASR)

What can you create with VibeVoice?

VibeVoice covers long-form narration, multi-speaker podcast generation, audiobook production, e-learning narration, character dialogue, and voice cloning. The single-speaker workflow handles narration and voiceover. The multi-speaker workflow handles dialogues, panels, and conversational content with up to 4 distinct voices and natural turn-taking.

Capability	What It Does	Use Case
Long-Form Speech	Generate up to 90 minutes of continuous speech without quality degradation. Speaker identity stays locked across the full duration.	Audiobooks, lectures, documentation narration
Multi-Speaker Dialogue	Up to 4 distinct speakers in one generation. Natural turn-taking, consistent voice identity per speaker, and contextual prosody shifts.	Podcasts, interview simulations, panel discussions
Voice Cloning	Provide a short reference audio sample. VibeVoice captures timbre and speaking style and applies it to your text. Consistent across full output.	Brand voice, character consistency, personalized narration
Automatic Emotion	The model detects emotional context from the text and adjusts delivery automatically. Excitement, sadness, anger, and other emotions render without explicit tags.	Storytelling, dramatic narration, character dialogue
Single-Speaker Narration	Clean, consistent single-voice output for narration, voiceover, and documentation. Maintains natural prosody across long texts.	E-learning, tutorial narration, documentation, voiceover
Pipeline Integration	Chain with video models in ComfyUI. Generate video with Wan 2.7 or Hailuo, add narration or dialogue with VibeVoice in the same workflow.	Video production pipelines, multimedia content

What are VibeVoice's key features?

VibeVoice's feature set is designed around one problem: generating long, multi-speaker audio that sounds like a real conversation. Most TTS models max out at single-speaker clips under a minute. VibeVoice generates 90-minute dialogues with 4 speakers. Every architectural decision targets this capability.

90-Minute Long-Form Generation

Most TTS models degrade after 30-60 seconds. VibeVoice generates up to 90 minutes of continuous speech without quality loss. The ultra-low 7.5 Hz tokenizer is the key: it produces fewer tokens per second of audio, keeping long sequences within the model's context window. A 90-minute podcast episode that would require thousands of separate clips with other models generates in one pass with VibeVoice.

4-Speaker Consistency

Four distinct speakers maintain consistent voice identities across the entire output. Speaker 1 sounds the same on minute 1 and minute 90. Turn-taking happens naturally: speakers respond to each other with appropriate timing, pauses, and prosody shifts. This is not separate clips stitched together. The model generates the full multi-speaker conversation as one continuous output.

7.5 Hz Ultra-Low Frame Rate Tokenizer

The core innovation. Most TTS systems tokenize audio at 50-100 Hz. VibeVoice uses continuous speech tokenizers (Acoustic and Semantic) at 7.5 Hz. This 7-13x reduction in token rate makes it computationally feasible to process sequences long enough for full podcast episodes. The tokenizers preserve audio fidelity despite the lower frame rate through continuous (not discrete) representations.

Next-Token Diffusion

The architecture combines an LLM backbone (for understanding text context and dialogue flow) with a diffusion head (for generating high-fidelity acoustic details). The LLM plans what to say and how to say it. The diffusion head handles the fine-grained audio synthesis. This hybrid approach produces more natural-sounding speech than pure autoregressive or pure diffusion methods.

Voice Prompting

Provide a short audio sample and VibeVoice captures the speaker's vocal characteristics. The cloned voice stays consistent across the full output duration. You can define each of the 4 speakers with a different reference, creating a complete cast of distinct characters for podcast or audiobook production.

MIT License

Fully open source under the MIT License. Model weights, inference code, and training details are available on HuggingFace and GitHub. Full commercial rights. No usage restrictions. You can deploy it, modify it, fine-tune it, and build commercial products with it.

How does VibeVoice compare to other TTS models?

VibeVoice's unique advantage is long-form multi-speaker generation (90 minutes, 4 speakers). No other open-source model matches this. Fish Audio S2 leads on inline emotion control and language count. MiniMax Speech 2.8 HD leads on audio fidelity and arena rankings. ElevenLabs leads on consumer ecosystem. VibeVoice is the pick when you need full podcast episodes or audiobook chapters generated in one pass.

Model	Max Duration	Speakers	Open Source	License
VibeVoice	~90 minutes	4	Yes	MIT
Fish Audio S2	No fixed limit	Multi-speaker	Yes	Research License
MiniMax Speech 2.8 HD	Long scripts	17+ preset voices	No	Commercial API
ElevenLabs	Long scripts	Multiple voices	No	Commercial API

Source: Microsoft Research VibeVoice GitHub, HuggingFace model cards, SEED benchmark results, arXiv:2508.19205, and third-party reviews as of April 2026.

How does VibeVoice work?

VibeVoice uses a next-token diffusion framework. An LLM backbone (Qwen2.5 for the Realtime variant) processes the text and dialogue structure. A diffusion head generates high-fidelity acoustic tokens at each step. Continuous speech tokenizers (Acoustic and Semantic) operate at 7.5 Hz, compressing audio into a token-efficient representation that enables long-form generation.

The Acoustic tokenizer captures the fine-grained audio signal: timbre, pitch, breathiness, and vocal texture. The Semantic tokenizer captures the linguistic content: what words are being spoken and their prosodic structure. Both operate in continuous (not discrete) token spaces, which preserves more audio detail than traditional codec-based approaches at the same compression ratio.

For multi-speaker generation, the model uses speaker-specific tokens that anchor each speaker's identity throughout the generation. The LLM backbone plans turn-taking, response timing, and emotional context. The diffusion head renders each speaker's audio with their established voice characteristics. Speaker transitions happen naturally because the model treats the conversation as a single continuous sequence.

On Floyo, VibeVoice runs through native ComfyUI nodes on H100 NVL GPUs. The single-speaker workflow takes a text input and generates narration. The multi-speaker workflow takes a dialogue script with speaker labels and generates the complete conversation. You can chain VibeVoice output with video generation nodes to add narration to AI-generated video in the same workflow.

Fair warning: VibeVoice is primarily optimized for English. Experimental multilingual support exists for German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish, but quality varies by language. The model was briefly taken down by Microsoft in September 2025 due to misuse concerns but was restored. Long-form generations (60+ minutes) require significant compute time. For short clips with maximum audio fidelity, Fish Audio S2 or MiniMax Speech 2.8 HD may produce higher-quality output.

Frequently Asked Questions

Common questions about running VibeVoice on Floyo.

Is VibeVoice free to use on Floyo?

You can start with Floyo's free pricing plan. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. VibeVoice is open-source under the MIT License, so there is no additional API cost beyond your Floyo plan.

How do I run VibeVoice without installing anything?

Open Floyo in your browser, search "VibeVoice" in the template library, and pick either the single-speaker or multi-speaker workflow. Click Run, write your text, and generate. Floyo handles the GPU, ComfyUI environment, and model weights. No local install, no Python setup.

Who made VibeVoice?

Microsoft Research, led by Zhiliang Peng and Jianwei Yu. VibeVoice-TTS was open-sourced August 25, 2025. VibeVoice-Realtime-0.5B followed December 3, 2025. VibeVoice-ASR launched January 21, 2026. The TTS paper was accepted as an Oral at ICLR 2026. Model weights are available on HuggingFace under the MIT License.

What is the difference between single-speaker and multi-speaker?

The single-speaker workflow generates narration in one consistent voice. The multi-speaker workflow generates conversations with up to 4 distinct speakers, each with their own voice identity. Multi-speaker includes natural turn-taking, pauses, and prosody shifts between speakers. Use single-speaker for voiceover and narration. Use multi-speaker for podcasts, dialogues, and audiobooks with multiple characters.

How does VibeVoice compare to Fish Audio S2?

VibeVoice leads on long-form multi-speaker generation (90 minutes, 4 speakers in one pass). Fish Audio S2 leads on inline emotion control (1,500+ free-form tags), language support (80+ languages), and short-clip audio quality. For podcast-length content, VibeVoice is the pick. For short, emotionally expressive clips, Fish Audio S2 has the edge. Both are available on Floyo.

Can I combine VibeVoice with video models in one workflow?

Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate video with Wan 2.7, Hailuo, or Kling Omni, then add narration or multi-speaker dialogue with VibeVoice in the same workflow. The audio output pairs with your video for a complete multimedia package.

Can I use VibeVoice output commercially?

Yes. VibeVoice is released under the MIT License, which grants full commercial usage rights. You can use generated audio in products, marketing, client work, podcasts, audiobooks, and any other commercial context without additional licensing.

What languages does VibeVoice support?

English is the primary language with the highest quality. Experimental multilingual support includes German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, and Spanish. Quality varies by language. For multilingual production work, Fish Audio S2 (80+ languages) or MiniMax Speech 2.8 HD (40+ languages) offer broader and more consistent coverage.

Try VibeVoice on Floyo

Open-source long-form multi-speaker TTS. Up to 90 minutes, 4 speakers, voice cloning, MIT licensed. Run it in your browser.

Try VibeVoice Now →

Browse All Models