Workflows

Pricing

VibeVoice Text to Speech Multi Speaker

Speech Multi Speaker

Multi Speaker

TTS

VibeVoice

454

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

LoadAudio

PreviewAudio

VibeVoice-ComfyUI

VibeVoiceMultipleSpeakersNode

VibeVoice for multi speaker is about using Microsoft’s VibeVoice models to generate long dialogues where several distinct voices talk, react, and take turns naturally in the same audio track.

Overview

VibeVoice is trained for multi‑speaker, long‑form text‑to‑speech, so it can handle multiple roles (host, guest, narrator, characters) in one pass, keeping each voice’s tone and rhythm consistent across the whole script. The model uses a language model plus a diffusion decoder and low‑rate speech tokens, which helps it capture context, pauses, and emphasis so conversations sound less robotic and more like a real recorded session.

Who can use it

Multi‑speaker VibeVoice is useful for:

Podcast creators generating full episodes with host and guests entirely from a written script.
Audiobook and drama producers who need different character voices and dialogue scenes without hiring several actors.
E‑learning and corporate training teams building scenario‑based conversations, role‑plays, and simulations.
ComfyUI and AI video users who want multiple characters speaking in sync with talking avatars or story videos.

Use case

A typical setup is to write a script with speaker tags (for example, “Host: …”, “Guest: …”), choose or define a voice style for each speaker, and let VibeVoice generate the entire multi‑speaker track in one go, preserving each voice across the episode. Another use case is an educational dialogue: two or three AI “teachers” and “students” explain a topic, ask questions, and respond to each other, producing a single audio file that can be synced to slides or AI‑generated classroom scenes.

Discover more workflows

You might like these too.

VibeVoice: Single-Speaker Text to Speech

floyoofficial

932

text to speech

TTS

VibeVoice

voice cloning

VibeVoice

VibeVoice: Single-Speaker Text to Speech

VibeVoice

Multi Model for Voice Convesion and Text to Speech

floyoofficial

330

ChatterBox

Higgs

Text to Speech

TTS

VibeVoice

A workflow of TTS Audio Suite which can to use different type of audio models.

Multi Model for Voice Convesion and Text to Speech

A workflow of TTS Audio Suite which can to use different type of audio models.

floyoofficial

24.5k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

20.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.0k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images