API

Pricing

Workflows

API

Pricing

LongCat AudioDiT for Multi Speaker TTS

Clone two voices from short audio samples and generate dialogue between them with LongCat AudioDiT 3.5B. Upload your references, write your script, hit run.

audiodit

dialogue

longcat

multi-speaker

text to speech

voice cloning

393

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

LoadAudio

NormalizeAudioLoudness

Reroute

LongCatMultiSpeakerTTS

SaveAudioMP3

LongCat AudioDiT 3.5B turns a written script into a back-and-forth conversation between two voices you choose. Upload a short audio sample for each speaker, type your dialogue, and the model generates an MP3 in their voices.

Two speakers maximum. No training needed. The reference audio is the voice.

How do you generate multi-speaker dialogue with LongCat AudioDiT?

Upload a reference audio clip for each speaker, write your script with [speaker_1] and [speaker_2] tags before each line, then paste the matching text transcription of each reference clip. LongCat reads the dialogue, switches between voices at each tag, and outputs a single MP3 with both speakers in conversation.

Speaker reference audio (Speaker_1, Speaker_2) Want a clean voice match? Use a 5 to 15 second clip of only the speaker talking. No music, no background noise, no overlapping voices. Each clip gets normalized to -23 LUFS automatically before the model sees it, so loudness differences between your two references won't bias the output.

Reference text (one for each speaker) This is the exact word-for-word transcription of what's said in your reference audio. It anchors the model to the right voice tone and pacing. Get it wrong and you'll hear drift in the output.

Dialogue script Each line goes on its own row, prefixed with the speaker tag:

[speaker_1]: Hello there.
[speaker_2]: Hey, how's it going?

The model uses those tags to switch voices.

Pause after speaker (default: 0.4s) Want a natural conversation rhythm? 0.4 works for most dialogue. Need faster back-and-forth for an argument or sitcom feel? Try 0.2. Want a slower podcast pace with breathing room? Push to 0.6 or 0.8.

Steps (default: 28) More steps mean more refinement and slower output. 28 is a good balance. Drop to 20 for fast drafts. Push to 36 if you want to squeeze out more quality.

Guidance strength (default: 4) Controls how closely the output follows your script. 4 keeps things faithful without sounding robotic. Lower values (2 to 3) give more natural variance. Higher values (5 to 6) lock to the script harder at the cost of expressiveness.

Seed Same seed plus same inputs equals the same output. Useful when you want to compare two settings without the voice changing on you. Set to randomize for a fresh take every run.

What is LongCat AudioDiT good for?

LongCat AudioDiT is built for short-form dialogue between two voices. Podcast intros, scripted explainer videos, audiobook character lines, video game NPC banter, language learning conversations, and rough cuts where you need real-sounding voices before booking studio time with the actual talent.

The voice cloning is what makes this useful. Clone a co-host's voice from a 10-second clip and they can read tomorrow's intro while they're on vacation. Clone two characters and rough out a dialogue scene before paying for a recording session.

The catch: two speakers max. If you need three or more voices in one file, generate them in separate runs and stitch the clips together in your editor. It also performs best on conversational text. Long monologues or technical jargon can wander.

Doing a single-voice narration? A regular TTS workflow will be faster.

FAQ

What's the best reference audio length for LongCat AudioDiT? 5 to 15 seconds is the sweet spot. Long enough to capture vocal character, short enough to keep processing fast. Use a clean recording of one person talking with no music, no background noise, and no overlapping voices. The cleaner the input, the cleaner the clone.

How many speakers can LongCat AudioDiT handle in one run? Two. The workflow is wired for [speaker_1] and [speaker_2] dialogue. If you need three or more voices in a single audio file, generate them in separate runs with different reference pairs and stitch the clips together in your editor.

Do I need a written transcript of my reference audio? Yes. Each speaker needs both the reference audio and the matching text of what's said in that clip. The transcription helps the model lock onto voice characteristics, pacing, and tone. Type it word-for-word or you'll hear drift in the output.

Why does my LongCat AudioDiT output sound off? Most often it's the reference audio. Background music, multiple voices in the clip, or a reference text that doesn't match the recording word-for-word will all cause weird results. Re-record with cleaner audio. Match the reference text to the recording exactly.

How to run LongCat AudioDiT online? You can run LongCat AudioDiT online through Floyo. No installation, no setup. Open the workflow in your browser, upload your inputs, and hit run. Free to try.

Discover more workflows

You might like these too.

floyoofficial

162

audiodit

audio generation

longcat

text to speech

tts

Turn text into spoken audio with LongCat AudioDiT 3.5B, Meituan's open-source diffusion TTS model. Clean voice quality in English and Chinese, no setup.

LongCat AudioDiT for TTS

Turn text into spoken audio with LongCat AudioDiT 3.5B, Meituan's open-source diffusion TTS model. Clean voice quality in English and Chinese, no setup.

floyoofficial

117

audio generation

film production

longcat

text to speech

voice cloning

voiceover

Clone any voice from a short audio sample with LongCat AudioDiT 3.5B. Upload a reference clip, type what you want it to say, and get speech in that voice.

LongCat AudioDiT for Voice Clone

Clone any voice from a short audio sample with LongCat AudioDiT 3.5B. Upload a reference clip, type what you want it to say, and get speech in that voice.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images

mdmz

11.0k

wan 2.2

wan22

wan 2.2 animate

wan 22 animate

wan animate

Wan 2.2 Animate Preprocess by Kijai (MDMZ Edition)