LongCat AudioDiT for Multi Speaker TTS
Clone two voices from short audio samples and generate dialogue between them with LongCat AudioDiT 3.5B. Upload your references, write your script, hit run.
audiodit
dialogue
longcat
multi-speaker
text to speech
voice cloning
0
44
Nodes & Models
LoadAudio
NormalizeAudioLoudness
Reroute
LongCatMultiSpeakerTTS
SaveAudioMP3
LongCat AudioDiT 3.5B turns a written script into a back-and-forth conversation between two voices you choose. Upload a short audio sample for each speaker, type your dialogue, and the model generates an MP3 in their voices.
Two speakers maximum. No training needed. The reference audio is the voice.
How do you generate multi-speaker dialogue with LongCat AudioDiT?
Upload a reference audio clip for each speaker, write your script with [speaker_1] and [speaker_2] tags before each line, then paste the matching text transcription of each reference clip. LongCat reads the dialogue, switches between voices at each tag, and outputs a single MP3 with both speakers in conversation.
Speaker reference audio (Speaker_1, Speaker_2) Want a clean voice match? Use a 5 to 15 second clip of only the speaker talking. No music, no background noise, no overlapping voices. Each clip gets normalized to -23 LUFS automatically before the model sees it, so loudness differences between your two references won't bias the output.
Reference text (one for each speaker) This is the exact word-for-word transcription of what's said in your reference audio. It anchors the model to the right voice tone and pacing. Get it wrong and you'll hear drift in the output.
Dialogue script Each line goes on its own row, prefixed with the speaker tag:
[speaker_1]: Hello there.
[speaker_2]: Hey, how's it going?The model uses those tags to switch voices.
Pause after speaker (default: 0.4s) Want a natural conversation rhythm? 0.4 works for most dialogue. Need faster back-and-forth for an argument or sitcom feel? Try 0.2. Want a slower podcast pace with breathing room? Push to 0.6 or 0.8.
Steps (default: 28) More steps mean more refinement and slower output. 28 is a good balance. Drop to 20 for fast drafts. Push to 36 if you want to squeeze out more quality.
Guidance strength (default: 4) Controls how closely the output follows your script. 4 keeps things faithful without sounding robotic. Lower values (2 to 3) give more natural variance. Higher values (5 to 6) lock to the script harder at the cost of expressiveness.
Seed Same seed plus same inputs equals the same output. Useful when you want to compare two settings without the voice changing on you. Set to randomize for a fresh take every run.
What is LongCat AudioDiT good for?
LongCat AudioDiT is built for short-form dialogue between two voices. Podcast intros, scripted explainer videos, audiobook character lines, video game NPC banter, language learning conversations, and rough cuts where you need real-sounding voices before booking studio time with the actual talent.
The voice cloning is what makes this useful. Clone a co-host's voice from a 10-second clip and they can read tomorrow's intro while they're on vacation. Clone two characters and rough out a dialogue scene before paying for a recording session.
The catch: two speakers max. If you need three or more voices in one file, generate them in separate runs and stitch the clips together in your editor. It also performs best on conversational text. Long monologues or technical jargon can wander.
Doing a single-voice narration? A regular TTS workflow will be faster.
FAQ
What's the best reference audio length for LongCat AudioDiT? 5 to 15 seconds is the sweet spot. Long enough to capture vocal character, short enough to keep processing fast. Use a clean recording of one person talking with no music, no background noise, and no overlapping voices. The cleaner the input, the cleaner the clone.
How many speakers can LongCat AudioDiT handle in one run? Two. The workflow is wired for [speaker_1] and [speaker_2] dialogue. If you need three or more voices in a single audio file, generate them in separate runs with different reference pairs and stitch the clips together in your editor.
Do I need a written transcript of my reference audio? Yes. Each speaker needs both the reference audio and the matching text of what's said in that clip. The transcription helps the model lock onto voice characteristics, pacing, and tone. Type it word-for-word or you'll hear drift in the output.
Why does my LongCat AudioDiT output sound off? Most often it's the reference audio. Background music, multiple voices in the clip, or a reference text that doesn't match the recording word-for-word will all cause weird results. Re-record with cleaner audio. Match the reference text to the recording exactly.
How to run LongCat AudioDiT online? You can run LongCat AudioDiT online through Floyo. No installation, no setup. Open the workflow in your browser, upload your inputs, and hit run. Free to try.
Read more

