LongCat AudioDiT for TTS
Turn text into spoken audio with LongCat AudioDiT 3.5B, Meituan's open-source diffusion TTS model. Clean voice quality in English and Chinese, no setup.
audiodit
audio generation
longcat
text to speech
tts
0
56
Turn written text into natural-sounding speech with LongCat AudioDiT 3.5B, an open-source TTS model from Meituan.
Paste your script, pick a few settings, and the model writes an MP3 you can download. Trained on a million hours of Chinese and English speech, so both languages sound right out of the box.
20 denoising steps to a finished clip. No reference audio needed.
How do you generate speech with LongCat AudioDiT 3.5B?
Type your script into the LongCat node, set steps to 20 and guidance to 4, then run. The model converts text straight into a waveform without going through mel-spectrograms, which keeps the output clean. You get an MP3 saved to your audio folder. Defaults are tuned for English and Chinese, so most scripts work without changes.
Text What you want spoken. English and Chinese both work without any extra setup. Keep each generation under 30 seconds for the cleanest output. Longer scripts tend to drop or repeat words, so split long passages into multiple runs.
Steps How many denoising passes the model takes. 20 is the sweet spot. Want faster previews while you're tweaking the script? Try 10 to 15. Need maximum fidelity for a finished take? Push to 25 or 30. Past 30 you stop hearing the difference.
Guidance strength How tightly the model follows your text. Default is 4. Lower (2 to 3) gives looser, more natural-sounding delivery. Higher (5 to 7) tightens pronunciation but can flatten the voice. Most scripts land best between 3 and 5.
Guidance method Two options. CFG is standard classifier-free guidance. Predictable and fine for most cases. APG is adaptive projection guidance, the method the LongCat team built specifically for this model. APG tends to produce cleaner audio with fewer artifacts on tricky text. Try APG first, fall back to CFG if you want more variation.
Seed Leave on randomize while you're searching for a delivery you like. Once you find a good take, lock the seed so you can iterate on the script without losing the voice.
What is LongCat AudioDiT 3.5B good for?
Generating clean speech from text in English and Chinese. The 3.5B model holds the top spot on the Seed benchmark for speaker similarity, beating both open and closed-source competitors. Use it for narration drafts, voiceover scratch tracks, audiobook clips, dialogue placeholders, and any workflow where you need spoken audio without booking talent.
Good fits: narration before you commit to a voice actor, audiobook chapter previews, podcast intros, dialogue placeholders for animation and game prototypes, and any bilingual project that needs both English and Mandarin in one pipeline.
When to use something else: this is the basic TTS variant, so it generates a clean default voice rather than cloning a specific speaker. If you need a particular voice, you want the voice cloning version with a reference audio input. For audio over 30 seconds, split your script into chunks and stitch the outputs.
FAQ
What is LongCat AudioDiT 3.5B? LongCat AudioDiT is an open-source text-to-speech model from Meituan with 3.5 billion parameters. It's a non-autoregressive diffusion model that generates speech directly in the waveform latent space, skipping the mel-spectrogram step most TTS pipelines rely on. Released under MIT license.
What languages does LongCat AudioDiT support? English and Chinese, both natively. The model was trained on roughly a million hours of speech split between the two. Other languages may produce intelligible output, but English and Chinese are where it sounds best.
How long can LongCat AudioDiT generate audio? The model can produce up to 60 seconds in a single run, but quality drops past 30 seconds. Words start dropping or repeating. For finished output, keep each generation between 15 and 30 seconds and concatenate clips for longer scripts.
Should I use CFG or APG guidance with LongCat AudioDiT? APG is the method the LongCat team designed for this model and it usually produces cleaner audio with fewer artifacts. CFG is the standard alternative and works fine for most text. Start with APG. Switch to CFG if you want more variation between seeds or APG sounds too rigid for your script.
How to run LongCat AudioDiT online? You can run LongCat AudioDiT online through Floyo. No installation, no setup. Open the workflow in your browser, paste your text, and hit run. Free to try.
Read more

