API

Pricing

Workflows

API

Pricing

LongCat AudioDiT for TTS

Turn text into spoken audio with LongCat AudioDiT 3.5B, Meituan's open-source diffusion TTS model. Clean voice quality in English and Chinese, no setup.

audiodit

audio generation

longcat

text to speech

tts

162

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

LongCatTTS

SaveAudioMP3

Turn written text into natural-sounding speech with LongCat AudioDiT 3.5B, an open-source TTS model from Meituan.

Paste your script, pick a few settings, and the model writes an MP3 you can download. Trained on a million hours of Chinese and English speech, so both languages sound right out of the box.

20 denoising steps to a finished clip. No reference audio needed.

How do you generate speech with LongCat AudioDiT 3.5B?

Type your script into the LongCat node, set steps to 20 and guidance to 4, then run. The model converts text straight into a waveform without going through mel-spectrograms, which keeps the output clean. You get an MP3 saved to your audio folder. Defaults are tuned for English and Chinese, so most scripts work without changes.

Text What you want spoken. English and Chinese both work without any extra setup. Keep each generation under 30 seconds for the cleanest output. Longer scripts tend to drop or repeat words, so split long passages into multiple runs.

Steps How many denoising passes the model takes. 20 is the sweet spot. Want faster previews while you're tweaking the script? Try 10 to 15. Need maximum fidelity for a finished take? Push to 25 or 30. Past 30 you stop hearing the difference.

Guidance strength How tightly the model follows your text. Default is 4. Lower (2 to 3) gives looser, more natural-sounding delivery. Higher (5 to 7) tightens pronunciation but can flatten the voice. Most scripts land best between 3 and 5.

Guidance method Two options. CFG is standard classifier-free guidance. Predictable and fine for most cases. APG is adaptive projection guidance, the method the LongCat team built specifically for this model. APG tends to produce cleaner audio with fewer artifacts on tricky text. Try APG first, fall back to CFG if you want more variation.

Seed Leave on randomize while you're searching for a delivery you like. Once you find a good take, lock the seed so you can iterate on the script without losing the voice.

What is LongCat AudioDiT 3.5B good for?

Generating clean speech from text in English and Chinese. The 3.5B model holds the top spot on the Seed benchmark for speaker similarity, beating both open and closed-source competitors. Use it for narration drafts, voiceover scratch tracks, audiobook clips, dialogue placeholders, and any workflow where you need spoken audio without booking talent.

Good fits: narration before you commit to a voice actor, audiobook chapter previews, podcast intros, dialogue placeholders for animation and game prototypes, and any bilingual project that needs both English and Mandarin in one pipeline.

When to use something else: this is the basic TTS variant, so it generates a clean default voice rather than cloning a specific speaker. If you need a particular voice, you want the voice cloning version with a reference audio input. For audio over 30 seconds, split your script into chunks and stitch the outputs.

FAQ

What is LongCat AudioDiT 3.5B? LongCat AudioDiT is an open-source text-to-speech model from Meituan with 3.5 billion parameters. It's a non-autoregressive diffusion model that generates speech directly in the waveform latent space, skipping the mel-spectrogram step most TTS pipelines rely on. Released under MIT license.

What languages does LongCat AudioDiT support? English and Chinese, both natively. The model was trained on roughly a million hours of speech split between the two. Other languages may produce intelligible output, but English and Chinese are where it sounds best.

How long can LongCat AudioDiT generate audio? The model can produce up to 60 seconds in a single run, but quality drops past 30 seconds. Words start dropping or repeating. For finished output, keep each generation between 15 and 30 seconds and concatenate clips for longer scripts.

Should I use CFG or APG guidance with LongCat AudioDiT? APG is the method the LongCat team designed for this model and it usually produces cleaner audio with fewer artifacts. CFG is the standard alternative and works fine for most text. Start with APG. Switch to CFG if you want more variation between seeds or APG sounds too rigid for your script.

How to run LongCat AudioDiT online? You can run LongCat AudioDiT online through Floyo. No installation, no setup. Open the workflow in your browser, paste your text, and hit run. Free to try.

Discover more workflows

You might like these too.

floyoofficial

394

audiodit

dialogue

longcat

multi-speaker

text to speech

voice cloning

Clone two voices from short audio samples and generate dialogue between them with LongCat AudioDiT 3.5B. Upload your references, write your script, hit run.

LongCat AudioDiT for Multi Speaker TTS

Clone two voices from short audio samples and generate dialogue between them with LongCat AudioDiT 3.5B. Upload your references, write your script, hit run.

floyoofficial

117

audio generation

film production

longcat

text to speech

voice cloning

voiceover

Clone any voice from a short audio sample with LongCat AudioDiT 3.5B. Upload a reference clip, type what you want it to say, and get speech in that voice.

LongCat AudioDiT for Voice Clone

Clone any voice from a short audio sample with LongCat AudioDiT 3.5B. Upload a reference clip, type what you want it to say, and get speech in that voice.

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images

mdmz

11.0k

wan 2.2

wan22

wan 2.2 animate

wan 22 animate

wan animate

Wan 2.2 Animate Preprocess by Kijai (MDMZ Edition)