floyo logo
Powered by
ThinkDiffusion
Pricing
Wan 2.7 is now live. Check it out 👉🏼
floyo logo
Powered by
ThinkDiffusion
Pricing
Wan 2.7 is now live. Check it out 👉🏼

Qwen3 ASR via TTS Audio Suite for SRT Builder

Transcribe any audio file to text and timed SRT subtitles using Qwen3's speech recognition engine. Upload your audio, get broadcast-ready captions back.

22

Generates in about -- secs

Nodes & Models

LoadAudio
SRTAdvancedOptionsNode
MarkdownNote
Qwen3TTSEngineNode
WorkflowGraphics
UnifiedASRTranscribeNode
PreviewAny
TextToSRTBuilderNode

Turn any audio file into a full transcript and timed SRT subtitle file.

Upload an audio file, and Qwen3's speech recognition engine transcribes it to text with word-level timing. The SRT Builder then formats that transcript into broadcast-standard subtitle cues with smart line breaks, proper duration limits, and clean punctuation. You get a preview of the raw transcript, the formatted SRT output, and timing data.

Two nodes, one run. Load audio, hit run, get subtitles.

How do you transcribe audio to SRT subtitles with Qwen3?

Upload your audio file and run the workflow. Qwen3's 0.6B speech recognition model transcribes the audio with word-level timing, then the SRT Builder formats everything into timed subtitle cues. The forced aligner is on by default, which gives you accurate word timestamps for clean SRT output.

Audio Input Any audio file works. MP3, WAV, whatever you have. Drop it into the Load Audio node.

Language Set to Auto by default. The model detects the spoken language on its own. If detection gets it wrong, set the language manually.

Task Set to "transcribe" by default. This gives you a transcript in the original spoken language. Switch to "translate" if you need the output in a different language. The target language defaults to English, but you can change it in the engine settings.

SRT Preset Defaults to "Broadcast." This sets line length, duration, and reading speed to broadcast subtitle standards (42 characters per line, 2 lines max, 17 characters per second). Good for most use cases out of the box.

SRT Mode Set to "smart" by default. This uses punctuation and sentence structure to decide where to break subtitle cues. The result reads more naturally than fixed-length splits.

Max Characters Per Line Default is 42. Want shorter lines for mobile or social media? Try 28 to 32. Need more room for longer words or translations? Go up to 50.

Max Duration / Min Duration Defaults are 6 seconds max, 1 second min. These control how long a single subtitle cue stays on screen. Shorter max durations mean more frequent cue changes, which can work better for fast-paced content.

Min Gap Default is 0.6 seconds. This is the pause between subtitle cues. Lower values pack cues tighter. Higher values give the viewer more breathing room between lines.

Forced Aligner On by default. This is what gives you accurate word-level timestamps for SRT output. If you only need a raw transcript without timing, you can turn it off to speed things up.

Most of the SRT advanced options are post-processing. That means you can tweak line length, duration, and merge settings without re-running the transcription. Change a value, hit run, and it applies from cache.

What is Qwen3 ASR good for?

Qwen3's speech recognition works well for transcribing interviews, podcasts, voiceovers, and video dialogue into timed subtitles. The SRT Builder handles the formatting so you get subtitle files that meet broadcast standards without manual timing adjustments.

This workflow fits content creators who need subtitles for video. Upload your voiceover or dialogue track, run it, and you have an SRT file ready to drop into your video editor. The broadcast preset handles line length, reading speed, and timing gaps for you.

It also works for podcast transcription where you need a clean text output. Turn off the forced aligner if you only need the transcript without SRT timing.

For long-form audio (lectures, full episodes), the chunked processing with overlap handles files longer than a few minutes. The default chunk size of 30 seconds with 2-second overlap keeps the transcript accurate across segment boundaries.

The model runs at 0.6B parameters, so transcription is fast. For higher accuracy on noisy audio or heavy accents, a larger model variant may help if available.

FAQ

What audio formats does Qwen3 ASR support? The workflow accepts common audio formats including MP3 and WAV. Drop your file into the Load Audio node and it handles the rest. No format conversion needed on your end.

Do I need to change the SRT advanced options? No. The defaults use the Broadcast preset with smart line breaking, which works for most subtitle use cases. All advanced options are optional. If you want to fine-tune line length or cue timing, those settings are there to adjust.

Can Qwen3 ASR translate audio to another language? Yes. Switch the task from "transcribe" to "translate" in the transcription node. The default target is English, but you can change the target language in the engine settings.

Does changing SRT settings require re-running transcription? No. Most SRT formatting options are post-processing. The workflow caches the transcription from your first run. Change a setting, hit run, and the new SRT output generates from cache without re-transcribing.

How do I run Qwen3 ASR online? You can run Qwen3 ASR online through Floyo. No installation, no setup. Open the workflow in your browser, upload your audio, and hit run. Free to try.

Read more

N