API

Pricing

Workflows

API

Pricing

Qwen3 ASR via TTS Audio Suite for SRT Builder

Transcribe any audio file to text and timed SRT subtitles using Qwen3's speech recognition engine. Upload your audio, get broadcast-ready captions back.

audio transcription

qwen

speech to text

SRT

subtitle generation

278

AUDIO CONVERSION - AUDIO OUTPUT_1776911345792.png

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

LoadAudio

SRTAdvancedOptionsNode

MarkdownNote

Qwen3TTSEngineNode

WorkflowGraphics

UnifiedASRTranscribeNode

PreviewAny

TextToSRTBuilderNode

Turn any audio file into a full transcript and timed SRT subtitle file.

Upload an audio file, and Qwen3's speech recognition engine transcribes it to text with word-level timing. The SRT Builder then formats that transcript into broadcast-standard subtitle cues with smart line breaks, proper duration limits, and clean punctuation. You get a preview of the raw transcript, the formatted SRT output, and timing data.

Two nodes, one run. Load audio, hit run, get subtitles.

How do you transcribe audio to SRT subtitles with Qwen3?

Upload your audio file and run the workflow. Qwen3's 0.6B speech recognition model transcribes the audio with word-level timing, then the SRT Builder formats everything into timed subtitle cues. The forced aligner is on by default, which gives you accurate word timestamps for clean SRT output.

Audio Input Any audio file works. MP3, WAV, whatever you have. Drop it into the Load Audio node.

Language Set to Auto by default. The model detects the spoken language on its own. If detection gets it wrong, set the language manually.

Task Set to "transcribe" by default. This gives you a transcript in the original spoken language. Switch to "translate" if you need the output in a different language. The target language defaults to English, but you can change it in the engine settings.

SRT Preset Defaults to "Broadcast." This sets line length, duration, and reading speed to broadcast subtitle standards (42 characters per line, 2 lines max, 17 characters per second). Good for most use cases out of the box.

SRT Mode Set to "smart" by default. This uses punctuation and sentence structure to decide where to break subtitle cues. The result reads more naturally than fixed-length splits.

Max Characters Per Line Default is 42. Want shorter lines for mobile or social media? Try 28 to 32. Need more room for longer words or translations? Go up to 50.

Max Duration / Min Duration Defaults are 6 seconds max, 1 second min. These control how long a single subtitle cue stays on screen. Shorter max durations mean more frequent cue changes, which can work better for fast-paced content.

Min Gap Default is 0.6 seconds. This is the pause between subtitle cues. Lower values pack cues tighter. Higher values give the viewer more breathing room between lines.

Forced Aligner On by default. This is what gives you accurate word-level timestamps for SRT output. If you only need a raw transcript without timing, you can turn it off to speed things up.

Most of the SRT advanced options are post-processing. That means you can tweak line length, duration, and merge settings without re-running the transcription. Change a value, hit run, and it applies from cache.

What is Qwen3 ASR good for?

Qwen3's speech recognition works well for transcribing interviews, podcasts, voiceovers, and video dialogue into timed subtitles. The SRT Builder handles the formatting so you get subtitle files that meet broadcast standards without manual timing adjustments.

This workflow fits content creators who need subtitles for video. Upload your voiceover or dialogue track, run it, and you have an SRT file ready to drop into your video editor. The broadcast preset handles line length, reading speed, and timing gaps for you.

It also works for podcast transcription where you need a clean text output. Turn off the forced aligner if you only need the transcript without SRT timing.

For long-form audio (lectures, full episodes), the chunked processing with overlap handles files longer than a few minutes. The default chunk size of 30 seconds with 2-second overlap keeps the transcript accurate across segment boundaries.

The model runs at 0.6B parameters, so transcription is fast. For higher accuracy on noisy audio or heavy accents, a larger model variant may help if available.

FAQ

What audio formats does Qwen3 ASR support? The workflow accepts common audio formats including MP3 and WAV. Drop your file into the Load Audio node and it handles the rest. No format conversion needed on your end.

Do I need to change the SRT advanced options? No. The defaults use the Broadcast preset with smart line breaking, which works for most subtitle use cases. All advanced options are optional. If you want to fine-tune line length or cue timing, those settings are there to adjust.

Can Qwen3 ASR translate audio to another language? Yes. Switch the task from "transcribe" to "translate" in the transcription node. The default target is English, but you can change the target language in the engine settings.

Does changing SRT settings require re-running transcription? No. Most SRT formatting options are post-processing. The workflow caches the transcription from your first run. Change a setting, hit run, and the new SRT output generates from cache without re-transcribing.

How do I run Qwen3 ASR online? You can run Qwen3 ASR online through Floyo. No installation, no setup. Open the workflow in your browser, upload your audio, and hit run. Free to try.

Discover more workflows

You might like these too.

floyoofficial

111

asr

audio

qwen

speech to text

subtitles

transcription

Upload audio and Qwen3's ASR engine returns the transcript, word-level timing for SRT subtitles, and an optional translation to English. Language auto-detected.

Qwen3 ASR: Transcribe Audio

Upload audio and Qwen3's ASR engine returns the transcript, word-level timing for SRT subtitles, and an optional translation to English. Language auto-detected.

Vertical Video FX Inserter - Qwen + Wan 2.1 FunControl

floyoofficial

634

fx-integration

image-to-image

qwen

reference-image

upscaling

video-conditioning

wan21-funcontrol

Vertical Video FX Inserter - Qwen + Wan 2.1 FunControl

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images

mdmz

11.0k

wan 2.2

wan22

wan 2.2 animate

wan 22 animate

wan animate

Wan 2.2 Animate Preprocess by Kijai (MDMZ Edition)