Qwen3 ASR 1.7B - Speech to Text

Upload an audio file and Qwen3 ASR 1.7B transcribes it to text. Supports 52 languages, auto-detects the language, and handles noisy audio. No setup needed.

ASR

Audio to Text

qwen

STT

240

AUDIO CONVERSION - AUDIO OUTPUT_1776911378224.png

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

LoadAudio

WorkflowGraphics

Qwen3ASRTranscriber

PreviewAny

Automatic speech recognition with Qwen3 ASR 1.7B, one of the strongest open-source transcription models available.

Upload an audio file, pick your language (or leave it on auto), and hit Run. Qwen3 ASR 1.7B transcribes the speech to text. It handles 52 languages and dialects, identifies the language on its own, and works in noisy environments, with music, and even with singing.

One input. One output. Text from audio, done.

How do you transcribe audio with Qwen3 ASR?

Upload your audio file, set the language to "auto" or pick a specific one, and run. Qwen3 ASR 1.7B identifies the language, transcribes everything it hears, and returns clean text. All defaults are optimized. You only need to provide the audio.

Audio file This is your only required input. Drop in an MP3, WAV, or other supported audio format. The model handles files up to 20 minutes long by chunking them automatically.

Language Default is "auto," which means the model detects the language for you. It covers 30 languages and 22 Chinese dialects. If you know the language ahead of time, setting it explicitly can improve accuracy. For mixed-language audio, leave it on auto.

Max new tokens Default is 256. This controls how many tokens the model generates per chunk. For short clips, the default is fine. For long recordings with dense speech, you might bump it up so the model doesn't cut off mid-sentence.

Chunk size Default is 30 (seconds). This sets how long each audio segment is before the model processes it. Shorter chunks use less memory but may split sentences awkwardly. Longer chunks give better context. 30 seconds works for most cases.

Overlap Default is 2 (seconds). This is how much neighboring chunks share at their edges. It prevents words from getting lost at chunk boundaries. The default handles this well. No need to change it unless you notice missing words at transitions.

Precision Default is bf16. This balances speed and accuracy. Leave it unless you have a specific reason to switch.

Play with chunk size and overlap if you have long recordings with fast speech. For everything else, the defaults do the work.

What is Qwen3 ASR 1.7B good for?

Qwen3 ASR 1.7B is built for transcribing speech to text across many languages and tough audio conditions. It outperforms Whisper large-v3 on multiple benchmarks. Use it when you need accurate, multilingual transcription from a single model without managing separate pipelines.

Podcast transcription, meeting notes, subtitle generation, voiceover-to-text conversion. If you have audio with speech in it, this model turns it into text. It handles accented English, background noise, and even singing voices.

The auto language detection is useful when you process audio in batches and don't know the language of each file in advance. You skip the routing step and let the model figure it out.

The catch: this is transcription only. It does not translate. If you need audio in one language converted to text in another, you need a translation step after this workflow.

FAQ

How many languages does Qwen3 ASR 1.7B support? It supports 52 languages and dialects total: 30 languages plus 22 Chinese dialects. It also recognizes English accents from multiple countries and regions. Language detection is built in, so the model picks the right language from your audio automatically.

Can Qwen3 ASR handle long audio files? Yes. The model processes audio in chunks (default 30 seconds each with 2 seconds of overlap). This means it can handle recordings well beyond 20 minutes. You do not need to split the audio yourself before uploading.

How does Qwen3 ASR compare to Whisper? Qwen3 ASR 1.7B outperforms Whisper large-v3 across multiple benchmarks, especially on Chinese, multilingual, noisy, and accented speech. On clean English, results are comparable. The main advantage is handling multiple languages and tough audio conditions in a single model.

What audio formats does this workflow accept? You can upload common formats like MP3 and WAV. The LoadAudio node handles conversion. If your file plays in a standard audio player, it should work here.

How to run Qwen3 ASR speech to text online? You can run Qwen3 ASR speech to text online through Floyo. No installation, no setup. Open the workflow in your browser, upload your audio, and hit run. Free to try.