Create with Alibaba Happy Horse model now! Try here 👉

Pricing

Create with Alibaba Happy Horse model now! Try here 👉

COMMUNITY PAGE

Run Whisper STT

Home / Model / Whisper STT on Floyo

AI SPEECH-TO-TEXT

Run Whisper STT on Floyo

OpenAI's speech-to-text model built for robust transcription across 99+ languages. Handles accents, background noise, and technical language. Transcribe or translate to English in one pass.

Run OpenAI's Whisper through ComfyUI in your browser. No API key, no installs, no local GPU.

Languages

99+

Training Data

680k+ hours

Tasks

Transcribe + translate

License

MIT (open source)

Try Whisper STT→ Browse All Models

No installation. Runs in browser. Updated April 2026.

What you Get?

What You Get

Whisper STT is OpenAI's open-source speech-to-text model, delivered on Floyo through the AILab ComfyUI node. It's an encoder-decoder Transformer trained on 680,000+ hours of multilingual audio that handles transcription, speech translation to English, and language identification in one network. Robust on noisy, real-world audio. Supports 99+ languages with automatic language detection. Outputs plain text ready for subtitles, prompting, or logging in downstream nodes.

WHISPER STT WORKFLOWS ON FLOYO

Whisper STT - Speech to Text Transcription

What is Whisper STT?

Whisper is OpenAI's automatic speech recognition (ASR) model, first open-sourced in September 2022 and updated to large-v3 in November 2023. It's a general-purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in a single neural network. On Floyo, it's packaged as the AILab Whisper STT node for ComfyUI workflows.

The architecture is an encoder-decoder Transformer. Audio input is resampled to 16kHz, split into 30-second chunks, converted to a log-Mel spectrogram, and passed through the encoder. The decoder predicts text tokens while using special tokens to control the task: language identification, phrase-level timestamps, transcription, or translation. One model, four jobs.

Whisper's strength is robustness. Training on 680,000+ hours of diverse multilingual audio (and 5 million hours for large-v3) means it handles accents, background noise, and technical jargon without fine-tuning. You can feed it a podcast, a conference recording, or a voice memo and get accurate transcription without pre-processing the audio.

The model comes in multiple sizes from 39M to 1.55B parameters. Larger models produce more accurate output. The AILab ComfyUI node exposes Whisper as a simple STT block: audio input, text output, ready for downstream nodes like subtitle generators, prompt builders, or logging systems.

On Floyo, Whisper STT runs through the native AILab node. Upload your audio (MP3, WAV, M4A, and other common formats), the node transcribes it, and the output text flows directly into your workflow. No API keys, no Python environment, no model download.

What are Whisper STT's technical specifications?

Whisper uses an encoder-decoder Transformer architecture trained on 680,000+ hours of multilingual audio (5 million hours for large-v3). It accepts audio resampled to 16kHz, splits it into 30-second chunks, and converts each chunk to a log-Mel spectrogram (80 bins for large-v2, 128 bins for large-v3). Output is plain text with optional timestamps. Model sizes range from tiny (39M params) to large-v3 (1.55B params).

Spec	Details
Developer	OpenAI
Architecture	Encoder-decoder Transformer
Model Sizes	Tiny (39M), Base (74M), Small (244M), Medium (769M), Large (1.55B)
Latest Version	Large-v3 (November 2023)
Audio Input	16kHz sample rate, 30-second chunks
Input Representation	Log-Mel spectrogram (128 Mel bins for large-v3)
Languages	99+ supported (strong ASR in ~10 languages)
Tasks	Transcription, translation to English, language ID, timestamps
Training Data	680k+ hours (original), 5M hours labeled + pseudo-labeled (large-v3)
Large-v3 Improvement	10-20% error reduction vs large-v2 across languages
Audio Formats	MP3, WAV, M4A, FLAC, MP4, WEBM, MPGA
Output	Plain text STRING for downstream ComfyUI nodes
Timestamps	Phrase-level via special tokens
License	MIT License (full commercial rights)
ComfyUI Access	AILab Whisper STT node on Floyo
Initial Release	September 2022 (open source)

What can you create with Whisper STT?

Whisper STT covers transcription for podcasts and interviews, subtitle generation, multilingual meeting notes, voice-to-prompt pipelines for image and video generation, translation of non-English audio to English text, and language identification. It is the backbone for any ComfyUI workflow that needs to turn spoken audio into written text, whether for captioning, search, archival, or as an input to another model.

Capability	What It Does	Use Case
Multilingual Transcription	Automatic speech recognition in 99+ languages. Automatic language detection means no need to specify the source language upfront.	Podcasts, lectures, interviews, global content
Speech Translation	Transcribe non-English audio directly into English text. One-step pipeline instead of transcribe-then-translate.	International meeting notes, foreign-language research
Subtitle Generation	Produces text with phrase-level timestamps for subtitle files. Works on recorded voice, podcasts, tutorials, and film audio.	Video subtitles, accessibility captions, tutorial creation
Voice-to-Prompt	Convert spoken commands into text prompts for image, video, or audio generation models. Chain the output directly into ComfyUI prompt nodes.	Voice-driven workflows, hands-free prompting
Language Identification	Automatically detects the spoken language. Useful for routing multi-language content or filtering audio by language.	Content routing, audio filtering, batch processing
Noisy Audio Handling	Trained on diverse real-world audio, so it handles background noise, accents, and technical jargon without pre-processing.	Field recordings, conference audio, low-quality sources

What are Whisper STT's key features?

Whisper's feature set is built around one idea: a single model that replaces the traditional multi-stage speech processing pipeline. Language detection, transcription, translation, and timestamp generation all happen in one forward pass through the network, controlled by special tokens in the decoder.

Multilingual Speech Recognition

Whisper supports 99+ languages out of the box. It shows strong ASR performance in about 10 languages (English, Chinese, German, Spanish, Russian, French, Portuguese, Korean, Japanese, Arabic) and usable performance in many more. No language-specific fine-tuning required. Upload audio and the model detects the language automatically.

Robust on Noisy Audio

Training on 680,000+ hours of diverse web audio (not curated studio recordings) means Whisper handles real-world conditions. Background noise, overlapping speakers, accents, technical language, and poor microphone quality all come out intelligible. You don't need to clean audio before transcription.

Translation to English

Whisper can transcribe and translate to English in one step. Feed it audio in Spanish, French, Chinese, or any supported language, and get English text out. This skips the usual transcribe-then-translate pipeline. The output quality won't beat dedicated translation models for literary work, but it is strong for meeting notes, summaries, and content understanding.

Phrase-Level Timestamps

Whisper can output timestamps alongside the text, marking when each phrase was spoken. This enables subtitle generation, audio search, and navigation through long recordings. The timestamps are at phrase level, not word level, so exact word timing requires post-processing with tools like WhisperX.

Encoder-Decoder Transformer

The architecture is an encoder-decoder Transformer, the same family as BERT and GPT. The encoder processes audio spectrograms into a dense representation. The decoder generates text autoregressively, guided by special tokens that control the task. This unified design is why Whisper can do multiple jobs in one model.

Open Source (MIT License)

All Whisper model weights, inference code, and training research are open source under MIT. Full commercial rights. No API costs, no usage tracking. You can deploy it on your own infrastructure, fine-tune it for specific domains, and ship products using it.

Scalable Model Sizes

Whisper comes in tiny (39M), base (74M), small (244M), medium (769M), and large-v3 (1.55B) sizes. Smaller models run faster on less hardware. Larger models produce more accurate transcription. Choose based on your accuracy-vs-speed trade-off. Large-v3 is the highest quality and recommended for production transcription work.

How does Whisper STT compare to other speech-to-text models?

Whisper large-v3 is the strongest fully open-source ASR model available. OpenAI's GPT-4o Transcribe and Deepgram Nova-3 have lower error rates on benchmarks, but both are closed-source API services. Google Chirp 2 and AssemblyAI Universal-2 lead on some language-specific benchmarks but cost more per minute and lock you into proprietary APIs. Whisper's trade-off: strongest open alternative, but hallucinations on silent or low-quality audio are a known issue.

Model	Languages	Open Source	Translation	License
Whisper large-v3	99+	Yes	To English (built-in)	MIT
GPT-4o Transcribe	~50	No	Via prompt	Commercial API
Deepgram Nova-3	36+	No	Via separate API	Commercial API
AssemblyAI Universal-2	~25	No	Via separate API	Commercial API

Source: OpenAI Whisper model card, Deepgram documentation, AssemblyAI product pages, and third-party benchmark comparisons as of April 2026. WER varies significantly by language and audio conditions; test with your use case.

How does Whisper STT work?

Whisper is an encoder-decoder Transformer. Input audio is resampled to 16kHz, split into 30-second chunks, and converted to a log-Mel spectrogram (128 Mel frequency bins for large-v3). The encoder processes the spectrogram into a dense audio representation. The decoder generates text autoregressively, one token at a time, while using special tokens to control whether it transcribes, translates, or identifies the language.

The training data is the key to Whisper's robustness. Original Whisper used 680,000 hours of multilingual audio scraped from the web. Large-v3 extended this to 5 million hours: 1 million hours of weakly labeled data plus 4 million hours of pseudo-labeled data generated by large-v2. This scale and diversity produce a model that generalizes to accents, noise, and technical vocabulary without fine-tuning.

The multitask training format uses special tokens as task specifiers. One token signals "transcribe in the source language." Another signals "translate to English." Timestamp tokens mark phrase boundaries. This lets a single decoder handle four tasks (language ID, transcription, translation, voice activity detection) without switching models.

On Floyo, the AILab Whisper STT node wraps this pipeline into a single ComfyUI block. Drop a LoadAudio node in front, connect it to the Whisper STT node, and the output string flows into whatever downstream node you want: a ShowText node for inspection, a prompt builder for another model, or a text file writer for logging.

Hi, Surbhi this side from Floyo Support!

Fair warning: Whisper has a known hallucination issue. On silent audio, very short clips, or low-resource languages, the model sometimes invents text that was not in the source. This tends to be worse on the smaller model sizes and on languages with less training data. For critical work (medical, legal, court reporting), review the output before trusting it. Beam search and temperature scheduling reduce but don't eliminate this behavior.

Frequently Asked Questions

Common questions about running Whisper STT on Floyo.

Is Whisper STT free to use on Floyo?

You can start with Floyo's free pricing plan. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. Whisper is open-source under the MIT License, so there is no additional API cost beyond your Floyo plan. No per-minute transcription charges.

How do I run Whisper STT without installing anything?

Open Floyo in your browser, find the Whisper STT workflow (search "Whisper" in the template library), and click Run. Upload your audio file, hit generate, and the transcription appears in the output node. Floyo handles the GPU, ComfyUI environment, and model weights. No local install, no Python setup, no API key.

Who made Whisper?

OpenAI. The model was first released in September 2022 as open-source research. Large-v2 came in December 2022, and large-v3 in November 2023. The AILab Whisper STT node used on Floyo wraps OpenAI's original Whisper model for ComfyUI workflows.

What audio formats does Whisper accept?

Common audio formats work: MP3, WAV, M4A, FLAC, MP4, WEBM, and MPGA. The LoadAudio node in the Floyo workflow handles decoding. Whisper internally resamples everything to 16kHz, so sample rate doesn't matter for input.

How many languages does Whisper support?

99+ languages with automatic language detection. Strong ASR performance in about 10 languages: English, Chinese, German, Spanish, Russian, French, Portuguese, Korean, Japanese, and Arabic. Other supported languages have varying accuracy depending on training data availability.

Can I translate non-English audio to English?

Yes. Whisper supports direct speech-to-English translation in one step. The model detects the source language automatically, transcribes it, and translates the result to English. The quality is strong for general understanding and meeting notes, though dedicated translation models may be better for literary or highly technical content.

Can I use Whisper output commercially?

Yes. Whisper is released under the MIT License, which grants full commercial usage rights. You can use transcriptions in products, marketing, client work, subtitles, documentation, and any other commercial context without additional licensing.

Can I combine Whisper STT with other AI models in one workflow?

Yes. That's the main advantage of using Whisper in ComfyUI on Floyo. Transcribe a voice memo with Whisper, then feed the text into Wan 2.7 to generate a matching video. Or chain Whisper output to Fish Audio S2 to re-dub the content in a different voice. Or use Whisper to extract subtitles from existing footage, edit them, and composite back onto the video. All in one pipeline.

Try Whisper STT on Floyo

Open-source speech-to-text in 99+ languages with automatic translation, language detection, and timestamp support. Run it in your browser.

Try Whisper STT Now → Browse All Models