
COMMUNITY PAGE
Run Whisper STT
Home / Model / Whisper STT on Floyo
AI SPEECH-TO-TEXT
Run Whisper STT on Floyo
OpenAI's speech-to-text model built for robust transcription across 99+ languages. Handles accents, background noise, and technical language. Transcribe or translate to English in one pass.
Run OpenAI's Whisper through ComfyUI in your browser. No API key, no installs, no local GPU.
Languages
99+
Training Data
680k+ hours
Tasks
Transcribe + translate
License
MIT (open source)
Try Whisper STTβ Browse All Models
No installation. Runs in browser. Updated April 2026.
What you Get?
What You Get
Whisper STT is OpenAI's open-source speech-to-text model, delivered on Floyo through the AILab ComfyUI node. It's an encoder-decoder Transformer trained on 680,000+ hours of multilingual audio that handles transcription, speech translation to English, and language identification in one network. Robust on noisy, real-world audio. Supports 99+ languages with automatic language detection. Outputs plain text ready for subtitles, prompting, or logging in downstream nodes.
WHISPER STT WORKFLOWS ON FLOYO
What is Whisper STT?
Whisper is OpenAI's automatic speech recognition (ASR) model, first open-sourced in September 2022 and updated to large-v3 in November 2023. It's a general-purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in a single neural network. On Floyo, it's packaged as the AILab Whisper STT node for ComfyUI workflows.
The architecture is an encoder-decoder Transformer. Audio input is resampled to 16kHz, split into 30-second chunks, converted to a log-Mel spectrogram, and passed through the encoder. The decoder predicts text tokens while using special tokens to control the task: language identification, phrase-level timestamps, transcription, or translation. One model, four jobs.
Whisper's strength is robustness. Training on 680,000+ hours of diverse multilingual audio (and 5 million hours for large-v3) means it handles accents, background noise, and technical jargon without fine-tuning. You can feed it a podcast, a conference recording, or a voice memo and get accurate transcription without pre-processing the audio.
The model comes in multiple sizes from 39M to 1.55B parameters. Larger models produce more accurate output. The AILab ComfyUI node exposes Whisper as a simple STT block: audio input, text output, ready for downstream nodes like subtitle generators, prompt builders, or logging systems.
On Floyo, Whisper STT runs through the native AILab node. Upload your audio (MP3, WAV, M4A, and other common formats), the node transcribes it, and the output text flows directly into your workflow. No API keys, no Python environment, no model download.
What are Whisper STT's technical specifications?
Whisper uses an encoder-decoder Transformer architecture trained on 680,000+ hours of multilingual audio (5 million hours for large-v3). It accepts audio resampled to 16kHz, splits it into 30-second chunks, and converts each chunk to a log-Mel spectrogram (80 bins for large-v2, 128 bins for large-v3). Output is plain text with optional timestamps. Model sizes range from tiny (39M params) to large-v3 (1.55B params).
| Spec | Details |
|---|---|
| Developer | OpenAI |
| Architecture | Encoder-decoder Transformer |
| Model Sizes | Tiny (39M), Base (74M), Small (244M), Medium (769M), Large (1.55B) |
| Latest Version | Large-v3 (November 2023) |
| Audio Input | 16kHz sample rate, 30-second chunks |
| Input Representation | Log-Mel spectrogram (128 Mel bins for large-v3) |
| Languages | 99+ supported (strong ASR in ~10 languages) |
| Tasks | Transcription, translation to English, language ID, timestamps |
| Training Data | 680k+ hours (original), 5M hours labeled + pseudo-labeled (large-v3) |
| Large-v3 Improvement | 10-20% error reduction vs large-v2 across languages |
| Audio Formats | MP3, WAV, M4A, FLAC, MP4, WEBM, MPGA |
| Output | Plain text STRING for downstream ComfyUI nodes |
| Timestamps | Phrase-level via special tokens |
| License | MIT License (full commercial rights) |
| ComfyUI Access | AILab Whisper STT node on Floyo |
| Initial Release | September 2022 (open source) |
What can you create with Whisper STT?
Whisper STT covers transcription for podcasts and interviews, subtitle generation, multilingual meeting notes, voice-to-prompt pipelines for image and video generation, translation of non-English audio to English text, and language identification. It is the backbone for any ComfyUI workflow that needs to turn spoken audio into written text, whether for captioning, search, archival, or as an input to another model.
| Capability | What It Does | Use Case |
|---|---|---|
| Multilingual Transcription | Automatic speech recognition in 99+ languages. Automatic language detection means no need to specify the source language upfront. | Podcasts, lectures, interviews, global content |
| Speech Translation | Transcribe non-English audio directly into English text. One-step pipeline instead of transcribe-then-translate. | International meeting notes, foreign-language research |
| Subtitle Generation | Produces text with phrase-level timestamps for subtitle files. Works on recorded voice, podcasts, tutorials, and film audio. | Video subtitles, accessibility captions, tutorial creation |
| Voice-to-Prompt | Convert spoken commands into text prompts for image, video, or audio generation models. Chain the output directly into ComfyUI prompt nodes. | Voice-driven workflows, hands-free prompting |
| Language Identification | Automatically detects the spoken language. Useful for routing multi-language content or filtering audio by language. | Content routing, audio filtering, batch processing |
| Noisy Audio Handling | Trained on diverse real-world audio, so it handles background noise, accents, and technical jargon without pre-processing. | Field recordings, conference audio, low-quality sources |
What are Whisper STT's key features?
Whisper's feature set is built around one idea: a single model that replaces the traditional multi-stage speech processing pipeline. Language detection, transcription, translation, and timestamp generation all happen in one forward pass through the network, controlled by special tokens in the decoder.
Multilingual Speech Recognition
Whisper supports 99+ languages out of the box. It shows strong ASR performance in about 10 languages (English, Chinese, German, Spanish, Russian, French, Portuguese, Korean, Japanese, Arabic) and usable performance in many more. No language-specific fine-tuning required. Upload audio and the model detects the language automatically.
Robust on Noisy Audio
Training on 680,000+ hours of diverse web audio (not curated studio recordings) means Whisper handles real-world conditions. Background noise, overlapping speakers, accents, technical language, and poor microphone quality all come out intelligible. You don't need to clean audio before transcription.
Translation to English
Whisper can transcribe and translate to English in one step. Feed it audio in Spanish, French, Chinese, or any supported language, and get English text out. This skips the usual transcribe-then-translate pipeline. The output quality won't beat dedicated translation models for literary work, but it is strong for meeting notes, summaries, and content understanding.
Phrase-Level Timestamps
Whisper can output timestamps alongside the text, marking when each phrase was spoken. This enables subtitle generation, audio search, and navigation through long recordings. The timestamps are at phrase level, not word level, so exact word timing requires post-processing with tools like WhisperX.
Encoder-Decoder Transformer
The architecture is an encoder-decoder Transformer, the same family as BERT and GPT. The encoder processes audio spectrograms into a dense representation. The decoder generates text autoregressively, guided by special tokens that control the task. This unified design is why Whisper can do multiple jobs in one model.
Open Source (MIT License)
All Whisper model weights, inference code, and training research are open source under MIT. Full commercial rights. No API costs, no usage tracking. You can deploy it on your own infrastructure, fine-tune it for specific domains, and ship products using it.
Scalable Model Sizes
Whisper comes in tiny (39M), base (74M), small (244M), medium (769M), and large-v3 (1.55B) sizes. Smaller models run faster on less hardware. Larger models produce more accurate transcription. Choose based on your accuracy-vs-speed trade-off. Large-v3 is the highest quality and recommended for production transcription work.
How does Whisper STT compare to other speech-to-text models?
Whisper large-v3 is the strongest fully open-source ASR model available. OpenAI's GPT-4o Transcribe and Deepgram Nova-3 have lower error rates on benchmarks, but both are closed-source API services. Google Chirp 2 and AssemblyAI Universal-2 lead on some language-specific benchmarks but cost more per minute and lock you into proprietary APIs. Whisper's trade-off: strongest open alternative, but hallucinations on silent or low-quality audio are a known issue.
| Model | Languages | Open Source | Translation | License |
|---|---|---|---|---|
| Whisper large-v3 | 99+ | Yes | To English (built-in) | MIT |
| GPT-4o Transcribe | ~50 | No | Via prompt | Commercial API |
| Deepgram Nova-3 | 36+ | No | Via separate API | Commercial API |
| AssemblyAI Universal-2 | ~25 | No | Via separate API | Commercial API |
Source: OpenAI Whisper model card, Deepgram documentation, AssemblyAI product pages, and third-party benchmark comparisons as of April 2026. WER varies significantly by language and audio conditions; test with your use case.
How does Whisper STT work?
Whisper is an encoder-decoder Transformer. Input audio is resampled to 16kHz, split into 30-second chunks, and converted to a log-Mel spectrogram (128 Mel frequency bins for large-v3). The encoder processes the spectrogram into a dense audio representation. The decoder generates text autoregressively, one token at a time, while using special tokens to control whether it transcribes, translates, or identifies the language.
The training data is the key to Whisper's robustness. Original Whisper used 680,000 hours of multilingual audio scraped from the web. Large-v3 extended this to 5 million hours: 1 million hours of weakly labeled data plus 4 million hours of pseudo-labeled data generated by large-v2. This scale and diversity produce a model that generalizes to accents, noise, and technical vocabulary without fine-tuning.
The multitask training format uses special tokens as task specifiers. One token signals "transcribe in the source language." Another signals "translate to English." Timestamp tokens mark phrase boundaries. This lets a single decoder handle four tasks (language ID, transcription, translation, voice activity detection) without switching models.
On Floyo, the AILab Whisper STT node wraps this pipeline into a single ComfyUI block. Drop a LoadAudio node in front, connect it to the Whisper STT node, and the output string flows into whatever downstream node you want: a ShowText node for inspection, a prompt builder for another model, or a text file writer for logging.
Hi, Surbhi this side from Floyo Support!Fair warning: Whisper has a known hallucination issue. On silent audio, very short clips, or low-resource languages, the model sometimes invents text that was not in the source. This tends to be worse on the smaller model sizes and on languages with less training data. For critical work (medical, legal, court reporting), review the output before trusting it. Beam search and temperature scheduling reduce but don't eliminate this behavior.
Frequently Asked Questions
Common questions about running Whisper STT on Floyo.
You can start with Floyo's free pricing plan. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. Whisper is open-source under the MIT License, so there is no additional API cost beyond your Floyo plan. No per-minute transcription charges.
Open Floyo in your browser, find the Whisper STT workflow (search "Whisper" in the template library), and click Run. Upload your audio file, hit generate, and the transcription appears in the output node. Floyo handles the GPU, ComfyUI environment, and model weights. No local install, no Python setup, no API key.
OpenAI. The model was first released in September 2022 as open-source research. Large-v2 came in December 2022, and large-v3 in November 2023. The AILab Whisper STT node used on Floyo wraps OpenAI's original Whisper model for ComfyUI workflows.
Common audio formats work: MP3, WAV, M4A, FLAC, MP4, WEBM, and MPGA. The LoadAudio node in the Floyo workflow handles decoding. Whisper internally resamples everything to 16kHz, so sample rate doesn't matter for input.
99+ languages with automatic language detection. Strong ASR performance in about 10 languages: English, Chinese, German, Spanish, Russian, French, Portuguese, Korean, Japanese, and Arabic. Other supported languages have varying accuracy depending on training data availability.
Yes. Whisper supports direct speech-to-English translation in one step. The model detects the source language automatically, transcribes it, and translates the result to English. The quality is strong for general understanding and meeting notes, though dedicated translation models may be better for literary or highly technical content.
Yes. Whisper is released under the MIT License, which grants full commercial usage rights. You can use transcriptions in products, marketing, client work, subtitles, documentation, and any other commercial context without additional licensing.
Yes. That's the main advantage of using Whisper in ComfyUI on Floyo. Transcribe a voice memo with Whisper, then feed the text into Wan 2.7 to generate a matching video. Or chain Whisper output to Fish Audio S2 to re-dub the content in a different voice. Or use Whisper to extract subtitles from existing footage, edit them, and composite back onto the video. All in one pipeline.
Try Whisper STT on Floyo
Open-source speech-to-text in 99+ languages with automatic translation, language detection, and timestamp support. Run it in your browser.
Related Reading
Film and Animation Workflows on Floyo
Setting Up an AI Production Pipeline for Your Studio
Last updated: April 2026. Specs from OpenAI Whisper paper (Radford et al.), HuggingFace model card (openai/whisper-large-v3), OpenAI Whisper GitHub repository, and third-party benchmark comparisons.
AILab
Audio to Text
Speech to Text
STT
Transcribe
Create a text from speech using Whisper STT
Whisper STT
Create a text from speech using Whisper STT
subtitling
vid2vid
video generation
Upload a video and get it back with burned-in subtitles. Whisper transcribes the audio, then the text gets placed frame-by-frame with word-level timing.
Auto Subtitles with Whisper - Video to Video
Upload a video and get it back with burned-in subtitles. Whisper transcribes the audio, then the text gets placed frame-by-frame with word-level timing.
audio
speech to text
srt
STT
subtitles
transcription
whisper
Upload any audio file and Whisper transcribes it into text with word-level and segment-level SRT subtitle files. Auto language detection included.
Whisper Speech-to-Text and SRT Subtitle Generator
Upload any audio file and Whisper transcribes it into text with word-level and segment-level SRT subtitle files. Auto language detection included.


