Whisper STT
Create a text from speech using Whisper STT
AILab
Audio to Text
Speech to Text
STT
Transcribe
0
19
Whisper STT from AILab is a speech‑to‑text (automatic speech recognition) system built around OpenAI’s Whisper model that converts spoken audio into written text.
What it is
General‑purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in one network.
In AILab/ComfyUI context, exposed as a Whisper STT node that takes audio input and outputs a text STRING for downstream nodes.
Key features
Robust transcription on noisy, real‑world audio thanks to training on ~680,000 hours of diverse multilingual data.
Supports many languages plus optional direct translation to English from non‑English speech.
Provides timestamps, language detection, and task control (transcribe vs. translate) through special tokens/options.
In Comfy/AILab nodes, accepts common audio formats and returns plain text ready for subtitles, prompting, or logging.
Best‑fit use cases
Generating subtitles or transcripts for recorded voice, podcasts, lectures, and tutorials.
Voice‑driven prompting or control in ComfyUI/AILab, where spoken commands are turned into text prompts or parameters.
Multilingual meeting notes and interview transcription, including translation to English when needed.
Read more
Nodes & Models
WorkflowGraphics
LoadAudio
WhisperSTT
ShowText|pysssss
ShowText|pysssss
Whisper STT from AILab is a speech‑to‑text (automatic speech recognition) system built around OpenAI’s Whisper model that converts spoken audio into written text.
What it is
General‑purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in one network.
In AILab/ComfyUI context, exposed as a Whisper STT node that takes audio input and outputs a text STRING for downstream nodes.
Key features
Robust transcription on noisy, real‑world audio thanks to training on ~680,000 hours of diverse multilingual data.
Supports many languages plus optional direct translation to English from non‑English speech.
Provides timestamps, language detection, and task control (transcribe vs. translate) through special tokens/options.
In Comfy/AILab nodes, accepts common audio formats and returns plain text ready for subtitles, prompting, or logging.
Best‑fit use cases
Generating subtitles or transcripts for recorded voice, podcasts, lectures, and tutorials.
Voice‑driven prompting or control in ComfyUI/AILab, where spoken commands are turned into text prompts or parameters.
Multilingual meeting notes and interview transcription, including translation to English when needed.
Read more
