Whisper STT

Create a text from speech using Whisper STT

AILab

Audio to Text

Speech to Text

STT

Transcribe

1

212

gallery-image-background-0

gallery-image-0

AUDIO CONVERSION - AUDIO OUTPUT_1776912054234.png

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

WorkflowGraphics

LoadAudio

WhisperSTT

dspy_nodes

ShowText|pysssss

ComfyUI-Custom-Scripts

ShowText|pysssss

Whisper STT from AILab is a speech‑to‑text (automatic speech recognition) system built around OpenAI’s Whisper model that converts spoken audio into written text.

What it is

General‑purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in one network.
In AILab/ComfyUI context, exposed as a Whisper STT node that takes audio input and outputs a text STRING for downstream nodes.

Key features

Robust transcription on noisy, real‑world audio thanks to training on ~680,000 hours of diverse multilingual data.
Supports many languages plus optional direct translation to English from non‑English speech.
Provides timestamps, language detection, and task control (transcribe vs. translate) through special tokens/options.
In Comfy/AILab nodes, accepts common audio formats and returns plain text ready for subtitles, prompting, or logging.

Best‑fit use cases

Generating subtitles or transcripts for recorded voice, podcasts, lectures, and tutorials.
Voice‑driven prompting or control in ComfyUI/AILab, where spoken commands are turned into text prompts or parameters.
Multilingual meeting notes and interview transcription, including translation to English when needed.

Read more

N