floyo logo
Powered by
ThinkDiffusion
floyo logo
Powered by
ThinkDiffusion

Whisper STT

Create a text from speech using Whisper STT

19

Whisper STT from AILab is a speech‑to‑text (automatic speech recognition) system built around OpenAI’s Whisper model that converts spoken audio into written text.

What it is

  • General‑purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in one network.

  • In AILab/ComfyUI context, exposed as a Whisper STT node that takes audio input and outputs a text STRING for downstream nodes.​

Key features

  • Robust transcription on noisy, real‑world audio thanks to training on ~680,000 hours of diverse multilingual data.

  • Supports many languages plus optional direct translation to English from non‑English speech.

  • Provides timestamps, language detection, and task control (transcribe vs. translate) through special tokens/options.

  • In Comfy/AILab nodes, accepts common audio formats and returns plain text ready for subtitles, prompting, or logging.​

Best‑fit use cases

  • Generating subtitles or transcripts for recorded voice, podcasts, lectures, and tutorials.

  • Voice‑driven prompting or control in ComfyUI/AILab, where spoken commands are turned into text prompts or parameters.​

  • Multilingual meeting notes and interview transcription, including translation to English when needed.

Read more

N
Generates in about -- secs

Nodes & Models

WorkflowGraphics
LoadAudio
WhisperSTT
ShowText|pysssss
ShowText|pysssss

Whisper STT from AILab is a speech‑to‑text (automatic speech recognition) system built around OpenAI’s Whisper model that converts spoken audio into written text.

What it is

  • General‑purpose ASR model that handles multilingual speech recognition, speech translation to English, and language identification in one network.

  • In AILab/ComfyUI context, exposed as a Whisper STT node that takes audio input and outputs a text STRING for downstream nodes.​

Key features

  • Robust transcription on noisy, real‑world audio thanks to training on ~680,000 hours of diverse multilingual data.

  • Supports many languages plus optional direct translation to English from non‑English speech.

  • Provides timestamps, language detection, and task control (transcribe vs. translate) through special tokens/options.

  • In Comfy/AILab nodes, accepts common audio formats and returns plain text ready for subtitles, prompting, or logging.​

Best‑fit use cases

  • Generating subtitles or transcripts for recorded voice, podcasts, lectures, and tutorials.

  • Voice‑driven prompting or control in ComfyUI/AILab, where spoken commands are turned into text prompts or parameters.​

  • Multilingual meeting notes and interview transcription, including translation to English when needed.

Read more

N