API

Pricing

Workflows

API

Pricing

TTS and Speech Length Calculator

Speech calculator and TTS using Whisper and WhatDreamCost

Text to Speech

TTS

WhatDreamCost

Whisper

130

AUDIO CONVERSION - AUDIO OUTPUT_1776911547777.png

Generates in about 46 secs

floyoofficial

Nodes & Models

ComfyUI Official

WorkflowGraphics

LoadAudio

Apply Whisper

PreviewAny

Gemini said

This workflow combines high-fidelity voice generation with precise timing estimation, specifically designed for AI video creators (often used within ComfyUI or similar modular pipelines).

It is important to note that Whisper is traditionally a Speech-to-Text (STT) model; however, in this context, it likely refers to the WhisperSpeech open-source TTS system or the OpenAI TTS API (which is frequently paired with Whisper). The WhatDreamsCost component refers to a specialized Speech Length Calculator (originally developed as a ComfyUI node by the user WhatDreamsCost) that ensures your video timeline matches your dialogue perfectly.

🎙️ AI Video Narrative Workflow

1. Script Input & Pre-Processing

The user enters the final narrative or dialogue into a Text Input field.

Cleaning: The workflow automatically strips non-spoken annotations (e.g., [Scene Start], [Soft Music Plays]) to ensure the calculator only measures actual spoken words.
Tagging: Optional "Pace Tags" can be added if the specific tool supports varying speeds for different segments.

2. Duration Estimation (WhatDreamsCost Calculator)

Before generating audio, the WhatDreamsCost Speech Length Calculator analyzes the text string.

WPM Calculation: It calculates the estimated duration in seconds based on a configurable Words Per Minute (WPM) setting (standard is usually ~150 WPM).
Real-time Feedback: As the user types, the tool provides a live readout of the expected audio length.
Frame Conversion: For video editors, it converts this time into a Total Frame Count (e.g., Duration×24fps), which is then piped into the video generation nodes to ensure the visual length matches the audio exactly.

3. Speech Synthesis (Whisper / WhisperSpeech)

The text is sent to the TTS engine to generate the high-quality .wav or .mp3 file.

Voice Selection: The user selects a voice profile (e.g., Alloy, Nova, or Onyx if using OpenAI, or a custom cloned voice if using WhisperSpeech).
Generation: The model synthesizes the audio, maintaining the natural prosody and intonation of the input language.

4. Audio-Visual Synchronization

The generated audio file and the calculated duration from the WhatDreamsCost node meet in the final assembly stage.

Timeline Locking: The video generator (like AnimateDiff or SVD) uses the frame count from Step 2 as its "Max Frames" limit.
Lip-Sync (Optional): The audio can be passed through a secondary node (like Wav2Lip) to synchronize a character’s mouth movements to the Whisper-generated audio.

Pro Tip: When using the WhatDreamsCost calculator, always add a small "buffer" (0.5 to 1.0 seconds) at the end of your video generation. AI speech models sometimes add brief silences at the start or end of a clip that can lead to the video cutting off the last word if timed too tightly.

Discover more workflows

You might like these too.

Multi Model for Voice Convesion and Text to Speech

floyoofficial

366

ChatterBox

Higgs

Text to Speech

TTS

VibeVoice

A workflow of TTS Audio Suite which can to use different type of audio models.

Multi Model for Voice Convesion and Text to Speech

A workflow of TTS Audio Suite which can to use different type of audio models.

floyoofficial

189

Audio2Audio

SoproTTS

Text to Speech

TTS

Turn your text to excellent speech using SoproTTS

Sopro for Text to Speech

Turn your text to excellent speech using SoproTTS

floyoofficial

25.2k

AiVideo

API

image to video

video generation

wan 2.5

Wan 2.5: Image to Video with Audio

Z-Image Turbo: Fast Image Generation in Seconds

floyoofficial

21.9k

Marketing

Photography

Production

Text2Image

Z-Image Turbo

Fast Image Generation in Seconds

Z-Image Turbo: Fast Image Generation in Seconds

Fast Image Generation in Seconds

floyoofficial

14.6k

VFX

Video2Video

Video Production

Wan2.6

Wan 2.6 Reference to Video

floyoofficial

14.6k

API

gemini 3 pro

Image2Image

typography

Google just released Nano Banana Pro, and honestly, it's a pretty big step up from the original Nano Banana. The main thing? It can actually put legible text in images now. Like, real text that you can read, not the garbled nonsense most AI models spit out.

Nano Banana Pro: Generate & Edit Images

mdmz

11.0k

wan 2.2

wan22

wan 2.2 animate

wan 22 animate

wan animate

Wan 2.2 Animate Preprocess by Kijai (MDMZ Edition)