TTS and Speech Length Calculator
Speech calculator and TTS using Whisper and WhatDreamCost
Text to Speech
TTS
WhatDreamCost
Whisper
0
21
Nodes & Models
WorkflowGraphics
LoadAudio
Apply Whisper
PreviewAny
Gemini said
This workflow combines high-fidelity voice generation with precise timing estimation, specifically designed for AI video creators (often used within ComfyUI or similar modular pipelines).
It is important to note that Whisper is traditionally a Speech-to-Text (STT) model; however, in this context, it likely refers to the WhisperSpeech open-source TTS system or the OpenAI TTS API (which is frequently paired with Whisper). The WhatDreamsCost component refers to a specialized Speech Length Calculator (originally developed as a ComfyUI node by the user WhatDreamsCost) that ensures your video timeline matches your dialogue perfectly.
๐๏ธ AI Video Narrative Workflow
1. Script Input & Pre-Processing
The user enters the final narrative or dialogue into a Text Input field.
Cleaning: The workflow automatically strips non-spoken annotations (e.g., [Scene Start], [Soft Music Plays]) to ensure the calculator only measures actual spoken words.
Tagging: Optional "Pace Tags" can be added if the specific tool supports varying speeds for different segments.
2. Duration Estimation (WhatDreamsCost Calculator)
Before generating audio, the WhatDreamsCost Speech Length Calculator analyzes the text string.
WPM Calculation: It calculates the estimated duration in seconds based on a configurable Words Per Minute (WPM) setting (standard is usually ~150 WPM).
Real-time Feedback: As the user types, the tool provides a live readout of the expected audio length.
Frame Conversion: For video editors, it converts this time into a Total Frame Count (e.g., Durationร24fps), which is then piped into the video generation nodes to ensure the visual length matches the audio exactly.
3. Speech Synthesis (Whisper / WhisperSpeech)
The text is sent to the TTS engine to generate the high-quality .wav or .mp3 file.
Voice Selection: The user selects a voice profile (e.g., Alloy, Nova, or Onyx if using OpenAI, or a custom cloned voice if using WhisperSpeech).
Generation: The model synthesizes the audio, maintaining the natural prosody and intonation of the input language.
4. Audio-Visual Synchronization
The generated audio file and the calculated duration from the WhatDreamsCost node meet in the final assembly stage.
Timeline Locking: The video generator (like AnimateDiff or SVD) uses the frame count from Step 2 as its "Max Frames" limit.
Lip-Sync (Optional): The audio can be passed through a secondary node (like Wav2Lip) to synchronize a characterโs mouth movements to the Whisper-generated audio.
Pro Tip: When using the WhatDreamsCost calculator, always add a small "buffer" (0.5 to 1.0 seconds) at the end of your video generation. AI speech models sometimes add brief silences at the start or end of a clip that can lead to the video cutting off the last word if timed too tightly.
Read more


