floyo logo
Powered by
ThinkDiffusion
Pricing
Wan 2.7 is now live. Check it out ๐Ÿ‘‰๐Ÿผ
floyo logo
Powered by
ThinkDiffusion
Pricing
Wan 2.7 is now live. Check it out ๐Ÿ‘‰๐Ÿผ

TTS and Speech Length Calculator

Speech calculator and TTS using Whisper and WhatDreamCost

21

Generates in about -- secs

Nodes & Models

WorkflowGraphics
LoadAudio
Apply Whisper
PreviewAny

Gemini said

This workflow combines high-fidelity voice generation with precise timing estimation, specifically designed for AI video creators (often used within ComfyUI or similar modular pipelines).

It is important to note that Whisper is traditionally a Speech-to-Text (STT) model; however, in this context, it likely refers to the WhisperSpeech open-source TTS system or the OpenAI TTS API (which is frequently paired with Whisper). The WhatDreamsCost component refers to a specialized Speech Length Calculator (originally developed as a ComfyUI node by the user WhatDreamsCost) that ensures your video timeline matches your dialogue perfectly.


๐ŸŽ™๏ธ AI Video Narrative Workflow

1. Script Input & Pre-Processing

The user enters the final narrative or dialogue into a Text Input field.

  • Cleaning: The workflow automatically strips non-spoken annotations (e.g., [Scene Start], [Soft Music Plays]) to ensure the calculator only measures actual spoken words.

  • Tagging: Optional "Pace Tags" can be added if the specific tool supports varying speeds for different segments.

2. Duration Estimation (WhatDreamsCost Calculator)

Before generating audio, the WhatDreamsCost Speech Length Calculator analyzes the text string.

  • WPM Calculation: It calculates the estimated duration in seconds based on a configurable Words Per Minute (WPM) setting (standard is usually ~150 WPM).

  • Real-time Feedback: As the user types, the tool provides a live readout of the expected audio length.

  • Frame Conversion: For video editors, it converts this time into a Total Frame Count (e.g., Durationร—24fps), which is then piped into the video generation nodes to ensure the visual length matches the audio exactly.

3. Speech Synthesis (Whisper / WhisperSpeech)

The text is sent to the TTS engine to generate the high-quality .wav or .mp3 file.

  • Voice Selection: The user selects a voice profile (e.g., Alloy, Nova, or Onyx if using OpenAI, or a custom cloned voice if using WhisperSpeech).

  • Generation: The model synthesizes the audio, maintaining the natural prosody and intonation of the input language.

4. Audio-Visual Synchronization

The generated audio file and the calculated duration from the WhatDreamsCost node meet in the final assembly stage.

  • Timeline Locking: The video generator (like AnimateDiff or SVD) uses the frame count from Step 2 as its "Max Frames" limit.

  • Lip-Sync (Optional): The audio can be passed through a secondary node (like Wav2Lip) to synchronize a characterโ€™s mouth movements to the Whisper-generated audio.

Pro Tip: When using the WhatDreamsCost calculator, always add a small "buffer" (0.5 to 1.0 seconds) at the end of your video generation. AI speech models sometimes add brief silences at the start or end of a clip that can lead to the video cutting off the last word if timed too tightly.

Read more

N