floyo logo
Powered by
ThinkDiffusion
Webinar: Qwen 2511 for Multi Angle & Relighting w Sebastian Kamph. Sign up now 👉🏽
floyo logo
Powered by
ThinkDiffusion
Webinar: Qwen 2511 for Multi Angle & Relighting w Sebastian Kamph. Sign up now 👉🏽

Chatterbox Text to Speech

Text to speech workflow using Chatterbox

97

ChatterBox TTS is an open‑source text‑to‑speech and voice‑cloning system that turns text into natural‑sounding speech, lets you clone voices from a few seconds of audio, and gives fine control over emotion and intensity.​

What it does

  • Converts text into high‑quality speech with controls for pitch, speed, and emotion (from neutral to highly dramatic).​

  • Performs zero‑shot voice cloning: upload a short reference clip (around 5 seconds) and it can mimic that voice without separate training.​

  • Supports multilingual output (around 22 languages) and can keep a cloned voice consistent across languages for dubbing/localization.​

Voice change and control

  • Works as a voice changer by cloning a target voice and then speaking any input text in that style, allowing accent, pacing, and emotional intensity adjustments.​

  • Provides explicit “exaggeration” or intensity parameters so you can dial emotion and expressiveness up or down programmatically.​

  • Includes watermarking/provenance options (PerTh) in some deployments so synthetic audio can be detected and tracked responsibly.​​

How it’s typically used

  • Via web UIs where you paste text, choose or clone a voice, adjust emotion/pacing, and download audio.​

  • As a self‑hosted or API‑based engine for agents, NPCs, audiobooks, podcasts, accessibility tools, or localized dubbing.

Read more

Generates in about -- secs

Nodes & Models

WorkflowGraphics
LoadAudio
SaveAudio
PreviewAudio
ChatterboxTTS
ChatterboxVC
ChatterboxTTS
ChatterboxTTS
ChatterboxVC

ChatterBox TTS is an open‑source text‑to‑speech and voice‑cloning system that turns text into natural‑sounding speech, lets you clone voices from a few seconds of audio, and gives fine control over emotion and intensity.​

What it does

  • Converts text into high‑quality speech with controls for pitch, speed, and emotion (from neutral to highly dramatic).​

  • Performs zero‑shot voice cloning: upload a short reference clip (around 5 seconds) and it can mimic that voice without separate training.​

  • Supports multilingual output (around 22 languages) and can keep a cloned voice consistent across languages for dubbing/localization.​

Voice change and control

  • Works as a voice changer by cloning a target voice and then speaking any input text in that style, allowing accent, pacing, and emotional intensity adjustments.​

  • Provides explicit “exaggeration” or intensity parameters so you can dial emotion and expressiveness up or down programmatically.​

  • Includes watermarking/provenance options (PerTh) in some deployments so synthetic audio can be detected and tracked responsibly.​​

How it’s typically used

  • Via web UIs where you paste text, choose or clone a voice, adjust emotion/pacing, and download audio.​

  • As a self‑hosted or API‑based engine for agents, NPCs, audiobooks, podcasts, accessibility tools, or localized dubbing.

Read more