Fish Speech Voice Cloning TTS with Emotion Tags

Emotion Tags

Audio

text to image

Voice Cloning

347

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI_Comfyroll_CustomNodes

CR Prompt Text

ComfyUI Official

LoadAudio

SaveAudio

dspy_nodes

ShowText|pysssss

ComfyUI-Custom-Scripts

ShowText|pysssss

Upload a short audio clip of the voice you want to clone, write the text you want spoken, and Fish Speech S2 Pro generates it in that voice. Whisper transcribes the reference clip automatically — you don't need to type out what was said in it.

Emotion and tone are controlled by tags you place directly in the text: [excited], [whispering], [laughing], [angry], and 60+ others. Put them inline wherever you want the delivery to shift.

Output is audio only, saved as a file.

How do you use Fish Speech S2 Pro voice cloning with emotion tags?

Upload a reference audio clip, write your target text with emotion tags placed where you want them, and Fish Speech S2 Pro generates the speech in the cloned voice. Whisper handles the reference transcription automatically. Most runs only need the reference audio and the text field.

Reference audio The voice sample to clone from. A clean, clear recording works best — minimal background noise, one speaker. Length matters less than clarity. The model extracts voice characteristics from this clip and applies them to your generated speech.

Reference text Auto-filled by Whisper after transcription. You can edit it if Whisper gets something wrong, but in most cases leave it as-is. It helps the model match prosody between the reference and the output.

Text (with emotion tags) Write what you want spoken, then drop tags inline where you want the delivery to change. Tags work like stage directions placed right in the sentence:

[professional broadcast tone] Welcome to today's program. [whispers] But first, let me tell you a secret.

[excited] I will love you for ten thousand years!

The full tag library covers basic emotions ([happy], [angry], [sad]), advanced emotions ([sarcastic], [hesitating], [guilty]), tone and voice ([shouting], [pitch up], [soft tone]), sound effects ([laughing], [sobbing], [sighing], [pause]), and volume control ([loud], [whispering], [low volume]). Mix and chain them freely.

Language Set to auto by default. The model detects the language from your text. Override it manually if auto-detection is getting it wrong on short inputs.

Temperature 0.8 by default. Lower it (0.5–0.6) for more controlled, predictable delivery. Raise it (0.9–1.0) for more expressive variation. If the output sounds flat, try going slightly higher.

Top_p 0.8 by default. Works with temperature to control delivery variety. Leave it alone unless you're specifically tuning for consistency or spontaneity.

Repetition penalty 1.1 by default. Keeps the model from looping or stuttering on repeated phrases. Increase slightly if you're getting repetition artifacts.

Seed Randomized by default. Fix the seed to reproduce a specific delivery. Change it to explore different interpretations of the same text and tags.

Max new tokens 200 by default. Increase for longer texts. If the output gets cut off mid-sentence, this is the first thing to raise.

What is Fish Speech S2 Pro voice cloning good for?

Fish Speech S2 Pro is for generating speech that sounds like a specific person, with delivery you can control at the sentence and word level. The emotion tag system means you're not guessing — you place the tag, you get the effect. Useful for voiceover, character dialogue, dubbing, and any content where voice and delivery both matter.

Good scenarios: voiceover work where you need a consistent voice across multiple lines. Character dialogue with varied emotional delivery — anger, excitement, hesitation — in a single clip. Dubbing or localization where the target voice is known. Content prototyping where you want to hear how a script sounds before committing to a recording session.

The catch: voice cloning quality depends on the reference clip. A noisy, compressed, or multi-speaker reference produces a weaker clone. Get the cleanest possible sample of the target voice for best results. And if Whisper mistranscribes the reference, fix the reference text manually — wrong transcription affects prosody matching.

FAQ

How long does the reference audio clip need to be for Fish Speech voice cloning? A few seconds of clean speech is enough. Longer isn't always better — clarity matters more than duration. A 5–15 second clip with one speaker and minimal noise gives the model what it needs.

Can Fish Speech S2 Pro generate speech in multiple languages? Yes. Language is set to auto by default and the model handles detection. You can mix languages in the same text input. Set language manually if auto-detection is unreliable on short or ambiguous inputs.

How do emotion tags work in Fish Speech? Place them inline in your text wherever you want the delivery to change. [angry] before a sentence makes that sentence sound angry. [pause] inserts a beat. [laughing] adds a laugh effect. They stack and chain — you can shift tone multiple times across a single line.

What happens if Whisper transcribes the reference audio incorrectly? Edit the reference text field manually before running. The transcription is used to help the model match prosody between the reference voice and the output. A wrong transcription doesn't break the clone, but fixing it produces cleaner results.

How do you run Fish Speech voice cloning online? You can run it online through Floyo. No installation, no setup. Open the workflow in your browser, upload your reference audio, and hit run.