floyo logo
Powered by
ThinkDiffusion
floyo logo
Powered by
ThinkDiffusion

ComfyUI-FishAudioS2

165

Last updated
2026-03-30

ComfyUI-FishAudioS2 is a specialized set of custom nodes designed to integrate the Fish Audio S2 Pro text-to-speech (TTS) technology into the ComfyUI framework. This tool enables advanced voice synthesis capabilities, including voice cloning, multi-speaker conversations, and nuanced emotional expression in generated speech.

  • Offers zero-shot voice cloning from short audio samples, allowing users to create new voice models quickly.
  • Supports over 1,500 emotive tags for fine-tuning speech prosody and emotion, enhancing the realism of generated audio.
  • Facilitates multi-speaker synthesis in a single pass, providing isolated audio outputs for each speaker, ideal for lip-sync applications.

Context

The ComfyUI-FishAudioS2 extension serves as a bridge between the powerful Fish Audio S2 Pro TTS model and the ComfyUI environment. Its primary purpose is to provide users with a flexible, intuitive way to generate high-quality speech that can be customized for a variety of applications, from animation to interactive media.

Key Features & Benefits

This tool stands out due to its zero-shot voice cloning capability, which allows users to clone voices from just 10-30 seconds of reference audio. Additionally, the extensive library of emotive tags provides users with unprecedented control over the emotional tone and pacing of the speech output. The multi-speaker functionality enables the generation of complex dialogues, making it particularly useful for projects requiring multiple characters.

Advanced Functionalities

The Fish Audio S2 Pro model includes advanced features such as automatic language detection and support for 83 languages, which significantly broadens its usability. Furthermore, the model supports various precision types (bf16, fp16, fp32) and optimized performance options, including advanced attention mechanisms to enhance synthesis speed and quality.

Practical Benefits

By integrating this tool into their workflows, users can achieve a higher level of control over speech synthesis, resulting in more engaging and lifelike audio outputs. The ability to generate multi-speaker conversations with isolated audio tracks streamlines the process for projects that involve animation or lip-syncing, ultimately improving efficiency and quality.

Credits/Acknowledgments

The development of this repository is credited to the Fish Audio team, with contributions from the open-source community. The project is licensed under the Fish Audio Research License, allowing for academic and non-commercial use while requiring separate licensing for commercial applications.