Home / Model / Fish Audio S2 on Floyo
AI AUDIO GENERATION
Run Fish Audio S2 on Floyo
Expressive text-to-speech with inline emotion control, zero-shot voice cloning, multi-speaker generation, and 80+ languages. Write [happy], [whispering], or [professional broadcast tone] directly in your script.
Run Fish Audio's S2 Pro through ComfyUI in your browser. No API key, no installs, no local GPU.
Languages
80+
Latency
~100ms first audio
Sample Rate
44.1kHz
Emotion Tags
1,500+ inline
Try Fish Audio S2 Now → Browse All Models
No installation. Runs in browser. Updated April 2026.
What is Fish Audio S2?
Fish Audio S2 Pro is a text-to-speech model released on March 10, 2026. It uses a Dual-Autoregressive architecture with a 4B-parameter Slow AR for semantic prediction and a 400M-parameter Fast AR for acoustic detail. Trained on over 10 million hours of audio across 80+ languages, it supports inline emotion control, zero-shot voice cloning, and multi-speaker generation. It outputs 44.1kHz audio with sub-150ms latency.
API
audio generation
expressive tts
text to speech
voice synthesis
Expressive Text to Speech
Fish Audio S2 TTS - Expressive Text to Speech
Expressive Text to Speech
Fish Speech Voice Cloning TTS with Emotion Tags
Emotion Tags
What are Fish Audio S2's technical specifications?
Fish Audio S2 Pro uses a Dual-Autoregressive architecture: a 4B-parameter Slow AR operating along the time axis for semantic prediction, and a 400M-parameter Fast AR generating 9 residual codebooks at each time step for acoustic detail. It outputs 44.1kHz audio with a real-time factor of 0.195 on H200 GPUs and time-to-first-audio of about 100ms.
| Spec | Details |
|---|---|
| Developer | Fish Audio |
| Architecture | Dual-AR: 4B Slow AR (time axis) + 400M Fast AR (codebook depth) |
| Audio Codec | RVQ-based, 10 codebooks, ~21 Hz frame rate |
| Output Quality | 44.1kHz high-fidelity audio |
| Languages | 80+ (Tier 1: English, Chinese, Japanese. Tier 2: Korean, Spanish, Portuguese, Arabic, Russian, French, German) |
| Emotion Control | 1,500+ inline tags via free-form natural-language descriptions |
| Voice Cloning | Zero-shot from 10-30 second reference clip |
| Multi-Speaker | Native multi-speaker and multi-turn in a single pass |
| Latency | ~100ms time-to-first-audio (H200 GPU) |
| Real-Time Factor | 0.195 on H200 (3,000+ tokens/sec) |
| Training Data | 10M+ hours of multilingual audio |
| Open Source | Yes (Fish Audio Research License, free for research/non-commercial) |
| ComfyUI Access | FishAudioTTSAdvanced node on Floyo |
| Release Date | March 10, 2026 |
What can you create with Fish Audio S2?
Fish Audio S2 covers scripted speech with emotional range, voice cloning, multi-speaker dialogue, narration, character voices, and real-time conversational audio. The inline tag system means tone shifts happen at the word level, not the clip level. You control delivery sentence by sentence within a single generation.
| Capability | What It Does | Use Case |
|---|---|---|
| Inline Emotion Tags | Write [happy], [whispering], [angry], or any free-form description at any point in your script. Delivery changes at that exact word. | Character dialogue, narration with mood shifts, interactive fiction |
| Voice Cloning | Zero-shot cloning from a 10-30 second reference clip. Captures timbre, speaking style, and emotional tendencies. No fine-tuning needed. | Brand voice consistency, character reuse, podcast production |
| Multi-Speaker | Generate complete dialogues between multiple characters in a single pass. Each speaker maintains distinct voice characteristics. | Audiobooks, conversational demos, training materials |
| 80+ Languages | Multilingual TTS without phonemes or language-specific preprocessing. Tier 1: English, Chinese, Japanese. Tier 2: Korean, Spanish, French, German, and more. | Localized content, international marketing, multilingual apps |
| Sound Effects | Inline tags for [laughing], [sobbing], [sighing], [inhale], [exhale], [clearing throat], and more. Placed at the exact point in the script. | Animated character voices, game dialogue, social content |
| Pipeline Integration | Chain with video models in ComfyUI. Generate a video with Wan 2.7 or Seedance, then add narration or dialogue with Fish Audio S2 in the same workflow. | Video production pipelines, explainer videos, product demos |
What are Fish Audio S2's key features?
Fish Audio S2's feature set is built around granular speech control at the word level. Instead of applying a single emotion or style to an entire clip, you embed instructions at specific positions in your text. The model interprets free-form natural-language descriptions, not a fixed tag vocabulary.
Inline Emotion Control
Write tags like [happy], [whispering], [professional broadcast tone], or [pitch up slightly with nervous energy] at any point in your script. The model adjusts delivery at that exact position. Tags apply from where you place them until the next tag. The same sentence reads completely differently with [excited] vs [serious] in front of it. Over 1,500 emotive tags are supported, and you can write custom descriptions beyond the preset list.
Zero-Shot Voice Cloning
Provide a 10-30 second reference audio clip. The model captures the speaker's timbre, speaking style, and emotional tendencies and applies them to your script without any fine-tuning. The cloned voice stays consistent across long outputs and responds to emotion tags, so you can make a cloned voice whisper, shout, or laugh while maintaining its identity.
Multi-Speaker Dialogue
Upload reference audio for multiple speakers. The model uses speaker tokens to switch between voices within a single generation. A two-person conversation, a panel discussion, or an audiobook with distinct character voices can all be produced in one pass. No splicing, no separate generations per speaker.
Sub-150ms Latency
Time-to-first-audio is about 100ms on H200 GPUs. The SGLang-based inference engine uses continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. If you use the same reference voice repeatedly, RadixAttention caches the prefix states and skips recomputing the reference for every request.
44.1kHz Output
The RVQ-based audio codec uses 10 codebooks at ~21 Hz frame rate to reconstruct high-fidelity 44.1kHz audio. The Slow AR predicts the primary semantic codebook while the Fast AR fills in the remaining 9 residual codebooks, preserving fine acoustic details like breathiness, texture, and timbre.
80+ Language Support
No phonemes or language-specific preprocessing required. Tier 1 languages (highest quality) include English, Chinese, and Japanese. Tier 2 includes Korean, Spanish, Portuguese, Arabic, Russian, French, and German. Additional languages include Swedish, Italian, Turkish, Dutch, Hindi, Thai, Vietnamese, and more.
How does Fish Audio S2 compare to other TTS models?
Fish Audio S2 Pro ranks #1 in blind A/B testing against all major TTS providers with a Bradley-Terry score of 3.07 (1.7x the next best). On Seed-TTS Eval, it achieves the lowest Word Error Rate among all models including closed-source systems. On EmergentTTS-Eval, it wins 81.88% against gpt-4o-mini-tts. ElevenLabs leads on ecosystem maturity and ease of use. OpenAI TTS integrates tightly with GPT workflows.
| Model | Emotion Control | Voice Cloning | Open Source | Audio Turing Test |
|---|---|---|---|---|
| Fish Audio S2 | 1,500+ inline tags (free-form) | Zero-shot (10-30s clip) | Yes | 0.515 |
| ElevenLabs | Style presets | Yes (instant cloning) | No | N/A |
| Seed-TTS | Limited | Yes | No | 0.417 |
| MiniMax Speech-02 | Basic | Yes | No | 0.387 |
Source: Fish Audio blind testing (71,000+ pairs, March-April 2026), Seed-TTS Eval benchmarks, EmergentTTS-Eval, and Audio Turing Test results from Fish Audio S2 Technical Report (arXiv:2603.08823).
How does Fish Audio S2 work?
Fish Audio S2 uses a Dual-Autoregressive architecture built on a decoder-only transformer with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate). The Slow AR (4B parameters) operates along the time axis and predicts the primary semantic codebook. The Fast AR (400M parameters) generates the remaining 9 residual codebooks at each time step, reconstructing fine acoustic detail.
This asymmetric design keeps inference efficient while preserving audio fidelity. Flattening all 10 codebooks along the time axis would cause a 10x sequence-length explosion. By splitting the work between a large time-axis model and a smaller depth-axis model, S2 avoids that bottleneck.
For post-training, S2 uses Group Relative Policy Optimization (GRPO). The same models used to filter and annotate training data serve as reward models during reinforcement learning. This eliminates the distribution mismatch between pre-training data and post-training objectives. The reward signal combines semantic accuracy, instruction adherence, acoustic preference scoring, and timbre similarity.
On Floyo, Fish Audio S2 runs through the FishAudioTTSAdvanced node. You write your script with inline emotion tags in a single text field, adjust temperature and repetition penalty if needed, and the node returns audio in about 2 seconds. You can chain this with video generation nodes in the same ComfyUI workflow to add narration or dialogue to AI-generated video.
Frequently Asked Questions
Common questions about running Fish Audio S2 on Floyo.
Fish Audio S2 runs as an API node on Floyo, so generation costs come from your API Wallet (separate from FloTime). Floyo gives $1 in free API credits on signup. The underlying S2 Pro model is open-source under the Fish Audio Research License (free for research and non-commercial use). Commercial use requires a separate license from Fish Audio.
Open Floyo in your browser, find the Fish Audio S2 TTS workflow (search "Fish Audio" in the template library), and click Run. Write your script with emotion tags in the text field and hit generate. Floyo handles the ComfyUI environment and API connection. No local install, no Python setup, no API key required.
Fish Audio, an independent AI audio company. S2 Pro was released on March 10, 2026. Model weights, fine-tuning code, and the SGLang-based inference engine are all open-source. The technical report is published on arXiv (2603.08823).
Add emotion tags directly in your text prompt. Write [happy], [whispering], [angry], or any free-form description before the line you want affected. Basic emotions: [angry], [sad], [excited], [happy]. Tone: [whispering], [shouting], [professional broadcast tone], [pitch up]. Effects: [laughing], [sobbing], [sighing], [pause]. Tags apply from where you place them until the next tag.
Fish Audio S2 ranked #1 in blind A/B testing against all major TTS providers including ElevenLabs. S2's inline emotion control is more granular (free-form tags at the word level vs. style presets). S2 is open-source. ElevenLabs has a more polished consumer interface and broader ecosystem integrations. For raw audio quality and control, S2 has the edge. For ease of use, ElevenLabs is more approachable.
Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate a video with Wan 2.7 or Seedance 2.0, then add narration, dialogue, or character voices with Fish Audio S2 in the same workflow. The audio node outputs a file that can be paired with your video output.
On Floyo, the workflow generates audio in about 2 seconds. The underlying model achieves ~100ms time-to-first-audio on H200 GPUs with a real-time factor of 0.195. That means 1 second of audio takes about 0.2 seconds to generate.
80+ languages. Tier 1 (highest quality): English, Chinese, Japanese. Tier 2: Korean, Spanish, Portuguese, Arabic, Russian, French, German. Additional languages include Swedish, Italian, Turkish, Dutch, Hindi, Thai, Vietnamese, and many more. No phonemes or language-specific preprocessing required.
Try Fish Audio S2 on Floyo
Expressive TTS with inline emotion tags, voice cloning, multi-speaker, 80+ languages. Run it in your browser.
Related Reading
Film and Animation Workflows on Floyo
Setting Up an AI Production Pipeline for Your Studio
Last updated: April 2026. Specs from Fish Audio S2 Technical Report (arXiv:2603.08823), Fish Audio blind testing results, HuggingFace model card, Seed-TTS Eval benchmarks, and EmergentTTS-Eval.

