Home / Model / Fish Audio S2 on Floyo

AI AUDIO GENERATION

Run Fish Audio S2 on Floyo

Expressive text-to-speech with inline emotion control, zero-shot voice cloning, multi-speaker generation, and 80+ languages. Write [happy], [whispering], or [professional broadcast tone] directly in your script.

Run Fish Audio's S2 Pro through ComfyUI in your browser. No API key, no installs, no local GPU.

Languages

80+

Latency

~100ms first audio

Sample Rate

44.1kHz

Emotion Tags

1,500+ inline

Try Fish Audio S2 Now → Browse All Models

No installation. Runs in browser. Updated April 2026.

What is Fish Audio S2?

Fish Audio S2 Pro is a text-to-speech model released on March 10, 2026. It uses a Dual-Autoregressive architecture with a 4B-parameter Slow AR for semantic prediction and a 400M-parameter Fast AR for acoustic detail. Trained on over 10 million hours of audio across 80+ languages, it supports inline emotion control, zero-shot voice cloning, and multi-speaker generation. It outputs 44.1kHz audio with sub-150ms latency.

Fish Audio S2 TTS - Expressive Text to Speech

floyoofficial

121

API

audio generation

expressive tts

text to speech

voice synthesis

Expressive Text to Speech

Fish Audio S2 TTS - Expressive Text to Speech

Expressive Text to Speech

Fish Speech Voice Cloning TTS with Emotion Tags

floyoofficial

181

Audio

text to image

Voice Cloning

Emotion Tags

Fish Speech Voice Cloning TTS with Emotion Tags

Emotion Tags

What are Fish Audio S2's technical specifications?

Fish Audio S2 Pro uses a Dual-Autoregressive architecture: a 4B-parameter Slow AR operating along the time axis for semantic prediction, and a 400M-parameter Fast AR generating 9 residual codebooks at each time step for acoustic detail. It outputs 44.1kHz audio with a real-time factor of 0.195 on H200 GPUs and time-to-first-audio of about 100ms.

Spec	Details
Developer	Fish Audio
Architecture	Dual-AR: 4B Slow AR (time axis) + 400M Fast AR (codebook depth)
Audio Codec	RVQ-based, 10 codebooks, ~21 Hz frame rate
Output Quality	44.1kHz high-fidelity audio
Languages	80+ (Tier 1: English, Chinese, Japanese. Tier 2: Korean, Spanish, Portuguese, Arabic, Russian, French, German)
Emotion Control	1,500+ inline tags via free-form natural-language descriptions
Voice Cloning	Zero-shot from 10-30 second reference clip
Multi-Speaker	Native multi-speaker and multi-turn in a single pass
Latency	~100ms time-to-first-audio (H200 GPU)
Real-Time Factor	0.195 on H200 (3,000+ tokens/sec)
Training Data	10M+ hours of multilingual audio
Open Source	Yes (Fish Audio Research License, free for research/non-commercial)
ComfyUI Access	FishAudioTTSAdvanced node on Floyo
Release Date	March 10, 2026

What can you create with Fish Audio S2?

Fish Audio S2 covers scripted speech with emotional range, voice cloning, multi-speaker dialogue, narration, character voices, and real-time conversational audio. The inline tag system means tone shifts happen at the word level, not the clip level. You control delivery sentence by sentence within a single generation.

Capability	What It Does	Use Case
Inline Emotion Tags	Write [happy], [whispering], [angry], or any free-form description at any point in your script. Delivery changes at that exact word.	Character dialogue, narration with mood shifts, interactive fiction
Voice Cloning	Zero-shot cloning from a 10-30 second reference clip. Captures timbre, speaking style, and emotional tendencies. No fine-tuning needed.	Brand voice consistency, character reuse, podcast production
Multi-Speaker	Generate complete dialogues between multiple characters in a single pass. Each speaker maintains distinct voice characteristics.	Audiobooks, conversational demos, training materials
80+ Languages	Multilingual TTS without phonemes or language-specific preprocessing. Tier 1: English, Chinese, Japanese. Tier 2: Korean, Spanish, French, German, and more.	Localized content, international marketing, multilingual apps
Sound Effects	Inline tags for [laughing], [sobbing], [sighing], [inhale], [exhale], [clearing throat], and more. Placed at the exact point in the script.	Animated character voices, game dialogue, social content
Pipeline Integration	Chain with video models in ComfyUI. Generate a video with Wan 2.7 or Seedance, then add narration or dialogue with Fish Audio S2 in the same workflow.	Video production pipelines, explainer videos, product demos

What are Fish Audio S2's key features?

Fish Audio S2's feature set is built around granular speech control at the word level. Instead of applying a single emotion or style to an entire clip, you embed instructions at specific positions in your text. The model interprets free-form natural-language descriptions, not a fixed tag vocabulary.

Inline Emotion Control

Write tags like [happy], [whispering], [professional broadcast tone], or [pitch up slightly with nervous energy] at any point in your script. The model adjusts delivery at that exact position. Tags apply from where you place them until the next tag. The same sentence reads completely differently with [excited] vs [serious] in front of it. Over 1,500 emotive tags are supported, and you can write custom descriptions beyond the preset list.

Zero-Shot Voice Cloning

Provide a 10-30 second reference audio clip. The model captures the speaker's timbre, speaking style, and emotional tendencies and applies them to your script without any fine-tuning. The cloned voice stays consistent across long outputs and responds to emotion tags, so you can make a cloned voice whisper, shout, or laugh while maintaining its identity.

Multi-Speaker Dialogue

Upload reference audio for multiple speakers. The model uses speaker tokens to switch between voices within a single generation. A two-person conversation, a panel discussion, or an audiobook with distinct character voices can all be produced in one pass. No splicing, no separate generations per speaker.

Sub-150ms Latency

Time-to-first-audio is about 100ms on H200 GPUs. The SGLang-based inference engine uses continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching. If you use the same reference voice repeatedly, RadixAttention caches the prefix states and skips recomputing the reference for every request.

44.1kHz Output

The RVQ-based audio codec uses 10 codebooks at ~21 Hz frame rate to reconstruct high-fidelity 44.1kHz audio. The Slow AR predicts the primary semantic codebook while the Fast AR fills in the remaining 9 residual codebooks, preserving fine acoustic details like breathiness, texture, and timbre.

80+ Language Support

No phonemes or language-specific preprocessing required. Tier 1 languages (highest quality) include English, Chinese, and Japanese. Tier 2 includes Korean, Spanish, Portuguese, Arabic, Russian, French, and German. Additional languages include Swedish, Italian, Turkish, Dutch, Hindi, Thai, Vietnamese, and more.

How does Fish Audio S2 compare to other TTS models?

Fish Audio S2 Pro ranks #1 in blind A/B testing against all major TTS providers with a Bradley-Terry score of 3.07 (1.7x the next best). On Seed-TTS Eval, it achieves the lowest Word Error Rate among all models including closed-source systems. On EmergentTTS-Eval, it wins 81.88% against gpt-4o-mini-tts. ElevenLabs leads on ecosystem maturity and ease of use. OpenAI TTS integrates tightly with GPT workflows.

Model	Emotion Control	Voice Cloning	Open Source	Audio Turing Test
Fish Audio S2	1,500+ inline tags (free-form)	Zero-shot (10-30s clip)	Yes	0.515
ElevenLabs	Style presets	Yes (instant cloning)	No	N/A
Seed-TTS	Limited	Yes	No	0.417
MiniMax Speech-02	Basic	Yes	No	0.387

Source: Fish Audio blind testing (71,000+ pairs, March-April 2026), Seed-TTS Eval benchmarks, EmergentTTS-Eval, and Audio Turing Test results from Fish Audio S2 Technical Report (arXiv:2603.08823).

How does Fish Audio S2 work?

Fish Audio S2 uses a Dual-Autoregressive architecture built on a decoder-only transformer with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate). The Slow AR (4B parameters) operates along the time axis and predicts the primary semantic codebook. The Fast AR (400M parameters) generates the remaining 9 residual codebooks at each time step, reconstructing fine acoustic detail.

This asymmetric design keeps inference efficient while preserving audio fidelity. Flattening all 10 codebooks along the time axis would cause a 10x sequence-length explosion. By splitting the work between a large time-axis model and a smaller depth-axis model, S2 avoids that bottleneck.

For post-training, S2 uses Group Relative Policy Optimization (GRPO). The same models used to filter and annotate training data serve as reward models during reinforcement learning. This eliminates the distribution mismatch between pre-training data and post-training objectives. The reward signal combines semantic accuracy, instruction adherence, acoustic preference scoring, and timbre similarity.

On Floyo, Fish Audio S2 runs through the FishAudioTTSAdvanced node. You write your script with inline emotion tags in a single text field, adjust temperature and repetition penalty if needed, and the node returns audio in about 2 seconds. You can chain this with video generation nodes in the same ComfyUI workflow to add narration or dialogue to AI-generated video.

Frequently Asked Questions

Common questions about running Fish Audio S2 on Floyo.

Is Fish Audio S2 free to use on Floyo?

Fish Audio S2 runs as an API node on Floyo, so generation costs come from your API Wallet (separate from FloTime). Floyo gives $1 in free API credits on signup. The underlying S2 Pro model is open-source under the Fish Audio Research License (free for research and non-commercial use). Commercial use requires a separate license from Fish Audio.

How do I run Fish Audio S2 without installing anything?

Open Floyo in your browser, find the Fish Audio S2 TTS workflow (search "Fish Audio" in the template library), and click Run. Write your script with emotion tags in the text field and hit generate. Floyo handles the ComfyUI environment and API connection. No local install, no Python setup, no API key required.

Who made Fish Audio S2?

Fish Audio, an independent AI audio company. S2 Pro was released on March 10, 2026. Model weights, fine-tuning code, and the SGLang-based inference engine are all open-source. The technical report is published on arXiv (2603.08823).

How do I control emotion and tone?

Add emotion tags directly in your text prompt. Write [happy], [whispering], [angry], or any free-form description before the line you want affected. Basic emotions: [angry], [sad], [excited], [happy]. Tone: [whispering], [shouting], [professional broadcast tone], [pitch up]. Effects: [laughing], [sobbing], [sighing], [pause]. Tags apply from where you place them until the next tag.

How does Fish Audio S2 compare to ElevenLabs?

Fish Audio S2 ranked #1 in blind A/B testing against all major TTS providers including ElevenLabs. S2's inline emotion control is more granular (free-form tags at the word level vs. style presets). S2 is open-source. ElevenLabs has a more polished consumer interface and broader ecosystem integrations. For raw audio quality and control, S2 has the edge. For ease of use, ElevenLabs is more approachable.

Can I combine Fish Audio S2 with video models in one workflow?

Yes. Floyo runs ComfyUI, which lets you chain multiple models. Generate a video with Wan 2.7 or Seedance 2.0, then add narration, dialogue, or character voices with Fish Audio S2 in the same workflow. The audio node outputs a file that can be paired with your video output.

How fast is Fish Audio S2?

On Floyo, the workflow generates audio in about 2 seconds. The underlying model achieves ~100ms time-to-first-audio on H200 GPUs with a real-time factor of 0.195. That means 1 second of audio takes about 0.2 seconds to generate.

What languages does Fish Audio S2 support?

80+ languages. Tier 1 (highest quality): English, Chinese, Japanese. Tier 2: Korean, Spanish, Portuguese, Arabic, Russian, French, German. Additional languages include Swedish, Italian, Turkish, Dutch, Hindi, Thai, Vietnamese, and many more. No phonemes or language-specific preprocessing required.

Try Fish Audio S2 on Floyo

Expressive TTS with inline emotion tags, voice cloning, multi-speaker, 80+ languages. Run it in your browser.

Try Fish Audio S2 Now → Browse All Models