Workflows

Pricing

COMMUNITY PAGE

Run Minimax on Floyo

Home / Model / MiniMax on Floyo

AI VIDEO, SPEECH & MUSIC GENERATION

Run MiniMax on Floyo

Video generation with cinematic physics (Hailuo), #1-ranked studio-grade TTS (Speech 2.8 HD), and full-song music generation with vocals (Music 2.6). Three modalities, one ecosystem.

Run MiniMax models through ComfyUI in your browser. No API key, no installs, no local GPU.

Video (Hailuo)

1080p, 6-10s clips

Speech (2.8 HD)

#1 TTS Arena, 40+ languages

Music (2.6)

Full songs, up to 6 min

Modalities

Video + Speech + Music

Try MiniMax Now →

Browse All Models

No installation. Runs in browser. Updated April 2026.

What you get?

What You Get

MiniMax is a full-stack AI creation platform spanning video, speech, and music. Hailuo (video) generates cinematic 1080p clips at 24fps with best-in-class physics simulation. Speech 2.8 HD (TTS) is ranked #1 on both Artificial Analysis and HuggingFace TTS Arenas with 7 emotion modes, inline interjections, and voice cloning from 5 seconds. Music 2.6 generates complete songs with vocals and instrumentals up to 6 minutes long with 14 structure tags, 100+ instruments, and cover generation. All three run as ComfyUI API nodes on Floyo.

MINIMAX WORKFLOWS ON FLOYO

MiniMax Text to Video (Hailuo)

Next Level Motion from Images Using MiniMax (Image-to-Video)

MiniMax Speech 2.8 HD for Text to Speech

MiniMax Music 2.6 Text to Music

What is MiniMax?

MiniMax is a global AI foundation model company that ships across video, speech, and music generation. The company builds Talkie (150 million+ users), and its models consistently rank at the top of public benchmarks. On Floyo, three MiniMax model families are available: Hailuo for video, Speech 2.8 HD for text-to-speech, and Music 2.6 for song generation.

Hailuo is the video generation line. The latest versions (Hailuo 2.0 and 2.3) use a Noise-aware Compute Redistribution (NCR) architecture that generates native 1080p video at 24fps with best-in-class physics simulation. Cloth, water, debris, camera shake, and complex body movements look physically plausible. Both text-to-video and image-to-video are supported with clips up to 10 seconds.

Speech 2.8 HD is the TTS line. It ranks #1 on both the Artificial Analysis Speech Arena and HuggingFace TTS Arena. Seven emotion modes (neutral, happy, sad, angry, fearful, disgusted, surprised), inline interjections like (laughs) and (sighs), voice cloning from 5 seconds of audio, and 40+ languages with tonal nuance preservation.

Music 2.6 is the music generation line, released April 10, 2026. It generates complete songs with vocals and instrumentals up to 6 minutes long. 14 structure tags control arrangement. 100+ instrument tones. BPM and key signature control with 99%+ accuracy. AI cover generation rebuilds existing songs in new styles. Studio-quality 44.1kHz/256kbps output.

On Floyo, all three run through ComfyUI API nodes. You can chain them in one workflow: generate a video with Hailuo, add narration with Speech 2.8 HD, and compose a soundtrack with Music 2.6. All in one pipeline.

What are MiniMax technical specifications?

MiniMax spans three modalities. Hailuo video uses NCR architecture for 1080p@24fps with physics-focused rendering. Speech 2.8 HD uses an autoregressive Transformer with Flow-VAE decoder for broadcast-grade TTS. Music 2.6 generates full songs with vocals and instrumentals at 44.1kHz/256kbps with BPM/key control. All run as API nodes on Floyo.

Spec	Details
Developer	MiniMax (makers of Talkie, 150M+ users)
HAILUO (VIDEO)
Architecture	Noise-aware Compute Redistribution (NCR) diffusion transformer
Resolution	768p (Standard) / Native 1080p (Pro)
Frame Rate	24fps
Duration	6 or 10 seconds per clip
Modes	Text-to-video (T2V), Image-to-video (I2V), Subject-to-video (S2V)
Physics	Best-in-class rigid body, fluid, cloth, and camera simulation
Versions	Hailuo 2.0 (Oct 2025) / Hailuo 2.3 (Dec 2025)
SPEECH 2.8 HD (TTS)
Architecture	Autoregressive Transformer + learnable speaker encoder + hybrid Flow-VAE decoder
Languages	40+
Emotions	7 modes (neutral, happy, sad, angry, fearful, disgusted, surprised)
Interjections	Inline: (laughs), (sighs), (coughs), (gasps)
Voice Cloning	From 5 seconds of reference audio
Arena Rankings	#1 Artificial Analysis + #1 HuggingFace TTS Arena
MUSIC 2.6
Output	Complete songs with vocals + instrumentals (up to 6 minutes)
Audio Quality	44.1kHz / 256kbps (studio-grade)
Instruments	100+ tones (orchestral, electric, synth, ethnic)
Structure Tags	14 ([Verse], [Chorus], [Bridge], [Drop], [Solo], [Build Up], etc.)
BPM/Key Control	99%+ accuracy when specified in prompt
Cover Generation	Upload a song, extract melody, restyle in any genre/language
Release Date	April 10, 2026
ComfyUI Access	API-based nodes on Floyo (4 workflows)

What can you create with MiniMax?

MiniMax covers cinematic video generation, image animation, studio-grade voiceover, full-song composition, and end-to-end multimedia pipelines. On Floyo, you can chain all three modalities in one ComfyUI workflow: generate video with Hailuo, add narration with Speech 2.8 HD, compose a soundtrack with Music 2.6, and export a complete multimedia package.

Capability	What It Does	Use Case
Text-to-Video (Hailuo)	Generate 768p or 1080p video clips at 24fps from text prompts. Best-in-class physics: water, debris, cloth, camera shake. 6-10 second clips.	Product demos, social content, short films, ads
Image-to-Video (Hailuo)	Animate still images into cinematic clips. Preserves composition while adding natural motion, lighting, and camera movement.	Photo animation, product showcases, motion graphics
Text-to-Speech (Speech 2.8)	Studio-grade voiceover with 7 emotions, inline interjections, 17+ preset voices, voice cloning, and 40+ languages. Broadcast-quality audio.	Narration, audiobooks, podcasts, ads, e-learning
Text-to-Music (Music 2.6)	Generate full songs with vocals and instrumentals up to 6 minutes. 14 structure tags, 100+ instruments, BPM/key control, auto-lyrics.	Soundtracks, jingles, game audio, content music
AI Cover Generation	Upload an existing song. Music 2.6 extracts the melodic skeleton and rebuilds it in any style, arrangement, or language.	Remixes, localized versions, style experiments
Multi-Modal Pipelines	Chain Hailuo + Speech 2.8 + Music 2.6 in one ComfyUI workflow. Generate video, add voiceover, compose soundtrack. Export a complete package.	Ad production, content creation, multimedia campaigns

What are MiniMax key features?

MiniMax's feature set spans three modalities, each best-in-class for different reasons. Hailuo leads on physics simulation. Speech 2.8 HD leads on TTS quality benchmarks. Music 2.6 leads on full-song generation with structural control. The real advantage is using all three together in one pipeline.

Cinematic Physics (Hailuo)

Hailuo's NCR architecture was specifically trained on physics-heavy scenarios. Rigid body collisions, fluid dynamics, cloth simulation, and camera shake are physically plausible. Facial emotion rendering captures the difference between a polite smile and a genuine one. These are the capabilities that separate Hailuo from competitors that render pretty scenes but break down when objects interact.

#1-Ranked TTS (Speech 2.8 HD)

Ranked #1 on both Artificial Analysis and HuggingFace TTS Arenas in blind human preference tests. Seven emotion modes that control pitch contour, timing, emphasis, and breath patterns. Inline interjections (laughs, sighs, coughs) render naturally at any point in the text. Voice cloning from 5 seconds of audio. 40+ languages with tonal nuance preserved.

Full-Song Generation (Music 2.6)

Generate complete songs up to 6 minutes with vocals and instrumentals from a style prompt and lyrics. 14 structure tags ([Verse], [Chorus], [Bridge], [Drop], [Solo], etc.) control the arrangement. 100+ instrument tones. BPM and key signature control with 99%+ accuracy. Auto-lyrics generation writes lyrics from your prompt. Instrumental-only mode for background music. Cover generation restyling existing songs.

Voice Cloning (Speech 2.8)

Clone a voice from 5 seconds of reference audio. The 2.8 version significantly improved timbre similarity over 2.6, especially for cross-language cloning. The cloned voice responds to emotion modes and interjection tags while maintaining its identity. Longer reference samples (10-30 seconds) improve accuracy.

Humanized Vocals (Music 2.6)

Natural breathing, delicate vibrato, seamless transitions between vocal registers. Vocals evolve dynamically across sections: emotional intensity shifts from verse to chorus, vocal technique adjusts per section. This eliminates the robotic quality that plagues most AI-generated singing.

Native 1080p Video (Hailuo)

Hailuo Pro generates at native 1080p, not upscaled from lower resolution. Standard generates at 768p. Both support text-to-video and image-to-video with 6 or 10-second clips. Last-frame conditioning controls where the video ends. Subject-to-video (S2V) maintains character consistency across clips.

How does MiniMax compare to other models?

MiniMax is the only provider with top-ranked models across video, speech, and music in a single ecosystem. Hailuo leads on physics simulation for video. Speech 2.8 HD holds #1 on both major TTS arenas. Music 2.6 is the most controllable full-song generator available. Competitors lead in individual categories but don't offer all three modalities from one provider.

Model	Category	Key Strength	Limitation vs MiniMax
MiniMax (Hailuo + Speech + Music)	Video + Speech + Music	Full-stack, all top-ranked	-
Wan 2.7	Video + Image	Open source, 4K, thinking mode	No speech or music
Fish Audio S2	Speech	1,500+ free-form emotion tags, open source	No video or music
Suno	Music	Large community, easy interface	Less structural control, no video/speech
ElevenLabs	Speech	Consumer ecosystem, ease of use	No video or music, ranked below MiniMax

Source: Artificial Analysis arenas, HuggingFace TTS Arena, Replicate model documentation, MiniMax official announcements, and third-party benchmark comparisons as of April 2026.

Frequently Asked Questions

Common questions about running MiniMax on Floyo.

Is MiniMax free to use on Floyo?

You can start with Floyo's free pricing plan. Floyo gives $0.25 in free API credits on signup. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. All MiniMax models run as API nodes, so generation costs come from your API Wallet (separate from your plan's GPU time).

How do I run MiniMax without installing anything?

Open Floyo in your browser, search "MiniMax" in the template library, and pick a workflow (video, speech, or music). Click Run, provide your inputs, and generate. Floyo handles the ComfyUI environment and API connections. No local install, no Python setup, no API key management.

Who made MiniMax?

MiniMax is a global AI foundation model company. They build Talkie (150 million+ users, 90+ minute average sessions), Hailuo video models, Speech TTS models, Music generation models, and the M2.7 text/coding model. The company ships across every major modality: text, video, speech, and music.

What MiniMax models are available on Floyo?

Four workflows cover three modalities. Hailuo for text-to-video and image-to-video (cinematic clips). Speech 2.8 HD for text-to-speech (studio-grade voiceover). Music 2.6 for text-to-music (complete songs with vocals). All run as ComfyUI API nodes and can be chained in one workflow.

Can I chain video, speech, and music in one workflow?

Yes. That is the main advantage of running MiniMax on Floyo's ComfyUI platform. Generate a product demo video with Hailuo, add a professional voiceover with Speech 2.8 HD, compose a background soundtrack with Music 2.6, and export the complete multimedia package. All in one pipeline, all in your browser.

Can Music 2.6 generate songs with vocals?

Yes. Music 2.6 generates complete songs with humanized vocals and instrumentals. The vocals feature natural breathing, vibrato, and register transitions. You can also enable instrumental-only mode for background music, or use auto-lyrics generation to have the model write lyrics from your style description.

How does MiniMax Speech 2.8 HD compare to Fish Audio S2?

Both rank #1 on major TTS benchmarks (different arenas). Fish Audio S2 offers more granular emotion control with 1,500+ free-form tags and 80+ languages. MiniMax Speech 2.8 HD offers 7 structured emotion modes, inline interjections, and voice cloning from just 5 seconds (vs 10-30 for Fish Audio). MiniMax has warmer broadcast-ready fidelity. Fish Audio is open source. Both are available on Floyo.

Can I use MiniMax output commercially?

Yes. Generated content can be used commercially according to MiniMax's terms of service. Check specific terms for your use case, especially around generated content containing identifiable voices or copyrighted musical references.

Try MiniMax on Floyo

Cinematic video, #1-ranked TTS, and full-song music generation. Three modalities, one ComfyUI pipeline. Run it in your browser.

Try MiniMax Now →

Browse All Models