
COMMUNITY PAGE
Run Minimax on Floyo
Home / Model / MiniMax on Floyo
AI VIDEO, SPEECH & MUSIC GENERATION
Run MiniMax on Floyo
Video generation with cinematic physics (Hailuo), #1-ranked studio-grade TTS (Speech 2.8 HD), and full-song music generation with vocals (Music 2.6). Three modalities, one ecosystem.
Run MiniMax models through ComfyUI in your browser. No API key, no installs, no local GPU.
|
Video (Hailuo) 1080p, 6-10s clips |
Speech (2.8 HD) #1 TTS Arena, 40+ languages |
|
Music (2.6) Full songs, up to 6 min |
Modalities Video + Speech + Music |
| Try MiniMax Now → | Browse All Models |
No installation. Runs in browser. Updated April 2026.









What you get?
What You Get
MiniMax is a full-stack AI creation platform spanning video, speech, and music. Hailuo (video) generates cinematic 1080p clips at 24fps with best-in-class physics simulation. Speech 2.8 HD (TTS) is ranked #1 on both Artificial Analysis and HuggingFace TTS Arenas with 7 emotion modes, inline interjections, and voice cloning from 5 seconds. Music 2.6 generates complete songs with vocals and instrumentals up to 6 minutes long with 14 structure tags, 100+ instruments, and cover generation. All three run as ComfyUI API nodes on Floyo.
MINIMAX WORKFLOWS ON FLOYO
MiniMax Text to Video (Hailuo)
Next Level Motion from Images Using MiniMax (Image-to-Video)
What is MiniMax?
MiniMax is a global AI foundation model company that ships across video, speech, and music generation. The company builds Talkie (150 million+ users), and its models consistently rank at the top of public benchmarks. On Floyo, three MiniMax model families are available: Hailuo for video, Speech 2.8 HD for text-to-speech, and Music 2.6 for song generation.
Hailuo is the video generation line. The latest versions (Hailuo 2.0 and 2.3) use a Noise-aware Compute Redistribution (NCR) architecture that generates native 1080p video at 24fps with best-in-class physics simulation. Cloth, water, debris, camera shake, and complex body movements look physically plausible. Both text-to-video and image-to-video are supported with clips up to 10 seconds.
Speech 2.8 HD is the TTS line. It ranks #1 on both the Artificial Analysis Speech Arena and HuggingFace TTS Arena. Seven emotion modes (neutral, happy, sad, angry, fearful, disgusted, surprised), inline interjections like (laughs) and (sighs), voice cloning from 5 seconds of audio, and 40+ languages with tonal nuance preservation.
Music 2.6 is the music generation line, released April 10, 2026. It generates complete songs with vocals and instrumentals up to 6 minutes long. 14 structure tags control arrangement. 100+ instrument tones. BPM and key signature control with 99%+ accuracy. AI cover generation rebuilds existing songs in new styles. Studio-quality 44.1kHz/256kbps output.
On Floyo, all three run through ComfyUI API nodes. You can chain them in one workflow: generate a video with Hailuo, add narration with Speech 2.8 HD, and compose a soundtrack with Music 2.6. All in one pipeline.
What are MiniMax technical specifications?
MiniMax spans three modalities. Hailuo video uses NCR architecture for 1080p@24fps with physics-focused rendering. Speech 2.8 HD uses an autoregressive Transformer with Flow-VAE decoder for broadcast-grade TTS. Music 2.6 generates full songs with vocals and instrumentals at 44.1kHz/256kbps with BPM/key control. All run as API nodes on Floyo.
| Spec | Details |
|---|---|
| Developer | MiniMax (makers of Talkie, 150M+ users) |
| HAILUO (VIDEO) | |
| Architecture | Noise-aware Compute Redistribution (NCR) diffusion transformer |
| Resolution | 768p (Standard) / Native 1080p (Pro) |
| Frame Rate | 24fps |
| Duration | 6 or 10 seconds per clip |
| Modes | Text-to-video (T2V), Image-to-video (I2V), Subject-to-video (S2V) |
| Physics | Best-in-class rigid body, fluid, cloth, and camera simulation |
| Versions | Hailuo 2.0 (Oct 2025) / Hailuo 2.3 (Dec 2025) |
| SPEECH 2.8 HD (TTS) | |
| Architecture | Autoregressive Transformer + learnable speaker encoder + hybrid Flow-VAE decoder |
| Languages | 40+ |
| Emotions | 7 modes (neutral, happy, sad, angry, fearful, disgusted, surprised) |
| Interjections | Inline: (laughs), (sighs), (coughs), (gasps) |
| Voice Cloning | From 5 seconds of reference audio |
| Arena Rankings | #1 Artificial Analysis + #1 HuggingFace TTS Arena |
| MUSIC 2.6 | |
| Output | Complete songs with vocals + instrumentals (up to 6 minutes) |
| Audio Quality | 44.1kHz / 256kbps (studio-grade) |
| Instruments | 100+ tones (orchestral, electric, synth, ethnic) |
| Structure Tags | 14 ([Verse], [Chorus], [Bridge], [Drop], [Solo], [Build Up], etc.) |
| BPM/Key Control | 99%+ accuracy when specified in prompt |
| Cover Generation | Upload a song, extract melody, restyle in any genre/language |
| Release Date | April 10, 2026 |
| ComfyUI Access | API-based nodes on Floyo (4 workflows) |
What can you create with MiniMax?
MiniMax covers cinematic video generation, image animation, studio-grade voiceover, full-song composition, and end-to-end multimedia pipelines. On Floyo, you can chain all three modalities in one ComfyUI workflow: generate video with Hailuo, add narration with Speech 2.8 HD, compose a soundtrack with Music 2.6, and export a complete multimedia package.
| Capability | What It Does | Use Case |
|---|---|---|
| Text-to-Video (Hailuo) | Generate 768p or 1080p video clips at 24fps from text prompts. Best-in-class physics: water, debris, cloth, camera shake. 6-10 second clips. | Product demos, social content, short films, ads |
| Image-to-Video (Hailuo) | Animate still images into cinematic clips. Preserves composition while adding natural motion, lighting, and camera movement. | Photo animation, product showcases, motion graphics |
| Text-to-Speech (Speech 2.8) | Studio-grade voiceover with 7 emotions, inline interjections, 17+ preset voices, voice cloning, and 40+ languages. Broadcast-quality audio. | Narration, audiobooks, podcasts, ads, e-learning |
| Text-to-Music (Music 2.6) | Generate full songs with vocals and instrumentals up to 6 minutes. 14 structure tags, 100+ instruments, BPM/key control, auto-lyrics. | Soundtracks, jingles, game audio, content music |
| AI Cover Generation | Upload an existing song. Music 2.6 extracts the melodic skeleton and rebuilds it in any style, arrangement, or language. | Remixes, localized versions, style experiments |
| Multi-Modal Pipelines | Chain Hailuo + Speech 2.8 + Music 2.6 in one ComfyUI workflow. Generate video, add voiceover, compose soundtrack. Export a complete package. | Ad production, content creation, multimedia campaigns |
What are MiniMax key features?
MiniMax's feature set spans three modalities, each best-in-class for different reasons. Hailuo leads on physics simulation. Speech 2.8 HD leads on TTS quality benchmarks. Music 2.6 leads on full-song generation with structural control. The real advantage is using all three together in one pipeline.
Cinematic Physics (Hailuo)
Hailuo's NCR architecture was specifically trained on physics-heavy scenarios. Rigid body collisions, fluid dynamics, cloth simulation, and camera shake are physically plausible. Facial emotion rendering captures the difference between a polite smile and a genuine one. These are the capabilities that separate Hailuo from competitors that render pretty scenes but break down when objects interact.
#1-Ranked TTS (Speech 2.8 HD)
Ranked #1 on both Artificial Analysis and HuggingFace TTS Arenas in blind human preference tests. Seven emotion modes that control pitch contour, timing, emphasis, and breath patterns. Inline interjections (laughs, sighs, coughs) render naturally at any point in the text. Voice cloning from 5 seconds of audio. 40+ languages with tonal nuance preserved.
Full-Song Generation (Music 2.6)
Generate complete songs up to 6 minutes with vocals and instrumentals from a style prompt and lyrics. 14 structure tags ([Verse], [Chorus], [Bridge], [Drop], [Solo], etc.) control the arrangement. 100+ instrument tones. BPM and key signature control with 99%+ accuracy. Auto-lyrics generation writes lyrics from your prompt. Instrumental-only mode for background music. Cover generation restyling existing songs.
Voice Cloning (Speech 2.8)
Clone a voice from 5 seconds of reference audio. The 2.8 version significantly improved timbre similarity over 2.6, especially for cross-language cloning. The cloned voice responds to emotion modes and interjection tags while maintaining its identity. Longer reference samples (10-30 seconds) improve accuracy.
Humanized Vocals (Music 2.6)
Natural breathing, delicate vibrato, seamless transitions between vocal registers. Vocals evolve dynamically across sections: emotional intensity shifts from verse to chorus, vocal technique adjusts per section. This eliminates the robotic quality that plagues most AI-generated singing.
Native 1080p Video (Hailuo)
Hailuo Pro generates at native 1080p, not upscaled from lower resolution. Standard generates at 768p. Both support text-to-video and image-to-video with 6 or 10-second clips. Last-frame conditioning controls where the video ends. Subject-to-video (S2V) maintains character consistency across clips.
How does MiniMax compare to other models?
MiniMax is the only provider with top-ranked models across video, speech, and music in a single ecosystem. Hailuo leads on physics simulation for video. Speech 2.8 HD holds #1 on both major TTS arenas. Music 2.6 is the most controllable full-song generator available. Competitors lead in individual categories but don't offer all three modalities from one provider.
| Model | Category | Key Strength | Limitation vs MiniMax |
|---|---|---|---|
| MiniMax (Hailuo + Speech + Music) | Video + Speech + Music | Full-stack, all top-ranked | - |
| Wan 2.7 | Video + Image | Open source, 4K, thinking mode | No speech or music |
| Fish Audio S2 | Speech | 1,500+ free-form emotion tags, open source | No video or music |
| Suno | Music | Large community, easy interface | Less structural control, no video/speech |
| ElevenLabs | Speech | Consumer ecosystem, ease of use | No video or music, ranked below MiniMax |
Source: Artificial Analysis arenas, HuggingFace TTS Arena, Replicate model documentation, MiniMax official announcements, and third-party benchmark comparisons as of April 2026.
Frequently Asked Questions
Common questions about running MiniMax on Floyo.
You can start with Floyo's free pricing plan. Floyo gives $0.25 in free API credits on signup. To continue using the service beyond the free tier, upgrade your Floyo pricing plan. All MiniMax models run as API nodes, so generation costs come from your API Wallet (separate from your plan's GPU time).
Open Floyo in your browser, search "MiniMax" in the template library, and pick a workflow (video, speech, or music). Click Run, provide your inputs, and generate. Floyo handles the ComfyUI environment and API connections. No local install, no Python setup, no API key management.
MiniMax is a global AI foundation model company. They build Talkie (150 million+ users, 90+ minute average sessions), Hailuo video models, Speech TTS models, Music generation models, and the M2.7 text/coding model. The company ships across every major modality: text, video, speech, and music.
Four workflows cover three modalities. Hailuo for text-to-video and image-to-video (cinematic clips). Speech 2.8 HD for text-to-speech (studio-grade voiceover). Music 2.6 for text-to-music (complete songs with vocals). All run as ComfyUI API nodes and can be chained in one workflow.
Yes. That is the main advantage of running MiniMax on Floyo's ComfyUI platform. Generate a product demo video with Hailuo, add a professional voiceover with Speech 2.8 HD, compose a background soundtrack with Music 2.6, and export the complete multimedia package. All in one pipeline, all in your browser.
Yes. Music 2.6 generates complete songs with humanized vocals and instrumentals. The vocals feature natural breathing, vibrato, and register transitions. You can also enable instrumental-only mode for background music, or use auto-lyrics generation to have the model write lyrics from your style description.
Both rank #1 on major TTS benchmarks (different arenas). Fish Audio S2 offers more granular emotion control with 1,500+ free-form tags and 80+ languages. MiniMax Speech 2.8 HD offers 7 structured emotion modes, inline interjections, and voice cloning from just 5 seconds (vs 10-30 for Fish Audio). MiniMax has warmer broadcast-ready fidelity. Fish Audio is open source. Both are available on Floyo.
Yes. Generated content can be used commercially according to MiniMax's terms of service. Check specific terms for your use case, especially around generated content containing identifiable voices or copyrighted musical references.
Try MiniMax on Floyo
Cinematic video, #1-ranked TTS, and full-song music generation. Three modalities, one ComfyUI pipeline. Run it in your browser.
| Try MiniMax Now → | Browse All Models |
Related Reading
Film and Animation Workflows on Floyo
Setting Up an AI Production Pipeline for Your Studio
Last updated: April 2026. Specs from MiniMax official documentation, Replicate model cards, Artificial Analysis Speech Arena, HuggingFace TTS Arena, Scenario model guides, WaveSpeedAI documentation, and MiniMax-Speech technical report (arXiv:2505.07916).
Next-Level Motion from Images using MiniMax
MiniMax Text-to-Video will Bring Your Creative Concepts to Life with Realistic Motion
Minimax Speech 2.8 HD for Text to Speech
Create realistic speech using Minimax speech 2.8
instrumental
minimax music 2.6
music generation
song generation
soundtrack
text to music
Text-to-music with Minimax Music 2.6. Generate songs with vocals and backing from a style prompt and lyrics, or toggle instrumental mode for score only.
Minimax Music 2.6 - Text to Music
Text-to-music with Minimax Music 2.6. Generate songs with vocals and backing from a style prompt and lyrics, or toggle instrumental mode for score only.
_1774200945888.webp?width=400&height=300&quality=80&resize=cover)
_1762437195803.gif?width=400&height=300&quality=80&resize=cover)

_1777893876118.png?width=400&height=300&quality=80&resize=cover)