ACE-Step 1.5 for Music Generation
Create stunning music using ACE Step 1.5
ACE-Step 1.5
Music Generation
Text to Audio
0
19
ACE‑Step 1.5 is an open‑source music foundation model that can generate and edit full songs (up to ~10 minutes) from text prompts, running locally on consumer GPUs with <4 GB VRAM.
What it is
Text‑to‑music and music‑editing model with a hybrid architecture: a language model plans song structure, and a diffusion transformer renders high‑quality audio.
Designed to reach or beat many commercial music models on quality while staying extremely fast (seconds per song on common GPUs).
Key features
Generates full songs from simple prompts, from short loops to ~10‑minute tracks, with coherent structure, style, and lyrics if requested.
Strong style and prompt control, supporting 50+ languages for captions/lyrics and fine‑grained genre, mood, instrument, and tempo steering.
Unified tasks: text‑to‑music, cover generation, “repainting” sections, continuations, vocal‑to‑BGM and track extraction in one model.
Runs locally with low VRAM; LoRA‑style personalization lets you capture your own musical style from a few songs.
Native ComfyUI support, with nodes and example workflows so you can integrate it like any other model in your graph.
Best‑fit use cases
Creating royalty‑free background music and themes for videos, streams, or games, fully offline.
Rapid idea sketching for producers: generate drafts in a target style, then rework stems in a DAW.
Covers and remixes: re‑render songs in a new style, repaint sections, or continue/reshape existing tracks.
Localized music content (jingles, songs with lyrics) in many languages for marketing or education.
Read more
Nodes & Models
WorkflowGraphics
PrimitiveStringMultiline
CheckpointLoaderSimple
ace_step_1.5_turbo_aio.safetensors
ModelSamplingAuraFlow
EmptyAceStep1.5LatentAudio
TextEncodeAceStepAudio1.5
ConditioningZeroOut
KSampler
VAEDecodeAudio
SaveAudioMP3
ACE‑Step 1.5 is an open‑source music foundation model that can generate and edit full songs (up to ~10 minutes) from text prompts, running locally on consumer GPUs with <4 GB VRAM.
What it is
Text‑to‑music and music‑editing model with a hybrid architecture: a language model plans song structure, and a diffusion transformer renders high‑quality audio.
Designed to reach or beat many commercial music models on quality while staying extremely fast (seconds per song on common GPUs).
Key features
Generates full songs from simple prompts, from short loops to ~10‑minute tracks, with coherent structure, style, and lyrics if requested.
Strong style and prompt control, supporting 50+ languages for captions/lyrics and fine‑grained genre, mood, instrument, and tempo steering.
Unified tasks: text‑to‑music, cover generation, “repainting” sections, continuations, vocal‑to‑BGM and track extraction in one model.
Runs locally with low VRAM; LoRA‑style personalization lets you capture your own musical style from a few songs.
Native ComfyUI support, with nodes and example workflows so you can integrate it like any other model in your graph.
Best‑fit use cases
Creating royalty‑free background music and themes for videos, streams, or games, fully offline.
Rapid idea sketching for producers: generate drafts in a target style, then rework stems in a DAW.
Covers and remixes: re‑render songs in a new style, repaint sections, or continue/reshape existing tracks.
Localized music content (jingles, songs with lyrics) in many languages for marketing or education.
Read more
