ACE-Step 1.5 for Music Generation

Create stunning music using ACE Step 1.5

ACE-Step 1.5

Music Generation

Text to Audio

ACE‑Step 1.5 is an open‑source music foundation model that can generate and edit full songs (up to ~10 minutes) from text prompts, running locally on consumer GPUs with <4 GB VRAM.

What it is

Text‑to‑music and music‑editing model with a hybrid architecture: a language model plans song structure, and a diffusion transformer renders high‑quality audio.
Designed to reach or beat many commercial music models on quality while staying extremely fast (seconds per song on common GPUs).

Key features

Generates full songs from simple prompts, from short loops to ~10‑minute tracks, with coherent structure, style, and lyrics if requested.
Strong style and prompt control, supporting 50+ languages for captions/lyrics and fine‑grained genre, mood, instrument, and tempo steering.
Unified tasks: text‑to‑music, cover generation, “repainting” sections, continuations, vocal‑to‑BGM and track extraction in one model.
Runs locally with low VRAM; LoRA‑style personalization lets you capture your own musical style from a few songs.
Native ComfyUI support, with nodes and example workflows so you can integrate it like any other model in your graph.

Best‑fit use cases

Creating royalty‑free background music and themes for videos, streams, or games, fully offline.
Rapid idea sketching for producers: generate drafts in a target style, then rework stems in a DAW.
Covers and remixes: re‑render songs in a new style, repaint sections, or continue/reshape existing tracks.
Localized music content (jingles, songs with lyrics) in many languages for marketing or education.

Generates in about -- secs

floyoofficial

Nodes & Models

ComfyUI Official

WorkflowGraphics

PrimitiveStringMultiline

CheckpointLoaderSimple

ace_step_1.5_turbo_aio.safetensors

ModelSamplingAuraFlow

EmptyAceStep1.5LatentAudio

TextEncodeAceStepAudio1.5

ConditioningZeroOut

KSampler

VAEDecodeAudio

SaveAudioMP3

ACE‑Step 1.5 is an open‑source music foundation model that can generate and edit full songs (up to ~10 minutes) from text prompts, running locally on consumer GPUs with <4 GB VRAM.

What it is

Text‑to‑music and music‑editing model with a hybrid architecture: a language model plans song structure, and a diffusion transformer renders high‑quality audio.
Designed to reach or beat many commercial music models on quality while staying extremely fast (seconds per song on common GPUs).

Key features

Generates full songs from simple prompts, from short loops to ~10‑minute tracks, with coherent structure, style, and lyrics if requested.
Strong style and prompt control, supporting 50+ languages for captions/lyrics and fine‑grained genre, mood, instrument, and tempo steering.
Unified tasks: text‑to‑music, cover generation, “repainting” sections, continuations, vocal‑to‑BGM and track extraction in one model.
Runs locally with low VRAM; LoRA‑style personalization lets you capture your own musical style from a few songs.
Native ComfyUI support, with nodes and example workflows so you can integrate it like any other model in your graph.

Best‑fit use cases

Creating royalty‑free background music and themes for videos, streams, or games, fully offline.
Rapid idea sketching for producers: generate drafts in a target style, then rework stems in a DAW.
Covers and remixes: re‑render songs in a new style, repaint sections, or continue/reshape existing tracks.
Localized music content (jingles, songs with lyrics) in many languages for marketing or education.