ACE-Step v1.5 XL Turbo Diffusers
Diffusers-format checkpoint of ACE-Step v1.5 XL Turbo β the guidance-distilled 5B-parameter flow-matching DiT for text-to-music generation (hidden_size=2560, 32 layers, 32 heads; encoder_hidden_size=2048 on the condition encoder).
This repository is the official Diffusers-format version of the ACE-Step v1.5 XL Turbo checkpoint. It can be loaded directly with AceStepPipeline, which was merged into huggingface/diffusers.
Weights are produced by scripts/convert_ace_step_to_diffusers.py from the upstream release and packaged in the standard Diffusers pipeline layout (model_index.json + one subdirectory per module), so the full pipeline can be loaded in a single from_pretrained call.
Usage
Install Diffusers from source until the next package release includes AceStepPipeline.
pip install git+https://github.com/huggingface/diffusers.git
import torch
import soundfile as sf
from diffusers import AceStepPipeline
pipe = AceStepPipeline.from_pretrained(
"ACE-Step/acestep-v15-xl-turbo-diffusers",
torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")
# Long-form audio: enable VAE tiling to keep decode memory bounded.
pipe.vae.enable_tiling()
output = pipe(
prompt="An upbeat synthwave track with driving drums and a catchy lead",
lyrics="[Verse]\nNeon lights are calling me\n[Chorus]\nRide the wave tonight",
audio_duration=30.0,
generator=torch.Generator(device="cuda").manual_seed(42),
)
audio = output.audios[0] # (channels, samples), 48 kHz
sf.write("acestep-xl-turbo.wav", audio.T.cpu().float().numpy(), pipe.sample_rate)
The turbo checkpoint is guidance-distilled. guidance_scale > 1.0 is ignored with a warning, so users do not need to tune CFG for this model. The pipeline defaults to 8 denoising steps and the recommended shift=3.0 sampling recipe.
For batched prompts with padding and FlashAttention, use the variable-length backend:
pipe.transformer.set_attention_backend("flash_varlen")
pipe.condition_encoder.set_attention_backend("flash_varlen")
For single-prompt generation, the regular flash backend is also suitable.
Repository layout
βββ model_index.json
βββ transformer/ # AceStepTransformer1DModel (DiT, 5B params, bf16)
βββ condition_encoder/ # AceStepConditionEncoder (with baked-in silence_latent)
βββ vae/ # AutoencoderOobleck (48 kHz stereo)
βββ text_encoder/ # Qwen3-Embedding-0.6B
βββ tokenizer/ # Qwen3 tokenizer
βββ scheduler/ # FlowMatchEulerDiscreteScheduler config
βββ silence_latent.pt # Raw reference (kept for debugging; not needed at runtime)
License
- ACE-Step weights: MIT (same as upstream)
text_encoder/(Qwen3-Embedding-0.6B): Apache 2.0 β redistributed per Qwen's license
Citation
@misc{gong2026acestep,
title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author = {Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished = {\url{https://github.com/ace-step/ACE-Step-1.5}},
year = {2026},
note = {GitHub repository}
}
- Downloads last month
- 194
Model tree for ACE-Step/acestep-v15-xl-turbo-diffusers
Base model
ACE-Step/acestep-v15-xl-turbo