$5 free credits when you sign up Claim now
Wan 2.2 Animate now available Test it!
Video Upscaling models now available Test it!
Z-Anime image model Test it!
ACE-Step 1.5 Prompting Guide: How to Write Tags, Structure Lyrics, and Generate Better Music
admin Jun 12, 2026 11 min read

ACE-Step 1.5 Prompting Guide: How to Write Tags, Structure Lyrics, and Generate Better Music

ACE-Step doesn’t work like Suno. There’s no magic text box where you describe a song and pray. Instead, it gives you two separate controls: tags that shape the sound, and lyrics that define the song structure. Understanding the split between them is the difference between getting random output and getting the track you actually hear in your head.

This guide walks through how the prompting system works across all three ACE-Step 1.5 variants on deAPI, with five complete examples you can copy and run in the playground.

Why ACE-Step Prompts Are Different

Most AI music tools accept a single natural-language description. “Write me a sad piano ballad about lost love” – and the model interprets everything at once. Genre, mood, instruments, lyrics, structure – all tangled in one sentence.

ACE-Step separates those concerns into two fields that don’t interfere with each other.

tags (caption) is a comma-separated keyword list that controls the audio. Genre, mood, instruments, vocal style, production quality, BPM. Think of it as a mixing board: each tag is a fader.

lyrics carries the song text plus structural markers – [verse], [chorus], [bridge] – and language tags like [en] or [ja]. Leave it empty and the model generates an instrumental.

This matters because you can swap the entire genre by editing tags while keeping the same lyrics intact. A folk ballad becomes a trap beat in seconds, with identical words over a completely different sonic landscape.

How to Write Tags

A strong tag set runs 5-12 keywords. More than 15, and they start diluting each other. The order matters less than the specificity – grand piano outperforms piano, and fingerpicked acoustic guitar produces a clearer result than guitar.

The formula

[genre], [mood], [2-3 specific instruments],
[vocal type], [production style], [BPM] bpm

Genre is always first

The genre tag anchors everything else. ACE-Step covers a wide range:

FamilyTags that work well
Electronictechno, progressive house, trance, drum and bass, breakcore, darkwave, synthwave
Hip-hoptrap, boom bap, drill, phonk, lo-fi hip-hop, cloud rap, uk drill
Rockindie rock, post-rock, shoegaze, dream pop, post-punk, grunge, doom metal
Popsynth-pop, city pop, k-pop, j-pop, indie pop, hyperpop
Jazzbebop, fusion, smooth jazz, gypsy jazz, cool jazz
Orchestralcinematic, trailer music, orchestral score, baroque, minimalist
Acousticindie folk, bluegrass, bossa nova, americana, sea shanty
Ambientdark ambient, drone, new age, generative

Getting muddy results from a niche genre like darkwave or breakcore? Run the same prompt on XL Turbo before rewriting your tags. The larger model (~10B params) resolves niche genres that the 3.5B variants blur together.

Name instruments, not adjectives

“Grand piano, upright bass, brushed drums” gives the model three concrete sound sources to render. “Sophisticated, elegant, refined” gives it nothing to work with.

XL Turbo is also where rare instruments shine. hammered dulcimer, theremin, shakuhachi, prepared piano – these produce distinct timbres on XL that the 3.5B models flatten into something generic.

Production tags shape the mix

These control the sonic “finish” of the track. Fidelity tags (hi-fi, lo-fi, dusty, polished) set the overall clarity. Character tags like analog warmth, tape saturation, and vinyl crackle add texture on top. Spatial tags (wide stereo, intimate, cinematic) define how big the track feels. And dynamics tags – compressed, sidechain pumping – shape the loudness behavior.

The difference is dramatic: lo-fi hip-hop, dusty, vinyl crackle, tape saturation and hip-hop, hi-fi, clean, wide stereo produce completely different mixes from identical lyrics.

Always include BPM

Set it both in the tags (e.g. 88 bpm) and as the bpm parameter. The model uses both inputs, and consistency between them tightens the rhythmic output.

GenreTypical BPM range
Ballad / acoustic60-80
Lo-fi hip-hop70-90
Pop100-120
House120-128
Techno128-140
Trap (half-time)130-145
Drum and bass170-180

Three tag mistakes that ruin output

“A beautiful melancholic piano piece about autumn in Paris” looks like a reasonable prompt. It isn’t. The model expects keywords: piano ballad, melancholic, intimate, Parisian cafe, felt piano, 68 bpm.

Contradictory tags cause a different failure. aggressive, serene or lo-fi, hi-fi, polished force the model to oscillate between poles instead of committing to either.

Finally, watch your BPM. Tagging techno at 70 bpm confuses the model because techno lives at 128-140. Slow and dark? That’s dark ambient or downtempo, not slow techno.

How to Structure Lyrics

Lyrics aren’t just words – they’re the song’s architecture.

Section markers tell the model what to play

Every marker goes in square brackets, lowercase:

MarkerPurpose
[intro]Instrumental opening
[verse]Verse
[pre-chorus]Tension builder before the hook
[chorus]The hook – repeat it
[bridge]Contrasting section
[inst]Instrumental solo or break
[build-up]Rising tension (electronic)
[drop]Peak energy release (EDM)
[breakdown]Strip energy back (electronic)
[outro]Ending

A marker with no text below it becomes an instrumental passage. This is how you create solos, interludes, and transitions without any vocals.

Language markers for multilingual songs

ACE-Step supports 19 languages. The top 10 perform best: [en], [zh], [ru], [es], [ja], [de], [fr], [pt], [it], [ko]. Others (including [pl]) work but may underperform.

Place the language marker at the start of a section, never mid-line. English verses with a Japanese chorus in j-rock style is a combination that works particularly well on XL Turbo.

Keep lines short

The model maps syllables to beats. Lines of 4-8 syllables flow naturally. At 12+ syllables, the vocal rhythm fractures – the model tries to cram too many words into too few beats.

Repeat the chorus. It helps the model lock onto a consistent melody. AABB or ABAB rhyme schemes sound more natural than free verse, but rhyme isn’t mandatory.

Instrumentals

Two ways to go fully instrumental:

  1. Leave lyrics empty and add no vocals or instrumental to the tags
  2. Use only section markers with no text: [intro][inst][outro]

Adding no vocals in the tags while writing lyric text creates confusion – the model may produce eerie wordless vocalizations. Pick one approach and commit.

Three Variants, One Prompting Format

deAPI runs three ACE-Step 1.5 models. All three accept the same tags and lyrics format – the difference is in quality, speed, and control:

Turbo (3.5B)XL Turbo (~10B, INT8)Base (3.5B)
Steps8 (fixed)8 (fixed)5-100 (adjustable)
Guidance (CFG)1 (fixed)1 (fixed)3-20 (adjustable)
Min duration10s10s30s
Niche genresSometimes missesHandles wellSometimes misses
Long-form (200s+)Loses coherenceHolds togetherHolds together
Vocal clarityAdequateCleanCleanest

The workflow: sketch on Turbo (generate 5-10 variants fast, find the arrangement that works), refine on XL Turbo (better sound quality, most tracks are done here), polish on Base with steps=40-60 (maximum fidelity for critical renders).

For Base specifically, two parameters matter:

Steps control quality vs. speed. 27 is the sweet spot for most work. Below 15 you get Turbo-like drafts. Above 60, gains become imperceptible.

Guidance (CFG) controls how strictly the model follows your tags. At 5-7, it plays loose and musical. At 10-12, it tracks your tags closely. Above 15, it gets rigid and may introduce artifacts. Complex prompts benefit from higher CFG; simple genre prompts sound better at lower values.

5 Full Prompt Examples

1. Lo-fi hip-hop (boom bap, jazzy)

tags:

lo-fi hip-hop, boom bap, dusty drums, vinyl crackle,
jazz sample, upright bass, rhodes piano, muted trumpet,
male vocals, rap vocals, laid-back, warm, 90s, 88 bpm

lyrics:

[intro]

[verse]
Smoke above the rooftops
Sun begins to fall
Neon signs reflecting
Off the barber shop wall
Granddad used to tell me
City never sleeps
Every face a promise
Concrete always keeps

[chorus]
This is how it goes
How it's always been
Writing every story
From where I'm standing in

[verse]
Kids on bikes are racing
Alleys after school
Grandmas in their windows
Summer's always cruel

[chorus]
This is how it goes
How it's always been
Writing every story
From where I'm standing in

[inst]

[outro]

bpm: 88 · duration: 180s

The [inst] before the outro leaves room for a jazz-influenced solo – on XL Turbo, expect muted trumpet or piano improvisation floating over the boom bap loop.


2. Dark progressive techno (instrumental)

tags:

progressive techno, dark, hypnotic, driving, industrial,
analog modular synth, 909 drums, acid bass, tom rolls,
atmospheric pad, no vocals, hi-fi, wide stereo, 134 bpm

lyrics:

[intro]

[inst]

[build-up]

[drop]

[breakdown]

[build-up]

[drop]

[outro]

bpm: 134 · duration: 240s

The [build-up][drop] sequence is essential for electronic music. The model reads these markers as dynamic instructions: rise, then release. Two cycles of that pattern create a full DJ-friendly arrangement.


3. Chamber folk ballad (acoustic, intimate)

tags:

chamber folk, acoustic ballad, intimate, melancholic, hopeful,
fingerpicked classical guitar, cello, solo violin, upright bass,
soft female vocals, subtle harmonies, warm analog, 2010s, 68 bpm

lyrics:

 [intro]

[verse]
Rain against the window
Of the passing train
Every drop a letter
I won't send again
Fields of wheat are bending
Summer in the air
Nothing waiting for me
But the evening there

[pre-chorus]
And I know, and I know

[chorus]
If the road brings winter
I will still walk on
If the song is ending
I will sing along

[verse]
Lamplight in a village
Where I stopped to rest
Stranger at a table
With the kindest guest

[chorus]
If the road brings winter
I will still walk on
If the song is ending
I will sing along

[inst]

[outro]
Every mile a prayer
Before the dawn

bpm: 68 · duration: 210s

Acoustic material benefits most from Base at steps=40+. The fingerpicked guitar and cello harmonics need extra sampling steps to render cleanly. Prototype the arrangement on XL Turbo first, then switch to Base for the final render.


4. Hybrid orchestral trailer (cinematic)

tags:

cinematic, epic hybrid trailer, dark, tense, grandiose,
taiko drums, sub drops, low brass swell, staccato strings,
choir ooh, piano ostinato, synth stab, no vocals,
wide stereo, cinematic score, 2020s, 90 bpm

lyrics:

[intro]

[verse]

[build-up]

[inst]

[chorus]

[breakdown]

[inst]

[outro]

bpm: 90 · duration: 120s

Trailer music depends on dynamic contrast – quiet tension against orchestral explosions. The [build-up][inst][chorus] arc gives the model a clear trajectory from restraint to full power.


5. Dark ambient (atmospheric, long-form)

tags:

dark ambient, cinematic, atmospheric, ethereal, haunting,
sustained analog drone, granular texture, reverb-heavy piano,
field recording wind, distant choir, sub rumble, no vocals,
slow evolving, hi-fi, wide stereo, 55 bpm

lyrics:

[intro]

[inst]

[outro]

bpm: 55 · duration: 270s

At 270 seconds, the XL Turbo variant’s advantage over the 3.5B Turbo becomes audible. The smaller model tends to lose coherence past the 3-minute mark – textures loop or drift into noise. XL sustains the evolving soundscape across the full duration.

The Reference Audio Shortcut

All three variants accept ref_audio – a 5-60 second clip the model treats as a style reference. Timbre, production texture, and mood from the clip influence the generated track.

Send a dry instrumental fragment – 10-30 seconds, consistent style throughout. A clean guitar intro, an isolated synth pad, a drum loop. These give the model a clear timbre to reference.

Vocal-heavy mixes backfire: the model tries to clone the voice instead of extracting the production style. Dense full mixes turn to mud because the model averages all the layers. Clips under 5 seconds carry too little signal for anything useful.

On XL Turbo, even subtle cues transfer – a specific reverb decay, tape compression character, the particular warmth of an analog recording chain. The 3.5B variants pick up the broad strokes but miss these details.

Common Mistakes

We covered tag mistakes above. Here are the ones that show up in lyrics and parameters.

Lines over 10 syllables fracture vocal timing. The model crams words into beats that can’t hold them – rewrite to 4-8 syllables per line and the rhythm locks in immediately.

Stacking five niche genres (darkwave, witch house, phonk, vaporwave, drill) sounds adventurous on paper. In practice, the model can’t resolve that many conflicting signals and the output collapses into noise. One primary genre plus one modifier is the ceiling.

A [chorus] marker with no text below it won’t generate a catchy hook. It generates silence. Write chorus lyrics or relabel the section [inst].

Forgetting the language marker on non-English lyrics is the sneakiest failure. Polish words without [pl] at the section start? The model may sing them in English phonetics regardless.

And the BPM trap: omit it from the tags, and the model guesses. techno without an explicit BPM tag might land at 115 instead of 130 – close enough to sound wrong, far enough to ruin the groove.

Try It

The deAPI playground lets you test all three ACE-Step variants without writing code. $5 in free credits on signup, no credit card required.

When you’re ready to integrate: the API docs cover every parameter, and the txt2music endpoint accepts the same tags and lyrics format described above. All generated music is commercially usable – ACE-Step ships under Apache 2.0.

No subscription No credit card required

Start building with AI in under a minute

Access all models from this article through a single REST API. Start with $5 free credits — no subscription, no credit card.

Migration assistance available talk to an engineer