BEAT: Rethinking Symbolic Music Encoding with Beat-Anchored Tones Representations

Anonymous Authors

Abstract

The perception of symbolic music is fundamentally based on the relative durations and interval relationships between the notes. However, existing representations face an inherent trade-off between preserving these musical priors and achieving encoding efficiency. In this paper, we propose BEAT (Beat-based Encoding with Anchored Tones), which breaks from the conventional note-event paradigm by adopting beat units as the basic primitive. BEAT segments the temporal axis into beat units, organizing music as sequences of these units. Within each beat, notes are sorted by pitch, with the first serving as an anchor point, while subsequent notes are encoded through chained relative intervals coupled with rhythmic patterns. This approach achieves efficient serialization while explicitly preserving musical priors.

Additionally, BEAT employs beat indexing for temporal alignment, enhancing both concurrent modeling capabilities and editability, enabling seamless adaptation to diverse downstream tasks. Extensive experiments demonstrate that BEAT achieves better encoding efficiency than existing methods while achieving state-of-the-art generation quality in both objective and subjective benchmarks. BEAT also consistently outperforms baselines across diverse tasks including multi-track generation, segment infilling, and real-time accompaniment.

Encoding Efficiency

BPE Compression Rate

Measuring similarity to natural language through BPE compression efficiency

Method +20 +100 +500 +5000
ABC 80.15% 62.66% 52.57% 43.57%
REMI 80.18% 70.59% 57.95% 38.81%
Ours (BEAT) 64.83% 58.93% 49.32% 37.31%

Sequence Length Comparison

Average token count on Advanced Piano Expression (APEX) dataset

Method Average Tokens
REMI 2400 Tokens
ABC 3450.4 Tokens
Ours (BEAT) 1825.6 Tokens

Unconditional Generation Results

Evaluation metrics: muspy statistics (JS divergence), CLAMP2-based FMD, MOS, LLM preference

Model Configurations

GPT-2 Small: 6 layers, Hidden Size 512

GPT-2 Large: 12 layers, Hidden Size 768

LLaMA: 16 layers, Hidden Size 768

LLaMA (16L, 768H)

Method PR GC SC JS FMD MOS LLM
REMI - - - - - - -
CP - - - - - - -
Octuple - - - - - - -
ABC - - - - - - -
Ours (BEAT) - - - - 101 - -
GT μ=0.579, σ=0.321 μ=0.984, σ=0.007 μ=0.935, σ=0.078 - - - -

GPT-2 Large (12L, 768H)

Method PR GC SC JS FMD MOS LLM
REMI - - - - - - -
CP - - - - - - -
ABC - - - - - - -
Ours (BEAT) - - - - 101 - -
GT μ=0.579, σ=0.321 μ=0.984, σ=0.007 μ=0.935, σ=0.078 - - - -

GPT-2 Small (6L, 512H)

Method PR GC SC JS FMD MOS LLM
REMI - - - - - - -
CP - - - - - - -
Octuple - - - - - - -
ABC - - - - - - -
Ours (BEAT) - - - - 101 - -
GT μ=0.579, σ=0.321 μ=0.984, σ=0.007 μ=0.935, σ=0.078 - - - -

Piano Continuation Results

20 GT samples, 4-bar prompt, 10 generations each (with velocity)

Method FMD MOS
REMI 454.07 -
ABC 460.37 -
Ours (BEAT) 310.55 -

Multi-track Segment Infilling Visualization

Hover to reveal the generated segments in multi-track piano roll (before → after)

Before 1 After 1

Sample 1

Before 2 After 2

Sample 2

Before 3 After 3

Sample 3

Before 4 After 4

Sample 4

Audio Samples

Generated music samples demonstrating BEAT encoding quality

Generation Showcase with Score Visualization

Two unconditionally generated piano pieces with synchronized score display

Piano Continuation - Method Comparison

Each sample shows a 4-bar prompt (~8 seconds) continued by different encoding methods (GT, BEAT, REMI, ABC)

Sample 1

Ground Truth GT
BEAT Ours
REMI Baseline
ABC Baseline

Sample 2

Ground Truth GT
BEAT Ours
REMI Baseline
ABC Baseline

Sample 3

Ground Truth GT
BEAT Ours
REMI Baseline
ABC Baseline

Expressive Piano Continuation

Each sample shows one GT prompt with multiple BEAT continuations, demonstrating generation diversity

Expressive Sample 1

Ground Truth GT
BEAT v1 Ours
BEAT v2 Ours
BEAT v3 Ours

Expressive Sample 2

Ground Truth GT
BEAT v1 Ours
BEAT v2 Ours
BEAT v3 Ours

Expressive Sample 3

Ground Truth GT
BEAT v1 Ours
BEAT v2 Ours
BEAT v3 Ours

Real-time Accompaniment

Piano accompaniment generation given melody input

Accompaniment Samples

Ground truth vs BEAT-generated accompaniment

Sample 1
GT
Ours
Sample 2
GT
Ours
Sample 3
GT
Ours
Sample 4
GT
Ours

Multi-track Ensemble

Multi-instrument coordination with rich harmonic textures (all samples truncated to 30s after audio rendering)

String Quartet Violin I, II, Viola, Cello
Sample 1
Sample 2
Piano Quartet Piano + Violin, Viola, Cello
Sample 1
Sample 2
Piano & Choir Piano + SATB voices
Sample 1
Sample 2
Clarinet & Piano Clarinet solo with piano accompaniment
Sample 1
Sample 2

Subjective Evaluation Samples

Audio samples used for subjective evaluation experiments (MOS study)

Piano Continuation

Given a 4-bar prompt, models continue to generate complete piano music. Compare GT with 5 baseline methods.

Sample 1
GT
Ours
ABC
REMI+
CP
Anticipate
Sample 2
GT
Ours
ABC
REMI+
CP
Anticipate
Sample 3
GT
Ours
ABC
REMI+
CP
Anticipate
Sample 4
GT
Ours
ABC
REMI+
CP
Anticipate
Sample 5
GT
Ours
ABC
REMI+
CP
Anticipate

Multitrack Continuation

Given a 4-bar prompt, models continue to generate complete multi-track music. Compare GT with 4 baseline methods.

Sample 1
Prompt
GT
Ours
ABC
REMI
Anticipate
Sample 2
Prompt
GT
Ours
ABC
REMI
Anticipate
Sample 3
Prompt
GT
Ours
ABC
REMI
Anticipate
Sample 4
Prompt
GT
Ours
ABC
REMI
Anticipate
Sample 5
Prompt
GT
Ours
ABC
REMI
Anticipate

Real-time Accompaniment

Given a 4-bar prompt and subsequent melody, models generate accompaniment in real-time. Prompt contains the first 4 bars (full) + subsequent melody (single track).

Sample 1
Prompt
GT
Ours
SongDriver
Sample 2
Prompt
GT
Ours
SongDriver
Sample 3
Prompt
GT
Ours
SongDriver
Sample 4
Prompt
GT
Ours
SongDriver
Sample 5
Prompt
GT
Ours
SongDriver