BEAT: Rethinking Symbolic Music Encoding with Beat-Anchored Tones Representations
Abstract
The perception of symbolic music is fundamentally based on the relative durations and interval relationships between the notes. However, existing representations face an inherent trade-off between preserving these musical priors and achieving encoding efficiency. In this paper, we propose BEAT (Beat-based Encoding with Anchored Tones), which breaks from the conventional note-event paradigm by adopting beat units as the basic primitive. BEAT segments the temporal axis into beat units, organizing music as sequences of these units. Within each beat, notes are sorted by pitch, with the first serving as an anchor point, while subsequent notes are encoded through chained relative intervals coupled with rhythmic patterns. This approach achieves efficient serialization while explicitly preserving musical priors.
Additionally, BEAT employs beat indexing for temporal alignment, enhancing both concurrent modeling capabilities and editability, enabling seamless adaptation to diverse downstream tasks. Extensive experiments demonstrate that BEAT achieves better encoding efficiency than existing methods while achieving state-of-the-art generation quality in both objective and subjective benchmarks. BEAT also consistently outperforms baselines across diverse tasks including multi-track generation, segment infilling, and real-time accompaniment.
Encoding Efficiency
BPE Compression Rate
Measuring similarity to natural language through BPE compression efficiency
| Method | +20 | +100 | +500 | +5000 |
|---|---|---|---|---|
| ABC | 80.15% | 62.66% | 52.57% | 43.57% |
| REMI | 80.18% | 70.59% | 57.95% | 38.81% |
| Ours (BEAT) | 64.83% | 58.93% | 49.32% | 37.31% |
Sequence Length Comparison
Average token count on Advanced Piano Expression (APEX) dataset
| Method | Average Tokens |
|---|---|
| REMI | 2400 Tokens |
| ABC | 3450.4 Tokens |
| Ours (BEAT) | 1825.6 Tokens |
Unconditional Generation Results
Evaluation metrics: muspy statistics (JS divergence), CLAMP2-based FMD, MOS, LLM preference
Model Configurations
GPT-2 Small: 6 layers, Hidden Size 512
GPT-2 Large: 12 layers, Hidden Size 768
LLaMA: 16 layers, Hidden Size 768
LLaMA (16L, 768H)
| Method | PR | GC | SC | JS | FMD | MOS | LLM |
|---|---|---|---|---|---|---|---|
| REMI | - | - | - | - | - | - | - |
| CP | - | - | - | - | - | - | - |
| Octuple | - | - | - | - | - | - | - |
| ABC | - | - | - | - | - | - | - |
| Ours (BEAT) | - | - | - | - | 101 | - | - |
| GT | μ=0.579, σ=0.321 | μ=0.984, σ=0.007 | μ=0.935, σ=0.078 | - | - | - | - |
GPT-2 Large (12L, 768H)
| Method | PR | GC | SC | JS | FMD | MOS | LLM |
|---|---|---|---|---|---|---|---|
| REMI | - | - | - | - | - | - | - |
| CP | - | - | - | - | - | - | - |
| ABC | - | - | - | - | - | - | - |
| Ours (BEAT) | - | - | - | - | 101 | - | - |
| GT | μ=0.579, σ=0.321 | μ=0.984, σ=0.007 | μ=0.935, σ=0.078 | - | - | - | - |
GPT-2 Small (6L, 512H)
| Method | PR | GC | SC | JS | FMD | MOS | LLM |
|---|---|---|---|---|---|---|---|
| REMI | - | - | - | - | - | - | - |
| CP | - | - | - | - | - | - | - |
| Octuple | - | - | - | - | - | - | - |
| ABC | - | - | - | - | - | - | - |
| Ours (BEAT) | - | - | - | - | 101 | - | - |
| GT | μ=0.579, σ=0.321 | μ=0.984, σ=0.007 | μ=0.935, σ=0.078 | - | - | - | - |
Piano Continuation Results
20 GT samples, 4-bar prompt, 10 generations each (with velocity)
| Method | FMD | MOS |
|---|---|---|
| REMI | 454.07 | - |
| ABC | 460.37 | - |
| Ours (BEAT) | 310.55 | - |
Multi-track Segment Infilling Visualization
Hover to reveal the generated segments in multi-track piano roll (before → after)
Sample 1
Sample 2
Sample 3
Sample 4
Audio Samples
Generated music samples demonstrating BEAT encoding quality
Generation Showcase with Score Visualization
Two unconditionally generated piano pieces with synchronized score display
Piano Continuation - Method Comparison
Each sample shows a 4-bar prompt (~8 seconds) continued by different encoding methods (GT, BEAT, REMI, ABC)
Sample 1
Sample 2
Sample 3
Expressive Piano Continuation
Each sample shows one GT prompt with multiple BEAT continuations, demonstrating generation diversity
Expressive Sample 1
Expressive Sample 2
Expressive Sample 3
Real-time Accompaniment
Piano accompaniment generation given melody input
Accompaniment Samples
Ground truth vs BEAT-generated accompaniment
Multi-track Ensemble
Multi-instrument coordination with rich harmonic textures (all samples truncated to 30s after audio rendering)
Subjective Evaluation Samples
Audio samples used for subjective evaluation experiments (MOS study)
Piano Continuation
Given a 4-bar prompt, models continue to generate complete piano music. Compare GT with 5 baseline methods.
Multitrack Continuation
Given a 4-bar prompt, models continue to generate complete multi-track music. Compare GT with 4 baseline methods.
Real-time Accompaniment
Given a 4-bar prompt and subsequent melody, models generate accompaniment in real-time. Prompt contains the first 4 bars (full) + subsequent melody (single track).