BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps

Anonymous Authors

Abstract

Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented, including sequences, grids, and graphs. To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression.

In this paper, we instead consider whether an alternative tokenization is possible, where an even-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and long-term structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.

Subjective Evaluation Samples

Audio samples used for subjective evaluation experiments (MOS study)

Piano Continuation

Each sample begins with the same 4-bar prompt followed by a generated continuation. Listeners rate samples on three criteria: Coherence with Prompt (continuity in melody/rhythm, consistency in harmony/style), Musical Plausibility (harmonic correctness, phrase structure), and Musicality (overall quality).

Sample 1
Prompt
GT
Ours
ABC
REMI
CP
Antic-S
Antic-L
Sample 2
Prompt
GT
Ours
ABC
REMI
CP
Antic-S
Antic-L
Sample 3
Prompt
GT
Ours
ABC
REMI
CP
Antic-S
Antic-L
Sample 4
Prompt
GT
Ours
ABC
REMI
CP
Antic-S
Antic-L
Sample 5
Prompt
GT
Ours
ABC
REMI
CP
Antic-S
Antic-L

Multitrack Continuation

Each sample begins with the same 4-bar multi-instrument prompt followed by a generated continuation. Listeners rate samples on three criteria: Coherence with Prompt (continuity in melody/rhythm, consistency in harmony/style), Musical Plausibility (harmonic correctness, phrase structure), and Musicality (overall quality).

Sample 1
Prompt
GT
Ours
ABC
REMI
Antic-S
Antic-L
Sample 2
Prompt
GT
Ours
ABC
REMI
Antic-S
Antic-L
Sample 3
Prompt
GT
Ours
ABC
REMI
Antic-S
Antic-L
Sample 4
Prompt
GT
Ours
ABC
REMI
Antic-S
Antic-L
Sample 5
Prompt
GT
Ours
ABC
REMI
Antic-S
Antic-L

Real-time Accompaniment

Each sample consists of a melody (vibraphone) and accompaniment (piano). The full melody and initial 4-bar accompaniment are fixed as the prompt, while the remaining accompaniment is generated. Listeners rate samples on: Coherence with Melody (harmony, rhythm alignment), Musical Plausibility (phrase structure, continuity), and Musicality.

Sample 1
Prompt
GT
Ours
SongDriver
Sample 2
Prompt
GT
Ours
SongDriver
Sample 3
Prompt
GT
Ours
SongDriver
Sample 4
Prompt
GT
Ours
SongDriver
Sample 5
Prompt
GT
Ours
SongDriver