Turn Text and Melodies into New Songs: Meta Open Sources MusicGen
Meta's MusicGen can generate short new pieces of music from text prompts, optionally aligning them with existing melodies.
Like most language models today, MusicGen is based on Transformer models. Just as language models predict the next character in a sentence, MusicGen predicts the next part of a piece of music.
The researchers used Meta's EnCodec audio tokenizer to break down audio data into smaller components. As a single-stage model that processes tokens in parallel, MusicGen is fast and efficient.
The team trained using 20,000 hours of licensed music. In particular, they relied on an internal dataset of 10,000 high-quality music tracks, as well as music data from Shutterstock and Pond5.
MusicGen can handle text and music cues
Apart from the efficiency of the architecture and the speed of generation, what makes MusicGen unique is its ability to handle both textual and musical cues. The text sets the basic style, which is then matched to the melody in the audio file.
For example, if you put the text prompt "A lilting and upbeat EDM piece with syncopated drums, lilting pads and strong mood, tempo: 130 BPM" with Bach's world-famous "Toccata and Fugue in D minor ( BWV 565) "the melody combined" can generate the following musical fragment.
However, currently you cannot precisely control the direction of the melody, for example, to hear different styles of the melody. It is intended only as a rough guide for generation and is not accurately reflected in the output.
MusicGen beats Google's MusicLM
The authors of the study tested three versions of the model at different sizes: 300 million (300M), 1.5 billion (1.5B), and 3.3 billion (3.3B) parameters. They found that larger models produced higher-quality audio, but the 1.5 billion-parameter model was rated best by humans. The 3.3 billion-parameter model was better at accurately matching text input and audio output.
Compared to other music models such as Riffusion, Mousai, MusicLM, and Noise2Music, MusicGen performed better on both objective and subjective indicators of how well music matches lyrics and the plausibility of composition. Overall, these models are slightly above the level of Google MusicLM.
Meta has open-sourced the code and models on Github for commercial use. A demo is available on Huggingface .
Objective metrics: Fréchet Audio Distance (FAD): Lower values indicate more believable audio produced. Kullback-Leibler Divergence (KL): Lower values indicate that the generated music has a similar concept to the reference music. CLAP Score: This score quantifies audio-text alignment. Subjective metrics: Overall Quality (OVL): Human raters rate the perceived quality of audio samples on a scale of 1 to 100. Relevance to Text Input (REL): Raters rate the match between audio and text on a scale of 1 to 100 out of 100.