Stable Diffusion maker launches Stable Audio text-to-music AI
Stability AI, the tech firm behind popular text-to-image AI tool Stable Diffusion, has unveiled a new......
Stability AI, the tech firm behind popular text-to-image AI tool Stable Diffusion, has unveiled a new product focused on music and sound.
Stable Audio was announced this morning as a freemium service. Its free basic version can generate and download music up to 20 seconds in length in response to text prompts.
(For example: ‘Trance, Ibiza, Beach, Sun, 4 AM, Progressive, Synthesizer, 909, Dramatic Chords, Choir, Euphoric, Nostalgic, Dynamic, Flowing’ or ‘Disco, Driving Drum Machine, Synthesizer, Bass, Piano, Guitars, Instrumental, Clubby, Euphoric, Chicago, New York, 115 BPM’ – two of the sample tracks showcased as part of Stable Audio’s launch.)
Its ‘Pro’ subscription increases the maximum length to 90 seconds, with tracks that can be downloaded and used for commercial projects – those created in the ‘Basic’ version can only be used for non-commercial ones.
The AI was trained on music and metadata from production-music company AudioSparx. This mirrors the approach taken by Meta earlier this year with its MusicGen AI, for which it struck a deal with ShutterStock and its Pond5 subsidiary.
However, in this case AudioSparx offered its musicians an opt-out if they did not want their work to be used to train Stable Audio. Around 10% of them chose to opt out. Those who opted in will get a share of the revenues from Stable Audio.
The launch follows the debut earlier this year of AI music models by Google (MusicLM) and Meta (MusicGen, and its wider AudioCraft framework.)
The product lead on Stable Audio is a familiar name to anyone who’s been following the AI music sector. Ed Newton-Rex was the co-founder and CEO of one of the first AI-music startups, Jukedeck, which was acquired in 2019 by TikTok’s parent firm ByteDance.
After leaving TikTok in 2021, Newton-Rex worked for a while as chief product officer at Snap-owned music-creation app Voisey. He then joined Stability AI in November 2022, initially as VP of product for its music-focused Harmonai project.
In February this year he moved to a more general role as VP of audio to work on Stable Audio. Music Ally talked to Newton-Rex and AudioSparx CEO Lee Johnson ahead of the new product’s announcement to find out more.
“Stability AI is obviously best known for its work in images, but this launch is our first product for music and audio generation,” said Newton-Rex.
“The concept is simple: you describe the music or audio you want to hear, and our system generates it for you. Primarily music, but actually it also works quite well for sound effects.”
“The thing I’m most excited about is musicians using it to create any sample they can think of, be that a drumbeat or a crazy melding of instruments, to use in their own music. We hope this is a useful and interesting sample generator.”
However, he made it clear that while Stable Audio is a commercial product – unlike MusicLM and MusicGen – it is also still experimental, and evolving rapidly.
Johnson is impressed. “I’m also a composer and producer, and have been completely blown away with its ability to create phenomenally human-sounding drum, bass, keyboard and guitar parts,” he said.
“I’m working on a song about vampires, and it created some great vampire screams for me, that also had a musical quality to them! I’ve had so much fun test-driving it and testing all kinds of different prompts: arpeggiation, key changes, beats per minute. It can handle all of this.
Vampire screams? That’s a hint at how Stable Audio will have applications beyond the obvious. Newton-Rex talked about someone creating a sample using the prompt ‘hammering wood at 120bpm‘ as another example.
“That kind of stuff is really fun: we don’t know the limits of this model,” he said. “We’re going to be encouraging people to go to the Stable Foundation Discord server, where there’ll be a Stable Audio channel to share prompts that are working well, and outputs. We’ll have a prompt guide, but people are going to do a better job of us at figuring out what works here.”
Stable Audio uses the ‘latent diffusion’ architecture that was first introduced with Stable Diffusion. For music, Newton-Rex said it enables the model to be trained much faster, and then to create audio of different lengths at a high quality – up to 44.1kHz stereo.
“The audio quality is astonishing. If you’re a pro [musician] and download a WAV file, you can feel really confident about using that in your music,” he said.
“And this is a commercial product. There have been some really great advances this year: state-of-the-art music generation models using cutting-edge technology. But I think this is the first product using these cutting-edge advances that lets people download and use that music commercially.”
Johnson said that around 800,000 tracks from AudioSparx’s catalogue have been used in the initial training process for Stable Audio. The partnership should mean Stability AI is able to sidestep some of the legal issues that have dogged Stable Diffusion this year: lawsuits from artists and from Getty Images.
AudioSparx has music from more than 8,000 artists in its catalogue, and Johnson said that giving them agency in the partnership with Stability AI was important.
“10% of our community ended up opting out of the deal, but in the replies we got to our initial email about it there was a huge level of passion. Some were extremely passionate about participating, and some were equally passionate about not participating,” he said.
“It was definitely instrumental to give them that capability [to opt out], otherwise we’d have had a riot on our hands! We were very happy to structure that ability to opt out, or to opt in – because I do anticipate that once this launches, and people start to understand what Stable Audio’s about, some of the people who opted out may change their minds and opt in.”
“We are really pleased to have offered that opt-out to AudioSparx artists, and pleased that they will be sharing in the revenue from the product,” added Newton-Rex.
Musicians who work with libraries can feel particularly under threat from musical AIs: production music has often been one of the first use-cases offered by these technologies – including by Jukedeck back in the day. Johnson offered an alternative view.
“I think of these tools as more collaborating tools than replacing tools. I would encourage our artist community to see this as a tool that can help you in your writing, rather than something that’s going to replace you. We’re a long way away from being able to replace artists, especially for vocal music, because this [Stable Audio] doesn’t do vocal music.”
How rapidly will Stable Audio improve? Newton-Rex stressed that it’s still early days, suggesting that audio generation is “at least a year behind image generation”.
As illustration, he talked about some of the famous limitations and quirks of the first image-generating AIs: struggling to handle words, and especially human hands. But both problems are being tackled, and Newton-Rex thinks musical AIs will follow a similar path.
“There are definitely ‘bad human hands’ equivalents in the audio domain. The odd clearly-wrong note, a discordant note every so often. Although then again, Miles Davis said there are no wrong notes in jazz, so…”
Actually, jazz is an interesting topic, because that’s one genre that Stable Audio and its peers really struggle with. Classical music too, even though a high percentage of AI music startup founders are classically-trained musicians and composers – Newton-Rex included.
“It still can’t do jazz!” he said. “I’m the first to say this model can’t do everything. Classical music is just not really possible right now either. It’s very good when you’re generating beat-driven and ambient instrumentals. It’s really good at EDM. But it’s less good at anything classical, less good at jazz… less good in general at anything really melodic.”
“When you get down to stems, you can get some great drum loops, great basslines. But very melodic stuff, piano stems for example: I haven’t yet been able to find a prompt that will do those really well. So there are things it does well and things it can’t do yet. But that’s okay.”
He suggested that the competition: between Stability AI, Meta and Google but also smaller startups in the field – Music Ally has written about text-to-music tools launched by Splash, Mubert and Cassette recently for example – is what will push things forward.
“I would not claim that we’re the only ones doing anything good here. The field has come a long way in the last 12 months. Even the last six months. I’m pleased to be releasing something that we think takes things forward. And these models will continue to improve, probably rapidly, and will get more useful over time.”