AI-generated music is already an innovative enough concept, however Riffusion takes it to a different degree with a intelligent, bizarre strategy that produces bizarre and compelling music utilizing not audio however pictures of audio.
Sounds unusual, is unusual. But when it really works, it really works. And it does work! Type of.
Diffusion is a machine studying approach for producing pictures that supercharged the AI world during the last 12 months. DALL-E 2 and Steady Diffusion are the 2 most high-profile fashions that work by step by step changing visible noise with what the AI thinks a immediate must seem like.
The strategy has proved highly effective in lots of contexts and could be very vulnerable to fine-tuning, the place you give the largely educated mannequin loads of a selected form of content material to be able to have it specialise in producing extra examples of that content material. As an example, you might fine-tune it on watercolors or on photographs of vehicles, and it could show extra succesful in reproducing both of these issues.
What Seth Forsgren and Hayk Martiros did for his or her interest mission Riffusion was fine-tune Steady Diffusion on spectrograms.
“Hayk and I play in slightly band collectively, and we began the mission just because we love music and didn’t know if it could be even attainable for Steady Diffusion to create a spectrogram picture with sufficient constancy to transform into audio,” Forsgren advised TechCrunch. “At each step alongside the best way we’ve been an increasing number of impressed by what is feasible, and one thought results in the subsequent.”
What are spectrograms, you ask? They’re visible representations of audio that present the amplitude of various frequencies over time. You may have in all probability seen waveforms, which present quantity over time and make audio seem like a collection of hills and valleys; think about if as a substitute of simply complete quantity, it confirmed the amount of every frequency, from the low finish to the excessive finish.
Right here’s a part of one I made from a track (“Marconi’s Radio” by Secret Machines, for those who’re questioning):

Picture Credit: Devin Coldewey
You possibly can see the way it will get louder in all frequencies because the track builds, and you’ll even spot particular person notes and devices if you understand what to search for. The method isn’t inherently good or lossless by any means, however it’s an correct, systematic illustration of the sound. And you’ll convert it again to sound by doing the identical course of in reverse.
Forsgren and Martiros made spectrograms of a bunch of music and tagged the ensuing pictures with the related phrases, like “blues guitar,” “jazz piano,” “afrobeat,” stuff like that. Feeding the mannequin this assortment gave it a good suggestion of what sure sounds “seem like” and the way it may re-create or mix them.
Right here’s what the diffusion course of appears like for those who pattern it because it’s refining the picture:

Picture Credit: Seth Forsgren / Hayk Martiros
And certainly the mannequin proved able to producing spectrograms that, when transformed to sound, are a fairly good match for prompts like “funky piano,” “jazzy saxophone,” and so forth. Right here’s an instance:

Picture Credit: Seth Forsgren / Hayk Martiros
However after all a sq. spectrogram (512 x 512 pixels, a regular Steady Diffusion decision) represents solely a brief clip; a three-minute track could be a a lot, a lot wider rectangle. Nobody desires to take heed to music 5 seconds at a time, however the limitations of the system they’d created imply they couldn’t simply create a spectrogram 512 pixels tall and 10,000 large.
After making an attempt a number of issues, they took benefit of the elemental construction of huge fashions like Steady Diffusion, which have quite a lot of “latent house.” That is kind of just like the no-man’s-land between extra well-defined nodes. Like for those who had an space of the mannequin representing cats, and one other representing canines, what’s “between” them is latent house that, for those who simply advised the AI to attract, could be some form of dogcat, or catdog, though there’s no such factor.
By the way, latent space stuff gets a lot weirder than that:
No creepy nightmare worlds for the Riffusion mission, although. As an alternative, they discovered that when you have two prompts, like “church bells” and “digital beats,” you possibly can form of step from one to the opposite a bit at a time and it step by step and surprisingly naturally fades from one to the opposite, on the beat even:
It’s an odd, fascinating sound, although clearly not significantly complicated or high-fidelity; bear in mind, they weren’t even certain that diffusion fashions might do that in any respect, so the power with which this one turns bells into beats or typewriter faucets into piano and bass is fairly outstanding.
Producing longer-form clips is feasible however nonetheless theoretical:
“We haven’t actually tried to create a traditional 3-minute track with repeating choruses and verses,” Forsgren stated. “I feel it may very well be accomplished with some intelligent methods reminiscent of constructing the next degree mannequin for track construction, after which utilizing the decrease degree mannequin for particular person clips. Alternatively you might deeply prepare our mannequin with a lot bigger decision pictures of full songs.”
The place does it go from right here? Different teams are trying to create AI-generated music in numerous methods, from utilizing speech synthesis models to specifically educated audio ones like Dance Diffusion.
Riffusion is extra of a “wow, take a look at this” demo than any form of grand plan to reinvent music, and Forsgren stated he and Martiros have been simply pleased to see individuals partaking with their work, having enjoyable and iterating on it:
“There are various instructions we might go from right here, and we’re excited to continue learning alongside the best way. It’s been enjoyable to see different individuals already constructing their very own concepts on high of our code this morning, too. One of many wonderful issues in regards to the Steady Diffusion neighborhood is how briskly persons are to construct up to the mark in instructions that the unique authors can’t predict.”
You possibly can try it out in a stay demo at Riffusion.com, however you may need to attend a bit on your clip to render — this received slightly extra consideration than the creators have been anticipating. The code is all accessible via the about page, so be at liberty to run your individual as properly, for those who’ve received the chips for it.