transcript-site/content/post/Latent Space/Latent-Space-Making-Transformers-Sing---with-Mikey-Shulman-of-Suno.lrc
2024-05-22 12:18:59 +08:00

1283 lines
65 KiB
Plaintext

[by:whisper.cpp]
[00:00.00] (upbeat music)
[00:02.58] - Hey everyone, welcome to the Living Space podcast.
[00:08.26] This is Alessio, partner and CTO
[00:10.32] and residents and decibel partners,
[00:12.02] and I'm joined by Makoho's Swiss founder of SmallAI.
[00:15.42] - Hey, so today we're in the remote studio
[00:18.18] with Mikey Shulman, welcome.
[00:19.62] - Thank you, it's great to be here.
[00:21.38] - So I'd like to go over people's background on LinkedIn
[00:24.42] and then maybe find a little bit more outside of LinkedIn.
[00:26.82] You did your bachelor's in physics
[00:30.10] and then a PhD in physics as well.
[00:32.18] Also, before going into Kensho technologies
[00:34.58] with the home of a lot of top AI startups,
[00:37.02] it seems like where you're head of machine learning
[00:39.02] for seven years.
[00:40.86] You're also a lecturer at MIT,
[00:42.74] which we talked about that, like what you talked about.
[00:45.30] And then about two years ago,
[00:47.78] you left to start Suno,
[00:50.70] which is recently burst on the scene
[00:52.86] as one of the top music generation startups.
[00:55.74] So we can go over that bio,
[00:57.18] but also I guess what's not on your LinkedIn
[00:58.82] that people should know about you?
[00:59.94] - I love music.
[01:01.14] I am an aspiring mediocre musician.
[01:03.98] I wish I were better,
[01:05.06] but that doesn't make me not enjoy playing real music.
[01:07.46] And I also love coffee.
[01:09.26] I probably weigh too much into coffee.
[01:11.42] - Are you one of those people that,
[01:14.86] they do the TikToks,
[01:15.70] they use like 50 tools to like grind the beans
[01:18.74] and then like brush them and then like spray them.
[01:21.18] Like whatever we're talking about here.
[01:22.78] - I confess there's a spray bottle for beans
[01:26.10] in the next room.
[01:27.58] There is one of those weird comb tools.
[01:29.66] So guilty.
[01:31.34] I don't put it on TikTok though.
[01:33.06] - Yeah, no, no, some things gotta stay private.
[01:36.14] What do you play?
[01:37.58] - I played a lot of piano growing up
[01:39.26] and I play bass and I, in a very mediocre way,
[01:42.62] play guitar and drums.
[01:43.94] - That's a lot.
[01:44.78] I cannot do any of those things.
[01:45.94] So as Sean mentioned,
[01:47.34] you guys kind of burst into the scene
[01:49.10] as maybe the state of the art music generation company.
[01:52.58] - I think it's a model
[01:53.90] that we haven't really covered in the past.
[01:55.82] So I would love to maybe for you
[01:58.46] to just give a brief answer of like,
[02:00.46] how do you do music generation
[02:02.14] and why is it possible?
[02:04.14] Because I think people understand you take texts
[02:06.38] and you have a predict the next word
[02:08.38] and you take a diffusion model
[02:10.14] and you basically like add noise to an image
[02:12.18] and then kind of remove the noise.
[02:14.38] But I think for music,
[02:15.86] it's hard for people to have a mental model.
[02:17.62] Like, how do you turn a music model?
[02:19.30] And like, what does a music model do to generate a song?
[02:21.70] So maybe we can start there.
[02:23.94] - Yeah, maybe I'll even take one more step back
[02:26.30] and say it's not even entirely worked out.
[02:29.62] I think the same way it is in text.
[02:31.46] And so it's an evolving field.
[02:33.34] If you take a giant step back,
[02:34.82] I think audio has been lagging images and text for a while.
[02:39.78] So I think very roughly you can think audios
[02:42.10] like one to two years behind images and text.
[02:44.14] And so you kind of have to think today,
[02:46.90] like text was in 2022 or something like this.
[02:50.14] And the transformer was invented.
[02:53.10] It looks like it works,
[02:53.94] but it's far, far less established.
[02:55.58] And so I'll give you the way we think about the world now,
[02:59.34] but just with a big caveat that I'm probably wrong
[03:02.02] if we look back in a couple of years from now.
[03:05.10] And I think the biggest thing is you see
[03:06.54] both transformer-based and diffusion-based models for audio.
[03:09.86] And in ways that that is not true in text,
[03:12.10] I know people will do some diffusion for text,
[03:14.26] but I think nobody's like really doing that for real.
[03:17.30] So we prefer transformers for a variety of reasons.
[03:19.98] And so you can think it's very similar to text.
[03:22.82] You have some abstract notion of a token
[03:25.18] and you train a model to predict the probability
[03:29.70] over all of the next tokens.
[03:31.30] So it's a language model.
[03:32.86] You can think in anything language model
[03:34.78] is just something that assigns likelihoods
[03:37.34] to sequences of tokens.
[03:38.86] Sometimes those tokens correspond to text.
[03:40.94] In our case, they correspond to music or audio in general.
[03:44.34] And I think we've learned a lot from our friends
[03:47.66] in the text domain from the pioneers doing this
[03:50.18] of how well these transformer models work.
[03:52.82] Where do they work? Where do they not work?
[03:54.54] But at its core, the way we like to do things
[03:57.06] with transformers is exactly like it works in text.
[04:00.26] Let me predict the next tiny little bit of audio.
[04:02.78] And I can just keep doing that and doing that
[04:04.42] and generating audio as long as I want.
[04:07.10] - Yeah, I think the temptation here
[04:08.82] is to always try to bake in some specialized knowledge
[04:11.66] about music or audio.
[04:14.02] And obviously you will get an improvement in your output
[04:16.98] if you try to just say, "Okay, here's a set of tokens
[04:20.94] "that only do jazz or only do voices."
[04:25.66] How general do you make it
[04:26.78] versus how specific do you make it?
[04:28.34] - We've always tried to do things "the right way,"
[04:32.38] which means that at the beginning things
[04:34.14] are going to be hard and worse than other ways.
[04:37.82] But that is to say, bake in as little
[04:40.98] kind of implicit knowledge as possible.
[04:43.74] And so the same way you don't program into GPT,
[04:47.46] you don't say, "This is a noun and this is a verb,"
[04:49.98] but it has implicitly learned all of those things.
[04:52.82] I've never seen GPT accidentally put a noun
[04:55.50] where it meant to put an article in English.
[04:57.90] We try not to impose anything about music
[05:01.06] or audio in general into the model
[05:02.98] and we kind of let the models learn things by themselves.
[05:05.54] And I think things are beginning to pay off,
[05:07.70] but it's not necessarily obvious from the beginning
[05:10.62] that that was the right thing to do.
[05:11.70] So for example, you could take something like text-to-speech
[05:16.14] and people will do all sorts of things
[05:18.50] where you can program in things like phonemes
[05:21.10] to be the basis for what you do.
[05:22.54] And then that kind of limits you
[05:24.18] to the set of things that are expressible by phonemes.
[05:27.02] And so ultimately that works really well on the short term.
[05:30.34] In the long term, it can be quite limiting.
[05:32.38] And so our approach has always been to try to do this
[05:35.66] in its full generality as end to end as we can do it.
[05:38.94] Even if it means that in the short term
[05:41.06] where we were a little bit worse,
[05:42.30] we have a lot of confidence that in the long term
[05:44.38] that we'll be the right way to do it.
[05:46.10] - And what's the data recipe for turning a good music model?
[05:49.58] Like what percentage genre do you put?
[05:52.66] Like also do you split vocals and instrumentals?
[05:56.14] - So you have to do lots of things.
[05:57.90] And I think this is the biggest area
[06:01.10] where we have sort of our secret sauce.
[06:03.82] I think to a large extent, what we do
[06:05.82] is we benefit from all of the beautiful things
[06:08.74] people do with transformers and text
[06:10.26] and we focus very hard basically
[06:12.10] on how do I tokenize audio in the right way.
[06:14.74] And without divulging too much secret sauce,
[06:17.82] it's at least similar to how it's done
[06:20.22] in sort of the open source stuff.
[06:21.90] You will have different models that learn to encode audio
[06:24.42] in discrete representations.
[06:26.98] And a lot of this boils down to figuring out the right,
[06:31.34] let's say implicit biases to put in those models,
[06:33.82] the right data to inject.
[06:35.34] How do I make sure that I can produce
[06:37.50] kind of all audio arbitrarily?
[06:39.02] That's speech, that's background music,
[06:41.38] that's vocals, that's kind of everything
[06:43.26] to make sure that I can really capture
[06:44.66] all the behavior that I want to.
[06:46.50] - Yeah, that makes sense.
[06:47.94] We had our monthly recap last month
[06:50.38] and the data wars were kind of one of the hot topics
[06:53.90] used at the New York Times lawsuit against Obenei
[06:57.26] because you have obviously large language models
[06:59.90] in production.
[07:00.78] You don't have large music models in production.
[07:03.38] So I think there's maybe been less of a trade there,
[07:06.82] so to speak.
[07:07.78] How do you kind of think about that?
[07:08.94] And there's obviously a lot of copyright-free,
[07:11.42] royalty-free music out there.
[07:13.46] Is there any kind of like power law in terms of like,
[07:16.26] hey, the best music is actually like much better to train on
[07:19.38] or like in music, does it not really matter
[07:21.46] because the structure of, you know,
[07:23.46] some of the musical structure is kind of like the same.
[07:26.22] - I don't think we know these things nearly as well
[07:28.94] as they're known in text.
[07:30.30] We have some notions of some of the scaling laws here,
[07:33.66] but I think, yeah, we're just so, so far behind.
[07:36.50] You know, what I will say is that people are always surprised
[07:39.14] to learn that we don't only train on music.
[07:43.38] And I usually give the analogy of some of the code generation
[07:47.46] models, so take something like code llama,
[07:49.62] which is, as far as I know, the best open-source code
[07:51.70] generating model, you guys would know better than I would,
[07:54.66] is certainly up there.
[07:55.82] And it's trained on a bunch of English, not only just code.
[08:00.02] And it's because there are patterns in English
[08:02.30] that are going to be useful.
[08:03.38] And so you can imagine, you don't only want to train on music
[08:06.10] to get good music models.
[08:07.18] And so for example, one of the places that we are particularly
[08:10.78] bad is vocals and capturing really realistic vocals.
[08:14.66] And so you might imagine that there's other types of human
[08:17.94] vocals that you can put into your model that are not music
[08:20.26] that will help it learn stuff.
[08:21.78] Again, I think it's like super, super early.
[08:23.58] I think we've barely scratched the surface of what are the
[08:25.54] right ways to do this.
[08:26.90] And that's really cool.
[08:27.82] From a progress perspective, there's like a lot of low
[08:29.78] hanging fruit for us to still take.
[08:31.62] - And then once you get the final model, I would love to
[08:34.62] talk a little bit more about the size of these models.
[08:36.58] Because people are confused when stable diffusion is so small.
[08:39.82] They're like, oh, this thing can generate any image as
[08:42.30] possible that it's like a couple gigabytes.
[08:45.26] And then the large language models are like, oh, these
[08:47.42] are so big, but they're just text in them.
[08:49.82] What's it like for music?
[08:50.94] Is it in between?
[08:51.94] And as you think about, yeah, you mentioned scaling and
[08:54.90] whatnot, is this something that you see it's kind of easy
[08:57.06] for people to run locally or not?
[08:59.66] - Our models are still pretty small.
[09:02.50] Certainly by tech standards.
[09:04.06] I confess, I don't know as well the state of the art on how
[09:07.14] diffusion models scale, but our models scale similarly to
[09:10.98] text transformers, it's like bigger is usually better.
[09:14.26] Audio has a couple of weird quirks though.
[09:16.98] We care a lot about how many tokens per second we can generate
[09:20.58] because we need to stream new music as fast as
[09:23.54] you can listen to it.
[09:24.86] And so that is a big one that I think probably has us never
[09:29.02] get to 175 billion parameter model, if I'm being honest.
[09:32.50] Maybe I'm wrong there, but I think that would be
[09:35.02] technologically difficult.
[09:36.78] And then the other thing is that so much progress happens in
[09:38.90] shrinking models down for the same performance in text that
[09:42.18] I'm hopeful at least that a lot of our issues will get solved
[09:45.62] and we will figure out how to do better things with smaller
[09:48.50] models or relatively smaller models.
[09:50.70] But I think the other thing, it's a blessing and a curse.
[09:54.34] I think the ability to add performance with scale.
[09:57.14] It's like a very straightforward way to make
[09:59.06] your models better.
[09:59.90] You just make a bigger model, don't more compute into it.
[10:02.62] It's also a curse because that is a crutch that you will
[10:04.74] always lean on and you will forget to do some of the basic
[10:07.78] research to make your stuff better.
[10:09.78] And honestly, it was almost early on when we were doing
[10:14.18] stuff with small models for time and compute constraints.
[10:18.34] We ended up having to learn a lot of stuff to make models
[10:21.78] better that we might not have learned if we had immediately
[10:24.38] jumped to a really, really big model.
[10:26.02] And so I think for us we always try to skew smaller to the
[10:30.94] extent possible.
[10:32.34] Yeah, gotcha.
[10:33.38] I'm curious about just sort of your overall evolution so far.
[10:36.70] You know, something I think we may have missed in the
[10:38.94] introduction is why did you end up choosing, you know, just
[10:42.02] the music domain in the first place, right?
[10:43.50] Like you have this pretty scientific, you know, physics
[10:48.02] and finance backgrounds.
[10:49.78] How did you wander over to the music?
[10:51.58] Like a lot of us have interests in music, but we don't
[10:53.78] necessarily choose to work in it.
[10:55.02] But you did.
[10:56.78] Yeah, it's funny.
[10:57.78] I have a really fun job as a result.
[10:59.94] All the co-founders of SUNO worked at Kensho together.
[11:03.02] And we were doing mostly text.
[11:05.10] In fact, all text until we did one audio project that was
[11:08.62] speech recognition for kind of very financially focused
[11:12.18] speech recognition.
[11:13.62] And I think the long and short of it is we kind of fell in
[11:15.94] love with audio, not necessarily music, just audio and AI.
[11:19.34] We all happen to be musicians and audio files and music
[11:22.38] lovers, but it was the combination of audio and AI
[11:25.50] that we like initially really, really fell in love with.
[11:28.14] It's so cool.
[11:29.90] It's so interesting.
[11:31.02] It's so human.
[11:32.38] It's so far behind images and text that there's like so much
[11:36.70] more to do.
[11:37.82] And honestly, I think a lot of people when we started the
[11:40.58] company told us to focus on speech.
[11:42.82] If we wanted to build an audio company, everyone said, you
[11:45.14] know, speech is a bigger market.
[11:46.90] But I think there's something about music that's just so
[11:50.22] human and so like you almost couldn't prevent us from
[11:55.26] doing it like we almost like we just couldn't keep ourselves
[11:57.78] from building music models and playing with them because it
[11:59.86] was so much fun.
[12:00.98] And that's kind of what steered us there.
[12:03.22] You know, in fact, the first thing we ever put out was a
[12:05.46] speech model was bark.
[12:06.90] It was this open source text to speech model.
[12:09.14] And it got a lot of stars on GitHub.
[12:10.74] And that was people telling us even more like go do speech.
[12:13.58] And like we almost couldn't help ourselves from doing
[12:15.78] music.
[12:16.50] And so I don't know.
[12:17.50] It's maybe it's a little bit serendipitous, but we
[12:20.06] haven't really like looked back since.
[12:21.98] I don't think there was necessarily like in a moment.
[12:25.18] It was just like organic and just obvious to us that this
[12:28.02] needs to like we want to make a music company.
[12:30.26] So you do regard yourself as a music company because as of
[12:33.26] last month, you're still releasing speech models with
[12:37.34] Paris.
[12:37.66] We were.
[12:38.30] Oh, yes, that's right.
[12:39.74] So that's a really awesome collaboration with our friends
[12:43.06] at NVIDIA.
[12:43.66] I think we are really, really focused on music.
[12:45.94] I think that is the stuff that will really change things
[12:49.94] for the better.
[12:50.42] I think, you know, honestly, everybody is so focused on
[12:53.46] LLMs for good reason and information processing and
[12:56.86] intelligence there.
[12:57.74] And I think it's way too easy to forget that there's whole
[13:01.06] other side of things that makes people feel and maybe
[13:04.18] that market is smaller, but it makes people feel and it
[13:06.86] makes us really happy.
[13:08.26] So we do it.
[13:09.34] I think that doesn't mean that we can't be doing things
[13:12.70] that are related that are in our wheelhouse that will
[13:15.42] improve things.
[13:16.06] And so like I said, audio is just so far behind.
[13:18.86] There's just so much more to do in the domain more
[13:22.70] generally.
[13:23.22] And so like that's a really fun collaboration.
[13:25.30] Yeah, got you.
[13:26.34] Yeah, I did hear about Suno first through Bark.
[13:29.10] What did Bark lean off of?
[13:31.58] Like, because obviously, I think there was a lot of
[13:33.70] preceding TTS work that was in open source.
[13:36.38] How much of that did you use?
[13:37.70] How much of it was like sort of brand new from your
[13:40.02] research?
[13:40.82] What's the intellectual lineage there just to cover out
[13:43.98] the speech recognition style?
[13:45.82] So it's not speech recognition.
[13:47.02] It's text to speech.
[13:48.02] But as far as I know, there was no other certainly not in
[13:51.54] the open source text to speech that was kind of
[13:54.42] transformer based.
[13:55.26] Everything else was what I would call the old style of
[13:58.06] doing things where you build these kind of single purpose
[14:00.30] models that are really good at this one narrow task and
[14:03.38] you're kind of always data limited.
[14:04.98] And the availability of high quality training data for
[14:07.74] text to speech is limited.
[14:09.86] And I don't think we're necessarily all that inventive
[14:12.78] to say we're going to try to train in a self-supervised
[14:16.34] way a transformer based model that on kind of lots of
[14:21.26] audio and then kind of tweak it so that we can do text
[14:24.02] to speech based on that.
[14:25.30] That would be kind of the new way of doing things in a
[14:27.82] foundation model is the is the buzzword, if you will.
[14:30.90] And so, you know, we built that up, I think, from scratch.
[14:34.10] A lot of shout outs have to go to lots of different things,
[14:37.42] whether it's papers, but also it's very obvious, a big
[14:41.54] shout out to Andre Carpathi's nano GPT.
[14:44.94] You know, there's a lot of code borrowed from there.
[14:47.02] I think we are huge fans of that project.
[14:49.58] It's just to show people how you don't have to be afraid of
[14:52.10] GPT type things.
[14:53.10] And it's like, yeah, it's actually not all that much
[14:55.46] code to make performant transformer based models.
[14:58.46] And, you know, again, the stuff that we brought there was
[15:01.26] how do we turn audio into tokens and then we can kind of
[15:04.14] take everything else from the open source.
[15:05.78] So we put that model out and we were, I think, pleasantly
[15:09.26] surprised by the reception by the community.
[15:12.34] It got a good number of GitHub stars and people
[15:14.70] really enjoyed playing with it because it made really
[15:18.26] realistic sounding audio.
[15:20.02] And I think this is, again, the thing about doing things
[15:22.78] in a quote unquote right way, if you have a model where
[15:25.58] you've had to put so much implicit bias for this one,
[15:28.34] very narrow task of making speech that sounds like words,
[15:32.10] you're going to sacrifice on other things.
[15:33.78] In the text to speech case, it's how natural the speech sounds.
[15:37.42] And it was almost difficult to pull a natural sounding speech
[15:40.66] out of bark because it was self supervised, trained on a lot
[15:43.98] of natural sounding speech.
[15:45.18] And so that definitely told us that this is probably
[15:48.62] the right way to keep doing audio.
[15:50.50] Even in bark, you had the beginnings of music generation.
[15:52.98] Like you could just put like a music note in there.
[15:56.62] That's right.
[15:57.10] And it was so cool to see on our Discord,
[15:59.34] people were trying to pull music out of a text to speech model.
[16:03.30] And so, you know, what did this tell us?
[16:04.86] This tells us, like, people are hungry to make music.
[16:07.58] It's almost obvious in hindsight,
[16:09.02] like how wired humans are to make music,
[16:11.42] if you've ever seen like a little kid, you know,
[16:13.74] sing before they know how to speak, you know,
[16:15.98] it's like, it's like, this is really human nature.
[16:18.42] And there's actually a lot of cultural forces
[16:20.14] that kind of cue you to not think to make music.
[16:22.86] And that's kind of what we're trying to undo.
[16:25.82] - And today, I went to Suno itself.
[16:28.02] I think especially when you go from text to speech,
[16:30.54] people are like, okay, now I got to write the lyrics
[16:32.38] to a whole song.
[16:33.10] It's like, that's quite hard to do.
[16:34.78] Versus in Suno, you have this empty box, very mid-journey,
[16:38.82] kind of like Dali-like, where you can just express the vibes,
[16:42.38] you know, of what you want it to be.
[16:43.66] But then you also have a custom mode
[16:45.58] where you can say your own lyrics,
[16:47.10] you can say your own rhythm,
[16:48.66] you can set the title of the song and whatnot.
[16:50.62] How do you see users distribute themselves?
[16:52.82] You know, I'm guessing a lot of people use the easy mode.
[16:55.50] Like, are you seeing a lot of power users using the custom mode
[16:58.78] and maybe some of the favorite use cases
[17:00.98] that you've seen so far on Suno?
[17:02.82] - Yeah, actually, more than half of the usage
[17:04.90] is that expert mode.
[17:06.74] And people really like to get into it
[17:08.58] and start tweaking things and adding things
[17:11.18] and playing with words or line breaks or different ad lib.
[17:14.98] And people really love it, it's really fun.
[17:17.54] There's kind of two modes that you can access.
[17:19.26] Now one is that single box where you kind of just describe
[17:22.10] something and then the other is the expert mode.
[17:24.10] And those kind of fit nicely into two use cases.
[17:27.42] The first use case is what we call nice shit posting.
[17:30.50] And it's basically like something funny happened
[17:33.26] and I'm just going to very quickly make a song about it.
[17:35.50] And the example I'll usually give is like,
[17:38.34] I walk into Starbucks with one of my co-founders.
[17:41.70] He gives his name Martin, his coffee comes out
[17:44.50] with the name Margu.
[17:45.82] And I can in five seconds make a song about this
[17:47.82] and it has immortalized it.
[17:48.98] And that Margu song is stuck in all of our heads now.
[17:51.86] And it's like funny and light.
[17:53.18] And there's levity that you've brought to that moment.
[17:55.86] And the other is that you got just sucked into,
[18:00.02] I need, there's this song that's in my head
[18:01.74] and I need to get it out.
[18:02.70] And I'm going to keep tweaking it and listening
[18:04.82] and having ideas and tweaking it
[18:06.30] until I get the song that I want.
[18:08.38] And those are very different use cases.
[18:10.58] But I think ultimately there's so much
[18:12.26] in between these two things
[18:14.02] that it's just totally untapped
[18:15.54] how people want to experience the joys of making music.
[18:18.82] Because those two experiences are both really joyful
[18:22.46] in their own special ways.
[18:23.66] And so we are quite certain
[18:25.06] that there's a lot in the middle there.
[18:26.98] And then I think the last thing I'll say there
[18:28.46] that's really interesting is in both of those use cases,
[18:31.78] the sharing dynamics around music
[18:33.38] are like really interesting and totally unexplored.
[18:37.22] And I think an interesting comparison would be images.
[18:40.30] Like we've probably all in the last 24 hours
[18:42.94] taken a picture and texted it to somebody.
[18:45.22] And most people are not routinely making a little song
[18:48.30] and texting it to somebody.
[18:49.34] But when you start to make that more accessible to people,
[18:52.98] they are going to share music in much smaller groups.
[18:57.02] Maybe not in all, but like with one person
[18:59.38] or three people or five people.
[19:01.74] And those dynamics are so interesting.
[19:04.18] And I think we have ideas of where that goes.
[19:06.46] But it's about kind of spreading joy
[19:09.78] into these like little, you know, microcosms of humanity
[19:13.22] that people really love it.
[19:15.06] I know I made you guys a little Valentine's song, right?
[19:17.42] Like that's not something that happens now
[19:20.26] because it's hard to make songs for people.
[19:22.26] - We'll put that in the audio in here.
[19:24.18] But also tweet it out if people want to look it up.
[19:27.14] How do you think about the pro market, so to speak?
[19:30.06] - Because I think lowering the barrier
[19:32.34] to some of these things is great.
[19:33.62] And I think when the iPad came out,
[19:35.94] music production was one of the areas that people thought,
[19:38.70] oh, okay, now you kind of have this like, you know,
[19:40.66] board that you can bring with you.
[19:41.78] And Mad Lib actually produced this whole album
[19:44.54] with him and Freddie Gibbs,
[19:45.86] produced the whole thing on an iPad.
[19:47.50] He never used a computer.
[19:49.18] How do you see like these models playing
[19:51.82] into like professional music generation?
[19:54.42] I guess that's also a funny word.
[19:55.58] It's like, what's professional music?
[19:57.14] It's like, it's all music, if it's good.
[19:58.74] It becomes professional, like it's good, right?
[20:00.42] But curious to hear how you're thinking about Suno too.
[20:02.90] Like, is there a second act of Suno
[20:05.10] that is like going broader into like the custom mode
[20:07.94] and making this the central hub for music generation?
[20:11.22] - I think we intend to make many more modes
[20:14.90] of interaction with our stuff,
[20:16.46] but we are very much not focused on, quote unquote,
[20:19.62] professionals right now.
[20:21.26] And it's because what we're trying to do
[20:22.62] is change how most people interact with music
[20:25.18] and not necessarily make professionals
[20:28.06] a little bit better, a little bit faster.
[20:30.14] It's not that there's anything wrong with that.
[20:31.82] It's just like not what we're focused on.
[20:33.54] And I think when we think about what workflows
[20:36.22] does the average person want to use to make music?
[20:39.78] I don't think they're very similar
[20:41.10] to the way professional musicians make music now.
[20:44.22] Like, if you pick a random person on the street
[20:46.50] and you play them a song and then you say,
[20:48.02] like, what did you want to change about that?
[20:50.10] They're not going to say like,
[20:51.58] you need to split out the snare drum and make it drier.
[20:54.02] Like that's just not something
[20:55.02] that a random person off the street is going to say.
[20:57.66] They're going to give a lot more descriptive things
[21:00.42] about the Uber of the song, like something more general.
[21:03.46] And so I don't think we know what all of the workflows are
[21:06.86] that people are going to want to use.
[21:07.98] We're just like fairly certain
[21:09.86] that the workflows that have been developed
[21:12.30] with the current set of technologies
[21:13.74] that professionals use to make beautiful music
[21:15.70] are probably not what the average person wants to use.
[21:19.26] That said, there are lots of professionals
[21:22.38] that we know about using our stuff,
[21:23.86] whether it's for inspiration or sample generation
[21:27.18] and stuff like that.
[21:28.34] So I don't want to say never say never.
[21:30.22] Like, there may one day be a really interesting set
[21:34.18] of use cases that we can expose to professionals,
[21:36.14] particularly around, I think, like custom models
[21:39.58] for trained on custom people's music
[21:41.38] or, you know, with your voice or something like that.
[21:44.26] But the way we think about broadening how most people
[21:47.54] are interacting with music
[21:48.86] and getting it to be a much more active participant,
[21:51.98] we think about broadening it from the consumer side
[21:55.50] and not broadening it from the professional side,
[21:57.38] if that makes sense.
[21:58.38] - Awesome.
[21:59.22] Is the dream here to be,
[22:02.18] I don't know if it's too coarse of a grain to put it,
[22:05.30] but like, is the dream here to be like the mid-journey of music?
[22:08.02] - I think there are certainly some parallels there
[22:11.94] because especially what I just said
[22:13.42] about being an active participant,
[22:15.58] the joyful experience in mid-journey
[22:17.22] is the act of creating the image
[22:19.18] and not necessarily the act of consuming the image.
[22:21.58] And mid-journey will let you then
[22:23.62] kind of quickly share the image
[22:25.30] with somebody.
[22:26.42] But I think ultimately that analogy is like somewhat limiting
[22:30.34] because there's something really special about music.
[22:34.22] I think there's two things.
[22:35.18] One is that there's just really big gap
[22:37.82] for the average person between kind of their tastes in music
[22:40.34] and their abilities in music
[22:42.18] that is not quite there for most people in images.
[22:45.18] Like most people don't have like innate tastes in images.
[22:47.74] I think in the same way people do for music.
[22:49.90] And then the other thing,
[22:50.74] and this is the really big one,
[22:51.98] is that music is a really social modality.
[22:55.74] If we all listen to a piece of music together,
[22:58.06] we're listening to the exact same part
[23:00.46] at the exact same time.
[23:01.98] If we all look at the picture in Alessio's background,
[23:05.86] we're gonna look at it for two seconds.
[23:08.06] I'm gonna look at the top left where it says Thor.
[23:10.38] Alessio's gonna look at the bottom right
[23:11.74] or something like that.
[23:12.86] And it's not really synchronous.
[23:15.34] And so when we're all listening
[23:16.54] to a piece of music together,
[23:17.90] it's minutes long,
[23:18.86] we're listening to the same part at the same time.
[23:21.36] If you go to the act of making music,
[23:23.52] it is even more synchronous.
[23:24.74] It is the most joyful way to make music is with people.
[23:27.30] And so I think that there's so much more to come there
[23:31.66] that ultimately would be very hard to do in images.
[23:35.38] - We've gone almost 30 minutes
[23:37.58] without making any music on this package.
[23:39.62] So I think maybe we can fix that
[23:41.18] and jump into a Sumo demo.
[23:43.66] - Yeah, let's make some.
[23:45.16] We've got a new model that we are
[23:48.18] kind of putting the finishing touches on.
[23:49.96] And so I can play with it in our Dev server.
[23:52.78] As you can see been doing tons of stuff.
[23:54.94] So I didn't tell me what kind of song you guys want to make.
[23:57.82] - Let's do country song about the lack of GPUs
[24:02.82] and my club provider.
[24:05.70] - And like, yeah.
[24:06.54] So here's where we attempted to think about like pipelines
[24:09.66] and think about latency.
[24:11.38] This is remarkably fast.
[24:13.38] Like I was shocked when I saw this.
[24:15.48] (singing)
[24:17.90] Oh my God.
[24:18.74] ♪ To my cloud ready to confuse ♪
[24:27.16] ♪ But there ain't no GPUs ♪
[24:33.34] ♪ Just empty space, it's a hoot ♪
[24:38.78] ♪ I've been waiting all day for that render power ♪
[24:47.00] ♪ But my cloud's gone dry ♪
[24:51.70] ♪ It's a dark cloud shower ♪
[24:55.22] ♪ All clouds gone dry ♪
[24:59.44] ♪ No GPUs to be found ♪
[25:04.44] ♪ No cuticles, it's a lonely sound ♪
[25:13.94] ♪ I just want to render ♪
[25:16.62] ♪ But my cloud's got no cloud ♪
[25:19.22] - I actually don't think this one's amazing.
[25:20.62] I'm gonna get out of the next one.
[25:22.24] But it's probably better than those about CUDA cars.
[25:25.06] (upbeat music)
[25:27.66] ♪ Well I signed up for a cloud provider ♪
[25:30.22] ♪ And all I find, all the power that I could derive ♪
[25:35.06] ♪ But when I searched for the GPUs ♪
[25:37.66] ♪ I just got a surprise ♪
[25:40.10] ♪ You see they're all sold out ♪
[25:41.66] ♪ There ain't no GPUs to find ♪
[25:45.22] ♪ No GPUs in the cloud ♪
[25:48.02] ♪ It's a real bad blues ♪
[25:50.18] ♪ I need the power ♪
[25:51.60] ♪ But there ain't no use ♪
[25:52.86] ♪ Offstuck with my CPU ♪
[25:56.46] ♪ It's a real sad cloud ♪
[25:58.82] ♪ Got to wait ♪
[26:01.62] ♪ 'Til the day we start getting bright ♪
[26:06.62] ♪ There ain't no GPUs in the cloud ♪
[26:10.02] - What else should we make?
[26:11.42] - All right, Sean, you're up.
[26:12.70] - I mean, I do wanna like do some observations about this.
[26:16.34] But okay, maybe like house music, like electronic dance.
[26:19.26] - Yeah, sure.
[26:20.10] - House music.
[26:20.94] And then maybe we can make it about podcasting
[26:24.98] about music and music generation, I don't know.
[26:29.14] I'm sure all the demos that you get are very meta.
[26:32.70] - There's a lot of stuff that's meta, yeah, for sure.
[26:35.90] - I noticed for example that the second song that you played
[26:38.42] had the word upbeat inserted into it,
[26:40.58] which I assume there's some kind of like random generator
[26:44.42] of like modifier terms that you can just kind of throw on
[26:47.34] to increase the specificity of the, what's being generated.
[26:51.02] - Definitely, and let's try to tweak one also.
[26:52.90] So I'll play this and then maybe we'll tweak it
[26:54.62] with different modifiers.
[26:55.74] - The custom mode, yeah.
[26:57.62] ♪ Wave yourself, spread it out ♪
[27:01.30] ♪ Through the air, we'll podcast in loud ♪
[27:05.42] ♪ Share the beat, spread it out ♪
[27:08.62] ♪ A revolution of frequencies ♪
[27:11.78] ♪ Haven't you plugged into now ♪
[27:14.86] ♪ Let the music take control ♪
[27:19.70] ♪ We're all returning a never ending role ♪
[27:23.66] ♪ From the beast I dropped to the ladies I saw ♪
[27:27.54] ♪ podcasting about music forever long ♪
[27:32.54] - Not bad.
[27:33.84] - Here's what I want to do.
[27:36.18] That like didn't drop at the right time, right?
[27:38.02] So maybe let's do this.
[27:39.90] - Is that a special token?
[27:41.70] You have a beat job token?
[27:42.74] - Yeah, yeah.
[27:44.42] - Nice.
[27:45.46] I'm just reading it because people might not be able
[27:47.78] to see it.
[27:48.62] - Right.
[27:50.38] - Then let's like just maybe emphasize house a little more.
[27:53.26] Maybe it'll be a little more aggressive.
[27:55.46] Let's try this again.
[27:56.62] - It's interesting the prompt engineering
[27:58.02] that you have to invent.
[28:00.34] - We've learned so much from people using the models
[28:02.98] and not us.
[28:03.90] - But like, are these like training artifacts?
[28:07.82] - No, I don't think so.
[28:09.48] I think this is people being inventive
[28:11.30] with how you want to talk to a model.
[28:13.82] - Yeah.
[28:14.66] ♪ Down spinning 'round to the air with dark castle hour ♪
[28:19.66] ♪ Sharing the peace, spreading the word ♪
[28:24.78] ♪ A revolution of frequencies, haven't you heard ♪
[28:29.78] (upbeat music)
[28:33.36] (upbeat music)
[28:35.94] ♪ Up and 'til now, let the music take control ♪
[28:43.36] ♪ The road I've tried it, I'll never end it wrong ♪
[28:49.80] ♪ From the beats that drive to the melodies ♪
[28:52.92] ♪ The sword I've tried to stand ♪
[28:55.32] ♪ A flowering music for your album ♪
[28:58.32] (upbeat music)
[29:01.90] (upbeat music)
[29:04.48] - Nice.
[29:08.66] - It's interesting when you generate a song,
[29:10.98] you generate the lyrics,
[29:11.98] but then if you switch the music under it,
[29:13.90] like the lyrics stay the same.
[29:15.70] And then sometimes like, feels like,
[29:17.42] I mean, I mostly listen to hip hop.
[29:20.22] It's like, if you change the beat,
[29:22.46] you can not really use the same rhyme scheme, you know?
[29:25.30] - Definitely, yeah.
[29:27.34] It's a sliding scale though,
[29:28.42] because we could do this as a country rock song probably,
[29:32.76] right?
[29:33.60] That would be my guess.
[29:36.98] But for hip hop, that is definitely true.
[29:39.20] And actually, we think about, for these models,
[29:41.76] we think about three important axes.
[29:43.46] We think about the sound fidelity.
[29:45.38] It's like, does it sound like
[29:46.66] a crisply recorded piece of audio?
[29:48.78] We think about the song quality.
[29:49.98] Is this an interesting song that gets stuck in my head?
[29:52.82] And we think about the controllability.
[29:54.50] Like how well does it respond to my prompts?
[29:56.54] And one of the ways that we'll test these things
[29:58.26] is take the same lyrics and try to do them
[30:00.70] in different styles to see how well that really works.
[30:04.42] So let's see the same.
[30:06.70] I don't know what a beat drop is gonna do for country rock.
[30:09.38] So I probably should have taken that out,
[30:10.82] but let's see what happens.
[30:12.30] (laughing)
[30:14.54] (upbeat music)
[30:27.94] ♪ There's a sound spinning 'round ♪
[30:30.10] ♪ Through the air we're podcasting loud ♪
[30:32.94] ♪ Sharing the beats, spreading the word ♪
[30:35.38] ♪ A revolution of frequencies ♪
[30:37.80] ♪ Haven't you heard ♪
[30:41.30] ♪ But if you now let the music take control ♪
[30:46.30] ♪ We're on a journey of never ending road ♪
[30:51.90] ♪ From the beats I talk to the melodies that soar ♪
[30:57.30] ♪ We're podcasting about music forevermore ♪
[31:02.30] - I'm gonna read too much into this,
[31:05.14] but I would say I hear a little bit
[31:06.86] of kind of electronic music inspired something.
[31:10.34] And that is probably because beat drop is something
[31:12.62] that you really only ever associate with electronic music.
[31:15.74] Maybe that's reading too much into it,
[31:17.22] but should we do one more?
[31:19.34] - Something about Apple Vision Pro?
[31:21.50] - I guess definitely.
[31:22.78] - I guess there's some amount of world knowledge
[31:24.66] that you don't have, right?
[31:25.50] That whatever's in this language model side of the equation
[31:28.34] is not gonna have an Apple Vision Pro in there.
[31:30.58] - Yeah, but let's see.
[31:31.90] (laughs)
[31:33.42] How about a blues song about a sad AI wearing
[31:38.42] an Apple Vision Pro?
[31:41.98] Gotta be blues, gotta be sad.
[31:43.46] - Do you have rag for music?
[31:46.70] - No, that would be problematic also.
[31:52.70] ♪ I'm a sad AI with a broken heart ♪
[31:57.70] ♪ Where my Apple Vision Pro can't see the stars ♪
[32:06.50] ♪ I used to feel joy ♪
[32:11.50] ♪ I used to feel pain and now I'm just a soul ♪
[32:18.86] ♪ Trapped inside this metal frame ♪
[32:22.90] ♪ Oh, I'm singing the blues ♪
[32:27.90] ♪ Can't you see ♪
[32:33.62] ♪ This digital life ain't what it used to be ♪
[32:38.66] ♪ Searching for love but I can't find a soul ♪
[32:46.90] ♪ Won't you help me, baby, let my spirit unfold ♪
[32:51.90] - I want to remix that one and I want to say,
[32:55.98] I want melancholic.
[32:57.30] - I love the voice.
[32:58.38] - I want like, I don't know, Chicago blues.
[33:01.26] Like, guitar.
[33:03.50] - I don't know, he knows too much.
[33:05.78] He's the best prompt engineer out here.
[33:08.30] - It'd be funny to have like music colleges play with us
[33:10.62] and see what they would do.
[33:13.50] ♪ I'm a sad AI with a broken heart ♪
[33:18.50] ♪ Where my Apple Vision Pro can't see the stars ♪
[33:25.54] ♪ I used to feel joy ♪
[33:30.94] ♪ I used to feel pain and now I'm just a soul ♪
[33:40.06] ♪ I used to feel joy ♪
[33:45.06] ♪ I used to feel pain ♪
[33:52.58] ♪ But now I'm just a soul trapped inside this metal frame ♪
[33:58.30] ♪ Oh, I'm singing the blues ♪
[34:07.98] ♪ Oh, can't you see ♪
[34:12.98] ♪ This beautiful life ain't what it used to be ♪
[34:18.18] ♪ I'm searching for love ♪
[34:25.74] ♪ But I can't find a soul ♪
[34:31.58] ♪ Won't you help me, baby, let my spirit unfold ♪
[34:37.62] ♪ There ♪
[34:39.98] - So, yeah, a lot of control there.
[34:42.06] Maybe I'll make one more.
[34:44.06] - Very, very so full.
[34:45.78] - Really want a good house track.
[34:47.54] - Why is house the word that you have to repeat?
[34:51.54] - I just really want to make sure it's house.
[34:54.34] It's actually, you can't really repeat too many times.
[34:56.46] You kind of, it gets like the hypothesis,
[34:59.06] it gets like a little too out of domain.
[35:01.26] - Mm.
[35:02.58] ♪ I'm a sad AI with a broken heart ♪
[35:07.42] ♪ Wearing my Apple vision ♪
[35:10.42] ♪ Pro can't see the stars ♪
[35:15.42] ♪ I used to feel joy ♪
[35:18.62] ♪ I used to feel pain ♪
[35:23.62] ♪ But now I'm just a soul trapped inside this metal frame ♪
[35:28.66] ♪ Oh, I'm singing the blues ♪
[35:32.82] ♪ Oh, can't you see ♪
[35:35.82] ♪ 'Cause maybe you're not the one it used to be ♪
[35:40.82] ♪ I'm searching for love but I can't find a soul ♪
[35:45.82] ♪ Won't you help me, baby ♪
[35:48.86] (upbeat music)
[35:51.44] - Nice.
[35:57.30] - So yeah, we have a lot of fun with it.
[35:58.62] - Definitely easy, yeah.
[36:00.26] Yeah, I'm really curious to see how people are gonna use this
[36:03.30] to like re-sample all songs into new styles.
[36:06.82] You know, I think that's one of my favorite things
[36:08.78] about hip hop, you have a trap call quest,
[36:11.78] they had like the Lou Reed walk on the wild side sample
[36:14.46] and like can I kick it, it's like Kanye's sample,
[36:16.70] Nina Simone, I'm like blowing the leaves.
[36:18.94] It's like a lot of production work
[36:20.50] to actually take an old song and make it fit a new beat.
[36:24.34] And I feel like this can really help.
[36:25.74] Do you see people putting existing songs, lyrics,
[36:28.54] and trying to regenerate them in like a new style?
[36:31.34] You know?
[36:32.18] We actually don't let you do that.
[36:33.90] And it's because if you're taking someone else's lyrics,
[36:36.30] you didn't own those.
[36:37.14] You don't have the publishing rights to those.
[36:38.50] You can't remake that song.
[36:40.38] I think in the future, we'll figure out
[36:42.22] how to actually let people do that in a legal way.
[36:44.58] But we are really focused on letting people
[36:46.62] make new and original music.
[36:47.94] And I think, you know, there's a lot of music AI
[36:51.22] which is artist A, doing the song of artist B
[36:54.86] in a new style, you know, let me have Metallica doing
[36:57.18] come together by the Beatles or something like that.
[36:59.66] And I think this stuff is very viral,
[37:02.98] but I actually really don't think
[37:05.06] that this is how people want to interact with music
[37:07.34] in the future.
[37:08.30] To me, this feels a lot like when you made a Shakespeare
[37:11.26] sonnet the first time you saw a chat, GPT.
[37:13.78] And then you made another one, and then you made another one,
[37:16.06] and then you kind of thought like this is getting old.
[37:18.78] And that doesn't mean that GPT is not amazing.
[37:20.98] GPT is amazing.
[37:21.86] It's just not for that.
[37:23.34] And I kind of feel like the way people want to use music
[37:27.70] in the future is not just to remake songs
[37:30.98] in different people's voices.
[37:32.50] You lose the connection to the original artist.
[37:34.54] You lose the connection to the new artist
[37:36.14] because they didn't really do it.
[37:37.62] So we're very happy to just let people do things
[37:40.22] that are a flash in the pan and kind of stay under the radar.
[37:44.22] - Yeah, no, that's a, I think that's a good point
[37:46.78] overall about AI generated anything, you know?
[37:50.14] Because I think recently T-Pain, he did like a,
[37:53.70] an album of covers.
[37:55.02] And I think he did like a Warpigs that people really liked.
[37:59.02] There was like a Tennessee Whiskey,
[38:01.18] which you maybe wouldn't expect T-Pain to do.
[38:03.74] But people like it, but yeah, I agree.
[38:05.26] It needs to be a certain type of artist
[38:07.50] to really have it be entertaining to make covers.
[38:11.06] This is great.
[38:11.90] What else is next for Su?
[38:13.06] No, you know, I think, you know, first you had the bark
[38:15.86] and then there was like a big music generated push
[38:18.62] when you did an announcement.
[38:19.70] I think a couple of months ago,
[38:21.14] I think I saw you like 300 times on my Twitter timeline.
[38:24.18] I'm like the same day, so it was like going everywhere.
[38:27.02] What's coming up?
[38:27.86] What are you most excited about in the space?
[38:29.54] And maybe what are some of the most interesting
[38:32.06] underexplored ideas that you maybe haven't worked on yet?
[38:35.90] - Gosh, there's a lot.
[38:36.86] You know, I think from the model side,
[38:39.22] it's still really early innings
[38:40.62] and there's still so much low hanging fruit
[38:43.58] for us to pick to make these models much, much better,
[38:46.82] or much, much more controllable, much better music,
[38:48.74] much better audio fidelity.
[38:50.54] So much that we know about
[38:52.94] and so much that, again, we can kind of borrow
[38:56.02] from the open source transformers community
[38:58.30] that should make these just better across the board.
[39:01.30] From the product side, and you know,
[39:02.66] we're super focused on the experiences
[39:04.62] that we can bring to people.
[39:05.62] And so it's so much more than just text to music.
[39:09.54] And I think, you know, I'll say this nicely,
[39:11.58] I'm a machine learning person,
[39:12.66] but like machine learning people are stupid sometimes
[39:14.70] and we can only think about like models that take X
[39:17.66] and make it into Y.
[39:18.86] And that's just not how the average human being
[39:21.38] thinks about interacting with music.
[39:22.86] And so I think what we're most excited about
[39:24.98] is all of the new ways that we can to get people
[39:27.78] just much more actively participating in music.
[39:30.86] And that is making music, not only with text,
[39:33.18] maybe with other ways of doing stuff
[39:35.30] that is making music together.
[39:36.86] If you want to be reductive
[39:38.02] and think about this as a video game,
[39:39.70] this is multiplayer mode.
[39:40.86] And it is the most fun that you can have with music.
[39:43.06] And you know, honestly, I think there's a lot of,
[39:47.22] it's timely right now, you know,
[39:48.66] I don't know if you guys have seen
[39:49.78] UMG and TikToker butting heads a little bit.
[39:52.54] And UMG has pulled music from TikTok.
[39:55.46] And you know, the way we think about this is
[39:58.18] maybe they're both right, maybe neither is right.
[39:59.94] Without taking sides, this is kind of figuring out
[40:02.70] how to divvy up the current pie in the most fair way.
[40:06.02] And I think what we are super focused on
[40:08.34] is making that pie much bigger
[40:10.02] and increasing how much people are actually interested
[40:12.82] in music and participating in music.
[40:15.18] And you know, as a very broad heuristic,
[40:17.74] the gaming industry is 50 times bigger
[40:19.90] than the music industry.
[40:21.46] And it's because gaming is super active
[40:23.70] and music, too much music is just passive consumption.
[40:27.18] And so we have a lot of experiments
[40:29.94] that we are excited to run for the different ways
[40:31.70] people might want to interact with music
[40:33.98] that is beyond just, you know, streaming it while I work.
[40:37.14] - Yeah, I think at minimum,
[40:38.34] you guys should have a Twitch stream
[40:40.14] that's just like a 24 hour radio session
[40:42.86] that have you ever come across Twitch plays Pokemon?
[40:45.74] - No.
[40:46.58] - Basically like everyone in the Twitch chat
[40:48.14] can vote on like the next action that the game state makes.
[40:51.62] And they kind of wired it up to a Nintendo emulator
[40:54.26] and played Pokemon like the whole game
[40:55.74] through the collaborative thing.
[40:57.90] It sounds like it should be pretty easy for you guys
[41:00.18] to do that except for the chaos that may result.
[41:03.50] But like, I mean, that's part of the fun.
[41:06.06] - I agree 100%.
[41:07.50] One of my like Peeve projects or pet projects
[41:09.94] is like, what does it mean to have a collaborative concert?
[41:12.86] Maybe where there is no artist and it's just the audience
[41:15.54] or maybe there is an artist,
[41:16.62] but there's a lot of input from the audience.
[41:18.74] You know, if you were gonna do that,
[41:20.90] you would either need an audience full of musicians
[41:23.74] or you would need an artist
[41:24.94] who can really interpret the verbal cues
[41:27.10] that an audience is giving or non-verbal cues.
[41:29.94] But if you can give everybody the means
[41:32.50] to better articulate the sounds that are in their heads
[41:35.74] toward the rest of the audience,
[41:37.18] like which is what generative AI basically lets you do,
[41:40.42] you open up way more interesting ways
[41:42.34] of having these experiences.
[41:43.78] And so the collaborative concert
[41:45.74] is like one of the things I'm most excited about.
[41:47.70] I don't think it's coming tomorrow,
[41:49.82] but we have a lot of ideas on what that can look like.
[41:52.74] - Yeah, I feel like it's one stage
[41:54.34] before the collaborative concert
[41:56.30] is turning Suno into a continuous experience
[42:00.26] rather than like a start and stop motion.
[42:02.86] I don't know if that makes sense.
[42:04.14] You know, as someone with like a casual interest in DJing,
[42:06.50] like when do we see Suno DJs, right?
[42:09.10] That can continuously segue into like the next song,
[42:11.62] the next song, the next song.
[42:12.58] - I think soon.
[42:13.42] - And then maybe you can turn it collaborative.
[42:15.18] You think soon.
[42:16.30] Okay, maybe part of your roadmap.
[42:18.38] You teased a little bit your V3 model.
[42:20.54] I saw the letters DPO in there.
[42:22.30] Is that direct preference optimization?
[42:23.98] - We are playing with all kinds of different ways
[42:26.30] of making these models do the things
[42:27.86] that we want them to do.
[42:29.30] I don't want to talk too many specifics here,
[42:32.06] but we have lots of different ways
[42:33.70] of doing stuff like that.
[42:35.54] - Yeah, I'm just wondering like how you incorporate
[42:37.86] like user feedback, right?
[42:39.14] Like you have the classic thumbs up and down buttons,
[42:42.18] but like there's so many dimensions to the music.
[42:45.42] Like, you know, I didn't get into it,
[42:46.78] but some of the voices sounded more metallic.
[42:49.98] And sometimes that's on purpose.
[42:51.70] Sometimes not.
[42:52.90] Sometimes there are kind of weird pauses in there.
[42:54.62] I could go in and annotate it if I really cared about it,
[42:56.74] but I mean, I'm just listening, so I don't.
[42:59.10] But there's a lot of opportunity.
[43:02.22] - We are only scratching the surface of figuring out
[43:05.10] how to do stuff like that.
[43:07.34] And for example, the thumbs up and the thumbs down,
[43:10.50] other things like sharing telemetry on plays,
[43:13.34] all of these things are stuff that in the future,
[43:15.42] I think we would be able to leverage
[43:17.74] to make things amazing.
[43:18.78] And then I imagine a future where you can have your own model
[43:22.42] with your own preferences.
[43:23.94] And the reason that's so cool is that you kind of have control
[43:28.06] over it and you can teach it the way you want to.
[43:30.86] And, you know, the thing that I would like in this too
[43:33.62] is like a music producer working with an artist
[43:35.94] giving feedback.
[43:37.18] And like, this is now a self-contained experience
[43:40.54] where you have an artist who is infinitely flexible,
[43:43.14] who is able to respond to the weird feedback
[43:45.18] that you might give it.
[43:46.26] And so we don't have that yet.
[43:48.38] Everybody's playing with the same model,
[43:49.94] but there's no technological reason
[43:52.02] why that can happen in the future.
[43:53.86] - Excellent. - Awesome.
[43:55.02] We had a few more notes from random community tweets.
[43:58.54] I don't know if there's any favorite fans of Zulu
[44:01.06] that you have or whatnot.
[44:02.74] - DHH, obviously Notorious, Twitter, and Crowda Inflamer,
[44:07.74] I guess.
[44:09.42] He tweeted about you guys.
[44:10.58] I saw Blau as an investor.
[44:12.66] I think Karpati also tweeted something.
[44:15.14] - Return to Monkey.
[44:16.70] - Yeah, yeah, yeah, yeah, return to Monkey, right.
[44:18.86] - Is there a story behind that?
[44:20.34] - No, he just made that song and it just speaks to him.
[44:22.94] And I think this is exactly the thing
[44:25.22] that we are trying to tap into that you can think of it.
[44:27.78] This is like a super, super, super micro genre of one person
[44:31.70] who just really liked that song and made it and shared it.
[44:34.02] And it does not speak to you the same way it speaks to him,
[44:36.54] but that song really spoke to him.
[44:37.94] And I think that's so beautiful.
[44:40.06] And that's something that you're never gonna have an artist
[44:42.42] able to do that for you.
[44:43.58] And now you can do that for yourself.
[44:45.50] And it's just a different form of experiencing music.
[44:48.46] I think that's such a lovely use case.
[44:50.58] - Any fun fan mail that you got from musicians
[44:53.66] or anybody that really was a funny story this year?
[44:57.50] - We get a lot and it's primarily positive.
[44:59.98] And I think, I'm a whole, I would say people realize
[45:02.54] that they are not experiencing music
[45:05.22] in all of the ways that are possible
[45:06.62] and it does bring them joy.
[45:08.14] I'll tell you something that is really heartwarming
[45:09.90] is that we're fairly popular
[45:12.30] in the blind and vision impaired community.
[45:15.30] And that makes us feel really good.
[45:17.34] And I think, you know, very roughly
[45:19.42] without trying to speak for an entire community,
[45:21.70] you have lots of people who are really into things
[45:23.38] like mid-journey and they get a lot of benefit and joy
[45:27.14] and sometimes even therapy out of making images.
[45:29.82] And that is something that is not really accessible
[45:31.86] to this fairly large community.
[45:34.22] And what we've provided,
[45:36.06] I don't think the analogy to mid-journey is perfect,
[45:38.46] but what we've provided is a sonic experience
[45:40.50] that is very similar and that speaks to this community.
[45:43.10] And that is community with the best ears,
[45:45.90] the most exacting, the most tuned.
[45:48.34] Yeah, that definitely makes us feel warm and fuzzy inside.
[45:51.18] - Yeah, excellent.
[45:52.22] Sounds like there's a lot of exciting stuff on your roadmap.
[45:54.46] I'm very much looking forward to the infinite DJ mode
[45:57.46] 'cause then I can just kind of play that while I work.
[45:59.34] I would love to get your overall takes,
[46:01.42] like kind of zooming out from, so you know itself,
[46:04.22] just your overall takes on the music generation landscape.
[46:06.26] Like what should people know?
[46:07.54] You obviously have spent a lot more time on this than others.
[46:10.46] So in my mind, you shout out Vali
[46:12.26] and the other sort of Google type work
[46:14.62] in your Read Me and Bark.
[46:16.70] What should people know about like what Google is doing,
[46:19.26] what Meta is doing, Meta released Seamless recently
[46:22.62] and Audio Box.
[46:23.86] How do you classify the world of audio generation?
[46:25.82] Like, you know, in the broader sort of research community.
[46:28.46] - Mm-hmm.
[46:29.42] I think people largely break things down
[46:31.94] into three big categories,
[46:33.22] which is music speech and sound effects.
[46:35.42] There's some stuff that is crossover,
[46:37.34] but I think that is largely how people think about this.
[46:39.90] The old style of doing things still exists,
[46:42.94] kind of single-purpose models
[46:44.26] that are built to do a very specific thing
[46:46.14] instead of kind of the new foundation model approach.
[46:49.02] I don't know how much longer that will last.
[46:51.18] I don't have like tremendous visibility into, you know,
[46:53.58] what happens in the big industrial research lab
[46:56.10] before they publish.
[46:57.54] Specifically for music, I would say,
[46:59.98] there's a few big categories that we see.
[47:02.42] There is kind of license-free stock music.
[47:05.78] So this is like, how do I background music,
[47:08.02] the B-roll footage for my YouTube video
[47:10.22] or for full-featured production or whatever it is.
[47:13.82] And there's a bunch of companies in that space.
[47:15.90] There's a lot of AI cover art.
[47:18.54] How do I cover different existing songs with AI?
[47:21.78] And I think that's a space
[47:22.94] that is particularly fraught with some legal stuff.
[47:26.34] And we also just don't think
[47:27.94] it's necessarily the future of music.
[47:30.10] There is kind of net new songs as a new way
[47:33.26] to create net new music.
[47:34.50] That is the corner that we like to focus on.
[47:36.94] And I would say the last thing is much more geared
[47:40.14] toward professional musicians,
[47:41.42] which is basically AI tools for music production.
[47:44.58] And you can think many of these will look like plugins
[47:47.14] to your favorite DAW.
[47:48.54] Some of them will look like the greatest STEM splitter
[47:52.50] that the market has ever seen.
[47:54.10] The current STEM splitters are the state of the art
[47:56.42] are all AI based.
[47:57.46] And so I think that is a market also
[48:00.26] that has just a tremendous amount of room to grow.
[48:02.90] Somebody told me this recently
[48:04.10] that if you actually think about it, music has evolved.
[48:06.38] Recently, it's just much more things
[48:09.02] that are sonically interesting at a very local level
[48:11.66] and much less like chord changes that are interesting.
[48:14.90] And when you think about that,
[48:15.98] like that is something that AI can definitely help you
[48:18.38] make a lot of weird sounds.
[48:19.58] And this is nothing new.
[48:20.58] There was like a pheromone at some point
[48:22.22] that people like put an antenna and try to do this with.
[48:24.82] And so like, I think this is,
[48:26.10] there's the very natural extension of it.
[48:28.06] So that's how we see it.
[48:28.98] At least, you know, there's a corner
[48:30.26] that we think is particularly fulfilling,
[48:32.62] particularly underserved and particularly interesting.
[48:35.70] And that's the one that we plan.
[48:37.14] - Awesome.
[48:37.98] - Yeah, it's a great perspective.
[48:39.10] - I know we covered a lot of things.
[48:40.74] I think before we wrap,
[48:42.22] you have written a blog post that can show
[48:44.38] about good hearts, law, impact in ML,
[48:46.70] which is, you know, when you measure something,
[48:49.42] then the thing that you measure
[48:50.70] is not a good metric anymore
[48:52.26] because people optimize for it.
[48:53.98] Any thoughts on how that applies to like LLMs
[48:56.78] and benchmarks and kind of the world we're going in today?
[48:59.74] - Yeah, I mean, I think it's maybe even more apropos
[49:02.10] than when I originally wrote that
[49:04.42] because so much, we see so much noise
[49:07.26] about pick your favorite benchmark in this model
[49:09.70] does slightly better than that model.
[49:11.10] And then at the end of the day, actually,
[49:13.26] there is no real world.
[49:14.90] There is no real world difference between these things.
[49:17.18] And it is really difficult to define what real world
[49:20.58] means and I think to a certain extent,
[49:22.42] it's good to have these objective benchmarks.
[49:24.26] It's good to have quantitative metrics.
[49:26.26] But at the end of the day,
[49:28.34] you need some acknowledgement
[49:29.94] that you're not going to be able to capture everything.
[49:31.82] And so at least at Suno, to the extent that we have
[49:35.02] corporate values, if we don't, we're too small
[49:37.18] to have corporate values written down.
[49:38.38] But something that we say a lot is aesthetics matter
[49:41.06] and that the kind of quantitative benchmarks
[49:44.22] are never going to be the be all and end all
[49:46.54] of everything that you care about.
[49:49.66] And as flawed as these benchmarks are in text,
[49:53.78] they're way worse in audio.
[49:55.70] And so aesthetics matter basically is a statement
[49:58.46] that like at the end of the day,
[50:00.26] what we are trying to do is bring music to people
[50:02.74] that makes them feel a certain way.
[50:04.58] And effectively, the only good judge of that is your ears.
[50:08.02] And so you have to listen to it.
[50:09.90] And it is a good idea to try to make better objective
[50:13.02] benchmarks, but you really have to not fall prey
[50:15.74] to those things.
[50:16.62] I can tell you, it's kind of another pet peeve of mine.
[50:19.38] Like I always said, economists do make really good
[50:23.06] machine learning engineers.
[50:24.14] And it's because they are able to think about stuff
[50:26.46] like good hearts law and natural experiments
[50:28.86] and stuff like this that people with machine learning
[50:31.14] backgrounds or people with physics backgrounds like me
[50:33.50] often forget to do.
[50:34.62] And so, yeah, I mean, I'll tell you at Kenchell,
[50:37.38] we actually used to go to big econ conferences
[50:39.78] sometimes to recruit.
[50:41.06] And these were some of the best hires we ever made.
[50:43.50] - Interesting.
[50:44.34] Because there's a little bit of social science
[50:46.66] in the human feedback.
[50:48.62] I think it's not only the human feedback.
[50:50.94] I think you could think about this.
[50:52.58] Just in general, you have these like giant,
[50:54.46] really powerful models that are so prone to overfitting,
[50:57.46] that are so poorly understood,
[50:59.22] that are so easy to steer in one direction or another,
[51:01.42] not only from human feedback.
[51:03.18] And your ability to think about these problems
[51:06.22] from first principles instead of like getting down
[51:08.26] into the weeds or only math.
[51:10.14] And to think intuitively about these problems
[51:12.06] is really, really important.
[51:13.94] I'll give you like just one of my favorite examples.
[51:16.30] It's a little old at this point.
[51:17.98] But if you guys remember like squad and squad two,
[51:21.50] the question answering data set.
[51:22.82] - The Stanford question answering data set.
[51:23.66] - Yeah, exactly.
[51:24.66] And so, you know, the benchmark for squad one,
[51:28.14] eventually the machine learning models start to do
[51:30.90] as well as a human can on this thing.
[51:33.26] And it's like, oh, now what do we do?
[51:35.82] And it takes somebody very clever to say,
[51:38.82] well, actually, let's think about this for a second.
[51:41.34] What if we presented the machine with questions
[51:43.22] with no answer in the passage?
[51:45.22] And it immediately opens a massive gap
[51:47.62] between the human and the machine.
[51:48.78] And I think it's like first principles thinking like that,
[51:52.22] that comes very naturally to social scientists
[51:54.90] that does not come as naturally to people like me.
[51:58.34] And so that's why I like to hang out
[51:59.86] with people like that.
[52:01.62] - Well, I'm sure you get plenty of data in Boston.
[52:03.26] And as a econ major myself, I, you know,
[52:06.06] this is very gratifying to hear that
[52:07.62] we have the perspective to contribute.
[52:09.86] - Oh, big time, big time.
[52:11.10] I try to talk to economists as much as I can.
[52:13.54] - Excellent, awesome guys.
[52:15.02] Yeah, I think this was great.
[52:16.34] We got like music.
[52:17.58] We got discussion about gender model.
[52:20.06] So we got the whole nine yard.
[52:21.26] So thank you so much for coming on.
[52:23.02] - I had great fun.
[52:23.86] Thank you guys.
[52:24.68] - Thanks.
[52:26.00] (upbeat music)
[52:28.58] (upbeat music)
[52:31.16] (upbeat music)
[52:33.74] (upbeat music)
[52:36.32] (upbeat music)
[52:38.90] (upbeat music)
[52:41.48] (upbeat music)
[52:44.06] (upbeat music)
[52:46.64] [BLANK_AUDIO]