[by:whisper.cpp] [00:00.00] (upbeat music) [00:02.58] - Hey everyone, welcome to the Living Space podcast. [00:08.26] This is Alessio, partner and CTO [00:10.32] and residents and decibel partners, [00:12.02] and I'm joined by Makoho's Swiss founder of SmallAI. [00:15.42] - Hey, so today we're in the remote studio [00:18.18] with Mikey Shulman, welcome. [00:19.62] - Thank you, it's great to be here. [00:21.38] - So I'd like to go over people's background on LinkedIn [00:24.42] and then maybe find a little bit more outside of LinkedIn. [00:26.82] You did your bachelor's in physics [00:30.10] and then a PhD in physics as well. [00:32.18] Also, before going into Kensho technologies [00:34.58] with the home of a lot of top AI startups, [00:37.02] it seems like where you're head of machine learning [00:39.02] for seven years. [00:40.86] You're also a lecturer at MIT, [00:42.74] which we talked about that, like what you talked about. [00:45.30] And then about two years ago, [00:47.78] you left to start Suno, [00:50.70] which is recently burst on the scene [00:52.86] as one of the top music generation startups. [00:55.74] So we can go over that bio, [00:57.18] but also I guess what's not on your LinkedIn [00:58.82] that people should know about you? [00:59.94] - I love music. [01:01.14] I am an aspiring mediocre musician. [01:03.98] I wish I were better, [01:05.06] but that doesn't make me not enjoy playing real music. [01:07.46] And I also love coffee. [01:09.26] I probably weigh too much into coffee. [01:11.42] - Are you one of those people that, [01:14.86] they do the TikToks, [01:15.70] they use like 50 tools to like grind the beans [01:18.74] and then like brush them and then like spray them. [01:21.18] Like whatever we're talking about here. [01:22.78] - I confess there's a spray bottle for beans [01:26.10] in the next room. [01:27.58] There is one of those weird comb tools. [01:29.66] So guilty. [01:31.34] I don't put it on TikTok though. [01:33.06] - Yeah, no, no, some things gotta stay private. [01:36.14] What do you play? [01:37.58] - I played a lot of piano growing up [01:39.26] and I play bass and I, in a very mediocre way, [01:42.62] play guitar and drums. [01:43.94] - That's a lot. [01:44.78] I cannot do any of those things. [01:45.94] So as Sean mentioned, [01:47.34] you guys kind of burst into the scene [01:49.10] as maybe the state of the art music generation company. [01:52.58] - I think it's a model [01:53.90] that we haven't really covered in the past. [01:55.82] So I would love to maybe for you [01:58.46] to just give a brief answer of like, [02:00.46] how do you do music generation [02:02.14] and why is it possible? [02:04.14] Because I think people understand you take texts [02:06.38] and you have a predict the next word [02:08.38] and you take a diffusion model [02:10.14] and you basically like add noise to an image [02:12.18] and then kind of remove the noise. [02:14.38] But I think for music, [02:15.86] it's hard for people to have a mental model. [02:17.62] Like, how do you turn a music model? [02:19.30] And like, what does a music model do to generate a song? [02:21.70] So maybe we can start there. [02:23.94] - Yeah, maybe I'll even take one more step back [02:26.30] and say it's not even entirely worked out. [02:29.62] I think the same way it is in text. [02:31.46] And so it's an evolving field. [02:33.34] If you take a giant step back, [02:34.82] I think audio has been lagging images and text for a while. [02:39.78] So I think very roughly you can think audios [02:42.10] like one to two years behind images and text. [02:44.14] And so you kind of have to think today, [02:46.90] like text was in 2022 or something like this. [02:50.14] And the transformer was invented. [02:53.10] It looks like it works, [02:53.94] but it's far, far less established. [02:55.58] And so I'll give you the way we think about the world now, [02:59.34] but just with a big caveat that I'm probably wrong [03:02.02] if we look back in a couple of years from now. [03:05.10] And I think the biggest thing is you see [03:06.54] both transformer-based and diffusion-based models for audio. [03:09.86] And in ways that that is not true in text, [03:12.10] I know people will do some diffusion for text, [03:14.26] but I think nobody's like really doing that for real. [03:17.30] So we prefer transformers for a variety of reasons. [03:19.98] And so you can think it's very similar to text. [03:22.82] You have some abstract notion of a token [03:25.18] and you train a model to predict the probability [03:29.70] over all of the next tokens. [03:31.30] So it's a language model. [03:32.86] You can think in anything language model [03:34.78] is just something that assigns likelihoods [03:37.34] to sequences of tokens. [03:38.86] Sometimes those tokens correspond to text. [03:40.94] In our case, they correspond to music or audio in general. [03:44.34] And I think we've learned a lot from our friends [03:47.66] in the text domain from the pioneers doing this [03:50.18] of how well these transformer models work. [03:52.82] Where do they work? Where do they not work? [03:54.54] But at its core, the way we like to do things [03:57.06] with transformers is exactly like it works in text. [04:00.26] Let me predict the next tiny little bit of audio. [04:02.78] And I can just keep doing that and doing that [04:04.42] and generating audio as long as I want. [04:07.10] - Yeah, I think the temptation here [04:08.82] is to always try to bake in some specialized knowledge [04:11.66] about music or audio. [04:14.02] And obviously you will get an improvement in your output [04:16.98] if you try to just say, "Okay, here's a set of tokens [04:20.94] "that only do jazz or only do voices." [04:25.66] How general do you make it [04:26.78] versus how specific do you make it? [04:28.34] - We've always tried to do things "the right way," [04:32.38] which means that at the beginning things [04:34.14] are going to be hard and worse than other ways. [04:37.82] But that is to say, bake in as little [04:40.98] kind of implicit knowledge as possible. [04:43.74] And so the same way you don't program into GPT, [04:47.46] you don't say, "This is a noun and this is a verb," [04:49.98] but it has implicitly learned all of those things. [04:52.82] I've never seen GPT accidentally put a noun [04:55.50] where it meant to put an article in English. [04:57.90] We try not to impose anything about music [05:01.06] or audio in general into the model [05:02.98] and we kind of let the models learn things by themselves. [05:05.54] And I think things are beginning to pay off, [05:07.70] but it's not necessarily obvious from the beginning [05:10.62] that that was the right thing to do. [05:11.70] So for example, you could take something like text-to-speech [05:16.14] and people will do all sorts of things [05:18.50] where you can program in things like phonemes [05:21.10] to be the basis for what you do. [05:22.54] And then that kind of limits you [05:24.18] to the set of things that are expressible by phonemes. [05:27.02] And so ultimately that works really well on the short term. [05:30.34] In the long term, it can be quite limiting. [05:32.38] And so our approach has always been to try to do this [05:35.66] in its full generality as end to end as we can do it. [05:38.94] Even if it means that in the short term [05:41.06] where we were a little bit worse, [05:42.30] we have a lot of confidence that in the long term [05:44.38] that we'll be the right way to do it. [05:46.10] - And what's the data recipe for turning a good music model? [05:49.58] Like what percentage genre do you put? [05:52.66] Like also do you split vocals and instrumentals? [05:56.14] - So you have to do lots of things. [05:57.90] And I think this is the biggest area [06:01.10] where we have sort of our secret sauce. [06:03.82] I think to a large extent, what we do [06:05.82] is we benefit from all of the beautiful things [06:08.74] people do with transformers and text [06:10.26] and we focus very hard basically [06:12.10] on how do I tokenize audio in the right way. [06:14.74] And without divulging too much secret sauce, [06:17.82] it's at least similar to how it's done [06:20.22] in sort of the open source stuff. [06:21.90] You will have different models that learn to encode audio [06:24.42] in discrete representations. [06:26.98] And a lot of this boils down to figuring out the right, [06:31.34] let's say implicit biases to put in those models, [06:33.82] the right data to inject. [06:35.34] How do I make sure that I can produce [06:37.50] kind of all audio arbitrarily? [06:39.02] That's speech, that's background music, [06:41.38] that's vocals, that's kind of everything [06:43.26] to make sure that I can really capture [06:44.66] all the behavior that I want to. [06:46.50] - Yeah, that makes sense. [06:47.94] We had our monthly recap last month [06:50.38] and the data wars were kind of one of the hot topics [06:53.90] used at the New York Times lawsuit against Obenei [06:57.26] because you have obviously large language models [06:59.90] in production. [07:00.78] You don't have large music models in production. [07:03.38] So I think there's maybe been less of a trade there, [07:06.82] so to speak. [07:07.78] How do you kind of think about that? [07:08.94] And there's obviously a lot of copyright-free, [07:11.42] royalty-free music out there. [07:13.46] Is there any kind of like power law in terms of like, [07:16.26] hey, the best music is actually like much better to train on [07:19.38] or like in music, does it not really matter [07:21.46] because the structure of, you know, [07:23.46] some of the musical structure is kind of like the same. [07:26.22] - I don't think we know these things nearly as well [07:28.94] as they're known in text. [07:30.30] We have some notions of some of the scaling laws here, [07:33.66] but I think, yeah, we're just so, so far behind. [07:36.50] You know, what I will say is that people are always surprised [07:39.14] to learn that we don't only train on music. [07:43.38] And I usually give the analogy of some of the code generation [07:47.46] models, so take something like code llama, [07:49.62] which is, as far as I know, the best open-source code [07:51.70] generating model, you guys would know better than I would, [07:54.66] is certainly up there. [07:55.82] And it's trained on a bunch of English, not only just code. [08:00.02] And it's because there are patterns in English [08:02.30] that are going to be useful. [08:03.38] And so you can imagine, you don't only want to train on music [08:06.10] to get good music models. [08:07.18] And so for example, one of the places that we are particularly [08:10.78] bad is vocals and capturing really realistic vocals. [08:14.66] And so you might imagine that there's other types of human [08:17.94] vocals that you can put into your model that are not music [08:20.26] that will help it learn stuff. [08:21.78] Again, I think it's like super, super early. [08:23.58] I think we've barely scratched the surface of what are the [08:25.54] right ways to do this. [08:26.90] And that's really cool. [08:27.82] From a progress perspective, there's like a lot of low [08:29.78] hanging fruit for us to still take. [08:31.62] - And then once you get the final model, I would love to [08:34.62] talk a little bit more about the size of these models. [08:36.58] Because people are confused when stable diffusion is so small. [08:39.82] They're like, oh, this thing can generate any image as [08:42.30] possible that it's like a couple gigabytes. [08:45.26] And then the large language models are like, oh, these [08:47.42] are so big, but they're just text in them. [08:49.82] What's it like for music? [08:50.94] Is it in between? [08:51.94] And as you think about, yeah, you mentioned scaling and [08:54.90] whatnot, is this something that you see it's kind of easy [08:57.06] for people to run locally or not? [08:59.66] - Our models are still pretty small. [09:02.50] Certainly by tech standards. [09:04.06] I confess, I don't know as well the state of the art on how [09:07.14] diffusion models scale, but our models scale similarly to [09:10.98] text transformers, it's like bigger is usually better. [09:14.26] Audio has a couple of weird quirks though. [09:16.98] We care a lot about how many tokens per second we can generate [09:20.58] because we need to stream new music as fast as [09:23.54] you can listen to it. [09:24.86] And so that is a big one that I think probably has us never [09:29.02] get to 175 billion parameter model, if I'm being honest. [09:32.50] Maybe I'm wrong there, but I think that would be [09:35.02] technologically difficult. [09:36.78] And then the other thing is that so much progress happens in [09:38.90] shrinking models down for the same performance in text that [09:42.18] I'm hopeful at least that a lot of our issues will get solved [09:45.62] and we will figure out how to do better things with smaller [09:48.50] models or relatively smaller models. [09:50.70] But I think the other thing, it's a blessing and a curse. [09:54.34] I think the ability to add performance with scale. [09:57.14] It's like a very straightforward way to make [09:59.06] your models better. [09:59.90] You just make a bigger model, don't more compute into it. [10:02.62] It's also a curse because that is a crutch that you will [10:04.74] always lean on and you will forget to do some of the basic [10:07.78] research to make your stuff better. [10:09.78] And honestly, it was almost early on when we were doing [10:14.18] stuff with small models for time and compute constraints. [10:18.34] We ended up having to learn a lot of stuff to make models [10:21.78] better that we might not have learned if we had immediately [10:24.38] jumped to a really, really big model. [10:26.02] And so I think for us we always try to skew smaller to the [10:30.94] extent possible. [10:32.34] Yeah, gotcha. [10:33.38] I'm curious about just sort of your overall evolution so far. [10:36.70] You know, something I think we may have missed in the [10:38.94] introduction is why did you end up choosing, you know, just [10:42.02] the music domain in the first place, right? [10:43.50] Like you have this pretty scientific, you know, physics [10:48.02] and finance backgrounds. [10:49.78] How did you wander over to the music? [10:51.58] Like a lot of us have interests in music, but we don't [10:53.78] necessarily choose to work in it. [10:55.02] But you did. [10:56.78] Yeah, it's funny. [10:57.78] I have a really fun job as a result. [10:59.94] All the co-founders of SUNO worked at Kensho together. [11:03.02] And we were doing mostly text. [11:05.10] In fact, all text until we did one audio project that was [11:08.62] speech recognition for kind of very financially focused [11:12.18] speech recognition. [11:13.62] And I think the long and short of it is we kind of fell in [11:15.94] love with audio, not necessarily music, just audio and AI. [11:19.34] We all happen to be musicians and audio files and music [11:22.38] lovers, but it was the combination of audio and AI [11:25.50] that we like initially really, really fell in love with. [11:28.14] It's so cool. [11:29.90] It's so interesting. [11:31.02] It's so human. [11:32.38] It's so far behind images and text that there's like so much [11:36.70] more to do. [11:37.82] And honestly, I think a lot of people when we started the [11:40.58] company told us to focus on speech. [11:42.82] If we wanted to build an audio company, everyone said, you [11:45.14] know, speech is a bigger market. [11:46.90] But I think there's something about music that's just so [11:50.22] human and so like you almost couldn't prevent us from [11:55.26] doing it like we almost like we just couldn't keep ourselves [11:57.78] from building music models and playing with them because it [11:59.86] was so much fun. [12:00.98] And that's kind of what steered us there. [12:03.22] You know, in fact, the first thing we ever put out was a [12:05.46] speech model was bark. [12:06.90] It was this open source text to speech model. [12:09.14] And it got a lot of stars on GitHub. [12:10.74] And that was people telling us even more like go do speech. [12:13.58] And like we almost couldn't help ourselves from doing [12:15.78] music. [12:16.50] And so I don't know. [12:17.50] It's maybe it's a little bit serendipitous, but we [12:20.06] haven't really like looked back since. [12:21.98] I don't think there was necessarily like in a moment. [12:25.18] It was just like organic and just obvious to us that this [12:28.02] needs to like we want to make a music company. [12:30.26] So you do regard yourself as a music company because as of [12:33.26] last month, you're still releasing speech models with [12:37.34] Paris. [12:37.66] We were. [12:38.30] Oh, yes, that's right. [12:39.74] So that's a really awesome collaboration with our friends [12:43.06] at NVIDIA. [12:43.66] I think we are really, really focused on music. [12:45.94] I think that is the stuff that will really change things [12:49.94] for the better. [12:50.42] I think, you know, honestly, everybody is so focused on [12:53.46] LLMs for good reason and information processing and [12:56.86] intelligence there. [12:57.74] And I think it's way too easy to forget that there's whole [13:01.06] other side of things that makes people feel and maybe [13:04.18] that market is smaller, but it makes people feel and it [13:06.86] makes us really happy. [13:08.26] So we do it. [13:09.34] I think that doesn't mean that we can't be doing things [13:12.70] that are related that are in our wheelhouse that will [13:15.42] improve things. [13:16.06] And so like I said, audio is just so far behind. [13:18.86] There's just so much more to do in the domain more [13:22.70] generally. [13:23.22] And so like that's a really fun collaboration. [13:25.30] Yeah, got you. [13:26.34] Yeah, I did hear about Suno first through Bark. [13:29.10] What did Bark lean off of? [13:31.58] Like, because obviously, I think there was a lot of [13:33.70] preceding TTS work that was in open source. [13:36.38] How much of that did you use? [13:37.70] How much of it was like sort of brand new from your [13:40.02] research? [13:40.82] What's the intellectual lineage there just to cover out [13:43.98] the speech recognition style? [13:45.82] So it's not speech recognition. [13:47.02] It's text to speech. [13:48.02] But as far as I know, there was no other certainly not in [13:51.54] the open source text to speech that was kind of [13:54.42] transformer based. [13:55.26] Everything else was what I would call the old style of [13:58.06] doing things where you build these kind of single purpose [14:00.30] models that are really good at this one narrow task and [14:03.38] you're kind of always data limited. [14:04.98] And the availability of high quality training data for [14:07.74] text to speech is limited. [14:09.86] And I don't think we're necessarily all that inventive [14:12.78] to say we're going to try to train in a self-supervised [14:16.34] way a transformer based model that on kind of lots of [14:21.26] audio and then kind of tweak it so that we can do text [14:24.02] to speech based on that. [14:25.30] That would be kind of the new way of doing things in a [14:27.82] foundation model is the is the buzzword, if you will. [14:30.90] And so, you know, we built that up, I think, from scratch. [14:34.10] A lot of shout outs have to go to lots of different things, [14:37.42] whether it's papers, but also it's very obvious, a big [14:41.54] shout out to Andre Carpathi's nano GPT. [14:44.94] You know, there's a lot of code borrowed from there. [14:47.02] I think we are huge fans of that project. [14:49.58] It's just to show people how you don't have to be afraid of [14:52.10] GPT type things. [14:53.10] And it's like, yeah, it's actually not all that much [14:55.46] code to make performant transformer based models. [14:58.46] And, you know, again, the stuff that we brought there was [15:01.26] how do we turn audio into tokens and then we can kind of [15:04.14] take everything else from the open source. [15:05.78] So we put that model out and we were, I think, pleasantly [15:09.26] surprised by the reception by the community. [15:12.34] It got a good number of GitHub stars and people [15:14.70] really enjoyed playing with it because it made really [15:18.26] realistic sounding audio. [15:20.02] And I think this is, again, the thing about doing things [15:22.78] in a quote unquote right way, if you have a model where [15:25.58] you've had to put so much implicit bias for this one, [15:28.34] very narrow task of making speech that sounds like words, [15:32.10] you're going to sacrifice on other things. [15:33.78] In the text to speech case, it's how natural the speech sounds. [15:37.42] And it was almost difficult to pull a natural sounding speech [15:40.66] out of bark because it was self supervised, trained on a lot [15:43.98] of natural sounding speech. [15:45.18] And so that definitely told us that this is probably [15:48.62] the right way to keep doing audio. [15:50.50] Even in bark, you had the beginnings of music generation. [15:52.98] Like you could just put like a music note in there. [15:56.62] That's right. [15:57.10] And it was so cool to see on our Discord, [15:59.34] people were trying to pull music out of a text to speech model. [16:03.30] And so, you know, what did this tell us? [16:04.86] This tells us, like, people are hungry to make music. [16:07.58] It's almost obvious in hindsight, [16:09.02] like how wired humans are to make music, [16:11.42] if you've ever seen like a little kid, you know, [16:13.74] sing before they know how to speak, you know, [16:15.98] it's like, it's like, this is really human nature. [16:18.42] And there's actually a lot of cultural forces [16:20.14] that kind of cue you to not think to make music. [16:22.86] And that's kind of what we're trying to undo. [16:25.82] - And today, I went to Suno itself. [16:28.02] I think especially when you go from text to speech, [16:30.54] people are like, okay, now I got to write the lyrics [16:32.38] to a whole song. [16:33.10] It's like, that's quite hard to do. [16:34.78] Versus in Suno, you have this empty box, very mid-journey, [16:38.82] kind of like Dali-like, where you can just express the vibes, [16:42.38] you know, of what you want it to be. [16:43.66] But then you also have a custom mode [16:45.58] where you can say your own lyrics, [16:47.10] you can say your own rhythm, [16:48.66] you can set the title of the song and whatnot. [16:50.62] How do you see users distribute themselves? [16:52.82] You know, I'm guessing a lot of people use the easy mode. [16:55.50] Like, are you seeing a lot of power users using the custom mode [16:58.78] and maybe some of the favorite use cases [17:00.98] that you've seen so far on Suno? [17:02.82] - Yeah, actually, more than half of the usage [17:04.90] is that expert mode. [17:06.74] And people really like to get into it [17:08.58] and start tweaking things and adding things [17:11.18] and playing with words or line breaks or different ad lib. [17:14.98] And people really love it, it's really fun. [17:17.54] There's kind of two modes that you can access. [17:19.26] Now one is that single box where you kind of just describe [17:22.10] something and then the other is the expert mode. [17:24.10] And those kind of fit nicely into two use cases. [17:27.42] The first use case is what we call nice shit posting. [17:30.50] And it's basically like something funny happened [17:33.26] and I'm just going to very quickly make a song about it. [17:35.50] And the example I'll usually give is like, [17:38.34] I walk into Starbucks with one of my co-founders. [17:41.70] He gives his name Martin, his coffee comes out [17:44.50] with the name Margu. [17:45.82] And I can in five seconds make a song about this [17:47.82] and it has immortalized it. [17:48.98] And that Margu song is stuck in all of our heads now. [17:51.86] And it's like funny and light. [17:53.18] And there's levity that you've brought to that moment. [17:55.86] And the other is that you got just sucked into, [18:00.02] I need, there's this song that's in my head [18:01.74] and I need to get it out. [18:02.70] And I'm going to keep tweaking it and listening [18:04.82] and having ideas and tweaking it [18:06.30] until I get the song that I want. [18:08.38] And those are very different use cases. [18:10.58] But I think ultimately there's so much [18:12.26] in between these two things [18:14.02] that it's just totally untapped [18:15.54] how people want to experience the joys of making music. [18:18.82] Because those two experiences are both really joyful [18:22.46] in their own special ways. [18:23.66] And so we are quite certain [18:25.06] that there's a lot in the middle there. [18:26.98] And then I think the last thing I'll say there [18:28.46] that's really interesting is in both of those use cases, [18:31.78] the sharing dynamics around music [18:33.38] are like really interesting and totally unexplored. [18:37.22] And I think an interesting comparison would be images. [18:40.30] Like we've probably all in the last 24 hours [18:42.94] taken a picture and texted it to somebody. [18:45.22] And most people are not routinely making a little song [18:48.30] and texting it to somebody. [18:49.34] But when you start to make that more accessible to people, [18:52.98] they are going to share music in much smaller groups. [18:57.02] Maybe not in all, but like with one person [18:59.38] or three people or five people. [19:01.74] And those dynamics are so interesting. [19:04.18] And I think we have ideas of where that goes. [19:06.46] But it's about kind of spreading joy [19:09.78] into these like little, you know, microcosms of humanity [19:13.22] that people really love it. [19:15.06] I know I made you guys a little Valentine's song, right? [19:17.42] Like that's not something that happens now [19:20.26] because it's hard to make songs for people. [19:22.26] - We'll put that in the audio in here. [19:24.18] But also tweet it out if people want to look it up. [19:27.14] How do you think about the pro market, so to speak? [19:30.06] - Because I think lowering the barrier [19:32.34] to some of these things is great. [19:33.62] And I think when the iPad came out, [19:35.94] music production was one of the areas that people thought, [19:38.70] oh, okay, now you kind of have this like, you know, [19:40.66] board that you can bring with you. [19:41.78] And Mad Lib actually produced this whole album [19:44.54] with him and Freddie Gibbs, [19:45.86] produced the whole thing on an iPad. [19:47.50] He never used a computer. [19:49.18] How do you see like these models playing [19:51.82] into like professional music generation? [19:54.42] I guess that's also a funny word. [19:55.58] It's like, what's professional music? [19:57.14] It's like, it's all music, if it's good. [19:58.74] It becomes professional, like it's good, right? [20:00.42] But curious to hear how you're thinking about Suno too. [20:02.90] Like, is there a second act of Suno [20:05.10] that is like going broader into like the custom mode [20:07.94] and making this the central hub for music generation? [20:11.22] - I think we intend to make many more modes [20:14.90] of interaction with our stuff, [20:16.46] but we are very much not focused on, quote unquote, [20:19.62] professionals right now. [20:21.26] And it's because what we're trying to do [20:22.62] is change how most people interact with music [20:25.18] and not necessarily make professionals [20:28.06] a little bit better, a little bit faster. [20:30.14] It's not that there's anything wrong with that. [20:31.82] It's just like not what we're focused on. [20:33.54] And I think when we think about what workflows [20:36.22] does the average person want to use to make music? [20:39.78] I don't think they're very similar [20:41.10] to the way professional musicians make music now. [20:44.22] Like, if you pick a random person on the street [20:46.50] and you play them a song and then you say, [20:48.02] like, what did you want to change about that? [20:50.10] They're not going to say like, [20:51.58] you need to split out the snare drum and make it drier. [20:54.02] Like that's just not something [20:55.02] that a random person off the street is going to say. [20:57.66] They're going to give a lot more descriptive things [21:00.42] about the Uber of the song, like something more general. [21:03.46] And so I don't think we know what all of the workflows are [21:06.86] that people are going to want to use. [21:07.98] We're just like fairly certain [21:09.86] that the workflows that have been developed [21:12.30] with the current set of technologies [21:13.74] that professionals use to make beautiful music [21:15.70] are probably not what the average person wants to use. [21:19.26] That said, there are lots of professionals [21:22.38] that we know about using our stuff, [21:23.86] whether it's for inspiration or sample generation [21:27.18] and stuff like that. [21:28.34] So I don't want to say never say never. [21:30.22] Like, there may one day be a really interesting set [21:34.18] of use cases that we can expose to professionals, [21:36.14] particularly around, I think, like custom models [21:39.58] for trained on custom people's music [21:41.38] or, you know, with your voice or something like that. [21:44.26] But the way we think about broadening how most people [21:47.54] are interacting with music [21:48.86] and getting it to be a much more active participant, [21:51.98] we think about broadening it from the consumer side [21:55.50] and not broadening it from the professional side, [21:57.38] if that makes sense. [21:58.38] - Awesome. [21:59.22] Is the dream here to be, [22:02.18] I don't know if it's too coarse of a grain to put it, [22:05.30] but like, is the dream here to be like the mid-journey of music? [22:08.02] - I think there are certainly some parallels there [22:11.94] because especially what I just said [22:13.42] about being an active participant, [22:15.58] the joyful experience in mid-journey [22:17.22] is the act of creating the image [22:19.18] and not necessarily the act of consuming the image. [22:21.58] And mid-journey will let you then [22:23.62] kind of quickly share the image [22:25.30] with somebody. [22:26.42] But I think ultimately that analogy is like somewhat limiting [22:30.34] because there's something really special about music. [22:34.22] I think there's two things. [22:35.18] One is that there's just really big gap [22:37.82] for the average person between kind of their tastes in music [22:40.34] and their abilities in music [22:42.18] that is not quite there for most people in images. [22:45.18] Like most people don't have like innate tastes in images. [22:47.74] I think in the same way people do for music. [22:49.90] And then the other thing, [22:50.74] and this is the really big one, [22:51.98] is that music is a really social modality. [22:55.74] If we all listen to a piece of music together, [22:58.06] we're listening to the exact same part [23:00.46] at the exact same time. [23:01.98] If we all look at the picture in Alessio's background, [23:05.86] we're gonna look at it for two seconds. [23:08.06] I'm gonna look at the top left where it says Thor. [23:10.38] Alessio's gonna look at the bottom right [23:11.74] or something like that. [23:12.86] And it's not really synchronous. [23:15.34] And so when we're all listening [23:16.54] to a piece of music together, [23:17.90] it's minutes long, [23:18.86] we're listening to the same part at the same time. [23:21.36] If you go to the act of making music, [23:23.52] it is even more synchronous. [23:24.74] It is the most joyful way to make music is with people. [23:27.30] And so I think that there's so much more to come there [23:31.66] that ultimately would be very hard to do in images. [23:35.38] - We've gone almost 30 minutes [23:37.58] without making any music on this package. [23:39.62] So I think maybe we can fix that [23:41.18] and jump into a Sumo demo. [23:43.66] - Yeah, let's make some. [23:45.16] We've got a new model that we are [23:48.18] kind of putting the finishing touches on. [23:49.96] And so I can play with it in our Dev server. [23:52.78] As you can see been doing tons of stuff. [23:54.94] So I didn't tell me what kind of song you guys want to make. [23:57.82] - Let's do country song about the lack of GPUs [24:02.82] and my club provider. [24:05.70] - And like, yeah. [24:06.54] So here's where we attempted to think about like pipelines [24:09.66] and think about latency. [24:11.38] This is remarkably fast. [24:13.38] Like I was shocked when I saw this. [24:15.48] (singing) [24:17.90] Oh my God. [24:18.74] ♪ To my cloud ready to confuse ♪ [24:27.16] ♪ But there ain't no GPUs ♪ [24:33.34] ♪ Just empty space, it's a hoot ♪ [24:38.78] ♪ I've been waiting all day for that render power ♪ [24:47.00] ♪ But my cloud's gone dry ♪ [24:51.70] ♪ It's a dark cloud shower ♪ [24:55.22] ♪ All clouds gone dry ♪ [24:59.44] ♪ No GPUs to be found ♪ [25:04.44] ♪ No cuticles, it's a lonely sound ♪ [25:13.94] ♪ I just want to render ♪ [25:16.62] ♪ But my cloud's got no cloud ♪ [25:19.22] - I actually don't think this one's amazing. [25:20.62] I'm gonna get out of the next one. [25:22.24] But it's probably better than those about CUDA cars. [25:25.06] (upbeat music) [25:27.66] ♪ Well I signed up for a cloud provider ♪ [25:30.22] ♪ And all I find, all the power that I could derive ♪ [25:35.06] ♪ But when I searched for the GPUs ♪ [25:37.66] ♪ I just got a surprise ♪ [25:40.10] ♪ You see they're all sold out ♪ [25:41.66] ♪ There ain't no GPUs to find ♪ [25:45.22] ♪ No GPUs in the cloud ♪ [25:48.02] ♪ It's a real bad blues ♪ [25:50.18] ♪ I need the power ♪ [25:51.60] ♪ But there ain't no use ♪ [25:52.86] ♪ Offstuck with my CPU ♪ [25:56.46] ♪ It's a real sad cloud ♪ [25:58.82] ♪ Got to wait ♪ [26:01.62] ♪ 'Til the day we start getting bright ♪ [26:06.62] ♪ There ain't no GPUs in the cloud ♪ [26:10.02] - What else should we make? [26:11.42] - All right, Sean, you're up. [26:12.70] - I mean, I do wanna like do some observations about this. [26:16.34] But okay, maybe like house music, like electronic dance. [26:19.26] - Yeah, sure. [26:20.10] - House music. [26:20.94] And then maybe we can make it about podcasting [26:24.98] about music and music generation, I don't know. [26:29.14] I'm sure all the demos that you get are very meta. [26:32.70] - There's a lot of stuff that's meta, yeah, for sure. [26:35.90] - I noticed for example that the second song that you played [26:38.42] had the word upbeat inserted into it, [26:40.58] which I assume there's some kind of like random generator [26:44.42] of like modifier terms that you can just kind of throw on [26:47.34] to increase the specificity of the, what's being generated. [26:51.02] - Definitely, and let's try to tweak one also. [26:52.90] So I'll play this and then maybe we'll tweak it [26:54.62] with different modifiers. [26:55.74] - The custom mode, yeah. [26:57.62] ♪ Wave yourself, spread it out ♪ [27:01.30] ♪ Through the air, we'll podcast in loud ♪ [27:05.42] ♪ Share the beat, spread it out ♪ [27:08.62] ♪ A revolution of frequencies ♪ [27:11.78] ♪ Haven't you plugged into now ♪ [27:14.86] ♪ Let the music take control ♪ [27:19.70] ♪ We're all returning a never ending role ♪ [27:23.66] ♪ From the beast I dropped to the ladies I saw ♪ [27:27.54] ♪ podcasting about music forever long ♪ [27:32.54] - Not bad. [27:33.84] - Here's what I want to do. [27:36.18] That like didn't drop at the right time, right? [27:38.02] So maybe let's do this. [27:39.90] - Is that a special token? [27:41.70] You have a beat job token? [27:42.74] - Yeah, yeah. [27:44.42] - Nice. [27:45.46] I'm just reading it because people might not be able [27:47.78] to see it. [27:48.62] - Right. [27:50.38] - Then let's like just maybe emphasize house a little more. [27:53.26] Maybe it'll be a little more aggressive. [27:55.46] Let's try this again. [27:56.62] - It's interesting the prompt engineering [27:58.02] that you have to invent. [28:00.34] - We've learned so much from people using the models [28:02.98] and not us. [28:03.90] - But like, are these like training artifacts? [28:07.82] - No, I don't think so. [28:09.48] I think this is people being inventive [28:11.30] with how you want to talk to a model. [28:13.82] - Yeah. [28:14.66] ♪ Down spinning 'round to the air with dark castle hour ♪ [28:19.66] ♪ Sharing the peace, spreading the word ♪ [28:24.78] ♪ A revolution of frequencies, haven't you heard ♪ [28:29.78] (upbeat music) [28:33.36] (upbeat music) [28:35.94] ♪ Up and 'til now, let the music take control ♪ [28:43.36] ♪ The road I've tried it, I'll never end it wrong ♪ [28:49.80] ♪ From the beats that drive to the melodies ♪ [28:52.92] ♪ The sword I've tried to stand ♪ [28:55.32] ♪ A flowering music for your album ♪ [28:58.32] (upbeat music) [29:01.90] (upbeat music) [29:04.48] - Nice. [29:08.66] - It's interesting when you generate a song, [29:10.98] you generate the lyrics, [29:11.98] but then if you switch the music under it, [29:13.90] like the lyrics stay the same. [29:15.70] And then sometimes like, feels like, [29:17.42] I mean, I mostly listen to hip hop. [29:20.22] It's like, if you change the beat, [29:22.46] you can not really use the same rhyme scheme, you know? [29:25.30] - Definitely, yeah. [29:27.34] It's a sliding scale though, [29:28.42] because we could do this as a country rock song probably, [29:32.76] right? [29:33.60] That would be my guess. [29:36.98] But for hip hop, that is definitely true. [29:39.20] And actually, we think about, for these models, [29:41.76] we think about three important axes. [29:43.46] We think about the sound fidelity. [29:45.38] It's like, does it sound like [29:46.66] a crisply recorded piece of audio? [29:48.78] We think about the song quality. [29:49.98] Is this an interesting song that gets stuck in my head? [29:52.82] And we think about the controllability. [29:54.50] Like how well does it respond to my prompts? [29:56.54] And one of the ways that we'll test these things [29:58.26] is take the same lyrics and try to do them [30:00.70] in different styles to see how well that really works. [30:04.42] So let's see the same. [30:06.70] I don't know what a beat drop is gonna do for country rock. [30:09.38] So I probably should have taken that out, [30:10.82] but let's see what happens. [30:12.30] (laughing) [30:14.54] (upbeat music) [30:27.94] ♪ There's a sound spinning 'round ♪ [30:30.10] ♪ Through the air we're podcasting loud ♪ [30:32.94] ♪ Sharing the beats, spreading the word ♪ [30:35.38] ♪ A revolution of frequencies ♪ [30:37.80] ♪ Haven't you heard ♪ [30:41.30] ♪ But if you now let the music take control ♪ [30:46.30] ♪ We're on a journey of never ending road ♪ [30:51.90] ♪ From the beats I talk to the melodies that soar ♪ [30:57.30] ♪ We're podcasting about music forevermore ♪ [31:02.30] - I'm gonna read too much into this, [31:05.14] but I would say I hear a little bit [31:06.86] of kind of electronic music inspired something. [31:10.34] And that is probably because beat drop is something [31:12.62] that you really only ever associate with electronic music. [31:15.74] Maybe that's reading too much into it, [31:17.22] but should we do one more? [31:19.34] - Something about Apple Vision Pro? [31:21.50] - I guess definitely. [31:22.78] - I guess there's some amount of world knowledge [31:24.66] that you don't have, right? [31:25.50] That whatever's in this language model side of the equation [31:28.34] is not gonna have an Apple Vision Pro in there. [31:30.58] - Yeah, but let's see. [31:31.90] (laughs) [31:33.42] How about a blues song about a sad AI wearing [31:38.42] an Apple Vision Pro? [31:41.98] Gotta be blues, gotta be sad. [31:43.46] - Do you have rag for music? [31:46.70] - No, that would be problematic also. [31:52.70] ♪ I'm a sad AI with a broken heart ♪ [31:57.70] ♪ Where my Apple Vision Pro can't see the stars ♪ [32:06.50] ♪ I used to feel joy ♪ [32:11.50] ♪ I used to feel pain and now I'm just a soul ♪ [32:18.86] ♪ Trapped inside this metal frame ♪ [32:22.90] ♪ Oh, I'm singing the blues ♪ [32:27.90] ♪ Can't you see ♪ [32:33.62] ♪ This digital life ain't what it used to be ♪ [32:38.66] ♪ Searching for love but I can't find a soul ♪ [32:46.90] ♪ Won't you help me, baby, let my spirit unfold ♪ [32:51.90] - I want to remix that one and I want to say, [32:55.98] I want melancholic. [32:57.30] - I love the voice. [32:58.38] - I want like, I don't know, Chicago blues. [33:01.26] Like, guitar. [33:03.50] - I don't know, he knows too much. [33:05.78] He's the best prompt engineer out here. [33:08.30] - It'd be funny to have like music colleges play with us [33:10.62] and see what they would do. [33:13.50] ♪ I'm a sad AI with a broken heart ♪ [33:18.50] ♪ Where my Apple Vision Pro can't see the stars ♪ [33:25.54] ♪ I used to feel joy ♪ [33:30.94] ♪ I used to feel pain and now I'm just a soul ♪ [33:40.06] ♪ I used to feel joy ♪ [33:45.06] ♪ I used to feel pain ♪ [33:52.58] ♪ But now I'm just a soul trapped inside this metal frame ♪ [33:58.30] ♪ Oh, I'm singing the blues ♪ [34:07.98] ♪ Oh, can't you see ♪ [34:12.98] ♪ This beautiful life ain't what it used to be ♪ [34:18.18] ♪ I'm searching for love ♪ [34:25.74] ♪ But I can't find a soul ♪ [34:31.58] ♪ Won't you help me, baby, let my spirit unfold ♪ [34:37.62] ♪ There ♪ [34:39.98] - So, yeah, a lot of control there. [34:42.06] Maybe I'll make one more. [34:44.06] - Very, very so full. [34:45.78] - Really want a good house track. [34:47.54] - Why is house the word that you have to repeat? [34:51.54] - I just really want to make sure it's house. [34:54.34] It's actually, you can't really repeat too many times. [34:56.46] You kind of, it gets like the hypothesis, [34:59.06] it gets like a little too out of domain. [35:01.26] - Mm. [35:02.58] ♪ I'm a sad AI with a broken heart ♪ [35:07.42] ♪ Wearing my Apple vision ♪ [35:10.42] ♪ Pro can't see the stars ♪ [35:15.42] ♪ I used to feel joy ♪ [35:18.62] ♪ I used to feel pain ♪ [35:23.62] ♪ But now I'm just a soul trapped inside this metal frame ♪ [35:28.66] ♪ Oh, I'm singing the blues ♪ [35:32.82] ♪ Oh, can't you see ♪ [35:35.82] ♪ 'Cause maybe you're not the one it used to be ♪ [35:40.82] ♪ I'm searching for love but I can't find a soul ♪ [35:45.82] ♪ Won't you help me, baby ♪ [35:48.86] (upbeat music) [35:51.44] - Nice. [35:57.30] - So yeah, we have a lot of fun with it. [35:58.62] - Definitely easy, yeah. [36:00.26] Yeah, I'm really curious to see how people are gonna use this [36:03.30] to like re-sample all songs into new styles. [36:06.82] You know, I think that's one of my favorite things [36:08.78] about hip hop, you have a trap call quest, [36:11.78] they had like the Lou Reed walk on the wild side sample [36:14.46] and like can I kick it, it's like Kanye's sample, [36:16.70] Nina Simone, I'm like blowing the leaves. [36:18.94] It's like a lot of production work [36:20.50] to actually take an old song and make it fit a new beat. [36:24.34] And I feel like this can really help. [36:25.74] Do you see people putting existing songs, lyrics, [36:28.54] and trying to regenerate them in like a new style? [36:31.34] You know? [36:32.18] We actually don't let you do that. [36:33.90] And it's because if you're taking someone else's lyrics, [36:36.30] you didn't own those. [36:37.14] You don't have the publishing rights to those. [36:38.50] You can't remake that song. [36:40.38] I think in the future, we'll figure out [36:42.22] how to actually let people do that in a legal way. [36:44.58] But we are really focused on letting people [36:46.62] make new and original music. [36:47.94] And I think, you know, there's a lot of music AI [36:51.22] which is artist A, doing the song of artist B [36:54.86] in a new style, you know, let me have Metallica doing [36:57.18] come together by the Beatles or something like that. [36:59.66] And I think this stuff is very viral, [37:02.98] but I actually really don't think [37:05.06] that this is how people want to interact with music [37:07.34] in the future. [37:08.30] To me, this feels a lot like when you made a Shakespeare [37:11.26] sonnet the first time you saw a chat, GPT. [37:13.78] And then you made another one, and then you made another one, [37:16.06] and then you kind of thought like this is getting old. [37:18.78] And that doesn't mean that GPT is not amazing. [37:20.98] GPT is amazing. [37:21.86] It's just not for that. [37:23.34] And I kind of feel like the way people want to use music [37:27.70] in the future is not just to remake songs [37:30.98] in different people's voices. [37:32.50] You lose the connection to the original artist. [37:34.54] You lose the connection to the new artist [37:36.14] because they didn't really do it. [37:37.62] So we're very happy to just let people do things [37:40.22] that are a flash in the pan and kind of stay under the radar. [37:44.22] - Yeah, no, that's a, I think that's a good point [37:46.78] overall about AI generated anything, you know? [37:50.14] Because I think recently T-Pain, he did like a, [37:53.70] an album of covers. [37:55.02] And I think he did like a Warpigs that people really liked. [37:59.02] There was like a Tennessee Whiskey, [38:01.18] which you maybe wouldn't expect T-Pain to do. [38:03.74] But people like it, but yeah, I agree. [38:05.26] It needs to be a certain type of artist [38:07.50] to really have it be entertaining to make covers. [38:11.06] This is great. [38:11.90] What else is next for Su? [38:13.06] No, you know, I think, you know, first you had the bark [38:15.86] and then there was like a big music generated push [38:18.62] when you did an announcement. [38:19.70] I think a couple of months ago, [38:21.14] I think I saw you like 300 times on my Twitter timeline. [38:24.18] I'm like the same day, so it was like going everywhere. [38:27.02] What's coming up? [38:27.86] What are you most excited about in the space? [38:29.54] And maybe what are some of the most interesting [38:32.06] underexplored ideas that you maybe haven't worked on yet? [38:35.90] - Gosh, there's a lot. [38:36.86] You know, I think from the model side, [38:39.22] it's still really early innings [38:40.62] and there's still so much low hanging fruit [38:43.58] for us to pick to make these models much, much better, [38:46.82] or much, much more controllable, much better music, [38:48.74] much better audio fidelity. [38:50.54] So much that we know about [38:52.94] and so much that, again, we can kind of borrow [38:56.02] from the open source transformers community [38:58.30] that should make these just better across the board. [39:01.30] From the product side, and you know, [39:02.66] we're super focused on the experiences [39:04.62] that we can bring to people. [39:05.62] And so it's so much more than just text to music. [39:09.54] And I think, you know, I'll say this nicely, [39:11.58] I'm a machine learning person, [39:12.66] but like machine learning people are stupid sometimes [39:14.70] and we can only think about like models that take X [39:17.66] and make it into Y. [39:18.86] And that's just not how the average human being [39:21.38] thinks about interacting with music. [39:22.86] And so I think what we're most excited about [39:24.98] is all of the new ways that we can to get people [39:27.78] just much more actively participating in music. [39:30.86] And that is making music, not only with text, [39:33.18] maybe with other ways of doing stuff [39:35.30] that is making music together. [39:36.86] If you want to be reductive [39:38.02] and think about this as a video game, [39:39.70] this is multiplayer mode. [39:40.86] And it is the most fun that you can have with music. [39:43.06] And you know, honestly, I think there's a lot of, [39:47.22] it's timely right now, you know, [39:48.66] I don't know if you guys have seen [39:49.78] UMG and TikToker butting heads a little bit. [39:52.54] And UMG has pulled music from TikTok. [39:55.46] And you know, the way we think about this is [39:58.18] maybe they're both right, maybe neither is right. [39:59.94] Without taking sides, this is kind of figuring out [40:02.70] how to divvy up the current pie in the most fair way. [40:06.02] And I think what we are super focused on [40:08.34] is making that pie much bigger [40:10.02] and increasing how much people are actually interested [40:12.82] in music and participating in music. [40:15.18] And you know, as a very broad heuristic, [40:17.74] the gaming industry is 50 times bigger [40:19.90] than the music industry. [40:21.46] And it's because gaming is super active [40:23.70] and music, too much music is just passive consumption. [40:27.18] And so we have a lot of experiments [40:29.94] that we are excited to run for the different ways [40:31.70] people might want to interact with music [40:33.98] that is beyond just, you know, streaming it while I work. [40:37.14] - Yeah, I think at minimum, [40:38.34] you guys should have a Twitch stream [40:40.14] that's just like a 24 hour radio session [40:42.86] that have you ever come across Twitch plays Pokemon? [40:45.74] - No. [40:46.58] - Basically like everyone in the Twitch chat [40:48.14] can vote on like the next action that the game state makes. [40:51.62] And they kind of wired it up to a Nintendo emulator [40:54.26] and played Pokemon like the whole game [40:55.74] through the collaborative thing. [40:57.90] It sounds like it should be pretty easy for you guys [41:00.18] to do that except for the chaos that may result. [41:03.50] But like, I mean, that's part of the fun. [41:06.06] - I agree 100%. [41:07.50] One of my like Peeve projects or pet projects [41:09.94] is like, what does it mean to have a collaborative concert? [41:12.86] Maybe where there is no artist and it's just the audience [41:15.54] or maybe there is an artist, [41:16.62] but there's a lot of input from the audience. [41:18.74] You know, if you were gonna do that, [41:20.90] you would either need an audience full of musicians [41:23.74] or you would need an artist [41:24.94] who can really interpret the verbal cues [41:27.10] that an audience is giving or non-verbal cues. [41:29.94] But if you can give everybody the means [41:32.50] to better articulate the sounds that are in their heads [41:35.74] toward the rest of the audience, [41:37.18] like which is what generative AI basically lets you do, [41:40.42] you open up way more interesting ways [41:42.34] of having these experiences. [41:43.78] And so the collaborative concert [41:45.74] is like one of the things I'm most excited about. [41:47.70] I don't think it's coming tomorrow, [41:49.82] but we have a lot of ideas on what that can look like. [41:52.74] - Yeah, I feel like it's one stage [41:54.34] before the collaborative concert [41:56.30] is turning Suno into a continuous experience [42:00.26] rather than like a start and stop motion. [42:02.86] I don't know if that makes sense. [42:04.14] You know, as someone with like a casual interest in DJing, [42:06.50] like when do we see Suno DJs, right? [42:09.10] That can continuously segue into like the next song, [42:11.62] the next song, the next song. [42:12.58] - I think soon. [42:13.42] - And then maybe you can turn it collaborative. [42:15.18] You think soon. [42:16.30] Okay, maybe part of your roadmap. [42:18.38] You teased a little bit your V3 model. [42:20.54] I saw the letters DPO in there. [42:22.30] Is that direct preference optimization? [42:23.98] - We are playing with all kinds of different ways [42:26.30] of making these models do the things [42:27.86] that we want them to do. [42:29.30] I don't want to talk too many specifics here, [42:32.06] but we have lots of different ways [42:33.70] of doing stuff like that. [42:35.54] - Yeah, I'm just wondering like how you incorporate [42:37.86] like user feedback, right? [42:39.14] Like you have the classic thumbs up and down buttons, [42:42.18] but like there's so many dimensions to the music. [42:45.42] Like, you know, I didn't get into it, [42:46.78] but some of the voices sounded more metallic. [42:49.98] And sometimes that's on purpose. [42:51.70] Sometimes not. [42:52.90] Sometimes there are kind of weird pauses in there. [42:54.62] I could go in and annotate it if I really cared about it, [42:56.74] but I mean, I'm just listening, so I don't. [42:59.10] But there's a lot of opportunity. [43:02.22] - We are only scratching the surface of figuring out [43:05.10] how to do stuff like that. [43:07.34] And for example, the thumbs up and the thumbs down, [43:10.50] other things like sharing telemetry on plays, [43:13.34] all of these things are stuff that in the future, [43:15.42] I think we would be able to leverage [43:17.74] to make things amazing. [43:18.78] And then I imagine a future where you can have your own model [43:22.42] with your own preferences. [43:23.94] And the reason that's so cool is that you kind of have control [43:28.06] over it and you can teach it the way you want to. [43:30.86] And, you know, the thing that I would like in this too [43:33.62] is like a music producer working with an artist [43:35.94] giving feedback. [43:37.18] And like, this is now a self-contained experience [43:40.54] where you have an artist who is infinitely flexible, [43:43.14] who is able to respond to the weird feedback [43:45.18] that you might give it. [43:46.26] And so we don't have that yet. [43:48.38] Everybody's playing with the same model, [43:49.94] but there's no technological reason [43:52.02] why that can happen in the future. [43:53.86] - Excellent. - Awesome. [43:55.02] We had a few more notes from random community tweets. [43:58.54] I don't know if there's any favorite fans of Zulu [44:01.06] that you have or whatnot. [44:02.74] - DHH, obviously Notorious, Twitter, and Crowda Inflamer, [44:07.74] I guess. [44:09.42] He tweeted about you guys. [44:10.58] I saw Blau as an investor. [44:12.66] I think Karpati also tweeted something. [44:15.14] - Return to Monkey. [44:16.70] - Yeah, yeah, yeah, yeah, return to Monkey, right. [44:18.86] - Is there a story behind that? [44:20.34] - No, he just made that song and it just speaks to him. [44:22.94] And I think this is exactly the thing [44:25.22] that we are trying to tap into that you can think of it. [44:27.78] This is like a super, super, super micro genre of one person [44:31.70] who just really liked that song and made it and shared it. [44:34.02] And it does not speak to you the same way it speaks to him, [44:36.54] but that song really spoke to him. [44:37.94] And I think that's so beautiful. [44:40.06] And that's something that you're never gonna have an artist [44:42.42] able to do that for you. [44:43.58] And now you can do that for yourself. [44:45.50] And it's just a different form of experiencing music. [44:48.46] I think that's such a lovely use case. [44:50.58] - Any fun fan mail that you got from musicians [44:53.66] or anybody that really was a funny story this year? [44:57.50] - We get a lot and it's primarily positive. [44:59.98] And I think, I'm a whole, I would say people realize [45:02.54] that they are not experiencing music [45:05.22] in all of the ways that are possible [45:06.62] and it does bring them joy. [45:08.14] I'll tell you something that is really heartwarming [45:09.90] is that we're fairly popular [45:12.30] in the blind and vision impaired community. [45:15.30] And that makes us feel really good. [45:17.34] And I think, you know, very roughly [45:19.42] without trying to speak for an entire community, [45:21.70] you have lots of people who are really into things [45:23.38] like mid-journey and they get a lot of benefit and joy [45:27.14] and sometimes even therapy out of making images. [45:29.82] And that is something that is not really accessible [45:31.86] to this fairly large community. [45:34.22] And what we've provided, [45:36.06] I don't think the analogy to mid-journey is perfect, [45:38.46] but what we've provided is a sonic experience [45:40.50] that is very similar and that speaks to this community. [45:43.10] And that is community with the best ears, [45:45.90] the most exacting, the most tuned. [45:48.34] Yeah, that definitely makes us feel warm and fuzzy inside. [45:51.18] - Yeah, excellent. [45:52.22] Sounds like there's a lot of exciting stuff on your roadmap. [45:54.46] I'm very much looking forward to the infinite DJ mode [45:57.46] 'cause then I can just kind of play that while I work. [45:59.34] I would love to get your overall takes, [46:01.42] like kind of zooming out from, so you know itself, [46:04.22] just your overall takes on the music generation landscape. [46:06.26] Like what should people know? [46:07.54] You obviously have spent a lot more time on this than others. [46:10.46] So in my mind, you shout out Vali [46:12.26] and the other sort of Google type work [46:14.62] in your Read Me and Bark. [46:16.70] What should people know about like what Google is doing, [46:19.26] what Meta is doing, Meta released Seamless recently [46:22.62] and Audio Box. [46:23.86] How do you classify the world of audio generation? [46:25.82] Like, you know, in the broader sort of research community. [46:28.46] - Mm-hmm. [46:29.42] I think people largely break things down [46:31.94] into three big categories, [46:33.22] which is music speech and sound effects. [46:35.42] There's some stuff that is crossover, [46:37.34] but I think that is largely how people think about this. [46:39.90] The old style of doing things still exists, [46:42.94] kind of single-purpose models [46:44.26] that are built to do a very specific thing [46:46.14] instead of kind of the new foundation model approach. [46:49.02] I don't know how much longer that will last. [46:51.18] I don't have like tremendous visibility into, you know, [46:53.58] what happens in the big industrial research lab [46:56.10] before they publish. [46:57.54] Specifically for music, I would say, [46:59.98] there's a few big categories that we see. [47:02.42] There is kind of license-free stock music. [47:05.78] So this is like, how do I background music, [47:08.02] the B-roll footage for my YouTube video [47:10.22] or for full-featured production or whatever it is. [47:13.82] And there's a bunch of companies in that space. [47:15.90] There's a lot of AI cover art. [47:18.54] How do I cover different existing songs with AI? [47:21.78] And I think that's a space [47:22.94] that is particularly fraught with some legal stuff. [47:26.34] And we also just don't think [47:27.94] it's necessarily the future of music. [47:30.10] There is kind of net new songs as a new way [47:33.26] to create net new music. [47:34.50] That is the corner that we like to focus on. [47:36.94] And I would say the last thing is much more geared [47:40.14] toward professional musicians, [47:41.42] which is basically AI tools for music production. [47:44.58] And you can think many of these will look like plugins [47:47.14] to your favorite DAW. [47:48.54] Some of them will look like the greatest STEM splitter [47:52.50] that the market has ever seen. [47:54.10] The current STEM splitters are the state of the art [47:56.42] are all AI based. [47:57.46] And so I think that is a market also [48:00.26] that has just a tremendous amount of room to grow. [48:02.90] Somebody told me this recently [48:04.10] that if you actually think about it, music has evolved. [48:06.38] Recently, it's just much more things [48:09.02] that are sonically interesting at a very local level [48:11.66] and much less like chord changes that are interesting. [48:14.90] And when you think about that, [48:15.98] like that is something that AI can definitely help you [48:18.38] make a lot of weird sounds. [48:19.58] And this is nothing new. [48:20.58] There was like a pheromone at some point [48:22.22] that people like put an antenna and try to do this with. [48:24.82] And so like, I think this is, [48:26.10] there's the very natural extension of it. [48:28.06] So that's how we see it. [48:28.98] At least, you know, there's a corner [48:30.26] that we think is particularly fulfilling, [48:32.62] particularly underserved and particularly interesting. [48:35.70] And that's the one that we plan. [48:37.14] - Awesome. [48:37.98] - Yeah, it's a great perspective. [48:39.10] - I know we covered a lot of things. [48:40.74] I think before we wrap, [48:42.22] you have written a blog post that can show [48:44.38] about good hearts, law, impact in ML, [48:46.70] which is, you know, when you measure something, [48:49.42] then the thing that you measure [48:50.70] is not a good metric anymore [48:52.26] because people optimize for it. [48:53.98] Any thoughts on how that applies to like LLMs [48:56.78] and benchmarks and kind of the world we're going in today? [48:59.74] - Yeah, I mean, I think it's maybe even more apropos [49:02.10] than when I originally wrote that [49:04.42] because so much, we see so much noise [49:07.26] about pick your favorite benchmark in this model [49:09.70] does slightly better than that model. [49:11.10] And then at the end of the day, actually, [49:13.26] there is no real world. [49:14.90] There is no real world difference between these things. [49:17.18] And it is really difficult to define what real world [49:20.58] means and I think to a certain extent, [49:22.42] it's good to have these objective benchmarks. [49:24.26] It's good to have quantitative metrics. [49:26.26] But at the end of the day, [49:28.34] you need some acknowledgement [49:29.94] that you're not going to be able to capture everything. [49:31.82] And so at least at Suno, to the extent that we have [49:35.02] corporate values, if we don't, we're too small [49:37.18] to have corporate values written down. [49:38.38] But something that we say a lot is aesthetics matter [49:41.06] and that the kind of quantitative benchmarks [49:44.22] are never going to be the be all and end all [49:46.54] of everything that you care about. [49:49.66] And as flawed as these benchmarks are in text, [49:53.78] they're way worse in audio. [49:55.70] And so aesthetics matter basically is a statement [49:58.46] that like at the end of the day, [50:00.26] what we are trying to do is bring music to people [50:02.74] that makes them feel a certain way. [50:04.58] And effectively, the only good judge of that is your ears. [50:08.02] And so you have to listen to it. [50:09.90] And it is a good idea to try to make better objective [50:13.02] benchmarks, but you really have to not fall prey [50:15.74] to those things. [50:16.62] I can tell you, it's kind of another pet peeve of mine. [50:19.38] Like I always said, economists do make really good [50:23.06] machine learning engineers. [50:24.14] And it's because they are able to think about stuff [50:26.46] like good hearts law and natural experiments [50:28.86] and stuff like this that people with machine learning [50:31.14] backgrounds or people with physics backgrounds like me [50:33.50] often forget to do. [50:34.62] And so, yeah, I mean, I'll tell you at Kenchell, [50:37.38] we actually used to go to big econ conferences [50:39.78] sometimes to recruit. [50:41.06] And these were some of the best hires we ever made. [50:43.50] - Interesting. [50:44.34] Because there's a little bit of social science [50:46.66] in the human feedback. [50:48.62] I think it's not only the human feedback. [50:50.94] I think you could think about this. [50:52.58] Just in general, you have these like giant, [50:54.46] really powerful models that are so prone to overfitting, [50:57.46] that are so poorly understood, [50:59.22] that are so easy to steer in one direction or another, [51:01.42] not only from human feedback. [51:03.18] And your ability to think about these problems [51:06.22] from first principles instead of like getting down [51:08.26] into the weeds or only math. [51:10.14] And to think intuitively about these problems [51:12.06] is really, really important. [51:13.94] I'll give you like just one of my favorite examples. [51:16.30] It's a little old at this point. [51:17.98] But if you guys remember like squad and squad two, [51:21.50] the question answering data set. [51:22.82] - The Stanford question answering data set. [51:23.66] - Yeah, exactly. [51:24.66] And so, you know, the benchmark for squad one, [51:28.14] eventually the machine learning models start to do [51:30.90] as well as a human can on this thing. [51:33.26] And it's like, oh, now what do we do? [51:35.82] And it takes somebody very clever to say, [51:38.82] well, actually, let's think about this for a second. [51:41.34] What if we presented the machine with questions [51:43.22] with no answer in the passage? [51:45.22] And it immediately opens a massive gap [51:47.62] between the human and the machine. [51:48.78] And I think it's like first principles thinking like that, [51:52.22] that comes very naturally to social scientists [51:54.90] that does not come as naturally to people like me. [51:58.34] And so that's why I like to hang out [51:59.86] with people like that. [52:01.62] - Well, I'm sure you get plenty of data in Boston. [52:03.26] And as a econ major myself, I, you know, [52:06.06] this is very gratifying to hear that [52:07.62] we have the perspective to contribute. [52:09.86] - Oh, big time, big time. [52:11.10] I try to talk to economists as much as I can. [52:13.54] - Excellent, awesome guys. [52:15.02] Yeah, I think this was great. [52:16.34] We got like music. [52:17.58] We got discussion about gender model. [52:20.06] So we got the whole nine yard. [52:21.26] So thank you so much for coming on. [52:23.02] - I had great fun. [52:23.86] Thank you guys. [52:24.68] - Thanks. [52:26.00] (upbeat music) [52:28.58] (upbeat music) [52:31.16] (upbeat music) [52:33.74] (upbeat music) [52:36.32] (upbeat music) [52:38.90] (upbeat music) [52:41.48] (upbeat music) [52:44.06] (upbeat music) [52:46.64] [BLANK_AUDIO]