[by:whisper.cpp] [00:00.00] (upbeat music) [00:02.58] - Hey everyone, welcome to the Late in Space podcast. [00:08.24] This is Alessio, partner and CTO [00:10.24] in residence at Decibel Partners. [00:11.92] And I'm joined by my cohost, Sviks, [00:13.96] founder of SmallAI. [00:15.28] - Hey, and today we have the studio [00:16.60] of super mission talent, welcome. [00:17.60] - Thanks for having me. [00:18.56] - On one of your rare visits from New York, [00:20.84] where you live. [00:21.68] (laughs) [00:22.52] You got your start in computer vision at NYU [00:25.60] with Yana Kun. [00:27.40] That was a very fortuitous start. [00:28.96] I was actually listening to your interview [00:31.12] on the gradient podcast. [00:32.12] So if people want to know more [00:33.40] about like the history of Smith, history of PyTorch, [00:35.72] they can go to that podcast. [00:37.00] We won't spend that much time there. [00:38.40] But I just was marveling at your luck, [00:40.56] or I don't know if it's your luck [00:42.32] or your drive to find AI early [00:45.36] and then find like the right quality mentor [00:47.84] because I guess Yana really sort of introduced you [00:50.32] to that world. [00:51.16] - Yeah, I think you're talking about extrinsic success, right? [00:54.60] Like a lot of people just have drive to do things [00:57.44] that they think is fun. [00:58.90] And a lot of those things might or might not be [01:01.72] extrinsically perceived as a good and successful. [01:04.96] I think I just happened to like something [01:08.28] that is now like one of the coolest things [01:11.00] in the world or whatever. [01:12.50] But if I happen, you know, [01:14.36] the first thing I tried to become was a 3D VFX artist. [01:18.44] And I was really interested in doing that, [01:21.12] but I turned out to be very bad at it. [01:23.84] So I ended up not doing that further. [01:25.56] But even if I was good at that, whatever, [01:27.94] and I ended up going down that path, [01:30.12] I probably would have been equally happy. [01:32.44] It's just like maybe like the perception of oh, [01:35.08] is this person successful or not might be different. [01:38.00] I think like after a baseline, [01:39.84] like your happiness is probably more correlated [01:42.44] with your intrinsic stuff. [01:44.16] - Yes, I think Dan Pink has this book on drive [01:47.76] that I often refer to about the power of intrinsic motivation [01:51.08] versus extrinsic and how long extrinsic lasts. [01:53.48] It's not very long at all. [01:55.48] But anyway, now you are, you know, an investor in runway. [01:57.68] So in a way, you're working on VFX. [02:00.66] - Yes, I mean, in a very convoluted way. [02:03.58] - It reminds me of Ed Katmoh. [02:05.90] I don't know if you guys know, but you know, [02:07.74] he actually tried to become an animator [02:09.78] in his early years and failed [02:11.46] or didn't get accepted by Disney [02:13.30] and then went and created Pixar [02:14.50] and then got bought by Disney. [02:16.10] Created Toy Story. [02:18.38] So you joined Facebook in 2014 [02:20.58] and eventually became creator and maintainer of PyTorch. [02:24.02] And there's this long story [02:24.94] that you can refer to on the gradient. [02:26.54] I think maybe people don't know that you also are involved [02:28.44] in more sort of hardware and cluster decisions and affair [02:30.56] and we can dive into more details there [02:32.32] because we're all about hardware this month. [02:34.52] (laughing) [02:35.60] Yeah, and then finally, I don't know what else, [02:37.28] like what else should people know about you [02:38.48] and on the personal side or professional side? [02:40.48] - I think open source is definitely like a big passion [02:43.80] of mine and probably forms a little bit of my identity [02:46.20] at this point. [02:47.16] I'm irrationally interested in open source. [02:50.40] I think open source has that fundamental way [02:53.00] to distribute opportunity in a way [02:56.08] that is very powerful. [02:57.98] Like I grew up in India. [02:59.86] I didn't have internet for a while [03:01.58] and in college actually I didn't have internet [03:04.14] except for like GPRS or whatever. [03:06.70] Like and like knowledge was very centralized [03:10.18] but like I saw that evolution of knowledge [03:12.10] slowly getting decentralized [03:13.62] and that ended up helping me learn quicker [03:16.90] and faster for like zero dollars. [03:19.46] And I think that was a strong reason [03:22.78] why I ended up where I am. [03:25.42] So like the open source set of things, [03:27.92] I always push regardless of like what I get paid for. [03:31.84] Like I think I would do that [03:33.60] as a passion project on the side. [03:35.52] - Yeah, that's wonderful. [03:36.36] We'll talk about the challenges as well [03:38.08] that open source has, open models versus closed models. [03:41.08] But maybe we want to talk, [03:42.28] touch a little bit on PyTorch before we move on [03:44.08] to the sort of meta AI in general. [03:45.72] - Yeah, we kind of touched on PyTorch in a lot of episodes. [03:48.64] So we had George Hotz from TinyGrad. [03:51.32] He called PyTorch a cis and TinyGrad a risk. [03:55.86] I would love to get your thoughts on PyTorch [03:58.34] design direction as far as, [04:00.66] I know you talk a lot about kind of having a happy path [04:04.14] to start with and then making complexity hidden away [04:06.66] but then available to the end user. [04:08.58] One of the things that George mentioned is, [04:10.14] I think you have like 250 primitive operators in PyTorch. [04:13.70] I think TinyGrad is four. [04:15.14] So how do you think about some of the learnings [04:17.66] that maybe he's gonna run into [04:19.02] that you already had in the past seven, eight years, [04:22.28] almost of running PyTorch? [04:24.32] - Yeah, I think there's different models here, [04:26.40] but like I think it's two different models [04:28.16] that people generally start with either. [04:29.76] They go like, I have a grand vision [04:32.16] and I'm gonna build like a giant system [04:34.08] that achieves this grand vision [04:35.44] and might be one is or like, you know, [04:37.04] super feature complete or whatever. [04:39.68] Or other people say they will get incrementally ambitious, [04:43.76] right? [04:44.60] And they say, oh, we'll start with something simple [04:45.96] and then we'll slowly layer out complexity [04:48.12] in a way that optimally applies Huffman coding or whatever. [04:51.38] Like, you know, where the density of the users are [04:54.86] and what they're using, [04:55.94] I would want to keep it in the like easy happy path [04:58.70] and where the more niche advanced use cases, [05:01.80] I'll still want people to try them, [05:04.06] but they need to take additional frictional steps. [05:07.90] George, I think just like we started with PyTorch, [05:10.94] George started with like the incrementally ambitious thing. [05:14.18] I remember TinyGrad used to be like, [05:17.02] we would be limited to a thousand lines of code. [05:19.12] And I think now it's like a 5,000. [05:21.20] So I think there's no real like magic [05:25.24] to which why PyTorch has the kind of complexity. [05:27.84] I think it's like probably partly necessitated [05:30.76] and partly because we built with the technology [05:33.44] available under us at that time. [05:36.32] PyTorch is like 190,000 lines of code [05:39.24] or something at this point. [05:40.52] I think if you had to rewrite it, [05:41.96] we would probably think about ways to rewrite it [05:45.44] in like a vastly simplified way for sure. [05:49.02] But a lot of that complexity comes from the fact [05:51.90] that in a very simple, explainable way, [05:54.88] you have memory hierarchies. [05:56.54] You have CPU has like three levels of caches [05:59.74] and then you have DRAM and SSD. [06:02.94] And then you have network. [06:04.30] Similarly, GPU has several levels of memory [06:07.86] and then you have like different levels [06:09.54] of network hierarchies and relaying plus like [06:12.82] Infinity Band or Rocky or something like that, right? [06:16.08] And the way the flops are available on your hardware, [06:20.76] they are available in a certain way [06:22.76] and your computation is in a certain way [06:24.72] and you have to retrofit your computation [06:26.56] onto both the memory hierarchy [06:28.56] and like the flops available. [06:30.36] When you're doing this, [06:31.64] it is actually like a fairly hard mathematical problem [06:35.42] to do this setup, like you find the optimal thing. [06:39.40] And finding the optimal thing is like, [06:41.52] what is optimal depends on like the input variables themselves. [06:44.70] So like, okay, like what is the shape of your input tensors [06:47.50] and like what is the operation you're trying to do [06:49.86] and like various things like that. [06:52.06] Finding that optimal configuration [06:54.50] and writing a dot in code [06:56.18] is not the same for every input configuration you have. [06:59.62] Like for example, just as the shape of the tensors change, [07:03.18] let's say you have three input tensors [07:04.78] into like a sparse dot product or something like that. [07:07.82] The shape of each of these input tensors [07:10.02] will vastly change how you do this optimally [07:13.30] placing this operation onto the hardware [07:15.88] in a way that will get you maximal throughput. [07:18.78] So a lot of our complexity comes from writing out [07:22.94] like hundreds of configurations [07:25.04] for each single PyTorch operator [07:27.30] and templatizing these things and like symbolically [07:30.02] like generating the final CUDA code or like CPU code. [07:34.62] There's no way to avoid it [07:35.62] because mathematically we haven't found symbolic ways [07:39.32] to do this that also keep compile time near zero. [07:43.70] You can write a very simple framework [07:45.70] but then you also should be willing to eat [07:48.10] the long compile time. [07:49.58] So like searching for that optimal performance at runtime. [07:52.78] But that's the trade-off, there's no like, [07:54.58] I don't think like unless we have like great breakthroughs [07:57.40] like George's vision is achievable like [08:00.30] or like he should be thinking about a narrower problem [08:03.10] such as I'm only gonna make this for like [08:04.92] work for self-driving car conducts [08:07.62] or like I'm only gonna make this work [08:09.62] for like LLM transformers of the llama style. [08:13.88] Like if you start narrowing the problem down [08:16.04] you can make a vastly simpler framework. [08:19.60] But if you don't, if you need the generality [08:22.16] to power all of the AI research that is happening [08:24.84] and keep like zero compile time [08:26.88] and you know all these other factors [08:28.72] I think it's not easy to avoid the complexity. [08:32.88] - That's interesting and we kind of touched on this [08:34.72] with Chris Lander when he was on the podcast. [08:37.16] If you think about frameworks, they have the model target. [08:40.54] They have the hardware target. [08:41.86] They have a different things to think about. [08:43.42] He mentioned when he was at Google, [08:45.18] TensorFlow trying to be optimized to make TPUs go brr, [08:48.54] you know, and go as fast. [08:50.02] I think George is trying to make [08:51.70] especially AMD stack be better than Rock'em. [08:54.54] How come PyTorch has been such a Switzerland [08:57.36] versus just making meta hardware go brr? [09:01.66] - First meta is not in the business of selling hardware. [09:04.98] Meta is not in the business of cloud compute. [09:07.60] The way meta thinks about funding PyTorch [09:09.84] is we're funding it because it's net good for meta [09:14.16] to fund PyTorch because PyTorch has become a standard [09:17.40] and a big open source project. [09:19.36] And generally it gives us a timeline edge. [09:22.24] It gives us like various like leverage, [09:24.56] like and all that within our own work. [09:27.00] So why is PyTorch like more of a Switzerland [09:29.36] rather than being opinionated? [09:30.56] Like I think the way we think about it [09:32.04] is not in terms of Switzerland or not. [09:34.26] We actually, like we articulated to all hardware vendors [09:37.96] and software vendors and all who come to us being like [09:40.92] we want to build a backend in core [09:42.76] for PyTorch and ship it by default. [09:44.40] It's like we just only look at our user side of things. [09:49.20] Like if users are using a particular piece of hardware, [09:52.88] then we want to support it. [09:54.28] We very much don't want to king make [09:56.96] the hardware side of things. [09:58.44] So as like the Mac books have GPUs [10:02.24] and as that stuff started getting increasingly interesting, [10:05.58] we pushed Apple to push some engineers [10:08.94] and work on the NPS support. [10:10.46] And we spend significant time [10:12.02] from like meta funded engineers on that as well [10:14.34] because a lot of people are using the Apple GPUs [10:18.54] and there's demand. [10:19.46] So like we kind of mostly look at it from the demand side. [10:22.46] We never look at it from like, [10:24.54] oh, which hardware should we start taking opinions on? [10:27.66] - Is there a future in which, [10:29.02] because Mojo or modules Mojo is kind of a super set of Python, [10:32.06] is there a future in which PyTorch might use Mojo features [10:35.84] optionally? [10:36.68] - I think it depends on how well integrated it is [10:39.84] into the Python ecosystem. [10:42.32] So if Mojo is like a pip install [10:44.24] and it's readily available and users feel like [10:48.76] they can use Mojo so smoothly within their workflows [10:53.12] in a way that just is slow friction, [10:56.44] we would definitely look into that. [10:57.92] Like in the same way like PyTorch now depends on Triton [11:00.88] like OpenAI Triton. [11:02.66] And we never had a conversation that was like, [11:05.42] huh, that's like a dependency. [11:07.34] Should we just build a Triton of our own? [11:10.30] Or should we like use Triton? [11:12.22] Like it almost doesn't like those conversations [11:14.82] don't really come up for us. [11:15.98] It's the conversations are more like, [11:17.54] well, does Triton have like 10,000 dependencies [11:20.10] and is it hard to install? [11:21.34] Like we almost don't look at these things [11:23.94] from like a strategic leverage point of view. [11:26.14] We look at these things from like a user experience point [11:28.90] of view, like, is it easy to install? [11:31.08] Is it like smoothly integrated? [11:32.56] And does it give enough benefits for us [11:34.40] to like start depending on it? [11:35.52] So yeah, we should consider. [11:36.52] That's how we think about it. [11:37.36] - You're inclusive by default. [11:38.84] If as long as it meets like the minimum bar of, yeah. [11:41.40] But like maybe I phrased it wrongly. [11:43.24] Maybe it's more like, okay, [11:44.32] like what problems would you look to solve [11:46.64] that you have right now? [11:48.08] - I think it depends on what problems Mojo will be useful at. [11:51.76] - Mainly a performance pitch. [11:53.44] Some amount of cross compiling pitch. [11:55.88] - Yeah, I think like the performance pitch for Mojo [11:58.24] was like, we're gonna be performant [12:00.58] even if you have like a lot of custom stuff. [12:04.02] Like you can write arbitrary custom things [12:06.38] and like we will be performant. [12:07.94] And that value proposition is not clear to us [12:12.82] from the PyTorch side to consider it for PyTorch. [12:16.02] So PyTorch, it's actually not 250 operators, [12:18.62] like a thousand operators. [12:19.86] PyTorch expose about a thousand operators [12:21.86] and people kind of write their ideas [12:23.92] in the thousand operators of PyTorch. [12:26.46] Mojo is like, well, maybe like it's okay [12:29.40] to completely sidestep those like thousand operators [12:32.20] of PyTorch and just write it in a more natural form. [12:35.12] Just write like raw Python, like write for loops [12:37.72] or whatever, right? [12:38.88] So from the consideration of how do we intersect PyTorch [12:43.88] at Mojo, like I can see one use case where you're like, [12:48.12] you have custom staff for some parts of your program, [12:52.36] but mostly it's PyTorch. [12:54.00] And so we can probably figure out how to like [12:55.96] make it easier for say Torch.compile to like smoothly [13:00.70] also consume Mojo subgraphs. [13:03.62] And like, you know, the interoperability [13:05.50] being actually usable, that I think is valuable, [13:09.40] but like Mojo as a fundamental front end [13:11.94] would be replacing PyTorch, not like augmenting PyTorch. [13:16.06] So in that sense, I don't see a synergy [13:18.10] in more deeply like integrating Mojo. [13:21.70] - So call out to Mojo whenever they have written [13:24.46] something in Mojo and there's some performance [13:27.16] - Yeah. [13:28.00] - related thing going on. [13:29.12] And then since you mentioned Apple, [13:30.36] what should people think of PyTorch versus MLX? [13:32.40] - I mean, MLX is early. [13:34.36] And I know the folks well, Ani used to work at fair [13:39.36] and I chatted, I used to chat with him all the time. [13:42.92] He used to be based out of New York as well. [13:45.34] The way I think about MLX is [13:48.32] that MLX is specialized for Apple right now. [13:52.32] It has a happy path because it's defined [13:54.78] as product in a narrow way. [13:57.00] At some point, MLX either says, [14:00.58] we will only be supporting Apple [14:04.14] and we will just focus on enabling, you know, [14:07.62] this is a framework if you use your MacBook, [14:09.70] but once you like go server side or whatever, [14:12.10] that's not my problem and I don't care. [14:14.06] Or MLX, it enters like the server side, [14:17.82] a set of things as well. [14:18.94] Like one of these two things will happen, right? [14:21.38] If the first thing will happen, [14:22.46] like MLX's overall addressable market will be small, [14:26.00] but it probably do well within that addressable market. [14:29.44] If it enters the second phase, [14:31.28] they're going to run into all the same complexities [14:33.24] that we have to deal with. [14:34.88] They will not have any magic wand [14:36.84] and they will have more complex work to do. [14:41.84] They probably wouldn't be able to move as fast. [14:45.54] - Having to deal with distributed compute. [14:47.74] - Distributed, NVIDIA named DGPUs, [14:50.88] like just like having a generalization [14:52.88] of the concept of a backend, [14:55.14] how they treat compilation with plus overheads right now. [14:59.30] They deeply assume like the whole MPS graph thing. [15:02.30] So they need to think about all these additional things [15:06.46] if they end up expanding onto the server side [15:09.30] and they'll probably build something like PyTorch as well, [15:13.02] right? [15:13.86] Like eventually that's where it will end. [15:15.84] And I think there they will kind of [15:18.38] fell on the lack of differentiation. [15:20.66] Like it wouldn't be obvious to people [15:22.52] why they would want to use it. [15:24.76] - I mean, there are some cloud companies offering [15:26.92] and one in them two chips on servers. [15:28.80] I feel like it might be interesting for Apple [15:30.84] to pursue that market, but it's not their core. [15:33.28] - Yeah, I mean, if Apple can figure out [15:35.72] their interconnect story, maybe, [15:37.80] like then it can become a thing. [15:39.96] - Honestly, that's more interesting than the cars. [15:41.88] - Yes, I think like, [15:44.88] I mean, the mode that NVIDIA has right now, I feel like [15:47.28] is that they have the interconnect that no one else has. [15:50.86] Like AMD GPUs are pretty good. [15:52.62] I'm sure there's very silicon that is not bad at all. [15:56.40] But like the interconnect, like NVLink is uniquely awesome. [16:00.36] I'm sure the other hardware providers are working on it. [16:03.60] - I feel like when you say it's uniquely awesome, [16:05.46] you have some appreciation of it that the rest of us don't. [16:07.66] I mean, the rest of us just like, [16:08.96] you know, we hear marketing lines, [16:10.12] but what do you mean when you say NVIDIA is very good [16:12.78] in networking? [16:13.62] Obviously they made the acquisition maybe like 15 years ago. [16:15.66] - It's like the bandwidth it offers [16:18.42] and the latency it offers. [16:19.98] I mean, like TPUs also have a good interconnect, [16:22.98] but you can't buy them. [16:24.38] So you have to go to Google to use it. [16:27.66] - Who are some of the other fair Pytorch alumni [16:30.70] that are building cool companies? [16:31.90] I know you have Fireworks AI, Lightning AI, Lepton, [16:35.84] and Yankee, you knew since college [16:39.06] when it was building coffee? [16:40.78] - Yeah, so Yankee and I used to be framework rivals. [16:44.46] Like Cafe and Torch. [16:47.10] I mean, we were all a very small clothes net community [16:49.86] back then, Cafe and Torch, Tiano, Chainr, Keras, [16:54.86] various frameworks. [16:57.76] I mean, it used to be like more like 20 frameworks. [17:00.76] I can't remember all the names. [17:02.46] CCV by Liu Liu, who is also based out of SF. [17:06.90] And I would actually like, you know, one of the ways [17:09.66] it was interesting is like, you went into the framework, [17:12.62] gots and saw if someone wrote their own convolution kernel [17:16.78] or they like were just copying someone else's. [17:20.30] There were like four or five convolution kernels [17:22.58] that were like unique and interesting. [17:25.50] There's one from this guy out of Russia. [17:28.22] I forgot the name. [17:29.92] But like, I remembered who was awesome enough [17:34.50] to have like written their own kernel. [17:37.90] And at some point there, like I built out these benchmarks [17:41.66] called connet benchmarks. [17:43.80] They were just benchmarking all the convolution kernels [17:46.62] that are available at that time. [17:49.14] And it hilariously became big enough that at that time [17:53.10] AI was getting like important, but not important enough [17:57.10] that industrial strength players came in to do these [18:01.00] kind of benchmarking standardization. [18:02.74] Like we have ML curve today. [18:04.82] So a lot of the startups were using connet benchmarks [18:09.38] in their pitch decks as like, oh, you know, [18:12.78] on connet benchmarks, like this is how we fare, [18:15.62] so you should fund us. [18:17.18] I remember Nirvana actually was at the top of the pack [18:19.82] because Scott Gray wrote like amazingly fast [18:23.70] convolution kernels at that time. [18:26.02] Very interesting, but several times. [18:27.46] But to answer your question, Alasio, [18:30.26] I think mainly leptin fireworks [18:34.30] are the two most obvious ones, [18:37.18] but I'm sure the fingerprints are a lot wider. [18:41.22] They're just people who worked within the Bytorch Cafe [18:45.74] to cohort of things and now end up at various other places. [18:50.34] - I think as a, both as an investor and a people looking [18:55.26] to build on top of their services, [18:57.32] it's a uncartable slash like I don't know what I don't know [19:00.78] pitch because I've met Yangtzei and I've met-- [19:04.14] - Lin Chao. [19:04.98] - Yeah, I've met these folks [19:06.94] and they're like, we were deep in the Bytorch ecosystem [19:09.82] and we serve billions of inferences a day [19:12.02] or whatever at Facebook and now we can do it for you. [19:14.86] And I'm like, okay, that's great. [19:17.02] What should I be wary of or cautious of [19:19.34] when these things happen? [19:20.54] Because I'm like, obviously this experience [19:23.22] is extremely powerful and valuable. [19:25.26] I just don't know what I don't know. [19:26.98] Like what should people know about [19:28.82] like these sort of new inferences as a service companies? [19:32.14] - I think at that point you would be investing in them [19:35.06] for their expertise and of one kind. [19:38.54] So if they've been at a large company [19:41.98] but they've been doing amazing work, [19:43.62] you would be thinking about it as like, okay, [19:45.38] like what these people bring to the table [19:47.42] is that they're really good at like GPU programming [19:51.14] or understanding the complexity of serving models. [19:55.78] Once it hits a certain scale, like, you know, [19:58.62] various expertise like from the infra [20:01.98] and like AI and GPUs point of view. [20:04.86] What you would obviously want to figure out [20:08.06] is like whether their understanding [20:11.06] of the external markets is clear, [20:12.98] whether they know and understand [20:15.50] how to think about running a business, [20:18.50] like understanding how to be disciplined [20:20.82] about making money or various things like that. [20:23.86] - Maybe I'll put it, like it's actually, [20:25.82] I will de-emphasize the investing bit [20:27.38] and just more as a potential customer. [20:29.50] - Oh, okay. [20:30.34] - Like it's more like, okay, like, you know, [20:31.86] PyTorch gods, of course, like, what else should I know? [20:36.86] - I mean, I would not care about who's building something [20:40.78] if I'm trying to be a customer. [20:42.22] I would care about whether-- [20:44.06] - The benchmark. [20:44.90] - Yeah, I use it and it's like usability [20:48.42] and reliability and speed, right? [20:51.06] - Quality as well. [20:51.98] - Yeah, if someone from some random unknown place [20:56.86] came to me and say, "User stuff is great," [21:00.22] like, and I have the bandwidth, [21:02.10] I probably will give it a shot [21:03.70] and if it turns out to be great, like, I'll just use it. [21:07.22] - Okay, great. [21:08.06] And then maybe one more thing about benchmarks [21:09.90] since we already brought it up [21:10.94] and you brought up confident benchmarks. [21:12.82] There was recent, some recent drama around AnyScale. [21:15.74] The AnyScale released their own benchmarks [21:17.86] and obviously they look great on their own benchmarks, [21:19.70] but maybe didn't give the other, [21:23.06] I feel like there are two lines of criticism. [21:25.10] One, which is they didn't test sort of apples for apples [21:28.42] on the kind of end points that the other providers, [21:31.78] that they are competitors with, you know, [21:33.58] on their benchmarks and, you know, [21:34.78] that is due diligence baseline. [21:36.30] And then the second would be more just, like, [21:38.06] optimizing for the right thing. [21:39.50] You had some commentary on it, [21:40.70] I'll just kind of let you riff. [21:41.94] - Yeah, I mean, in summary, [21:44.38] basically my criticism of that was, [21:48.74] AnyScale built these benchmarks for end users [21:52.70] to just understand what they should pick, right? [21:55.22] And that's a very good thing to do. [21:57.82] I think what they didn't do a good job of [21:59.98] is give that end user a full understanding [22:04.02] of what they should pick. [22:04.98] Like they just gave them like a very narrow slice [22:08.26] of understanding. [22:09.10] I think they just gave them latency numbers [22:11.82] and that's not sufficient, right? [22:13.62] Like you need to understand your total cost of ownership [22:17.62] at some reasonable scale. [22:19.02] Not like, oh, like one API call is like one cent, [22:21.86] but like a thousand API calls are like 10 cents. [22:25.38] Or like, you know, like people can miss price [22:27.10] to cheat on those benchmarks. [22:28.70] So you want to understand, okay, like, [22:31.02] how much is it going to cost me [22:32.70] if I actually subscribe to you [22:34.70] and do like a million API calls a month or something. [22:38.14] And then you want to understand the latency [22:40.82] and reliability, not just from like one call you made, [22:44.50] but like an aggregate of calls you made [22:47.54] over several various times of the day [22:49.86] and times of the week. [22:51.54] And the nature of the workloads, [22:53.70] like it's like, is it just like some generic single paragraph [22:57.18] that you're sending that is cashable? [22:59.22] Or like, is it like testing real world workload? [23:03.50] I think that kind of rigor, like in presenting [23:06.94] that benchmark wasn't there. [23:08.34] It was a much more narrow sliver [23:10.90] of what should have been a good benchmark. [23:14.22] That was my main criticism. [23:16.02] And I'm pretty sure if before they released it, [23:19.46] they showed it to their like other stakeholders [23:23.54] who would be caring about this benchmark [23:26.10] because they are present in it, [23:27.98] they would have easily just pointed out these gaps. [23:31.22] And I think they didn't do that [23:32.58] and they just like released it. [23:35.18] So I think those were the two main criticism. [23:37.98] I think they were fair and Robert took it well. [23:40.22] - He took it very well, yeah. [23:41.50] I will have him on at some point and we'll discuss it. [23:44.22] But I think it's important for, [23:45.18] I think the market being maturing enough [23:47.06] that people start caring and competing [23:48.66] on these kinds of things. [23:50.02] Means that we need to establish what best practice is [23:52.90] because otherwise, everyone's gonna play dirty. [23:54.94] - Yeah, absolutely. [23:56.78] My view of the LLM inference market in general [23:59.54] is that it's like the laundromat model. [24:02.98] Like the margins are gonna drive down towards [24:06.78] the bare minimum. [24:07.82] Like it's gonna be all kinds of arbitrage [24:10.30] between how much you can get the hardware for [24:12.30] and then how much you sell the API [24:14.58] and how much like latency your customers [24:16.86] are willing to let go. [24:18.26] Like you need to figure out how to squeeze your margins. [24:20.74] Like what is your unique thing here? [24:22.66] Like I think like together in fireworks [24:24.94] and all these people are trying to build some faster [24:28.10] CUDA kernels and faster like hardware kernels in general. [24:33.10] But those modes only last for a month or two. [24:35.54] Like these ideas quickly propagate. [24:38.06] - Even if they're not published? [24:39.26] - Even if they're not published, like the idea space is small. [24:44.18] So even if they're not published, [24:46.74] the discovery rate is gonna be pretty high. [24:49.38] It's not like we're talking about a combinatorial thing [24:52.02] that is really large. [24:53.70] You're talking about like llama style LLM models [24:56.94] and we're gonna beat those to death. [24:59.10] Like on like a few different hardware SKUs, right? [25:02.54] Like it's not even like we have a huge diversity [25:05.22] of hardware you're going to aim to run it on. [25:07.86] Now when you have such a narrow problem [25:09.74] and you have a lot of people working on it [25:11.38] like the rate at which these ideas are gonna get figured out [25:14.42] is gonna be pretty rough. [25:15.26] - Is it like a standard bag of tricks? [25:16.74] Like the standard one that I know of is, [25:18.74] you know fusing operators. [25:20.34] - Yeah, it's a standard bag of tricks on like figuring out [25:23.34] how to like improve your memory bandwidth and all that. [25:26.98] - Interesting. [25:28.62] Any ideas instead of things that are not being beaten [25:31.46] to death that people should be paying more attention to? [25:34.10] - One thing I was like, you know, [25:35.02] you have a thousand operators, right? [25:36.22] Like what's the most interesting usage of PyTorch [25:38.46] that you're seeing maybe outside of this little bubble? [25:41.38] - So PyTorch, it's very interesting and scary [25:44.26] at the same time, but basically it's used [25:47.42] in a lot of exotic ways for like from the ML angle. [25:50.26] Like, okay, like what kind of models are being built [25:53.18] and you get all the way from like state space model [25:56.94] and all these things to like stuff that like [26:00.14] end-order differentiable models [26:02.70] like NeuroD's and stuff like that. [26:05.26] I think like there's one set of interestingness factor [26:08.98] from like the ML side of things. [26:11.18] And then there's the other set of interesting factor [26:13.58] from the applications point of view. [26:15.26] It's used in Mars rover simulations [26:19.10] to drag discovery to Tesla cars. [26:23.06] And there's a huge diversity of like applications [26:26.02] in which it is used in. [26:27.10] So in terms of most interesting application side of things, [26:30.14] I think I'm scared at how many interesting things [26:33.26] that are also very critical and really important [26:35.66] it is used in if I think the scariest was when [26:39.94] I went to visit CERN at some point [26:42.94] and they said they were using PyTorch [26:45.70] and they were using GANs at the same time [26:48.78] for like particle physics research. [26:50.82] And I was scared more about the fact [26:52.70] that they were using GANs than they were using PyTorch. [26:55.14] Because at that time I was like a researcher [26:57.26] focusing on GANs. [26:58.90] But the diversity is probably the most interesting. [27:01.42] How many different things it is being used in. [27:04.78] I think that's the most interesting to me [27:06.74] from the applications perspective. [27:08.78] From the models perspective, I think I've seen a lot [27:11.46] of like the really interesting ones to me [27:13.86] are where we're starting to combine search [27:18.70] and symbolic stuff with differentiable models. [27:22.66] Like the whole AlphaGo style models is one example. [27:26.02] And then I think we're attempting to do it for LLNs as well [27:29.10] with like various reward models and then search. [27:32.06] I mean, I don't think PyTorch is being used in this [27:34.74] but like the whole Alpha geometry thing was interesting [27:37.94] because again, it's an example of combining [27:39.78] symbolic models with the gradient-based ones. [27:42.90] But there are stuff like Alpha geometry [27:45.30] that PyTorch is used at. [27:46.62] Especially when you intersect biology and chemistry with ML. [27:51.38] Like in those areas, you want stronger guarantees [27:55.30] on the output. [27:57.62] So yeah, maybe from the ML side, [28:00.50] those things to me are very interesting right now. [28:03.58] - Yeah. [28:04.58] People are very excited about the Alpha geometry thing. [28:06.54] And it's kind of like, for me, it's theoretical. [28:09.18] It's great. [28:10.02] You can solve some Olympiad questions. [28:11.54] I'm not sure how to make that bridge over [28:13.50] into the real world applications, [28:15.70] but I'm sure people smarter than you will figure it out. [28:17.98] - Let me give you an example of it. [28:20.46] You know how like the whole thing about synthetic data [28:24.54] will be the next rage in LLNs is a thing? [28:27.14] - It already is a rage. [28:28.30] - Which I think is fairly misplaced [28:31.06] in how people perceive it. [28:32.78] People think synthetic data is some kind of magic wand [28:35.54] that you wave and it's going to be amazing. [28:38.86] Synthetic data is useful in neural networks right now [28:43.86] because we as humans have figured out a bunch [28:49.18] of symbolic models of the world [28:52.82] or made up certain symbolic models [28:54.94] because of human innate biases. [28:57.06] So we've figured out how to ground particle physics [29:01.06] in a 30-parameter model. [29:04.02] And it's just very hard to compute. [29:07.58] As in like it's like it takes a lot of flops to compute, [29:09.70] but it only has 30 parameters or so. [29:12.30] I mean, I'm not a physics expert, [29:13.70] but like it's a very low rank model. [29:16.82] We built mathematics as a field [29:20.62] that basically is very low rank. [29:23.30] Language, like a deep understanding of language [29:26.14] like the whole syntactic parse trees [29:27.94] and like just understanding how language [29:30.94] can be broken down into a formal symbolism [29:34.26] is something that we figured out. [29:36.06] So we basically as humans have accumulated all this knowledge [29:39.46] on these subjects, [29:41.46] either synthetic and we created those subjects [29:44.74] in our heads or like we've grounded some real world phenomenon [29:48.74] into a set of symbols, [29:50.82] but we haven't figured out how to teach neural networks, [29:55.42] so symbolic world models directly. [29:58.90] The only way we have to teach them is generating a bunch [30:02.62] of inputs and outputs and gradient dissenting over them. [30:05.14] So in areas where we have the symbolic models [30:08.58] and we need to teach all the knowledge we have [30:11.82] that is better encoded in the symbolic models, [30:14.62] what we're doing is we're generating a bunch of synthetic data, [30:18.34] a bunch of input/output pairs, [30:20.58] and then giving that to the neural network [30:22.42] and asking it to learn the same thing [30:24.58] that we already have a better low rank model off [30:28.42] in gradient descent in a much more overparameterized way. [30:32.54] Outside of this, like where we don't have good symbolic models, [30:35.78] like synthetic data obviously doesn't make any sense. [30:38.90] So synthetic data is not a magic wand [30:40.90] where it will work in all cases in every case or whatever. [30:43.70] It's just where we as humans already have good symbolic models off. [30:47.50] We need to impart that knowledge to neural networks [30:51.06] and we figured that the synthetic data is a vehicle [30:54.50] to impart this knowledge to it. [30:57.90] But people, because maybe they don't know enough [31:01.58] about synthetic data as a notion, [31:03.90] but they hear the next wave of data revolution is synthetic data, [31:08.18] they think it's some kind of magic [31:09.78] where we just create a bunch of random data somehow, [31:13.74] they don't think about how, [31:15.38] and then they think that's just the revolution. [31:17.82] And I think that's maybe a gap in understanding [31:20.70] most people have in this hype cycle. [31:23.22] - Yeah, well, it's a relatively new concept, so. [31:25.74] Oh, there's two more that I'll put in front of you [31:27.74] and then you can see what you respond. [31:29.62] One is, I have this joke that it's only synthetic data [31:34.18] if it's from the Mistral region of France, [31:36.50] otherwise it's a sparkling distillation, [31:38.30] which is what news researchers doing, [31:40.06] like they're distilling GPT-4 by creating synthetic data [31:42.86] from GPT-4 and creating mock textbooks inspired by PHY2 [31:46.78] and then fine-tuning open source models like Lama. [31:50.50] And so I don't know, I mean, I think that's, [31:53.66] should we call that synthetic data? [31:54.62] Should we call it something else? [31:55.54] I don't know. [31:56.86] - Yeah, I mean, the outputs of LLMs, [32:00.18] are they synthetic data? [32:01.98] They probably are. [32:03.90] But I think it depends on the goal you have. [32:07.10] If your goal is like, you're creating synthetic data [32:10.46] with the goal of trying to distill GPT-4's superiority [32:14.70] into another model, I guess you can call it synthetic data, [32:18.10] but it also feels disingenuous because your goal is like, [32:22.38] I need to copy the behavior of GPT-4 and-- [32:25.66] - It's also not just behavior but data set. [32:28.46] So I've often thought of this as data set washing. [32:31.30] Like you need one model at the top of the chain, [32:34.22] a name French company that makes a model [32:37.86] that has all the data in it that we don't know [32:39.42] where it's from, but it's open source, hey, [32:40.66] and then we distill from that and it's great. [32:42.90] (laughing) [32:44.74] To be fair, they also use larger models as judges [32:48.10] for preference ranking, right? [32:49.22] So that is, I think a very, very accepted use of synthetic. [32:53.30] - Correct, I think it's a very interesting time, [32:55.54] where we don't really have good social models [32:59.38] of what is acceptable, depending on how many bits [33:03.62] of information you use from someone else, right? [33:06.54] It's like, okay, you use like one bit, is that okay? [33:10.70] Yeah, that's accepted to be okay. [33:12.78] Okay, what about if you use like 20 bits, is that okay? [33:16.22] I don't know, what if you use like 200 bits? [33:19.50] Like I don't think we as society have ever been [33:23.02] in this conundrum where we have to be like, [33:25.06] where is the boundary of copyright, [33:27.78] or where is the boundary of socially accepted [33:31.94] understanding of copying someone else? [33:35.22] Like we haven't been tested this mathematically [33:37.62] before in my opinion, so. [33:39.66] - Whether it's transformative use. [33:40.94] - Yes. [33:41.76] - So yeah, I think this New York Times opening eye case [33:43.90] is gonna go to the Supreme Court and we'll have to decide it [33:46.54] 'cause obviously we never had to deal with it before. [33:49.46] And then finally, for synthetic data, [33:51.10] the thing that I'm personally exploring [33:52.42] is solving this great start paradigm difference [33:54.66] between RAG and fine-tuning, [33:55.70] where you can kind of create synthetic data [33:57.50] off of your retrieved documents. [33:59.50] And then fine-tune on that, that's kind of synthetic. [34:02.06] All you need is variation or diversity of samples [34:06.30] for you to fine-tune on. [34:07.34] And then you can fine-tune new knowledge into your model. [34:10.02] I don't know if you've seen that [34:10.98] as a direction for synthetic data. [34:13.42] - I think you're basically trying to, [34:16.14] what you're doing is you're saying, [34:17.82] well, language, I know how to parameterize language [34:20.98] to an extent. [34:22.38] And I need to teach my model variations of this input data [34:27.38] so that it's resilient or invariant to language [34:31.42] uses of that data. [34:32.62] - Yeah, it doesn't overfit on the record. [34:33.98] - So I think that's 100% synthetic, right? [34:36.66] You understand, like the key is like, [34:39.26] you create variations of your documents. [34:41.62] And you know how to do that because you have a symbolic model [34:44.42] or like some implicit symbolic model off language. [34:48.86] - Do you think the issue with symbolic models [34:51.42] is just the architecture of the language models [34:55.02] that we're building? [34:56.02] I think like maybe the thing that people grasp is like [34:58.38] the inability of transformers to deal with numbers [35:01.46] because of this organizer. [35:03.10] Is it a fundamental issue there too? [35:05.16] And do you see alternative architectures [35:07.58] that will be better with symbolic understanding? [35:09.94] - I am not sure if it's a fundamental issue or not. [35:13.18] I think we just don't understand transformers enough. [35:16.30] I don't even mean transformers as an architecture. [35:18.62] I mean like the use of transformers today, [35:21.66] like combining the tokenizer and transformers [35:24.98] and the dynamics of training, [35:26.90] like when you show math heavy questions versus not, [35:31.90] I don't have a good calibration [35:34.02] of whether I know the answer or not. [35:35.86] I, you know, there's common criticisms that are like, [35:38.38] well, you know, transformers will just fail at X. [35:41.70] But then when you scale them up to sufficient scale, [35:45.34] they actually don't fail at that X. [35:47.54] I think this is this entire subfield [35:49.70] where they're trying to figure out these answers [35:51.34] called like the science of deep learning or something. [35:53.50] So we'll get to know more. [35:54.94] I don't know the answer. [35:56.62] - That's such a little bit on just meta AI [36:00.34] and stuff that's going on there. [36:01.82] Maybe I don't know how deeply you're personally involved [36:04.14] in it, but you're a first-guest from meta AI, [36:06.26] which is really fantastic. [36:07.66] And Lama One was, you know, [36:09.46] you are such a believing open source. [36:10.90] Lama One was more or less like the real breakthrough [36:13.70] in open source AI. [36:15.06] The most interesting thing for us [36:16.62] covering on this, in this podcast [36:18.38] was the depth of Chinchilla, as people say. [36:21.46] Any interesting insights there around like the scaling models [36:25.38] for open source models or smaller models [36:27.74] or whatever that that design decision was [36:29.58] when you guys were doing it? [36:31.02] - So Lama One was game long pull and team. [36:35.46] There was OPT before, which I think I'm also very proud of [36:39.78] because we bridged the gap in understanding [36:43.70] how complex it is to train these models to the world. [36:46.22] Like until then no one really in gory detail published. [36:50.50] - The logs. [36:51.34] - Yeah, like why is it complex? [36:53.34] And everyone says like, oh, it's complex, [36:55.70] but no one really talked about why it's complex. [37:00.30] I think OPT was cool. [37:01.98] We probably-- [37:02.82] - I met Susan and she's very, very outspoken. [37:04.70] - Yeah, we probably, I think, [37:07.94] didn't train it for long enough, right? [37:09.42] Like, you know, that's kind of obvious in retrospect. [37:12.66] - For a 175B. [37:13.98] - Yeah. [37:14.82] - You trained it according to Chinchilla at the time or? [37:17.54] - I can't remember the deals, [37:19.42] but I think it's a commonly held belief at this point [37:21.74] that like, well, if we trained OPT longer, [37:24.34] it would actually end up being better. [37:26.90] Lama One, I think was, yeah, game long pull [37:29.50] and team game is fantastic [37:32.34] and went on to build Mistral. [37:34.30] I wasn't too involved in that set of things. [37:36.94] So I don't know what you're asking me, [37:39.78] which is like, well, like how did they think [37:41.78] about scaling loss and all of that? [37:43.54] Lama Two, I was more closely involved in. [37:47.70] I helped them a reasonable amount [37:50.58] with like their infrastructure needs and stuff. [37:54.14] And Lama Two, I think was more like, [37:57.46] let's get to the evolution. [37:59.70] At that point, we kind of understood [38:02.54] what we were missing from the industry's understanding [38:07.54] of LLMs and we needed more data [38:12.22] and we needed more to train the models for longer. [38:15.34] And we made, I think, a few tweaks to the architecture [38:18.86] and we scaled up more. [38:20.22] And like, that was Lama Two. [38:22.26] I think Lama Two, you can think of it as like, [38:24.26] after Guillaume left the team kind of rebuilt their muscle [38:27.62] around Lama Two. [38:28.98] And Hugo, I think, who's the first author is fantastic. [38:32.42] And I think he did play a reasonable big role [38:35.02] in Lama One as well. [38:35.86] And he overlaps between Lama One and Two. [38:37.94] So in Lama Three, obviously, hopefully, will be awesome. [38:42.82] - Just one question on Lama Two [38:44.10] and then we'll try and fish Lama Three spoilers out of you. [38:48.38] In the Lama Two paper, [38:49.50] the lost curves of the 34 and 70 B parameter, [38:52.90] this still seemed kind of steep. [38:54.82] I feel like they could go lower. [38:56.18] How, from an infrastructure level, [38:58.30] how do you allocate resources? [38:59.66] Like, could they have just gone longer [39:01.66] or were you just like, [39:02.50] hey, this is all the GPUs that we can burn [39:04.46] and let's just move on to Lama Three [39:06.02] and then make that one better? [39:07.70] - Instead of answering specifically [39:09.46] about that Lama Two situation or whatever, [39:11.90] I'll tell you how we think about things. [39:14.94] Generally, Mark released some numbers, right? [39:18.94] - So let's cite those things again. [39:22.34] All in memory, 600K GPUs. [39:24.42] - That is by the end of this year [39:26.10] and 600K H100 equivalents. [39:29.42] With 250K H100s, including all of our other GPU [39:33.82] or accelerator stuff, [39:34.86] it would be 600K at great capacity. [39:38.58] That's a lot of GPUs. [39:39.42] We'll talk about that separately. [39:40.74] But the way we think about it is [39:43.66] we have a train of models, right? [39:45.74] Lama One, Two, Three, Four. [39:48.30] And we have a bunch of GPUs. [39:50.90] I don't think we're short of GPUs. [39:53.54] - Yeah, no, I wouldn't say so. [39:54.94] - Yeah, so it's all a matter of time. [39:56.90] I think time is the biggest bottleneck. [39:59.18] It's like, when do you stop training the previous one [40:01.94] and when do you start training the next one? [40:04.38] And how do you make those decisions? [40:06.66] The data, do you have net new data, [40:08.70] better clean data for the next one [40:10.86] in a way that it's not worth [40:12.38] like really focusing on the previous one? [40:14.98] It's just a standard iterative product. [40:17.62] You're like, when is the iPhone one? [40:19.78] When do you start working on iPhone two versus iPhone? [40:23.18] Like so on, right? [40:24.40] So mostly the considerations are time and generation [40:29.12] rather than GPUs in my opinion. [40:31.58] - So one other thing with the scaling laws, [40:33.74] like Chinchilla is like optimal to balance [40:36.30] training and inference costs. [40:37.78] I think at Metascale, you would rather pay a lot more [40:40.70] maybe a training and then save on inference. [40:42.74] How do you think about that from infrastructure perspective? [40:45.58] I think in your tweet, you say you can try and guess [40:47.94] on like how we're using these GPUs. [40:50.34] Can you just give people a bit of understanding? [40:52.26] It's like, because I've already seen a lot of VCC, [40:54.66] Lama Tree has been trained on 600,000 GPUs [40:56.78] and that's obviously not true, I'm sure. [40:58.82] How do you allocate between the research, [41:01.10] like FAIR and the Lama training, [41:03.74] the inference on Instagram suggestions, [41:06.48] they got me to scroll, like I generate a stickers [41:09.10] on WhatsApp and all of that. [41:10.98] - Yeah, we haven't talked about any of this publicly, [41:13.90] but like as a broad stroke, it's like how we would allocate [41:18.06] resources of any other kinds at any company. [41:21.94] You run like a VC portfolio, like how do you allocate [41:24.48] your investments between different companies or whatever. [41:26.82] You kind of make various trade-offs and you kind of decide [41:29.38] should I invest in this project or this other project [41:32.26] or how much should I invest in this project. [41:34.66] It's very much like a zero sum of trade-offs [41:38.26] and it also comes into play, like how are your clusters [41:42.02] configured, like overall, like what you can fit [41:45.10] of what size and what cluster and so on. [41:47.50] So broadly, there's no magic sauce here. [41:51.06] Like, I mean, I think the details would add more spice [41:54.74] but also wouldn't add more understanding. [41:59.30] It's just gonna be like, oh, okay, I mean, this looks like [42:02.02] they just think about this as I would normally do. [42:05.12] - So even the GPU rich run through the same struggles [42:08.78] of having to decide where to allocate things. [42:11.10] - Yeah, I mean, like at some point, I forgot who said it, [42:14.02] but it's like you kind of fit your bottles [42:18.46] to the amount of computer do you have. [42:21.22] If you don't have enough computer, do you figure out [42:23.22] how to make do, it's one of the models. [42:26.06] But like no one as of today I think would feel [42:29.94] like they have enough compute. [42:31.42] I don't think I've heard any company within the AI space [42:35.86] be like, oh yeah, like we feel like we have sufficient [42:38.70] compute and we couldn't have done better. [42:41.10] So like that conversation, I don't think I've heard [42:44.26] from any of my friends at other companies. [42:46.98] - Stella from Eluthor sometimes says that [42:49.10] because she has a lot of donated compute. [42:51.22] - Yeah. [42:52.06] - And she's trying to put it to interesting uses, [42:53.18] but for some reason she's decided to stop [42:56.10] making large models. [42:57.10] - I mean, that's a cool, high conviction opinion [43:00.02] that might pay out. [43:01.70] - Why? [43:02.54] - I mean, she's taking a path that most people [43:06.66] don't care to take about in this climate [43:08.78] and she probably will have very differentiated ideas. [43:12.26] I mean, think about the correlation of ideas [43:14.76] in AI right now, it's so bad, right? [43:18.02] Like, so everyone's fighting for the same pie. [43:21.70] In some weird sense, like that's partly why [43:24.54] I don't really directly work on LLMs. [43:27.10] I used to be a gen, like I used to do image models and stuff. [43:30.98] And I actually stopped doing GANs because GANs [43:34.46] were getting so hot that I didn't have any calibration [43:37.80] of whether like my work would be useful or not. [43:40.70] Because, oh yeah, like someone else did the same thing [43:43.70] you did, it's like, there's so much to do, [43:47.18] I don't understand why I need to like fight for the same pie. [43:50.30] So like, I think like Stella's decision is very smart. [43:53.86] - And how do you reconcile that with how we started [43:57.18] the discussion about intrinsic versus extrinsic, [44:00.30] kind of like a accomplishment or success? [44:02.78] How should people think about that when, [44:04.58] especially when they're doing a PhD [44:06.26] or like early in their career? [44:08.54] I think in Europe, I walked through a lot of the posters [44:11.38] and whatnot, there seems to be mold collapse in a way [44:14.42] in the research, a lot of people working on the same things. [44:17.42] Is it worth for like a PhD to not take a bet [44:20.34] on something that is like maybe not as interesting, [44:23.10] you know, just because of funding and, you know, [44:25.18] visibility and whatnot? [44:26.18] Or yeah, what suggestions would you give? [44:28.90] - I think there's a baseline level of compatibility [44:31.82] you need to have with the field. [44:34.30] Basically, you need to figure out [44:37.62] if you will get paid enough to eat, right? [44:40.22] And like whatever reasonable normal lifestyle [44:43.66] you want to have as a baseline. [44:46.42] So you at least have to pick a problem [44:48.42] within the neighborhood of like fundable. [44:51.26] Like you wouldn't want to be doing something [44:55.30] so obscure that people are like, I don't know, [44:58.50] like you can work on it. [44:59.98] - With a limit on fundability, I'm just like observing [45:04.10] something like three months of compute, right? [45:05.74] That's the top line, that's the like max [45:07.70] that you can spend on any one project. [45:09.42] - But like, I think that's very ill specified, [45:12.22] like how much compute, right? [45:14.42] I think that the notion of fundability is broader. [45:16.58] It's more like, hey, are these family of models [45:19.30] within the acceptable set of you're not crazy [45:23.18] or something, right? [45:24.02] Like even something like neural ordies, [45:26.22] which is a very like boundary pushing thing [45:29.18] or like state space models or whatever. [45:31.22] Like all of these things I think [45:32.98] are still in fundable territory. [45:34.90] When you're talking about, I'm gonna do one [45:38.02] of the neuromorphic models [45:40.54] and then apply image classification to them [45:43.70] or something, then it becomes like a bit questionable. [45:47.28] Again, it depends on your motivation. [45:48.82] Maybe if you're a neuroscientist, it actually is feasible. [45:52.50] But if you're like a AI engineer, [45:54.82] like the audience of these podcasts, [45:56.74] then it's more questionable. [45:58.94] The way I think about it is like, you need to figure out [46:01.22] how you can be in the baseline level of fundability [46:03.82] just so that you can just live. [46:06.46] And then after that, really focus on intrinsic motivation [46:11.06] and depends on your strengths, [46:14.18] like how you can play to your strengths [46:16.22] and your interests at the same time. [46:18.10] Like I try to look at a bunch of ideas [46:20.50] that are interesting to me, [46:22.74] but also try to play to my strengths. [46:25.70] I'm not gonna go work on theoretical ML. [46:28.64] I'm interested in it, but when I want to work [46:31.62] on something that I try to partner with someone [46:33.74] who is actually a good like theoretical ML person [46:36.34] and see if I actually have any value to provide. [46:38.62] And if they think I do, then I come in. [46:40.62] So I think you'd want to find that intersection [46:43.10] of ideas you like and that also play to your strengths. [46:47.82] And I'd go from there. [46:49.34] Everything else, like actually finding extrinsic success [46:52.98] and all of that, I think is, [46:55.18] the way I think about it is like somewhat immaterial. [46:58.54] When you're talking about building ecosystem and stuff, [47:01.06] like slightly different considerations come into play, [47:03.70] but that's a different conversation. [47:05.74] - Yeah, we're gonna pivot a little bit [47:07.22] to just talk about open source AI. [47:09.58] But one more thing I wanted to establish for Meta [47:11.74] is like this 600K number [47:13.06] is just kind of rounding out the discussion. [47:15.34] That's for all Meta. [47:16.26] So including your own inference needs, right? [47:17.86] It's not just about training. [47:19.18] - It's gonna be the number in our data centers [47:22.14] for all of Meta. [47:23.10] - Yeah, so like, there's a decent amount of workload [47:26.06] serving Facebook and Instagram and whatever. [47:28.98] And then is there interest in like your own hardware? [47:31.70] - We already talked about our own hardware. [47:35.58] It's called MTIA, our own silicon. [47:39.14] I think we've even showed like the standard photograph [47:43.10] of you holding like the chip that doesn't work. [47:46.06] Like as in the chip that you basically just get like-- [47:51.22] - As a test? [47:52.42] - Yeah, a test chip or whatever. [47:54.06] So we are working on our silicon [47:56.58] and we'll probably talk more about it [47:58.90] when the time is right, but-- [48:00.94] - Like what gaps do you have [48:02.62] that the market doesn't offer? [48:04.70] - Okay, I mean, this is easy to answer. [48:06.80] So basically, remember how I told you about [48:09.70] there's this memory hierarchy and like sweet spots [48:12.34] and all of that. [48:13.18] Fundamentally, like when you build a hardware, [48:15.82] like you make it general enough that a wide set of customers [48:20.30] and a wide set of workloads can use it effectively [48:23.46] while trying to get the maximum level of performance they can. [48:27.54] The more specialized you make the chip, [48:29.46] the more hardware efficient it's going to be, [48:31.86] the more power efficient it's gonna be, [48:33.66] the more easier it's going to be to find the software, [48:38.02] like the kernels right to just map that one [48:41.82] or two workloads to that hardware and so on. [48:44.62] So it's pretty well understood across the industry [48:47.26] that if you have a sufficiently large volume enough workload, [48:51.30] you can specialize it and get some efficiency gains, [48:56.30] like power gains and so on. [48:58.02] So the way you can think about every large company building [49:03.02] silicon, like I think a bunch of the other large companies [49:05.98] are building their own silicon as well, [49:07.70] is they, each large company has a sufficient enough set [49:11.86] of verticalized workloads that can be specialized [49:16.86] that have a pattern to them that say a more generic accelerator [49:21.98] like an NVIDIA or any MDGPU does not exploit. [49:26.62] So there is some level of power efficiency [49:30.26] that you're leaving on the table by not exploiting that. [49:33.66] And you have sufficient scale [49:35.22] and you have sufficient forecasted stability [49:39.18] that those workloads will exist in the same form, [49:43.14] that it's worth spending the time to build out a chip [49:46.62] to exploit that sweet spot. [49:49.28] Like obviously something like this is only useful [49:52.36] if you hit a certain scale [49:54.50] and that you're like forecasted prediction [49:57.70] of those kind of workloads being [49:59.98] in the same kind of specializable exploitable way is true. [50:04.98] So yeah, that's why we're building our own chips. [50:07.82] - Awesome, yeah. [50:09.78] I know we've been talking a lot on a lot of different topics [50:13.06] and going back to open source, you had a very good tweet. [50:16.10] You said that a single company's close source effort [50:18.90] rate limits against people's imaginations and needs, [50:21.66] how do you think about that? [50:23.82] How do you think about all the impact [50:26.46] that some of the meta AI work in open source [50:28.96] has been doing and maybe directions [50:30.54] of the whole open source AI space? [50:32.46] - Yeah, in general, I think first, [50:34.98] I think it's worth talking about this in terms of open [50:37.94] and not just open source [50:39.42] because like with the whole notion of model weights, [50:42.38] no one even knows what source means for these things. [50:45.18] But just for the discussion, when I say open source, [50:49.02] you can assume it's just, I'm talking about open. [50:51.94] And then there's the whole notion of like licensing [50:54.70] and all that, like, what happens? [50:56.74] Commercial, non-commercial, commercial with clauses [50:58.90] and all that. [50:59.74] I think like at a fundamental level, [51:01.74] the most benefited value of open source [51:05.38] is that you make the distribution to be very wide. [51:10.38] Like it's just available with no friction [51:12.94] and like you can, people can do transformative things [51:16.38] in a way that's very accessible. [51:17.82] Like maybe like it's open source, [51:19.94] but it has a commercial license [51:21.50] and I'm a student like in India. [51:23.70] I don't care about the license. [51:25.86] I just don't even understand the license. [51:28.22] But like the fact that I can use it and do something with it [51:31.98] is very transformative to me. [51:33.82] Like I got this thing in a very accessible way. [51:37.58] And then like, so it's various degrees, right? [51:39.66] And then like if it's open source, [51:41.54] but it's like actually like a commercial license, [51:44.18] then a lot of companies are going to benefit [51:46.86] from like gaining value that they didn't previously have, [51:51.62] that they maybe had to pay a closed source company for it. [51:55.96] So open source is just a very interesting tool [51:58.82] that you can use in various ways. [52:00.74] So there's again two kinds of open source. [52:02.86] One is like some large company doing a lot of work [52:05.30] and then open sourcing it. [52:06.98] And that kind of effort is not really feasible [52:10.54] by say like a band of volunteers doing it the same way. [52:14.62] So there's both a capital and operational expenditure [52:17.62] that the large company just decided to ignore [52:20.34] and give it away to the world for some benefits of some kind. [52:23.98] They're not as tangible as like direct revenue. [52:27.10] So in that part, Met has been doing incredibly good things. [52:31.66] They fund a huge amount of the Pytorch development. [52:35.74] They've open sourced Lama and those family of models [52:40.30] and several other fairly transformative projects. [52:44.22] So FICE is one, segment anything, [52:48.22] Detektron, Detektron 2, Densipose. [52:51.42] I mean it's-- - Seamless? [52:52.78] - Yeah, seamless. [52:53.78] Like it's just like the list is so long [52:55.82] that you know we're not gonna cover. [52:58.02] So like I think Meta comes into that category [53:01.18] where like we spend a lot of CAPEX and OPEX [53:03.74] and we have a high talent density of great AI people [53:07.82] and we open our stuff. [53:09.74] And the thesis for that I remember [53:11.70] when FAIR was started the common thing was like, [53:14.34] wait, why would Meta wanna start a open AI lab? [53:19.34] Like what exactly is the benefit [53:21.14] like from a commercial perspective? [53:23.26] And for then like the thesis was very simple. [53:25.66] It was like AI is currently rate limiting Meta's ability [53:30.46] to do things, our ability to build various product [53:34.18] integrations, moderation, various other factors. [53:37.58] Like AI was the limiting factor. [53:40.14] And we just wanted AI to advance more. [53:42.78] And we didn't care if the IP of the AI was uniquely [53:47.66] in our possession or not for us. [53:49.14] Like however the field advances that accelerates [53:51.94] like Meta's ability to build a better product. [53:54.30] So we just built like an open AI lab and we said, [53:57.78] if this helps accelerate the progress of AI [54:00.82] that's strictly great for us. [54:02.94] But like very easy rational, right? [54:05.14] Still the same to a large extent with like the llama stuff. [54:08.14] And it's the same values, but like, you know, [54:10.94] the argument, it's a bit more nuanced. [54:13.90] And then there's the second kind of open source [54:15.62] which is, oh, you know, we built this project [54:18.42] nights and weekends and we're very smart people [54:20.54] and we open sourced it. [54:21.74] And then we built a community around it. [54:23.26] This is like the Linux kernel [54:24.78] and various software projects like that. [54:27.70] So I think about open source like both of these things [54:32.22] being beneficial and both of these things being different. [54:34.94] They're different and beneficial in their own ways. [54:37.88] The second one is really useful when [54:41.26] there's an active arbitrage to be done. [54:44.48] If someone's not really looking at a particular space [54:47.66] because it's not commercially viable or whatever, [54:49.84] like a band of volunteers can just coordinate online [54:52.66] and do something and then make that happen. [54:56.12] And that's great. [54:57.56] I wanna cover a little bit about like open source LLMs maybe. [55:00.94] So open source LLMs have been very interesting [55:03.66] because I think we were trending towards a, [55:06.44] an increase in open source in AI from 2010 [55:11.44] all the way to like 2017 or something. [55:14.98] Like where more and more pressure within the community [55:17.66] was to open source their stuff [55:19.22] so that their methods and stuff get adopted. [55:21.82] And then the LLMs revolution kind of took the opposite effect. [55:26.62] Open AI stopped open sourcing their stuff [55:29.66] and DeepMind kind of, you know, then like all the cloud [55:33.82] and all these other providers, [55:35.50] they didn't open source their stuff. [55:37.76] And it was not good in the sense that first, [55:42.00] like science done in isolation [55:43.88] probably will just form its own bubble [55:47.04] where like people believe their own bullshit or whatever, right? [55:49.92] So there's that problem. [55:51.72] And then there was the other problem [55:53.56] which was the accessibility part. [55:55.72] Like, okay, I again, always go back to like, [55:58.68] I'm a student in India with no money. [56:00.76] What is my accessibility to any of these closers models? [56:05.76] At some scale, I have to pay money [56:08.18] that makes it a non-starter and stuff. [56:11.34] And there's also the control thing. [56:13.34] I strongly believe if you want human aligned stuff, [56:17.08] you want all humans to give feedback. [56:20.46] And you want all humans to have access [56:22.50] to their technology in the first place. [56:24.28] And I actually have seen, you know, living in New York, [56:28.00] whenever I come to Silicon Valley, [56:29.44] I see a different cultural bubble. [56:31.44] Like all the friends I hang out with [56:32.80] talk about some random thing, [56:35.06] like Dyson spheres or whatever, you know, that's a thing. [56:38.28] And most of the world doesn't know or care [56:41.00] about any of this stuff. [56:42.20] Like it's like definitely like a bubble [56:44.72] and bubbles can form very easily. [56:46.60] And when you make a lot of decisions [56:48.32] because you're in a bubble, [56:50.32] they're probably not globally optimal decisions. [56:52.92] So I think like open source, [56:54.12] the distribution of open source [56:56.24] powers a certain kind of non falsifiability [57:01.24] that I think is very important. [57:03.56] I think on the open source models, [57:05.76] like it's going great in the fact that Laura, [57:09.28] I think came out of the necessity [57:12.04] of open source models needing to be fine tunable in some way. [57:17.04] - Cheaply. [57:19.32] - Yeah, and I think DPO also came out of [57:23.68] the academic open source side of things. [57:26.40] So do any of the closed source labs, [57:29.40] did any of them already have Laura or DPO internally? [57:33.04] Maybe, but like that does not advance humanity in any way. [57:37.32] It advances like some companies probability [57:40.24] of doing the winner takes all [57:42.40] that I talked about earlier in the podcast. [57:45.20] I don't know, it just feels fundamentally good. [57:47.96] Like when people try to, you know, [57:50.80] people are like, well, like what are the ways [57:53.48] in which it is not okay? [57:55.44] I find most of these arguments like, [57:57.48] and this might be a little controversial, [57:59.36] but like I find a lot of arguments based on [58:02.80] whether like closed source models are safer [58:04.96] or open source models are safer, [58:06.84] very much related to whether what kind of culture [58:10.96] they grew up in, what kind of society they grew up in. [58:14.52] If they grew up in a society that they trusted, [58:17.20] then I think they take the closed source argument, [58:21.20] and if they grew up in a society that they couldn't trust, [58:23.72] where the norm was that you didn't trust your government, [58:26.60] obviously, like it's corrupt or whatever, [58:28.84] then I think like the open source argument is what they take. [58:31.96] I think there's a deep connection to like people's innate biases [58:36.00] from their childhood and their trust in society [58:39.40] and governmental aspects that push them [58:41.88] towards one opinion or the other. [58:44.12] And I'm definitely in the camp of open sources, [58:47.04] definitely going to actually have better outcomes for society. [58:50.64] Close source to me just means that centralization of power, [58:54.12] which is really hard to trust. [58:55.80] So I think it's going well in so many ways. [58:59.60] We're actively disaggregating the centralization of power [59:03.08] to just like two or three providers. [59:05.24] We are, I think, benefitting from like so many people [59:08.48] using these models in so many ways that aren't allowed [59:12.52] by like, say, like Silicon Valley left wing tropes. [59:17.40] Like some of these things are good or bad, [59:19.84] but like they're not culturally accepted universally in the world. [59:23.28] So those are things worth thinking about. [59:25.08] And I think open source is not winning in certain ways. [59:28.16] Like these are all the things in which like, as I mentioned, [59:31.28] it's actually being very good and beneficial and winning. [59:35.20] I think one of the ways in which it's not winning, [59:37.48] at some point I should write a long form post about this, [59:40.36] is I think it has a classic coordination problem. [59:44.40] I mean, open source in general always has a coordination problem. [59:47.80] If there's a vertically integrated provider with more resources, [59:51.68] they will just be better coordinated than open source. [59:54.44] And so now open source has to figure out [59:57.24] how to have coordinated benefits. [59:59.00] And the reason you want coordinated benefits [60:01.20] is because these models are getting better [60:05.72] based on human feedback. [60:07.96] And if you see with open source models, [60:10.04] like if you go to like Reddit, local, Lama, subreddit, [60:14.20] like there's so many variations of models [60:16.96] that are being produced from, say, news research. [60:20.68] I mean, like there's like so many like variations [60:24.24] built by so many people. [60:25.84] And one common theme is they're all using these fine tuning [60:29.96] or human preferences data sets that are very limited [60:33.64] and like someone published them somewhere [60:36.48] and like they're not sufficiently diverse. [60:40.44] And you look at the other side, like say frontends like Uba [60:44.48] or like Hugging Chat or Olamma, [60:47.64] they don't really have like feedback buttons. [60:50.08] Like all the people using all these frontends, [60:52.88] they probably want to give feedback, [60:55.28] but there's no way for them to give feedback. [60:57.80] So these models are being built. [61:00.36] They're being arbitrarily measured. [61:02.44] And then they are being deployed into all these open source frontends [61:05.68] or like apps that are closed source, [61:07.84] they're serving open source models. [61:09.64] And these frontends don't have, [61:11.84] they are not exposing the ability to give feedback. [61:14.92] So we're just losing all of this feedback. [61:18.60] Maybe open source models are being as used as GPT is [61:22.24] at this point in like all kinds of, [61:24.80] in a very fragmented way. [61:26.48] Like in aggregate, all the open source models together [61:28.76] are probably being used as much as GPT is, [61:31.28] maybe, you know, close to that. [61:33.96] But the amount of feedback that is driving back [61:36.96] into the open source ecosystem is like negligible, [61:39.88] maybe less than 1% of like the usage. [61:42.36] So I think like some, like the blueprint here, I think is, [61:48.00] you'd want someone to create a sinkhole for the feedback. [61:51.04] Some centralized sinkhole, like maybe Hugging Face or someone [61:54.08] just funds like, okay, like I will make available a call [61:58.08] to log a string along with like, you know, [62:01.20] a bit of information of positive or negative [62:03.36] or something like that. [62:04.28] And then you would want to send pull requests [62:06.52] to all the like open source frontends, [62:08.56] like Uber and all being like, [62:10.36] hey, we're just integrating like a feedback UI. [62:12.92] And then work with like the close source people [62:14.76] is also being like, look, it doesn't cost you anything. [62:17.72] Just like have a button. [62:19.16] And then the sinkhole will have a bunch of this data coming in. [62:23.76] And then I think a bunch of open source researchers [62:26.40] should figure out how to filter the feedback [62:28.68] into only the like high quality one. [62:30.44] I'm sure like it will be exploited by spam bots [62:32.56] or whatever, right? [62:33.32] Like this is like the perfect way [62:35.08] to inject your advertising product into like the next. [62:38.92] - My Coca-Cola now. [62:40.88] - So there needs to be some level of that. [62:43.84] That in the same way, I'm sure like, [62:45.96] like all the close providers are doing today, [62:48.88] like OpenAI, Cloud, like the feedback that comes in, [62:52.72] I'm sure they are figuring out if that's legit or not. [62:56.40] That kind of data filtering needs to be done. [62:59.04] And that loop has to be set up. [63:03.40] And this requires that central sinkhole [63:05.96] and that like data cleaning effort both to be like there. [63:09.40] They're not there right now. [63:10.84] They're not there right now. [63:11.96] I think for capital reasons, [63:14.20] but also for coordination reasons. [63:15.96] Okay, if that central sinkhole is there, [63:17.64] who's going to go coordinate all of this integration [63:20.40] across all of these like open source front ends. [63:23.44] But I think if we do that, if that actually happens, [63:27.80] I think that probably has a real chance [63:31.40] of the open source models having a runaway effect [63:33.68] against OpenAI with their current like [63:36.80] daily active users, rumored. [63:39.44] Probably doesn't have a chance against Google [63:41.40] because you know, Google has Android and Chrome [63:45.28] and Gmail and Google Docs and everything, you know. [63:50.08] So people just use that a lot. [63:52.56] But like, I think like there's a clear chance [63:56.44] we can take at truly winning open source. [64:00.92] - Do you think this feedback is helpful [64:02.48] to make open source models better [64:04.40] or to get to like open source AGI? [64:07.16] Because in a way like OpenAI's goal is to get to AGI, right? [64:10.36] So versus I think in open source [64:12.60] we're more focused on personal better usage [64:15.28] or like commercial better usage. [64:16.64] - Yeah, I think that's a good question. [64:17.76] But I think like, I actually don't think [64:20.84] people have a good understanding of AGI [64:23.60] and I don't mean definition level. [64:25.16] I mean, people are like, okay, we're going to AGI means [64:29.44] it's powering 40% of world economic output [64:32.88] or something like that, right? [64:35.56] But what does that mean? [64:37.56] So do you think electricity is powering 40% [64:41.28] of world economic output or is it not? [64:44.40] Like generally the notion of like powering [64:47.48] X% of economic output is not defined well at all [64:53.16] or made to understand like how to know when we got to AGI [64:57.40] or how to measure whether we're getting AGI. [65:00.52] Like, you know, you can look at it in terms of intelligence [65:03.32] or task automation, whatever. [65:05.90] I think that's what we are doing right now. [65:08.08] We're basically integrating like the current set [65:10.36] of AI technologies into so many real world use cases [65:15.00] where we find value that if some new version of AI comes in [65:20.12] we can find, like we can be like, ah, this helps me more. [65:23.56] In that sense, I think like the whole process [65:26.32] of like how we think we got to AGI will be continuous [65:29.84] and not discontinuous like how I think [65:33.40] the question is posed. [65:35.20] So I think the open source thing will be very much in line [65:40.20] with getting to AGI because open source has [65:43.32] that natural selection effect. [65:45.28] Like if a better open source model comes, [65:47.44] really no one says, huh, I don't want to use it [65:51.04] because there are ecosystem effect. [65:53.64] I'm logged into my ecosystem or like, [65:56.56] I don't know if I like the models, you know, whatever. [65:59.60] It's just a very pure direct thing. [66:02.96] So if there's a better model that comes out [66:06.08] then it will be used. [66:08.32] So I definitely think it has a good chance of achieving [66:12.68] how I would think about it as a continuous path [66:16.04] to what we might define as AGI. [66:18.72] - For the listeners, I would actually mention [66:20.48] a couple other maybe related notes on just [66:22.72] this very interesting concept of feedbacks in coal [66:25.92] for open source to really catch up [66:27.96] in terms of the overall Google versus Open AI debate. [66:31.80] Open Assistant was led by Yannick Koucher [66:35.16] who recently ended his effort. [66:36.52] I think the criticism there was like the kind of people [66:38.40] that go to a specific website to give feedback [66:41.52] is not representative of real world usage. [66:43.44] And that's why the models trained on Open Assistant [66:46.12] didn't really seem like they have caught on [66:48.40] in the open source world. [66:49.56] The two leading candidates in my mind are LMSys [66:51.88] out of UC Berkeley who have the LMSys arena [66:54.88] which is being touted as one of the only ways, [66:57.84] only reliable benchmarks anymore. [66:59.36] I kind of call them non-parametric benchmarks [67:01.52] 'cause there's nothing to cheat on it except for Elo. [67:05.56] And then the other one is Open Router [67:07.36] which is Alex Otala's thing. [67:08.64] I don't know if you've talked to any of these people. [67:11.08] I obviously know all of the efforts that you talked about. [67:15.72] I haven't talked to them directly about this yet [67:18.48] but the way I think about it is [67:20.36] the way these models are going to be used [67:22.56] is always going to be way more distributed than centralized. [67:26.04] Like which is the power of the open source movement. [67:29.04] Like the UI within which these models are going to be used [67:32.92] is going to be decentralized. [67:35.28] Like it's, these models are going to be integrated [67:37.32] into like hundreds and thousands of projects [67:40.76] and products and all of that, right? [67:42.92] And I think that is important to recognize. [67:45.76] Like the LMSys leaderboard is the best thing we have right now [67:50.08] to understand whether a model is better or not [67:53.04] versus another model. [67:54.80] But it's also biased in only having a sliver of view [67:59.24] into how people actually use these models. [68:01.04] Like the people who actually end up coming [68:03.04] to the LMSys leaderboard and then using a model [68:06.12] only use it for certain things. [68:08.16] Like GitHub co-pilot style usage is not captured [68:12.36] in say like LMSys things. [68:13.92] And so many other styles, [68:15.24] like the character AI style things is not captured in LMSys. [68:19.48] - Which open router could do. [68:20.92] They don't do it right now, but. [68:22.12] - Yeah, so like, I think like, yeah, my point is like, [68:25.28] the way these models are going to be used [68:27.24] is going to be always a large surface area. [68:30.84] And I think we need to figure out [68:32.12] how to provide infrastructure to integrate [68:35.40] with all these like ways in which it's being used. [68:39.08] Even if you get like the top hundred front ends [68:42.72] that the model, like the open source models are used through [68:46.24] to subscribe to like the sinkhole, [68:48.72] I think that's already like a substantial thing. [68:51.04] I think like thinking one or two things [68:54.20] built by themselves get a lot of data, [68:56.72] I think is not going to happen. [68:58.64] - Yeah, fair enough. [68:59.80] Before we let you go, [69:01.08] can we do just a quick beyond text segment? [69:03.80] So you're an investor in Broadway, [69:05.88] which is a V2 generation. [69:07.60] You're an investor in 1x, [69:08.88] which is a humanoid assistant. [69:11.72] Osmo, which is focused on using AI [69:14.12] for smell recognition and synthesis. [69:16.28] You advise a bunch of robotics projects at NYU. [69:19.24] - And he builds his own home robot. [69:21.12] - Yeah, exactly. [69:22.92] On a more, yeah, maybe open anything. [69:24.68] What are like the things that you're most excited about [69:27.00] beyond like tax generation and kind of the more mundane usage? [69:30.92] - Yeah, I mean, in general, [69:32.68] I have more things I'm generally excited about [69:35.32] than I can possibly do. [69:37.80] Investing is one way to try to clear those urges. [69:42.80] I'm generally excited about robotics being a possibility, [69:48.80] a home robotics being like five to seven years away [69:52.36] into commercialization. [69:53.84] I think it's not like next year or two years from now, [69:57.96] but like five to seven years from now, [70:00.32] I think a lot more robotics companies might pop out. [70:04.24] There's not a good consensus [70:06.08] on whether hardware is a bottleneck [70:08.08] or AI is a bottleneck in robotics right now. [70:10.72] My view is actually hardware is still the bottleneck, [70:14.88] and AI is also a little bit of bottleneck, [70:17.64] but I don't think there's any obvious breakthroughs we need. [70:23.16] I think it just work. [70:24.64] So I'm generally excited about robotics. [70:26.48] I spend a lot of personal time. [70:27.98] I spend like every Wednesday afternoon at NYU [70:30.96] working with Laryl Pinto and T. [70:33.44] And just getting towards my like home robot [70:36.38] that just does my dishes and stuff. [70:38.32] - What's the status of it? [70:39.32] Like what does it do for you now? [70:41.20] - As of today, we just deployed a couple of months ago, [70:45.40] we deployed our home robotics stuff [70:47.72] into like several tens of New York City homes [70:52.36] and like tried to make it do a bunch of tasks. [70:55.16] And we're basically starting to build out a framework [70:59.40] that gets to a certain level of robustness [71:02.20] on fairly simple tasks. [71:04.60] Like, you know, picking this cup [71:06.44] and putting it somewhere else [71:07.64] or like taking a few pieces of cloth on the ground [71:10.64] and put it somewhere else or open your microwave. [71:14.92] Like various like baseline tasks like that [71:18.72] with low sample complexity. [71:21.00] So I think one of the things people don't spend [71:23.12] their time robotics is like the user experience, [71:25.92] which I think we, in the research I do at NYU, [71:29.76] we spend a huge amount of time on. [71:31.88] I think the key there is sample complexity has to be really low. [71:35.40] A lot of the current robotics research, if you see there, [71:38.08] like, oh yeah, we collected like 50 demos [71:40.28] and now it's able to do this task [71:42.08] or we collected like 300 demos [71:44.48] or like the number of samples you need [71:46.60] for this thing to do the task is really high. [71:48.96] So we're focusing a lot on, [71:50.72] you show it like two or three times [71:53.12] and that's sufficient for it to actually like do the task. [71:56.72] But it comes with like less generalization, right? [71:59.52] Like you, there's some initial conditions [72:01.36] that have to be true for it to do the task. [72:03.64] So we're making progress. [72:05.08] That's very interesting in general, the space. [72:07.84] I don't think people in the space [72:09.44] have settled on the hardware, [72:11.88] like how the hardware looks like [72:14.12] for it to be truly useful in the home or whatever, [72:16.88] or the UX or the like AI/ML stuff needed [72:21.88] to make it sample efficient and all of that. [72:25.08] But I think like lots of work is happening in the field. [72:28.84] - Yeah, one of my friends, Karloed Berkeley, [72:31.20] he worked on a project called M3L, [72:33.08] which is two CNNs, one for tactile feedback [72:36.48] and one for image. [72:37.80] When you say hardware, [72:38.68] is it running all these things on the edge [72:41.56] or is it just like the actual servos? [72:43.72] - Yeah, hardware, I mean like the actual like servos, [72:48.24] like the motors, servos, even like the sensors, [72:53.24] I think we have incredible vision [72:56.24] that still like is so much better compared to [72:59.48] in the field of view and in resolution, [73:01.36] compared to any of the cameras we can buy. [73:03.92] We have, our skin is like all available touch sensing, [73:08.76] and we have like some of the most efficient, [73:12.00] you know, some of the most high capacity motors [73:14.76] that can lift large loads, you know, [73:17.32] in like the dexterity of a hand and stuff. [73:20.44] So in terms of hardware, I mean like in terms [73:23.28] of those capabilities, like, you know, [73:25.36] we haven't figured out how to do a lot of these stuff. [73:28.44] I mean, Tesla has been making incredible progress. [73:31.24] One X, I think announced their new thing [73:35.24] that looks incredible. [73:36.72] Some of the other companies figure [73:38.24] and like others are doing great work, [73:40.56] but we're really not anywhere close to like the hardware [73:43.68] that we feel like we need. [73:45.84] And there's obviously the other thing I want to call out is [73:49.48] a lot of what people show works, [73:52.24] but like has to be fixed all the time. [73:53.92] I mean, like that's the other thing we are incredible at. [73:57.12] Like we don't need any maintenance [73:58.84] or like the maintenance is part of us. [74:01.32] If you buy a product, electronics product of any kind, [74:04.00] you buy a PS5, you don't say, [74:06.40] oh yeah, my PS5 breaks like every six days [74:09.00] and I have to like do some reasonable amount of work on it. [74:11.60] But like that's robotics. [74:13.32] Like if it's not industrial robotics [74:15.16] where it's very controlled and specialized or whatever, [74:18.20] like you're talking about reliability like in those ranges. [74:21.64] So I think people don't talk about [74:24.24] the reliability thing enough. [74:25.40] Like what I mean, like we're going to enter [74:27.28] the commercialization phase. [74:28.52] I mean like we're going to start thinking about, okay, [74:31.48] now we have this thing and we need to figure out [74:33.08] how to get reliability high enough to deploy it into homes [74:36.12] and like just sell it to people and like Best Buy or something. [74:40.04] So that's the other factor [74:41.72] that we have to make a lot of progress on. [74:44.24] - I just realized that Google has a play in this [74:47.36] with like PalmE and stuff [74:48.68] and OpenEi obviously has a long history of doing this stuff. [74:51.92] Is there anything in Meta? [74:53.92] No robotics stuff in Meta. [74:55.52] - We have a small robotics program at Meta, out of fare. [74:58.96] I actually used to do it at fare a little bit [75:01.28] before I moved into Infer and focused on my Meta time [75:05.04] on a lot of like other infrastructural stuff. [75:07.44] So yeah, Meta's robotics program is a lot smaller. [75:10.88] - Seems like it would be a personal computing. [75:14.40] - You can think of it as like, [75:15.72] Meta has a ridiculously large device strategy, right? [75:19.76] Like, you know, this is how our reality labs stuff. [75:23.24] Like, you know, we're going at it from VR and AR [75:25.88] and you know, we showcase a lot of stuff. [75:28.32] I think for Meta, the robot is not as important [75:32.40] as like the physical devices. [75:35.48] Physical devices kind of stuff, for sure. [75:38.64] - Okay, I want to touch on Osmo a bit [75:40.24] because a very unusual company to the stuff [75:42.76] that we normally discuss, not robotics, sense of smell. [75:46.60] The original pitch I heard from the founder, [75:48.08] maybe you can correct me, is that you realize [75:50.28] that you can smell cancer. [75:52.36] Is that intuitive? [75:53.84] Is that what you get? [75:54.68] Or is it the potential that you seek? [75:56.40] - The very interesting reason I invested in Osmo [75:59.96] is because Alex Wilsko, the founder of Osmo, [76:03.56] before PyTorch, there was Torch. [76:05.52] And Alex Wilsko actually worked on Torch. [76:08.08] He's actually like a frameworks guy. [76:10.28] Like, you know, he built this thing called tangent [76:12.52] from Google, like another like autodiff framework and stuff. [76:16.04] Like, so I know him from that side of things. [76:18.52] And then, like, I also, like, [76:20.20] he is a neurobiologist by training. [76:22.44] He just happens to also love like neural networks [76:26.00] and like hacking on those frameworks. [76:28.56] So incredibly smart guy, one of the smartest people I know. [76:32.44] So when he was going in this direction, [76:34.68] I thought it was incredible that like smell [76:38.00] is something that we haven't even started to scrape [76:42.24] in terms of digitization. [76:44.00] When we think about audio or images or video, [76:47.96] they're like so advanced. [76:49.72] So we have the concept of color spaces. [76:52.60] We have the concept of like frequency spectrums. [76:55.68] Like, you know, we figured out how ears process, [76:58.44] like frequencies in mouse spectrum or whatever, [77:00.72] like logarithmically scaled images. [77:03.36] We're like RGB, YUV. [77:04.92] Like we have so many different kinds of parametrizations. [77:07.88] We have formalized these two senses ridiculously well. [77:12.88] Touch and smell, nada. [77:16.08] We're like where we were with images in, say, in 1920 [77:19.96] or maybe even the 1800s, right? [77:22.52] That's where we're at. [77:23.52] And Alex has this incredible vision [77:26.04] of like having a smell sensor [77:30.30] just eventually just be part of your daily life. [77:34.36] Like, as of today, you don't really think about [77:38.12] like when you're watching an Instagram reel or something. [77:40.40] Huh, like I also would love to know what it smelled like. [77:44.72] You know, when you're watching a reel of a food or something. [77:48.04] You don't because we really haven't as a society [77:52.20] got that muscle to even understand [77:54.96] what a smell sensor can do. [77:57.54] I think the more near term effects are obviously [78:00.38] going to be around things that provide more obvious utility [78:04.52] in the short term, like maybe smelling cancer [78:07.68] or like repelling mosquitoes better [78:10.60] or you know, stuff like that. [78:12.64] More recently, he's been talking about [78:14.00] like categorizing perfumes obviously. [78:15.48] Yeah, exactly. [78:16.32] That's a market that you can pursue. [78:17.52] Yeah, like, I mean, think about how you can customize [78:21.28] a perfume to your own liking in the same way [78:24.28] you can customize a shoe or something, right? [78:27.48] So that like, that's I think all the near term stuff. [78:29.80] I think if he's able to figure out [78:32.36] a near term value for it, [78:34.40] they as a company can sustain themselves [78:37.08] to then eventually like try to make progress [78:39.52] on the long term, which is really an uncharted territory. [78:44.04] Like you think about it, 50 years from now, [78:47.04] it would be pretty obvious to like kids of the generation [78:50.28] to just like, I was going to say, [78:51.84] scroll the reel on their phone, [78:53.28] maybe phone, they're just like, you know, [78:56.72] on their glasses, they're watching something. [78:59.44] I think VR would be. [79:00.28] And then like, they immediately get like a smell sense [79:04.28] off that remote experience as well. [79:06.72] Like we haven't really progressed enough in that dimension. [79:10.62] And I think they have a chance to do it. [79:13.00] Awesome. I mean, we touched on a lot of things. [79:14.76] Anything we're missing, anything you want to direct people to [79:18.32] or call to action, call for research, call for startups. [79:22.88] I don't really have a lot of calls to action [79:24.86] because usually I think people should be intrinsically, [79:28.12] like that's a good look inside yourself. [79:31.08] Yeah. That's good. [79:33.68] Awesome. Thank you so much for coming on. [79:35.12] Yeah. This was great. [79:36.16] Thanks a bit. [79:37.76] (upbeat music) [79:40.34] (upbeat music) [79:42.92] (upbeat music) [79:45.50] (upbeat music) [79:48.08] (upbeat music) [79:50.66] (upbeat music) [79:53.24] (upbeat music) [79:55.82] (upbeat music) [79:58.40] [BLANK_AUDIO]