[by:whisper.cpp]
[00:00.00] (upbeat music)
[00:02.58] - Hey everyone, welcome to the Late in Space podcast.
[00:08.24] This is Alessio, partner and CTO
[00:10.24] in residence at Decibel Partners.
[00:11.92] And I'm joined by my cohost, Sviks,
[00:13.96] founder of SmallAI.
[00:15.28] - Hey, and today we have the studio
[00:16.60] of super mission talent, welcome.
[00:17.60] - Thanks for having me.
[00:18.56] - On one of your rare visits from New York,
[00:20.84] where you live.
[00:21.68] (laughs)
[00:22.52] You got your start in computer vision at NYU
[00:25.60] with Yana Kun.
[00:27.40] That was a very fortuitous start.
[00:28.96] I was actually listening to your interview
[00:31.12] on the gradient podcast.
[00:32.12] So if people want to know more
[00:33.40] about like the history of Smith, history of PyTorch,
[00:35.72] they can go to that podcast.
[00:37.00] We won't spend that much time there.
[00:38.40] But I just was marveling at your luck,
[00:40.56] or I don't know if it's your luck
[00:42.32] or your drive to find AI early
[00:45.36] and then find like the right quality mentor
[00:47.84] because I guess Yana really sort of introduced you
[00:50.32] to that world.
[00:51.16] - Yeah, I think you're talking about extrinsic success, right?
[00:54.60] Like a lot of people just have drive to do things
[00:57.44] that they think is fun.
[00:58.90] And a lot of those things might or might not be
[01:01.72] extrinsically perceived as a good and successful.
[01:04.96] I think I just happened to like something
[01:08.28] that is now like one of the coolest things
[01:11.00] in the world or whatever.
[01:12.50] But if I happen, you know,
[01:14.36] the first thing I tried to become was a 3D VFX artist.
[01:18.44] And I was really interested in doing that,
[01:21.12] but I turned out to be very bad at it.
[01:23.84] So I ended up not doing that further.
[01:25.56] But even if I was good at that, whatever,
[01:27.94] and I ended up going down that path,
[01:30.12] I probably would have been equally happy.
[01:32.44] It's just like maybe like the perception of oh,
[01:35.08] is this person successful or not might be different.
[01:38.00] I think like after a baseline,
[01:39.84] like your happiness is probably more correlated
[01:42.44] with your intrinsic stuff.
[01:44.16] - Yes, I think Dan Pink has this book on drive
[01:47.76] that I often refer to about the power of intrinsic motivation
[01:51.08] versus extrinsic and how long extrinsic lasts.
[01:53.48] It's not very long at all.
[01:55.48] But anyway, now you are, you know, an investor in runway.
[01:57.68] So in a way, you're working on VFX.
[02:00.66] - Yes, I mean, in a very convoluted way.
[02:03.58] - It reminds me of Ed Katmoh.
[02:05.90] I don't know if you guys know, but you know,
[02:07.74] he actually tried to become an animator
[02:09.78] in his early years and failed
[02:11.46] or didn't get accepted by Disney
[02:13.30] and then went and created Pixar
[02:14.50] and then got bought by Disney.
[02:16.10] Created Toy Story.
[02:18.38] So you joined Facebook in 2014
[02:20.58] and eventually became creator and maintainer of PyTorch.
[02:24.02] And there's this long story
[02:24.94] that you can refer to on the gradient.
[02:26.54] I think maybe people don't know that you also are involved
[02:28.44] in more sort of hardware and cluster decisions and affair
[02:30.56] and we can dive into more details there
[02:32.32] because we're all about hardware this month.
[02:34.52] (laughing)
[02:35.60] Yeah, and then finally, I don't know what else,
[02:37.28] like what else should people know about you
[02:38.48] and on the personal side or professional side?
[02:40.48] - I think open source is definitely like a big passion
[02:43.80] of mine and probably forms a little bit of my identity
[02:46.20] at this point.
[02:47.16] I'm irrationally interested in open source.
[02:50.40] I think open source has that fundamental way
[02:53.00] to distribute opportunity in a way
[02:56.08] that is very powerful.
[02:57.98] Like I grew up in India.
[02:59.86] I didn't have internet for a while
[03:01.58] and in college actually I didn't have internet
[03:04.14] except for like GPRS or whatever.
[03:06.70] Like and like knowledge was very centralized
[03:10.18] but like I saw that evolution of knowledge
[03:12.10] slowly getting decentralized
[03:13.62] and that ended up helping me learn quicker
[03:16.90] and faster for like zero dollars.
[03:19.46] And I think that was a strong reason
[03:22.78] why I ended up where I am.
[03:25.42] So like the open source set of things,
[03:27.92] I always push regardless of like what I get paid for.
[03:31.84] Like I think I would do that
[03:33.60] as a passion project on the side.
[03:35.52] - Yeah, that's wonderful.
[03:36.36] We'll talk about the challenges as well
[03:38.08] that open source has, open models versus closed models.
[03:41.08] But maybe we want to talk,
[03:42.28] touch a little bit on PyTorch before we move on
[03:44.08] to the sort of meta AI in general.
[03:45.72] - Yeah, we kind of touched on PyTorch in a lot of episodes.
[03:48.64] So we had George Hotz from TinyGrad.
[03:51.32] He called PyTorch a cis and TinyGrad a risk.
[03:55.86] I would love to get your thoughts on PyTorch
[03:58.34] design direction as far as,
[04:00.66] I know you talk a lot about kind of having a happy path
[04:04.14] to start with and then making complexity hidden away
[04:06.66] but then available to the end user.
[04:08.58] One of the things that George mentioned is,
[04:10.14] I think you have like 250 primitive operators in PyTorch.
[04:13.70] I think TinyGrad is four.
[04:15.14] So how do you think about some of the learnings
[04:17.66] that maybe he's gonna run into
[04:19.02] that you already had in the past seven, eight years,
[04:22.28] almost of running PyTorch?
[04:24.32] - Yeah, I think there's different models here,
[04:26.40] but like I think it's two different models
[04:28.16] that people generally start with either.
[04:29.76] They go like, I have a grand vision
[04:32.16] and I'm gonna build like a giant system
[04:34.08] that achieves this grand vision
[04:35.44] and might be one is or like, you know,
[04:37.04] super feature complete or whatever.
[04:39.68] Or other people say they will get incrementally ambitious,
[04:43.76] right?
[04:44.60] And they say, oh, we'll start with something simple
[04:45.96] and then we'll slowly layer out complexity
[04:48.12] in a way that optimally applies Huffman coding or whatever.
[04:51.38] Like, you know, where the density of the users are
[04:54.86] and what they're using,
[04:55.94] I would want to keep it in the like easy happy path
[04:58.70] and where the more niche advanced use cases,
[05:01.80] I'll still want people to try them,
[05:04.06] but they need to take additional frictional steps.
[05:07.90] George, I think just like we started with PyTorch,
[05:10.94] George started with like the incrementally ambitious thing.
[05:14.18] I remember TinyGrad used to be like,
[05:17.02] we would be limited to a thousand lines of code.
[05:19.12] And I think now it's like a 5,000.
[05:21.20] So I think there's no real like magic
[05:25.24] to which why PyTorch has the kind of complexity.
[05:27.84] I think it's like probably partly necessitated
[05:30.76] and partly because we built with the technology
[05:33.44] available under us at that time.
[05:36.32] PyTorch is like 190,000 lines of code
[05:39.24] or something at this point.
[05:40.52] I think if you had to rewrite it,
[05:41.96] we would probably think about ways to rewrite it
[05:45.44] in like a vastly simplified way for sure.
[05:49.02] But a lot of that complexity comes from the fact
[05:51.90] that in a very simple, explainable way,
[05:54.88] you have memory hierarchies.
[05:56.54] You have CPU has like three levels of caches
[05:59.74] and then you have DRAM and SSD.
[06:02.94] And then you have network.
[06:04.30] Similarly, GPU has several levels of memory
[06:07.86] and then you have like different levels
[06:09.54] of network hierarchies and relaying plus like
[06:12.82] Infinity Band or Rocky or something like that, right?
[06:16.08] And the way the flops are available on your hardware,
[06:20.76] they are available in a certain way
[06:22.76] and your computation is in a certain way
[06:24.72] and you have to retrofit your computation
[06:26.56] onto both the memory hierarchy
[06:28.56] and like the flops available.
[06:30.36] When you're doing this,
[06:31.64] it is actually like a fairly hard mathematical problem
[06:35.42] to do this setup, like you find the optimal thing.
[06:39.40] And finding the optimal thing is like,
[06:41.52] what is optimal depends on like the input variables themselves.
[06:44.70] So like, okay, like what is the shape of your input tensors
[06:47.50] and like what is the operation you're trying to do
[06:49.86] and like various things like that.
[06:52.06] Finding that optimal configuration
[06:54.50] and writing a dot in code
[06:56.18] is not the same for every input configuration you have.
[06:59.62] Like for example, just as the shape of the tensors change,
[07:03.18] let's say you have three input tensors
[07:04.78] into like a sparse dot product or something like that.
[07:07.82] The shape of each of these input tensors
[07:10.02] will vastly change how you do this optimally
[07:13.30] placing this operation onto the hardware
[07:15.88] in a way that will get you maximal throughput.
[07:18.78] So a lot of our complexity comes from writing out
[07:22.94] like hundreds of configurations
[07:25.04] for each single PyTorch operator
[07:27.30] and templatizing these things and like symbolically
[07:30.02] like generating the final CUDA code or like CPU code.
[07:34.62] There's no way to avoid it
[07:35.62] because mathematically we haven't found symbolic ways
[07:39.32] to do this that also keep compile time near zero.
[07:43.70] You can write a very simple framework
[07:45.70] but then you also should be willing to eat
[07:48.10] the long compile time.
[07:49.58] So like searching for that optimal performance at runtime.
[07:52.78] But that's the trade-off, there's no like,
[07:54.58] I don't think like unless we have like great breakthroughs
[07:57.40] like George's vision is achievable like
[08:00.30] or like he should be thinking about a narrower problem
[08:03.10] such as I'm only gonna make this for like
[08:04.92] work for self-driving car conducts
[08:07.62] or like I'm only gonna make this work
[08:09.62] for like LLM transformers of the llama style.
[08:13.88] Like if you start narrowing the problem down
[08:16.04] you can make a vastly simpler framework.
[08:19.60] But if you don't, if you need the generality
[08:22.16] to power all of the AI research that is happening
[08:24.84] and keep like zero compile time
[08:26.88] and you know all these other factors
[08:28.72] I think it's not easy to avoid the complexity.
[08:32.88] - That's interesting and we kind of touched on this
[08:34.72] with Chris Lander when he was on the podcast.
[08:37.16] If you think about frameworks, they have the model target.
[08:40.54] They have the hardware target.
[08:41.86] They have a different things to think about.
[08:43.42] He mentioned when he was at Google,
[08:45.18] TensorFlow trying to be optimized to make TPUs go brr,
[08:48.54] you know, and go as fast.
[08:50.02] I think George is trying to make
[08:51.70] especially AMD stack be better than Rock'em.
[08:54.54] How come PyTorch has been such a Switzerland
[08:57.36] versus just making meta hardware go brr?
[09:01.66] - First meta is not in the business of selling hardware.
[09:04.98] Meta is not in the business of cloud compute.
[09:07.60] The way meta thinks about funding PyTorch
[09:09.84] is we're funding it because it's net good for meta
[09:14.16] to fund PyTorch because PyTorch has become a standard
[09:17.40] and a big open source project.
[09:19.36] And generally it gives us a timeline edge.
[09:22.24] It gives us like various like leverage,
[09:24.56] like and all that within our own work.
[09:27.00] So why is PyTorch like more of a Switzerland
[09:29.36] rather than being opinionated?
[09:30.56] Like I think the way we think about it
[09:32.04] is not in terms of Switzerland or not.
[09:34.26] We actually, like we articulated to all hardware vendors
[09:37.96] and software vendors and all who come to us being like
[09:40.92] we want to build a backend in core
[09:42.76] for PyTorch and ship it by default.
[09:44.40] It's like we just only look at our user side of things.
[09:49.20] Like if users are using a particular piece of hardware,
[09:52.88] then we want to support it.
[09:54.28] We very much don't want to king make
[09:56.96] the hardware side of things.
[09:58.44] So as like the Mac books have GPUs
[10:02.24] and as that stuff started getting increasingly interesting,
[10:05.58] we pushed Apple to push some engineers
[10:08.94] and work on the NPS support.
[10:10.46] And we spend significant time
[10:12.02] from like meta funded engineers on that as well
[10:14.34] because a lot of people are using the Apple GPUs
[10:18.54] and there's demand.
[10:19.46] So like we kind of mostly look at it from the demand side.
[10:22.46] We never look at it from like,
[10:24.54] oh, which hardware should we start taking opinions on?
[10:27.66] - Is there a future in which,
[10:29.02] because Mojo or modules Mojo is kind of a super set of Python,
[10:32.06] is there a future in which PyTorch might use Mojo features
[10:35.84] optionally?
[10:36.68] - I think it depends on how well integrated it is
[10:39.84] into the Python ecosystem.
[10:42.32] So if Mojo is like a pip install
[10:44.24] and it's readily available and users feel like
[10:48.76] they can use Mojo so smoothly within their workflows
[10:53.12] in a way that just is slow friction,
[10:56.44] we would definitely look into that.
[10:57.92] Like in the same way like PyTorch now depends on Triton
[11:00.88] like OpenAI Triton.
[11:02.66] And we never had a conversation that was like,
[11:05.42] huh, that's like a dependency.
[11:07.34] Should we just build a Triton of our own?
[11:10.30] Or should we like use Triton?
[11:12.22] Like it almost doesn't like those conversations
[11:14.82] don't really come up for us.
[11:15.98] It's the conversations are more like,
[11:17.54] well, does Triton have like 10,000 dependencies
[11:20.10] and is it hard to install?
[11:21.34] Like we almost don't look at these things
[11:23.94] from like a strategic leverage point of view.
[11:26.14] We look at these things from like a user experience point
[11:28.90] of view, like, is it easy to install?
[11:31.08] Is it like smoothly integrated?
[11:32.56] And does it give enough benefits for us
[11:34.40] to like start depending on it?
[11:35.52] So yeah, we should consider.
[11:36.52] That's how we think about it.
[11:37.36] - You're inclusive by default.
[11:38.84] If as long as it meets like the minimum bar of, yeah.
[11:41.40] But like maybe I phrased it wrongly.
[11:43.24] Maybe it's more like, okay,
[11:44.32] like what problems would you look to solve
[11:46.64] that you have right now?
[11:48.08] - I think it depends on what problems Mojo will be useful at.
[11:51.76] - Mainly a performance pitch.
[11:53.44] Some amount of cross compiling pitch.
[11:55.88] - Yeah, I think like the performance pitch for Mojo
[11:58.24] was like, we're gonna be performant
[12:00.58] even if you have like a lot of custom stuff.
[12:04.02] Like you can write arbitrary custom things
[12:06.38] and like we will be performant.
[12:07.94] And that value proposition is not clear to us
[12:12.82] from the PyTorch side to consider it for PyTorch.
[12:16.02] So PyTorch, it's actually not 250 operators,
[12:18.62] like a thousand operators.
[12:19.86] PyTorch expose about a thousand operators
[12:21.86] and people kind of write their ideas
[12:23.92] in the thousand operators of PyTorch.
[12:26.46] Mojo is like, well, maybe like it's okay
[12:29.40] to completely sidestep those like thousand operators
[12:32.20] of PyTorch and just write it in a more natural form.
[12:35.12] Just write like raw Python, like write for loops
[12:37.72] or whatever, right?
[12:38.88] So from the consideration of how do we intersect PyTorch
[12:43.88] at Mojo, like I can see one use case where you're like,
[12:48.12] you have custom staff for some parts of your program,
[12:52.36] but mostly it's PyTorch.
[12:54.00] And so we can probably figure out how to like
[12:55.96] make it easier for say Torch.compile to like smoothly
[13:00.70] also consume Mojo subgraphs.
[13:03.62] And like, you know, the interoperability
[13:05.50] being actually usable, that I think is valuable,
[13:09.40] but like Mojo as a fundamental front end
[13:11.94] would be replacing PyTorch, not like augmenting PyTorch.
[13:16.06] So in that sense, I don't see a synergy
[13:18.10] in more deeply like integrating Mojo.
[13:21.70] - So call out to Mojo whenever they have written
[13:24.46] something in Mojo and there's some performance
[13:27.16] - Yeah.
[13:28.00] - related thing going on.
[13:29.12] And then since you mentioned Apple,
[13:30.36] what should people think of PyTorch versus MLX?
[13:32.40] - I mean, MLX is early.
[13:34.36] And I know the folks well, Ani used to work at fair
[13:39.36] and I chatted, I used to chat with him all the time.
[13:42.92] He used to be based out of New York as well.
[13:45.34] The way I think about MLX is
[13:48.32] that MLX is specialized for Apple right now.
[13:52.32] It has a happy path because it's defined
[13:54.78] as product in a narrow way.
[13:57.00] At some point, MLX either says,
[14:00.58] we will only be supporting Apple
[14:04.14] and we will just focus on enabling, you know,
[14:07.62] this is a framework if you use your MacBook,
[14:09.70] but once you like go server side or whatever,
[14:12.10] that's not my problem and I don't care.
[14:14.06] Or MLX, it enters like the server side,
[14:17.82] a set of things as well.
[14:18.94] Like one of these two things will happen, right?
[14:21.38] If the first thing will happen,
[14:22.46] like MLX's overall addressable market will be small,
[14:26.00] but it probably do well within that addressable market.
[14:29.44] If it enters the second phase,
[14:31.28] they're going to run into all the same complexities
[14:33.24] that we have to deal with.
[14:34.88] They will not have any magic wand
[14:36.84] and they will have more complex work to do.
[14:41.84] They probably wouldn't be able to move as fast.
[14:45.54] - Having to deal with distributed compute.
[14:47.74] - Distributed, NVIDIA named DGPUs,
[14:50.88] like just like having a generalization
[14:52.88] of the concept of a backend,
[14:55.14] how they treat compilation with plus overheads right now.
[14:59.30] They deeply assume like the whole MPS graph thing.
[15:02.30] So they need to think about all these additional things
[15:06.46] if they end up expanding onto the server side
[15:09.30] and they'll probably build something like PyTorch as well,
[15:13.02] right?
[15:13.86] Like eventually that's where it will end.
[15:15.84] And I think there they will kind of
[15:18.38] fell on the lack of differentiation.
[15:20.66] Like it wouldn't be obvious to people
[15:22.52] why they would want to use it.
[15:24.76] - I mean, there are some cloud companies offering
[15:26.92] and one in them two chips on servers.
[15:28.80] I feel like it might be interesting for Apple
[15:30.84] to pursue that market, but it's not their core.
[15:33.28] - Yeah, I mean, if Apple can figure out
[15:35.72] their interconnect story, maybe,
[15:37.80] like then it can become a thing.
[15:39.96] - Honestly, that's more interesting than the cars.
[15:41.88] - Yes, I think like,
[15:44.88] I mean, the mode that NVIDIA has right now, I feel like
[15:47.28] is that they have the interconnect that no one else has.
[15:50.86] Like AMD GPUs are pretty good.
[15:52.62] I'm sure there's very silicon that is not bad at all.
[15:56.40] But like the interconnect, like NVLink is uniquely awesome.
[16:00.36] I'm sure the other hardware providers are working on it.
[16:03.60] - I feel like when you say it's uniquely awesome,
[16:05.46] you have some appreciation of it that the rest of us don't.
[16:07.66] I mean, the rest of us just like,
[16:08.96] you know, we hear marketing lines,
[16:10.12] but what do you mean when you say NVIDIA is very good
[16:12.78] in networking?
[16:13.62] Obviously they made the acquisition maybe like 15 years ago.
[16:15.66] - It's like the bandwidth it offers
[16:18.42] and the latency it offers.
[16:19.98] I mean, like TPUs also have a good interconnect,
[16:22.98] but you can't buy them.
[16:24.38] So you have to go to Google to use it.
[16:27.66] - Who are some of the other fair Pytorch alumni
[16:30.70] that are building cool companies?
[16:31.90] I know you have Fireworks AI, Lightning AI, Lepton,
[16:35.84] and Yankee, you knew since college
[16:39.06] when it was building coffee?
[16:40.78] - Yeah, so Yankee and I used to be framework rivals.
[16:44.46] Like Cafe and Torch.
[16:47.10] I mean, we were all a very small clothes net community
[16:49.86] back then, Cafe and Torch, Tiano, Chainr, Keras,
[16:54.86] various frameworks.
[16:57.76] I mean, it used to be like more like 20 frameworks.
[17:00.76] I can't remember all the names.
[17:02.46] CCV by Liu Liu, who is also based out of SF.
[17:06.90] And I would actually like, you know, one of the ways
[17:09.66] it was interesting is like, you went into the framework,
[17:12.62] gots and saw if someone wrote their own convolution kernel
[17:16.78] or they like were just copying someone else's.
[17:20.30] There were like four or five convolution kernels
[17:22.58] that were like unique and interesting.
[17:25.50] There's one from this guy out of Russia.
[17:28.22] I forgot the name.
[17:29.92] But like, I remembered who was awesome enough
[17:34.50] to have like written their own kernel.
[17:37.90] And at some point there, like I built out these benchmarks
[17:41.66] called connet benchmarks.
[17:43.80] They were just benchmarking all the convolution kernels
[17:46.62] that are available at that time.
[17:49.14] And it hilariously became big enough that at that time
[17:53.10] AI was getting like important, but not important enough
[17:57.10] that industrial strength players came in to do these
[18:01.00] kind of benchmarking standardization.
[18:02.74] Like we have ML curve today.
[18:04.82] So a lot of the startups were using connet benchmarks
[18:09.38] in their pitch decks as like, oh, you know,
[18:12.78] on connet benchmarks, like this is how we fare,
[18:15.62] so you should fund us.
[18:17.18] I remember Nirvana actually was at the top of the pack
[18:19.82] because Scott Gray wrote like amazingly fast
[18:23.70] convolution kernels at that time.
[18:26.02] Very interesting, but several times.
[18:27.46] But to answer your question, Alasio,
[18:30.26] I think mainly leptin fireworks
[18:34.30] are the two most obvious ones,
[18:37.18] but I'm sure the fingerprints are a lot wider.
[18:41.22] They're just people who worked within the Bytorch Cafe
[18:45.74] to cohort of things and now end up at various other places.
[18:50.34] - I think as a, both as an investor and a people looking
[18:55.26] to build on top of their services,
[18:57.32] it's a uncartable slash like I don't know what I don't know
[19:00.78] pitch because I've met Yangtzei and I've met--
[19:04.14] - Lin Chao.
[19:04.98] - Yeah, I've met these folks
[19:06.94] and they're like, we were deep in the Bytorch ecosystem
[19:09.82] and we serve billions of inferences a day
[19:12.02] or whatever at Facebook and now we can do it for you.
[19:14.86] And I'm like, okay, that's great.
[19:17.02] What should I be wary of or cautious of
[19:19.34] when these things happen?
[19:20.54] Because I'm like, obviously this experience
[19:23.22] is extremely powerful and valuable.
[19:25.26] I just don't know what I don't know.
[19:26.98] Like what should people know about
[19:28.82] like these sort of new inferences as a service companies?
[19:32.14] - I think at that point you would be investing in them
[19:35.06] for their expertise and of one kind.
[19:38.54] So if they've been at a large company
[19:41.98] but they've been doing amazing work,
[19:43.62] you would be thinking about it as like, okay,
[19:45.38] like what these people bring to the table
[19:47.42] is that they're really good at like GPU programming
[19:51.14] or understanding the complexity of serving models.
[19:55.78] Once it hits a certain scale, like, you know,
[19:58.62] various expertise like from the infra
[20:01.98] and like AI and GPUs point of view.
[20:04.86] What you would obviously want to figure out
[20:08.06] is like whether their understanding
[20:11.06] of the external markets is clear,
[20:12.98] whether they know and understand
[20:15.50] how to think about running a business,
[20:18.50] like understanding how to be disciplined
[20:20.82] about making money or various things like that.
[20:23.86] - Maybe I'll put it, like it's actually,
[20:25.82] I will de-emphasize the investing bit
[20:27.38] and just more as a potential customer.
[20:29.50] - Oh, okay.
[20:30.34] - Like it's more like, okay, like, you know,
[20:31.86] PyTorch gods, of course, like, what else should I know?
[20:36.86] - I mean, I would not care about who's building something
[20:40.78] if I'm trying to be a customer.
[20:42.22] I would care about whether--
[20:44.06] - The benchmark.
[20:44.90] - Yeah, I use it and it's like usability
[20:48.42] and reliability and speed, right?
[20:51.06] - Quality as well.
[20:51.98] - Yeah, if someone from some random unknown place
[20:56.86] came to me and say, "User stuff is great,"
[21:00.22] like, and I have the bandwidth,
[21:02.10] I probably will give it a shot
[21:03.70] and if it turns out to be great, like, I'll just use it.
[21:07.22] - Okay, great.
[21:08.06] And then maybe one more thing about benchmarks
[21:09.90] since we already brought it up
[21:10.94] and you brought up confident benchmarks.
[21:12.82] There was recent, some recent drama around AnyScale.
[21:15.74] The AnyScale released their own benchmarks
[21:17.86] and obviously they look great on their own benchmarks,
[21:19.70] but maybe didn't give the other,
[21:23.06] I feel like there are two lines of criticism.
[21:25.10] One, which is they didn't test sort of apples for apples
[21:28.42] on the kind of end points that the other providers,
[21:31.78] that they are competitors with, you know,
[21:33.58] on their benchmarks and, you know,
[21:34.78] that is due diligence baseline.
[21:36.30] And then the second would be more just, like,
[21:38.06] optimizing for the right thing.
[21:39.50] You had some commentary on it,
[21:40.70] I'll just kind of let you riff.
[21:41.94] - Yeah, I mean, in summary,
[21:44.38] basically my criticism of that was,
[21:48.74] AnyScale built these benchmarks for end users
[21:52.70] to just understand what they should pick, right?
[21:55.22] And that's a very good thing to do.
[21:57.82] I think what they didn't do a good job of
[21:59.98] is give that end user a full understanding
[22:04.02] of what they should pick.
[22:04.98] Like they just gave them like a very narrow slice
[22:08.26] of understanding.
[22:09.10] I think they just gave them latency numbers
[22:11.82] and that's not sufficient, right?
[22:13.62] Like you need to understand your total cost of ownership
[22:17.62] at some reasonable scale.
[22:19.02] Not like, oh, like one API call is like one cent,
[22:21.86] but like a thousand API calls are like 10 cents.
[22:25.38] Or like, you know, like people can miss price
[22:27.10] to cheat on those benchmarks.
[22:28.70] So you want to understand, okay, like,
[22:31.02] how much is it going to cost me
[22:32.70] if I actually subscribe to you
[22:34.70] and do like a million API calls a month or something.
[22:38.14] And then you want to understand the latency
[22:40.82] and reliability, not just from like one call you made,
[22:44.50] but like an aggregate of calls you made
[22:47.54] over several various times of the day
[22:49.86] and times of the week.
[22:51.54] And the nature of the workloads,
[22:53.70] like it's like, is it just like some generic single paragraph
[22:57.18] that you're sending that is cashable?
[22:59.22] Or like, is it like testing real world workload?
[23:03.50] I think that kind of rigor, like in presenting
[23:06.94] that benchmark wasn't there.
[23:08.34] It was a much more narrow sliver
[23:10.90] of what should have been a good benchmark.
[23:14.22] That was my main criticism.
[23:16.02] And I'm pretty sure if before they released it,
[23:19.46] they showed it to their like other stakeholders
[23:23.54] who would be caring about this benchmark
[23:26.10] because they are present in it,
[23:27.98] they would have easily just pointed out these gaps.
[23:31.22] And I think they didn't do that
[23:32.58] and they just like released it.
[23:35.18] So I think those were the two main criticism.
[23:37.98] I think they were fair and Robert took it well.
[23:40.22] - He took it very well, yeah.
[23:41.50] I will have him on at some point and we'll discuss it.
[23:44.22] But I think it's important for,
[23:45.18] I think the market being maturing enough
[23:47.06] that people start caring and competing
[23:48.66] on these kinds of things.
[23:50.02] Means that we need to establish what best practice is
[23:52.90] because otherwise, everyone's gonna play dirty.
[23:54.94] - Yeah, absolutely.
[23:56.78] My view of the LLM inference market in general
[23:59.54] is that it's like the laundromat model.
[24:02.98] Like the margins are gonna drive down towards
[24:06.78] the bare minimum.
[24:07.82] Like it's gonna be all kinds of arbitrage
[24:10.30] between how much you can get the hardware for
[24:12.30] and then how much you sell the API
[24:14.58] and how much like latency your customers
[24:16.86] are willing to let go.
[24:18.26] Like you need to figure out how to squeeze your margins.
[24:20.74] Like what is your unique thing here?
[24:22.66] Like I think like together in fireworks
[24:24.94] and all these people are trying to build some faster
[24:28.10] CUDA kernels and faster like hardware kernels in general.
[24:33.10] But those modes only last for a month or two.
[24:35.54] Like these ideas quickly propagate.
[24:38.06] - Even if they're not published?
[24:39.26] - Even if they're not published, like the idea space is small.
[24:44.18] So even if they're not published,
[24:46.74] the discovery rate is gonna be pretty high.
[24:49.38] It's not like we're talking about a combinatorial thing
[24:52.02] that is really large.
[24:53.70] You're talking about like llama style LLM models
[24:56.94] and we're gonna beat those to death.
[24:59.10] Like on like a few different hardware SKUs, right?
[25:02.54] Like it's not even like we have a huge diversity
[25:05.22] of hardware you're going to aim to run it on.
[25:07.86] Now when you have such a narrow problem
[25:09.74] and you have a lot of people working on it
[25:11.38] like the rate at which these ideas are gonna get figured out
[25:14.42] is gonna be pretty rough.
[25:15.26] - Is it like a standard bag of tricks?
[25:16.74] Like the standard one that I know of is,
[25:18.74] you know fusing operators.
[25:20.34] - Yeah, it's a standard bag of tricks on like figuring out
[25:23.34] how to like improve your memory bandwidth and all that.
[25:26.98] - Interesting.
[25:28.62] Any ideas instead of things that are not being beaten
[25:31.46] to death that people should be paying more attention to?
[25:34.10] - One thing I was like, you know,
[25:35.02] you have a thousand operators, right?
[25:36.22] Like what's the most interesting usage of PyTorch
[25:38.46] that you're seeing maybe outside of this little bubble?
[25:41.38] - So PyTorch, it's very interesting and scary
[25:44.26] at the same time, but basically it's used
[25:47.42] in a lot of exotic ways for like from the ML angle.
[25:50.26] Like, okay, like what kind of models are being built
[25:53.18] and you get all the way from like state space model
[25:56.94] and all these things to like stuff that like
[26:00.14] end-order differentiable models
[26:02.70] like NeuroD's and stuff like that.
[26:05.26] I think like there's one set of interestingness factor
[26:08.98] from like the ML side of things.
[26:11.18] And then there's the other set of interesting factor
[26:13.58] from the applications point of view.
[26:15.26] It's used in Mars rover simulations
[26:19.10] to drag discovery to Tesla cars.
[26:23.06] And there's a huge diversity of like applications
[26:26.02] in which it is used in.
[26:27.10] So in terms of most interesting application side of things,
[26:30.14] I think I'm scared at how many interesting things
[26:33.26] that are also very critical and really important
[26:35.66] it is used in if I think the scariest was when
[26:39.94] I went to visit CERN at some point
[26:42.94] and they said they were using PyTorch
[26:45.70] and they were using GANs at the same time
[26:48.78] for like particle physics research.
[26:50.82] And I was scared more about the fact
[26:52.70] that they were using GANs than they were using PyTorch.
[26:55.14] Because at that time I was like a researcher
[26:57.26] focusing on GANs.
[26:58.90] But the diversity is probably the most interesting.
[27:01.42] How many different things it is being used in.
[27:04.78] I think that's the most interesting to me
[27:06.74] from the applications perspective.
[27:08.78] From the models perspective, I think I've seen a lot
[27:11.46] of like the really interesting ones to me
[27:13.86] are where we're starting to combine search
[27:18.70] and symbolic stuff with differentiable models.
[27:22.66] Like the whole AlphaGo style models is one example.
[27:26.02] And then I think we're attempting to do it for LLNs as well
[27:29.10] with like various reward models and then search.
[27:32.06] I mean, I don't think PyTorch is being used in this
[27:34.74] but like the whole Alpha geometry thing was interesting
[27:37.94] because again, it's an example of combining
[27:39.78] symbolic models with the gradient-based ones.
[27:42.90] But there are stuff like Alpha geometry
[27:45.30] that PyTorch is used at.
[27:46.62] Especially when you intersect biology and chemistry with ML.
[27:51.38] Like in those areas, you want stronger guarantees
[27:55.30] on the output.
[27:57.62] So yeah, maybe from the ML side,
[28:00.50] those things to me are very interesting right now.
[28:03.58] - Yeah.
[28:04.58] People are very excited about the Alpha geometry thing.
[28:06.54] And it's kind of like, for me, it's theoretical.
[28:09.18] It's great.
[28:10.02] You can solve some Olympiad questions.
[28:11.54] I'm not sure how to make that bridge over
[28:13.50] into the real world applications,
[28:15.70] but I'm sure people smarter than you will figure it out.
[28:17.98] - Let me give you an example of it.
[28:20.46] You know how like the whole thing about synthetic data
[28:24.54] will be the next rage in LLNs is a thing?
[28:27.14] - It already is a rage.
[28:28.30] - Which I think is fairly misplaced
[28:31.06] in how people perceive it.
[28:32.78] People think synthetic data is some kind of magic wand
[28:35.54] that you wave and it's going to be amazing.
[28:38.86] Synthetic data is useful in neural networks right now
[28:43.86] because we as humans have figured out a bunch
[28:49.18] of symbolic models of the world
[28:52.82] or made up certain symbolic models
[28:54.94] because of human innate biases.
[28:57.06] So we've figured out how to ground particle physics
[29:01.06] in a 30-parameter model.
[29:04.02] And it's just very hard to compute.
[29:07.58] As in like it's like it takes a lot of flops to compute,
[29:09.70] but it only has 30 parameters or so.
[29:12.30] I mean, I'm not a physics expert,
[29:13.70] but like it's a very low rank model.
[29:16.82] We built mathematics as a field
[29:20.62] that basically is very low rank.
[29:23.30] Language, like a deep understanding of language
[29:26.14] like the whole syntactic parse trees
[29:27.94] and like just understanding how language
[29:30.94] can be broken down into a formal symbolism
[29:34.26] is something that we figured out.
[29:36.06] So we basically as humans have accumulated all this knowledge
[29:39.46] on these subjects,
[29:41.46] either synthetic and we created those subjects
[29:44.74] in our heads or like we've grounded some real world phenomenon
[29:48.74] into a set of symbols,
[29:50.82] but we haven't figured out how to teach neural networks,
[29:55.42] so symbolic world models directly.
[29:58.90] The only way we have to teach them is generating a bunch
[30:02.62] of inputs and outputs and gradient dissenting over them.
[30:05.14] So in areas where we have the symbolic models
[30:08.58] and we need to teach all the knowledge we have
[30:11.82] that is better encoded in the symbolic models,
[30:14.62] what we're doing is we're generating a bunch of synthetic data,
[30:18.34] a bunch of input/output pairs,
[30:20.58] and then giving that to the neural network
[30:22.42] and asking it to learn the same thing
[30:24.58] that we already have a better low rank model off
[30:28.42] in gradient descent in a much more overparameterized way.
[30:32.54] Outside of this, like where we don't have good symbolic models,
[30:35.78] like synthetic data obviously doesn't make any sense.
[30:38.90] So synthetic data is not a magic wand
[30:40.90] where it will work in all cases in every case or whatever.
[30:43.70] It's just where we as humans already have good symbolic models off.
[30:47.50] We need to impart that knowledge to neural networks
[30:51.06] and we figured that the synthetic data is a vehicle
[30:54.50] to impart this knowledge to it.
[30:57.90] But people, because maybe they don't know enough
[31:01.58] about synthetic data as a notion,
[31:03.90] but they hear the next wave of data revolution is synthetic data,
[31:08.18] they think it's some kind of magic
[31:09.78] where we just create a bunch of random data somehow,
[31:13.74] they don't think about how,
[31:15.38] and then they think that's just the revolution.
[31:17.82] And I think that's maybe a gap in understanding
[31:20.70] most people have in this hype cycle.
[31:23.22] - Yeah, well, it's a relatively new concept, so.
[31:25.74] Oh, there's two more that I'll put in front of you
[31:27.74] and then you can see what you respond.
[31:29.62] One is, I have this joke that it's only synthetic data
[31:34.18] if it's from the Mistral region of France,
[31:36.50] otherwise it's a sparkling distillation,
[31:38.30] which is what news researchers doing,
[31:40.06] like they're distilling GPT-4 by creating synthetic data
[31:42.86] from GPT-4 and creating mock textbooks inspired by PHY2
[31:46.78] and then fine-tuning open source models like Lama.
[31:50.50] And so I don't know, I mean, I think that's,
[31:53.66] should we call that synthetic data?
[31:54.62] Should we call it something else?
[31:55.54] I don't know.
[31:56.86] - Yeah, I mean, the outputs of LLMs,
[32:00.18] are they synthetic data?
[32:01.98] They probably are.
[32:03.90] But I think it depends on the goal you have.
[32:07.10] If your goal is like, you're creating synthetic data
[32:10.46] with the goal of trying to distill GPT-4's superiority
[32:14.70] into another model, I guess you can call it synthetic data,
[32:18.10] but it also feels disingenuous because your goal is like,
[32:22.38] I need to copy the behavior of GPT-4 and--
[32:25.66] - It's also not just behavior but data set.
[32:28.46] So I've often thought of this as data set washing.
[32:31.30] Like you need one model at the top of the chain,
[32:34.22] a name French company that makes a model
[32:37.86] that has all the data in it that we don't know
[32:39.42] where it's from, but it's open source, hey,
[32:40.66] and then we distill from that and it's great.
[32:42.90] (laughing)
[32:44.74] To be fair, they also use larger models as judges
[32:48.10] for preference ranking, right?
[32:49.22] So that is, I think a very, very accepted use of synthetic.
[32:53.30] - Correct, I think it's a very interesting time,
[32:55.54] where we don't really have good social models
[32:59.38] of what is acceptable, depending on how many bits
[33:03.62] of information you use from someone else, right?
[33:06.54] It's like, okay, you use like one bit, is that okay?
[33:10.70] Yeah, that's accepted to be okay.
[33:12.78] Okay, what about if you use like 20 bits, is that okay?
[33:16.22] I don't know, what if you use like 200 bits?
[33:19.50] Like I don't think we as society have ever been
[33:23.02] in this conundrum where we have to be like,
[33:25.06] where is the boundary of copyright,
[33:27.78] or where is the boundary of socially accepted
[33:31.94] understanding of copying someone else?
[33:35.22] Like we haven't been tested this mathematically
[33:37.62] before in my opinion, so.
[33:39.66] - Whether it's transformative use.
[33:40.94] - Yes.
[33:41.76] - So yeah, I think this New York Times opening eye case
[33:43.90] is gonna go to the Supreme Court and we'll have to decide it
[33:46.54] 'cause obviously we never had to deal with it before.
[33:49.46] And then finally, for synthetic data,
[33:51.10] the thing that I'm personally exploring
[33:52.42] is solving this great start paradigm difference
[33:54.66] between RAG and fine-tuning,
[33:55.70] where you can kind of create synthetic data
[33:57.50] off of your retrieved documents.
[33:59.50] And then fine-tune on that, that's kind of synthetic.
[34:02.06] All you need is variation or diversity of samples
[34:06.30] for you to fine-tune on.
[34:07.34] And then you can fine-tune new knowledge into your model.
[34:10.02] I don't know if you've seen that
[34:10.98] as a direction for synthetic data.
[34:13.42] - I think you're basically trying to,
[34:16.14] what you're doing is you're saying,
[34:17.82] well, language, I know how to parameterize language
[34:20.98] to an extent.
[34:22.38] And I need to teach my model variations of this input data
[34:27.38] so that it's resilient or invariant to language
[34:31.42] uses of that data.
[34:32.62] - Yeah, it doesn't overfit on the record.
[34:33.98] - So I think that's 100% synthetic, right?
[34:36.66] You understand, like the key is like,
[34:39.26] you create variations of your documents.
[34:41.62] And you know how to do that because you have a symbolic model
[34:44.42] or like some implicit symbolic model off language.
[34:48.86] - Do you think the issue with symbolic models
[34:51.42] is just the architecture of the language models
[34:55.02] that we're building?
[34:56.02] I think like maybe the thing that people grasp is like
[34:58.38] the inability of transformers to deal with numbers
[35:01.46] because of this organizer.
[35:03.10] Is it a fundamental issue there too?
[35:05.16] And do you see alternative architectures
[35:07.58] that will be better with symbolic understanding?
[35:09.94] - I am not sure if it's a fundamental issue or not.
[35:13.18] I think we just don't understand transformers enough.
[35:16.30] I don't even mean transformers as an architecture.
[35:18.62] I mean like the use of transformers today,
[35:21.66] like combining the tokenizer and transformers
[35:24.98] and the dynamics of training,
[35:26.90] like when you show math heavy questions versus not,
[35:31.90] I don't have a good calibration
[35:34.02] of whether I know the answer or not.
[35:35.86] I, you know, there's common criticisms that are like,
[35:38.38] well, you know, transformers will just fail at X.
[35:41.70] But then when you scale them up to sufficient scale,
[35:45.34] they actually don't fail at that X.
[35:47.54] I think this is this entire subfield
[35:49.70] where they're trying to figure out these answers
[35:51.34] called like the science of deep learning or something.
[35:53.50] So we'll get to know more.
[35:54.94] I don't know the answer.
[35:56.62] - That's such a little bit on just meta AI
[36:00.34] and stuff that's going on there.
[36:01.82] Maybe I don't know how deeply you're personally involved
[36:04.14] in it, but you're a first-guest from meta AI,
[36:06.26] which is really fantastic.
[36:07.66] And Lama One was, you know,
[36:09.46] you are such a believing open source.
[36:10.90] Lama One was more or less like the real breakthrough
[36:13.70] in open source AI.
[36:15.06] The most interesting thing for us
[36:16.62] covering on this, in this podcast
[36:18.38] was the depth of Chinchilla, as people say.
[36:21.46] Any interesting insights there around like the scaling models
[36:25.38] for open source models or smaller models
[36:27.74] or whatever that that design decision was
[36:29.58] when you guys were doing it?
[36:31.02] - So Lama One was game long pull and team.
[36:35.46] There was OPT before, which I think I'm also very proud of
[36:39.78] because we bridged the gap in understanding
[36:43.70] how complex it is to train these models to the world.
[36:46.22] Like until then no one really in gory detail published.
[36:50.50] - The logs.
[36:51.34] - Yeah, like why is it complex?
[36:53.34] And everyone says like, oh, it's complex,
[36:55.70] but no one really talked about why it's complex.
[37:00.30] I think OPT was cool.
[37:01.98] We probably--
[37:02.82] - I met Susan and she's very, very outspoken.
[37:04.70] - Yeah, we probably, I think,
[37:07.94] didn't train it for long enough, right?
[37:09.42] Like, you know, that's kind of obvious in retrospect.
[37:12.66] - For a 175B.
[37:13.98] - Yeah.
[37:14.82] - You trained it according to Chinchilla at the time or?
[37:17.54] - I can't remember the deals,
[37:19.42] but I think it's a commonly held belief at this point
[37:21.74] that like, well, if we trained OPT longer,
[37:24.34] it would actually end up being better.
[37:26.90] Lama One, I think was, yeah, game long pull
[37:29.50] and team game is fantastic
[37:32.34] and went on to build Mistral.
[37:34.30] I wasn't too involved in that set of things.
[37:36.94] So I don't know what you're asking me,
[37:39.78] which is like, well, like how did they think
[37:41.78] about scaling loss and all of that?
[37:43.54] Lama Two, I was more closely involved in.
[37:47.70] I helped them a reasonable amount
[37:50.58] with like their infrastructure needs and stuff.
[37:54.14] And Lama Two, I think was more like,
[37:57.46] let's get to the evolution.
[37:59.70] At that point, we kind of understood
[38:02.54] what we were missing from the industry's understanding
[38:07.54] of LLMs and we needed more data
[38:12.22] and we needed more to train the models for longer.
[38:15.34] And we made, I think, a few tweaks to the architecture
[38:18.86] and we scaled up more.
[38:20.22] And like, that was Lama Two.
[38:22.26] I think Lama Two, you can think of it as like,
[38:24.26] after Guillaume left the team kind of rebuilt their muscle
[38:27.62] around Lama Two.
[38:28.98] And Hugo, I think, who's the first author is fantastic.
[38:32.42] And I think he did play a reasonable big role
[38:35.02] in Lama One as well.
[38:35.86] And he overlaps between Lama One and Two.
[38:37.94] So in Lama Three, obviously, hopefully, will be awesome.
[38:42.82] - Just one question on Lama Two
[38:44.10] and then we'll try and fish Lama Three spoilers out of you.
[38:48.38] In the Lama Two paper,
[38:49.50] the lost curves of the 34 and 70 B parameter,
[38:52.90] this still seemed kind of steep.
[38:54.82] I feel like they could go lower.
[38:56.18] How, from an infrastructure level,
[38:58.30] how do you allocate resources?
[38:59.66] Like, could they have just gone longer
[39:01.66] or were you just like,
[39:02.50] hey, this is all the GPUs that we can burn
[39:04.46] and let's just move on to Lama Three
[39:06.02] and then make that one better?
[39:07.70] - Instead of answering specifically
[39:09.46] about that Lama Two situation or whatever,
[39:11.90] I'll tell you how we think about things.
[39:14.94] Generally, Mark released some numbers, right?
[39:18.94] - So let's cite those things again.
[39:22.34] All in memory, 600K GPUs.
[39:24.42] - That is by the end of this year
[39:26.10] and 600K H100 equivalents.
[39:29.42] With 250K H100s, including all of our other GPU
[39:33.82] or accelerator stuff,
[39:34.86] it would be 600K at great capacity.
[39:38.58] That's a lot of GPUs.
[39:39.42] We'll talk about that separately.
[39:40.74] But the way we think about it is
[39:43.66] we have a train of models, right?
[39:45.74] Lama One, Two, Three, Four.
[39:48.30] And we have a bunch of GPUs.
[39:50.90] I don't think we're short of GPUs.
[39:53.54] - Yeah, no, I wouldn't say so.
[39:54.94] - Yeah, so it's all a matter of time.
[39:56.90] I think time is the biggest bottleneck.
[39:59.18] It's like, when do you stop training the previous one
[40:01.94] and when do you start training the next one?
[40:04.38] And how do you make those decisions?
[40:06.66] The data, do you have net new data,
[40:08.70] better clean data for the next one
[40:10.86] in a way that it's not worth
[40:12.38] like really focusing on the previous one?
[40:14.98] It's just a standard iterative product.
[40:17.62] You're like, when is the iPhone one?
[40:19.78] When do you start working on iPhone two versus iPhone?
[40:23.18] Like so on, right?
[40:24.40] So mostly the considerations are time and generation
[40:29.12] rather than GPUs in my opinion.
[40:31.58] - So one other thing with the scaling laws,
[40:33.74] like Chinchilla is like optimal to balance
[40:36.30] training and inference costs.
[40:37.78] I think at Metascale, you would rather pay a lot more
[40:40.70] maybe a training and then save on inference.
[40:42.74] How do you think about that from infrastructure perspective?
[40:45.58] I think in your tweet, you say you can try and guess
[40:47.94] on like how we're using these GPUs.
[40:50.34] Can you just give people a bit of understanding?
[40:52.26] It's like, because I've already seen a lot of VCC,
[40:54.66] Lama Tree has been trained on 600,000 GPUs
[40:56.78] and that's obviously not true, I'm sure.
[40:58.82] How do you allocate between the research,
[41:01.10] like FAIR and the Lama training,
[41:03.74] the inference on Instagram suggestions,
[41:06.48] they got me to scroll, like I generate a stickers
[41:09.10] on WhatsApp and all of that.
[41:10.98] - Yeah, we haven't talked about any of this publicly,
[41:13.90] but like as a broad stroke, it's like how we would allocate
[41:18.06] resources of any other kinds at any company.
[41:21.94] You run like a VC portfolio, like how do you allocate
[41:24.48] your investments between different companies or whatever.
[41:26.82] You kind of make various trade-offs and you kind of decide
[41:29.38] should I invest in this project or this other project
[41:32.26] or how much should I invest in this project.
[41:34.66] It's very much like a zero sum of trade-offs
[41:38.26] and it also comes into play, like how are your clusters
[41:42.02] configured, like overall, like what you can fit
[41:45.10] of what size and what cluster and so on.
[41:47.50] So broadly, there's no magic sauce here.
[41:51.06] Like, I mean, I think the details would add more spice
[41:54.74] but also wouldn't add more understanding.
[41:59.30] It's just gonna be like, oh, okay, I mean, this looks like
[42:02.02] they just think about this as I would normally do.
[42:05.12] - So even the GPU rich run through the same struggles
[42:08.78] of having to decide where to allocate things.
[42:11.10] - Yeah, I mean, like at some point, I forgot who said it,
[42:14.02] but it's like you kind of fit your bottles
[42:18.46] to the amount of computer do you have.
[42:21.22] If you don't have enough computer, do you figure out
[42:23.22] how to make do, it's one of the models.
[42:26.06] But like no one as of today I think would feel
[42:29.94] like they have enough compute.
[42:31.42] I don't think I've heard any company within the AI space
[42:35.86] be like, oh yeah, like we feel like we have sufficient
[42:38.70] compute and we couldn't have done better.
[42:41.10] So like that conversation, I don't think I've heard
[42:44.26] from any of my friends at other companies.
[42:46.98] - Stella from Eluthor sometimes says that
[42:49.10] because she has a lot of donated compute.
[42:51.22] - Yeah.
[42:52.06] - And she's trying to put it to interesting uses,
[42:53.18] but for some reason she's decided to stop
[42:56.10] making large models.
[42:57.10] - I mean, that's a cool, high conviction opinion
[43:00.02] that might pay out.
[43:01.70] - Why?
[43:02.54] - I mean, she's taking a path that most people
[43:06.66] don't care to take about in this climate
[43:08.78] and she probably will have very differentiated ideas.
[43:12.26] I mean, think about the correlation of ideas
[43:14.76] in AI right now, it's so bad, right?
[43:18.02] Like, so everyone's fighting for the same pie.
[43:21.70] In some weird sense, like that's partly why
[43:24.54] I don't really directly work on LLMs.
[43:27.10] I used to be a gen, like I used to do image models and stuff.
[43:30.98] And I actually stopped doing GANs because GANs
[43:34.46] were getting so hot that I didn't have any calibration
[43:37.80] of whether like my work would be useful or not.
[43:40.70] Because, oh yeah, like someone else did the same thing
[43:43.70] you did, it's like, there's so much to do,
[43:47.18] I don't understand why I need to like fight for the same pie.
[43:50.30] So like, I think like Stella's decision is very smart.
[43:53.86] - And how do you reconcile that with how we started
[43:57.18] the discussion about intrinsic versus extrinsic,
[44:00.30] kind of like a accomplishment or success?
[44:02.78] How should people think about that when,
[44:04.58] especially when they're doing a PhD
[44:06.26] or like early in their career?
[44:08.54] I think in Europe, I walked through a lot of the posters
[44:11.38] and whatnot, there seems to be mold collapse in a way
[44:14.42] in the research, a lot of people working on the same things.
[44:17.42] Is it worth for like a PhD to not take a bet
[44:20.34] on something that is like maybe not as interesting,
[44:23.10] you know, just because of funding and, you know,
[44:25.18] visibility and whatnot?
[44:26.18] Or yeah, what suggestions would you give?
[44:28.90] - I think there's a baseline level of compatibility
[44:31.82] you need to have with the field.
[44:34.30] Basically, you need to figure out
[44:37.62] if you will get paid enough to eat, right?
[44:40.22] And like whatever reasonable normal lifestyle
[44:43.66] you want to have as a baseline.
[44:46.42] So you at least have to pick a problem
[44:48.42] within the neighborhood of like fundable.
[44:51.26] Like you wouldn't want to be doing something
[44:55.30] so obscure that people are like, I don't know,
[44:58.50] like you can work on it.
[44:59.98] - With a limit on fundability, I'm just like observing
[45:04.10] something like three months of compute, right?
[45:05.74] That's the top line, that's the like max
[45:07.70] that you can spend on any one project.
[45:09.42] - But like, I think that's very ill specified,
[45:12.22] like how much compute, right?
[45:14.42] I think that the notion of fundability is broader.
[45:16.58] It's more like, hey, are these family of models
[45:19.30] within the acceptable set of you're not crazy
[45:23.18] or something, right?
[45:24.02] Like even something like neural ordies,
[45:26.22] which is a very like boundary pushing thing
[45:29.18] or like state space models or whatever.
[45:31.22] Like all of these things I think
[45:32.98] are still in fundable territory.
[45:34.90] When you're talking about, I'm gonna do one
[45:38.02] of the neuromorphic models
[45:40.54] and then apply image classification to them
[45:43.70] or something, then it becomes like a bit questionable.
[45:47.28] Again, it depends on your motivation.
[45:48.82] Maybe if you're a neuroscientist, it actually is feasible.
[45:52.50] But if you're like a AI engineer,
[45:54.82] like the audience of these podcasts,
[45:56.74] then it's more questionable.
[45:58.94] The way I think about it is like, you need to figure out
[46:01.22] how you can be in the baseline level of fundability
[46:03.82] just so that you can just live.
[46:06.46] And then after that, really focus on intrinsic motivation
[46:11.06] and depends on your strengths,
[46:14.18] like how you can play to your strengths
[46:16.22] and your interests at the same time.
[46:18.10] Like I try to look at a bunch of ideas
[46:20.50] that are interesting to me,
[46:22.74] but also try to play to my strengths.
[46:25.70] I'm not gonna go work on theoretical ML.
[46:28.64] I'm interested in it, but when I want to work
[46:31.62] on something that I try to partner with someone
[46:33.74] who is actually a good like theoretical ML person
[46:36.34] and see if I actually have any value to provide.
[46:38.62] And if they think I do, then I come in.
[46:40.62] So I think you'd want to find that intersection
[46:43.10] of ideas you like and that also play to your strengths.
[46:47.82] And I'd go from there.
[46:49.34] Everything else, like actually finding extrinsic success
[46:52.98] and all of that, I think is,
[46:55.18] the way I think about it is like somewhat immaterial.
[46:58.54] When you're talking about building ecosystem and stuff,
[47:01.06] like slightly different considerations come into play,
[47:03.70] but that's a different conversation.
[47:05.74] - Yeah, we're gonna pivot a little bit
[47:07.22] to just talk about open source AI.
[47:09.58] But one more thing I wanted to establish for Meta
[47:11.74] is like this 600K number
[47:13.06] is just kind of rounding out the discussion.
[47:15.34] That's for all Meta.
[47:16.26] So including your own inference needs, right?
[47:17.86] It's not just about training.
[47:19.18] - It's gonna be the number in our data centers
[47:22.14] for all of Meta.
[47:23.10] - Yeah, so like, there's a decent amount of workload
[47:26.06] serving Facebook and Instagram and whatever.
[47:28.98] And then is there interest in like your own hardware?
[47:31.70] - We already talked about our own hardware.
[47:35.58] It's called MTIA, our own silicon.
[47:39.14] I think we've even showed like the standard photograph
[47:43.10] of you holding like the chip that doesn't work.
[47:46.06] Like as in the chip that you basically just get like--
[47:51.22] - As a test?
[47:52.42] - Yeah, a test chip or whatever.
[47:54.06] So we are working on our silicon
[47:56.58] and we'll probably talk more about it
[47:58.90] when the time is right, but--
[48:00.94] - Like what gaps do you have
[48:02.62] that the market doesn't offer?
[48:04.70] - Okay, I mean, this is easy to answer.
[48:06.80] So basically, remember how I told you about
[48:09.70] there's this memory hierarchy and like sweet spots
[48:12.34] and all of that.
[48:13.18] Fundamentally, like when you build a hardware,
[48:15.82] like you make it general enough that a wide set of customers
[48:20.30] and a wide set of workloads can use it effectively
[48:23.46] while trying to get the maximum level of performance they can.
[48:27.54] The more specialized you make the chip,
[48:29.46] the more hardware efficient it's going to be,
[48:31.86] the more power efficient it's gonna be,
[48:33.66] the more easier it's going to be to find the software,
[48:38.02] like the kernels right to just map that one
[48:41.82] or two workloads to that hardware and so on.
[48:44.62] So it's pretty well understood across the industry
[48:47.26] that if you have a sufficiently large volume enough workload,
[48:51.30] you can specialize it and get some efficiency gains,
[48:56.30] like power gains and so on.
[48:58.02] So the way you can think about every large company building
[49:03.02] silicon, like I think a bunch of the other large companies
[49:05.98] are building their own silicon as well,
[49:07.70] is they, each large company has a sufficient enough set
[49:11.86] of verticalized workloads that can be specialized
[49:16.86] that have a pattern to them that say a more generic accelerator
[49:21.98] like an NVIDIA or any MDGPU does not exploit.
[49:26.62] So there is some level of power efficiency
[49:30.26] that you're leaving on the table by not exploiting that.
[49:33.66] And you have sufficient scale
[49:35.22] and you have sufficient forecasted stability
[49:39.18] that those workloads will exist in the same form,
[49:43.14] that it's worth spending the time to build out a chip
[49:46.62] to exploit that sweet spot.
[49:49.28] Like obviously something like this is only useful
[49:52.36] if you hit a certain scale
[49:54.50] and that you're like forecasted prediction
[49:57.70] of those kind of workloads being
[49:59.98] in the same kind of specializable exploitable way is true.
[50:04.98] So yeah, that's why we're building our own chips.
[50:07.82] - Awesome, yeah.
[50:09.78] I know we've been talking a lot on a lot of different topics
[50:13.06] and going back to open source, you had a very good tweet.
[50:16.10] You said that a single company's close source effort
[50:18.90] rate limits against people's imaginations and needs,
[50:21.66] how do you think about that?
[50:23.82] How do you think about all the impact
[50:26.46] that some of the meta AI work in open source
[50:28.96] has been doing and maybe directions
[50:30.54] of the whole open source AI space?
[50:32.46] - Yeah, in general, I think first,
[50:34.98] I think it's worth talking about this in terms of open
[50:37.94] and not just open source
[50:39.42] because like with the whole notion of model weights,
[50:42.38] no one even knows what source means for these things.
[50:45.18] But just for the discussion, when I say open source,
[50:49.02] you can assume it's just, I'm talking about open.
[50:51.94] And then there's the whole notion of like licensing
[50:54.70] and all that, like, what happens?
[50:56.74] Commercial, non-commercial, commercial with clauses
[50:58.90] and all that.
[50:59.74] I think like at a fundamental level,
[51:01.74] the most benefited value of open source
[51:05.38] is that you make the distribution to be very wide.
[51:10.38] Like it's just available with no friction
[51:12.94] and like you can, people can do transformative things
[51:16.38] in a way that's very accessible.
[51:17.82] Like maybe like it's open source,
[51:19.94] but it has a commercial license
[51:21.50] and I'm a student like in India.
[51:23.70] I don't care about the license.
[51:25.86] I just don't even understand the license.
[51:28.22] But like the fact that I can use it and do something with it
[51:31.98] is very transformative to me.
[51:33.82] Like I got this thing in a very accessible way.
[51:37.58] And then like, so it's various degrees, right?
[51:39.66] And then like if it's open source,
[51:41.54] but it's like actually like a commercial license,
[51:44.18] then a lot of companies are going to benefit
[51:46.86] from like gaining value that they didn't previously have,
[51:51.62] that they maybe had to pay a closed source company for it.
[51:55.96] So open source is just a very interesting tool
[51:58.82] that you can use in various ways.
[52:00.74] So there's again two kinds of open source.
[52:02.86] One is like some large company doing a lot of work
[52:05.30] and then open sourcing it.
[52:06.98] And that kind of effort is not really feasible
[52:10.54] by say like a band of volunteers doing it the same way.
[52:14.62] So there's both a capital and operational expenditure
[52:17.62] that the large company just decided to ignore
[52:20.34] and give it away to the world for some benefits of some kind.
[52:23.98] They're not as tangible as like direct revenue.
[52:27.10] So in that part, Met has been doing incredibly good things.
[52:31.66] They fund a huge amount of the Pytorch development.
[52:35.74] They've open sourced Lama and those family of models
[52:40.30] and several other fairly transformative projects.
[52:44.22] So FICE is one, segment anything,
[52:48.22] Detektron, Detektron 2, Densipose.
[52:51.42] I mean it's-- - Seamless?
[52:52.78] - Yeah, seamless.
[52:53.78] Like it's just like the list is so long
[52:55.82] that you know we're not gonna cover.
[52:58.02] So like I think Meta comes into that category
[53:01.18] where like we spend a lot of CAPEX and OPEX
[53:03.74] and we have a high talent density of great AI people
[53:07.82] and we open our stuff.
[53:09.74] And the thesis for that I remember
[53:11.70] when FAIR was started the common thing was like,
[53:14.34] wait, why would Meta wanna start a open AI lab?
[53:19.34] Like what exactly is the benefit
[53:21.14] like from a commercial perspective?
[53:23.26] And for then like the thesis was very simple.
[53:25.66] It was like AI is currently rate limiting Meta's ability
[53:30.46] to do things, our ability to build various product
[53:34.18] integrations, moderation, various other factors.
[53:37.58] Like AI was the limiting factor.
[53:40.14] And we just wanted AI to advance more.
[53:42.78] And we didn't care if the IP of the AI was uniquely
[53:47.66] in our possession or not for us.
[53:49.14] Like however the field advances that accelerates
[53:51.94] like Meta's ability to build a better product.
[53:54.30] So we just built like an open AI lab and we said,
[53:57.78] if this helps accelerate the progress of AI
[54:00.82] that's strictly great for us.
[54:02.94] But like very easy rational, right?
[54:05.14] Still the same to a large extent with like the llama stuff.
[54:08.14] And it's the same values, but like, you know,
[54:10.94] the argument, it's a bit more nuanced.
[54:13.90] And then there's the second kind of open source
[54:15.62] which is, oh, you know, we built this project
[54:18.42] nights and weekends and we're very smart people
[54:20.54] and we open sourced it.
[54:21.74] And then we built a community around it.
[54:23.26] This is like the Linux kernel
[54:24.78] and various software projects like that.
[54:27.70] So I think about open source like both of these things
[54:32.22] being beneficial and both of these things being different.
[54:34.94] They're different and beneficial in their own ways.
[54:37.88] The second one is really useful when
[54:41.26] there's an active arbitrage to be done.
[54:44.48] If someone's not really looking at a particular space
[54:47.66] because it's not commercially viable or whatever,
[54:49.84] like a band of volunteers can just coordinate online
[54:52.66] and do something and then make that happen.
[54:56.12] And that's great.
[54:57.56] I wanna cover a little bit about like open source LLMs maybe.
[55:00.94] So open source LLMs have been very interesting
[55:03.66] because I think we were trending towards a,
[55:06.44] an increase in open source in AI from 2010
[55:11.44] all the way to like 2017 or something.
[55:14.98] Like where more and more pressure within the community
[55:17.66] was to open source their stuff
[55:19.22] so that their methods and stuff get adopted.
[55:21.82] And then the LLMs revolution kind of took the opposite effect.
[55:26.62] Open AI stopped open sourcing their stuff
[55:29.66] and DeepMind kind of, you know, then like all the cloud
[55:33.82] and all these other providers,
[55:35.50] they didn't open source their stuff.
[55:37.76] And it was not good in the sense that first,
[55:42.00] like science done in isolation
[55:43.88] probably will just form its own bubble
[55:47.04] where like people believe their own bullshit or whatever, right?
[55:49.92] So there's that problem.
[55:51.72] And then there was the other problem
[55:53.56] which was the accessibility part.
[55:55.72] Like, okay, I again, always go back to like,
[55:58.68] I'm a student in India with no money.
[56:00.76] What is my accessibility to any of these closers models?
[56:05.76] At some scale, I have to pay money
[56:08.18] that makes it a non-starter and stuff.
[56:11.34] And there's also the control thing.
[56:13.34] I strongly believe if you want human aligned stuff,
[56:17.08] you want all humans to give feedback.
[56:20.46] And you want all humans to have access
[56:22.50] to their technology in the first place.
[56:24.28] And I actually have seen, you know, living in New York,
[56:28.00] whenever I come to Silicon Valley,
[56:29.44] I see a different cultural bubble.
[56:31.44] Like all the friends I hang out with
[56:32.80] talk about some random thing,
[56:35.06] like Dyson spheres or whatever, you know, that's a thing.
[56:38.28] And most of the world doesn't know or care
[56:41.00] about any of this stuff.
[56:42.20] Like it's like definitely like a bubble
[56:44.72] and bubbles can form very easily.
[56:46.60] And when you make a lot of decisions
[56:48.32] because you're in a bubble,
[56:50.32] they're probably not globally optimal decisions.
[56:52.92] So I think like open source,
[56:54.12] the distribution of open source
[56:56.24] powers a certain kind of non falsifiability
[57:01.24] that I think is very important.
[57:03.56] I think on the open source models,
[57:05.76] like it's going great in the fact that Laura,
[57:09.28] I think came out of the necessity
[57:12.04] of open source models needing to be fine tunable in some way.
[57:17.04] - Cheaply.
[57:19.32] - Yeah, and I think DPO also came out of
[57:23.68] the academic open source side of things.
[57:26.40] So do any of the closed source labs,
[57:29.40] did any of them already have Laura or DPO internally?
[57:33.04] Maybe, but like that does not advance humanity in any way.
[57:37.32] It advances like some companies probability
[57:40.24] of doing the winner takes all
[57:42.40] that I talked about earlier in the podcast.
[57:45.20] I don't know, it just feels fundamentally good.
[57:47.96] Like when people try to, you know,
[57:50.80] people are like, well, like what are the ways
[57:53.48] in which it is not okay?
[57:55.44] I find most of these arguments like,
[57:57.48] and this might be a little controversial,
[57:59.36] but like I find a lot of arguments based on
[58:02.80] whether like closed source models are safer
[58:04.96] or open source models are safer,
[58:06.84] very much related to whether what kind of culture
[58:10.96] they grew up in, what kind of society they grew up in.
[58:14.52] If they grew up in a society that they trusted,
[58:17.20] then I think they take the closed source argument,
[58:21.20] and if they grew up in a society that they couldn't trust,
[58:23.72] where the norm was that you didn't trust your government,
[58:26.60] obviously, like it's corrupt or whatever,
[58:28.84] then I think like the open source argument is what they take.
[58:31.96] I think there's a deep connection to like people's innate biases
[58:36.00] from their childhood and their trust in society
[58:39.40] and governmental aspects that push them
[58:41.88] towards one opinion or the other.
[58:44.12] And I'm definitely in the camp of open sources,
[58:47.04] definitely going to actually have better outcomes for society.
[58:50.64] Close source to me just means that centralization of power,
[58:54.12] which is really hard to trust.
[58:55.80] So I think it's going well in so many ways.
[58:59.60] We're actively disaggregating the centralization of power
[59:03.08] to just like two or three providers.
[59:05.24] We are, I think, benefitting from like so many people
[59:08.48] using these models in so many ways that aren't allowed
[59:12.52] by like, say, like Silicon Valley left wing tropes.
[59:17.40] Like some of these things are good or bad,
[59:19.84] but like they're not culturally accepted universally in the world.
[59:23.28] So those are things worth thinking about.
[59:25.08] And I think open source is not winning in certain ways.
[59:28.16] Like these are all the things in which like, as I mentioned,
[59:31.28] it's actually being very good and beneficial and winning.
[59:35.20] I think one of the ways in which it's not winning,
[59:37.48] at some point I should write a long form post about this,
[59:40.36] is I think it has a classic coordination problem.
[59:44.40] I mean, open source in general always has a coordination problem.
[59:47.80] If there's a vertically integrated provider with more resources,
[59:51.68] they will just be better coordinated than open source.
[59:54.44] And so now open source has to figure out
[59:57.24] how to have coordinated benefits.
[59:59.00] And the reason you want coordinated benefits
[60:01.20] is because these models are getting better
[60:05.72] based on human feedback.
[60:07.96] And if you see with open source models,
[60:10.04] like if you go to like Reddit, local, Lama, subreddit,
[60:14.20] like there's so many variations of models
[60:16.96] that are being produced from, say, news research.
[60:20.68] I mean, like there's like so many like variations
[60:24.24] built by so many people.
[60:25.84] And one common theme is they're all using these fine tuning
[60:29.96] or human preferences data sets that are very limited
[60:33.64] and like someone published them somewhere
[60:36.48] and like they're not sufficiently diverse.
[60:40.44] And you look at the other side, like say frontends like Uba
[60:44.48] or like Hugging Chat or Olamma,
[60:47.64] they don't really have like feedback buttons.
[60:50.08] Like all the people using all these frontends,
[60:52.88] they probably want to give feedback,
[60:55.28] but there's no way for them to give feedback.
[60:57.80] So these models are being built.
[61:00.36] They're being arbitrarily measured.
[61:02.44] And then they are being deployed into all these open source frontends
[61:05.68] or like apps that are closed source,
[61:07.84] they're serving open source models.
[61:09.64] And these frontends don't have,
[61:11.84] they are not exposing the ability to give feedback.
[61:14.92] So we're just losing all of this feedback.
[61:18.60] Maybe open source models are being as used as GPT is
[61:22.24] at this point in like all kinds of,
[61:24.80] in a very fragmented way.
[61:26.48] Like in aggregate, all the open source models together
[61:28.76] are probably being used as much as GPT is,
[61:31.28] maybe, you know, close to that.
[61:33.96] But the amount of feedback that is driving back
[61:36.96] into the open source ecosystem is like negligible,
[61:39.88] maybe less than 1% of like the usage.
[61:42.36] So I think like some, like the blueprint here, I think is,
[61:48.00] you'd want someone to create a sinkhole for the feedback.
[61:51.04] Some centralized sinkhole, like maybe Hugging Face or someone
[61:54.08] just funds like, okay, like I will make available a call
[61:58.08] to log a string along with like, you know,
[62:01.20] a bit of information of positive or negative
[62:03.36] or something like that.
[62:04.28] And then you would want to send pull requests
[62:06.52] to all the like open source frontends,
[62:08.56] like Uber and all being like,
[62:10.36] hey, we're just integrating like a feedback UI.
[62:12.92] And then work with like the close source people
[62:14.76] is also being like, look, it doesn't cost you anything.
[62:17.72] Just like have a button.
[62:19.16] And then the sinkhole will have a bunch of this data coming in.
[62:23.76] And then I think a bunch of open source researchers
[62:26.40] should figure out how to filter the feedback
[62:28.68] into only the like high quality one.
[62:30.44] I'm sure like it will be exploited by spam bots
[62:32.56] or whatever, right?
[62:33.32] Like this is like the perfect way
[62:35.08] to inject your advertising product into like the next.
[62:38.92] - My Coca-Cola now.
[62:40.88] - So there needs to be some level of that.
[62:43.84] That in the same way, I'm sure like,
[62:45.96] like all the close providers are doing today,
[62:48.88] like OpenAI, Cloud, like the feedback that comes in,
[62:52.72] I'm sure they are figuring out if that's legit or not.
[62:56.40] That kind of data filtering needs to be done.
[62:59.04] And that loop has to be set up.
[63:03.40] And this requires that central sinkhole
[63:05.96] and that like data cleaning effort both to be like there.
[63:09.40] They're not there right now.
[63:10.84] They're not there right now.
[63:11.96] I think for capital reasons,
[63:14.20] but also for coordination reasons.
[63:15.96] Okay, if that central sinkhole is there,
[63:17.64] who's going to go coordinate all of this integration
[63:20.40] across all of these like open source front ends.
[63:23.44] But I think if we do that, if that actually happens,
[63:27.80] I think that probably has a real chance
[63:31.40] of the open source models having a runaway effect
[63:33.68] against OpenAI with their current like
[63:36.80] daily active users, rumored.
[63:39.44] Probably doesn't have a chance against Google
[63:41.40] because you know, Google has Android and Chrome
[63:45.28] and Gmail and Google Docs and everything, you know.
[63:50.08] So people just use that a lot.
[63:52.56] But like, I think like there's a clear chance
[63:56.44] we can take at truly winning open source.
[64:00.92] - Do you think this feedback is helpful
[64:02.48] to make open source models better
[64:04.40] or to get to like open source AGI?
[64:07.16] Because in a way like OpenAI's goal is to get to AGI, right?
[64:10.36] So versus I think in open source
[64:12.60] we're more focused on personal better usage
[64:15.28] or like commercial better usage.
[64:16.64] - Yeah, I think that's a good question.
[64:17.76] But I think like, I actually don't think
[64:20.84] people have a good understanding of AGI
[64:23.60] and I don't mean definition level.
[64:25.16] I mean, people are like, okay, we're going to AGI means
[64:29.44] it's powering 40% of world economic output
[64:32.88] or something like that, right?
[64:35.56] But what does that mean?
[64:37.56] So do you think electricity is powering 40%
[64:41.28] of world economic output or is it not?
[64:44.40] Like generally the notion of like powering
[64:47.48] X% of economic output is not defined well at all
[64:53.16] or made to understand like how to know when we got to AGI
[64:57.40] or how to measure whether we're getting AGI.
[65:00.52] Like, you know, you can look at it in terms of intelligence
[65:03.32] or task automation, whatever.
[65:05.90] I think that's what we are doing right now.
[65:08.08] We're basically integrating like the current set
[65:10.36] of AI technologies into so many real world use cases
[65:15.00] where we find value that if some new version of AI comes in
[65:20.12] we can find, like we can be like, ah, this helps me more.
[65:23.56] In that sense, I think like the whole process
[65:26.32] of like how we think we got to AGI will be continuous
[65:29.84] and not discontinuous like how I think
[65:33.40] the question is posed.
[65:35.20] So I think the open source thing will be very much in line
[65:40.20] with getting to AGI because open source has
[65:43.32] that natural selection effect.
[65:45.28] Like if a better open source model comes,
[65:47.44] really no one says, huh, I don't want to use it
[65:51.04] because there are ecosystem effect.
[65:53.64] I'm logged into my ecosystem or like,
[65:56.56] I don't know if I like the models, you know, whatever.
[65:59.60] It's just a very pure direct thing.
[66:02.96] So if there's a better model that comes out
[66:06.08] then it will be used.
[66:08.32] So I definitely think it has a good chance of achieving
[66:12.68] how I would think about it as a continuous path
[66:16.04] to what we might define as AGI.
[66:18.72] - For the listeners, I would actually mention
[66:20.48] a couple other maybe related notes on just
[66:22.72] this very interesting concept of feedbacks in coal
[66:25.92] for open source to really catch up
[66:27.96] in terms of the overall Google versus Open AI debate.
[66:31.80] Open Assistant was led by Yannick Koucher
[66:35.16] who recently ended his effort.
[66:36.52] I think the criticism there was like the kind of people
[66:38.40] that go to a specific website to give feedback
[66:41.52] is not representative of real world usage.
[66:43.44] And that's why the models trained on Open Assistant
[66:46.12] didn't really seem like they have caught on
[66:48.40] in the open source world.
[66:49.56] The two leading candidates in my mind are LMSys
[66:51.88] out of UC Berkeley who have the LMSys arena
[66:54.88] which is being touted as one of the only ways,
[66:57.84] only reliable benchmarks anymore.
[66:59.36] I kind of call them non-parametric benchmarks
[67:01.52] 'cause there's nothing to cheat on it except for Elo.
[67:05.56] And then the other one is Open Router
[67:07.36] which is Alex Otala's thing.
[67:08.64] I don't know if you've talked to any of these people.
[67:11.08] I obviously know all of the efforts that you talked about.
[67:15.72] I haven't talked to them directly about this yet
[67:18.48] but the way I think about it is
[67:20.36] the way these models are going to be used
[67:22.56] is always going to be way more distributed than centralized.
[67:26.04] Like which is the power of the open source movement.
[67:29.04] Like the UI within which these models are going to be used
[67:32.92] is going to be decentralized.
[67:35.28] Like it's, these models are going to be integrated
[67:37.32] into like hundreds and thousands of projects
[67:40.76] and products and all of that, right?
[67:42.92] And I think that is important to recognize.
[67:45.76] Like the LMSys leaderboard is the best thing we have right now
[67:50.08] to understand whether a model is better or not
[67:53.04] versus another model.
[67:54.80] But it's also biased in only having a sliver of view
[67:59.24] into how people actually use these models.
[68:01.04] Like the people who actually end up coming
[68:03.04] to the LMSys leaderboard and then using a model
[68:06.12] only use it for certain things.
[68:08.16] Like GitHub co-pilot style usage is not captured
[68:12.36] in say like LMSys things.
[68:13.92] And so many other styles,
[68:15.24] like the character AI style things is not captured in LMSys.
[68:19.48] - Which open router could do.
[68:20.92] They don't do it right now, but.
[68:22.12] - Yeah, so like, I think like, yeah, my point is like,
[68:25.28] the way these models are going to be used
[68:27.24] is going to be always a large surface area.
[68:30.84] And I think we need to figure out
[68:32.12] how to provide infrastructure to integrate
[68:35.40] with all these like ways in which it's being used.
[68:39.08] Even if you get like the top hundred front ends
[68:42.72] that the model, like the open source models are used through
[68:46.24] to subscribe to like the sinkhole,
[68:48.72] I think that's already like a substantial thing.
[68:51.04] I think like thinking one or two things
[68:54.20] built by themselves get a lot of data,
[68:56.72] I think is not going to happen.
[68:58.64] - Yeah, fair enough.
[68:59.80] Before we let you go,
[69:01.08] can we do just a quick beyond text segment?
[69:03.80] So you're an investor in Broadway,
[69:05.88] which is a V2 generation.
[69:07.60] You're an investor in 1x,
[69:08.88] which is a humanoid assistant.
[69:11.72] Osmo, which is focused on using AI
[69:14.12] for smell recognition and synthesis.
[69:16.28] You advise a bunch of robotics projects at NYU.
[69:19.24] - And he builds his own home robot.
[69:21.12] - Yeah, exactly.
[69:22.92] On a more, yeah, maybe open anything.
[69:24.68] What are like the things that you're most excited about
[69:27.00] beyond like tax generation and kind of the more mundane usage?
[69:30.92] - Yeah, I mean, in general,
[69:32.68] I have more things I'm generally excited about
[69:35.32] than I can possibly do.
[69:37.80] Investing is one way to try to clear those urges.
[69:42.80] I'm generally excited about robotics being a possibility,
[69:48.80] a home robotics being like five to seven years away
[69:52.36] into commercialization.
[69:53.84] I think it's not like next year or two years from now,
[69:57.96] but like five to seven years from now,
[70:00.32] I think a lot more robotics companies might pop out.
[70:04.24] There's not a good consensus
[70:06.08] on whether hardware is a bottleneck
[70:08.08] or AI is a bottleneck in robotics right now.
[70:10.72] My view is actually hardware is still the bottleneck,
[70:14.88] and AI is also a little bit of bottleneck,
[70:17.64] but I don't think there's any obvious breakthroughs we need.
[70:23.16] I think it just work.
[70:24.64] So I'm generally excited about robotics.
[70:26.48] I spend a lot of personal time.
[70:27.98] I spend like every Wednesday afternoon at NYU
[70:30.96] working with Laryl Pinto and T.
[70:33.44] And just getting towards my like home robot
[70:36.38] that just does my dishes and stuff.
[70:38.32] - What's the status of it?
[70:39.32] Like what does it do for you now?
[70:41.20] - As of today, we just deployed a couple of months ago,
[70:45.40] we deployed our home robotics stuff
[70:47.72] into like several tens of New York City homes
[70:52.36] and like tried to make it do a bunch of tasks.
[70:55.16] And we're basically starting to build out a framework
[70:59.40] that gets to a certain level of robustness
[71:02.20] on fairly simple tasks.
[71:04.60] Like, you know, picking this cup
[71:06.44] and putting it somewhere else
[71:07.64] or like taking a few pieces of cloth on the ground
[71:10.64] and put it somewhere else or open your microwave.
[71:14.92] Like various like baseline tasks like that
[71:18.72] with low sample complexity.
[71:21.00] So I think one of the things people don't spend
[71:23.12] their time robotics is like the user experience,
[71:25.92] which I think we, in the research I do at NYU,
[71:29.76] we spend a huge amount of time on.
[71:31.88] I think the key there is sample complexity has to be really low.
[71:35.40] A lot of the current robotics research, if you see there,
[71:38.08] like, oh yeah, we collected like 50 demos
[71:40.28] and now it's able to do this task
[71:42.08] or we collected like 300 demos
[71:44.48] or like the number of samples you need
[71:46.60] for this thing to do the task is really high.
[71:48.96] So we're focusing a lot on,
[71:50.72] you show it like two or three times
[71:53.12] and that's sufficient for it to actually like do the task.
[71:56.72] But it comes with like less generalization, right?
[71:59.52] Like you, there's some initial conditions
[72:01.36] that have to be true for it to do the task.
[72:03.64] So we're making progress.
[72:05.08] That's very interesting in general, the space.
[72:07.84] I don't think people in the space
[72:09.44] have settled on the hardware,
[72:11.88] like how the hardware looks like
[72:14.12] for it to be truly useful in the home or whatever,
[72:16.88] or the UX or the like AI/ML stuff needed
[72:21.88] to make it sample efficient and all of that.
[72:25.08] But I think like lots of work is happening in the field.
[72:28.84] - Yeah, one of my friends, Karloed Berkeley,
[72:31.20] he worked on a project called M3L,
[72:33.08] which is two CNNs, one for tactile feedback
[72:36.48] and one for image.
[72:37.80] When you say hardware,
[72:38.68] is it running all these things on the edge
[72:41.56] or is it just like the actual servos?
[72:43.72] - Yeah, hardware, I mean like the actual like servos,
[72:48.24] like the motors, servos, even like the sensors,
[72:53.24] I think we have incredible vision
[72:56.24] that still like is so much better compared to
[72:59.48] in the field of view and in resolution,
[73:01.36] compared to any of the cameras we can buy.
[73:03.92] We have, our skin is like all available touch sensing,
[73:08.76] and we have like some of the most efficient,
[73:12.00] you know, some of the most high capacity motors
[73:14.76] that can lift large loads, you know,
[73:17.32] in like the dexterity of a hand and stuff.
[73:20.44] So in terms of hardware, I mean like in terms
[73:23.28] of those capabilities, like, you know,
[73:25.36] we haven't figured out how to do a lot of these stuff.
[73:28.44] I mean, Tesla has been making incredible progress.
[73:31.24] One X, I think announced their new thing
[73:35.24] that looks incredible.
[73:36.72] Some of the other companies figure
[73:38.24] and like others are doing great work,
[73:40.56] but we're really not anywhere close to like the hardware
[73:43.68] that we feel like we need.
[73:45.84] And there's obviously the other thing I want to call out is
[73:49.48] a lot of what people show works,
[73:52.24] but like has to be fixed all the time.
[73:53.92] I mean, like that's the other thing we are incredible at.
[73:57.12] Like we don't need any maintenance
[73:58.84] or like the maintenance is part of us.
[74:01.32] If you buy a product, electronics product of any kind,
[74:04.00] you buy a PS5, you don't say,
[74:06.40] oh yeah, my PS5 breaks like every six days
[74:09.00] and I have to like do some reasonable amount of work on it.
[74:11.60] But like that's robotics.
[74:13.32] Like if it's not industrial robotics
[74:15.16] where it's very controlled and specialized or whatever,
[74:18.20] like you're talking about reliability like in those ranges.
[74:21.64] So I think people don't talk about
[74:24.24] the reliability thing enough.
[74:25.40] Like what I mean, like we're going to enter
[74:27.28] the commercialization phase.
[74:28.52] I mean like we're going to start thinking about, okay,
[74:31.48] now we have this thing and we need to figure out
[74:33.08] how to get reliability high enough to deploy it into homes
[74:36.12] and like just sell it to people and like Best Buy or something.
[74:40.04] So that's the other factor
[74:41.72] that we have to make a lot of progress on.
[74:44.24] - I just realized that Google has a play in this
[74:47.36] with like PalmE and stuff
[74:48.68] and OpenEi obviously has a long history of doing this stuff.
[74:51.92] Is there anything in Meta?
[74:53.92] No robotics stuff in Meta.
[74:55.52] - We have a small robotics program at Meta, out of fare.
[74:58.96] I actually used to do it at fare a little bit
[75:01.28] before I moved into Infer and focused on my Meta time
[75:05.04] on a lot of like other infrastructural stuff.
[75:07.44] So yeah, Meta's robotics program is a lot smaller.
[75:10.88] - Seems like it would be a personal computing.
[75:14.40] - You can think of it as like,
[75:15.72] Meta has a ridiculously large device strategy, right?
[75:19.76] Like, you know, this is how our reality labs stuff.
[75:23.24] Like, you know, we're going at it from VR and AR
[75:25.88] and you know, we showcase a lot of stuff.
[75:28.32] I think for Meta, the robot is not as important
[75:32.40] as like the physical devices.
[75:35.48] Physical devices kind of stuff, for sure.
[75:38.64] - Okay, I want to touch on Osmo a bit
[75:40.24] because a very unusual company to the stuff
[75:42.76] that we normally discuss, not robotics, sense of smell.
[75:46.60] The original pitch I heard from the founder,
[75:48.08] maybe you can correct me, is that you realize
[75:50.28] that you can smell cancer.
[75:52.36] Is that intuitive?
[75:53.84] Is that what you get?
[75:54.68] Or is it the potential that you seek?
[75:56.40] - The very interesting reason I invested in Osmo
[75:59.96] is because Alex Wilsko, the founder of Osmo,
[76:03.56] before PyTorch, there was Torch.
[76:05.52] And Alex Wilsko actually worked on Torch.
[76:08.08] He's actually like a frameworks guy.
[76:10.28] Like, you know, he built this thing called tangent
[76:12.52] from Google, like another like autodiff framework and stuff.
[76:16.04] Like, so I know him from that side of things.
[76:18.52] And then, like, I also, like,
[76:20.20] he is a neurobiologist by training.
[76:22.44] He just happens to also love like neural networks
[76:26.00] and like hacking on those frameworks.
[76:28.56] So incredibly smart guy, one of the smartest people I know.
[76:32.44] So when he was going in this direction,
[76:34.68] I thought it was incredible that like smell
[76:38.00] is something that we haven't even started to scrape
[76:42.24] in terms of digitization.
[76:44.00] When we think about audio or images or video,
[76:47.96] they're like so advanced.
[76:49.72] So we have the concept of color spaces.
[76:52.60] We have the concept of like frequency spectrums.
[76:55.68] Like, you know, we figured out how ears process,
[76:58.44] like frequencies in mouse spectrum or whatever,
[77:00.72] like logarithmically scaled images.
[77:03.36] We're like RGB, YUV.
[77:04.92] Like we have so many different kinds of parametrizations.
[77:07.88] We have formalized these two senses ridiculously well.
[77:12.88] Touch and smell, nada.
[77:16.08] We're like where we were with images in, say, in 1920
[77:19.96] or maybe even the 1800s, right?
[77:22.52] That's where we're at.
[77:23.52] And Alex has this incredible vision
[77:26.04] of like having a smell sensor
[77:30.30] just eventually just be part of your daily life.
[77:34.36] Like, as of today, you don't really think about
[77:38.12] like when you're watching an Instagram reel or something.
[77:40.40] Huh, like I also would love to know what it smelled like.
[77:44.72] You know, when you're watching a reel of a food or something.
[77:48.04] You don't because we really haven't as a society
[77:52.20] got that muscle to even understand
[77:54.96] what a smell sensor can do.
[77:57.54] I think the more near term effects are obviously
[78:00.38] going to be around things that provide more obvious utility
[78:04.52] in the short term, like maybe smelling cancer
[78:07.68] or like repelling mosquitoes better
[78:10.60] or you know, stuff like that.
[78:12.64] More recently, he's been talking about
[78:14.00] like categorizing perfumes obviously.
[78:15.48] Yeah, exactly.
[78:16.32] That's a market that you can pursue.
[78:17.52] Yeah, like, I mean, think about how you can customize
[78:21.28] a perfume to your own liking in the same way
[78:24.28] you can customize a shoe or something, right?
[78:27.48] So that like, that's I think all the near term stuff.
[78:29.80] I think if he's able to figure out
[78:32.36] a near term value for it,
[78:34.40] they as a company can sustain themselves
[78:37.08] to then eventually like try to make progress
[78:39.52] on the long term, which is really an uncharted territory.
[78:44.04] Like you think about it, 50 years from now,
[78:47.04] it would be pretty obvious to like kids of the generation
[78:50.28] to just like, I was going to say,
[78:51.84] scroll the reel on their phone,
[78:53.28] maybe phone, they're just like, you know,
[78:56.72] on their glasses, they're watching something.
[78:59.44] I think VR would be.
[79:00.28] And then like, they immediately get like a smell sense
[79:04.28] off that remote experience as well.
[79:06.72] Like we haven't really progressed enough in that dimension.
[79:10.62] And I think they have a chance to do it.
[79:13.00] Awesome. I mean, we touched on a lot of things.
[79:14.76] Anything we're missing, anything you want to direct people to
[79:18.32] or call to action, call for research, call for startups.
[79:22.88] I don't really have a lot of calls to action
[79:24.86] because usually I think people should be intrinsically,
[79:28.12] like that's a good look inside yourself.
[79:31.08] Yeah. That's good.
[79:33.68] Awesome. Thank you so much for coming on.
[79:35.12] Yeah. This was great.
[79:36.16] Thanks a bit.
[79:37.76] (upbeat music)
[79:40.34] (upbeat music)
[79:42.92] (upbeat music)
[79:45.50] (upbeat music)
[79:48.08] (upbeat music)
[79:50.66] (upbeat music)
[79:53.24] (upbeat music)
[79:55.82] (upbeat music)
[79:58.40] [BLANK_AUDIO]