Intro. [Recording date: March 25, 2025.]
Russ Roberts: Today is March 25th, 2025, and my guest is podcaster and author, Dwarkesh Patel. You can find him on YouTube, at Substack at Dwarkesh.com. He is the author with Gavin Leech of The Scaling Era: An Oral History of AI, 2019-2025, which is our topic for today, along with many other things, I suspect. Dwarkesh, welcome to EconTalk.
Dwarkesh Patel: Thanks for having me on, Russ. I’ve been a fan, I was just telling you, for ever since–I think probably before I started my podcast, I’ve been a big fan, so it’s actually really cool to get to talk to you.
Russ Roberts: Well, I really appreciate it. I admire your work as well. We’re going to talk about it some.
Russ Roberts: You start off saying, early in the book–and I should say, this book is from Stripe Press, which produces beautiful books. Unfortunately, I saw it in PDF [Portable Document Format] form; but it was pretty beautiful in PDF form, but it’s I’m sure even nicer in its physical form. You say, ‘We need to see the last six years afresh–2019 to the present.’ Why? What are we missing?
Dwarkesh Patel: I think there’s this perspective in the popular conception of AI [artificial intelligence], maybe even when researchers talk about it, that the big thing that’s happened is we’ve made these breakthroughs and algorithms. We’ve come up with these big new ideas. And that has happened, but the backdrop is just these big-picture trends, these trends most importantly in the buildup of compute, in the buildup of data–even these new algorithms come about as a result of this sort of evolutionary process where if you have more compute to experiment on, you can try out different ideas. You wouldn’t have known beforehand why the transformer works better than the previous architectures if you didn’t have more compute to play around with.
And then when you look at: then why did we go from GPT-2 to GPT-3 to GPT-4 [Generative Pre-trained Transformer] to the models we’re working with now? Again, it’s a story of dumping in more and more compute. Then that raises just a bunch of questions about: Well, what is the nature of intelligence such that you just throw a big blob of compute at wide distribution of data and you get this agentic thing that can solve problems on the other end? It raises a bunch of other questions about what will happen in the future.
But, I think that trend of this 4X-ing [four times] of compute every single year, increasing in investment to the level we’re at hundreds of dollars now at something which was an academic hobby a decade ago, is the missed trend.
Russ Roberts: I didn’t mention that you’re a computer science major, so you know some things that I really don’t know at all. What is the transformer? Explain what that is. It’s a key part of the technology here.
Dwarkesh Patel: So, the transformer is this architecture that was invented by some Google researchers in 2018, and it’s the fundamental architectural breakthrough behind ChatGPT and the kinds of models that you play around with when you think about an LLM [large language model].
And, what separates it from the kinds of architectures before is that it’s much easier to train in parallel. So, if you have these huge clusters of GPUs [Graphics Processing Units], a transformer is just much more practicable to scale than other architectures. And that allowed us to just keep throwing more compute at this problem of trying to get these things to be intelligent.
And then the other big breakthrough was to combine this architecture with just this really naive training process of: Predict the next word. And you wouldn’t have–now, we just know that this is how it works, and so we’re, like, ‘Okay? Of course, that’s how you get intelligence.’ But it’s actually really interesting that you predict the next word in Wikitext, and as you make it bigger and bigger, it picks up these longer and longer patterns, to the point where now it can just totally pass a Turing Test, can even be helpful in certain kinds of tasks.
Russ Roberts: Yeah, I think you said it gets “intelligent.” Obviously that was a–you had quotes around it. But maybe not. We’ll talk about that.
At the end of the first chapter, you say, “This book’s knowledge cut-off is November, 2024. This means that any information or events occurring after that time will not be reflected.” That’s, like, two eons ago.
Dwarkesh Patel: That’s right.
Russ Roberts: So, how does that affect the book in the way you think about it and talk about it?
Dwarkesh Patel: Obviously, the big breakthrough since has been inference scaling, models like o1 and o3, even DeepSeek’s reasoning model. In an important way, it is a big break from the past. Previously, we had this idea that pre-training, which is just making the models bigger–so if you think like GPT-3.5 to GPT-4–that’s where progress is going to come from. It does seem that that alone is slightly disappointing. GPT-4.5 was released and it’s better but not significantly better than GPT-4.
So, the next frontier now is this: How much juice can you get out of trying to make these smaller models–train them towards a specific objective? So, not just predicting internet text, but: Solve this coding problem for me, solve this math problem for me. And how much does that get you–because those are kinds of verifiable problems where you know the solution, you just get a see if the model can get that solution. Can we get some purchase on slightly harder tasks, which are more ambiguous, probably the kind of research you do, or also the kinds of tasks which are–just require a lot of consecutive steps? The model still can’t use a computer reliably, and that’s where a lot of economic value lies. To automate remote work, you actually got to do remote work. So, that’s the big change.
Russ Roberts: I really appreciate you saying, ‘That’s the kind of research you do.’ The kind of research I do at my age is what is wrong with my sense of self and ego that I still need to do X, Y, Z to feel good about myself? That’s the kind of research I’m looking into. But I appreciate–I’m flattered by your presumption that I was doing something else.
Russ Roberts: Now, I have become enamored of Claude. There was a rumor that Claude is better with Hebrew than other LLMs. I don’t know if that’s true–obviously because my Hebrew is not good enough to verify that. But I think if you ask me, ‘Why do you like Claude?’ it’s an embarrassing answer. The typeface is really–the font is fantastic. The way it looks on my phone is beautifully arrayed. It’s a lovely visual interface.
There are some of these tools that are much better than others for certain tasks. Do we know that? Do the people in the business know that and do they have even a vague idea as to why that is?
So, I assume, for example, some might be better at coding, some might better at more deep research, some might better at thinking and meaning, taking time before answering and it makes a difference. But, for many things that normal people would want to do, are there any differences between them–do we know of? and do we know why?
Dwarkesh Patel: I feel like normal people are in a better position to answer that question than the AI researchers. I mean, one question I have is: in the long run, what will be the trend here? So, it seems to me that the models are kind of similar. And not only are they similar, but they’re getting more similar over time, where, now everybody’s releasing a reasoning model, and they’re not only that, they’re copying the–when they make a new product, not only do they copy the product, they copy the name of the product. Gemini has Deep Research and OpenAI has Deep Research.
You could think in the long run maybe they’d get distinguished. And it does seem like the labs are pursuing sort of different objectives. It seems like a company like Anthropic may be much more optimizing for this fully autonomous software engineer, because that’s where they think a lot of the value is first unlocked. And then other labs maybe are optimizing more for consumer adoption or for just, like, enterprise use or something like that. But, at least so far–tell me about your impression, but my sense is they feel kind of similar.
Russ Roberts: Yeah, they do. In fact, I think in something like translation, a truly bilingual person might have a preference or a taste. Actually, I’ll ask you what you use it for in your personal life, not your intellectual pursuits of understanding the field. For me, what I use it for now is brainstorming: help me come up with a way to think about a particular problem, tutoring. I wasn’t sure what transformer was, so I asked Claude what it was. And I’ve got another example I’ll give in a little bit. I use it for translation a lot because I think Claude’s much better–it feels better than Google Translate. I don’t know if it’s better than ChatGPT.
Finally, I love asking it for advice on travel. Which is bizarre, that I do that. There’s a zillion sites that say, ‘The 12 best things to see in Rome,’ but for some reason I want Claude’s opinion. And, ‘Give me three hotels near this place.’ I have a trust in it that is totally irrational.
So, that’s what I’m using it for. We’ll come back to what else is important, because those things are nice but they’re not important. Particularly. What do you use it for in your personal life?
Dwarkesh Patel: Research, because my job as a podcaster is I spend a week or two prepping for each guest and having something to interact with as I am–because you know that you read stuff and it’s like you don’t get a sense of why is this important? How does this connect to other ideas? Getting a constant engagement with your confusions is super helpful.
The other thing is, I’ve tried to experiment with putting these LLMs into my podcasting workflow to help me find clips and automating certain things like that. They’ve been, like, moderately useful. Honestly, not that useful. But, yeah, they are huge for research. The big question I’m curious about is when they can actually use the computer, then is that a huge unlock in the value they can provide to me or anybody else?
Russ Roberts: Explain what you mean by that.
Dwarkesh Patel: So, right now there are just–some labs have rolled out this feature called computer use; but they’re just not that good. They can’t reliably do a thing like book you a flight or organize the logistics for a happy hour or countless other things like that, right? Which sometimes people use this frame of: These models are at high school level; now they’re at college level; now they’re a Ph.D. level. Obviously, a Ph.D.–I mean, a high schooler could help you book a flight. Maybe a high schooler especially, maybe not the Ph.D..
Russ Roberts: Yeah, exactly.
Dwarkesh Patel: So, there’s this question of: What’s going wrong? Why can they be so smart in this–I mean, they can answer frontier math problems with these new reasoning models, but they can’t help me organize–they can’t, like, play a brand new video game. So, what’s going on there?
I think that’s probably the fundamental question that we’ll learn over the next year or two, is whether these common-sense foibles that they have, is that sort of intrinsic problem where we’re under–I mean, one analogy is, I’m sure you’ve heard this before–but, like, remember–the sense I get is that when Deep Blue beat Kasparov, there was a sense that, like, a fundamental aspect of intelligence had been cracked. And in retrospect, we realized that actually the chess engine is quite narrow and is missing a lot of the fundamental components that are necessary to, say, automate a worker or something.
I wonder if, in retrospect, we’ll look back at these models: If in the version where I’m totally wrong and these models aren’t that useful, we’ll just think to ourselves, there was something to this long-term agency and this coherence and this common sense that we were underestimating.
Russ Roberts: Well, I think until we understand them a little bit better, I don’t know if we’re going to solve that problem. You asked the head of Anthropic something about whether they work or not. You said, “Fundamentally, what is the explanation for why scaling works? Why is the universe organized such that if you throw big blobs of compute at a wide enough distribution of data the thing becomes intelligent?” Dario Amodei of Anthropic, the CEO [Chief Executive Officer] said, “The truth is we still don’t know. It’s almost entirely just a [contingent] empirical fact. It’s a fact that you could sense from the data, but we still don’t have a satisfying explanation for it.”
It seems like a large barrier, that unknowing. It seems like a large barrier to making them better at either actually being a virtual assistant–not just giving me advice on Rome but booking the trip, booking the restaurant, and so on. Without that, how are we going to improve the quirky part, the hallucinating part of these models?
Dwarkesh Patel: Yeah. Yeah. This is a question I feel like we will get a lot of good evidence in the next year or two. I mean, another question I asked Dario in that interview, which I feel like I still don’t have a good answer for, is: Look, if you had a human who had as much stuff memorized as these LLMs have, they know basically everything that any human has ever written down, even a moderately intelligent person would be able to draw some pretty interesting connections, make some new discoveries. And we have examples of humans doing this. There’s one guy who figured out that, look, if you look at what happens to the brain when there’s a magnesium deficiency, it actually looks quite similar to what a migraine looks like; and so you could solve a bunch of migraines by giving people magnesium supplements or something, right?
So, why don’t we have evidence of LLMs using this unique asymmetric advantage they have to do some intelligent ends in this creative way? There are answers to all these things. People have given me interesting answers, but a lot of questions still remain.
Russ Roberts: Yeah. Why did you call your book The Scaling Era? That suggests there’s another era coming sooner-ish, if not soon. Do you know what that’s going to be? It’ll be called something different. Do you know what it’ll be called?
Dwarkesh Patel: The RL [real life] era? No, I think it’ll still be the–so scaling refers to the fact that we’re just making these systems, like, hundreds, thousands of times bigger. If you look at a jump from something like GPT-3 to GPT-4 or GPT-2 to GPT-3, it means that you have 100X’d the amount of compute you’re using on the system. It’s not exactly like that because there’s some–over time you find out ways to make the model more efficient as well, but basically, if you use the same architecture to get the same amount of performance, you would have to 100X the compute to go from one generation to the next. So, that’s what that referring to, that there is this exponential buildup in compute to go from one level to the next.
The big question going forward is whether we’ll see this–I mean, we will see this pattern because people will still want to spend a bunch of compute on training the systems, and we’re on schedule to get big ramp-ups in compute as the clusters that companies ordered in the aftermath of ChatGPT blowing up are now coming online. Then there’s questions about: Well, how much compute will it take to make these big breakthroughs in reasoning or agency or so forth?
But, stepping back and just seeing a little forward to AGI–
Russ Roberts: Artificial General Intelligence–
Dwarkesh Patel: That’s right. There will become a time when an AGI can run as efficiently as a human brain–at least as efficiently, right? So, a human brain runs on 20 watts. An H100, for example, it takes on the order of 1,000 watts and that can store maybe the weights for one model or something like that.
We know it’s physically possible for the amount of energy the human brain uses to power a human level intelligence, and maybe it’s going to get even more efficient than that. But, before we get to that level, we will build an AGI which costs a Montana’s-worth of infrastructure and $100 billion of CapEx, and is clunky in all kinds of weird ways. Maybe you have to use some sort of inference scaling hack. By that, what I mean to refer to is this idea that often you can crack puzzles by having the model think for longer. In fact, it weirdly keeps scaling as you add not just one page of thinking, but 100 pages of thinking, 1,000 pages of thinking.
I often wonder–so, there was this challenge that OpenAI solved with these visual processing puzzles called ARC-AGI [Abstraction and Reasoning Corpus for Artificial General Intelligence], and it kept improving up to 5,000 pages of thinking about these very simple visual challenges. And I kind of want to see: what was on page 300? What big breakthrough did it have that made that?
But, anyways, so there is this hack where you keep spending more compute thinking and that gives you better output. So, that’ll be the first AGI. And we’ll build it because it’s so valuable to have an AGI that we’ll build it the most inefficient way. The first one we will build won’t be the most physically efficient one possible. But, yeah.
Russ Roberts: Can you think of another technology where trial and error turned out to be so triumphant? Now, I did a wonderful interview with Matt Ridley awhile back on innovation and technology. One of his insights–and I don’t know if it’s his–but one of the things he writes about–I think it’s his–is that a lot of times the experts are behind the people who are just fiddling around. He talks about the Wright brothers are just bicycle guys. They didn’t know anything about aerodynamics particularly. They just tried a bunch of stuff and until finally they lifted off the ground, is the application of–I don’t know if–I think that’s close to actually true.
Here we have this world where these unbelievably intellectually sophisticated computer scientists are building these extraordinarily complex transformer architectures, and they don’t know how they work. That’s really weird. If you don’t know how they work, the easiest thing to make them better is just do more of what works so far and expect it to eventually cross some line that you might be hoping it will. But, can you think of another technology where the trial and error is such an important part of it alongside the intense intellectual depth of it? It’s really quite unusual, I would guess.
Dwarkesh Patel: I think most technologies–I mean, I would actually be curious to get your takes on economic history and so forth, but I feel like most technologies probably have this element of individual genius is overrated and building up continuously on the slight improvements. And often, it’s not, like, one big breakthrough in the transformer or something. It’s, like, you figured out a better optimizer. You figured out better hardware. Right? So, a lot of these breakthroughs are contingent on the fact that we couldn’t have been doing the same thing in the 1990s. In fact, people had similar ideas, they just weren’t scaled to a level which helped you see the potential of AI back then. [More to come, 20:25]