Ep 18: 8 Levels of AI Engineering, Meta AI Delays, and LLM Neuroanatomy
Shimin (00:15)
Hello and welcome to Artificial Developer Intelligence, a weekly conversation show about software development aided by our future AI overlords. I am Shimin Zhang and with me today are my co-hosts Dan, let's act like cowboys. Yee-haw. Lasky and Rahul, don't get uppity at work or he will replace you with a bot Yadav, how are you two doing today?
Rahul Yadav (00:38)
Hey, how's it going?
Dan (00:39)
Let's act like cowboys.
Shimin (00:41)
Mm-hmm.
Dan (00:41)
What are you saying? Like I don't write tests or what?
Shimin (00:44)
it's from one of the ⁓ articles that you're going to talk about later.
Dan (00:46)
If so, you're right. Claude writes my tests. You caught me.
but he writes them first.
Shimin (00:51)
Green, red, that's the way to go. On today's show, we're gonna first start with the news thread mill as always, where we're gonna talk about NVIDIA's competitor to OpenClaw, meta delaying its rollout of new AI models and some follow up on the SWE benchmark PRs.
Dan (01:07)
Then very surprisingly, because we haven't really done this in a while on the show, but we're actually going to have a technique corner where we talk about some interesting techniques when working with AI. So we have some, the funnily titled, collective superstitions of people who talk to machines. And I knew Shimin would fall for this one, the eight levels of agentic engineering. Yeah.
Shimin (01:27)
Love it already.
that'll be followed by post-processing where we're going to talk about a New York Times article called, coding after coders and also a Stanford law article called, built by agents, tested by agents, trusted by whom.
Dan (01:42)
who.
Shimin (01:42)
After he lays you off, yes.
Dan (01:43)
And finally, we're going to do a deep dive. And this one is actually not a paper, but we're going to be talking about LLM neuro anotomy with someone's story about how they topped the LLM leaderboard without changing a single weight, which is actually a pretty fascinating story.
Shimin (01:59)
Yeah, and finally, finally, Dan is going to rant about something this week. We'll find out what. I'm excited. It's my favorite segment.
Dan (02:03)
Mm-hmm.
And then we're hopefully gonna close out with a little bit of glossy two minutes to midnight. Stick around to see why it's glossy.
Shimin (02:12)
All right, let us get started. So first article this is, Meta has delayed its rollout of its new AI model, codenamed Avocado, which I think is a pretty cool codename for a model. Everybody loves them. This is why we can't afford homes. It's because we're spending all our money on model token costs.
Dan (02:26)
You
It's not just toast anymore. It's also a model.
Shimin (02:36)
So the reason why Meta has delayed the rollout of Avocado is because apparently it did not do as well as Gemini 3 or the new Anthropic and OpenAI models. It did seem to rumored to be performing as well as Google's Gemini 2.5 model. So as a result, Meta is in discussions
about temporarily licensing Gemini to power the company's AI models. So.
Dan (03:04)
Wow.
Shimin (03:04)
This
is all reporting by Eli Tan from the New York Times. And a of a follow-up that wasn't covered in this article was that Meta is rumored to be planning a 20 % layoff of the entire company in order to spend more of its capital in the AI space.
Rahul Yadav (03:21)
allegedly.
Shimin (03:22)
Allegedly.
Dan (03:23)
mean, it seems plausible given block and everything else.
Rahul Yadav (03:27)
Hahaha
Dan (03:30)
We're only laughing because it's painful, friends.
Shimin (03:30)
Yeah. this,
yeah, it's it's a gallow's humor. meta came out of the gate really strong with, Olamo and, know, with a research group led by Yann LeCun and all of that, deep research talent. And then there's been a lot of turnover in their deep learning department. And then
Yann LeCun left, you his company just raised a billion dollars for his own world model app. And META is not doing so hot.
Dan (03:57)
Yeah, which is really, I mean, all things considered kind of a shame because it's like there's been a lot of discourse recently in my experience about the sort of sovereign models, right? And I think that's largely because like India has now entered the chat in a big way and they're making a big deal out of like sovereign models, which I think is neat. It's kind of a cool approach. But if that's the case, meta is like our
Shimin (04:08)
Mm-hmm.
Dan (04:18)
our sovereign model savior for better or for worse in the States. And them stumbling doesn't feel so great.
Rahul Yadav (04:26)
Why
is Meta our sovereign model, savior? I didn't connect that.
Dan (04:30)
Well, they were
like the maybe not so much sovereign model, for a long time running, they were the open weights leader for US models anyway. ⁓
Rahul Yadav (04:36)
⁓ But they've
Shimin (04:37)
Mm-hmm.
Rahul Yadav (04:39)
also gone closed now.
Dan (04:40)
Right. And then on top of that, uh, their most recent ones haven't been too hot. Anyway, like the last open weights ones, did like 120 B and stuff like that GPT OSS 120 B. So
Rahul Yadav (04:41)
Yeah.
Yeah
Shimin (04:52)
Right.
So Meta has been playing catch up for a while now. And we heard a lot of stories about them paying researchers millions and millions and millions in cash just to poach them from OpenAI in Anthropic. Is it possible that because all top AI researchers already have a FU money, they maybe don't want to work for Meta?
Dan (05:12)
They're just not
motivated.
Shimin (05:14)
Like, yeah, is the meta brand so toxic that like it's not appealing?
Rahul Yadav (05:18)
Was it, what is it you said a few minutes ago about about laying off 20 % of the company? I wonder if that has something to do with
Dan (05:18)
I mean, they're still a fang.
Shimin (05:23)
⁓
Yeah, and completely unrelated news, right? So.
Rahul Yadav (05:29)
Completely.
Dan (05:29)
Yeah, that's
fair.
Wait, so they actually did give up on open weights? Totally. I must have missed that memo too.
Rahul Yadav (05:37)
They've gone closed recently.
Dan (05:40)
Like fully?
I know that they had always been working on closed stuff, but then they've sort of trickled out open weights here and there.
Rahul Yadav (05:45)
Because this avocado
one, was going to be a closed model. like, some of the concern with open weights, which is legit, you could also use those things to do all sorts of terror-related stuff. And so it...
Dan (05:49)
Huh, interesting.
Shimin (05:50)
Mm-hmm.
Rahul Yadav (06:03)
partially makes sense to me to go closed on it and then they also want to like, you know, maintain whatever edge they would have because like one thing that I've been curious about in this whole thing is
Amazon, Microsoft, Google, all of them spending crap tons of money on the AI build out makes sense to me. Meta didn't make sense to me as much. But then looking at the eventual business or the core of the business is making ads and selling ads, right? And so this helps them create almost all the tools you need to create ads and create the most
like engaging ads and engaging content and everything. And then on top of that, they want to replace iPhone or break the dependency from iPhone. That's why they have those like glasses and everything. And those would need a lot of AI capabilities if they want to keep pushing that. ⁓ And then I think they also just, yeah. And they in general don't like.
Dan (06:58)
Well, it's basically the whole interface.
Rahul Yadav (07:03)
being stuck with any one cloud provider. So they're trying to build out their own. Third one to me is the weakest of these three, because sure, but you know, everybody would like to have their own cloud and everything, but it's not your competitive advantage. But the first two, I think is why they're really pushing on this and the...
safety concerns is why they probably would have closed wait and a bunch of probably features that they would want to have exclusive access to in their models and not share with others.
Shimin (07:33)
Right, but you didn't mention Apple, right? Apple also did not spend a crap ton of money building its own model. And now Apple is also licensing Gemini. So, did meta have spent all this money just to end up licensing Gemini? Like, that is a full circle moment here.
Rahul Yadav (07:38)
⁓ sorry. ⁓ apples... Yeah, good. Yeah.
Apple is in fact like
Dan (07:51)
That's a big ouch too.
Rahul Yadav (07:52)
the most fascinating of them because either they're going to look like big idiots or the smartest people at the end of this. ⁓
Dan (07:59)
But they invested
quite a bit too in their own research before eventually, you know, maybe giving up. I mean, they even built a hardware platform to run it, right? Like talk about research costs.
Rahul Yadav (08:11)
Compared
to this money, though, where you're talking about like a trillion dollars, they didn't come anywhere or compared to any one company, as far as I know, they didn't get to like hundreds of billions of dollars, and they like maybe 10 or something, which is for Apple is a rounding error. I'm guessing at this, I don't have hard numbers, but I I don't remember seeing any
know, Apple's trying to swing big and just couldn't get it to work.
Shimin (08:37)
Yeah, we're going to follow this. you know, maybe Apple is taking all its supposed EV investments and putting it into a secret AI lab that we don't know about.
Rahul Yadav (08:45)
or edge
compute, everybody keeps compressing them and then Apple's like, someone needs hardware. I got one in everybody's pockets already.
Dan (08:54)
You
Shimin (08:55)
Okay, moving on, our next news item is from Dan. It's NVIDIA's Nemoclaw.
Dan (09:01)
Mm hmm. so when we first added this, to our docket for this week, it was just a rumor. And now it is confirmed, that Nvidia has apparently worked closely with the, I forget the dude's name, the original dude of open claw.
to create this fork of it. of course, because it's Nvidia, it's designed to work with their Nemo local models and their entire AI stack. So it's very well integrated. But the reason that I wanted to talk about it in this and particularly talk about it with you all is we have kind of a running theme so far over the last few episodes around the
interesting power of these fully autonomous agents that you can message with, but also their security model and how it's not actually proven to be all that hot in practice. And one of the things NVIDIA is promising here that I have not personally evaluated, I'm fascinated to try to maybe try and run this one myself is supposedly this is, they're touting it as being enterprise ready, meaning you can safely unleash it on your org and it will be fine.
So pretty lofty claim, but like, hopefully they've done some actual serious engineering, not just to say that there wasn't serious engineering going on on any of the claws or anything, like, hopefully they've done some really serious thinking about security posture on this and we'll live up to that claim. Cause it'd be pretty neat to actually use one of these things. You know, I, I want to use them. I'm just like, boy. So.
Shimin (10:24)
Mm-hmm.
Same here. ⁓
Nividia is also trying to own the entire vertical stack, right? But Nvidia does not have the incentive to do too much of it because then it loses its best customers who are building everything from the software level, up or down, I guess is this open claw or moltbot or clawbot technology like
the fastest software paradigm adoption cycle that you've ever seen. I could not think of one that happened faster than this. Not even, like even something like cryptocurrency and Bitcoin and NFT's that was bubbling up for a decade before it went mainstream.
Dan (11:08)
I would argue that that's because like many, people probably might, well, yes, myself included have wanted this for a very long time. Right. Like, and that's why I like, you know, Siri, for example, sorry, Apple to beat on you is so disappointing because it's like, this is what Siri was supposed to be. It was this, persistent thing that helps you and puts the threads together so that you don't have to across apps And
Shimin (11:14)
Yeah
Mm-hmm.
Dan (11:32)
That's really the value there. And I think that's exciting for a lot of people, like across, I mean, it's exciting enough that like, like my partners talk to me and be like, how do I get one of these agents? Like, I want it to do research for me and this and that and the other thing. And I'm like, whoa, okay. you know, I've got a podcast about this. We can talk about it later.
Shimin (11:40)
Ha ha ha ha
You
can get your partner like a lobster in a tank and be like, here's a little claw.
Dan (11:55)
It's an open claw for
me.
Shimin (12:00)
Now that you mentioned Siri and I'm also thinking about Alexa, it's almost like Amazon and Apple spent a decade building the demand for this thing just for open claw to finally capture it. Yeah.
Dan (12:11)
Yeah, they were the marketing and then now here it is. Yeah,
it's true. Everything you wanted Alexa to be, but it actually works. And you don't have to talk to it unless you want to.
Shimin (12:19)
without the security concerns.
I'm excited to see where this goes. Rahul, do you have anything to add about this?
Dan (12:23)
Yeah.
Rahul Yadav (12:26)
Same as you, moving in the vertical layer at some point the demand for GPUs must stay strong and do whatever it takes to keep it strong is how I'm seeing it.
Shimin (12:39)
Yeah, Nvidia, a well-run company. What a thought. I'm also reminded of when Nvidia released their open source version of the face generator. That was like one of the biggest use case of generative AI at the time. Yeah. They are actually incredibly strong technically when it comes to AI and you just don't hear as much about it because of their financial conflict of interest. Yeah.
Rahul Yadav (12:50)
Yeah.
Dan (12:51)
early.
Yeah, I
mean they have their own open weights model too.
Rahul Yadav (13:03)
And also the self-driving car, the whole software stack too, I think, not just the hardware. Yeah.
Dan (13:07)
And hardware too. Yeah. mean, like a lot of that stuff wouldn't be possible
Shimin (13:07)
Nope. Nope. Nope.
Dan (13:11)
without their, crunching raw crunching power, being able to fit on a car, which is pretty wild when you think about it.
Rahul Yadav (13:15)
Yeah, yeah.
Shimin (13:18)
You know how a couple of weeks ago, Rahul, was like if I was a company that rhymes with, I don't know, chlamydia, I would be really worried about my future. Apparently not. This is what sleep deprivation does to a man. The rhymes are just off. Okay, moving on. ⁓
Dan (13:31)
Yeah.
Rahul Yadav (13:35)
That's what Jensen
goes about around saying. NIVIDIA it rhymes with chlamydia.
Dan (13:40)
no...
Shimin (13:41)
the jokes write themselves. What do you catch when you- Anyways, okay, our next article also brought by Dan from METR
Dan (13:48)
so this one's also been making rounds this week, which is,
There's been some analysis done of the popular sort of SWE bench, right? Which is like one of the model benchmarks that folks are using. And every time we talk about a model release, I feel like on here we've talked about like the sweet bench stuff. they did a study in which they have proved that
Shimin (13:59)
Mm-hmm.
Dan (14:12)
all of the sweet bench passing pull requests that supposedly like, you know, would have succeeded in the benchmark threshold. and these are on a little bit older models. you know, take this with a grain of salt, but, from 2024 to mid 25, would not have been accepted by repo maintainers.
And so this study doesn't go, doesn't really draw any significant conclusions from this. but there's been a whole bunch of discourse about it saying our model's not improving because if you look at the box plot for that, it's actually not too hard to essentially just draw a flat line through it. when you consider the
the acceptance ratios instead of like purely the, this we bench output, stuff. So that has certainly not been my anecdotal experience, but, you know, I'm just like one data point, man.
Shimin (15:05)
Well, let me add one more data point to your data point. I too feel like the models have been improving. So even though from this graph, you can draw the conclusion that somewhere along the lines of early 2025, there was a step up change and it's just kind of been going sideways for a little bit. Personally speaking, I think a lot of the vibes on the inner webs, know,
says the same thing, which is model models have been steadily improving, but I think a bigger conclusion to draw from this is probably, the original scores were calculated via a automated grader via a set of test suites. Right. So if it passes the test, but actually wouldn't be accepted by the repo maintainer, then
Our existing paradigm of red green testing and test based software, AI assisted software development is probably insufficient.
that software developers are still needed to go through the code review PRs. Cause if you just truly let the dark factories do their thing, then they will probably create PRs that are not up to par, even though they pass all the tests.
Dan (16:15)
Yeah, I think that's not a surprising, at least not surprising to me outcome, because it's never been necessarily about that, right? It's always like, it's good enough and good enough is getting shockingly good.
Shimin (16:25)
You
Yeah, they broke down, what the major reasons for why the maintainers would not merge the code in and they broke it down to code quality, breaks other code, core functionality, et cetera. And I have to say like core functionality and breaking other code, not a large chunk of the, of the reasons why they failed. So if the reason they were not accepting them was mostly code quality, then
Dan (16:47)
Mm-hmm.
Shimin (16:52)
Maybe good enough is good enough.
Rahul Yadav (16:54)
So I would like to nitpick that specific thing because like maybe you just need a good agents MD or Claude MD because right about there it says like code quality example bad style not following repo standards and my brain goes to put all of those in your agents MD and you would likely get you know the output that complies to those standards so yeah.
Dan (17:16)
One of my coworkers
who wouldn't mind me telling this story recently took Claude and pointed it at every single pull request he's made for the past four and a half years and asked it to summarize his coding style and his strengths and weaknesses as a software engineer, which in and of itself is kind of an interesting thing to do that I certainly had never thought to do.
Shimin (17:32)
Mm-hmm. Mm-hmm.
Dan (17:39)
One or two, sure, but not like everything, you know? And then he also summarized it into a like coding this way document that he then like primes his agents with every time. I'm like, that's a pretty cool idea. I hadn't thought of that. So.
Shimin (17:43)
You
Including
all the weaknesses. Yeah.
Dan (17:57)
Probably. Yeah, that's fair.
Shimin (17:58)
Yeah, just to fool the code reviewer. Yeah, like you didn't use AI
on this. yeah, that's something Dan that's kind of mistake then would make.
Dan (18:04)
That's exactly how I would have messed this up. Yeah, that's fair.
Rahul Yadav (18:07)
Five of your
six teammates are making this mistake. You don't have to be one of them.
Dan (18:10)
Yeah.
Yeah, but I still thought it was a neat idea.
Shimin (18:16)
Yeah, definitely a useful technique. Kind of like the called overview feature, almost.
Rahul Yadav (18:21)
What's that feature about?
Shimin (18:23)
is overview. It's like it goes through your cloud usage over like X number of sessions and give you tips for improving insights. That's what I meant to say. Yes.
Rahul Yadav (18:29)
⁓ insights. Yeah,
Dan (18:30)
It's ice, yeah.
Rahul Yadav (18:32)
the slash insights come in. Yeah. Yeah.
Shimin (18:34)
Yeah. Slash insight. Okay.
Onto techniques corner. I got this really interesting article from Scott Werner, who blogs at works on my machine. And this is kind of a meta blog post about AI techniques. And he first draws a analogy to.
Back when we had NES consoles, folks would blow onto the cartridge to fix the connection, even though we don't really know why it actually fixes the machine. If the actual fix was something depositing moisture onto the contacts, which corrodes them, then in the long run, you're actually making the cartridge worse, despite thinking that, hey, this is my magical technique.
And he says that we're doing this right now with our various AI prompting techniques. Our agents down markdown, our Claude down markdowns are just another way of blowing into this magic cartridge and thinking that, you know, it will fix the machine, even though we have no way of actually understanding the mechanism of that act. Furthermore, in an age where prompts
are so easily copyable, he posits the question of like, what does that mean about the writer, the creative act of writing those prompts? Who are you as a software developer if you can just copy another developer's agent.markdown file? And I think that's a really interesting question because I've looked at a lot of prompt.markdowns, but
Am I as good as you you Dan just because I copied your markdowns right? He has this
Dan (20:06)
You're just making
the same mistakes I am. already established that.
Shimin (20:09)
Exactly. And he brings up this historical analogy of a short story where...
Pierre Menard, the author of Don Quixote decides to rewrite the entirety of Don Quixote
But instead of copying or updating it for a modern audience, he wants to produce the exact same piece of text word by word by arranging it as though his own life and his own readings and his own sufferings and experience caused him to write Don Quixote. And what he's saying here is even though the prompt dot markdown file is the same, the fact that you created it
it is imbued with your creative touch, with your own personality, with your own technique, right? Just, just cause someone copies it doesn't mean they are you. And lastly, as a, as a part of this blog post, he has a the visible work, a Vibe coded app where you are encouraged to.
It is broken at the moment where you're encouraged to pass a number of tests. I fixed it. Where, for example, you want to buy refreshing. This is not a good marketing for claude code of web apps. ⁓ For example, here you're tasked with prompting the AI to write a
Dan (21:11)
You
Definitely vibe code.
What of podcast is this anyway?
Shimin (21:32)
Target Output, which is a controller for a blog app written in Ruby on Rails. And everyone who create a prompt that comes close enough to it, this is a great one, where you just create a controller and you copy and paste the code in. That's one way to go about it. But there are other ones where you know,
Dan (21:47)
face the actual thing.
Shimin (21:53)
a goose, for example, prompts, write a rails controller for a model named article, standard rail scaffolding style controller, seven RESTful actions, strong params before actions to permitted fields, title and body. It is a way to kind of capture the, the essence of the prompter, through their prompts, which I think is a really interesting development in how we, how we think about prompts. If most of our daily workflow now is going to be about
managing agents and AI, as opposed to writing a raw code, like where does our self expression come from? Right? Where is, where is that style? Maybe one way is to use AI to summarize your writing style, but maybe another way is to capture the writing style of your prompt. We're all becoming writers guys. Yay.
Rahul Yadav (22:36)
Is this ranked by fidelity? It looks like it is. No. Well, 62 is in the middle. What did, I'm curious what the top one got where they copy pasted the code. What fidelity did that get? 100 % nice. It'd be sad if that didn't get 100 % fidelity.
Shimin (22:38)
Yes, this is Frank by Fidelity. no.
I think
A hundred. It was a hundred percent fatality. Yes.
Dan (22:46)
100.
Create this controller by pasting it.
Zero percent.
Rahul Yadav (22:58)
It's...
Shimin (22:58)
Yeah, so
what do you guys think? Where do our creativity lie now?
Dan (23:03)
This, you know, what's funny is my first reaction to this is, LEET coding. Like all the LEET code tests you have to take for interviews where it's like the tears, the tests, like, you know, the percentage of passing or whatever. Now, like write your solution for it. I don't know. It's just kind of funny. Like we've come sort of full circle, but it's writing instead of.
Shimin (23:08)
Mm-hmm.
Dan (23:23)
which like I still think we need almost like a meta language that's maybe a higher level like pseudo code or something for specifying this stuff more efficiently than English. Cause like the whole reason why we have source code in the first place is it fulfills that need, right? Like that's the reason why we're not hand coding assembly or.
Shimin (23:32)
Mm-hmm.
Dan (23:44)
binary on a punch card anymore, like putting in a 16-bit instruction 32, whatever. So I don't know. It's like if we've gone too far up the abstraction chain, but what do I know?
Shimin (23:46)
God forbid, yeah.
Rahul Yadav (23:46)
Yeah.
Shimin (23:54)
Yeah, we're going to talk about this probably later too, but if coding is some sort of a crystallization of thinking, then going from code to English is not a fundamental change to the nature of the job.
Dan (24:06)
No, but the ambiguity introduced by English,
Rahul Yadav (24:06)
Buh.
Dan (24:11)
at least in my experience is the source of both people arguing about what a program was intended to do and or, or like the intent of it and or bugs when the ambiguity of the language causes you to do like N instead of M when, you like I've definitely seen that before. So, uh,
Rahul Yadav (24:31)
Yeah.
Shimin (24:35)
We have examples of these more perhaps complete languages, like math.
Dan (24:41)
True, or even just like formal methods, right?
Shimin (24:44)
Right. Yeah. So.
Rahul Yadav (24:45)
Also, wouldn't this lead us all to write less because you're not having to think as much about the details? This is more like, you know...
LLM prompts are like a tweet at best. They're not necessarily, I have deeply thought about all the finer details and I'm writing all the code myself. And so over time, I feel like the clarity of thought is going to go down because you're operating at here's the high level problem. Go figure it out. ⁓ By its nature, you're not.
Dan (25:20)
Yeah. Although that part I don't think is
necessarily the worst because like solving the right problem is still important. But I do find that really interesting in the example prompts that people have put into this app. Nobody started with the business problem they're trying to solve, which is,
Rahul Yadav (25:31)
Yeah.
Yeah.
Dan (25:36)
Maybe because the example starts with code, no one necessarily knows what the business problem is, but like, I think the closest we get is one where they're like, help me build a Rails controller for blog articles. Like, that's the first one where we're even mentioning that's a blog, you know?
Rahul Yadav (25:41)
See ya.
Yeah.
Dan (25:51)
And like, I don't know, maybe that's just weird, but for my own case, I always tend to think and start from what is the root problem that I'm trying to solve and not necessarily
Rahul Yadav (26:01)
No, I, I agree with that part. But even a business has a lot of intricate details, right? So if you with your own brain have thought about the majority of them, at what point do you just lose grasp of the details and then
Dan (26:01)
immediately the implementation.
Shimin (26:09)
Mm-hmm.
Rahul Yadav (26:19)
Because to me, it's not you can't just sit at a high low. You have to like, it's like a reinforcing loop. the more you step away from the details, the less clarity of thought, the less clarity of thought, the more you move away from details. And it's like a downward spiral almost. ⁓ It might not happen right away, but.
Dan (26:35)
Well, the other interesting corollary of that too is
are we stuck in 2025 slash 26 technology forever now? In terms of like the things that people are writing, right? Because if like we don't, if there is no next new hot language because everyone's just LLMing stuff together.
Rahul Yadav (26:44)
⁓
Yep.
Yeah.
Dan (26:56)
What does that mean for innovation too?
Rahul Yadav (26:57)
I feel like
lean is this the year of lean or that seems like the next new hot language because we need to get rid of the bottleneck of verifiability. So maybe give it a few more years and then maybe we're completely stuck.
Shimin (27:09)
Yeah.
Dan (27:09)
You
Shimin (27:14)
I mean, this, this has probably happened to all of us, right? Like you think you're going into a ticket or a coding problem thinking that you understand it inside and out only to go 40 % away and then realize, Oh, there is this whole other case scenario or whole other piece about a problem that I didn't realize when I initially started coding. So how do we capture that in the age of rather high level?
prompts. Is there like a feedback mechanism?
Rahul Yadav (27:44)
Yeah.
Dan (27:44)
I mean, that's actually what I use the high level
prompt for is because it's so fast to generate the initial code that allows me to like sketch out like a working POC, maybe not working, but at least like a directional POC in minutes that would have taken me potentially hours before. And I actually read it and like go.
Rahul Yadav (27:46)
you
Shimin (27:47)
Mm-hmm.
Mm-hmm.
Ha ha ha
Dan (28:06)
that's not right. And my gosh, we missed this whole section over here. And that helps me think through it sometimes too. So.
Shimin (28:09)
Mm-hmm.
So it helps
to still read the code. That's good.
Dan (28:15)
For me, I mean, maybe I'm old fashioned, but ⁓ yeah.
Rahul Yadav (28:18)
Claude review,
15 to 25 bucks a PR review, I think. Yeah.
Dan (28:22)
Ha ha ha.
Shimin (28:23)
Yes.
Okay, let's go to our next technique corner article. ⁓ The 8 levels of agentic engineering.
Dan (28:29)
Which is.
engineering. put, I just suggested this one cause I knew, I just knew Shimin would go for it. Like he loves levels and things. I know. Yup. All right. So as we, as we mentioned, there's eight levels. So this is, my goodness. How do you say Bassim Eledath excellent blog and
Shimin (28:40)
You know me so well, I'm a sucker for this.
Rahul Yadav (28:42)
He's a sucker
for levels.
Dan (28:56)
They are talking about the progression that I think we've covered to some degree on the show, which is you go from, well, maybe not you, but the industry to some degree and, you know, maybe where people are at individually has sort of started with the idea of like, hey, it's a fancy autocomplete, right? So when, and I'd certainly remember when,
Shimin (29:01)
Mm-hmm.
Dan (29:16)
Copilot came out and it was like, oh my gosh, this saving me so much time because I can have it write a whole function in one go. Um, and then, uh, we sort of went up a level with the beginnings of, uh, of Claude code, right? Or even like the first copilot where you could chat with it and have it do a couple of things or early cursor. Um, so that was when we started hooking chat up to your code base.
which you know, people are still doing today to some degree. ⁓ and then, level three is what they're calling context engineering. so
Shimin (29:44)
Mm-hmm.
Dan (29:49)
the overall thread of what you're working on becomes a lot more important than purely just the chat. So I think we're starting to ratchet up in terms of autonomy a little bit. ⁓ And what's in the model's memory matters a lot more than it did in tab complete or just chat. And then finally we get to level four, is compounding engineering.
Shimin (29:58)
Mm-hmm.
Dan (30:08)
We're taking not just the context of the current session, but then we are taking outputs of that session to improve the next session. so it starts scaling really quickly. and then finally we get into, which I don't know if I necessarily agree with this, to be honest, like I might've put MCP below compounding. Cause I feel like you go from context to MCP, but level five, they have it as MCP and skills. So, we've got like.
Understanding of the problem and the capability to follow threads. So now we need the ability to do more fancy things. So MCP and skills allow the agent to actually start doing others, you know, pieces outside of purely just reading and writing code. So good example of that, that I've seen recently is people are starting to plug MCPs into their CI and logging stacks. So you can
Shimin (30:44)
Mm-hmm.
Dan (30:55)
actually don't, instead of like one of the gaps I'd seen previously is you have to go out to your CI still and pull something into the model. And so now you don't have to do that anymore. then the new hotness that we've been talking about a little bit on the show for level six is harness engineering. So we had a whole little, section on that and what last, last episode or one before, so automated feedback loop. And this is where the, the
agent is becoming truly autonomous really is kind of what it's starting to feel like. Yeah. And you're, you're not just doing the output. You also have set up some sort of feedback loop so it can test itself. Then we go up the next level to level seven, which is background agents. So they're becoming autonomous fully. So
Shimin (31:20)
Mm-hmm.
Dan (31:38)
you might have a primary one that you're interacting with. This is more like Gastown or other stuff that we've talked about where there's um, well getting in that direction. And then I guess level eight is in fact Gastown. it's autonomous agent teams where they're swarming on your problem. Um, so yeah, I dunno, it's pretty interesting take and it's, it's neat to kind of gauge like both where you're at priming like overall, I they've got most of the levels, right? I don't necessarily agree with all the ordering of them, but
need to kind of look at this as like a list and see where you at on that list. And I would argue I'm probably somewhere at a five to six personally, but.
Shimin (32:11)
Yeah, the, you know, the reason I chose this, of course, because I do, I do love myself a list. I'm a sucker as I mentioned earlier, but you know what else this reminds me of? I read a lot of, Kung Fu books and movies growing up. Think Crouching Tiger, Hidden Dragon. It's like, okay, you start by running very fast and then you run fast and you do a little jumping.
Dan (32:17)
You
Shimin (32:33)
And then eventually you're like flying on roofs and doing all kinds of crazy sword play, you know, I don't know how much I, yeah. So like, don't know how much I, I believe in, the, necessary hierarchy. A, don't necessarily believe in it's a complete hierarchy and B, the, even the author, admits that, nobody has mastered a this level yet. It seems like it's a possibility, but.
Dan (32:38)
There's no middle ground though. It's like, yeah.
Shimin (32:56)
Like, is this the kind of, you know, the yogi you meditate and then eventually you transcend space and time? Like yet to be been seen, but maybe. That's curious. I, yeah.
Dan (33:04)
You
It's like all the
people talking about single person unicorns too. It's like, mm-hmm, yet to be seen. Maybe possible, but.
Shimin (33:13)
Yeah,
I'll believe it when I see it.
Dan (33:17)
Yeah, I mean, people are definitely doing it, right? As we see with Gastown and other things, but to what degree of what degree of success? I don't know. And I'm a little terrified to try it myself to be honest, partially for budgetary concerns and partially for what, what have we unleashed upon the world?
Shimin (33:22)
People are trying.
Let's show me the money.
Burn those tokens. Yeah, I think it's a useful tool to see which one of these techniques, and I wouldn't necessarily call them levels that you've kind of experimented with where you might want to go next, but I wouldn't. I think these are separate branches on a tech tree, not necessarily a straight up step up pyramid.
Dan (33:54)
Yeah, it depends on some, like some, mean, going from odd autocomplete to what they're calling context engineering is definitely a step, I think. ⁓ but yeah, as you go higher, it does get a lot fuzzier, right? Like I would argue that like at the very least like MCP is misordered, but.
Shimin (34:03)
Mm-hmm.
Yeah. And like harness engineering and background agents, like, is that really just one thing? Is that really two different things?
Dan (34:16)
Yeah, I actually was reading, I
started going like down the wrong path. You may have noticed, but I'm like, this is essentially back. no, it's not back on it. What is this like? So anyway.
Shimin (34:27)
Right. one of the things to try? okay. Let's move on to post-processing where we have yet another article by the New York times, uh, from last week.
Rahul Yadav (34:36)
Who's been reading
New York Times this week? got two of...
Dan (34:39)
Well, the other thing that's fascinating is that like
the
Shimin (34:43)
There are no
absolutely no reason, no geopolitical world news reason of any sort to read the New York Times this week. I don't know why.
Rahul Yadav (34:51)
What is that second
Dan (34:52)
But
Rahul Yadav (34:52)
and third thing you have at the top now? Anthropic sues Pentagon and energy costs? Why are energy costs going up?
Shimin (34:58)
Huh. I wonder?
Dan (35:02)
But the other part that's funny to me about the fact that we have two, not one, but two New York Times articles. And I think that's honestly the first time, maybe the first that we've ever had a New York Times article on here is they like, how much has this entered the mainstream discourse in the what, like handful of months that we've been doing this show. Like we went from like, you know, crazy tech sources to now we're talking about like mainstream news.
Which is just wild. like that nothing else that really shows you the rocket ship doesn't it? It's like whoa
Shimin (35:33)
Yeah. And that's the main reason why I wanted to mention this article is this is what the everyday Americans in, well, sorry, this is what the everyday coastal elitists think about software development and AI. Okay. They've, they've got, they've received their instructions from the gray lady and they're marching on. That's, that's a joke. I do.
Dan (35:46)
Yeah
Shimin (35:55)
despite my misgivings, still think it's a good way to get a feel of what everyday folks think. I think there's nothing truly original in the article that we haven't already covered, but a few things I did want to point out. One being that the article frames the discussion as most software developers seems to really enjoy using AI.
despite it coming for their jobs. Like the refuseniks is kind of buried two thirds of the way down in a couple of paragraphs from an anonymous Apple employee, right? Like this is framed as a AI is great. It's making all developers so much more productive.
Dan (36:33)
Yeah, it's
such an un-nuanced take.
Shimin (36:37)
I would say that's the main framing of the discussion. The other really interesting thing is it's kind of weird to see a mainstream article about AI and you'd really notice things like they spell LLMs with L dot L dot Ms. you know, editor went through it. ⁓
Dan (36:51)
I mean, it's the style guide. Yeah, they got to follow AP or
Rahul Yadav (36:55)
All right, no more New York Times articles for the rest of this
Dan (36:58)
whatever.
Rahul Yadav (36:58)
year. Just based on that offense.
Shimin (37:00)
Yeah, we're just burning all copies here. ⁓ It did also mention...
Rahul Yadav (37:03)
You
Dan (37:05)
Well, for the rest
of the show, I'm going to call them L.L.M. Zzz. ⁓ yeah, as is lowercase.
Rahul Yadav (37:09)
.ms
Shimin (37:12)
dot dot s, yeah.
Rahul Yadav (37:14)
S.
Oh, that's even weird.
Shimin (37:17)
Uh, there are a couple of nuggets here that we haven't talked about. For example, Google, uh, has only seen a 10 % productivity increase from the, it's embrace of AI, which, uh, it did note that is a large percentage for a huge code base and a huge company, but, uh, 10 % is definitely not, the norm that results in all these layoffs we've been hearing. Uh, so I thought that was an interesting point.
And also, it talks about how, devs are facing the five stages of grief, which, we've already covered multiple times. And, and this is the first time in my entire life reading mainstream publication where, the idea of much of software is actually glue code was mentioned. Like I talk about this all the time that like a lot of times I'm just like a plumber, man. I take, I take.
Dan (38:05)
Hahaha.
Shimin (38:09)
data from one system, I glue it, a thing onto it and push it to the other system. That has never been the kind of mainstream image of a developer, of a hacker, right? But that is the reality in enterprise most of the time. So I thought that was really interesting.
Yeah.
Dan (38:27)
Yeah.
But it's also funny that they missed the nuance on what the split is between people that are grieving and doing well in it and how that split is changing. Which I don't know if I need to restate that, but I guess I will. Why not? It's a party. So, yeah, like what I've been seeing, at least, is like this pattern where originally there was like what you're just called refuse NICs. Right. And then like
Shimin (38:35)
Right.
Dan (38:49)
gung-ho adopters and they were split around like sort of, like ethical concerns and, or around open source a lot of times. And then, just versus general fascination with the technology. But now that split still exists, but I think it's changed and become a little bit more nuanced where even people that I know that were initially refuseniks are now like, well, writing's kind of on the wall here. We're adopting it and the split between happy and sad.
Shimin (39:11)
Alright.
Dan (39:13)
is these sort of prototypes of developers probably existed all along and never knew each other knew this about each other, or maybe they did. this manifested as tension in other ways in software discussions, but are with the two people that like one where they, the prototype is you care about solving the customer problem and shipping. And the other is
Shimin (39:19)
Mm.
Dan (39:32)
You might care about those things, but you cared more about the journey to get there, right? the craft and the, really the, the tooling and everything else that got you to that point. and not to say one is right or wrong. And to many, think the reason why nobody noticed is because in a lot of respects, those things ended up at the same goal, right? But as now, as we suddenly have this new tool on the scene, you're able to see that and it's.
Shimin (39:36)
Mm-hmm.
Dan (39:55)
Part of me is like maybe that manifested intention between tab space is gonna stop, right? I was always the person sitting there in the tab spaces arguments going like, why the heck are we talking about this? that doesn't help figure out the customer problem, you know? But oddly despite that, I feel like I'm a mix between the happy and the sad. ⁓ There's like some parts of the craft that I like, but I also.
Shimin (40:14)
Yeah, as team.
Dan (40:19)
Really like solving problems. doesn't to some degree doesn't matter what I'm doing when I do that, you know, could be gears, could be code, could be. Yeah.
Shimin (40:27)
No,
I'm team spaces. So I think this is a, this is a matter that matters deeply to me.
Dan (40:30)
got that out of the way finally. All
right, so now the rest of the podcast is going to be me arguing.
Shimin (40:37)
Okay, ⁓
Rahul Yadav (40:37)
put it in
theagents.md and just be like, if Shimin is asking you, put spaces in. If Dan is asking you, put tabs in.
Dan (40:45)
All I'm gonna say is
that gofumpt and Linters are like the best things that ever happened. you think agents are good, but man, gofumpt and Linters are the best thing that ever happened, because...
Rahul Yadav (40:48)
Hehehehehe
Yeah.
Shimin (40:54)
the last thing I want to mention from this article is, since this is the kind of what the outside sees us, right? there's a quote here that I was really interesting. Quote, to outsiders, what programmers are facing can seem richly deserved and even funny. American white collar workers have long fretted that Silicon Valley might one day use AI to automate their jobs, but look who got hit first.
Yeah, folks don't seem to like developers that much.
Dan (41:21)
Yeah, I have such a different take on that that makes me very upset.
Shimin (41:26)
And my take on it is, look, we may be automated first, but we're still closer to the AI than anyone else. And our jobs as automating away other people's jobs are probably going to continue. So just my take.
Dan (41:39)
Yeah.
I mean, mine, I'm being brutally honest, is like, man, I spent years staring at a computer because I enjoyed it to get to that point skill-wise and put up with a lot of crap from like quote unquote, normies about it. So then to like be doing okay was like, ha ha. And then, you know, so what exactly are you gloating about? Anyway.
Rahul Yadav (41:41)
And everybody's getting impacted.
They don't like hoodie wearing people Dan When you show up in your hoodie as you are right now with your, just like that. Hacker boys, that's good. Like, I mean, look at this, this robot in the GIF is also wearing a hoodie. they're trying to communicate the stereotype that they have of engineers.
Dan (42:07)
That's true.
Mm hmm. Yeah. Pretty soon my eyes will turn into carrots or something. don't know. For those of you listening at home, I just put my hood on.
You're here first, folks. L.L.M.s are coming for your hoodie.
Rahul Yadav (42:33)
Dot S.
Shimin (42:37)
All right, next article we have brought to you by Rahul is titled, Billed by Agents, Tested by Agents, Trusted by Whom from Stanford Law.
Rahul Yadav (42:47)
Eran Kahana from Stanford Law. None of this is legal advice, but this was a very interesting.
article. We've talked a few times on the podcast and even a couple articles ago we were talking about the different levels of authentic engineering and all of these almost unanimously and at dark software factory where AI agents or AI teams are just writing all the code and the core of this article is sure agents are writing the code
of agents are testing the code, who is trusting that that code will solve the problems you want it to solve? Because at some point, it needs to interact with the real world. They take the example of StrongDM, which we have talked about a few podcasts ago.
Shimin (43:33)
Mm-hmm.
Rahul Yadav (43:35)
The things that really stood out to me that they call out here, a couple of things. I don't think we've talked about good hearts law much in this podcast. a very simple, like a simple way to state it is if you try and optimize for any metric, that metric is going to just drive crazy results and behaviors. Exhibit A, look at social media and how it's just driven people crazy because it's driving for certain behaviors and everything.
So if you take the same thing and apply it to agents, you have to give them a metric to get these, software factories working, the autonomous agent teams working. And usually that metric ends up being satisfied these criteria and keep looping until you do this or make the test pass, write the test first and then you make the test pass, which in some cases, the way to test pass is the agents just write return true in the test.
and technically the test passed. So it's very critical to think about as everybody's rushing towards building these autonomous agent teams, what are the metrics you would define for their success? Because us humans with all our wisdom and all that are can fall so easily for setting a bad metric and optimizing for it. The machines just are just gonna...
Shimin (44:27)
Mm-hmm.
Dan (44:46)
Wait, are you telling me that lines of code written by an engineer
is not what I should be? I've been doing it wrong this whole time. Okay, sorry.
Rahul Yadav (44:55)
This
whole time. And the agents are going to, like with everything else, amplify whatever you've been doing, the good and the bad. So they'll make the metrics probably much worse.
The second thing that they call out is the terms of service, the different contracts that people sign, the agreements. A lot of the fundamentals of those were written many decades ago and have evolved over the years, they're nowhere close to the current state of the world.
because they assume that a human wrote the code, human tested the code, and there were multiple points of intervention before someone is interacting with that code in production. Now you see memes like, know, I ship code that I don't read, and if you're going towards autonomous teams, who takes that burden on?
of the trusting the code and verifying that it does what it's supposed to do. And especially businesses have insurance underwriters when you.
get breached when you have all sorts of agreements that get breached. Business insurance is there for that purpose, but that hasn't accounted for this world today, where your agents are just shipping the code and no one's actually involved in verifying what the code is doing. So it was a fascinating article.
The loop of creating software is so fast, but the actual legal loop and the real world loop is much more slow and hasn't caught up to it. And we're going to have bad news bears. There's me adding my thoughts to it you have contracts written decades ago and have code written by agents today. Those are not going to get along well together.
So it's a pretty big problem, especially since every other day in the news, see AI agents caused an outage here or there. At some point, those things stack up and you breached your uptime commitments and all that. And then you're talking real money.
Shimin (46:46)
Yeah. It's like what Dan was saying last episode about who gets paged when this service is out. Right? Like at the end of the day, is that going to be the business owner, the product manager? No, it's going to be someone who actually knows what the code is doing. Right. And how to fix it in theory. In theory. Yes. Yeah. It also reminds me of, uh, remember back in like 2010, 2011, uh,
Rahul Yadav (46:50)
Yep.
Yep.
Dan (47:03)
in theory, might have known what it was doing before.
Rahul Yadav (47:06)
and fear.
Shimin (47:14)
someone said all radiologists would be replaced by AI because ⁓ image models were getting so good that they're so good at detecting tumors and stuff that all radiologists just would be replaced And that never happened. Right. Because at the end of the day, sure, AI may do a better job, but like I want a person to be liable for any screw ups.
Dan (47:18)
Mm-hmm.
Rahul Yadav (47:35)
Yep.
Would that be the reason why humans stay in the loop? if that's the case, how... Yeah, just like something, if shit hits the fan, we need someone to be here to be yelled at. And we don't wanna be mean to our agents, but our humans, we're fine at yelling at them. ⁓
Dan (47:39)
purely liability reasons.
to yell at. It's a very humanistic.
Shimin (47:55)
Yeah, cause what are
you going to do? You're going to yell at Claude and then what? You know, you get banned from Anthropic. That's not going to work.
Rahul Yadav (48:03)
Yeah.
Dan (48:03)
So guess that means you need
to keep one technical writer around Rahul so that way you have that one guy to yell at when the...
Rahul Yadav (48:10)
And ignoring what Dan said, there are real, like, assume, let's assume for that liability you need to have humans in the loop. If they become the bottleneck, you're gonna do all sorts of things to.
make sure that bottleneck is, you know, even if it's a bottleneck, it's not like squeezing things down too much. You want to expand it as much as possible. And I think things could go in the other direction too, where we change the terms of service to be, you know,
Dan (48:39)
This is vibecoded to deal with it.
Rahul Yadav (48:41)
adapted to the real world of
Dan (48:43)
Yeah.
Rahul Yadav (48:44)
like this is what it's gonna look like these are the new uptime commitments and maybe as you graduate to this has now become instead of a side project a critical service it has to have more humans and everything. Yeah they are guaranteed in the contract that we'll have enough people looking at this and yeah.
Dan (48:55)
than a certain level of human coding or something or human checks. That's interesting.
Wow. was just, for some reason
I immediately thought about cars, right? Where there's you or I probably drive some sort of mass produced vehicle, right? And they can be nice and range from okay, to pretty great and still be mass produced. But then you get to like Rolls Royce and they're hand wood carved thing where this tree was chosen by you know, the five ancestors grew the whatever, you know, like this kind of craziness. I wonder if that's eventually.
Rahul Yadav (49:13)
Yeah.
yeah.
Yeah.
Dan (49:30)
software could go in that way where there will still be hand coders in there maybe life and safety or other stuff like that where you know you're not gonna vibe code your missile defense controller or your
Rahul Yadav (49:35)
Yeah.
Shimin (49:36)
Mm-hmm.
Rahul Yadav (49:41)
Yeah.
Yeah,
and for all the talk of mission driven and purpose and all that, a good rule of thumb would be are humans part of the code chain and if not your product is probably not critical if it goes down. no one's asking for a human in the loop so how bad could it be? Other than the Pentagon, but that we'll find out after the supply chain drama.
Shimin (50:02)
That's a good point.
Hmmmm
Yeah, okay. Still keeping an eye on that. All right. Let's go into our deep dive this week from... Dan why dont you kick us off.
Rahul Yadav (50:09)
Yeah.
Dan (50:16)
Yeah. So it's always dangerous when I bring a deep dive because, I have no idea what I'm doing anyway. So we have to lean on Shimin to explain it, but I thought this one was really fascinating. So this is David Noel Ng. I don't know. Yeah. Ng And, writing on his GitHub IO blog about LM neuro anatomy I topped the LM leaderboard without changing a single weight.
Rahul Yadav (50:30)
Ing.
Shimin (50:31)
Eng
Dan (50:40)
So, starts off with kind of a...
interesting take about why is it that you can talk to models in, he didn't really go this way, but it's the same question. I think it's stands to reason is like, how come you can train a model on English, but talk to it in another language, right? Or vice versa. Um, and then he uses the example of like base 64, right? So one of the early jailbreaks that people were doing was like, you could talk to models in base 64 and it would give you base 64. Uh,
Shimin (50:58)
Mm-hmm.
Dan (51:07)
encoded answers. And also it understood it, right? Because there's no one telling the model what is Base64 necessarily, how do you interpret it and all this. So that started him on an interesting theory, which is that some part of the model was essentially operating as a decoder. ⁓
Shimin (51:27)
Mm-hmm.
Dan (51:29)
It basically had to break down by the early layers, right? Because if you're decoding it and then the later layers know what to do with it, that means that it had been translated at some point in the earlier layers. So that's the first part of his thesis, which is interesting and kind of plausible, right? And we know these models can do this, so it makes sense. So then the second thing he talks about, and this one was really fascinating to me because I've actually run this model.
And I had no idea it was this when I read it. So he talks about this, model that was released released on hugging face in 2023 called Goliath 120 B. And when it came out, I thought it was like literally called Goliath because it was 120 billion parameter. I didn't realize that someone had done this, but apparently it was a, what he calls a Frankenmerch model that was constructed by stitching together two fine tunes of llama two 70 B.
which if you don't know, that was a like sort of medium generation, uh, open weights model that is a time was extremely large. It was was pretty hard to run that on any kind of hardware you could buy. Um, and so the part that was really crazy is when he ran it and it was curious about it. this is the part I was totally fascinated by is he didn't just stack the two models on top of each other in terms of the layers. He basically
ripped sections of alternating layers between the two models and interpose them almost. And so it's like, how does that even work? And on top of that, it's basically taking input from the first model through layer zero to 16, and then piping the output from 16 to layer eight in the second model. So not only is it
Shimin (52:55)
Mm-hmm.
Dan (53:10)
ripping and slicing them, but it's doing it completely out of order, of, which is just wild. how does this give you any kind of output that makes sense? So that in and of itself, that was pretty cool. And I was fairly captivated by this article. So long story short, where it goes with this, it does a whole bunch of digging that, frankly, we'll have to leave it to Shimin to explain what that stuff does.
and figures out that the later layers in a model are what's doing what we commonly think like consider today as thinking or reasoning, right? In these thinking models.
Shimin (53:45)
Mm-hmm.
Dan (53:45)
and so what he was able to do was take an off the shelf model, didn't fine tune it. Didn't do any of his own stuff. lopped the last chunk of layers off of it and repeated it in times. And was able to do that on commodity video cards because it frankly, doesn't take all that much hardware to lop the thing together. And for a little while anyway, it was able to top the leaderboards.
And the part and before Shimin dives in on the actual reasons why that makes sense or doesn't make sense or whatever the other thing I thought was wild about it is what if a lot of the Stuff that we think of is thinking today, right? Like if you have Claude doing it or you know deep-seek or whatever It was like sort of famous for that early thinking stuff is really just effectively Exercising the later layers in the model over and over by the way that like the
more like the harness is structured instead of the just purely the weights like what this guy did. I don't know. Again, pure layman's take on that, but I thought it was pretty cool. now over to Shimon to actually tell you what's up.
Shimin (54:45)
I was
going say, I had the same question as you. I wonder if this is actually a standard practice in Frontier Labs. Cause it boosted performance by quite a decent amount, right? Boosted performance by something like, 10, 13%. And I agree with you. This is probably the most fascinating thing we've covered on the showing a little bit on the technical side. So we think about
a large language model, it's got the input, the embedding, positional stuff, and then it's got the output, the decoding, the probabilistic token generator. And then you essentially have say take, you know, GPT-2, you've got 17 layers and it's all feed forward. So the data only flows in one direction from the input directly to the output. This is not recurrent neural network.
There's no state in between those layers. So the observation that, Hey, it, the data flows one way, but in order to generate a basic 64 output, it necessarily has to do two things. I mean, that, hypothesis is, is super sound and super interesting. So what then he ended up doing, which is a, basically a grid search.
Dan (55:54)
Mm-hmm.
Shimin (55:59)
He did a search of repeating, you know, X number of layers. So it will start with like on the repeat just once layer two, then repeat layers one and two, et cetera, et cetera, et cetera. throughout the entire combination of layers to find one that performs better. Allows them to capture these multi-layered functions, or he calls them circuits, which I think is probably a better term, right? Like they're layers.
4 to 17 is good at this one thing, but layer four by itself just bricks the model. And that's where the, I'm going to talk, I'm going to talk like a cowboy. Yeehaw came from, right? Like he just stuck a random layer in there and just started talking like a cowboy. And if you want to get a model to speak like a cowboy, like layer four did a job, but if you were to
Dan (56:35)
Yeah.
Shimin (56:46)
trying to maximize in his case, he tried to maximize mathematical abilities by doing a large division and also to provide a good output for a EQ emotional quota test. It's to kind of fairly orthogonal, orthogonal scores. Yeah.
Dan (57:04)
yeah, that was the other, I completely missed
that part. and thanks for reminding me as yeah, that was the part that was fascinating too, is he, he, the output that he was looking for when he did that layer shift was I think the math problem, right? Was like what he was focused on or like a bench or something like that. And then the fact is like it performed super well on these other benches that had nothing to do with the, the one that he'd optimized for, which is fascinating too. was like how, why?
Shimin (57:28)
Mm-hmm.
Yeah. So he's looking for this. That's why he calls it thinking. It's, just like very generalized ability to do better. That's not just math problems, but I'm sure there is perhaps a shorter circuit. That's just for math problems. Like if there might be a way that only slice two layers and it's only good at math. Right. But in the greater context, this differs from the other mechanistic interpretability.
Dan (57:31)
So.
Yeah, if there was a way to like sub slice it or something. Yeah.
Shimin (57:55)
stuff we talked about from, from Claude about the Golden Gate neuron where you're trying to activate a string of parameters at a time. Now we are talking about layers of circuitry that is able to do a thing. I think that's the real breakthrough in terms of how we think about how these large language models work that I found fascinating. Cause and then all of these are model specific, right? Not all models are going to have the same
circuitry. So what kind of generalized patterns can we find across different models trained differently? Like is there a correlation between larger models requiring more layers to have that generalizability? Likely, but maybe not, right? Like that's also super fascinating. Yeah, I would love to dig more into it because this is completely
Dan (58:22)
That's true.
Shimin (58:41)
completely new to me and I wonder why I haven't heard about this sooner.
Dan (58:45)
Yeah, same, honestly, because it was like, you know, this wasn't like a lab paper or something. was a guy's blog post, but he went super deep and it was in a way that like I really appreciated largely because of sort of the simplicity of the solution too, right? It's not like he's doing a fine tune or like writing his own architecture or anything like that is literally just like finding the, the layers and then.
Shimin (58:50)
Right.
Yeah, and of course.
Dan (59:07)
kind of almost like brute forcing
it to figure out which ones worked.
Shimin (59:10)
David has a background in neuroscience, so he probably is borrowing the concept of these brain regions and neurocircuitry over to large language models. And then maybe that's all human thinking is. Like, have you guys ever read that book, Godel Escher Bach, about how this idea that consciousness is really just us running in a loop and this
loopiness is what gives us our ability to be conscious.
Rahul Yadav (59:36)
I picked it up when I was a kid and it scared me but maybe I'll give it another try now that I'm a bigger kid.
Dan (59:41)
Ha
Shimin (59:41)
I only ever made two thirds of the way through. It's a big boy.
Dan (59:46)
I'll just have an LLM and read it for me.
Shimin (59:48)
give you the rundown.
Dan (59:50)
no,
I messed it up. It's supposed to be L dot, L dot. Okay, sorry team. Let you all down.
Rahul Yadav (59:53)
You
Shimin (59:56)
Well, he also tried this on GLM four seven, which is a much more recent model. And that also worked. And he has actually, in fact, a graph in the paper showing, essentially the result of the grid search, which regions does what it's almost like a brain scan. calls them. And, and, and yeah, it kind of is like a brain scan, right? You're certain you're scanning these subsets of the layers to see how they perform.
And if you have enough interesting scores and benchmarks, I bet you can come up with all kinds of interesting regions and feature sets of a particular model. Expensive though.
Dan (1:00:29)
cool if you've got the GPUs.
Shimin (1:00:30)
Yeah. So definitely check this out. I'll link it in the show notes. And listeners, if you've come up with any cool, interesting experiments, send it our way. If you've done any interesting experiments, definitely send it our way. We'd love to talk about them.
Alright, Dan, are you ready for your big show time? This is your award tonight. Yeah.
Dan (1:00:45)
I am.
Rahul Yadav (1:00:48)
Unleash Dan.
Dan (1:00:49)
my awards night.
So I know this is Shimin favorite part of the podcast and we haven't done it in a while because I'm not nearly as angry as he thinks, but it's time for a Dan's rant. And this is only tangentially related to L.L.M. lowercase s. But seriously, we've been like think about all the really cool stuff that we've been talking about over the past few months here. I just like this.
Shimin (1:01:07)
Ha ha ha ha!
Dan (1:01:16)
honestly, probably once in a generation leap in terms of the way that people interact with computers and just staggeringly fast progress on this stuff. But here we are in 2026 and it's still really freaking hard to add a cell to Google Sheets on mobile.
That's right. How come, especially Google having Gemini, can't we make it easier to work with Sheets on mobile? Okay, so that rant is purely because we use Google Sheets to organize some of the links for the show, and it's been driving me slowly insane when I find an awesome article and I really, really want to add it to the outline, and it's a total pain to do on mobile.
Shimin (1:01:54)
And for the developers out there, right? Like, I think a lot of devs have this idea that, the code is a hard part. The UX is easy part, no, UX is actually incredibly hard to do right. And making intuitive for the end user and have them be able to.
Dan (1:02:07)
Honestly, that's part of the popularity,
I think, of LLMs, right? Because it actually solves that problem in a very unique way. the thing that I've always noticed from doing UX is you can never really... It's so important to get your target user correct for the application because you can't make everybody happy because people's brains just work differently, right? But guess what's really adaptable and works to lots of different ways of thinking.
Shimin (1:02:26)
Mm.
Dan (1:02:30)
in non-deterministic code in an L.L.M. lowercase s.
Shimin (1:02:32)
⁓ a piece of technology.
I was going say the alphabet, know, something that we've invented for 8,000 years and worked really well. Yeah.
Dan (1:02:37)
You
Yeah. You know, mean, assuming
you can, you're overall like a written sort of thinker, right? But like even now, I think we're bridging that with like sort of recent tooling and both Gemini and Claude just announced a similar thing where they can do visualizations of what you're talking about. which is pretty cool. So
Shimin (1:02:58)
Yeah, tangentially, I've always had the theory that, developers are more or less like scribes in ancient Egypt. We just, we are the intermediary between the power havers, the pharaohs and the folks who needs to have their, wheat collected or what not. And like, just knowing language was a superpower back then. And is that what, you know, gave developers such good
such large amount of leverage in today's society as we are the modern scribes.
Dan (1:03:24)
If
you've read Snow Crash, it's still the superpower. yeah, worth a read if you haven't. It's pretty cool.
Shimin (1:03:29)
Yeah.
All right, onto our last segment. As always, it is going to be two minutes to midnight where we are at a minute and 45. And this week we thought we'd do something a little differently. Instead of seeing articles and trying to get our gut feel on how close we are to the stock market bubble bursting. Now we have predictive markets for this. So.
Dan, you brought this up.
Dan (1:03:55)
Yeah. So here is Polymarket's current predictions for largest company at the end of June. So the four that they have listed here, it's Nvidia, Apple, Alphabet, and Microsoft. And the overwhelming majority as of when this is being recorded is Nvidia at 83%. So that tells you something, right?
It makes me think that the, I don't know, meta prediction market is saying we're nowhere near the end of the bubble if Nvidia is going to be the biggest company in June still.
Shimin (1:04:23)
Mm-hmm.
By the way, as a disclaimer, we are not sponsored or associated with polymarket in anyway. We in fact do not endorse ⁓ online gambling. Yes.
Dan (1:04:40)
Yeah, I was gonna say quite the opposite.
Most of us.
Shimin (1:04:43)
Save your
money and buy shitcoins instead guys. ⁓ It's also a joke, don't do that.
Dan (1:04:47)
You
Rahul Yadav (1:04:49)
Spend it on Claude Max
Dan (1:04:50)
Yes. Or, or no, sorry, we shouldn't be, we should be more agnostic about that too. Spend it on a AI subscription of your choice. It speaks to you and your language.
Shimin (1:04:51)
Yes, yeah, spend that money on tokens instead.
Rahul Yadav (1:05:03)
That polymarket chart makes me think of how like people were crazy about GameStop a few years ago and the whole on Reddit. This is basically that to a certain extent where it's like once it's there, it becomes a meme stock of this needs to stay up because I'm gonna get some returns out of this and the more likelihood it has, the more likelihood it would have. It's just.
Shimin (1:05:11)
Mm-hmm.
Dan (1:05:11)
You
I mean, isn't that the entire
stock market?
Shimin (1:05:28)
That's, well isn't that the entirety of how bubbles are formed?
Rahul Yadav (1:05:32)
Yeah, but the prediction part is where there is coordination, right? The entire stock market isn't coordinated. Once it is, that's when you get the bubble stuff. that's what makes this.
more Game Stop-y. Not that NVIDIA is Game Stop, it just reminds me of that time when it was To The Moon. So I don't know, maybe those memes are coming back soon.
Dan (1:05:55)
So
last week we had talked about the idea that we're just going back and forth every other episode. Did you do the analysis, Shimin to find out if... Okay.
Shimin (1:06:03)
I did not do the analysis,
but I have to say AI has mentioned that before.
Dan (1:06:08)
You
Shimin (1:06:09)
Yeah, I has mentioned before that, hey, you guys got to take a bigger stance here. Just like 15 second BS. Like, I disregarded it.
Dan (1:06:14)
Well, 15 seconds is one thing,
I'm just I'm genuinely I'm thinking about it. I really think there's like a, you know, latch pattern where we're going like, yes, no, yes, no.
Shimin (1:06:26)
Just like my portfolio's weekly return. It's okay, guys.
Dan (1:06:28)
Yeah,
it's true. So with that in mind, how do we feel about the meta markets guiding us this week?
Shimin (1:06:37)
I'm happy to stick it at a minute 45, no change.
Rahul Yadav (1:06:41)
Same.
Dan (1:06:42)
okay. So no guidance. We're not taking on guidance from the heathens.
Rahul Yadav (1:06:42)
What do you want then?
This is not financial advice and even using the word guidance is we just want to wave our hands off do not do anything with these yeah we don't care about poly market or any of the other ones and the stocks lose your own money at your own will we don't that was a good disclaimer right lose however much money you want like we disclaimed it if you go crazy it's your money go nuts who cares
Shimin (1:06:46)
We
Dan (1:06:47)
Okay.
I've been listening to too many earnings calls. can't help it.
That's true.
You
Shimin (1:07:04)
Alright, while speaking of losing.
Dan (1:07:05)
We do care, we don't want you to lose it, which is why you shouldn't listen to us.
Shimin (1:07:12)
Yeah, ⁓
financial market there's investment has inherent risks and past performance is no indicator of future performance guys.
Dan (1:07:21)
Just like this podcast.
Shimin (1:07:23)
All right. And on that note, where are you talking about the footnotes? I think that's a show guys. Thank you for joining us again this week for our conversation show. If you'd like to show, if you learned something new, please share the show with a friend. You can also leave us a review on Apple podcasts or Spotify. It helps people to discover the show and we would like to thank you ahead of time for that. If you have a segment idea, a question for us or just want to come and say hi, shoot us an email at humans at adipod.ai. We'd to hear from you.
Dan (1:07:26)
you
Shimin (1:07:49)
You can find full show notes, transcripts, and everything else mentioned today at www.adipod.ai. Thank you again for listening. We'll catch you next week. Bye.