It's magical until you start going through the code carefully, line by line, and find yourself typing at the agent: YOU DID WHAT NOW? Then, when you read a few more lines and realise that neither AI nor human will be able to debug the codebase once ten more features are added you find yourself typing: REVERT. EVERYTHING.
yes, this is an issue i see too... also fixing it up takes alot of time (sometimes more if i just 'one-shotted' it myself)... idk these tools are useful, but i feel like we are going too far with 'just let the ai do everything'...
Does Yegge really think that building production software this way is a good idea?
Let's assume that managing context well is a problem and that this kind of orchestration solves it. But I see another problem with agents:
When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it's small, like the selection of a data structure. Eventually, though, you want to add a feature that clashes with that invariant. At that point there are usually three choices:
* Don't add the feature. The invariant is a useful simplifying principle and it's more important than the feature; it will pay dividends in other ways.
* Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
* Go back and change the invariant. You've just learnt something new that you hadn't considered and puts things in a new light, and it turns out there's a better approach.
Often, only one of these is right. Usually, one of these is very, very wrong, and with bad consequences.
But picking among them isn't a matter of context. It's a matter of judgment and the models - not the harnesses - get this judgment wrong far too often (they go with what they know - the "average" of their training - or they just don't get it). So often, in fact, that mistakes quickly accumulate and compound, and after a few bad decisions like this the codebase is unsalvageable. Today's models are just not good enough (yet) to create a complete sustainable product on their own. You just can't trust them to make wise decisions. Study after study and experiement after experiment show this.
Now, perhaps we make better judgment calls because we have context that the agent doesn't. But we can't really dump everything we know, from facts to lessons, and that pertains to every abstraction layer of the software, into documents. Even if we could, today's models couldn't handle them. So even if it is a matter of context, it is not something that can be solved with better context management. Having an audit trail is nice, but not if it's a trail of one bad decision after another.
I think a lot of it comes down to the training objective, which is to fulfill the user’s request.
They have knowledge about how programs can be structured in ways that improve overall maintainability, but little room to exercise that knowledge over the course of fulfilling the user’s request to add X feature.
They can make changes which lead to an improvement to the code base itself (without adding features); they just need to be asked explicitly to do so.
I’d argue the training objective should be tweaked. Before implementing, stop to consider the absolutely best way to approach it - potentially making other refactors to accommodate the feature first.
> A messy codebase is still cheaper to send ten agents through than to staff a team around
People who say that haven't used today's agents enough or haven't looked closely at what they produce. The code they write isn't messy at all. It's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. The entire construction is just wrong, hidden behind a nice exterior. And when you need to add a couple more floors, the agents can't "get through it" and neither can people. The codebase is bricked.
Today's agents are simply not capable enough - without very close and labour-intensive human supervision - to produce code that can last through evolution over any substantial period of time.
Debugging would suffer as well, I assume. There's this old adage that if you write the cleverest code you can, you won't be clever enough to debug it.
There's nothing really stopping agents from writing the cleverest code they can. So my question is, when production goes down, who's debugging it? You don't have 10 days.
They can work really well if you put sufficient upfront engineering into your architecture and it's guardrails, such that agents (nor humans) basically can't produce incorrect code in the codebase. If you just let them rip without that, then they require very heavy baby-sitting. With that, they're a serious force-multiplier.
They just make a lot of mistakes that compound and they don't identify. They currently need to be very closely supervised if you want the codebase to continue to evolve for any significant amount of time. They do work well when you detect their mistakes and tell them to revert.
Oh my god have Anthropic products been absolutely saying everything is load bearing for the last week or so. Literally ever other paragraph has “such and such is load-bearing”.
The problem is, the MBAs running the ship are convinced AI will solve all that with more datacenters. The fact that they talk about gigawatts of compute tells you how delusional they are. Further, the collateral damage this delusion will occur as these models sigmoid their way into agents, and harnesses and expert models and fine tuned derivatives, and cascading manifold intelligent word salad excercises shouldn't be under concerned.
First, it's not "can occur" but does occur 100% of the time. Second, sure, it does mean something is missing, but how do you test for "this codebase can withstand at least two years of evolution"?
You can spend a lot of time perfecting the test suite to meet your specific requirements and needs, but I think that would take quite a while, and at that point, why not just write the code yourself? I think the most viable approach of today's AI is still to let it code and steer it when it makes a decision you don't like, as it goes along.
You have to fight to get agents to write tests in my experience. It can be done, but they don't. I've yet to figure out how get any any agent to use TDD - that is write a test and then verify it fails - once in a while I can get it to write one test that way, but it then writes far more code to make it pass than the test justifies and so is still missing coverage of important edge cases.
Instead of fighting agents to write tests, what if the testing agent is the product itself? That's the idea behind Autonoma (https://github.com/autonoma-ai/autonoma), AI agents that do E2E testing by exploring your app like a real user.
I have TDD flow working as a part of my tasks structuring and then task completion.
There are separate tasks for making the tests and for implementing. The agent which implements is told to pick up only the first available task, which will be “write tests task”, it reliably does so. I just needed to add how it should mark tests as skipped because it’s been conflicting with quality gates.
A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec. Whether it be humans or agents, people rarely specify that one explicitly but treat it as an assumed bit of knowledge.
It goes the other way quite often with people. How often do you see K8s for small projects?
> A lot of that can be overcome by including the need to be able to put more floors on top as part of the spec
I wish it could, but in practice, today's agents just can't do that. About once a week I reach some architectural bifurcation where one path is stable and the other leads to an inevitable total-loss catastrophe from which the codebase will not recover. The agent's success rate (I mostly use Codex with gpt5.4) is about 50-50. No matter what you explain to them, they just make catastrophic mistakes far too often.
It isn't. Anthropic tried building a fairly simple piece of software (a C compiler) with a full spec, thousands of human-written tests, and a reference implementation - all of which were made available to the agent and the model trained on. It's hard to imagine a better tested, better-specified project, and we're talking about 20KLOC. Their agents worked for two weeks and produced a 100KLOC codebase that was unsalvageable - any fix to one thing broke another [1]. Again, their attempt was to write software that's smaller, better tested, and better specified than virtually any piece of real software and the agents still failed.
Today's agents are simply not capable enough to write evolvable software without close supervision to save them from the catastrophic mistakes they make on their own with alarming frequency.
Specifically, if you look at agent-generated code, it is typically highly defensive, even against bugs in its own code. It establishes an invariant and then writes a contingency in case the invariant doesn't hold. I once asked it to maintain some data structure so that it could avoid a costly loop. It did, but in the same round it added a contingency (that uses the expensive loop) in the code that consumes the data structure in case it maintained it incorrectly.
This makes it very hard for both humans and the agent to find later bugs and know what the invariants are. How do you test for that? You may think you can spec against that, but you can't, because these are code-level invariants, not behavioural invariants. The best you can do is ask the agent to document every code-level invariant it establishes and rely on it. That can work for a while, but after some time there's just too much, and the agent starts ignoring the instructions.
I think that people who believe that agents produce fine-but-messy code without close supervision either don't carefully review the code or abandon the project before it collapses. There's no way people who use agents a lot and supervise them closely believe they can just work on their own.
"Incomplete specs" is the way of the world. Even highly engineered projects like buildings have "incomplete specs" because the world is unpredictable and you simply cannot anticipate everything that might come up.
Lol I largely agree with my beloved dissenters, just not on the same magnitude. I understand complete specs are impossible and equivalent to source code via declaration. My disagreement is with this particular part:
"t's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. "
If your test/design of a BUILDING doesn't include at simulations/approximations of such easy to catch structural flaws, its just bad engineering. Which rhymes a lot with the people that hate AI. By and large, they just don't use it well.
And sometimes it can't even handle it then. I was recently porting ruby web code to python. Agents were simultaneously surprisingly good (converting ActiveRecord to sqlalchemy ORM) and shockingly, incapably bad.
For example, ruby uses blocks a lot. Ruby blocks are curious little thingies because they are arguably just syntax sugar for a HOF, but man it's great syntax sugar. Python then has "yield" which is simultaneously the same keyword ruby uses for blocks, but works fundamentally differently (instead of just a HOF, it's for generating an iterator/generator) and while there are some decorators that can use yield's ability to "pause" execution in the function to send control flow back out of the function for a moment (@contextmanager) which feels _even more_ like ruby blocks, it's a rather limited trick and requires the decorator to adapt the Generator to a context manager and there's just no good way to generalize that.
Somehow this is the perfect storm to make LLMs completely incapable of converting ruby code that uses blocks for more than the basic iteration used in the stdlib. It will try to port to python code that is either nonsensical, or uses yield incorrectly and doesn't actually work (and in a way that type checkers can even spot). And furthermore, even if you can technically whack it with a hammer until it works with yield, it's often not at all the way to do it. Ruby devs use blocks not-uncommonly while python devs are not really going to be using yield often at all, perhaps outside of @contextmanager. So the right move is usually to just restructure control flow to not need to use blocks/HOFs (or double down and explicitly pass in a function). (Rubyists will cringe at this, and rightly so... Ruby is often extraordinarily expressive).
The fact that such a simple language feature trips them up so completely is pretty odd to me. I guess maybe their training data doesn't include a lot of ruby-to-python conversions. Maybe that's indicative of something, but I digress.
I'll grant you that Go is extremely opinionated; that's its shtick. But it's an old language that started out with a 1970s design as a statement by its creators against modern programming languages. From its langnauge design, through its compiler, to its GC algorithm, it is intentionally retro (Java retired its Go-like GC five years ago because the algorithm was too antiquated). It may suit your taste and I'm not suggesting that it's bad, but modern it is not.
These are all the options that have ever existed, including options that are or were available only in debug builds used during development and diagnostic options. There are still a few hundred non-diagnostic "product" flags at any one time, but most are intentionally undocumented (the list is compiled from the source code [1]) and are similar in spirit to compiler/linker configuration flags (only in Java, compilation and linking are done at runtime) and they're mostly concerned with various resource constants. It is very rare for most of them to ever be set manually, but if there's some unusual environment or condition, they can be helpful.
First, as others have pointed out, it's always been like that up to a point. But that's not the problem with X.
I didn't leave X when Musk acquired Twitter, and I'm not scandalised by people's political positions, even when they're extreme. But a position and behaviour are two very different things (e.g. being a racist vs making a Nazi salute on live television). I left when the atmosphere amplified by the site became... not for me. I won't go into a pub full of football hooligans not because I disagree with their club affiliation but because their conduct creates an atmosphere that's not for me.
As for newspapers (even ignoring those with political party affiliations, something that was common in newspapers' heyday), most of them preserved some kind of civil decorum, and those that didn't weren't read by those who wanted some decorum. How civilised some environment is is not a matter of political position.
Also, there were always some people of influence that held extreme views. But such people behaving in an uncivilised manner in public was less common (and certainly less accepted).
"Code quality" here isn't referring to some aesthetic value. Coding agents write code that doesn't converge, meaning code that they cannot evolve after a while. They get to the point where fixing one bug causes another, and then the codebase is in such a state that no human or agent can salvage.
People who say they don't care about the quality of code produced by agents are those who haven't been evolving non-trivial codebases with agents long enough to see just how catastrophically they implode after a while. At that point, everyone cares, and that point always comes with today's agents given enough lines and enough changes.
> Coding agents write code that doesn't converge, meaning code that they cannot evolve after a while
That's not true, and I'm not sure what that even means. It's totally up to you the human to ensure AI code mergable or evolvable, or meet your quality standard in general. I certainly have had to tell Claude to use different approaches for maintainability, and the result is not different than if I do it myself.
Sure, if you vigilantly review the agent's output and say no when it's wrong (which happens very frequently) then things work. I meant that without such ver close supervision things don't converge because agents make mistakes that compound.
1. "Vibe coding" is a spectrum of how much human supervision (and/or scaffolding in the form of human-written tests and/or specs) is involved.
2. The problem with "bad code" has nothing to do with the short-term success of the product but with the ability to evolve it successfully over time. In other words, it's about long-term success, not short-term success.
3. Perhaps most importantly, Claude Code is a fairly simple product at its core, and almost all its value comes from the model, not from its own code (and the same is true on the cost side). Claude Code is relatively a low stakes product. This means that the problems caused by bad code matter less in this instance, and they're managed further by Claude Code not being at the extreme "vibey" end of the spectrum.
So AI aside, Claude Code is proof that if you pour years and many billions into a product, it can be a success even if the code in the narrow and small UI layer isn't great.
There's this definition of LLM generation + "no thorough review or testing"
And there's the more normative one: just LLM generation.[1][2][3]
"Not even looking at it" is very difficult as part of a definition. What if you look at it once? Or just glance at it? Is it now no longer vibe coding? What if I read a diff every ten commits? Or look at the code when something breaks?
At which point is it no longer vibe coding according to this narrower definition?
If you do not know the code at all, and are going off of "vibes", it's vibecoding. If you can get a deep sense of what is going on in the code based off of looking at a diff every ten commits, then that's not vibe coding (I, myself, are unable to get a sense from that little of a look).
If you actually look at the code and understand it and you'd stand by it, then it's not vibecode. If you had an LLM shit it out in 20 minutes and you don't really know what going on, it's vibecode. Which, to me, is not derogatory. I have a bunch of stuff I've vibecoded and a bunch of stuff that I've actually read the code and fixed it, either by hand or with LLM assistance. And ofc, all the code that was written by me prior to ChatGPT's launch.
You're repeating the broader definition, great. But your post leaves me with the same question about degrees.
You say there's two cases: no review and full review, "deep sense of the code", and that one is vibe coding and one is not.
What about the degrees in between? At what point does vibe coding become something else?
For example, I would not say "looking at the diffs" to ever be enough review to get a deep sense of what's been done. You need to look at diagrams and systematically presented output to understand any complex system.
Is one person's vibe coding then another persons deep understanding non-vine coding?
If you can answer this question you may be able to convince me.
You're right that it's a spectrum. Just like anything else, you can be 'mostly' vibe coding or 'somewhat' vibe coding. But the threshold where it stops being vibe coding isn't entirely subjective.
If you are trusting the AI's logic and primarily verifying the output (the app runs, the button works), you are vibe coding. If you are reading the diffs, verifying the architecture, you are transitioning back toward engineering. Any sincere developer knows where they are sitting on that spectrum.
You say the threshold is not entirely subjective, but then you describe a subjective (you just know it) and ambiguous (transitioning back toward engineering) threshold.
Sure seems to me like it's subjective.
Also, I've nedlver ever heard so much talk about "verifying architecture" as when people talk about vibe coding.
That's not something you usually do. The architecture is the overall structure of a design, and has to be elaborated into functional designs and interface contracts before you have something you can verify in actual code. The architecture itself is very much an intangible thing. "Verifying architecture" in diffs is nonsense, and is definitely not engineering.
Hm you could do like five degrees of vibecoding. Level one. You laboriously still look at the code and the diffs being generated.
Level two: You sometimes look at the code being generated. You have a feel for how the classes are architected together but don't know the details. Level 4. You're aware of the classes and files in use, but beyond that, you have no idea what's going on.
Level 5. You just spit stuff at the LLM and have it shit out code that you have zero clue what it's doing. You don't even know if you're using react or not!
It's a bit absurd that a semantic debate is happening over a term coined in someone's shower thought tweet. Maybe the real problem is that it's just a stupid phrase that should never have been taken so seriously. But here we are...
I think it's perfectly serviceable. Prompting software into existance is a vibes-based activity, and it's completely at odds with engineering. Which is why it's good that there's a term that conveys this.
1 is definitely false right now. I gave specs, tests, full datasets, reference code to translate to an llm and still produce garbage code/fall flat on it's face. I just spent one week translating a codebase from go to cpp and i had to throw the whole thing out because it put in some horrible bugs that it could not fix even burning 500$ worth of tokens and me babysitting it. As i said it had everything at it's disposal: tests, reference impl, lots of data to work with. I finally got my lazy ass to inplement it and lo and behold i did it in 2 days with no bugs (that i know of) and the code quality is miles better than that undigested vomit. The codebase was a protocol library for decoding network traffic that used a lot of bit twiddling, flow control, huffman table compression, mildly complicated stuff. So no - if you want working non-trivial code that you can rely on then definitely don't use a llm to do it. Use it for autocomplete, small bits of code but never let the damn thing do the thinking for you.
Oh, I agree. Anthropic themselves proved that even with a full spec and thousands of human-crafted tests, unsupervised agents couldn't produce even something as relatively simple as a workable C compiler, even when the model was trained on the spec, tests, the theory, a reference implementation, and even when given a reference implementation as an oracle.
But my point was that I don't think the development of Claude Code itself isn't supervised, hence it's not really "vibe coded".
My main problem with Scott Alexander is this: To draw correct conclusions from data, a necessary (though insufficient) condition is to be an expert in the field from which the data is drawn and/or to which the data applies. Otherwise, you might not know how accurate the sources of the data are and, more importantly, whether you're considering enough context (i.e. whether you have all the right data to draw your conclusion). At the very best, you can consider the objections you've heard, but are these (all) the right objections? For example, when I read Paul Krugman on international trade or central banks, at least I know that he's an expert in that subject matter so he knows what context may be more or less relevant. When he's not an expert in some subfield of economics, at least he knows who the experts are and refers to them.
Scott Alexander is not an expert in almost anything he writes about. As far as I know, he's not done any scholarly work outside his area of practice, psychiatry. In relation to this post's subject, Alexander is not an expert in criminology, law enforcement, political perception, or sociology. Then again, neither is the author of this post (at least they don't say what their relevant credentials are). It seems neither of them even know who the experts are. I can understand why they find the question interesting, but they're ill-equipped to provide answers. Both personal perception and data can obviously be misleading, which is precisely why people who truly want to understand something spend years becoming experts.
It seems to me that both Alexander and the author of this post are, actually, members of the same church whose members are those who believe that people can draw correct conclusions from a smattering of data without the necessary scholarship and expertise, and that you can understand something complicated without putting in all the effort required to understand it: the Church of Dunning–Kruger Dilettantism.
Of course, anyone is free to write their thoughts on anything, and readers are free to form opinions on what they read. What this reader sees here is two people arguing over something that both know far too little about to offer the relevant insight. What is interesting to me is that someone who's not particularly knowledgeable on the subject of crime took the time to write a long rebuttal to another post about crime written by someone else who knows just as little. I can guess that's because that church is large.
> It seems to me that both Alexander and the author of this post are, actually, members of the same church - the church of those who believe that people can draw correct conclusions from a smattering of data without the necessary scholarship and expertise, and that you can understand something complicated without putting in all the effort required to understand it. It's the Church of Dunning–Kruger Dilettantism.
We are all like that, we have no other options, haven't we? I mean, either we try to understand the world around us, or we are not. We can't be experts in everything, so in most cases we are go by Danning-Kruger Dilettantism.
Scott made the dilettantism into a profession, he has its methods and he sharpens them. He debates things with other dilettantes, and it helps them to improve themselves. To me, personally, it is one of the main attractions of the blog. I'm dilettante in a lot of topics, but still I don't want to simply ignore them, because I'm not an expert.
> What is interesting to me is that someone who's not particularly knowledgeable on the subject of crime took the time to write a long rebuttal to another post about crime written by someone else who knows just as little.
It is not about crime really. The author we discussing talks about methodology, they are on a meta level of a discussion, the crime discussion is just one data point for a meta-discussion.
Your post is the part of the same meta-discussion about methodology, though your attack comes from the other direction.
> We can't be experts in everything, so in most cases we are go by Danning-Kruger Dilettantism.
Or care enough to find out what the experts say? Surely that's the best way to start understanding the world around us. And if the experts don't agree on an answer, the people who know less probably won't contribute much, but at least it raises the level of discussion.
> Scott made the dilettantism into a profession, he has its methods and he sharpens them. He debates things with other dilettantes, and it helps them to improve themselves.
I won't judge the methods people use to improve themselves, but I can say that this is not a good method of getting closer to the truth, just in case that is also something they're interested in beside self betterment. No amount of thought or debate can substitute scholarship.
> The author we discussing talks about methodology
Methodology of what? Self-improvement or getting to the bottom of why people think there's a rise in crime? Because if it's the latter, a better methodology than either would surely begin with studying the subject more seriously.
> Or care enough to find out what the experts say?
And what do expert way about crime? I don't care about crime levels, so I have no idea where to look really. It seems that no one knows where to look, as in comments on Scott's blog, so on HN. Could you find an expert take on this discrepancy between reported levels of crime and perceived levels of crime?
If you can't, and no one can, so maybe it is a good question to try our abilities to work with data and draw conclusions?
> this is not a good method of getting closer to the truth
I do not care about crime, but I know that I can't ask expert all the question I have. So I just have no other options.
> a better methodology than either would surely begin with studying the subject more seriously.
You can't be expert in all topics at once, so this is not an option. Probably you can create a think tank, inviting experts in different areas into it, which can answer all your questions. But where do you get the money for that?
There have been numerous studies on crime perception and fear of crime in the last 50 years. Here's just a small touch of what an internet search brought up in ten minutes:
For example, one of these sources, in the British Journal of Criminology begins with: For over 40 years, the fear of crime has been a stable of North American, British and European criminological research. Hundreds of publications have sought to illuminate the social and emotional risks associated with worry about crime (Ferraro 1995; Hale 1996; Visser et al. 2013).
> If you can't, and no one can, so maybe it is a good question to try our abilities to work with data and draw conclusions?
Not like that. You'll find that most questions for which there's good data (and even if there isn't) that more than one person is interested in already have experts thinking about. If there aren't, and you're qualified (i.e. have the requisite context) people will pay you to work on the problem through something called "research grants".
> You can't be expert in all topics at once, so this is not an option. Probably you can create a think tank, inviting experts in different areas into it, which can answer all your questions. But where do you get the money for that?
Well, in Europe we have these things only we call them universities, not a "think tanks", and while far from perfect, they're the actual best way humanity has come up with for trying to answer hard questions. They work like this: people spend some years getting the needed context in various fields, and then different people spend years studying different areas. These people publish their findings periodically, and so people who aren't experts in a particular field can read what the experts wrote. Because people make mistake, multiple people explore every question, and then they argue with each other, but when they do, they already know what they're talking about. This turns out to work better than people starting with a blank slate trying to ponder their way to an answer in the course of days or weeks.
Of course, because many things that interest people are non linear systems, there are many things we can't really get definitive answers for, and it's a fun exercise to think about the (currently) unanswerable questions. But to do it well, you need to start from the current state of knowledge and at least survey what the experts have so far rather than start from scratch with some pieces of data you may not have the right context to analyse.
That's why Scott Alexander isn't taken seriously. He's playing a game based on the plot of the film Memento: He makes sure he has just the right pieces to make something interesting for him yet simple enough to think through in days or weeks and maybe reach a conclusion (and a conclusion perhaps could not be reached without erasing enough information to paint a simple-enough picture). Some people find it entertaining, and I can see why; it evokes natural philosophy, which has certainly produced a lot of entertaning prose. But ultimately it's a game of fantasy science.
> There have been numerous studies on crime perception and fear of crime in the last 50 years.
Nice, so there are relevant studies. Pity no one tried to troll Scott with these, it would be interesting to watch.
> Not like that. You'll find that most questions for which there's good data (and even if there isn't) that more than one person is interested in already have experts thinking about. If there aren't, and you're qualified (i.e. have the requisite context) people will pay you to work on the problem through something called "research grants".
The existence of experts is not a guarantee they will answer your questions, or that they will publish their answers in an accessible way.
Grants are not the answer: you need to become an expert in a topic, to get a grant, but you can be an expert in all topics and get all the grants.
> Well, in Europe we have these things only we call them universities, not a "think tanks"
You can call them as you like, but if it is not you who pays money for researchers, they will probably won't answer your questions.
> to do it well, you need to start from the current state of knowledge
One can't do it in most of the cases. It is possible for a selected small range of topics, but not for all questions I can want answers for.
> at least survey what the experts have so far rather than start from scratch with some pieces of data you may not have the right context to analyse.
Yeah, I agree with that.
> That's why Scott Alexander isn't taken seriously. He's playing a game based on the plot of the film Memento: He makes sure he has just the right pieces to make something interesting for him yet simple enough to think through in days or weeks and maybe reach a conclusion (and a conclusion perhaps could not be reached without erasing enough information to paint a simple-enough picture). Some people find it entertaining, and I can see why; it evokes natural philosophy, which has certainly produced a lot of entertaning prose.
Yeah, hard to argue with that.
> But ultimately it's a game of fantasy science.
I can't resist and not to ask "and what?"
Universities are not really good in most cases. It is a rare occasion to read a take of a PhD on something like Iran war (Brett Devereaux just couldn't resist it), mostly all you have are opinions of self-proclaimed experts, and if you are lucky, if will be an opinion of Scott Alexander. Alternatively you can try to churn data yourself. Well, maybe if I had money to bring down paywalls all around the Internet I could read PhDs talking about the world around us all day long, but it needs too much money.
And when it comes to using far from perfect sources to understand what is going on, the only strategy I know is to stick to some number of those "experts". You'll learn their strengths and weaknesses with time, you'll learn how to form your own opinion based on their thoughts. You can't just stick to real experts, because you'll need really a lot of them to cover wide enough knowledge, so you'd better find some non-experts, who covers a wide range of topics, for most of them they will not have a qualification. Know your sources and watch them argue with others. You'll know when they are right and when they are mistaken.
> Pity no one tried to troll Scott with these, it would be interesting to watch.
I would assume that's because people who generally try to learn things from sources that communicate acquired knowledge aren't Scott Alexander's audience in the first place.
> You can call them as you like, but if it is not you who pays money for researchers, they will probably won't answer your questions.
Seems like you're not familiar with academia. One of the biggest problems with academia these days is a glut, not a dearth, of questions being explored. And beside, even if Scott Alexander were to try and answer your questions, there's no reason to trust his answers because he just doesn't have the necessary tools to answer them. What you're saying sounds like, well, if science won't answer your question, ask a medium. Sure, a medium might be happy to answer your question, but why would you think that the answer is correct?
If no one who is qualified to answer your question can answer it or wants to answer it, then it remains unanswered. There are lots of open questions. People who insist on getting answers to everything are asking to become gullible and accept wrong answers. Such a demand produces a supply of hacks who will be happy to answer anything.
Personally, I find it interesting that there are so many unanswered questions, but even if you're not, the belief they can be answered by some cursory assembling of some data and a few days of thinking is just wrong. I mean, you'll get an answer, but it probably won't be the right answer.
> Universities are not really good in most cases.
I don't know what "good" is judged against, but what the experts produce is typically superior to what Scott Alexander produces (which is why he's not taken seriously).
> You can't just stick to real experts, because you'll need really a lot of them to cover wide enough knowledge
If anything, there are too many, not too few. This means that in some niche areas you end up with bad experts. But that's no worse than the hacks.
> so you'd better find some non-experts, who covers a wide range of topics, for most of them they will not have a qualification.
Yes, but the good ones cover what the relevant experts say. The kind of stuff you may find in, say, the Economist or Paul Krugman's blog.
Scott Alexander's writing read to me like something written by a bright middle-schooler who has lots of thoughts and ideas, but little relevant knowledge. And because the people who are interested in actual knowledge know they won't find it in his writings, what you end up with is an entire community where you can find a very lively debate among a bunch of people, none of whom know what they're talking about.
It becomes interesting. You sounds as an advent of Church of Science from the very beginning, but now it becomes unmistakable. Science has no monopoly on truth. I'm not familiar with academy, you are right, I don't know how much Science share your ideas, but if it is, then Science has defeated itself. I hope that it is not so, and I believe it is not so.
Below there is a list of reasons why we cannot grant Science a monopoly on truth. I'm not claiming it is a comprehensive list, it is just reasons from top of my mind. But before that I'll say two more things.
Science doesn't have a monopoly on truth and Science is not a perfect instrument. Therefore I'll seek wisdom in all places, and I'll use my own brains to think, even if I'm not an expert in a topic. I will not ask medium, but it is not because I believe it is absolutely useless, it is because I have a limited time and I need to prioritize sources with high expected value of their answers. But still I will not bind myself to highest value sources either, because the path to the truth is not linear, you'd better remember about Monte Carlo methods and add some randomness to your moves.
I want also to say about "unanswered questions". It is really good, that you feel yourself comfortable despite uncertainty. Most of people are not, they can't sleep if they have unanswered questions and they could pick a random answer just to make the question answered. But it is possible to do better, than just stuff a question in a pile of unanswered questions. If you accept the idea of an answer as a probability distribution over all possible answers, then you can have an answer for any question and move all the uncertainty into the distribution of the answer. You can deal with the uncertainty in a more conscious way, and sometimes using more intricate methods. You can seek answer, find no definitive "yes" or "no", but still update your distribution. The unanswered questions stop being unanswered, they become questions with high entropy answers. Entropy is a continuous function, so question can become more unanswered or less unanswered.
And now the list of reasons to reject Science claim for a monopoly on truth.
1. There are phenomena Science can't deal with due to its methods. The most glaring class of such phenomena are single irreproducible events. Did Christ rise from death? Science can't answer this question and probably it never won't be able to answer it, because it can't do multiple observations. Scott Alexander recently wrote again on topic of a collective vision of Virgin Mary[1], and science can't tackle the question. I think you should read it through. Because you'll notice that it seems that now science can meaningfully join the discussion, because there are similar events, but before these events we found, all the science could do about the vision, is to keep its proud silence on the topic.
2. We cannot trust Scott Alexander, but we cannot trust Science either. Science can be mistaken, and sometimes it may be mistaken for no good reason at all. Did you read Judea Pearl "The Book of Why"? Pearl tells the history of statistics in his book, and particularly the roles of Pearson and Fisher, who doesn't look as people prioritizing truth over everything else, they where fighting the truth due to... well, I'd say because they were humans with all our failings. Science can be led astray in its search for truth behind Alzheimer disease because of a junk-paper published 30 years ago. Or Science could spend enormous efforts to build String Theory and then... just chuck it away.
We should hope that Science will correct itself eventually in all such cases, but should we just wait for it, or maybe we are allowed to stray from Science and use our brains instead? Maybe Science is close to truth than anything/anyone else, but still it is not the Truth. Which means than you need to use your own brains, and to accept any statement with a grain of salt. We have to part with boolean logic and embrace the probabilistic nature of truth even when we talk about scientific truths.
3. Science develops the scientific method and employs the state of art methods to seek truth, but it is a scientific propaganda. xD I mentioned The Book of Why of Judea Pearl, which shows examples of Science rejecting better methods, and the causality proposed by Judea Pearl and his students is not a method science really wield. Bayesianism is the previous idea of Judea Pearl, it is older than causality by 10 years, and Science still relies on debunked p-values[2]. There are better methods, but Science still clings to old ones, rejecting the prospect of learning new tricks.
4. Science is a part of a bigger system and it can't avoid pressure from it. SJW pressured Science to a point, when its answers on questions including words "black", "white", "male", "female" are just plainly unreliable. Scientists can't publish a paper stating that black people are somehow "worse" than white. It doesn't matter if the paper contains truth or not, scientist just not allowed to publish it. So now any study talking about correlation of skin color or sex or gender with anything is just don't worth time needed to read it. I don't know other examples of this, but you need to be careful when seeking scientific answers, and keep in mind that some answers cannot be uttered. Or science is not allowed to diagnose politicians based on observations of their behavior. Science is restricted by society, culture, moral and ethical considerations. We should keep it in mind when we are seeking scientific answers.
5. There is knowledge Science never bothered itself to incorporate, or it just can't incorporate it. If my granny shows me how to use a specific herb to treat a wound, should I reject her knowledge as unscientific? Lets suppose I can't find a scientific answers about the herb, so what should I do? But knowledge can be just beyond Science. Science works with knowledge that can be written on a paper. If it can't then Science is out of luck. There is intuition for example, it cannot be written down. I use it to write English, I don't know grammar rules, I rely on intuition. I can't formalize my intuition, I can just use it. Well... English maybe not the best example for me, because English is my second language, and I'm not enough proficient with it to compete with professional linguists, but I am proficient in Russian to use it in ways that no linguist could explain. No grammar would accept my ways, but other Russians will understand what I'm doing. You see? I'm more proficient in Russian than Science with all its linguistic studies and thousands of linguists.
6. There are questions that are beyond Science. Like what does it mean to be conscious? When scientists try to answer this question, they are not better than laymans. For example, Anil Seth says that "consciousness is biological"[3] and kinda stops there. What a shame. Or Steven Hawking proclaiming that philosophy is dead. You can't just ask Science what is conscious and get an answer, because Science doesn't understand the question: it is a philosophical question right now. It can become a scientific question, but for now it is not.
First, science is what we call the best epistemological methodology. Of course it makes mistake, but what makes it unique is that it makes fewer mistakes than any other methodology.
Second, I'm all for using your brain to get as close as possible to answers to the questions that interest you, but using your brain should lead you to choosing a good method of learning about the world. For example, using your brain will tell you that the difference between us and the ancient Greeks is a lot of intense study over millennia of people standing on the shoulders of those who came before them and continuing their painstaking study. So your brain should tell you that on almost any topic there are better sources than Scott Alexander, and these are the sources people who really want to do the best they can use. When I read Alexander, I didn't think he's a total moron; I just thought that he's not at all interesting, insightful, or knowledgeable compared to what else is out there. I just wonder why people who claim to want to use their brain pick such a third-rate source to read. Using your brain should also lead you to conclude that no amount of processing of information would lead you onto the right path to the answer if you're not starting out with all the relevant information. Again, it's like that film Memento: deductions from missing information lead you to wrong conclusions.
> First, science is what we call the best epistemological methodology. Of course it makes mistake, but what makes it unique is that it makes fewer mistakes than any other methodology.
Yes. But it is a tradeoff. When your first priority is an epistemological quality of generated knowledge, you lose agility that can be needed for practical purposes. I mentioned Judea Pearl before, he describes how epistemological methodology of Science prevented it for years, more than a decade, from stating clearly and unambiguously that smoking tobacco leads to a lung cancer. To be honest, I should mention that other social institution where no better than that, and they failed to ban or limit tobacco smoking before Science declared the causal link. So science is still better than that, the story supports your claim that science has the best epistemological methodology. But still I can't help but to wonder if it is possible to do better?
I believe you can do better at individual level. When it is all about your decisions as of individual, you could decide that smoking causes cancer and quit smoking long before science reached a consensus.
> I just wonder why people who claim to want to use their brain pick such a third-rate source to read. Using your brain should also lead you to conclude that no amount of processing of information would lead you onto the right path to the answer if you're not starting out with all the relevant information.
Our world generates insane amounts of relevant information and in a lot of cases it is still not enough to get as much as you need. You have to lean how to reason in uncertainty. You have to learn how to come to conclusions and make decisions when you have not enough information. Either because information you need is not available or because you have a limited time to make a decision.
It is one more deficiency of Science we could add to my list: it concerned about truth too much, and is not that useful when you need to make a decision right now, when you need it so bad as you are going to base your decision on guesses, not on a proven scientific truth.
Science has tools for that, but it leaves them to practitioners, and concentrates on generating truths of a solid scientific quality, not best guesses.
> When I read Alexander, I didn't think he's a total moron; I just thought that he's not at all interesting, insightful, or knowledgeable compared to what else is out there. I just wonder why people who claim to want to use their brain pick such a third-rate source to read.
Well... I can't answer this question. You see, I can't show you any charts as well I can't cite any reputable sources. My answer will be neither of Church of Charts nor of Church of Science. All I can is to give you a personal anecdote.
Scott Alexander writes not only on topics he is under-qualified in. There are two exceptions:
1. People minds. He is a psychiatrist, so he is qualified to write about human minds. I'm not sure that the quality of his writing is of a good scientific quality still: he cites some studies sometimes, but while the existence of references is a necessary condition for a quality scientific writing, it is not sufficient by itself. But still his writing on this topic goes to at least top 0.1% of all Internet writings about people minds. I have a bachelor degree in psychology, I can see it with my eyes closed: everyone believes themselves competent enough when talking about psychology, and the Internet and book stores are filled to a brim by psychological garbage. Substantial amount of peer reviewed scientific papers are crap. It is not really hard to go to top 10% or even to top 1%, but Scott Alexander is still better than that.
BTW, now AI is a hot topic, and while Scott Alexander is not an AI expert, it is useful to see at AI through eyes of a psychiatrist. You need to filter out a lot of he is saying on the topic, but still you can get some insights.
2. A bunch of topics that are more of a philosophical kind. You can't expect science to have any consensus on those topics. Science is just not good for these topics, till they migrate from a philosophy to a normal science.
One of directions of his blog... It is hard to verbalize, but I can explain through history: Scott Alexander started with Yudkowski. It is an idea of "rationality", or a magic tool that allow you to reach the best possible conclusions in any situation. Yudkowski was (and probably still is) insane, he created a Cult of Yudkowski's Rationality. I have learnt a lot from Yodkowski, especially from his posts on why what he created was not a cult. Scott Alexander is much more saner, he writes these long blog posts about crime level or whatever, but he enumerates his mistakes also. I didn't read the references you provided above, but I've read titles and some abstracts, I'd bet that the crime levels will be one of Scott Alexander mistakes of '26. It is very interesting how you can use your brain the best in practice, and Scott Alexander explores exactly this. He maybe not perfect at this, but who is? Can you find someone who is better at it? I'm very concerned how humanity can tackle the loss of authorities of truth (no more media or public figures you can trust), people are believing in stupidest things, but how I manage to do better? (Do I?) Can I teach others or maybe can we create something that will help others to navigate the ocean of lies to find islands of truth? Scott is not about this, but still close enough to keep me interested.
And, you see, you can have a discussion with smart people sometimes. Like this one for example. Such discussions force me to write down my thoughts and to think them through. They can help me to find my own blind spots. For example, I didn't thought really about data Scott Alexander relies to. I could do better earlier because I know, that you can't interpret data without understanding how it was generated. I know the theory behind it, I faced this issue on practice. But I never noticed Scott Alexander is interpreting data without understanding how it was generated and he does nothing about it. I mean, if you try to reach quick and dirty conclusion, you can just ignore some inconvenient question, but maybe you can do better without diving into the question for months? I have some ideas about this, but it is irrelevant to our discussion. What is relevant: discussions triggered by Scott Alexander are attracting people of a kind that can point to me to my blind spots and make me better.
> I mentioned Judea Pearl before, he describes how epistemological methodology of Science prevented it for years...
Yes, but you're making the same argument. Science is flawed, but it is the least flawed methodology. It means that other paths could lead to better results, but only by chance. This is good enough on a societal level, and it plays the role of mutations in natural selection. Most mutations aren't adaptive, but over many enough of them, some will be and selection is likely to eventually amplify those. Because science is "least incorrect imperfect methodology" most other approaches will do worse (probabilistically, they must), but a few will do better (again, probabilistically this needs to happen) and those will become part of the new methodology, i.e. part of science.
> I believe you can do better at individual level
Only by chance and only with low probability.
> You have to lean how to reason in uncertainty. You have to learn how to come to conclusions and make decisions when you have not enough information.
Yes, and the best methodology for that is called science. And when science doesn't know better than even chance, you must still make a choice, but there can be no methodology that could consistently lead to a better outcome, because if there were, it would be science.
> What is relevant: discussions triggered by Scott Alexander are attracting people of a kind that can point to me to my blind spots and make me better.
Good, but at some point it's better to raise the bar. Alexander and other Rationalisticists write like members of a top US high-school debate team. They're among the best - at the high school level.
> Yes, but you're making the same argument. Science is flawed, but it is the least flawed methodology.
I believe you are missing that science is not just a methodology, it is a social institution also. It can have the least flawed methodology (though comparing with what? science still relies on p-values, isn't it a flawed methodology? one can do better than that, there are widely known methods), but it can be very flawed as a social institution, just because it is social institution. p-hacking, paper mills -- they are all fruits of a social institution. Science can have the best possible methodology, but still get flawed papers.
You can do better sometimes just because you are not social institution.
> Only by chance and only with low probability.
Well... I'm not sure that I can agree. I'm confused by your use of a probability here.
If I try to come with a replacement to quantum mechanics I have zero chances to come with something better than a scientific consensus. But if I watched public debates about possible causal link between smoking and cancer at 1950s, maybe even in 1940s. I could do better with pretty high probability. Science couldn't figure it out because it lacked methods, it couldn't stage an experiment, it couldn't find physiological mechanism leading from smoking to cancer, it could rely on growing heap of correlational data. And a troll of a statistician (R.A.Fisher) kept dismantling all the arguments, noting that there are other possible explanations for correlations. Tobacco companies kept buying "research" proving that smoking is fine.
You can be much better than science in such a situation. You can end up with: "tobacco causes cancer" has 50% of probability to be true. (Maybe not 50%, but 80% or 30%, idk). Then you could weight pros and cons of smoking with probabilities and come to a conclusion that risk doesn't worth it. And you could come to this conclusion 10 years before science stated it with confidence close to 100% and politicians started to ban smoking ads.
How to measure the probability of such success?
Or how about predictions for Iran war outcomes? What science is saying? Can I learn its predictions now to write them down along with my predictions, so I could compare them later to see who was right?
How about economics? I read different experts in economics, and they miss the mark all the time, and sometimes it just... You see, when Putin started the war, economists were all like "it will take half a year for Russian economics to collapse". After six months they were predicting a collapse in a year. Now their predictions are very weighted. Like "a tipping point", "trends are..." and so on. _I_ could do better than that. And why? Because I know limitations of economists, they are too much into their models tuned for normal states, so I know to rely on a bigger picture, not on economics data alone. Economists predict that China will face consequences of its stupid investment strategies for decade at least (maybe for longer, I just got a habit of reading economists' predictions ~10 years ago). And? It is the same issue: models have their limits, people forget about it often, even if they have PhD in economics.
> Good, but at some point it's better to raise the bar.
I'm working on it as well. For example, right now I'm into systems analysis. I missed the whole discipline somehow, I should have read the basics at least a decade ago or even earlier. It is obvious stuff mostly, but still it helps to organize knowledge, make it explicit.
> Alexander and other Rationalisticists write like members of a top US high-school debate team. They're among the best - at the high school level.
I don't know what my phrasing lack, so it is so hard to understand. But you see, Scott Alexander is not the worst I'm reading regularly. I like debates, I believe 2 things about them:
1. If you want to keep your thinking sharp, you need to become a guru of debates
2. You cannot be guru of debates if you hadn't mastered debates at all levels[1]
(I can elaborate on (1) if you want, but I'm not going to explain (2): it becomes obvious when you practice it, but too long to explain for non-enlightened people.)
I read debates in 4chan style, where trolls are trolling trolls. I kinda lost enthusiasm to take part in this, even when I could spend some time on it, but I keep reading it.
> science still relies on p-values, isn't it a flawed methodology? one can do better than that, there are widely known methods
Not generally, no. When there are better methods, of course they are used.
> You can do better sometimes just because you are not social institution.
I'm not sure what that means. Again, if something is probabilistically the best, by definition it's possbile to do better sometimes, but only by chance.
> But if I watched public debates about possible causal link between smoking and cancer at 1950s, maybe even in 1940s. I could do better with pretty high probability.
But you're selecting your experiment after the fact. If you're complaining about p-hacking, this is probably the crudest form of it.
> I read different experts in economics, and they miss the mark all the time
Yes, nonlinear systems are pretty much impossible to predict. Of course they miss the mark a lot. The question is, can you come up with a system that misses the mark less? If you could, that would be the new economics.
> I'm working on it as well. For example, right now I'm into systems analysis.
I would suggest studying some mathematics, because your point above about p values tells me you're unfamiliar with some basics. You might have heard some stuff, but you can't really understand it until you actually study the subject.
> If you want to keep your thinking sharp, you need to become a guru of debates
I just hope you understand the difference between debates that can move a field of knowledge forward - those are typically conducted in writing and over a long time, and all parties have already spent years becoming experts in the field - and the sort of debates we have at the Oxford Union, which are a sport. It's a fine hobby, and you can become better at it, but it's not really "the way" to sharpen your mind. It's a skill you can develop like in any sport.
There are no shortcuts. You have to put it in the years to really know a subject, and then you can learn from others who've studied other subjects.
I generally fall into the first camp, too, but the code that AI produces is problematic because it's code that will stop working in an unrecoverable way after some number of changes. That's what happened in the Anthropic C compiler experiment (they ended up with a codebase that wasn't working and couldn't be fixed), and that's what happens once every 3-5 changes I see Codex making in my own codebase. I think, if I had let that code in, the project would have been destroyed in another 10 or so changes, in the sense that it would be impossible to fix a bug without creating another. We're not talking style or elegance here. We're talking ticking time bombs.
I think that the real two camps here are those who haven't carefully - and I mean really carefully - reviewed the code the agents write and haven't put their process under some real stress test vs those who have. Obviously, people who don't look for the time bombs naturally think everything is fine. That's how time bombs work.
I can make this more concrete. The program wants to depend on some invariant, say that a particular list is always sorted, and the code maintains it by always inserting elements in the right place in the list. Other code that needs to search for an element depends on that invariant. Then it turns out that under some conditions - due to concurrency, say - an element is inserted in the wrong place and the list isn't sorted, so one of the places that tries to find an element in the list fails to find it. At that point, it's a coin toss of whether the agent will fix the insertion or the search. If it fixes the search, the bug is still there for all the other consumers of the list, but the testing didn't catch that. Then what happens is that, with further changes, depending on their scope, you find that some new code depends on the intended invariant and some doesn't. After several such splits and several failed invariants, the program ends up in a place that nothing can be done to fix a bug. If the project is "done" before that happens - you're in luck; if not, you're in deep, deep trouble. But right up until that point, unless you very carefully review the code (because the agents are really good at making code seem reasonable under cursory scrutiny), you think everything is fine. Unless you go looking for cracks, every building seems stable until some catastrophic failure, and AI-generated code is full of cracks that are just waiting for the right weight distribution to break open and collapse.
So it sounds to me that the people you think are in the first camp not only just care how the building is built as long as it doesn't collapse, but also believe that if it hasn't collapsed yet it must be stable. The first part is, indeed, a matter of perspective, but the second part is just wrong (not just in principle but also when you actually see the AI's full-of-cracks code).
It can be especially bad if the architecture is layered with each one having its own invariant. Like in a music player, you may have the concept of a queue in the domain layer, but in the UI layer you may have additional constraints that does not relate to that. Then the agent decide to fix a bug in the UI layer because the description is a UI bug, while it’s in fact a queue bug
Shit like this is why you really have to read the plans instead of blindly accepting them. The bots are naturally lazy and will take short cuts whenever they think you won't notice.
> The program wants to depend on some invariant, say that a particular list is always sorted, and the code maintains it by always inserting elements in the right place in the list.
Invariants must be documented as part of defining the data or program module, and ideally they should be restated at any place they're being relied upon. If you fail to do so, that's a major failure of modularity and it's completely foreseeable that you'll have trouble evolving that code.
Right, except even when the invariants are documented agents get into trouble. Virtually every week I see the agent write strange code with multiple paths. It knows that the invariant _should_ hold, but it still writes a workaround for cases it doesn't. Something I see even more frequently is where the agent knows a certain exception shouldn't occur, but it does, so half the time it will choose to investigate and half the time it says, oh well, and catches the exception. In fact, it's worse. Sometimes it catches exceptions that shouldn't occur proactively as part of its "success at all costs" drive, and all these contingency plans it builds into the code make it very hard (even for the agent) to figure out why things go wrong.
Most importantly, this isn't hypothetical. We see that agents write programs that after some number of changes just collapse because they don't converge. They don't transition well between layers of abstractions, so they build contingencies into multiple layers, and the result is that after some time the codebase is just broken beyond repair and no changes can be made without breaking something (and because of all the contingencies, reproducing the breakage can be hard). This is why agents don't succeed in building even something as simple as a workable C compiler even with a full spec and thousands of human-written tests.
If the agents could code well, no one would be complaining. People complain because agent code becomes structurally unsound over time, and then it's only a matter of time until it collapses. Every fix and change you make without super careful supervision has a high chance of weakening the structure.
Agents don't really know the whole codebase when they're writing the code, their context is way too tiny for that; and trying to grow context numbers doesn't really work well (most of it gets ignored). So they're always working piece-meal and these failures are entirely expected unless the codebase is rigorously built for modularity and the agent is told to work "in the small" and keep to the existing constraints.
> Agents don't really know the whole codebase when they're writing the code
Neither do people, yet people manage to write software that they can evolve over a long time, and agents have yet to do that. I think it's because people can move back and forth between levels of abstraction, and they know when it's best to do it, but agents seem to have a really hard time doing that.
On the other hand, agents are very good at debugging complex bugs that span many parts of the codebase, and they manage to do that even with their context limitations. So I don't think it's about context. They're just not smart enough to write stable code yet.
> Neither do people, yet people manage to write software that they can evolve over a long time
You need a specific methodology to do that, one that separates "programming in the large" (the interaction across program modules) from "programming in the small" within a single, completely surveyable module. In an agentic context, "surveyable" code realistically has to imply a manageable size relative to the agent's context. If the abstraction boundaries across modules leak in a major way (including due to undocumented or casually broken invariants) that's a bit of a disaster, especially wrt. evolvability.
Agents just can't currently do that well. When you run into a problem when evolving the code to add a new feature or fix a bug, you need to decide whether the change belongs in the architecture or should be done locally. Agents are about as good as a random choice in picking the right answer, and there's typically only one right answer. They simply don't have the judgment. Sometimes you get the wrong choice in one session and the right choice in another.
But this happens at all levels because there are many more than just two abstraction levels. E.g. do I change a subroutine's signature or do I change the callsite? Agents get it wrong. A lot.
Another thing they just don't get (because they're so focused on task success) is that it's very often better to let things go wrong in a way that could inform changes rather than get things to "work" in a way that hides the problem. One of the reasons agent code needs to be reviewed even more carefully than human code is that they're really good at hiding issues with potentially catastrophic consequences.
> Agents are about as good as a random choice in picking the right answer, and there's typically only one right answer.
That's realistically because they aren't even trying to answer that question by thinking sensibly about the code. Working in a limited context with anything they do leaves them guessing and trying the first thing that might work. That's why they generally do a bit better when you explicitly ask them to reverse engineer/document a design of some existing codebase: that's a problem that at least involves an explicit requirement to comprehensively survey the code, figure out what part matters, etc. They can't be expected to do that as a default. It's not even a limitation of existing models, it's quite inherent to how they're architected.
Yes, and I think there's a fundamental problem here. The big reason the "AI thought leadership" claim that AI should do well at coding is because there are mechanical success metrics like tests. Except that's not true. The tests cover the behaviour, not the structure. It's like constructing a building where the only tests are whether floorplans match the design. It makes catastrophic strctural issues easy to hide. The building looks right, and it might even withstand some load, but later, when you want to make changes, you move a cupboard or a curtain rod only to have the structure collapse because that element ended up being load-bearing.
It's funny, but one of the lessons I've learnt working with agents is just how much design matters in software and isn't just a matter of craftsmenship pride. When you see the codebase implode after the tenth new feature and realise it has to be scrapped because neither human nor AI can salvage it, the importance of design becomes palpable. Before agents it was hard to see because few people write code like that (just as no one would think to make a curtain rod load-bearing when building a structure).
And let's not forget that the models hallucinate. Just now I was discussing architecture with Codex, and what it says sounds plausible, but it's wrong in subtle and important ways.
> The big reason the "AI thought leadership" claim that AI should do well at coding is because there are mechanical success metrics like tests.
I mean, if you properly define "do well" as getting a first draft of something interesting that might or might not be a step towards a solution, that's not completely wrong. A pass/fail test is verified feedback of a sort, that the AI can then do quick iteration on. It's just very wrong to expect that you can get away with only checking for passing tests and not even loosely survey what the AI generated (which is invariably what people do when they submit a bunch of vibe-coded pull requests that are 10k lines each or more, and call that a "gain" in productivity).
It's not completely wrong if you're interested in a throwaway codebase. It is completely wrong if what you want is a codebase you'll evolve over years. Agents are nowhere close to offering that (yet) unless a human is watching them like a hawk (closer than you'd watch another human programmer, because human programmers don't make such dangerous mistakes as frequently, and when they do make them, they don't hide them as well).
reply