This is how you do things if you are new to this game.
Get two other, different, LLMs to thoroughly review the code. If you don’t have an automated way to do all of this, you will struggle and eventually put yourself out of a job.
If you do use this approach, you will get code that is better than what most software devs put out. And that gives you a good base to work with if you need to add polish to it.
I actually have used other LLMs to review the code, in the past (not today, but in the past). It's fine, but it doesn't tend to catch things like "this technically works but it's loading a footgun." For example, the redux test I was mentioning in my original post, the tests were reusing a single global store variable. It technically worked, the tests ran, and since these were the first tests I introduced in the code base there weren't any issues even though this made the tests non deterministic... but, it was a pattern that was easily going to break down the line.
To me, the solution isn't "more AI", it's "how do I use AI in a way that doesn't screw me over a few weeks/months down the line", and for me that's by making sure I understand the code it generated and trim out the things that are bad/excessive. If it's generating things I don't understand, then I need to understand them, because I have to debug it at some point.
Also, in this case it was just some unit tests, so who cares, but if this was a service that was publicly exposed on the web? I would definitely want to make sure I had a human in the loop for anything security related, and I would ABSOLUTELY want to make sure I understood it if it were handling user data.
The quality of generated code does not matter. The problem is when it breaks 2 AM and you're burning thousands of dollars every minutes. You don't own the code that you don't understand, but unfortunately that does not mean you don't own the responsibility as well. Good luck on writing the postmortem, your boss will have lots of question for you.
AI can help you understand code faster than without AI. It allows me to investigate problems that I have little context in and be able to write fixes effectively.
> If you do use this approach, you will get code that is better than what most software devs put out. And that gives you a good base to work with if you need to add polish to it.
If you do use this approach, you'll find that it will descend into a recursive madness. Due to the way these models are trained, they are never going to look at the output of two other models and go "Yeah, this is fine as it is; don't change a thing".
Before you know it you're going to have change amplification, where a tiny change by one model triggers other models (or even itself) to make other changes, which triggers further changes, etc ad nauseum.
The easy part is getting the models to spit out working code. The hard part is getting it to stop.
I've never done this because i haven't felt compelled to do this because I want to review my own code but I imagine this works okay and isn't hard to set up by asking Claude to set this up for you...
What? People do this all the time. Sometimes manually by invoking another agent with a different model and asking it to review the changes against the original spec. I just setup some reviewer / verifier sub agents in Cursor that I can invoke with a slash command. I use Opus 4.5 as my daily driver, but I have reviewer subagents running Gemini 3 Pro and GPT-5.2-codex and they each review the plan as well, and then the final implementation against the plan. Both sometimes identify issues, and Opus then integrates that feedback.
It’s not perfect so I still review the code myself, but it helps decrease the number of defects I have to then have the AI correct.
these two posts (the parent and then the OP) seem equally empty?
by level of compute spend, it might look like:
- ask an LLM in the same query/thread to write code AND tests (not good)
- ask the LLM in different threads (meh)
- ask the LLM in a separate thread to critique said tests (too brittle, testing guidelines, testing implementation and not out behavior, etc). fix those. (decent)
- ask the LLM to spawn multiple agents to review the code and tests. Fix those. Spawn agents to critique again. Fix again.
- Do the same as above, but spawn agents from different families (so Claude calls Gemini and Codex).
—-
these are usually set up as /slash commands like /tests or /review so you aren’t doing this manually. since this can take some time, people might work on multiple features at once.
I don't get it. People complain when they have to go to the office. And then some are given the option to work from home. Then they complain their boss can find out where they are during work hours. What on Earth are you complaining about?
It's about hiring adults, respecting and trusting them to do the job and support the team, and be responsible for their methods. The details are not important to that goal.
If an employer instead treats people like toddlers needing supervision, spoon feeding, and metrics around methods, not work, they will get only that.
It's pretty amazing to see the bubble many people here seem to work in. A guess, but probably 90% of employees have to go to work. Either they physically cannot do their job remotely or the employer demands that they be present.
A lot of people are coming across as whiny children here, "Oh no I might have to go to the office for my 6-figure paycheck." Grow up and go to work, as George Carlin might say.
"Oh you're doing work? That's so cute... we're gonna close whatever apps you had open, because we're updating now. We own your computer.
You had unsaved work? Too bad, it's gone, get bent."
This, a 1,000 times. I hate, hate, hate this "feature". My Macs don't do that. My Linux systems don't do that. The whole, "screw you, we don't care" attitude of Microsoft is quite appalling.
Microsoft now makes it very difficult to disable this feature. After a few registry edits, I thought I was able to put a stop to the madness. But, then it went to rebooting on its own again.
I keep telling Windows 10 to delay these updates by 1 week each time...
Curiously, Office apps have auto-save, so does IntelliJ, VSCode, and even Notepad nowadays.. "restore my work environment after a reboot" almost works, but some things do disappear (e.g. unsaved web input forms) that it's aggravating enough. I wonder if they'll make it mandatory for apps to persist more across restarts.
Ok, that requires them to be competent, if they're competent we won't even have what we have now.
"requiring apps to do X" means blocking running every app written prior to you instituting the requirement, bricking everyone's workflows, so that would be the least competent thing imaginable.
"You don't need the "language model" part to run an autopilot, that's just silly."
I think most of us understood that reproducing what existing autopilot can do was not the goal. My inexpensive DJI quadcopter has an impressive abilities in this area as well. But, I cannot give it a mission in natural language and expect it to execute it. Not even close.
I am a huge proponent of using AI tools for software development. But until I see a vibe coded replacement for the Linux kernel, PostgreSQL, gcc, git or Chromium, I am just going to disagree with this premise. If I am on a system without Python installed, I don't see Claude saying, oh, you don't need to download it, I'll write the Python interpreter for you.
> I am a huge proponent of using AI tools for software development. But until I see a vibe coded replacement for the Linux kernel, PostgreSQL, gcc, git or Chromium, I am just going to disagree with this premise.
Did you read it?
It isn't saying that LLMs will replace major open source software components. It said that the "reward" for providing, maintaining and helping curate these OSS pieces; which is the ecosystem they exist in, just disappears if there is no community around it, just an LLM ingesting open source code and spitting out a solution good or bad.
We've already seen curl buckle under the pressure, as their community minded, good conscious effort to give back to security reports, collapsed under the weight of slop.
This is largely about extending that thesis to the entire ecosystem. No GH issues, no PRs, no interaction. No kudos on HN, no stars on github, no "cheers mate" as you pass them at a conference after they give a great talk.
Where did you get that you needed to see a Linux kernel developed from AI tools, before you think the article's authors have a point?
> This is largely about extending that thesis to the entire ecosystem. No GH issues, no PRs, no interaction. No kudos on HN, no stars on github, no "cheers mate" as you pass them at a conference after they give a great talk.
>> My guess is that you think iMessage is SMS-only
> No, there's Apple's proprietary protocol...
Earlier you asked: "But iMessage is already open?"
Now you are saying that iMessage uses "Apple's proprietary protocol". I hope now you understand that when people say that Apple iMessage is not open, they are not talking about the SMS protocol that Apple does not own.
I used Ralph recently, in Claude Code. We had a complex SQL script that was crunched large amounts of data and was slow to run even on tables that are normalized, have indexes for the right columns etc. We, the humans spent significant amount of time tweaking it. We were able to get some performance gains, but eventually hit a wall. That is when I let Ralph take a stab at it. I told it to create a baseline benchmark and I gave it the expected output. I told to keep iterating on the script until there was at least 3x improvement in performance number while the output was identical. I set the iteration limit to 50. I let it loose and went to dinner. When I came back, it had found a way to get 3x performance and stopped on the 20th iteration.
Is there another human that could get me even better performance given the same parameters. Probably yes. In the same amount of time? Maybe, but unlikely. In any case, we don't have anybody on our team that can think of 20 different ways to improve a large and complex SQL script and try them all in a short amount of time.
These tools do require two things before you can expect good results:
1. An open mind.
2. Experience. Lots of it.
BTW, I never trust the code an AI agent spits out. I get other AI agents, different LLMs, to review all work, create deterministic tests that must be run and must pass before the PR is ever generated. I used to do a lot of this manually. But now I create Claude skills that automate a lot of this away.
AI agent skills are very useful. Unlike MCP they do not waste context. Most of the time I am building skills that are very particular to my project. But occasionally I do use a skill that is more generic. Particularly when something is too new to have made it into the LLM training data set. Or not common enough.
> Just get off your ass and go and give them the message...
If I need to have all 4 members of the family meet me at the pool, first I need to go find each one of them. They could all be at different place. And then tell them individually to meet me at the pool? Is that the better solution you are proposing?
Get two other, different, LLMs to thoroughly review the code. If you don’t have an automated way to do all of this, you will struggle and eventually put yourself out of a job.
If you do use this approach, you will get code that is better than what most software devs put out. And that gives you a good base to work with if you need to add polish to it.
reply