Exactly. The challenge isn’t getting the LLMs to make sure they validate their o...

bisonbear · 2026-03-18T03:22:24 1773804144

I'm becoming convinced that test pass rate is not a great indicator of model quality - instead we have to look at agent behavior beyond the test gate, such as how aligned is it with human intent, and does it follow the repo's coding standards.

I wrote a short blog about this phenomenon here if you're interested https://www.stet.sh/blog/both-pass

also +1 on placing heavy emphasis on the plan. if you have a good plan, then the code becomes trivial. I have started doing a 70/30 or even 80/20 split of time spent on plan / time implementing & reviewing