I found that duplicating a specific block of 7 middle layers in Qwen2-72B, without modifying any weights, improved performance across all Open LLM Leaderboard benchmarks and took #1. As of 2026, the top 4 models on that leaderboard are still descendants.
The weird finding: single-layer duplication does nothing. Too few layers, nothing. Too many, it gets worse. Only circuit-sized blocks of ~7 layers work. This suggests pretraining carves out discrete functional circuits in the layer stack that only work when preserved whole.
The whole thing was developed on 2x RTX 4090s in my basement. I'm now running current models (GLM-4.7, Qwen3.5, MiniMax M2.5) on a dual GH200 rig (see my other post). Code and new models coming soon.
Happy to answer questions.
I have a couple questions:
1. I think this quote should be raising *many more* eyebrows.
> The astounding thing about Goliath wasn’t that is was a huge leap in performance, it was that the damn thing functioned at all. To this day, I still don’t understand why this didn’t raise more eyebrows.
You put a cat's brain into a dog's head and its still breathing! It didn't flatline immediately! Is yesterday's news? This seems like the biggest take away. Why isn't every <MODEL_PROVIDER> attempting LLM-surgery at this moment? Have you noticed any increasede discourse in this area?
2. You mentioned you spent the beginning of your career looking at brains in biotech. How did you end up in a basement of GPU's, working not in biotech, but still kind of looking at brains?
Again, great post!
reply