Removing GIL naively would decrease single thread performance. Every project aiming at removing GIL failed because it could not get performance comparable with GILed Python.
It's a fork of Python 3.9, takes out GIL and introduces optimisations to speed up both single- and multi-threaded execution (since the bar set by PSF is that no-GIL implementations must be at least as fast as GIL single threaded programs). He ends up with a net 10% speed improvement.
If he does these optimisations, and also doesn't remove the GIL, the performance boost is even larger. So, depending on how you look at it, it's either:
- A bunch of optimisations, plus a GILectomy which slows Python down, or
- A bulk change that removes GIL and speeds things up
Since these improvements were in a similar ballpark, my fear was that the improvements are taken off the branch, with GIL left in place...
Removing the GIL is an idea (and as you point out not working very well). When optimizing do not depend on 'that one cool trick' to fix everything. In this case it looks like they are removing extra work and doing work once and keeping a copy around (caching).
Why would it decrease single thread performance? How is python different than other languages that support native full-fledged multi-threading, eg Java, Go, C#?
A big part is that Python uses reference counting GC. Java, Go, C# all use tracing GC. Py_INCREF and Py_DECREF are responsible for inc/decreasing the reference count, and are not atomic. The GIL ensures refcount safety by allowing only one thread access to changing refcount. The naive approach to parallelization would require locking each ref inc/dec. There are some more sophisticated approaches (thanks to work by Sam Gross et al) that avoid a mutex hit for every inc/dec.
Tracing GC does not run into this problem. Why Python doesn't use tracing GC is not something I am qualified to answer.
I am by no means knowledgeable enough on the topic, but Swift has similar problem domain, and afaik only uses atomic ref counts for objects that “escape” from a given thread - is there a reason something like that wouldn’t work for python as well?
python made it's C api visible, so things like reference counting are widely observed by C libraries that interop with python. This makes it much harder to make changes since you can't change the implementation in ways that programs rely on.