It has to flush them. Because if another core sees the result of the atomic op i...

haberman · on Dec 11, 2020

Interesting, are you suggesting that a large part of the M1 performance advantage is thanks to the weaker ARM memory model?

Is the 20% perf hit of TSO mode that you cite an ARM vs. ARM comparison? If so, that would be pretty damning.

Is there an easy way to flip the M1 into TSO mode for benchmarking? I would love to observe this 20% for myself.

sharpneli · on Dec 11, 2020

Large part yes. But not The reason. It’s fast because of many things like that. TSO doesn’t affect single core perf much so it’s not really a factor there, and yet it’s blazingly fast. However the multicore perf is really great too.

I haven’t verified the exact numbers myself. And it will depend on the exact thing you’re running. It’s just on the order of low tens of percents.

TSO cannot be enabled outside of rosetta as it’s not exactly a good arm extension. Perhaps you could do some trickery but Apple likely prevents that.

However you can test it by making something where you know rosetta generates comparable arm assembly from the X86 one and just run comparison that way. Some sort of parallel lockfree algorithm would be the best candidate.

saagarjha · on Dec 11, 2020

TSO is possible to enable outside of Rosetta with some shenanigans in the kernel. Unfortunately getting Rosetta to generate code that is comparable with what a compiler would create is quite difficult: it needs to lift x86 into its own IR and then re-do register allocation, which it is quite good at but obviously not perfect.