It has to flush them. Because if another core sees the result of the atomic op it must also see everything else that the other core wrote before the op. While it can indeed first see no writes and then suddenly all it can never see just the atomic op and not the previous writes.
Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then.
The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1
Large part yes. But not The reason. It’s fast because of many things like that. TSO doesn’t affect single core perf much so it’s not really a factor there, and yet it’s blazingly fast. However the multicore perf is really great too.
I haven’t verified the exact numbers myself. And it will depend on the exact thing you’re running. It’s just on the order of low tens of percents.
TSO cannot be enabled outside of rosetta as it’s not exactly a good arm extension. Perhaps you could do some trickery but Apple likely prevents that.
However you can test it by making something where you know rosetta generates comparable arm assembly from the X86 one and just run comparison that way. Some sort of parallel lockfree algorithm would be the best candidate.
TSO is possible to enable outside of Rosetta with some shenanigans in the kernel. Unfortunately getting Rosetta to generate code that is comparable with what a compiler would create is quite difficult: it needs to lift x86 into its own IR and then re-do register allocation, which it is quite good at but obviously not perfect.
Without that the store buffers can be kept unflushed to, as an example, see if one can get a full cacheline or whatnot and only flush then.
The comment is correct that an X86 with heavy reordering backend will beat arm without one. However arm with one does handily beat X86 with one. Case in point: M1