They're only tricky if you're communicating data between threads without using locks. Since this has always been a "here be dragons" area even with x86's convenient memory model, code that gets it right for x86 but not other architectures is pretty rare.