Real world benchmarks and big O measure different things, so one is not a substitute for another.
Real world benchmarks don't just measure the algorithm, they also measure the implementation. By comparing big O, you can compare algorithms directly instead of comparing implementations of the algorithms.
Big O doesn't take into account constant factors, the time variability of CPU operations (e.g. cache hit vs cache miss), implementation details, or that all useful input sizes might be small. This means that even though an algorithm might have a better big O, it might actually be worse than other algorithms for practical size inputs.
Creating an algorithm with a better big O is a mathematical breakthrough, but it's not proof that it's going to be practical in the foreseeable future. If you want to additionally prove that, you need real-world benchmarks. So real-world benchmarks would be useful here for some people. It might not be useful for the authors though if their goal is just to prove algorithmic superiority rather than real-world superiority. Creating a real-world benchmark also has the problem that existing implementations have likely been tuned with microoptimizations for years, and the new algorithm's implementation won't have been, leading to an unfair comparison. Your example of 10ms vs 9ms is such a small difference that microoptimizations would matter much more than it.
I was agreeing with you that real world metrics would be useful.
I was making a couple clarifications. You seemed to imply that real-world metrics can be a substitute for big O, but that's not the case, they do different comparisons.
Also, I was pointing out that the 10ms vs 9ms example isn't a set of numbers that would indicate one algorithm is better than the other, just that one implementation is better than the other.
Despite the prevalence of the question, most people (like myself) would have to break out one of our old textbooks to tell.
Do you have a reference measure that might help? For example: a process that would take 10ms under the following conditions [a,b,c] now takes 9ms?