I mean, the runtimes are right there in the abstract; this is pretty easy to compare. Bellman-Ford takes time Theta(|V||E|), and as best I can tell the improvements you mention don't change that, they're just constant-factor improvements. As opposed to this algorithm being O(|E|*log(|V|)^8*log(W)), where W is the largest absolute value of a negative edge.
The abstract of this paper even compare the runtime to the state of the art, and what it compares to isn't Bellman-Ford or the improvements you mention; there had already been asymptotic improvements on those.
Real world benchmarks and big O measure different things, so one is not a substitute for another.
Real world benchmarks don't just measure the algorithm, they also measure the implementation. By comparing big O, you can compare algorithms directly instead of comparing implementations of the algorithms.
Big O doesn't take into account constant factors, the time variability of CPU operations (e.g. cache hit vs cache miss), implementation details, or that all useful input sizes might be small. This means that even though an algorithm might have a better big O, it might actually be worse than other algorithms for practical size inputs.
Creating an algorithm with a better big O is a mathematical breakthrough, but it's not proof that it's going to be practical in the foreseeable future. If you want to additionally prove that, you need real-world benchmarks. So real-world benchmarks would be useful here for some people. It might not be useful for the authors though if their goal is just to prove algorithmic superiority rather than real-world superiority. Creating a real-world benchmark also has the problem that existing implementations have likely been tuned with microoptimizations for years, and the new algorithm's implementation won't have been, leading to an unfair comparison. Your example of 10ms vs 9ms is such a small difference that microoptimizations would matter much more than it.
I was agreeing with you that real world metrics would be useful.
I was making a couple clarifications. You seemed to imply that real-world metrics can be a substitute for big O, but that's not the case, they do different comparisons.
Also, I was pointing out that the 10ms vs 9ms example isn't a set of numbers that would indicate one algorithm is better than the other, just that one implementation is better than the other.
The abstract of this paper even compare the runtime to the state of the art, and what it compares to isn't Bellman-Ford or the improvements you mention; there had already been asymptotic improvements on those.