That conclusion is based on their benchmarks. I'm not interested in those. I'm interested in community benchmarks, like those we're seeing in the comments. Lo and behold, GPT-4 is still king. The claims of any company should be taken with exactly a pinch of salt.
That kind of benchmark is a lot more reliable for models published before the benchmarks; models published afterwards have more opportunity to "study to the test". That's especially a concern when a company explicitly uses its score on that benchmark as a marketing point.
No, I do not take these benchmarks seriously and for good reason. They're benchmarks. The only thing that matters is the user's direct experience of the product. And Phind isn't there.