Not only that, in some cases they’re comparing apples to oranges as well, undermining their credibility further. Eg chain-of-thought vs non-CoT results. I don’t even know why they’re doing that, seems like their results would be impressive enough even without this.