Not only that, in some cases they’re comparing apples to oranges as well, underm...

Not only that, in some cases they’re comparing apples to oranges as well, undermining their credibility further. Eg chain-of-thought vs non-CoT results. I don’t even know why they’re doing that, seems like their results would be impressive enough even without this.