So there is highly efficient matrix transpose in Mojo
All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).
The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.
Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).
Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.
I think the OP based the title off of "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the final kernels for whatever reason
FWIW I didnt take the blog as a dunk on CUDA, just as an impressive outcome from the blog writer in Mojo. It's awesome to see this on Hopper - if it makes it go faster thats awesome.
I don't think that's fair. The article promised a highly efficient kernel and seems to have delivered exactly that, which isn't "nothing". My beef is entirely with the submitted title.
Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.