The "Switching to Mojo gave a 14% improvement over CUDA" title is editorialized,...

timmyd · 2025-06-06T23:20:19 1749252019

[op here] To be clear: Yes, there are 3 kernels - you can see those in the linked github at the end of the article if you clicked that. These are:

transpose_naive - Basic implementation with TMA transfers

transpose_swizzle - Adds swizzling optimization for better memory access patterns

transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling

Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:

transpose_naive: 1056.08 GB/s (32.0025% of max)

transpose_swizzle: 1437.55 GB/s (43.5622% of max)

transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)

via the GitHub - simveit/efficient_transpose_mojo

Comparing to the CUDA implementations mentioned in the article:

Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s

Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s

Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s

So there is highly efficient matrix transpose in Mojo

All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).

The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.

jsnell · 2025-06-07T00:15:04 1749255304

Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).

Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.

timmyd · 2025-06-07T00:23:42 1749255822

thanks jsnell - i did they and they appreciated the comment above, and unflagged it. i appreciate it!

atomicapple · 2025-06-06T20:14:20 1749240860

I think the OP based the title off of "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the final kernels for whatever reason

jebarker · 2025-06-06T20:13:00 1749240780

Yeah, it seems like the blog post is just meant to be an example of how to do something in Mojo and not a dunk on CUDA.

timmyd · 2025-06-06T23:45:57 1749253557

FWIW I didnt take the blog as a dunk on CUDA, just as an impressive outcome from the blog writer in Mojo. It's awesome to see this on Hopper - if it makes it go faster thats awesome.

baal80spam · 2025-06-06T20:10:30 1749240630

0.14% is within the limits of statistical error. So this is a nothing-"article".

jsnell · 2025-06-06T20:14:21 1749240861

I don't think that's fair. The article promised a highly efficient kernel and seems to have delivered exactly that, which isn't "nothing". My beef is entirely with the submitted title.