Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The "Switching to Mojo gave a 14% improvement over CUDA" title is editorialized, the original is "Highly efficient matrix transpose in Mojo".

Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.



[op here] To be clear: Yes, there are 3 kernels - you can see those in the linked github at the end of the article if you clicked that. These are:

transpose_naive - Basic implementation with TMA transfers

transpose_swizzle - Adds swizzling optimization for better memory access patterns

transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling

Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:

transpose_naive: 1056.08 GB/s (32.0025% of max)

transpose_swizzle: 1437.55 GB/s (43.5622% of max)

transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)

via the GitHub - simveit/efficient_transpose_mojo

Comparing to the CUDA implementations mentioned in the article:

Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s

Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s

Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s

So there is highly efficient matrix transpose in Mojo

All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).

The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.


Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).

Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.


thanks jsnell - i did they and they appreciated the comment above, and unflagged it. i appreciate it!


I think the OP based the title off of "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the final kernels for whatever reason


Yeah, it seems like the blog post is just meant to be an example of how to do something in Mojo and not a dunk on CUDA.


FWIW I didnt take the blog as a dunk on CUDA, just as an impressive outcome from the blog writer in Mojo. It's awesome to see this on Hopper - if it makes it go faster thats awesome.


0.14% is within the limits of statistical error. So this is a nothing-"article".


I don't think that's fair. The article promised a highly efficient kernel and seems to have delivered exactly that, which isn't "nothing". My beef is entirely with the submitted title.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: