@apaszke@clattner_llvm@metaai I think at least part of it is that they seem to have compared against cuBLAS instead of cuBLASLt. The latter is able to optimise for the specific input sizes more than the former, which makes it a fairer comparison with tools like mojo/Triton/etc.
@clattner_llvm@metaai How can Mojo be faster than CUDA? Isn’t it really just PTX vs the DSL abstractions? It’s also quite important to consider productivity in addition to perf, although it is harder to quantify
Thank you to folks at @metaai for publishing their independent perf analysis comparing CUDA and Mojo against Triton and TileLang DSLs, showing Mojo meeting and beating CUDA, and leaving DSLs in the dust.