Bohan Hou

0

22

1.5K

Bohan Hou@bohanhou1998·4d

@jeongho09621512 Yeah it’s always good to have a standard tool

English

63

JungHoyoun@jeongho09621512·4d

@bohanhou1998 It seems like Nvidia will launch an official intra kernel tool names iket It might be good to refer to it. github.com/NVIDIA/cutlass…

English

0

1

114

Bohan Hou@bohanhou1998·4d

It's cool to learn how different warp/warpgroup roles overlap in the kernel, such as FA4. tvm.apache.org/docs/tirx/nati…

English

11

88

5.5K

Bohan Hou@bohanhou1998·4d

Similar tools are also available through APIs in Triton and CuTeDSL. FlashInfer has also had a tool of this kind before (github.com/flashinfer-ai/…). Our implementation is broadly similar to FlashInfer’s, with a few modifications and adaptations to fit the TIRx API.

English

1

6

489

Bohan Hou@bohanhou1998·23 Haz

@maharshii nvm, we’ll just ask Mythos to npm install the kernels🤣

English

0

2

180

maharshi@maharshii·23 Haz

ML perf is in its JS framework era, i love it.

Bohan Hou@bohanhou1998

We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. tvm.apache.org/2026/06/22/tirx On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.

English

91

8.1K

Bohan Hou@bohanhou1998·23 Haz

Tilelang is built on top of TensorIR (released around 2022~2023, aimed at schedule-based autotuning on Ampere GPUs), and TIRx is an upgrade of TensorIR, just released. They share some of the data structures (like ForNode, IfNode) but are completely differently designed DSLs and have different compilation pipelines. Tilelang rebased onto TIRx, I believe, a few weeks ago after we upstreamed the code. github.com/apache/tvm/pul… This is the first PR bringing the TIRx infra upgrade into Apache TVM.

English

4

326

Retep@Retep8080·23 Haz

@bohanhou1998 wait i thought it was released long ago since tilelang is built on top of tirx. Am I missing something

English

0

1

318

Bohan Hou@bohanhou1998·22 Haz

We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. tvm.apache.org/2026/06/22/tirx On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.

English

4

44

144

37.9K

Bohan Hou@bohanhou1998·23 Haz

Yes. I read Pyptx's codebase on day 1 and love how Pyptx organizes the PTX instruction registrations in Python. It's also an important part of TIRx to provide full native control. The primitive part is meant to allow users to (for example) simply write `Tx.copy(As[0:128, 0:32], Ag[...])` to rely on the compiler to generate loops around TMA calls. Sometimes the TMA instruction selection can be a bit tricky. We can issue fewer instructions by organizing tensormap shape/stride carefully. A bit of compiler magic here, but not too heavy :)

English

3

441

Patrick C Toulme@PatrickToulme·23 Haz

@bohanhou1998 This is very similar to Pyptx. You are a bit above Pyptx as Pyptx has no compiler and emits raw PTX

English

0

2

943

Bohan Hou@bohanhou1998·23 Haz

@silverhawk_ny The current DSL can already support AMD code generation (by lowering to LLVM). For NV platforms it lowers to CUDA C++. It can also lowers to Vulkan WebASM OpenCL etc. We don’t have access to latest server-class AMD cards so we only tried to program kernels on B200.

English

1

67

arch rock@silverhawk_ny·23 Haz

@bohanhou1998 Thanks, do you plan to have some high level DSL to support both NV and AMD?

English

0

51

Bohan Hou@bohanhou1998·23 Haz

TIRx is by itself a kernel-level language generating CUDA C++, so in this sense, it does not support other kernel languages. But the TVM stack has a Relax graph compiler that can represent a computational graph and have calls into arbitrary python/C++ functions registered via tvm-ffi so I believe in that sense it can support other types of kernels. See #L914C1-L923C26" target="_blank" rel="nofollow noopener">github.com/apache/tvm/blo… this test case, how it registered a function and called by the graph.

English

5

315

arch rock@silverhawk_ny·23 Haz

@bohanhou1998 we have some similar efforts to completely move away from Torch inductor/compile to do own graph IR rewrite, kernel fusion, dose this support different type of kernel like Triton/CuteDSL, etc?

English

0

326

Bohan Hou@bohanhou1998·23 Haz

“A compiler optimization that always works is just part of the programming model.” Many of you have probably watched @cHHillee 's Jane Street talk: janestreet.com/tech-talks/bui… The talk was given in 2025, but the question had already been on my mind since the summer of 2023, when I was interning on Meta’s PyTorch Compiler team, where I worked with Horace. At the time, my TensorIR paper had just been accepted to ASPLOS 2023. TensorIR gave us a schedule-oriented IR for tensorized program optimization, and much of my thinking then was shaped by Ampere Tensor Core kernels. But the target was already moving. Hopper made warp-specialized kernels, TMA-driven pipelines, new synchronization patterns, and FP8 much more central. Kernels like FlashAttention-2 were also pushing beyond what our existing schedule abstraction could express comfortably. Meanwhile, in the MLC project — a compiler-based ML serving engine targeting both servers and mobile devices — TensorIR’s schedule system felt awkward when we tried to reach peak performance on quantized GeMV/GEMM kernels. During a Topgolf team event, Horace asked me a question: “How much time do you think you need to upgrade your compiler to Hopper? And what about the generation after Hopper?” That question has stayed with me ever since. Every time I thought about what my next project should be — auto warp specialization, auto pipelining, or another compiler pass in that direction — I found it hard to fully convince myself. Not because those directions are unimportant, but because they still did not answer the question that bothered me most. If every new hardware generation requires another major compiler redesign before users can even express the best program, then maybe that is simply not the right path. For a while, I honestly believed the answer might just be: write CUDA C++, and at most build libraries on top so that we do not have to rewrite everything from scratch. Things become more interesting now that agents are starting to become capable of writing kernels with specific low-level instruction patterns. One might think stronger agents will make this problem go away: they can write more code, search more schedules, patch more compiler passes, and try more variants than humans can. Personally, I do not have a decisive answer. It is risky to make strong predictions about what agents will or will not be able to do. From my own experience, they are already reasonably useful for tuning some kernels in CUDA C++ and performing local rewrites that make the generated PTX/SASS look better. But when asked to design or orchestrate new warp-specialization plans, pipelined data movement, or synchronization protocols, they can still struggle to produce a kernel that is even correct — let alone one that does not hang. One experiment I tried was asking Codex to port DeepSeek’s MegaMoe kernel into TIRx. I expected the task to be relatively straightforward, since TIRx intentionally keeps the source structure close to hardware-native CUDA C++. In practice, the agent spent several days and still could not resolve a hang without my intervention. What eventually helped was not simply asking it to “try again,” but changing the feedback loop: I asked a reviewer agent to produce an audit file, line by line, comparing the generated kernel against the original and justifying that the two were doing the same thing. That workflow can help when an expert kernel already exists. But it is much less satisfying if the goal is to have agents design new kernels from scratch. For those curious, here is the experimental agent-generated TIRx port of MegaMoe (which I believe matches the performance of the initial release of DS’s kernels, but they have some optimizations later) that came out of this process: gist.github.com/spectrometerHB… It is not part of the initial release; it is mostly a record of the experiment. The result is not particularly polished, but it was useful for understanding what kind of feedback loop agents need when working on low-level kernels. Of course, a sufficiently high-level DSL can help when the abstraction already captures the kernel pattern. If the user can write GEMM(A, B) and the compiler deterministically lowers it to a good implementation, then life is easy. But that is exactly the hard part at the frontier: the pattern may not yet be mature enough to hide behind a stable abstraction. Until a strong enough model can solve this end-to-end, what we can do is improve the workflow around the model: give it more structured feedback and perhaps even lightweight verifiers that can check whether a generated kernel shape is valid with respect to synchronization. This connects back to the opening sentence: “a compiler optimization that always works is just part of the programming model.” For example, if a warp-specialization plan relies on mbarrier coordination, the compiler should at least be able to reason about whether that plan can introduce races, deadlocks, or violations of the intended producer-consumer protocol. That kind of feedback is hard to provide if the program is only raw CUDA C++ without a richer intermediate representation. Maybe the better question is not whether agents can write kernels, but what programming model gives humans, agents, and compilers enough shared structure to reason about kernels before the final benchmark.

English

8

48

3.3K

Bohan Hou@bohanhou1998·22 Haz

@ye_combinator @Lei_Wang_1999 @yi_xin_dong @HongyiJin258 @ruihanglai @junrushao @tqchenml Also @meow_cat_7 @LijieyYang , Shushi Hong,

English

288

Bohan Hou@bohanhou1998·22 Haz

Huge thanks to the collaborators and friends who helped shape this work and gave feedback along the way: @ye_combinator @Lei_Wang_1999 @yi_xin_dong @HongyiJin258 @ruihanglai @junrushao @tqchenml

English

0

7

1.1K

Bohan Hou@bohanhou1998·1 Haz

RT @ruihanglai: Two moments every ML researcher knows. You get onto a new cluster, and week one goes to fitting the framework to your setup…

English

10

0

161

Bohan Hou 리트윗함

Tianqi Chen@tqchenml·21 Eki

📢Excited to introduce Apache TVM FFI, an open ABI and FFI for ML systems, enabling compilers, libraries, DSLs, and frameworks to naturally interop with each other. Ship one library across pytorch, jax, cupy etc and runnable across python, c++, rust tvm.apache.org/2025/10/21/tvm…

English

3

41

166

38.8K

Bohan Hou 리트윗함

PyTorch@PyTorch·21 Eki

Live from the AI Infra Summit, co-located with #PyTorchCon — Tianqi Chen (@nvidia) explores how shared ML foundations can advance interoperability across compilers, libraries, DSLs, and frameworks, while unifying workloads across edge and cloud. 🔗 hubs.la/Q03PBnK00 #AIInfraSummit #OpenSourceAI #AIInfrastructure

English

CMU School of Computer Science@SCSatCMU

13

47

8.4K

Bohan Hou 리트윗함

Tim Dettmers@Tim_Dettmers·17 Nis

Happy to announce that I joined the CMU Catalyst with three of my incoming students. Our research will bring the best models to consumer GPUs with a focus on agent systems and MoEs. It is amazing to see so many talented people at Catalyst -- a very exciting ecosystem!

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

English

13

48

339

24.4K

Bohan Hou 리트윗함

Tianqi Chen@tqchenml·17 Nis

Really thrilled to receive #NVIDIADGX B200 from @nvidia . Looking forward to cooking with the beast. Together with an amazing team at CMU Catalyst group @BeidiChen @Tim_Dettmers @JiaZhihao @zicokolter, We are looking at the innovate across entire stack from model to instructions

CMU School of Computer Science@SCSatCMU

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

English

CMU School of Computer Science@SCSatCMU

17

84

11.3K

Bohan Hou 리트윗함

Zhihao Jia@JiaZhihao·17 Nis

Thank you to @NVIDIA for gifting our Catalyst Research Group the latest NVIDIA DGX B200! The B200 platform will greatly accelerate our research in building next-generation ML systems.🚀 #NVIDIADGX #DGXB200 @NVIDIADC

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

English