Bohan Hou

66 posts

Bohan Hou banner
Bohan Hou

Bohan Hou

@bohanhou1998

CS Ph.D. student @ CMU

Pittsburgh, PA انضم Ağustos 2020
93 يتبع1K المتابعون
Bohan Hou
Bohan Hou@bohanhou1998·
Every time I fight ptxas over why it put some variable in R instead of UR, or why it lifted a CSE this time but not that time, I think back to someone telling me years ago that truly great programmers can write plain C and mentally compile it all the way down to the final asm.
English
2
0
22
1.5K
Bohan Hou
Bohan Hou@bohanhou1998·
Similar tools are also available through APIs in Triton and CuTeDSL. FlashInfer has also had a tool of this kind before (github.com/flashinfer-ai/…). Our implementation is broadly similar to FlashInfer’s, with a few modifications and adaptations to fit the TIRx API.
English
0
1
6
487
Bohan Hou
Bohan Hou@bohanhou1998·
@maharshii nvm, we’ll just ask Mythos to npm install the kernels🤣
English
1
0
2
180
maharshi
maharshi@maharshii·
ML perf is in its JS framework era, i love it.
Bohan Hou@bohanhou1998

We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. tvm.apache.org/2026/06/22/tirx On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.

English
2
2
91
8.1K
Bohan Hou
Bohan Hou@bohanhou1998·
Tilelang is built on top of TensorIR (released around 2022~2023, aimed at schedule-based autotuning on Ampere GPUs), and TIRx is an upgrade of TensorIR, just released. They share some of the data structures (like ForNode, IfNode) but are completely differently designed DSLs and have different compilation pipelines. Tilelang rebased onto TIRx, I believe, a few weeks ago after we upstreamed the code. github.com/apache/tvm/pul… This is the first PR bringing the TIRx infra upgrade into Apache TVM.
English
1
1
4
325
Retep
Retep@Retep8080·
@bohanhou1998 wait i thought it was released long ago since tilelang is built on top of tirx. Am I missing something
English
1
0
1
317
Bohan Hou
Bohan Hou@bohanhou1998·
We release TIRx today, a minimal compiler stack and hardware-native DSL for frontier ML kernels, built around storage-first tensor layouts and reusable tile primitives. tvm.apache.org/2026/06/22/tirx On NVIDIA B200, TIRx delivers up to ~1.08× over cuBLASLt on dense GEMM, outperforms DeepGEMM on all FP8 blockwise workloads with up to ~1.09× speedup, keeps FlashAttention-4 (FA4) typically within ~±2% of CuTeDSL, and remains competitive with cuBLASLt/FlashInfer on NVFP4 GEMM. Through our past experiences building frontier ML kernels, megakernels, and agentic kernel systems, we kept seeing the same boundary problem: new operators and new hardware require new optimization strategies that often break old programming models or compiler passes. TIRx builds on top of Apache TVM and moves toward a simple goal: let users and agents express the best-performing program, even for future hardware generations, while keeping the engineering effort for new kernels and new hardware as low as possible.
Bohan Hou tweet media
English
4
44
143
37.8K
Bohan Hou
Bohan Hou@bohanhou1998·
Yes. I read Pyptx's codebase on day 1 and love how Pyptx organizes the PTX instruction registrations in Python. It's also an important part of TIRx to provide full native control. The primitive part is meant to allow users to (for example) simply write `Tx.copy(As[0:128, 0:32], Ag[...])` to rely on the compiler to generate loops around TMA calls. Sometimes the TMA instruction selection can be a bit tricky. We can issue fewer instructions by organizing tensormap shape/stride carefully. A bit of compiler magic here, but not too heavy :)
English
0
0
3
440
Patrick C Toulme
Patrick C Toulme@PatrickToulme·
@bohanhou1998 This is very similar to Pyptx. You are a bit above Pyptx as Pyptx has no compiler and emits raw PTX
English
1
0
2
942
Bohan Hou
Bohan Hou@bohanhou1998·
@silverhawk_ny The current DSL can already support AMD code generation (by lowering to LLVM). For NV platforms it lowers to CUDA C++. It can also lowers to Vulkan WebASM OpenCL etc. We don’t have access to latest server-class AMD cards so we only tried to program kernels on B200.
English
0
0
1
67
arch rock
arch rock@silverhawk_ny·
@bohanhou1998 Thanks, do you plan to have some high level DSL to support both NV and AMD?
English
1
0
0
51
Bohan Hou
Bohan Hou@bohanhou1998·
TIRx is by itself a kernel-level language generating CUDA C++, so in this sense, it does not support other kernel languages. But the TVM stack has a Relax graph compiler that can represent a computational graph and have calls into arbitrary python/C++ functions registered via tvm-ffi so I believe in that sense it can support other types of kernels. See #L914C1-L923C26" target="_blank" rel="nofollow noopener">github.com/apache/tvm/blo… this test case, how it registered a function and called by the graph.
English
1
1
5
314
arch rock
arch rock@silverhawk_ny·
@bohanhou1998 we have some similar efforts to completely move away from Torch inductor/compile to do own graph IR rewrite, kernel fusion, dose this support different type of kernel like Triton/CuteDSL, etc?
English
1
0
0
325
Bohan Hou
Bohan Hou@bohanhou1998·
“A compiler optimization that always works is just part of the programming model.” Many of you have probably watched @cHHillee 's Jane Street talk: janestreet.com/tech-talks/bui… The talk was given in 2025, but the question had already been on my mind since the summer of 2023, when I was interning on Meta’s PyTorch Compiler team, where I worked with Horace. At the time, my TensorIR paper had just been accepted to ASPLOS 2023. TensorIR gave us a schedule-oriented IR for tensorized program optimization, and much of my thinking then was shaped by Ampere Tensor Core kernels. But the target was already moving. Hopper made warp-specialized kernels, TMA-driven pipelines, new synchronization patterns, and FP8 much more central. Kernels like FlashAttention-2 were also pushing beyond what our existing schedule abstraction could express comfortably. Meanwhile, in the MLC project — a compiler-based ML serving engine targeting both servers and mobile devices — TensorIR’s schedule system felt awkward when we tried to reach peak performance on quantized GeMV/GEMM kernels. During a Topgolf team event, Horace asked me a question: “How much time do you think you need to upgrade your compiler to Hopper? And what about the generation after Hopper?” That question has stayed with me ever since. Every time I thought about what my next project should be — auto warp specialization, auto pipelining, or another compiler pass in that direction — I found it hard to fully convince myself. Not because those directions are unimportant, but because they still did not answer the question that bothered me most. If every new hardware generation requires another major compiler redesign before users can even express the best program, then maybe that is simply not the right path. For a while, I honestly believed the answer might just be: write CUDA C++, and at most build libraries on top so that we do not have to rewrite everything from scratch. Things become more interesting now that agents are starting to become capable of writing kernels with specific low-level instruction patterns. One might think stronger agents will make this problem go away: they can write more code, search more schedules, patch more compiler passes, and try more variants than humans can. Personally, I do not have a decisive answer. It is risky to make strong predictions about what agents will or will not be able to do. From my own experience, they are already reasonably useful for tuning some kernels in CUDA C++ and performing local rewrites that make the generated PTX/SASS look better. But when asked to design or orchestrate new warp-specialization plans, pipelined data movement, or synchronization protocols, they can still struggle to produce a kernel that is even correct — let alone one that does not hang. One experiment I tried was asking Codex to port DeepSeek’s MegaMoe kernel into TIRx. I expected the task to be relatively straightforward, since TIRx intentionally keeps the source structure close to hardware-native CUDA C++. In practice, the agent spent several days and still could not resolve a hang without my intervention. What eventually helped was not simply asking it to “try again,” but changing the feedback loop: I asked a reviewer agent to produce an audit file, line by line, comparing the generated kernel against the original and justifying that the two were doing the same thing. That workflow can help when an expert kernel already exists. But it is much less satisfying if the goal is to have agents design new kernels from scratch. For those curious, here is the experimental agent-generated TIRx port of MegaMoe (which I believe matches the performance of the initial release of DS’s kernels, but they have some optimizations later) that came out of this process: gist.github.com/spectrometerHB… It is not part of the initial release; it is mostly a record of the experiment. The result is not particularly polished, but it was useful for understanding what kind of feedback loop agents need when working on low-level kernels. Of course, a sufficiently high-level DSL can help when the abstraction already captures the kernel pattern. If the user can write GEMM(A, B) and the compiler deterministically lowers it to a good implementation, then life is easy. But that is exactly the hard part at the frontier: the pattern may not yet be mature enough to hide behind a stable abstraction. Until a strong enough model can solve this end-to-end, what we can do is improve the workflow around the model: give it more structured feedback and perhaps even lightweight verifiers that can check whether a generated kernel shape is valid with respect to synchronization. This connects back to the opening sentence: “a compiler optimization that always works is just part of the programming model.” For example, if a warp-specialization plan relies on mbarrier coordination, the compiler should at least be able to reason about whether that plan can introduce races, deadlocks, or violations of the intended producer-consumer protocol. That kind of feedback is hard to provide if the program is only raw CUDA C++ without a richer intermediate representation. Maybe the better question is not whether agents can write kernels, but what programming model gives humans, agents, and compilers enough shared structure to reason about kernels before the final benchmark.
English
0
8
48
3.3K
Bohan Hou
Bohan Hou@bohanhou1998·
RT @ruihanglai: Two moments every ML researcher knows. You get onto a new cluster, and week one goes to fitting the framework to your setup…
English
0
10
0
158
Bohan Hou أُعيد تغريده
Tianqi Chen
Tianqi Chen@tqchenml·
📢Excited to introduce Apache TVM FFI, an open ABI and FFI for ML systems, enabling compilers, libraries, DSLs, and frameworks to naturally interop with each other. Ship one library across pytorch, jax, cupy etc and runnable across python, c++, rust tvm.apache.org/2025/10/21/tvm…
Tianqi Chen tweet media
English
3
41
166
38.8K
Bohan Hou أُعيد تغريده
Tim Dettmers
Tim Dettmers@Tim_Dettmers·
Happy to announce that I joined the CMU Catalyst with three of my incoming students. Our research will bring the best models to consumer GPUs with a focus on agent systems and MoEs. It is amazing to see so many talented people at Catalyst -- a very exciting ecosystem!
CMU School of Computer Science@SCSatCMU

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

English
13
48
339
24.4K
Bohan Hou أُعيد تغريده
Tianqi Chen
Tianqi Chen@tqchenml·
Really thrilled to receive #NVIDIADGX B200 from @nvidia . Looking forward to cooking with the beast. Together with an amazing team at CMU Catalyst group @BeidiChen @Tim_Dettmers @JiaZhihao @zicokolter, We are looking at the innovate across entire stack from model to instructions
CMU School of Computer Science@SCSatCMU

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

English
0
17
84
11.3K
Bohan Hou أُعيد تغريده
Zhihao Jia
Zhihao Jia@JiaZhihao·
Thank you to @NVIDIA for gifting our Catalyst Research Group the latest NVIDIA DGX B200! The B200 platform will greatly accelerate our research in building next-generation ML systems.🚀 #NVIDIADGX #DGXB200 @NVIDIADC
CMU School of Computer Science@SCSatCMU

Huge thank you to @NVIDIADC for gifting a brand new #NVIDIADGX B200 to CMU’s Catalyst Research Group! This AI supercomputing system will afford Catalyst the ability to run and test their work on a world-class unified AI platform.

English
0
10
51
8.2K