Haohui Mai

1.9K posts

Haohui Mai

Haohui Mai

@wheat9

OS hacker+GPU optimization

Bay area Katılım Ekim 2009
440 Takip Edilen236 Takipçiler
Haohui Mai retweetledi
LaurieWired
LaurieWired@lauriewired·
if you’re a CS/EE student write your thesis on JIT compilation of eBPF for NVMe controllers there’s huge career alpha in computational storage; the standards are *just* starting to exist (TP4091)
LaurieWired tweet media
English
39
261
5.2K
233.2K
Bin CHEN
Bin CHEN@SeTriones·
@shao__meng claude code 缩水版?这个项目的定位不知道是啥
中文
2
0
3
2K
Haohui Mai retweetledi
i Expose Racists & Pedos
i Expose Racists & Pedos@SeeRacists·
HEARTBREAKING: Ex-PhD student Brendt Christensen found GUILTY of posing as cop, luring, abducting, R*ping & d*capitating Chinese scholar Yingying Zhang in his apartment in 2017. Her dism*mbered remains STILL missing. Never forget Yingying’s story.
English
320
4.4K
27.9K
821.4K
Haohui Mai retweetledi
Haohui Mai retweetledi
Aakash Gupta
Aakash Gupta@aakashgupta·
Sounds incredible until you read the fine print. The compiler generates less efficient code than GCC with all optimizations disabled. It doesn’t have its own assembler or linker. It can’t produce a 16-bit x86 code generator. And Carlini himself says it has “nearly reached the limits of Opus’s abilities.” New features and bugfixes kept breaking existing functionality. So what did $20,000 and two weeks actually buy? A compiler that passes 99% of GCC’s torture tests but can’t match the output quality of a tool that’s had 37 years of human engineering. That’s the constraint nobody’s pricing in. The real story is in the cost curve, not the capability demo. $20,000 for 100,000 lines means $0.20 per line of generated code. A senior compiler engineer costs roughly $150/hour. At maybe 50 polished lines per hour for something this complex, that’s $3/line. AI just did it at 15x cheaper, and it will only get cheaper from here. But the code isn’t equivalent. The AI version needs a human to finish the assembler, fix the linker, optimize the output, and prevent regressions. Those are the hardest 20% of the problem, and they represent 80% of the engineering value. Anthropic built the demo. Shipping the product still requires humans. This tells you exactly where we are in the autonomous software timeline. AI can now produce impressive first drafts of complex systems at trivial cost. Turning those drafts into production software still requires the judgment that costs $300K+ per year in compiler engineer salary. The gap between “compiles the Linux kernel” and “replaces GCC” is measured in decades of accumulated engineering wisdom that no model has internalized yet. The companies that understand this will use agent teams to generate the 80% and hire engineers to finish the 20%. The companies that don’t will ship $20,000 compilers that produce slower code than a free tool from 1987.
Anthropic@AnthropicAI

New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: anthropic.com/engineering/bu…

English
191
315
2.3K
373.7K
Haohui Mai
Haohui Mai@wheat9·
@Yuchenj_UW My experience is that Codex seems to have better world knowledge which make it more effective on triaging and debugging. Claude code excels in day to day software engineering tasks that need more automation.
English
0
0
0
214
Yuchen Jin
Yuchen Jin@Yuchenj_UW·
Is Codex actually ahead of Claude Code now??? I tried Codex yesterday while doing some training optimizations on Andrej’s nanochat. It has worse UI, ran my code in a CPU-only sandbox despite I have GPUs. It feels less agentic than Claude Code for sure. Sonnet 5, I'm still patiently waiting for you...
Yuchen Jin tweet media
English
190
20
730
169.7K
Haohui Mai
Haohui Mai@wheat9·
@HotAisle For dense model nvfp4 works out of the box (Petit). We are adding MoE support these days. Stay tuned
English
0
0
1
34
Hot Aisle
Hot Aisle@HotAisle·
@wheat9 i'm not interested in quanting this model myself... I just want the OOB experience for AMD to actually work.
English
1
0
0
55
Hot Aisle
Hot Aisle@HotAisle·
That warning about gptq_gemm being buggy and suggesting Marlin/BitBLAS is largely NVIDIA advice. On MI300X, GPTQ support is much less mature; if you see GPU “memory access fault” crashes later, GPTQ kernels are a prime suspect. Practically: FP8 / BF16 tends to be the stable lane on MI300X. Yup... Memory access fault by GPU node-3 (Agent handle: 0x1b1c89b0) on address 0x7ee6a0009000. Reason: Unknown. Memory access fault by GPU node-2 (Agent handle: 0xb524a50) on address 0x7f4ca9223000. Reason: Unknown.
English
1
0
2
756
Bin CHEN
Bin CHEN@SeTriones·
报警了, @wheat9 一大早就在卷我
中文
1
0
0
82
Haohui Mai retweetledi
Jeff Dean
Jeff Dean@JeffDean·
Performance Hints Over the years, my colleague Sanjay Ghemawat and I have done a fair bit of diving into performance tuning of various pieces of code. We wrote an internal Performance Hints document a couple of years ago as a way of identifying some general principles and we've recently published a version of it externally. We'd love any feedback you might have! Read the full doc at: abseil.io/fast/hints.html
Jeff Dean tweet media
English
106
1.1K
7.7K
2.1M
Haohui Mai
Haohui Mai@wheat9·
@tianyin_xu Sitting in Siebel Center at a snowstorm day is very peaceful and surprisingly satisfying
English
1
0
1
61
Tianyin Xu
Tianyin Xu@tianyin_xu·
Perfect weather to get work done in the weekend.
Tianyin Xu tweet media
English
6
0
46
3K
Haohui Mai
Haohui Mai@wheat9·
@tianyin_xu In terms of producing papers / artifacts, the productivity does rise. It terms of producing insights — I found it not so much. AI does makes people lazy in some sense :-)
English
0
0
1
51
Tianyin Xu
Tianyin Xu@tianyin_xu·
That can be one of the reasons that the demand of PhD becomes smaller. I think that CS research these days either goes very fundamental or goes very applied. The middle ground fades away because engineering barriers are much lower now. OTOH students are indeed much more productive now than before.
English
1
0
0
391
Tianyin Xu
Tianyin Xu@tianyin_xu·
Working on recommendation letters for students applying for PhD/MS. Multiple sources say this year will be tough, as PhD/MS openings will be fewer especially in traditional CS/ECE areas, e.g., not AI related [1]. My suggestion is to apply more broadly and more targetedly. There are still many excellent colleagues who are looking for grad students. You can find some of them in the comments of this LinkedIn post, linkedin.com/feed/update/ur… [1] In fact, AI today means everything CS, and I argue that AI needs more domain experts than ever.
English
4
9
88
12.3K
Haohui Mai
Haohui Mai@wheat9·
I really hope that AMD can spend some love on their out-of-the-box experience for inference. For example, the latest sglang v0.5.5 docker image is broken for a whole week due to github.com/sgl-project/sg…. Maybe it's time to add some smoke tests
English
0
0
1
143
Haohui Mai retweetledi
Yifan Qiao
Yifan Qiao@yifandotqiao·
🚀 End the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your cluster's low utilization? We launch kvcached, the first library for elastic GPU sharing across LLMs. 🔗 github.com/ovg-project/kv… 🧵👇 Why it matters:
Yifan Qiao tweet media
English
9
52
199
73.7K
Haohui Mai retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
How do you run FP4 models on AMD MI250/MI300 without waiting for MI350? The CausalFlow team @wheat9 built Petit, optimized mixed-precision kernels co-designed with AMD’s MatrixCore. Benchmarks: 🔹 1.74× faster Llama-3.3-70B inference 🔹 3.7× faster GEMM vs hipBLASLt Open-sourced + integrated into SGLang v0.4.10. See the full blog👇
LMSYS Org tweet media
English
1
3
25
3K
Haohui Mai
Haohui Mai@wheat9·
@aramh Too difficult for novice programmers, not enough knobs for experts (except having unsafe everywhere) but in general it’s still nice for average system programmings
English
0
0
0
123
Aram Hăvărneanu
Aram Hăvărneanu@aramh·
I'm looking for intelligent critiques of Rust (not dumb statements like "borrow checker bad"). One complaint I have is that Cell/RefCell are types instead of modes. In other words that the horizon of mutability is static instead of dynamic. It's possible to go from a mutable reference to an immutable one, but it's not fine grained enough. Another complaint I have is that the lifetime of a reference doesn't default to its encompassing data structure.
English
34
5
136
14.6K
Haohui Mai
Haohui Mai@wheat9·
@satnam6502 Get an MacBook Air and a powerful Linux box. That works the best
English
0
0
1
69
Satnam Singh
Satnam Singh@satnam6502·
I find myself without a laptop and I am torn between a 13" MacBook Air M4 32GB 1TB vs. 14" MacBook Pro M4 Pro 32GB 1TB. It will be my main development machine (with external monitor), running SystemVerilog simulations, theorem provers like Lean4, Agda, SVA formal verification jobs using Tabby CAD from YosysHQ, ML frameworks like MLIR and OpenXLA, and some large Haskell programs. So all this points to the MacBook Pro (and the extra HDMI etc. ports are nice) but if the whole point of a laptop is to be light and portable perhaps I should get the MacBook Air and hope it has enough juice to keep me productive. In that case it might make sense to pair it with a beefy Linux machine which I can use via VS Code's remote feature (my typical mode of use recently anyway). Any advice very welcome, esp. from theorem prover and hardware CAD users. apple.com/shop/buy-mac/m…
English
34
2
104
47.8K
Haohui Mai
Haohui Mai@wheat9·
@tianyin_xu What a pity. It shows the challenges of system research to stay relevant is real. It is tough when everybody moves to “AI”…
English
0
1
6
1.2K