Neoblizz 🐘

2.4K posts

Neoblizz 🐘 banner
Neoblizz 🐘

Neoblizz 🐘

@Neoblizzz

🐙, PhD, Gamer, Researcher @AMD (C++/HIP/CUDA).

California, USA Katılım Şubat 2012
53 Takip Edilen38 Takipçiler
Sabitlenmiş Tweet
Neoblizz 🐘
Neoblizz 🐘@Neoblizzz·
The "you are dead to me" look. ☺️❤️
Neoblizz 🐘 tweet mediaNeoblizz 🐘 tweet media
English
2
0
5
0
levi
levi@levidiamode·
Day 96/365 of GPU Programming Wanted to spend some time learning more about the AMD equivalent of ThunderKittens after seeing TK's use for Cursor's Composer 2. So finally watched @_williamhu's excellent lecture on his work (+ the folks at Stanford's @HazyResearch) around HipKittens. The talk started off with some spicy notes about compilers and covered a bunch of other really interesting topics like block vs warp level in Triton/HK & TK, feasibility of CuTe layouts on AMD hardware, register lifetime tracking, bank conflicts, AMD's ISA vs PTX/SASS, explicit memory patterns, achieving good occupancy, optimizing for general cache reuse, CDNA3 vs CDNA4, warp specialization, producer/consumer on AMD vs Nvidia due to registry allocation, tenor cores, from ThunderKittens to HipKittens, motivation for tile based primitves, AGPRs/VGPRs & explicit register pinning, reverse engineering instructions, swizzling differences, wave interleaving through conditional barriers, cache disaggregation and next steps for HK. The paper has some additional details around stuff like the register pinning, chiplet aware cache scheduling and implementation details, so will have to give that a closer read sometime.
levi tweet medialevi tweet medialevi tweet medialevi tweet media
levi@levidiamode

Day 95/365 of GPU Programming Phase 1 of the AMD kernel challenge just ended, so combing through and studying the winning submissions today to get a sense of the kernel solutions and methods used at the very top. The three problems on MI355X (gfx950/CDNA4) were: MXFP4 GEMM, mixed MLA decode and MXFP4 MoE. Here are some notes on the winning solutions (might be wrong on any of these, so pls feel free to correct me on whatever): GEMM: fused quant + native FP4 MFMA The key instruction here seems to have been tl.dot_scaled(a_fp4, a_scales, "e2m1", b, b_scales, "e2m1", accumulator): this is Triton's interface to CDNA4's scaled MFMA. The top Triton submission used this/other winners wrote precompiled HIP kernels using v_mfma_scale_f32_16x16x128_f8f6f4 directly. The A matrix arrives as bf16 and is quantized to MXFP4. Winners fused this into the GEMM loop: - Compute per 32 element amax via bitwise exponent extraction: cast to int32, add 0x200000, mask 0xFF800000, shift right 23 - Compute prescale via either exp2 or bitwise float construction (set exponent bits directly) - Convert to fp4 via v_cvt_scalef32_pk_fp4_f32 or v_cvt_scalef32_pk_fp4_bf16 hardware instructions as Triton inline_asm_elementwise or HIP inline ASM - The packed fp4 bytes go straight into the MFMA, hence never touching global mem Other techniques: XCD remapping (pid redistribution across 8 compute dies?), .wt store modifier for small grids (writethrough to avoid polluting L2? not sure tbh). One submission had a HIP kernel for K=512 shapes using v_mfma_scale_f32_16x16x128_f8f6f4 with LDS double buffering and async buffer-to-LDS loads via llvm.amdgcn.raw.buffer.load.lds. And the #5 submission went a bit nuts on the per shape specialization: 6 separate HIP kernels with all dimensions as compile time constants, manual buffer resource descriptors, CK type sched_group_barrier for ds_read/MFMA interleaving + no Triton in the hot path. There was also a 4.35µs HSA AQL submission (not yet confirmed valid by organizers) which precompiled a HSACO binary and dispatched by writing 64 byte AQL packets directly to the ring buffer. MLA Decode: wrapper bypass + fewer kernels per call So it looks like the winning submission at 21µs bypassed the mla_decode_fwd wrapper by relying on its percall tensor allocation, metadata computation + dispatch logic. The submission preallocatedsall intermediate tensors and metadata at import time and calls the underlying primitives (mla_decode_stage1_asm_fwd and mla_reduce_v1_ directly) thus constructing the persistent mode metadata path itself. Some winning submissions also switched from fp8 Q (a8w8) to bf16 Q (a16w8) for certain shapes (eliminating the per call quantization kernel). Others used static fp8 scales where they kept fp8 Q. One effective shape level optimization seems to have been: num_kv_splits=1 for 8K sequences which lets stage1 write final output directly skipping the reduce kernel for 4/8 benchmark shapes. Per batchsize page sizes for 8K ranged from 512 (bs=64) to 2048 (bs=4, bs=256). MoE: environment + kernel tuning The gpumode runner had an old version of AITER where the Python wrappers didn't expose FlyDSL stage1 kernel configs. The winners used shutil.copy2 to overwrite the runner's AITER Python files with newer ones from a git clone using FlyDSL for both MoE stages, and precompiled kernels during setup to avoid JIT timeouts. The speed differences also seems to have come from some kernel tuning on top of that: split K values, tile_n sizing in stage2, patching expert counts (e.g. E=257 to trigger faster FlyDSL codegen paths?) and fused sort+quant via t2s mapping. Nice find by the teams that used both the environment gap and the right configs to take advantage of this. Next steps Only ended up finishing in the top 15 (presumably; results have still to be verified), so likely won't make it to phase 2 this time around. Fell out of the top 10 on the last day, which is unfortunate but it was a super fun process nonetheless. Mainly started this as a learning exercise and wasn't able to spend as much time on the MM/MLA challenges as I would've liked to but it really helped me study lots of new topics, so nothing but thankful for the fun opportunity to play around with these kernels and benchmark them on real hardware. Major props and thanks to @GPU_MODE @marksaroufim @m_sirovatka @myainotez, Ben and Daniel who set up the competition and volunteered tirelessly (even on weekends) to make sure servers were up & running around the clock. Good luck to all phase 2 participants! Super excited to read up on the winning Kimi K2.5/Deepseek R1 submissions next month.

English
2
6
63
5.6K
Mark Saroufim
Mark Saroufim@marksaroufim·
After 5 amazing years, I’m leaving the PyTorch team at Meta. I did my best work there and got to work with some of the smartest, most OSS pilled engineers in the industry. More soon on what’s next: still systems, still OSS (but not everything), a smaller team with a lot of GPUs
Mark Saroufim tweet media
English
101
29
1.3K
71.8K
Neoblizz 🐘 retweetledi
NASA
NASA@NASA·
We see our home planet as a whole, lit up in spectacular blues and browns. A green aurora even lights up the atmosphere. That's us, together, watching as our astronauts make their journey to the Moon.
NASA tweet media
English
4.7K
66.2K
313.2K
76.5M
Neoblizz 🐘
Neoblizz 🐘@Neoblizzz·
Give an AI agent a flight search API and let it experiment autonomously. Combination of dates, airports, airlines, and constraints, checks if the total cost improved, keeps or discards, and repeats.
English
0
0
0
24
Neoblizz 🐘 retweetledi
Official One Piece Card Game English Version
Thank you very much to everyone for their continued support of the ONE PIECE CARD GAME. We truly appreciate your passion and dedication. Due to overwhelming demand, we understand that many users have recently experienced difficulties purchasing our products. To help improve availability and ensure more fans have the opportunity to purchase these items, we are preparing the following measures: ① Reprints of OP-13 and EB-03 We are currently moving forward with reprinting OP-13 and EB-03, with distribution expected to begin around June. For this wave, priority allocation will be given to Bandai TCG+ stores that consistently support Organized Play. Please be aware that delivery timings may vary by region. Retailers are encouraged to contact their distributors for details. ② Premium Bandai USA The following products are scheduled to open for orders on Premium Bandai USA around mid-March~April. To help maintain fairness for all customers, these items will be offered through a chance-to-buy sales system. Further information will be shared in an upcoming announcement. ・Heroines Special Set ・3rd English Anniversary Set (reprint) ・Booster Pack [OP-13] (reprint) ・Extra Booster Pack [EB-03] (reprint) We truly appreciate your understanding and continued support. #ONEPIECE #onepiececardgame
English
361
596
6.3K
996.5K
Neoblizz 🐘 retweetledi
Velja
Velja@Velja_LOL·
Was a good run tbh xD
English
239
225
18.5K
319K
Neoblizz 🐘 retweetledi
Crownie
Crownie@CrownshotLoL·
Sadly i messed up our game today, it was in our hands, thank u for all the support it was amazing and playing LEC was super fun.. ❤️❤️
English
279
248
20.8K
302.1K
Neoblizz 🐘 retweetledi
LosRatones
LosRatones@LosRatoneslol·
The boys gave it their all, but it wasn't to be. Thanks for all the support. 🐀🖤
LosRatones tweet media
English
1.2K
2.4K
46.1K
2.3M
Neoblizz 🐘 retweetledi
Caedrel
Caedrel@Caedrel·
do you believe in witchcraft ?
English
451
1.2K
25.4K
1.1M
Neoblizz 🐘 retweetledi
Thebausffs
Thebausffs@thebausffs·
Got solo killed every game so far 😎
English
185
359
20.3K
501.9K
Neoblizz 🐘 retweetledi
Overwatch
Overwatch@PlayOverwatch·
Can you spot your main? Drop what they’re doing in the comments ☃️❄️ And a HUGE thank you to the talented @Theresssa1 for such a wonderful work of art this holiday 🥰
Overwatch tweet media
English
416
2.9K
16.7K
1.2M
FooYa
FooYa@iFooYa·
So thankful for the ability to relive my childhood trauma and get rebullied by this terrorist I call a sister!!! 😊 ❤️
FooYa tweet mediaFooYa tweet media
English
22
7
690
28.8K
Neoblizz 🐘 retweetledi
Overwatch
Overwatch@PlayOverwatch·
You wanted a new hero? Tough luck ⚔️ Find out what Vendetta lost and what she’s willing to do to get it all back in our newest Hero Trailer, ‘La Lupa.’ 🐺
English
830
2.9K
17.8K
3.8M
Neoblizz 🐘 retweetledi
AI at AMD
AI at AMD@AIatAMD·
AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers? This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and chiplet-aware cache reuse help unlock the full potential of AMD GPUs. 🐱 Dig into the details: hazyresearch.stanford.edu/blog/2025-11-0… #AMDevs
AI at AMD tweet media
English
0
7
69
13.7K