Neoblizz 🐘

2.4K posts

Neoblizz 🐘

@Neoblizzz

🐙, PhD, Gamer, Researcher @AMD (C++/HIP/CUDA).

California, USA Katılım Şubat 2012

53 Takip Edilen38 Takipçiler

Sabitlenmiş Tweet

Neoblizz 🐘@Neoblizzz·25 Eyl

The "you are dead to me" look. ☺️❤️

English

Neoblizz 🐘@Neoblizzz·10 Nis

@levidiamode @_williamhu @HazyResearch Hey, I have been enjoying your daily posts -- DM me if you're interested on working on some cool stuff at AMD.

English

135

levi@levidiamode·10 Nis

Day 96/365 of GPU Programming Wanted to spend some time learning more about the AMD equivalent of ThunderKittens after seeing TK's use for Cursor's Composer 2. So finally watched @_williamhu's excellent lecture on his work (+ the folks at Stanford's @HazyResearch) around HipKittens. The talk started off with some spicy notes about compilers and covered a bunch of other really interesting topics like block vs warp level in Triton/HK & TK, feasibility of CuTe layouts on AMD hardware, register lifetime tracking, bank conflicts, AMD's ISA vs PTX/SASS, explicit memory patterns, achieving good occupancy, optimizing for general cache reuse, CDNA3 vs CDNA4, warp specialization, producer/consumer on AMD vs Nvidia due to registry allocation, tenor cores, from ThunderKittens to HipKittens, motivation for tile based primitves, AGPRs/VGPRs & explicit register pinning, reverse engineering instructions, swizzling differences, wave interleaving through conditional barriers, cache disaggregation and next steps for HK. The paper has some additional details around stuff like the register pinning, chiplet aware cache scheduling and implementation details, so will have to give that a closer read sometime.

levi@levidiamode

Day 95/365 of GPU Programming Phase 1 of the AMD kernel challenge just ended, so combing through and studying the winning submissions today to get a sense of the kernel solutions and methods used at the very top. The three problems on MI355X (gfx950/CDNA4) were: MXFP4 GEMM, mixed MLA decode and MXFP4 MoE. Here are some notes on the winning solutions (might be wrong on any of these, so pls feel free to correct me on whatever): GEMM: fused quant + native FP4 MFMA The key instruction here seems to have been tl.dot_scaled(a_fp4, a_scales, "e2m1", b, b_scales, "e2m1", accumulator): this is Triton's interface to CDNA4's scaled MFMA. The top Triton submission used this/other winners wrote precompiled HIP kernels using v_mfma_scale_f32_16x16x128_f8f6f4 directly. The A matrix arrives as bf16 and is quantized to MXFP4. Winners fused this into the GEMM loop: - Compute per 32 element amax via bitwise exponent extraction: cast to int32, add 0x200000, mask 0xFF800000, shift right 23 - Compute prescale via either exp2 or bitwise float construction (set exponent bits directly) - Convert to fp4 via v_cvt_scalef32_pk_fp4_f32 or v_cvt_scalef32_pk_fp4_bf16 hardware instructions as Triton inline_asm_elementwise or HIP inline ASM - The packed fp4 bytes go straight into the MFMA, hence never touching global mem Other techniques: XCD remapping (pid redistribution across 8 compute dies?), .wt store modifier for small grids (writethrough to avoid polluting L2? not sure tbh). One submission had a HIP kernel for K=512 shapes using v_mfma_scale_f32_16x16x128_f8f6f4 with LDS double buffering and async buffer-to-LDS loads via llvm.amdgcn.raw.buffer.load.lds. And the #5 submission went a bit nuts on the per shape specialization: 6 separate HIP kernels with all dimensions as compile time constants, manual buffer resource descriptors, CK type sched_group_barrier for ds_read/MFMA interleaving + no Triton in the hot path. There was also a 4.35µs HSA AQL submission (not yet confirmed valid by organizers) which precompiled a HSACO binary and dispatched by writing 64 byte AQL packets directly to the ring buffer. MLA Decode: wrapper bypass + fewer kernels per call So it looks like the winning submission at 21µs bypassed the mla_decode_fwd wrapper by relying on its percall tensor allocation, metadata computation + dispatch logic. The submission preallocatedsall intermediate tensors and metadata at import time and calls the underlying primitives (mla_decode_stage1_asm_fwd and mla_reduce_v1_ directly) thus constructing the persistent mode metadata path itself. Some winning submissions also switched from fp8 Q (a8w8) to bf16 Q (a16w8) for certain shapes (eliminating the per call quantization kernel). Others used static fp8 scales where they kept fp8 Q. One effective shape level optimization seems to have been: num_kv_splits=1 for 8K sequences which lets stage1 write final output directly skipping the reduce kernel for 4/8 benchmark shapes. Per batchsize page sizes for 8K ranged from 512 (bs=64) to 2048 (bs=4, bs=256). MoE: environment + kernel tuning The gpumode runner had an old version of AITER where the Python wrappers didn't expose FlyDSL stage1 kernel configs. The winners used shutil.copy2 to overwrite the runner's AITER Python files with newer ones from a git clone using FlyDSL for both MoE stages, and precompiled kernels during setup to avoid JIT timeouts. The speed differences also seems to have come from some kernel tuning on top of that: split K values, tile_n sizing in stage2, patching expert counts (e.g. E=257 to trigger faster FlyDSL codegen paths?) and fused sort+quant via t2s mapping. Nice find by the teams that used both the environment gap and the right configs to take advantage of this. Next steps Only ended up finishing in the top 15 (presumably; results have still to be verified), so likely won't make it to phase 2 this time around. Fell out of the top 10 on the last day, which is unfortunate but it was a super fun process nonetheless. Mainly started this as a learning exercise and wasn't able to spend as much time on the MM/MLA challenges as I would've liked to but it really helped me study lots of new topics, so nothing but thankful for the fun opportunity to play around with these kernels and benchmark them on real hardware. Major props and thanks to @GPU_MODE @marksaroufim @m_sirovatka @myainotez, Ben and Daniel who set up the competition and volunteered tirelessly (even on weekends) to make sure servers were up & running around the clock. Good luck to all phase 2 participants! Super excited to read up on the winning Kimi K2.5/Deepseek R1 submissions next month.

English

5.6K

Neoblizz 🐘@Neoblizzz·8 Nis

@TheCatziilla My kinda video 🔥

English

Neoblizz 🐘 retweetledi

Catziilla 🦇@TheCatziilla·7 Nis

New Pokemon Hunting video :) youtu.be/3sO44u2rsfk?si…

YouTube

English

231

5.8K

Neoblizz 🐘@Neoblizzz·7 Nis

@marksaroufim Best of luck, Mark!

English

Mark Saroufim@marksaroufim·7 Nis

After 5 amazing years, I’m leaving the PyTorch team at Meta. I did my best work there and got to work with some of the smartest, most OSS pilled engineers in the industry. More soon on what’s next: still systems, still OSS (but not everything), a smaller team with a lot of GPUs

English

101

1.3K

71.8K

Neoblizz 🐘 retweetledi

NASA@NASA·3 Nis

We see our home planet as a whole, lit up in spectacular blues and browns. A green aurora even lights up the atmosphere. That's us, together, watching as our astronauts make their journey to the Moon.

English

4.7K

66.2K

313.2K

76.5M

Neoblizz 🐘 retweetledi

Supergiant Games (Hades II now on Xbox/PS5!)@SupergiantGames·26 Mar

HADES II is coming to @Xbox Series X|S and @PlayStation on April 14!🌖 It'll be on @XboxGamePass that same day. Time for the Princess of the Underworld to suit up in our brand-new animated trailer!✨

English

615

7.6K

69.6K

6.6M

Neoblizz 🐘@Neoblizzz·23 Mar

Give an AI agent a flight search API and let it experiment autonomously. Combination of dates, airports, airlines, and constraints, checks if the total cost improved, keeps or discards, and repeats.

English

Neoblizz 🐘@Neoblizzz·23 Mar

You come back to a log of experiments and (hopefully) cheaper flights to Japan (or anywhere)! Inspired by @karpathy's autoresearch. github.com/neoblizz/autor…

English

Neoblizz 🐘 retweetledi

Wawamachi ( Commission open )@wawamachi·1 Mar

Wawamachi ( Commission open ) tweet media

Wawamachi ( Commission open )@wawamachi

ZXX

36.5K

1.2M

Neoblizz 🐘 retweetledi

Jeri@Jeriafterdark·1 Mar

I... I see the vision #onepiece #PokemonWindsWaves

Melony🍈@MelonTeee

understand me

English

6.7K

50.1K

767.5K

Neoblizz 🐘 retweetledi

Official One Piece Card Game English Version@ONEPIECE_tcg_EN·12 Şub

Thank you very much to everyone for their continued support of the ONE PIECE CARD GAME. We truly appreciate your passion and dedication. Due to overwhelming demand, we understand that many users have recently experienced difficulties purchasing our products. To help improve availability and ensure more fans have the opportunity to purchase these items, we are preparing the following measures: ① Reprints of OP-13 and EB-03 We are currently moving forward with reprinting OP-13 and EB-03, with distribution expected to begin around June. For this wave, priority allocation will be given to Bandai TCG+ stores that consistently support Organized Play. Please be aware that delivery timings may vary by region. Retailers are encouraged to contact their distributors for details. ② Premium Bandai USA The following products are scheduled to open for orders on Premium Bandai USA around mid-March~April. To help maintain fairness for all customers, these items will be offered through a chance-to-buy sales system. Further information will be shared in an upcoming announcement. ・Heroines Special Set ・3rd English Anniversary Set (reprint) ・Booster Pack [OP-13] (reprint) ・Extra Booster Pack [EB-03] (reprint) We truly appreciate your understanding and continued support. #ONEPIECE #onepiececardgame

English

361

596

6.3K

996.5K

Neoblizz 🐘 retweetledi

Velja@Velja_LOL·8 Şub

Was a good run tbh xD

English

239

225

18.5K

319K

Neoblizz 🐘 retweetledi

Crownie@CrownshotLoL·9 Şub

Sadly i messed up our game today, it was in our hands, thank u for all the support it was amazing and playing LEC was super fun.. ❤️❤️

English

279

248

20.8K

302.1K

Neoblizz 🐘 retweetledi

LosRatones@LosRatoneslol·9 Şub

The boys gave it their all, but it wasn't to be. Thanks for all the support. 🐀🖤

English

1.2K

2.4K

46.1K

2.3M

Neoblizz 🐘 retweetledi

Caedrel@Caedrel·2 Şub

do you believe in witchcraft ?

English

451

1.2K

25.4K

1.1M

Neoblizz 🐘 retweetledi

Thebausffs@thebausffs·26 Oca

Got solo killed every game so far 😎

English

185

359

20.3K

501.9K

Neoblizz 🐘 retweetledi

Overwatch@PlayOverwatch·25 Ara

Can you spot your main? Drop what they’re doing in the comments ☃️❄️ And a HUGE thank you to the talented @Theresssa1 for such a wonderful work of art this holiday 🥰

English

416

2.9K

16.7K

1.2M

Neoblizz 🐘@Neoblizzz·27 Kas

@iFooYa @TheCatziilla YOOO WE BOUGHT IT TOO. 😈

English

FooYa@iFooYa·27 Kas

So thankful for the ability to relive my childhood trauma and get rebullied by this terrorist I call a sister!!! 😊 ❤️

English

690

28.8K

Neoblizz 🐘 retweetledi

Overwatch@PlayOverwatch·17 Kas

You wanted a new hero? Tough luck ⚔️ Find out what Vendetta lost and what she’s willing to do to get it all back in our newest Hero Trailer, ‘La Lupa.’ 🐺

English

830

2.9K

17.8K

3.8M

Neoblizz 🐘 retweetledi

AI at AMD@AIatAMD·12 Kas

AI is compute hungry, so the @HazyResearch team at @Stanford asked: How do we build AI from the hardware up? How do we lead developers to do what the hardware prefers? This technical deep dive on HipKittens explores how optimized register tiles, wave-level scheduling, and chiplet-aware cache reuse help unlock the full potential of AMD GPUs. 🐱 Dig into the details: hazyresearch.stanford.edu/blog/2025-11-0… #AMDevs

English

13.7K

Keşfet

@levidiamode @_williamhu @HazyResearch @TheCatziilla @marksaroufim @Xbox @PlayStation @XboxGamePass