shikhar

github.com/zartbot/blog/i…

189

Zartbot@zartbotF·4d

ZXX

3

14

4.6K

shikhar@encapsulated007·5d

@tenderizzation self-introspection

English

1

26

tender@tenderizzation·5d

successor to the attention block needs to be called the introspection block atp

English

0

7

734

shikhar@encapsulated007·5d

@MainzOnX very hard to deny one and accept the other. excited for both

English

1

33

Adam Mainz@MainzOnX·5d

@encapsulated007 We probably did and I can happily write something up here! You into tips and tricks on the DSL or optimization tooling? (Or both)

English

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

0

2

160

Adam Mainz@MainzOnX·5d

Thinking about writing blog posts / articles here again. Any topics people want? ML inference, kernel perf, cool projects from Meta etc?

English

18

6

92

9.2K

shikhar retweetledi

Albert Gu@_albertgu·5d

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

36

314

1.6K

418.2K

shikhar retweetledi

Tri Dao@tri_dao·5d

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/stat…

Albert Gu@_albertgu

English

11

112

845

73.3K

shikhar@encapsulated007·5d

@norxornor @openreviewnet yup, mine shows uspsa.com.mx or some shit

English

1

81

nor@norxornor·5d

everyone's openreview profile seems to have been switched out with some random guy's? (logged in a few minutes ago to see my profile basically wiped and replaced) @openreviewnet

English

0

7

490

shikhar@encapsulated007·5d

@johannes_hage well, now im missing it more.

English

1

142

Johannes Hagemann@johannes_hage·6d

it was a good event sir

English

9

5

181

5.8K

shikhar@encapsulated007·6d

@tugot17 @m_sirovatka 's sleep paralysis demon

English

418

Piotr Mazurek in SF 🌉@tugot17·6d

wait NVL1152??? what? 1152?

English

39

62.6K

shikhar@encapsulated007·6d

@anindyadeeps well, well, well x.com/behrouz_ali/st…

Ali Behrouz@behrouz_ali

This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.

English

0

1

96

Anindya@anindyadeeps·6d

You know whats more surprising? No one thought of this before?

Han Xiao@hxiao

If you only have 60s of attention for Kimi's Attention Residuals paper, watch this.

English

0

6

565

shikhar retweetledi

Lei Zhang@LeiLMx·16 Mar

I published a new post in my Triton series about Gluon — a new Python frontend that exposes more compiler internals so developers can have explicit control over performance. I also share some thoughts in the context of rapidly evolving agentic software development: portability vs performance, general vs domain-specific compilers, and why DSLs may become an important companion. 🔗 lei.chat/posts/gluon-ex…

English

19

140

13.7K

shikhar@encapsulated007·14 Mar

@corsix @wouter_kool bro is in the trenches...

English

99

Pete Cawley@corsix·14 Mar

… and we have a new winning vliw-challenge.fly.dev entry from @wouter_kool

English

1

29

2K

shikhar@encapsulated007·13 Mar

@m_sirovatka @GPU_MODE @verdacloud @sestercegroup nooo, missing out on another banger hackathon!!

English

1

178

Matej Sirovatka@m_sirovatka·13 Mar

What’s the best model you can train in a day if someone hands you a pile of Blackwell GPUs? You can try out yourself On April 9 in Paris, @GPU_MODE + @verdacloud + @sestercegroup are hosting a GPU hackathon with a bunch of GPUs to run on and even more of them for the winners.

English

12

8

160

8.3K

shikhar@encapsulated007·12 Mar

@xidulu nemotron-3 inspired latent router?!

English

180

Xidulu@xidulu·12 Mar

Has anyone tried using random projection as the MoE router...?

English

0

6

2K

shikhar@encapsulated007·11 Mar

@Laz4rz i just go to sleep everytime i build vllm from source (nvcc_threads=1, max_jobs=2)

English

2

46

Lazarz@Laz4rz·10 Mar

job dying at 430/440 compilation due to time limit got me going home

Lazarz@Laz4rz

in the “correct vllm + torch + torchaudio on aarch64” trenches

English

4

0

34

3.9K

PolyMage Labs@polymagelabs·10 Mar

Finally, here's the paper on PolyBlocks describing how fully code-generating compilers for AI chips can be built! This is the culmination of multiple years of R&D and engineering. There is now enough reusable infrastructure in our toolkit to quickly build high-performing PyTorch/JAX compilers for new chips, no matter how weird or unique their capabilities are, and without relying on any "kernel" libraries or manual model optimization or porting. The paper isn't exhaustive, but it provides details on the key parts, the design choices, and why they are powerful. arxiv.org/abs/2603.06731

English

18

85

5.5K

shikhar@encapsulated007·10 Mar

@polymagelabs this is really cool. didn't knew you guys were still out there buliding.

English

My last open-source project before joining xAI is just out today. Megatron Core MoE is probably the best open framework out there to seriously train mixture of experts at scale. It achieves 1233 TFLOPS/GPU for DeepSeek-V3-685B. arxiv.org/abs/2603.07685

1

166

shikhar@encapsulated007·10 Mar

insanity beyond my comprehension!!

Ethan He@EthanHe_42

English

14

1.8K

shikhar@encapsulated007·10 Mar

@TheZachMueller war clock from pacific rim

English

Mark Saroufim@marksaroufim

29

Zach Mueller@TheZachMueller·10 Mar

Computers are fun again

English

3

0

10

722

shikhar@encapsulated007·10 Mar

what even in the good lord Hopper's hack was this!?

@m_sirovatka There's one smart human Erik Schultheis, he's the vanguard of humans against the AI slop and he's been working on a benchmark function that would be resistant to adversarial attacks If you're an AI researcher, come at us! github.com/gpu-mode/pygpu…

English