Satyapriya Krishna

636 posts

Satyapriya Krishna

@SatyaScribbles

Explorer @sesame Re-tweets/Re-posts == Lit review

Allston, MA Beigetreten Haziran 2020

255 Folgt556 Follower

Angehefteter Tweet

Satyapriya Krishna@SatyaScribbles·17 Mar

Excited to join @sesame! 🕶️

English

434

Satyapriya Krishna retweetet

Kevin Gu@kevingu·2d

x.com/i/article/2039…

ZXX

113

607

5.3K

3.2M

Satyapriya Krishna retweetet

Yoonho Lee@yoonholeee·5d

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English

262

1.6K

424.4K

Satyapriya Krishna retweetet

Jack Zhang@jcz42·5d

We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵

English

165

202.2K

Satyapriya Krishna retweetet

elvis@omarsar0·29 Mar

NEW research from NVIDIA. Post-training agents with RL is powerful but expensive. Every parameter update needs full multi-turn rollouts with environment interactions, making end-to-end RL prohibitively costly for long-horizon agentic tasks. This research offers a practical middle ground. The work introduces PivotRL, a framework that operates on existing SFT trajectories to combine the computational efficiency of SFT with the out-of-domain retention of end-to-end RL. Instead of exhaustive full-trajectory rollouts, PivotRL identifies pivots, informative intermediate turns where sampled actions show mixed outcomes, and trains only on those high-signal moments. Standard SFT degrades OOD performance by -9.83 points on average. PivotRL stays near zero (+0.21) while achieving +14.11 average in-domain gains over the base model versus +9.94 for SFT. On SWE-Bench, PivotRL reaches competitive accuracy with E2E RL using 4x fewer rollout turns and 5.5x less wall-clock time. The method is already deployed in production as the workhorse for NVIDIA's Nemotron-3-Super-120B agentic post-training. Paper: arxiv.org/abs/2603.21383 Learn to build effective AI agents in our academy: academy.dair.ai

English

401

48.5K

Satyapriya Krishna retweetet

Will Held@WilliamBarrHeld·26 Mar

Scaling laws are "just" regressions. But a biased fitting method can quietly misallocate millions of $ of compute at frontier scales. My coworker Eric Czech dug into a bias in parabolic IsoFLOP fits used by Meta, DeepSeek, Microsoft, Waymo, et al. for their scaling laws🧵

English

134

34.3K

Satyapriya Krishna retweetet

Junyang Lin@JustinLin610·26 Mar

x.com/i/article/2037…

ZXX

589

815K

Satyapriya Krishna retweetet

Cursor@cursor_ai·25 Mar

We go into detail about the infrastructure behind large scale training including the kernels we developed and open-sourced for the project. We also discuss distributed training and environment scaling for RL.

English

183

43.9K

Satyapriya Krishna retweetet

MIT CSAIL@MIT_CSAIL·22 Mar

A take from 1963 that aged pretty well.

English

177

41K

Satyapriya Krishna retweetet

rohan anil@_arohan_·22 Mar

Really neat paper! Find a linear map from the encoder output to its chosen codebook vector using a scale and two Householder reflections then backprop through that map instead of the straight-through identity. Great work!

English

415

26.1K

Satyapriya Krishna retweetet

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

334

2.1K

13.5K

4.9M

Satyapriya Krishna retweetet

Nika Haghtalab@nhaghtal·28 Şub

Very cool!! Nice to see practice keeping up with the theory 😉 After all, they’re provably as fast as parallel sampling can get.

Stefano Ermon@StefanoErmon

Mercury 2 is live 🚀🚀 The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

English

242

25.2K

Satyapriya Krishna retweetet

Center for AI Safety@CAIS·12 Mar

x.com/i/article/2031…

ZXX

2.7K

Satyapriya Krishna retweetet

mohit@mohitwt_·7 Mar

x.com/i/article/2029…

ZXX

110

984

110.3K

Satyapriya Krishna retweetet

fly51fly@fly51fly·7 Mar

[CL] Replaying pre-training data improves fine-tuning S Kotha, P Liang [Stanford University] (2026) arxiv.org/abs/2603.04964

English

2.2K

Satyapriya Krishna retweetet

Davis Blalock@davisblalock·4 Mar

🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]

English

227

1.6K

213.6K

Satyapriya Krishna retweetet

fly51fly@fly51fly·4 Mar

[CL] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning J Zhang, Z Yu, L Wang, N Yang… [Microsoft Research Asia & Peking University] (2026) arxiv.org/abs/2603.01639

English

1.8K

Satyapriya Krishna retweetet

Aryan Pandey@AryanPa66861306·2 Mar

Today I read a Paper: Ring Attention with Blockwise Transformers for Near-Infinite Context arxiv.org/pdf/2310.01889

English

117

8.4K

Satyapriya Krishna retweetet

fly51fly@fly51fly·28 Şub

[CL] dLLM: Simple Diffusion Language Modeling Z Zhou, L Chen, H Tong, D Song [UC Berkeley & UIUC] (2026) arxiv.org/abs/2602.22661

5.8K

Satyapriya Krishna retweetet

Matei Zaharia@matei_zaharia·27 Şub

Really cool work from Databricks Research in collaboration with Harvard and Cornell! It turns out off-policy RL can match and even outperform on-policy, making post training a lot more efficient and flexible. Try it on your tasks on Databricks!

Kianté Brantley@xkianteb

Does LLM RL post-training need to be on-policy?

English

10.9K

Entdecken

@noahamsel @berlinchen @tri_dao @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates