Satyapriya Krishna

636 posts

Satyapriya Krishna banner
Satyapriya Krishna

Satyapriya Krishna

@SatyaScribbles

Explorer @sesame Re-tweets/Re-posts == Lit review

Allston, MA Beigetreten Haziran 2020
255 Folgt556 Follower
Satyapriya Krishna retweetet
Yoonho Lee
Yoonho Lee@yoonholeee·
How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end
Yoonho Lee tweet media
English
75
262
1.6K
424.4K
Satyapriya Krishna retweetet
Jack Zhang
Jack Zhang@jcz42·
We made Muon run up to 2x faster for free! Introducing Gram Newton-Schulz: a mathematically equivalent but computationally faster Newton-Schulz algorithm for polar decomposition. Gram Newton-Schulz rewrites Newton-Schulz such that instead of iterating on the expensive rectangular X matrix, we iterate on the small, square, symmetric XX^T Gram matrix to reduce FLOPs. This allows us to make more use of fast symmetric GEMM kernels on Hopper and Blackwell, halving the FLOPs of each of those GEMMs. Gram Newton-Schulz is a drop-in replacement of Newton-Schulz for your Muon use case: we see validation perplexity preserved within 0.01, and share our (long!) journey stabilizing this algorithm and ensuring that training quality is preserved above all else. This was a super fun project with @noahamsel, @berlinchen, and @tri_dao that spanned theory, numerical analysis, and ML systems! Blog and codebase linked below 🧵
Jack Zhang tweet media
English
16
165
1K
202.2K
Satyapriya Krishna retweetet
elvis
elvis@omarsar0·
NEW research from NVIDIA. Post-training agents with RL is powerful but expensive. Every parameter update needs full multi-turn rollouts with environment interactions, making end-to-end RL prohibitively costly for long-horizon agentic tasks. This research offers a practical middle ground. The work introduces PivotRL, a framework that operates on existing SFT trajectories to combine the computational efficiency of SFT with the out-of-domain retention of end-to-end RL. Instead of exhaustive full-trajectory rollouts, PivotRL identifies pivots, informative intermediate turns where sampled actions show mixed outcomes, and trains only on those high-signal moments. Standard SFT degrades OOD performance by -9.83 points on average. PivotRL stays near zero (+0.21) while achieving +14.11 average in-domain gains over the base model versus +9.94 for SFT. On SWE-Bench, PivotRL reaches competitive accuracy with E2E RL using 4x fewer rollout turns and 5.5x less wall-clock time. The method is already deployed in production as the workhorse for NVIDIA's Nemotron-3-Super-120B agentic post-training. Paper: arxiv.org/abs/2603.21383 Learn to build effective AI agents in our academy: academy.dair.ai
elvis tweet media
English
22
68
401
48.5K
Satyapriya Krishna retweetet
Will Held
Will Held@WilliamBarrHeld·
Scaling laws are "just" regressions. But a biased fitting method can quietly misallocate millions of $ of compute at frontier scales. My coworker Eric Czech dug into a bias in parabolic IsoFLOP fits used by Meta, DeepSeek, Microsoft, Waymo, et al. for their scaling laws🧵
Will Held tweet media
English
2
27
134
34.3K
Satyapriya Krishna retweetet
Cursor
Cursor@cursor_ai·
We go into detail about the infrastructure behind large scale training including the kernels we developed and open-sourced for the project. We also discuss distributed training and environment scaling for RL.
Cursor tweet media
English
3
4
183
43.9K
Satyapriya Krishna retweetet
MIT CSAIL
MIT CSAIL@MIT_CSAIL·
A take from 1963 that aged pretty well.
MIT CSAIL tweet media
English
20
177
1K
41K
Satyapriya Krishna retweetet
rohan anil
rohan anil@_arohan_·
Really neat paper! Find a linear map from the encoder output to its chosen codebook vector using a scale and two Householder reflections then backprop through that map instead of the straight-through identity. Great work!
rohan anil tweet media
English
5
42
415
26.1K
Satyapriya Krishna retweetet
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
334
2.1K
13.5K
4.9M
Satyapriya Krishna retweetet
Satyapriya Krishna retweetet
fly51fly
fly51fly@fly51fly·
[CL] Replaying pre-training data improves fine-tuning S Kotha, P Liang [Stanford University] (2026) arxiv.org/abs/2603.04964
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
English
1
5
30
2.2K
Satyapriya Krishna retweetet
Davis Blalock
Davis Blalock@davisblalock·
🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]
Davis Blalock tweet media
English
30
227
1.6K
213.6K
Satyapriya Krishna retweetet
fly51fly
fly51fly@fly51fly·
[CL] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning J Zhang, Z Yu, L Wang, N Yang… [Microsoft Research Asia & Peking University] (2026) arxiv.org/abs/2603.01639
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
English
0
6
30
1.8K
Satyapriya Krishna retweetet
Aryan Pandey
Aryan Pandey@AryanPa66861306·
Today I read a Paper: Ring Attention with Blockwise Transformers for Near-Infinite Context arxiv.org/pdf/2310.01889
Aryan Pandey tweet media
English
2
19
117
8.4K
Satyapriya Krishna retweetet
fly51fly
fly51fly@fly51fly·
[CL] dLLM: Simple Diffusion Language Modeling Z Zhou, L Chen, H Tong, D Song [UC Berkeley & UIUC] (2026) arxiv.org/abs/2602.22661
fly51fly tweet mediafly51fly tweet mediafly51fly tweet mediafly51fly tweet media
0
15
96
5.8K