Swayam Singh

3.5K posts

Swayam Singh banner
Swayam Singh

Swayam Singh

@swayaminsync

देखा एक ख्वाब तो ये सिलसिले हुए ✨ | @MSFTResearch | OSS Maintainer

BLR Katılım Nisan 2021
1.6K Takip Edilen1.7K Takipçiler
Sabitlenmiş Tweet
Swayam Singh
Swayam Singh@swayaminsync·
Strong version of you is dealing with all the inner demons silently, keeping all the chaos contained within you, hidden from the outside world. It'll get exhaustive sometimes and I am proud of you. Don't give up.
English
0
0
19
13.8K
Swayam Singh
Swayam Singh@swayaminsync·
@leloykun @dakovalev1 Interesting results, appreciate the good work. I got to read more about this and the tagged paper.
English
0
0
0
123
leloy!
leloy!@leloykun·
Excited results!! I was also working on this direction a few months back but only managed to reach the necessary step before this. I got similar results to @dakovalev1 's, but without the need to go back-and-forth between dual norms and the frobenius norm which makes the coefficients loose. On top of that, I also added handling of nesterov momentum and decoupled weight decay. @orvieto_antonio I think the next natural step is to use last-iterate bounds instead of expected bounds so we can also handle variable (decaying) learning rates.
leloy! tweet medialeloy! tweet medialeloy! tweet medialeloy! tweet media
Antonio Orvieto@orvieto_antonio

Optimization theory for adaptive methods actually predicts most of what we know about hyperparameter scaling in LLM pretraining, and suggests new strategies as well. We did a deep dive here.

English
4
11
102
7.9K
Swayam Singh
Swayam Singh@swayaminsync·
Post-2023 versions of GitHub CLI (gh) introduced a switch command to easily toggle between active accounts. No more logging in and dealing with 2FA every single time 🙏 Honestly, the older workarounds were so tedious. (img is grok generated, cool)
Swayam Singh tweet media
English
0
0
10
325
Swayam Singh
Swayam Singh@swayaminsync·
Follow these and the model trainers won't curse your name + feel free to add more
English
0
0
2
63
Swayam Singh
Swayam Singh@swayaminsync·
Developing Benchmarks: A First-Time Parent's Guide 1️⃣ Think and create panics to log for all things that can go wrong during a run (out-of-context, invalid parsing, no-response, etc) 2️⃣ If possible make the setup to be able to run concurrently with multi-threads/processes 3️⃣ Implement checkpointing to resume a left-off run 4️⃣ Pin every dependency version, model checkpoint hash, and random seed 5️⃣ Log token counts (input/output) per sample 6️⃣ Log all the events in a file (every single one) 7️⃣ Define a retry policy with exponential backoff for transient failures
English
2
0
6
156
Swayam Singh
Swayam Singh@swayaminsync·
Okay this is cool, if you go to the official page of OSTEP book then at the end you'll see author listed some nice recommendations of non-tech, fiction & non-fiction books:
Swayam Singh tweet media
English
0
0
11
541
Swayam Singh
Swayam Singh@swayaminsync·
Re-visiting Arcane!
English
0
0
3
78
Swayam Singh
Swayam Singh@swayaminsync·
@eliebakouch > The pretrain checkpoint is trained with a stable nvfp4 quantized recipe
English
0
0
4
1.3K
elie
elie@eliebakouch·
cursor pretrain from scratch 1T total 20B active moe open weight model, hybrid with sparse attention (DSA) and linear attention layer (GDN) (ratio 5:1) optimized for 2M context length, serving on blackwell B300 and sota on cursor bench. (i have 0 insider info this is totally random but let's try to manifest it)
elie tweet mediaelie tweet media
Jon Kaplan@aye_aye_kaplan

big things are coming

English
14
9
379
58.1K
Swayam Singh retweetledi
Albert Gu
Albert Gu@_albertgu·
The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!
Albert Gu tweet media
English
36
315
1.6K
420K
Swayam Singh retweetledi
Ali Behrouz
Ali Behrouz@behrouz_ali·
This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.
Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English
33
89
1K
220.6K
Swayam Singh
Swayam Singh@swayaminsync·
ptxNinja is such a cool work. It extends Binary Ninja via a plugin to support parsing and analyzing PTX. A rust based PEG parser gives you CFGs and kernel navigation for .ptx files. Still has some limitations (label mismatches, no global scope support, no demangling) but a good foundation work.
Nicolò Altamura@nicolodev

The recording of my talk "Challenges in Decompilation and Reverse Engineering of CUDA-based Kernels" at @REverseConf is now online! Recording: youtube.com/watch?v=ns5jFu… Slides: nicolo.dev/files/pdf/reve… Binary Ninja plugin: github.com/seekbytes/ptxN…

English
0
0
4
306
Swayam Singh retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…
Kimi.ai tweet media
English
330
2.1K
13.5K
4.9M
Swayam Singh retweetledi
EDITH
EDITH@Infopulsed·
yeah so I do have thoughts to add on… Your overview of Gluon is definitely spot‑on(no doubt about that :) ). But the upside is the performance ceiling: you can call architecture‑specific intrinsics like AMD’s buffer_loads and schedule your own software pipelining, things the standard Triton stack either hides or can’t legally emit. The downside, as you note, is portability. Once you pin your kernel to MI300X or Hopper features, it stops being generic and you inherit the compiler’s job of layout inference and optimization What I’d really want to add here is that this trend isn’t happening in a vacuum....Academic work on warp specialisation (e.g. the Tawa compiler) shows that modern GPUs need asynchronous producer/consumer warps to fully saturate tensor‑core pipelines, but also that manually orchestrating them is extremely hard-Tawa ends up auto‑partitioning high‑level kernels into warp roles to relieve developers from that tedium... This confirms why exposing warp specialisation in Gluon matters, but it also hints that such explicit programming might not scale as hardware evolves. Similarly, Mojo and Helion push a different philosophy: they still ask developers to understand the hardware, but they try to give a single language/runtime that abstracts away vendor differences, letting you write kernels once and retarget them There’s also a growing pushback against proliferating tile‑based DSLs. A widely shared critique argues that splitting the GPU world into CUDA, Vulkan, Mojo, Triton, Gluon, Tilus, etc. forces developers into choosing between portability, expressiveness and hardware access, and that you can’t have all three. On this view the right approach is to give the compiler intent, not instructions, e.g. declare error tolerances or performance/accuracy trade‑offs and let smarter compilers decide tile sizes, fusion strategy and scheduling... That’s the opposite of Gluon’s “you’re in charge” philosophy.. i mean It’s telling that even within the Triton ecosystem there are efforts like the CUDA Tile IR backend and Tawa to automate exactly the optimizations Gluon asks you to do yourself. So I think your conclusion that domain‑specific languages will be a good companion for agentic development is fair they provide a clear contract and make code generated by LLMs verifiable. But it might be worth tempering enthusiasm with the real possibility that smarter compilers, or higher‑level DSLs like Mojo/Helion, will narrow the gap. In short, explicit control is powerful, but it’s not a silver bullet; the ecosystem is still very very fluid...iyk, and we’ll likely see a spectrum of tools ranging from fully automated to fully explicit coexisting for some time.... :)
Lei Zhang@LeiLMx

I published a new post in my Triton series about Gluon — a new Python frontend that exposes more compiler internals so developers can have explicit control over performance. I also share some thoughts in the context of rapidly evolving agentic software development: portability vs performance, general vs domain-specific compilers, and why DSLs may become an important companion. 🔗 lei.chat/posts/gluon-ex…

English
1
2
6
790
Pushkar Dongare
Pushkar Dongare@pushkar_dongare·
Looking for a male flatmate for a spacious 3BHK in HSR. - ₹22k including maintenance (Fully Furnished) - Gated society - Attached washroom - 2 cute cats - Gas pipeline (no stress of cylinder shortage) DM for more details @BangaloreRoomi Do your thing!
Pushkar Dongare tweet mediaPushkar Dongare tweet mediaPushkar Dongare tweet mediaPushkar Dongare tweet media
English
7
1
16
3.6K
Swayam Singh
Swayam Singh@swayaminsync·
@art_zucker That's nice, I guess the TRL guides will also soon get updated with this.
English
0
0
2
187
Arthur Zucker
Arthur Zucker@art_zucker·
If you don't realize what that means: for easy dev / eval but mostly GRPO this is kinda game changer! No weight synchro. No accuracy drop. You just use the exact same codepath for training and generating. Once the model is trained you put to prod in vllm / sglang, same code..
Rémi Ouazan@remi_or_

The inference stack just got simpler. PagedAttention, the kernel that made vLLM fast, now ships natively in 🤗 Transformers CB. Result: 84% of vLLM throughput on a single GPU. Near SOTA with no extra runtime. The gap is closing 📈

English
9
10
149
18K
Swayam Singh
Swayam Singh@swayaminsync·
Still holding out for "The Winds of Winter" and "A Dream of Spring". No way I'm spoiling it with that second-hand show adaptation! [Maybe will give up someday if there's nothing to watch while having food] xD
English
0
0
3
187
Swayam Singh retweetledi
Quansight
Quansight@quansightai·
From astrophysics simulations to a commit bit on NumPy and PyO3. Nathan Goldbaum's career is proof that unusual paths lead to outsized impact. He just sat down with Lobsters to talk free-threading, burnout, Rust, and what's next for Python. 👉 buff.ly/W1Dl70t
English
0
1
2
317