Swayam Singh

3.5K posts

Swayam Singh

@swayaminsync

देखा एक ख्वाब तो ये सिलसिले हुए ✨ | @MSFTResearch | OSS Maintainer

BLR Katılım Nisan 2021

1.6K Takip Edilen1.7K Takipçiler

Sabitlenmiş Tweet

Swayam Singh@swayaminsync·19 Haz

Strong version of you is dealing with all the inner demons silently, keeping all the chaos contained within you, hidden from the outside world. It'll get exhaustive sometimes and I am proud of you. Don't give up.

English

13.8K

Swayam Singh@swayaminsync·9h

@leloykun @dakovalev1 Interesting results, appreciate the good work. I got to read more about this and the tagged paper.

English

123

leloy!@leloykun·10h

Excited results!! I was also working on this direction a few months back but only managed to reach the necessary step before this. I got similar results to @dakovalev1 's, but without the need to go back-and-forth between dual norms and the frobenius norm which makes the coefficients loose. On top of that, I also added handling of nesterov momentum and decoupled weight decay. @orvieto_antonio I think the next natural step is to use last-iterate bounds instead of expected bounds so we can also handle variable (decaying) learning rates.

Antonio Orvieto@orvieto_antonio

Optimization theory for adaptive methods actually predicts most of what we know about hyperparameter scaling in LLM pretraining, and suggests new strategies as well. We did a deep dive here.

English

102

7.9K

Swayam Singh@swayaminsync·1d

Post-2023 versions of GitHub CLI (gh) introduced a switch command to easily toggle between active accounts. No more logging in and dealing with 2FA every single time 🙏 Honestly, the older workarounds were so tedious. (img is grok generated, cool)

English

325

Swayam Singh@swayaminsync·2d

Follow these and the model trainers won't curse your name + feel free to add more

English

Swayam Singh@swayaminsync·2d

Developing Benchmarks: A First-Time Parent's Guide 1️⃣ Think and create panics to log for all things that can go wrong during a run (out-of-context, invalid parsing, no-response, etc) 2️⃣ If possible make the setup to be able to run concurrently with multi-threads/processes 3️⃣ Implement checkpointing to resume a left-off run 4️⃣ Pin every dependency version, model checkpoint hash, and random seed 5️⃣ Log token counts (input/output) per sample 6️⃣ Log all the events in a file (every single one) 7️⃣ Define a retry policy with exponential backoff for transient failures

English

156

Swayam Singh@swayaminsync·2d

Okay this is cool, if you go to the official page of OSTEP book then at the end you'll see author listed some nice recommendations of non-tech, fiction & non-fiction books:

English

541

Swayam Singh@swayaminsync·2d

Re-visiting Arcane!

English

Swayam Singh@swayaminsync·4d

Ahhh finally no more cross-compilation pain!! Time to update the recipe.

Conda Forge@condaforge

Thanks to @github and @microsoft @azure for their continued sponsorship. We can now natively compile packages on macOS Arm64 and Linux Aarch64 machines. Thank you!

English

267

Swayam Singh@swayaminsync·5d

@eliebakouch > The pretrain checkpoint is trained with a stable nvfp4 quantized recipe

English

1.3K

elie@eliebakouch·5d

cursor pretrain from scratch 1T total 20B active moe open weight model, hybrid with sparse attention (DSA) and linear attention layer (GDN) (ratio 5:1) optimized for 2M context length, serving on blackwell B300 and sota on cursor bench. (i have 0 insider info this is totally random but let's try to manifest it)

Jon Kaplan@aye_aye_kaplan

big things are coming

English

379

58.1K

Swayam Singh@swayaminsync·5d

Ah okay a relative finding is made in almost a year old paper.

Swayam Singh@swayaminsync

Genius finding I must say!!

English

6.8K

Swayam Singh retweetledi

Albert Gu@_albertgu·6d

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

315

1.6K

420K

Swayam Singh@swayaminsync·6d

Watched this 3 days ago, great talk. Almost all optimizations were familiar for a perf guy, but the key takeaway still hits "C++ can reach assembly-class performance for matrix multiplication" (with a free bf16 support)

Vivek Galatage@vivekgalatage

CppCon: Matrix Multiplication Deep Dive || Cache Blocking, SIMD & Parallelization by Aliaksei Sala

English

103

7.8K

Swayam Singh retweetledi

Ali Behrouz@behrouz_ali·16 Mar

This paper is the same as the DeepCrossAttention (DCA) method from more than a year ago: arxiv.org/abs/2502.06785. As far as I understood, here there is no innovation to be excited about, and yet surprisingly there is no citation and discussion about DCA! The level of redundancy in LLM research and then the hype on X is getting worse and worse! DeepCrossAttention is built based on the intuition that depth-wise cross-attention allows for richer interactions between layers at different depths. DCA further provides both empirical and theoretical results to support this approach.

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

220.6K

Swayam Singh@swayaminsync·16 Mar

ptxNinja is such a cool work. It extends Binary Ninja via a plugin to support parsing and analyzing PTX. A rust based PEG parser gives you CFGs and kernel navigation for .ptx files. Still has some limitations (label mismatches, no global scope support, no demangling) but a good foundation work.

Nicolò Altamura@nicolodev

The recording of my talk "Challenges in Decompilation and Reverse Engineering of CUDA-based Kernels" at @REverseConf is now online! Recording: youtube.com/watch?v=ns5jFu… Slides: nicolo.dev/files/pdf/reve… Binary Ninja plugin: github.com/seekbytes/ptxN…

English

306

Swayam Singh retweetledi

Kimi.ai@Kimi_Moonshot·16 Mar

English

330

2.1K

13.5K

4.9M

Swayam Singh retweetledi

EDITH@Infopulsed·16 Mar

yeah so I do have thoughts to add on… Your overview of Gluon is definitely spot‑on(no doubt about that :) ). But the upside is the performance ceiling: you can call architecture‑specific intrinsics like AMD’s buffer_loads and schedule your own software pipelining, things the standard Triton stack either hides or can’t legally emit. The downside, as you note, is portability. Once you pin your kernel to MI300X or Hopper features, it stops being generic and you inherit the compiler’s job of layout inference and optimization What I’d really want to add here is that this trend isn’t happening in a vacuum....Academic work on warp specialisation (e.g. the Tawa compiler) shows that modern GPUs need asynchronous producer/consumer warps to fully saturate tensor‑core pipelines, but also that manually orchestrating them is extremely hard-Tawa ends up auto‑partitioning high‑level kernels into warp roles to relieve developers from that tedium... This confirms why exposing warp specialisation in Gluon matters, but it also hints that such explicit programming might not scale as hardware evolves. Similarly, Mojo and Helion push a different philosophy: they still ask developers to understand the hardware, but they try to give a single language/runtime that abstracts away vendor differences, letting you write kernels once and retarget them There’s also a growing pushback against proliferating tile‑based DSLs. A widely shared critique argues that splitting the GPU world into CUDA, Vulkan, Mojo, Triton, Gluon, Tilus, etc. forces developers into choosing between portability, expressiveness and hardware access, and that you can’t have all three. On this view the right approach is to give the compiler intent, not instructions, e.g. declare error tolerances or performance/accuracy trade‑offs and let smarter compilers decide tile sizes, fusion strategy and scheduling... That’s the opposite of Gluon’s “you’re in charge” philosophy.. i mean It’s telling that even within the Triton ecosystem there are efforts like the CUDA Tile IR backend and Tawa to automate exactly the optimizations Gluon asks you to do yourself. So I think your conclusion that domain‑specific languages will be a good companion for agentic development is fair they provide a clear contract and make code generated by LLMs verifiable. But it might be worth tempering enthusiasm with the real possibility that smarter compilers, or higher‑level DSLs like Mojo/Helion, will narrow the gap. In short, explicit control is powerful, but it’s not a silver bullet; the ecosystem is still very very fluid...iyk, and we’ll likely see a spectrum of tools ranging from fully automated to fully explicit coexisting for some time.... :)

Lei Zhang@LeiLMx

I published a new post in my Triton series about Gluon — a new Python frontend that exposes more compiler internals so developers can have explicit control over performance. I also share some thoughts in the context of rapidly evolving agentic software development: portability vs performance, general vs domain-specific compilers, and why DSLs may become an important companion. 🔗 lei.chat/posts/gluon-ex…

English

790

Swayam Singh@swayaminsync·15 Mar

@depressionlesss Awwwwww

164

Antidepressant Content@depressionlesss·15 Mar

Street photography for stray cats 😻

English

449

4.4K

61.9K

Swayam Singh@swayaminsync·15 Mar

@pushkar_dongare @BangaloreRoomi Cute cats, we have the same pair in the office.

English

133

Pushkar Dongare@pushkar_dongare·15 Mar

Looking for a male flatmate for a spacious 3BHK in HSR. - ₹22k including maintenance (Fully Furnished) - Gated society - Attached washroom - 2 cute cats - Gas pipeline (no stress of cylinder shortage) DM for more details @BangaloreRoomi Do your thing!

English

3.6K

Swayam Singh@swayaminsync·14 Mar

@art_zucker That's nice, I guess the TRL guides will also soon get updated with this.

English

187

Arthur Zucker@art_zucker·14 Mar

If you don't realize what that means: for easy dev / eval but mostly GRPO this is kinda game changer! No weight synchro. No accuracy drop. You just use the exact same codepath for training and generating. Once the model is trained you put to prod in vllm / sglang, same code..

Rémi Ouazan@remi_or_

The inference stack just got simpler. PagedAttention, the kernel that made vLLM fast, now ships natively in 🤗 Transformers CB. Result: 84% of vLLM throughput on a single GPU. Near SOTA with no extra runtime. The gap is closing 📈

English

149

18K

Swayam Singh@swayaminsync·13 Mar

Still holding out for "The Winds of Winter" and "A Dream of Spring". No way I'm spoiling it with that second-hand show adaptation! [Maybe will give up someday if there's nothing to watch while having food] xD

English

187

Swayam Singh retweetledi

Quansight@quansightai·13 Mar

From astrophysics simulations to a commit bit on NumPy and PyO3. Nathan Goldbaum's career is proof that unusual paths lead to outsized impact. He just sat down with Lobsters to talk free-threading, burnout, Rust, and what's next for Python. 👉 buff.ly/W1Dl70t

English

317

Keşfet

@leloykun @dakovalev1 @orvieto_antonio @eliebakouch @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9