mikail

2K posts

mikail

@Gradientdinner

Research Scientist @nvidia 🌁 | PhD @MIT

San Francisco, CA 가입일 Ocak 2019

2.1K 팔로잉2.6K 팔로워

고정된 트윗

mikail@Gradientdinner·20 Şub

This paper got into @Nature!!! 🚀🚀🚀 Look at @SarthakChandra’s thread for a summary x.com/SarthakChandra…

mikail@Gradientdinner

🚨New Preprint! Wondered how grid cells form multiple discrete modules? Interested in continuous attractors and modularity? With @FieteGroup, we discover + generalize a physical mechanism for forming modules from smoothly varying parameters in a dynamical system!👇(1/15)

English

107

12.3K

mikail 리트윗함

Benjamin Marie@bnjmn_marie·1d

We now have four MoE models of roughly similar size, but based on very different architectural choices: >Qwen3 30B: full attention with simple GQA >Qwen3.5 35B: full attention + Gated DeltaNet >GLM 4.7 Flash: MLA >Nemotron 3 Nano: full attention + Mamba with very different KV-cache footprints. Nemotron is by far the most memory-efficient of the four. Although GLM 4.7 Flash uses MLA, its rank is fairly high, higher than Mistral Small 4 for instance, so it delivers only limited memory savings. To make these comparisons easier, I wrote an article and a notebook that estimate the memory consumption of all these models for any given sequence length: kaitchup.substack.com/p/the-kv-cache…

English

169

11.1K

mikail 리트윗함

Jared Rosner@jaredrosnerd·1d

The hottest summer I ever spent was a winter in San Francisco

English

232

3.3K

94.6K

mikail 리트윗함

Aaron Gokaslan@SkyLi0n·1d

Or their complete optimizer setup. These all interact. Even the kernels and precision can varying effects on the training dynamics.

typedfemale@typedfemale

personal pet peeve when people write papers about training dynamics, but never explicitly describe their complete architecture

English

1.1K

mikail 리트윗함

Hadi Vafaii@hadivafaii·3d

The blueprint for this "grand unification" already exists: --------- 🔹1961: Landauer established the thermodynamic cost of bit erasure (ieeexplore.ieee.org/document/53924…) 🔹1982: Bennett resolved Maxwell’s Demon, proving that logical irreversibility (erasure), rather than measurement, necessitates thermodynamic work (link.springer.com/article/10.100…) 🔹Late 1990s: Jarzynski and @gavincrooks related non-equilibrium work to equilibrium free energy. Their Fluctuation Theorems were later used to derive the entropy production cost of statistical inference (journals.aps.org/prl/abstract/1…; journals.aps.org/pre/abstract/1…) 🔹2007: Kawai et al. proved average thermodynamic entropy production equals the KL divergence between forward and backward path distributions( journals.aps.org/prl/abstract/1…) 🔹2009: Sagawa & Ueda incorporated information theory into this framework, bounding the energetic cost of measurement and feedback (journals.aps.org/prl/abstract/1…) 🔹2019: Wolpert extended stochastic thermo to formal computational architectures, including Turing machines and boolean circuits (iopscience.iop.org/article/10.108…) 🔹...and many more works in thermo/non-equilibrium stat mech that are waiting to be formally connected to computer science/ML concepts. --------- Formalizing this mathematical connection is a central research interest of mine. Reach out if you have ideas (DMs open).

David Pfau@pfau

We need a grand unification between physics and computer science to understand the relationship between energy and information. Always nice to see work that brings them together.

English

475

49.5K

mikail 리트윗함

Yifan Zhang@yifan_zhang_·4d

Last year I have thought about two ideas about improving Residual Streams, one is RNN view, another is Attention view. The first one we called Deep Delta Learning (DDL, arxiv.org/abs/2601.00417), and the second one we called Transformer^2. Concurrently, Kimi released the Attention Residuals. Need to point out is that DDL is more friendly to Pipeline Parallelism for large scale models. Transformer^2/Attention Residuals need to transfer a lot of previous layers’ hidden states. @elonmusk DDL is definitely worth reading!

Yifan Zhang@yifan_zhang_

Something REALLY HUGE. github.com/yifanzhang-pro…

English

226

33.9K

mikail@Gradientdinner·4d

@ZimingLiu11 This is blowing my mind. Do you have intuition? Does longer sequence dilute signal that makes the gradient and hence any gradient based optimizer worse?

English

394

Ziming Liu@ZimingLiu11·4d

When do "Neural thickets" / RandOpt work? In today's blog, I show that sequence length is a key parameter -- RandOpt works better for longer sequences, while gradient-based methods work better for shorter sequences. kindxiaoming.github.io/blog/2026/rand…

English

249

21.6K

mikail 리트윗함

Cheng Lou@_chenglou·6d

Now comes the unusual bits: after enough hyperparams & other kinds of config sweeps, I’ve reached the surprising conclusion that no form of regular curriculum training beats _reverse_ curriculum training on this specific task! I.e. instead of starting with easy sudoku puzzles data and gradually ramping up toward harder ones, the training runs that start with harder puzzles and end with easy puzzles always fare better, and also fare better than mixed difficulty sampling It’s possible to justify this in hindsight; a friend mentioned “I naively thought I’d get better at chess by starting with Blitz (easy, fast) instead of regular chess (harder, slower), but it’s the other way around”. That seemed philosophically interesting. Experts: please check the training runs in case there’s been a mistake; but so far the numbers do seem to speak for themselves

English

101

6.7K

mikail 리트윗함

Dileep George@dileeplearning·13 Mar

Neural cellular automata are the retinal waves for LLMs? 🤔

Seungwook Han@seungwookh

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

Català

109

13K

mikail 리트윗함

Grigory Sapunov@che_shr_cat·11 Mar

1/ Dense MLPs are a lie. The standard transformers we train are already doing sparse routing inside their feedforward layers—we just couldn't see it until now. 🧵

English

326

46.5K

mikail 리트윗함

Paras Chopra@paraschopra·9 Mar

My dear young person, Don’t succumb to mediocrity. There’s enough of it going around. Aspire for craftsmanship, as that is what leads to joy and beauty. The world needs more people who’re proud of what they make, and less of those who couldn’t care less.

English

468

4.2K

90K

mikail 리트윗함

Xinghong (Shin) Fu@shinfxh·9 Mar

"Taste transfers even as domain knowledge shifts. Theoretical physicists become quants. Quants move into AI." 🤔

Amy Tam@amytam01

x.com/i/article/2031…

English

3.6K

mikail 리트윗함

Amy Tam@amytam01·9 Mar

x.com/i/article/2031…

ZXX

223

1.7K

352.1K

mikail 리트윗함

Ji-Ha@Ji_Ha_Kim·8 Mar

Does anyone know ballpark numbers for typical condition numbers of gradient matrices during training

English

7.6K

mikail 리트윗함

Adarsh Kumarappan@adarshk123321·6 Mar

What if we could mathematically predict how a neural network evolves during training? We developed the first mathematical framework that explains why trained networks develop the distinctive "bulk+tail" weight structure that predicts generalization, validated across transformers, vision transformers, and MLPs. 📄 arxiv.org/abs/2507.12709 1/7

English

511

30.6K

mikail 리트윗함

Thomas Pethick@tmpethick·5 Mar

My thesis is now accessible online! I've tried to make it the introduction to non-Euclidean methods and monotone operators that I wish I had when starting out. 1/n infoscience.epfl.ch/entities/publi…

English

11.2K

mikail 리트윗함

Andrew Curran@AndrewCurran_·5 Mar

Striking image from the new Anthropic labor market impact report.

English

561

2.3K

13.5K

7.2M

mikail 리트윗함

Viraj Doshi@viraj9451·5 Mar

Muon can accelerate LLM training, but does that benefit transfer to regulatory DNA sequence modeling with its different data distribution? 🧬 Our results show that Muon with independent weight decay (MuonW) hits our validation perplexity target in ~37% fewer FLOPs than the best Adam configuration.

English

122

10.4K

mikail@Gradientdinner·5 Mar

@SynBio1 A full 4 year turnaround time

English

mikail@Gradientdinner·5 Mar

@SynBio1 My Nature paper was submitted when GPT-3 was out and accepted when o1 was out 🫠🫠🫠🫠 nature.com/articles/s4158…

English

372

Jake Wintermute 🧬/acc@SynBio1·4 Mar

It took Nature 13 months to publish Evo 2! 13 months! For reference: Opus 4.5 ended software engineering as we know it 4 months ago. ClawdBot added 2 million users in a single week in January. Academic publishing is so cooked it's not even funny nature.com/articles/s4158…

English

534

106.7K

mikail@Gradientdinner·1 Mar

@CalcCon 👀

QME