mikail

2K posts

mikail banner
mikail

mikail

@Gradientdinner

Research Scientist @nvidia 🌁 | PhD @MIT

San Francisco, CA 가입일 Ocak 2019
2.1K 팔로잉2.6K 팔로워
mikail 리트윗함
Benjamin Marie
Benjamin Marie@bnjmn_marie·
We now have four MoE models of roughly similar size, but based on very different architectural choices: >Qwen3 30B: full attention with simple GQA >Qwen3.5 35B: full attention + Gated DeltaNet >GLM 4.7 Flash: MLA >Nemotron 3 Nano: full attention + Mamba with very different KV-cache footprints. Nemotron is by far the most memory-efficient of the four. Although GLM 4.7 Flash uses MLA, its rank is fairly high, higher than Mistral Small 4 for instance, so it delivers only limited memory savings. To make these comparisons easier, I wrote an article and a notebook that estimate the memory consumption of all these models for any given sequence length: kaitchup.substack.com/p/the-kv-cache…
Benjamin Marie tweet media
English
5
13
169
11.1K
mikail 리트윗함
Jared Rosner
Jared Rosner@jaredrosnerd·
The hottest summer I ever spent was a winter in San Francisco
Jared Rosner tweet media
English
52
232
3.3K
94.6K
mikail 리트윗함
Hadi Vafaii
Hadi Vafaii@hadivafaii·
The blueprint for this "grand unification" already exists: --------- 🔹1961: Landauer established the thermodynamic cost of bit erasure (ieeexplore.ieee.org/document/53924…) 🔹1982: Bennett resolved Maxwell’s Demon, proving that logical irreversibility (erasure), rather than measurement, necessitates thermodynamic work (link.springer.com/article/10.100…) 🔹Late 1990s: Jarzynski and @gavincrooks related non-equilibrium work to equilibrium free energy. Their Fluctuation Theorems were later used to derive the entropy production cost of statistical inference (journals.aps.org/prl/abstract/1…; journals.aps.org/pre/abstract/1…) 🔹2007: Kawai et al. proved average thermodynamic entropy production equals the KL divergence between forward and backward path distributions( journals.aps.org/prl/abstract/1…) 🔹2009: Sagawa & Ueda incorporated information theory into this framework, bounding the energetic cost of measurement and feedback (journals.aps.org/prl/abstract/1…) 🔹2019: Wolpert extended stochastic thermo to formal computational architectures, including Turing machines and boolean circuits (iopscience.iop.org/article/10.108…) 🔹...and many more works in thermo/non-equilibrium stat mech that are waiting to be formally connected to computer science/ML concepts. --------- Formalizing this mathematical connection is a central research interest of mine. Reach out if you have ideas (DMs open).
David Pfau@pfau

We need a grand unification between physics and computer science to understand the relationship between energy and information. Always nice to see work that brings them together.

English
21
52
475
49.5K
mikail 리트윗함
Yifan Zhang
Yifan Zhang@yifan_zhang_·
Last year I have thought about two ideas about improving Residual Streams, one is RNN view, another is Attention view. The first one we called Deep Delta Learning (DDL, arxiv.org/abs/2601.00417), and the second one we called Transformer^2. Concurrently, Kimi released the Attention Residuals. Need to point out is that DDL is more friendly to Pipeline Parallelism for large scale models. Transformer^2/Attention Residuals need to transfer a lot of previous layers’ hidden states. @elonmusk DDL is definitely worth reading!
Yifan Zhang tweet mediaYifan Zhang tweet media
Yifan Zhang@yifan_zhang_

Something REALLY HUGE. github.com/yifanzhang-pro…

English
6
30
226
33.9K
mikail
mikail@Gradientdinner·
@ZimingLiu11 This is blowing my mind. Do you have intuition? Does longer sequence dilute signal that makes the gradient and hence any gradient based optimizer worse?
English
0
1
1
394
Ziming Liu
Ziming Liu@ZimingLiu11·
When do "Neural thickets" / RandOpt work? In today's blog, I show that sequence length is a key parameter -- RandOpt works better for longer sequences, while gradient-based methods work better for shorter sequences. kindxiaoming.github.io/blog/2026/rand…
Ziming Liu tweet media
English
5
27
249
21.6K
mikail 리트윗함
Cheng Lou
Cheng Lou@_chenglou·
Now comes the unusual bits: after enough hyperparams & other kinds of config sweeps, I’ve reached the surprising conclusion that no form of regular curriculum training beats _reverse_ curriculum training on this specific task! I.e. instead of starting with easy sudoku puzzles data and gradually ramping up toward harder ones, the training runs that start with harder puzzles and end with easy puzzles always fare better, and also fare better than mixed difficulty sampling It’s possible to justify this in hindsight; a friend mentioned “I naively thought I’d get better at chess by starting with Blitz (easy, fast) instead of regular chess (harder, slower), but it’s the other way around”. That seemed philosophically interesting. Experts: please check the training runs in case there’s been a mistake; but so far the numbers do seem to speak for themselves
English
4
6
101
6.7K
mikail 리트윗함
mikail 리트윗함
Grigory Sapunov
Grigory Sapunov@che_shr_cat·
1/ Dense MLPs are a lie. The standard transformers we train are already doing sparse routing inside their feedforward layers—we just couldn't see it until now. 🧵
Grigory Sapunov tweet media
English
11
30
326
46.5K
mikail 리트윗함
Paras Chopra
Paras Chopra@paraschopra·
My dear young person, Don’t succumb to mediocrity. There’s enough of it going around. Aspire for craftsmanship, as that is what leads to joy and beauty. The world needs more people who’re proud of what they make, and less of those who couldn’t care less.
English
62
468
4.2K
90K
mikail 리트윗함
Ji-Ha
Ji-Ha@Ji_Ha_Kim·
Does anyone know ballpark numbers for typical condition numbers of gradient matrices during training
English
2
2
45
7.6K
mikail 리트윗함
Adarsh Kumarappan
Adarsh Kumarappan@adarshk123321·
What if we could mathematically predict how a neural network evolves during training? We developed the first mathematical framework that explains why trained networks develop the distinctive "bulk+tail" weight structure that predicts generalization, validated across transformers, vision transformers, and MLPs. 📄 arxiv.org/abs/2507.12709 1/7
Adarsh Kumarappan tweet media
English
12
54
511
30.6K
mikail 리트윗함
Thomas Pethick
Thomas Pethick@tmpethick·
My thesis is now accessible online! I've tried to make it the introduction to non-Euclidean methods and monotone operators that I wish I had when starting out. 1/n infoscience.epfl.ch/entities/publi…
English
3
11
62
11.2K
mikail 리트윗함
Andrew Curran
Andrew Curran@AndrewCurran_·
Striking image from the new Anthropic labor market impact report.
Andrew Curran tweet media
English
561
2.3K
13.5K
7.2M
mikail 리트윗함
Viraj Doshi
Viraj Doshi@viraj9451·
Muon can accelerate LLM training, but does that benefit transfer to regulatory DNA sequence modeling with its different data distribution? 🧬 Our results show that Muon with independent weight decay (MuonW) hits our validation perplexity target in ~37% fewer FLOPs than the best Adam configuration.
Viraj Doshi tweet mediaViraj Doshi tweet mediaViraj Doshi tweet mediaViraj Doshi tweet media
English
5
22
122
10.4K
mikail
mikail@Gradientdinner·
@SynBio1 A full 4 year turnaround time
English
0
0
0
56
Jake Wintermute 🧬/acc
It took Nature 13 months to publish Evo 2! 13 months! For reference: Opus 4.5 ended software engineering as we know it 4 months ago. ClawdBot added 2 million users in a single week in January. Academic publishing is so cooked it's not even funny nature.com/articles/s4158…
English
24
69
534
106.7K
Calc Consulting
Calc Consulting@CalcCon·
New results coming soon showing how SETOL (the theory behind weightwatcher) can be used to merge models
Calc Consulting tweet media
English
1
0
16
870