Jack Cook

130 posts

Jack Cook banner
Jack Cook

Jack Cook

@jackcookjack

phd student @miteecs | systems for ml

Cambridge, MA Katılım Ocak 2011
488 Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
Jack Cook
Jack Cook@jackcookjack·
Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵
Jack Cook tweet media
English
6
40
250
68.3K
Jack Cook retweetledi
Dan Alistarh
Dan Alistarh@DAlistarh·
Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4. Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration! [1/4]
English
5
25
170
19.3K
Jack Cook
Jack Cook@jackcookjack·
oh, you want a kernel that'll be right about 93% of the time and have tons of really weird and unpredictable edge cases? yeah I'd recommend Triton
English
0
0
7
294
Jack Cook retweetledi
Charles 🎉 Frye
Charles 🎉 Frye@charles_irl·
There was a flippening in the last few months: you can run your own LLM inference with rates and performance that match or beat LLM inference APIs. We wrote up the techniques to do so in a new guide, along with code samples. modal.com/docs/guide/hig…
Charles 🎉 Frye tweet media
English
21
100
894
93.4K
Jack Cook
Jack Cook@jackcookjack·
Here's a non-obvious problem with block-scaled quantized Attention: at the edge of your causal mask, later tokens can leak information to earlier ones through the scale factor computation. I wouldn't expect this leakage to matter very much since it affects scales, not values, but it turns out it does actually cause the loss to decrease a little too quickly! Very cool post by @tensorpro and team.
Jack Cook tweet mediaJack Cook tweet media
tensorpro@tensorpro

We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it. matx.com/research/leaky…

English
0
3
18
2.5K
Jack Cook retweetledi
Guangxuan Xiao
Guangxuan Xiao@Guangxuan_Xiao·
Life update: Wrapped up my PhD at @MITEECS 🎓 Super excited to start working on pre-training at @thinkymachines.
Guangxuan Xiao tweet media
English
53
71
1.9K
73.1K
Jack Cook retweetledi
alex zhang
alex zhang@a1zhang·
Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs). It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs! Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025! arxiv.org/pdf/2512.24601
alex zhang tweet media
English
251
1.1K
7.4K
2M
Jack Cook retweetledi
Charles 🎉 Frye
Charles 🎉 Frye@charles_irl·
use quant.exposed and maybe you too will write a groundbreaking research paper on low-precision training
Charles 🎉 Frye tweet media
English
2
1
155
21.3K
Danyal Akarca
Danyal Akarca@DanAkarca·
Two papers out, on a new paradigm of temporal computation! Our first work funded by @ARIA_research. We're super proud of this, and there's much more coming: Neural networks, specifically their weights, have become the most useful functional abstraction from the brain. As powerful function approximators they seeded the current era of AI. But there are many more useful abstractions. The brain does so much more than learn weights. At its core, the brain exploits the structure of the physical world to perform computation. It aligns itself to reality - to time and space. @achterbrain and I have long been working on how to embed neural networks in space and link this to hardware. But there's another dimension: time itself. We think leveraging both is a big part of the puzzle as to why human learning is so efficient. If this is true, how can we use space and time in neural networks? What would this even mean? We pitched to @BramhavarSuraj at @ARIA_research one possible way. Several years ago, I stumbled upon theoretical work that made it concrete. Time delays - the physical fact that signals take time to travel - can store memory (and other things, including increasing the number of computable functions). Even in feedforward networks. The delay here isn't overhead as would be traditionally thought but a feature that can be exploited. TL/DR - our main findings show that you can do computation in neural networks with time, without (much) need for weights. And it's remarkably efficient. We also show it's possible to co-design hardware (we open-source a chip design) with novel architectures that exploit time to maintain long contexts. In our first paper, we train neural networks to learn delays and weights. The result: state-of-the-art performance on all the temporally complex neuromorphic benchmarks we tested. Crucially, once you're encoding time directly, weights become almost irrelevant. We compress them to 1.58 bits, just positive, negative, or absent (ternary) weights. That's it. Model sizes drop to double-digit kilobytes! This works because we're finally encoding information the way the task needs it. Time and space. Which just so happens to be... everything in the physical world. Robotics. Embodied systems. Physical intelligence. In the second paper, we turn to memory. Intelligent systems don't just compute - they hold onto information over long context windows. And doing this efficiently in hardware underlying computation is a hard, unsolved problem. To solve this, we built a dual memory pathway architecture: fast spiking dynamics plus a compact state-space memory module that evolves much more slowly. Inspired by how the mammalian cortex separates fast somatic spiking from slower dendritic integration. Each layer maintains a tiny amount of working memory - just ~5% of hidden width - that summarises recent activity and feeds back into the network. There are many directions to scale this further. We co-designed the algorithms and hardware together from the ground up. The result: >4× throughput and >5× energy efficiency, beating Intel's Loihi2 and other leading neuromorphic platforms like DenRAM and ReckOn. We're open-sourcing the chip design so people can build on this. Incredibly proud of the team: @pengfeisun17, @achterbrain, @neuralreckoning, Zhe Su and @giacomoi One of our key takeaways: the current AI paradigm is a narrow slice of what's possible. Scaling homogeneous systems only gets you so far. Biological intelligence is deeply heterogeneous: different timescales, different substrates, different specialisations, all co-evolved together. We think the next frontier of scaling means embracing that heterogeneity. Algorithms and hardware aren't separate problems. They need to co-evolve together. Can't wait to share what we're cooking next.
Danyal Akarca tweet mediaDanyal Akarca tweet mediaDanyal Akarca tweet media
English
5
28
140
23.3K
Jack Cook
Jack Cook@jackcookjack·
Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵
Jack Cook tweet media
English
6
40
250
68.3K
Jack Cook
Jack Cook@jackcookjack·
We have this result in our paper, Table 3 shows that always scaling (not clipping) to 4.0 is worse than always scaling to 6.0. Like you said, this needs to be done conditionally to have a benefit: our method selects the better scale based on option that yields a lower mean squared quantization error, as is shown in Table 2 in our paper.
English
0
0
2
202
Sabareesh Kumar
Sabareesh Kumar@RSabareesh·
@jackcookjack @jerry_gjx @Guangxuan_Xiao In your evals, do you apply the new scheme unconditionally to all blocks? For some distributions, clipping to 4.0 could be worse for MSE. Maybe there’s value in doing this conditionally?
English
1
0
0
229
Jack Cook
Jack Cook@jackcookjack·
This is a great question! Block-scaled INT4 does not have hardware support like NVFP4 does, so it wasn't a focus for this work. We did run some small-scale experiments with simulated NVINT4 (e4m3 scale for every 16 int4 values) tensors and found that it performed worse than both NVFP4 and NVFP4 + 4/6 for LLM pre-training. Having the option to represent a larger range of values as FP4 does, rather than a smaller range with more granularity as INT4 does, is crucial for many blocks. If you're interested in learning more on this topic, I'd recommend checking out this paper: arxiv.org/abs/2510.25602
English
1
1
14
647
Jack Cook retweetledi
Ben Pouladian
Ben Pouladian@benitoz·
NVFP4 getting stabilized by Four Over Six is a huge win for NVIDIA If 4-bit training tracks BF16 curves you basically double effective throughput & cut costs in half while staying locked into CUDA Blackwell and Rubin demand goes even higher because everyone just trains more! 🦾
Jack Cook@jackcookjack

Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵

English
1
7
39
5.5K
Jack Cook
Jack Cook@jackcookjack·
We’re still investigating this and hope to have some more results soon in an updated version of the paper. Unfortunately these experiments are very expensive for us to run and often have different outcomes when you train on more tokens (notice how at 5BT, all of the models look fine with their recipe!), so we couldn’t run detailed architectural ablations. We think it may be due to a small architectural difference between their transformer and ours which we should have matched in hindsight, perhaps their Q and KV heads (we have 32 of each, they have 16 and 8). We will also be open sourcing our training code very soon, so stay tuned!
English
1
0
9
977