Jack Cook

130 posts

Jack Cook

@jackcookjack

phd student @miteecs | systems for ml

Cambridge, MA Katılım Ocak 2011

488 Takip Edilen1.2K Takipçiler

Sabitlenmiş Tweet

Jack Cook@jackcookjack·2 Ara

Training LLMs with NVFP4 is hard because FP4 has so few values that I can fit them all in this post: ±{0, 0.5, 1, 1.5, 2, 3, 4, 6}. But what if I told you that reducing this range even further could actually unlock better training + quantization performance? Introducing Four Over Six, a new method for improving the accuracy of NVFP4 quantization with Adaptive Block Scaling. 🧵

English

250

68.3K

Jack Cook retweetledi

Dan Alistarh@DAlistarh·2 Şub

Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4. Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration! [1/4]

English

170

19.3K

Jack Cook@jackcookjack·20 Oca

oh, you want a kernel that'll be right about 93% of the time and have tons of really weird and unpredictable edge cases? yeah I'd recommend Triton

English

294

Jack Cook retweetledi

Charles 🎉 Frye@charles_irl·14 Oca

There was a flippening in the last few months: you can run your own LLM inference with rates and performance that match or beat LLM inference APIs. We wrote up the techniques to do so in a new guide, along with code samples. modal.com/docs/guide/hig…

English

100

894

93.4K

Jack Cook@jackcookjack·14 Oca

Here's a non-obvious problem with block-scaled quantized Attention: at the edge of your causal mask, later tokens can leak information to earlier ones through the scale factor computation. I wouldn't expect this leakage to matter very much since it affects scales, not values, but it turns out it does actually cause the loss to decrease a little too quickly! Very cool post by @tensorpro and team.

tensorpro@tensorpro

We trained models with MXFP4-quantized attention, but it turns out this can break causal modeling. Our latest post explains why this happens and how to fix it. matx.com/research/leaky…

English

2.5K

Jack Cook@jackcookjack·7 Oca

@Guangxuan_Xiao @MITEECS @thinkymachines Congrats @Guangxuan_Xiao!!

English

471

Jack Cook retweetledi

Guangxuan Xiao@Guangxuan_Xiao·7 Oca

Life update: Wrapped up my PhD at @MITEECS 🎓 Super excited to start working on pre-training at @thinkymachines.

English

1.9K

73.1K

Jack Cook retweetledi

alex zhang@a1zhang·3 Oca

Much like the switch in 2025 from language models to reasoning models, we think 2026 will be all about the switch to Recursive Language Models (RLMs). It turns out that models can be far more powerful if you allow them to treat *their own prompts* as an object in an external environment, which they understand and manipulate by writing code that invokes LLMs! Our full paper on RLMs is now available—with much more expansive experiments compared to our initial blogpost from October 2025! arxiv.org/pdf/2512.24601

English

251

1.1K

7.4K

Jack Cook retweetledi

Charles 🎉 Frye@charles_irl·18 Ara

use quant.exposed and maybe you too will write a groundbreaking research paper on low-precision training

English

155

21.3K

Jack Cook@jackcookjack·11 Ara

@kyeburchard Congrats Kye!

English

115

Kye@kyeburchard·11 Ara

engineers are generating code faster than today's knowledge systems can keep up. we're solving this. no more out-of-date READMEs scattered around your codebase, no more Notion docs that you *know* exist but can't find. just clean, organized, automatic documentation. in Falconer.

Falconer@falconer_ai

Craft and beauty are finally coming to internal docs. Today we're previewing Falconer: the single source of truth for company knowledge. We're still working through our waitlist, but are now excited to take on more demo requests. Engineers spend way too much time answering repeated questions and searching for outdated information. We experienced these problems during hypergrowth at Uber and Stripe and thought, "What if we applied external docs treatment to our internal docs?" That simple approach achieved pretty remarkable results for productivity. And now Falconer is building the AI-native version of those platforms for everyone. With Falconer, your internal docs are: - In one place - Always up to date - Easy to find - Synced with your data Coding agents allow us to generate more code faster than ever...but that code is poorly understood and out of step with the company's crown jewels: business context and tribal knowledge. Falconer's mission is to connect teams and agents with a shared memory system—always available and always accurate. We're working on it, and will have much more to share soon.

English

3.8K

Jack Cook@jackcookjack·9 Ara

@DanAkarca @ARIA_research Congrats Dan!!

English

406

Danyal Akarca@DanAkarca·9 Ara

Two papers out, on a new paradigm of temporal computation! Our first work funded by @ARIA_research. We're super proud of this, and there's much more coming: Neural networks, specifically their weights, have become the most useful functional abstraction from the brain. As powerful function approximators they seeded the current era of AI. But there are many more useful abstractions. The brain does so much more than learn weights. At its core, the brain exploits the structure of the physical world to perform computation. It aligns itself to reality - to time and space. @achterbrain and I have long been working on how to embed neural networks in space and link this to hardware. But there's another dimension: time itself. We think leveraging both is a big part of the puzzle as to why human learning is so efficient. If this is true, how can we use space and time in neural networks? What would this even mean? We pitched to @BramhavarSuraj at @ARIA_research one possible way. Several years ago, I stumbled upon theoretical work that made it concrete. Time delays - the physical fact that signals take time to travel - can store memory (and other things, including increasing the number of computable functions). Even in feedforward networks. The delay here isn't overhead as would be traditionally thought but a feature that can be exploited. TL/DR - our main findings show that you can do computation in neural networks with time, without (much) need for weights. And it's remarkably efficient. We also show it's possible to co-design hardware (we open-source a chip design) with novel architectures that exploit time to maintain long contexts. In our first paper, we train neural networks to learn delays and weights. The result: state-of-the-art performance on all the temporally complex neuromorphic benchmarks we tested. Crucially, once you're encoding time directly, weights become almost irrelevant. We compress them to 1.58 bits, just positive, negative, or absent (ternary) weights. That's it. Model sizes drop to double-digit kilobytes! This works because we're finally encoding information the way the task needs it. Time and space. Which just so happens to be... everything in the physical world. Robotics. Embodied systems. Physical intelligence. In the second paper, we turn to memory. Intelligent systems don't just compute - they hold onto information over long context windows. And doing this efficiently in hardware underlying computation is a hard, unsolved problem. To solve this, we built a dual memory pathway architecture: fast spiking dynamics plus a compact state-space memory module that evolves much more slowly. Inspired by how the mammalian cortex separates fast somatic spiking from slower dendritic integration. Each layer maintains a tiny amount of working memory - just ~5% of hidden width - that summarises recent activity and feeds back into the network. There are many directions to scale this further. We co-designed the algorithms and hardware together from the ground up. The result: >4× throughput and >5× energy efficiency, beating Intel's Loihi2 and other leading neuromorphic platforms like DenRAM and ReckOn. We're open-sourcing the chip design so people can build on this. Incredibly proud of the team: @pengfeisun17, @achterbrain, @neuralreckoning, Zhe Su and @giacomoi One of our key takeaways: the current AI paradigm is a narrow slice of what's possible. Scaling homogeneous systems only gets you so far. Biological intelligence is deeply heterogeneous: different timescales, different substrates, different specialisations, all co-evolved together. We think the next frontier of scaling means embracing that heterogeneity. Algorithms and hardware aren't separate problems. They need to co-evolve together. Can't wait to share what we're cooking next.

English

140

23.3K

Jack Cook@jackcookjack·5 Ara

@nikhil_anand91 @vega_myhre @jerry_gjx @Guangxuan_Xiao Thanks for checking out our work! NVIDIA used 4.5e-4 for their 12B model — we matched the LR they used for their 1.2B transformer, which was 1.2e-3

English

Nikhil Anand@nikhil_anand91·5 Ara

@jackcookjack @vega_myhre @jerry_gjx @Guangxuan_Xiao though - to qualify that statement, we tested an earlier version of the recipe and not specifically nvfp4.

English

Jack Cook@jackcookjack·2 Ara

English

250

68.3K

Jack Cook@jackcookjack·5 Ara

Modal’s ability to scale and run many experiments in parallel was crucial for helping us finish this work on time! Our implementation on GitHub natively supports running NVFP4 experiments with B200s on Modal and I can’t recommend them enough for LLM training + inference workloads

Charles 🎉 Frye@charles_irl

i liked this paper already, but i like it even more now that i know a bunch of the work was done on @modal <3

English

2.6K

Jack Cook@jackcookjack·4 Ara

We have this result in our paper, Table 3 shows that always scaling (not clipping) to 4.0 is worse than always scaling to 6.0. Like you said, this needs to be done conditionally to have a benefit: our method selects the better scale based on option that yields a lower mean squared quantization error, as is shown in Table 2 in our paper.

English

202

Sabareesh Kumar@RSabareesh·3 Ara

@jackcookjack @jerry_gjx @Guangxuan_Xiao In your evals, do you apply the new scheme unconditionally to all blocks? For some distributions, clipping to 4.0 could be worse for MSE. Maybe there’s value in doing this conditionally?

English

229

Jack Cook@jackcookjack·3 Ara

This is a great question! Block-scaled INT4 does not have hardware support like NVFP4 does, so it wasn't a focus for this work. We did run some small-scale experiments with simulated NVINT4 (e4m3 scale for every 16 int4 values) tensors and found that it performed worse than both NVFP4 and NVFP4 + 4/6 for LLM pre-training. Having the option to represent a larger range of values as FP4 does, rather than a smaller range with more granularity as INT4 does, is crucial for many blocks. If you're interested in learning more on this topic, I'd recommend checking out this paper: arxiv.org/abs/2510.25602

English

647

Maxence Frenette@maxencefrenette·3 Ara

@jackcookjack @jerry_gjx @Guangxuan_Xiao Naive question, this seems like it’s getting closer to int4. Is using int4 weights with the same block scaling as nvfp4 something that was tried at all?

English

681

Jack Cook retweetledi

Ben Pouladian@benitoz·3 Ara

NVFP4 getting stabilized by Four Over Six is a huge win for NVIDIA If 4-bit training tracks BF16 curves you basically double effective throughput & cut costs in half while staying locked into CUDA Blackwell and Rubin demand goes even higher because everyone just trains more! 🦾

Jack Cook@jackcookjack

English

5.5K

Jack Cook retweetledi

Clive Chan@itsclivetime·3 Ara

clever idea!

Jack Cook@jackcookjack

English

11.1K

Jack Cook@jackcookjack·3 Ara

We’re still investigating this and hope to have some more results soon in an updated version of the paper. Unfortunately these experiments are very expensive for us to run and often have different outcomes when you train on more tokens (notice how at 5BT, all of the models look fine with their recipe!), so we couldn’t run detailed architectural ablations. We think it may be due to a small architectural difference between their transformer and ours which we should have matched in hindsight, perhaps their Q and KV heads (we have 32 of each, they have 16 and 8). We will also be open sourcing our training code very soon, so stay tuned!

English

977

Daniel Vega-Myhre@vega_myhre·2 Ara

@jackcookjack @jerry_gjx @Guangxuan_Xiao did you look into why Nvidia’s NVFP4 recipe diverged in your experiments, but in their paper’s experiments they showed stable con convergence?

English

1.1K

Keşfet

@tensorpro @Guangxuan_Xiao @MITEECS @thinkymachines @kyeburchard @DanAkarca @ARIA_research @achterbrain