Naman Goyal

211 posts

Naman Goyal

@NamanGoyal21

Research @thinkymachines, previously pretraining LLAMA at GenAI Meta

Katılım Kasım 2012

734 Takip Edilen1.7K Takipçiler

Naman Goyal retweetledi

Thinking Machines@thinkymachines·11 May

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English

460

1.9K

15.7K

7.6M

Naman Goyal retweetledi

Mira Murati@miramurati·10 Mar

Grateful to Jensen and @nvidia team for their support. Together, we’re working to deploy at least 1GW of Vera Rubin systems, bringing adaptable collaborative AI to everyone. thinkingmachines.ai/nvidia-partner…

English

168

284

3.9K

558.8K

Naman Goyal@NamanGoyal21·6 Mar

@JingyuanLiu123 @Jianlin_S @clu_cheng Welcome! excited to work together

English

135

JingyuanLiu@JingyuanLiu123·6 Mar

Some updates: I've always been bullish on TML, and I actually joined TML this Monday Looking back, I am feeling so lucky that I have the privilege to work closely with the best optimization experts on the Muon optimizer ( @Jianlin_S from Kimi and @clu_cheng from Meta). Now I am so excited to be able to work with @jxbz and build new cool things! (On the other hand, there have always been some bad rumors about Meta TBD's potential failure. That's not true! From my personal experiences, it really has the best talents in the field, and I really enjoyed learning from the lab. The avocado model will for sure be great!)

JingyuanLiu@JingyuanLiu123

hmm I sort of disagree and I am bullish for TML. I think they really really have the top talents that I admire in the field, e.g. Jeremy and Sam for optimization, Songlin for Attn, Lia for MoE, Andrew for FSDPv2, and a bunch more folks it's just natural that it takes a while to publish good models: - dpsk starts to publish papers in 2023, even piblished dspkv2 (which I think is already amazing) in mid 2024 and nobody cares, until dpskv3 and r1 - msh took 10+ month to deliver a first not bad long ctx model in 2023 and be silent for the whole 2024 year, and starts to catch up gradually in 2025 - qwen starts to be a much better model than llama until qwen2.5, mid or late 2024, while the lab has been there forever it takes time to get infra and data done, but as long as you have good folks, and principled ways of doing science and experiments, some time or later, scaling laws will pay back

English

275

54.5K

Naman Goyal@NamanGoyal21·31 Oca

@drisspg How do you get physical SM id the blocks were launched on?

English

163

driss guessous@drisspg·31 Oca

AI is fun man, one off vibing of scripts is perfect product market fit! Another fun fact is you can see that cuda indeed does not always launch 1-to-1 the the first N blocks in a grid on the same smid, however it seems to map the same block to the same physical sm from run2run

English

1.5K

Naman Goyal retweetledi

Pengyu Zhao@zpysky1125·29 Eki

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. (zhihu.com/question/19653…) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock and go with full attention with MiniMax M2?" After explaining the backstory in one chat after another, I figured it's time to write down our journey in a blog. Honestly, I could give you the textbook debate. I could talk all afternoon about why you should build linear/sparse attention. Then, I could turn around and talk all afternoon about why you shouldn't. But what's the point of all that hand-waving? The real question is whether you should actually do it. So, let's start with the conclusion: We are always working on it. But in a real-world, industrial-grade system, the truth is that efficient attention still has some way to go before it can definitively beat full attention. As LLMs have evolved, the entire stack has become monstrously complex. We serve more scenarios, and the architecture design trade-offs are exploding: "How does it perform on code and math? What about agent scenarios? How does it handle multimodality? Does long-chain CoT still hold up? Can RL scale on top of it? Are there hidden traps with low-precision compute? How do you implement interleaved thinking, caching, or speculative decoding? ... " In short, there's a vast difference between the promise on paper and its payoff in production. You only get to claim that payoff after satisfying Condition 1...n and solving Problem 1...n. II. Why Efficient Attention? Let's do a thought experiment. If you had infinite compute, would you even bother with linear or sparse attention? Some might bring up theoretical arguments about softmax attention "oversmoothing" in an infinite context... but who knows? Under the current compute bound, no model has truly pushed softmax attention to its absolute limit. So, for all practical purposes, the race for efficient attention is a race to save compute. For our M2 design, could we aim to save tokens — achieving the same quality with fewer tokens? Well if you believe in scaling laws, to achieve this goal, you'd probably bet on other paths to get there, not efficient attention. So, the simple truth is this: Compute is finite. We need an architecture that makes better use of it — models that achieve higher performance under the same budget (training & inference). III. The Real Bottlenecks To build a model that can practically be deployed and used by the community, we have to start with what users care: Quality, Speed (TPS), and Price. Quality is non-negotiable. A useless model is useless even if it's free. So how do we make a Linear/Sparse/Hybrid Attention model that performs well enough? The biggest challenge here isn’t the architecture design — the real bottleneck is the limitations of evaluation. (As for speed and price, those are heavily influenced by the inference stack—and great models tend to attract great engineers to optimize them.) The Evaluation Trap: Goodhart's Law in Action “As long as you build the benchmark, I’ll find a way to beat it.” Over the past few years of LLM development, the pace of leaderboard progress is staggering. No matter how hard a benchmark is — even if the SOTA score starts in single digits — once it catches the industry’s attention, it’s usually crushed within a few iterations. But how do you build an evaluation system that is comprehensive and actually reflects a model's true capabilities? That’s one of the hardest — and most critical — problems in LLM development, and it becomes even more acute when you start messing with a component as fundamental as attention. Benchmarks are a Leaky Abstraction There’s no free lunch. When you reduce the complexity of attention, you pay a price. The question is, where? When we were developing MiniMax-Text-01, everyone was still evaluating MMLU, BBH, MATH, and LongBench (all of which are now saturated). From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. Our own small-scale hybrid models confirmed this on the leaderboards. (Did we find a free lunch?) Not quite. The price paid became obvious at a larger scale: the model had clear deficits in complex, multi-hop reasoning tasks. Okay, once a problem is exposed, you can fix it. We developed proxy metrics for this specific weakness and iterated until the hybrid model seemed to match MHA. But does that proxy metric still correlate with real-world downstream performance at an even larger scale? Are there other hidden weaknesses? Who knows. We haven't run those experiments yet. The better the models get, the harder they are to evaluate. But that’s a must part of the journey — keep it up, eval teams! The High Cost of Knowing Things For complex reasoning tasks, we can sometimes find early proxy metrics that correlate well with final performance — but not for all tasks (at least, not yet). As tasks get harder, the amount of experiment compute required just to get a statistically significant signal on your metric grows astronomically — which is ironic, since we study efficient attention because compute is limited. And beyond the academic benchmarks, optimization issues often only surface at scale. You never really know what’s going to happen until you scale up. Anyone who read our M1 paper will recall the serious precision issues we hit during RL training — problems that would’ve been spotted earlier. Going back and analyzing Lightning Attention's numerical convergence with that experience in hand was incredibly clarifying. Discovering the real problems is often far harder than solving them. A Symphony of Variables There are just too many variables in model training. Different architectures behave very differently on different data distributions and with different optimizers. In a world where our data is constantly being updated, an experiment run on last month's data mix might yield the opposite conclusion today. We can’t observe everything perfectly — but we’re working on finding more reliable experimental strategies. Infrastructure: Where Theory Meets Metal Compared to full attention, the infrastructure for linear and sparse attention is much less mature. To actually get the promised results, there’s still a lot of groundwork to fill in. Take linear attention for example: If you analyze the compute intensity of existing linear architectures, many of them are memory-bound — even during training. Without extreme IO optimization, you’re basically leaving a huge amount of GPU FLOPs on the table. And inference brings even more challenges than training: How do you deliver a service that is genuinely faster and cheaper? Linear attention has linear compute complexity and constant memory usage. That means there’s a crossover point where it becomes more efficient than full attention in compute and memory. In theory, that point lies at a few thousand tokens — which isn’t particularly long for today’s large models. But that’s just theory. We need to solve a few key problems to actually approach it: Low-Precision State Storage: Linear attention is currently far more sensitive to numerical precision than full attention. Prefix Caching: In real-world applications, the cache-hit rate for conversations is very high. A new architecture must handle this gracefully. Speculative Decoding: How do you optimize speculative decoding with linear attention backbone? Well fortunately, all of these seem solvable. IV. What’s Next Scaling remains the name of the game, and context scaling is one of the key problems. Longer and longer context length is key in both pre-training and post-training. As GPU compute growth slows while data length keeps increasing, the benefits of linear and sparse attention will gradually emerge. We should start preparing now: Better Data: More multimodal, information-rich long-context data. Better Evaluation: More informative evaluation system and experimental paradigms to speed up iteration. Better Infrastructure: Mature training and inference infrastructure to fully squeeze out GPU potential. V. Addendum: the SWA code... We accidentally left the SWA inference code in the open-source release, and some people asked why it wasn’t used in the final model. Simple answer: the performance wasn't good enough. That experiment was from quite early on, before GPT-OSS was open-sourced (we were pretty surprised to see its structure, by the way). But I can share a brief summary of our failed attempt. We tried adapting CPT into a Hybrid SWA, testing both inter & intra-layer mixing. The motivation for intra-layer mixing was to balance the compute intensity across all layers, which is friendly to both PP in training and PP or AFD during inference. Unfortunately, neither worked. Performance degraded noticeably as context length grew — which is unacceptable in agentic scenarios. Our analysis showed that many global attention patterns (like retrieval head and induction head) were already established early during pre-training. CPT can hardly adjust those patterns afterwards. You surely can mitigate the issue by using data probes to identify and keep those heads as full attention — but unfortunately, it’s nearly impossible to discover them all from human priors. (And no, this issue isn’t related to attention sinks.) If you're interested in this line of research, I recommend taking a closer look at GPT-OSS, CWM, and Gemma, especially their long-context performance. Finally, we’re hiring! If you want to join us, send your resume to guixianren@minimaxi.com. References MiniMax-01: Scaling Foundation Models with Lightning Attention MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention CWM: An Open-Weights LLM for Research on Code Generation with World Models Qwen3-Next Gemma 3 Technical Report gpt-oss-120b & gpt-oss-20b Model Card Retrieval Head Mechanistically Explains Long-Context Factuality transformer-circuits.pub/2022/in-contex…

MiniMax (official)@MiniMax_AI

We’re open-sourcing MiniMax M2 — Agent & Code Native, at 8% Claude Sonnet price, ~2x faster ⚡ Global FREE for a limited time via MiniMax Agent & API - Advanced Coding Capability: Engineered for end-to-end developer workflows. Strong capability on a wide-range of applications (Claude Code, Cursor, Cline, Kilo Code, Droid, etc) - High Agentic Performance: Robust handling of long-horizon toolchains (mcp, shell, browser, retrieval, code). - Smarter, Faster, Cheaper with efficient parameter activation

English

120

804

749K

Naman Goyal retweetledi

Diyi Yang@Diyi_Yang·22 Eki

Thanks @thinkymachines for supporting Tinker access for our CS329x students on Homework 2 😉

RSC ☀️🌲@silver__tsuki

Its not even been a month since @thinkymachines released Tinker & Stanford already has an assignment on it

English

582

318.1K

Naman Goyal retweetledi

Andrej Karpathy@karpathy·1 Eki

Tinker is cool. If you're a researcher/developer, tinker dramatically simplifies LLM post-training. You retain 90% of algorithmic creative control (usually related to data, loss function, the algorithm) while tinker handles the hard parts that you usually want to touch much less often (infra, forward/backward of the LLM itself, distributed training), meaning you can do these at well below <<10% of typical complexity involved. Compared to the more common and existing paradigm of "upload your data, we'll post-train your LLM", this is imo a more clever place to "slice up" the complexity of post-training, both delegating the heavy lifting, but also keeping majority of the data/algorithmic creative control. I think the community still has to discover how and when finetuning makes sense compared to the (often strong) baseline of prompting a giant model. The early indications I've seen is that finetuning isn't so much about "stylizing" an LLM, instead, it's a lot more about narrowing the scope, and especially when you have a lot of training examples. An extreme example of scope narrowing being that of categorical classifiers, e.g.spam filters, content filters, etc. but it should be broader than that. Instead of building a giant few-shot prompts for a big LLM, it might work a lot better (and faster!) to finetune a smaller LLM specifically for your narrow task. Increasingly, production applications of LLMs are larger pipelines where a bunch of LLMs collaborate in DAGs and flows. Some of these components might work well as prompts. But a lot of it will probably work a lot better as a finetune. Tinker makes the latter trivial and should allow for an easy experimentation of what works best at any stage.

Thinking Machines@thinkymachines

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models! thinkingmachines.ai/tinker

English

108

641

6.1K

750.9K

Naman Goyal retweetledi

Thinking Machines@thinkymachines·1 Eki

English

241

793

5.9K

4.2M

Naman Goyal retweetledi

Thinking Machines@thinkymachines·29 Eyl

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA. thinkingmachines.ai/blog/lora/

English

560

3.5K

1.4M

Naman Goyal retweetledi

Woosuk Kwon@woosuk_k·12 Eyl

At Thinking Machines, our work includes collaborating with the broader research community. Today we are excited to share that we are building a vLLM team at @thinkymachines to advance open-source vLLM and serve frontier models. If you are interested, please DM me or @barret_zoph! Here are some example roles / projects: * Distributed inference engineer to support large-scale models on Blackwell GPUs * PyTorch & model optimization engineer to support & optimize latest OSS models * MLSys generalist for various aspects of vLLM

English

1.2K

194.5K

Naman Goyal retweetledi

Horace He@cHHillee·10 Eyl

Apologies that I haven't written anything since joining Thinking Machines but I hope this blog post on a topic very near and dear to my heart (reproducible floating point numerics in LLM inference) will make up for it!

Thinking Machines@thinkymachines

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to prompt engineering. Here we share what we are working on and connect with the research community frequently and openly. The name Connectionism is a throwback to an earlier era of AI; it was the name of the subfield in the 1980s that studied neural networks and their similarity to biological brains. thinkingmachines.ai/blog/defeating…

English

187

2.9K

537.9K

Naman Goyal retweetledi

Thinking Machines@thinkymachines·10 Eyl

English

229

1.2K

7.6K

3.5M

Naman Goyal retweetledi

Bram Wasti@bwasti·3 Eyl

so apparently swe-bench doesn’t filter out future repo states (with the answers) and the agents sometimes figure this out… github.com/SWE-bench/SWE-…

English

300

132.6K

Naman Goyal@NamanGoyal21·24 Ağu

@tensorbert I typically do mostly mental or paper math to get idea about various trade offs under the optimization constraints and usually that is good enough to narrow down to 4-5 parallelism config for a particular model run, and then for the hero run usually end up benchmarking those.

English

510

Bert Maher@tensorbert·24 Ağu

I am curious: folks who know a lot about parallelism in model scaling, do you mainly get your insights from pencil-and-paper reasoning, from actually running lots of measurements, both? I have a good feel for single-device performance, but not large scale yet

Horace He@cHHillee

This is the advantage of large nvlink domains or TPUs topology - the main reason to do PP is that you are bottlenecked on your DP comms and cannot scale TP further. But if you have high enough bandwidth across a large enough domain (like TPUs or NVL72), you don't need to do PP for a very long time

English

27.6K

Naman Goyal@NamanGoyal21·15 Tem

The past 4 months have been among the most rewarding of my career—filled with learning and building alongside some of the most talented ML research and infra folks I know. I truly believe magic happens when driven, talented people are aligned on a shared mission.

Mira Murati@miramurati

Thinking Machines Lab exists to empower humanity through advancing collaborative general intelligence. We're building multimodal AI that works with how you naturally interact with the world - through conversation, through sight, through the messy way we collaborate. We're excited that in the next couple months we’ll be able to share our first product, which will include a significant open source component and be useful for researchers and startups developing custom models. Soon, we’ll also share our best science to help the research community better understand frontier AI systems. To accelerate our progress, we’re happy to confirm that we’ve raised $2B led by a16z with participation from NVIDIA, Accel, ServiceNow, CISCO, AMD, Jane Street and more who share our mission. We’re always looking for extraordinary talent that learns by doing, turning research into useful things. We believe AI should serve as an extension of individual agency and, in the spirit of freedom, be distributed as widely and equitably as possible. We hope this vision resonates with those who share our commitment to advancing the field. If so, join us. thinkingmachines.paperform.co

English

119

11.8K

Naman Goyal retweetledi

Jeremy Howard@jeremyphoward·8 Haz

literally openai

English

646

65.9K

Naman Goyal retweetledi

Vijay@__tensorcore__·13 May

🚨🔥 CUTLASS 4.0 is released 🔥🚨 pip install nvidia-cutlass-dsl 4.0 marks a major shift for CUTLASS: towards native GPU programming in Python slidehelloworld.png docs.nvidia.com/cutlass/media/…

English

422

78.9K

Naman Goyal@NamanGoyal21·6 Nis

Congrats amazing friends and ex colleagues on killer release! Pushing the frontier of open source models pushes the field collectively forward!

AI at Meta@AIatMeta

Today is the start of a new era of natively multimodal AI innovation. Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality. Llama 4 Scout • 17B-active-parameter model with 16 experts. • Industry-leading context window of 10M tokens. • Outperforms Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 across a broad range of widely accepted benchmarks. Llama 4 Maverick • 17B-active-parameter model with 128 experts. • Best-in-class image grounding with the ability to align user prompts with relevant visual concepts and anchor model responses to regions in the image. • Outperforms GPT-4o and Gemini 2.0 Flash across a broad range of widely accepted benchmarks. • Achieves comparable results to DeepSeek v3 on reasoning and coding — at half the active parameters. • Unparalleled performance-to-cost ratio with a chat version scoring ELO of 1417 on LMArena. These models are our best yet thanks to distillation from Llama 4 Behemoth, our most powerful model yet. Llama 4 Behemoth is still in training and is currently seeing results that outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks. We’re excited to share more details about it even while it’s still in flight. Read more about the first Llama 4 models, including training and benchmarks ➡️ go.fb.me/gmjohs Download Llama 4 ➡️ go.fb.me/bwwhe9

English

10.9K

Naman Goyal@NamanGoyal21·3 Şub

@giffmana Being the only person who was co-author both in OPT and llama1 and was part of zetta team, I can say that actually that it was much more nuanced and has multiple POVs and not a simple story as presented. But I will silence as no more drama, and only llama!

English

103

8.3K

Lucas Beyer (bl16)@giffmana·3 Şub

There has been so much drama across the various big labs in the last few years, i love the entertainment! What would be super dope is if in 5-10 years a few of us sit down and write a book about it all. hmu if you got some tea and are down for this, it'll be fun!

English

219

23.5K

Naman Goyal@NamanGoyal21·7 Oca

@cHHillee @finbarrtimbers 100% agreed, I just think of sequence parallelism as slight different way to do tensor parallelism that is almost always better than default TP for most training.

English

173

Horace He@cHHillee·7 Oca

@finbarrtimbers Sequence Parallelism is probably the worst named parallelism kind.

English

3.3K

finbarr@finbarrtimbers·7 Oca

writing an article about different types of large model parallelism and when to use them what parallelism questions do you have

English

334

26.8K

Keşfet

@nvidia @JingyuanLiu123 @Jianlin_S @clu_cheng @jxbz @drisspg @thinkymachines @barret_zoph