Mahan Fathi

75 posts

Mahan Fathi

@MahanFathi

llm research @nvidia👁️; ex @googledeepmind, @google🧠 & @mila_quebec; FREE IRAN 🇮🇷.

Toronto, Ontario Katılım Haziran 2011

162 Takip Edilen969 Takipçiler

Mahan Fathi retweetledi

Artificial Analysis@ArtificialAnlys·11 Mar

NVIDIA has released Nemotron 3 Super, a 120B (12B active) open weights reasoning model that scores 36 on the Artificial Analysis Intelligence Index with a hybrid Mamba-Transformer MoE architecture We were given access to this model ahead of launch and evaluated it across intelligence, openness, and inference efficiency. Key takeaways ➤ Combines high openness with strong intelligence: Nemotron 3 Super performs strongly for its size and is substantially more intelligent than any other model with comparable openness ➤ Nemotron 3 Super scored 36 on the Artificial Analysis Intelligence Index, +17 points ahead of the previous Super release and +12 points from Nemotron 3 Nano. Compared to models in a similar size category, this places it ahead of gpt-oss-120b (33), but behind the recently-released Qwen3.5 122B A10B (42). ➤ Focused on efficient intelligence: we found Nemotron 3 Super to have higher intelligence than gpt-oss-120b while enabling ~10% higher throughput per GPU in a simple but realistic load test ➤ Supported today for fast serverless inference: providers including @DeepInfra and @LightningAI are serving this model at launch with speeds of up to 484 tokens per second Model details 📝 Nemotron 3 Super has 120.6B total and 12.7B active parameters, along with a 1 million token context window and hybrid reasoning support. It is published with open weights and a permissive license, alongside open training data and methodology disclosure 📐 The model has several design features enabling efficient inference, including using hybrid Mamba-Transformer and LatentMoE architectures, multi-token prediction, and NVFP4 quantized weights 🎯 NVIDIA pre-trained Nemotron 3 Super in (mostly) NVFP4 precision, but moved to BF16 for post-training. Our evaluation scores use the BF16 weights 🧠 We benchmarked Nemotron 3 Super in its highest-effort reasoning mode ("regular"), the most capable of the model's three inference modes (reasoning-off, low-effort, and regular)

English

485

93.2K

Mahan Fathi retweetledi

NVIDIA@nvidia·11 Mar

x.com/i/article/2031…

ZXX

267

1.6K

409.2K

Mahan Fathi retweetledi

Oleksii Kuchaiev@kuchaev·11 Mar

Nemotron 3 Super is here — 120B total / 12B active, Hybrid SSM Latent MoE, designed for Blackwell. Truly open: permissive license, open data, open training infra. See analysis on @ArtificialAnlys Details in thread 🧵below:

English

276

29.1K

Mahan Fathi retweetledi

Oleksii Kuchaiev@kuchaev·15 Ara

🚀 Nemotron 3 Nano 30B-A3B is here! Open weights + open data + open source. AA Intelligence Index: 52 (@ArtificialAnlys ) ✅ 1M‑token context ✅ up to 3.3× higher throughput vs similarly sized open models ✅ stronger reasoning/agentic + chat Details + links in the thread 🧵

English

144

10.2K

Mahan Fathi@MahanFathi·6 Kas

We're looking for Summer Interns to join the Post-Training Team at @NVIDIA! DM me with your updated resume and three concise bullets detailing your most relevant experience — e.g. publications, repos, blogs, etc. RT please to help us find top talent.

English

468

31.2K

Mahan Fathi@MahanFathi·13 Ağu

@jxmnop @Miles_Brundage the RL updates are indeed small in the sense that they are *sparse*, however they are still *full-rank* (see arxiv.org/abs/2505.11711). this means that LoRA would fail to recover the true base model even if we assume access to the post-training data.

English

156

dr. jack morris@jxmnop·13 Ağu

@Miles_Brundage theoretically, because the update is so low-rank empirically, because the generations have nothing to do with the training data e.g. I didn’t train the model to output Harry Potter somehow it knew that already

English

161

17.1K

dr. jack morris@jxmnop·13 Ağu

OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only... or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base 🧵

English

163

446

6.1K

928.7K

Mahan Fathi@MahanFathi·30 May

@StellaLisy cool paper with really surprising results. but looking at the abstract, Qwen2.5-Math-7B gets ~70% on MATH500 when no template is used right off the shelf = about as much improvement you achieve with random rewards. this is due to a mismatch between the train/test time templates.

English

204

Stella Li@StellaLisy·27 May

🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…

English

331

1.8K

699.3K

Mahan Fathi retweetledi

Shashwat Goel@ShashwatGoel7·29 May

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

English

120

873

323.4K

Mahan Fathi retweetledi

Oleksii Kuchaiev@kuchaev·16 May

NeMo RL is now open source! It replaces NeMo-Aligner and is the toolkit we use to post train next generations of our models. Give it a try github.com/NVIDIA/NeMo-RL

English

394

24.9K

Mahan Fathi retweetledi

Oleksii Kuchaiev@kuchaev·5 May

Llama-Nemotron-v1 technical report is now available on arxiv arxiv.org/pdf/2505.00949…

English

346

28.8K

Mahan Fathi retweetledi

Ross Goroshin@RGoroshin·25 Eki

The talk I gave @ Mila on learning linearized representations of dynamical systems (Koopman representations) is on YouTube. The work was mainly carried out by @MahanFathi in collaboration with @pierrelux 's lab, and was presented at ICLR 2024. youtube.com/watch?v=wKyN5j…

YouTube

English

4.6K

Mahan Fathi retweetledi

Guillaume Lajoie@g_lajoie_·23 Eki

In-context learnin (ICL) is one of the most exciting part of the LLM boom. Sequence models (not just LLMs) implement on-the-fly models conditionned on inputs w/o weight updates! Q: are in-context models better than «in-weights» ones? A: some times ICL is better than standard opt.

Eric Elmoznino@EricElmoznino

Introducing our new paper explaining in-context learning through the lens of Occam’s razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_ arxiv.org/abs/2410.14086

English

3.4K

Mahan Fathi retweetledi

Eric Elmoznino@EricElmoznino·21 Eki

English

104

16.9K

Mahan Fathi@MahanFathi·29 Ağu

@agarwl_ @nvidia Thanks, Rishabh!!

English

126

Rishabh Agarwal@agarwl_·29 Ağu

@MahanFathi @nvidia Congrats!

English

360

Mahan Fathi@MahanFathi·27 Ağu

life update: thrilled to announce that i’ll be joining @nvidia as a research scientist on the alignment team. grateful for the support from mentors and peers. this is a dream come true for both the researcher and the gamer in me!

English

404

40.8K

Mahan Fathi@MahanFathi·29 Ağu

@surmenok @nvidia Thanks, Pavel!!

English

166