Mahan Fathi

75 posts

Mahan Fathi banner
Mahan Fathi

Mahan Fathi

@MahanFathi

llm research @nvidia👁️; ex @googledeepmind, @google🧠 & @mila_quebec; FREE IRAN 🇮🇷.

Toronto, Ontario Katılım Haziran 2011
162 Takip Edilen969 Takipçiler
Mahan Fathi retweetledi
Artificial Analysis
Artificial Analysis@ArtificialAnlys·
NVIDIA has released Nemotron 3 Super, a 120B (12B active) open weights reasoning model that scores 36 on the Artificial Analysis Intelligence Index with a hybrid Mamba-Transformer MoE architecture We were given access to this model ahead of launch and evaluated it across intelligence, openness, and inference efficiency. Key takeaways ➤ Combines high openness with strong intelligence: Nemotron 3 Super performs strongly for its size and is substantially more intelligent than any other model with comparable openness ➤ Nemotron 3 Super scored 36 on the Artificial Analysis Intelligence Index, +17 points ahead of the previous Super release and +12 points from Nemotron 3 Nano. Compared to models in a similar size category, this places it ahead of gpt-oss-120b (33), but behind the recently-released Qwen3.5 122B A10B (42). ➤ Focused on efficient intelligence: we found Nemotron 3 Super to have higher intelligence than gpt-oss-120b while enabling ~10% higher throughput per GPU in a simple but realistic load test ➤ Supported today for fast serverless inference: providers including @DeepInfra and @LightningAI are serving this model at launch with speeds of up to 484 tokens per second Model details 📝 Nemotron 3 Super has 120.6B total and 12.7B active parameters, along with a 1 million token context window and hybrid reasoning support. It is published with open weights and a permissive license, alongside open training data and methodology disclosure 📐 The model has several design features enabling efficient inference, including using hybrid Mamba-Transformer and LatentMoE architectures, multi-token prediction, and NVFP4 quantized weights 🎯 NVIDIA pre-trained Nemotron 3 Super in (mostly) NVFP4 precision, but moved to BF16 for post-training. Our evaluation scores use the BF16 weights 🧠 We benchmarked Nemotron 3 Super in its highest-effort reasoning mode ("regular"), the most capable of the model's three inference modes (reasoning-off, low-effort, and regular)
Artificial Analysis tweet media
English
20
62
485
93.2K
Mahan Fathi retweetledi
Oleksii Kuchaiev
Oleksii Kuchaiev@kuchaev·
Nemotron 3 Super is here — 120B total / 12B active, Hybrid SSM Latent MoE, designed for Blackwell. Truly open: permissive license, open data, open training infra. See analysis on @ArtificialAnlys Details in thread 🧵below:
Oleksii Kuchaiev tweet mediaOleksii Kuchaiev tweet media
English
10
45
276
29.1K
Mahan Fathi retweetledi
Oleksii Kuchaiev
Oleksii Kuchaiev@kuchaev·
🚀 Nemotron 3 Nano 30B-A3B is here! Open weights + open data + open source. AA Intelligence Index: 52 (@ArtificialAnlys ) ✅ 1M‑token context ✅ up to 3.3× higher throughput vs similarly sized open models ✅ stronger reasoning/agentic + chat Details + links in the thread 🧵
Oleksii Kuchaiev tweet media
English
3
33
144
10.2K
Mahan Fathi
Mahan Fathi@MahanFathi·
We're looking for Summer Interns to join the Post-Training Team at @NVIDIA! DM me with your updated resume and three concise bullets detailing your most relevant experience — e.g. publications, repos, blogs, etc. RT please to help us find top talent.
English
14
34
468
31.2K
Mahan Fathi
Mahan Fathi@MahanFathi·
@jxmnop @Miles_Brundage the RL updates are indeed small in the sense that they are *sparse*, however they are still *full-rank* (see arxiv.org/abs/2505.11711). this means that LoRA would fail to recover the true base model even if we assume access to the post-training data.
English
0
0
2
156
dr. jack morris
dr. jack morris@jxmnop·
@Miles_Brundage theoretically, because the update is so low-rank empirically, because the generations have nothing to do with the training data e.g. I didn’t train the model to output Harry Potter somehow it knew that already
English
8
0
161
17.1K
dr. jack morris
dr. jack morris@jxmnop·
OpenAI hasn’t open-sourced a base model since GPT-2 in 2019. they recently released GPT-OSS, which is reasoning-only... or is it? turns out that underneath the surface, there is still a strong base model. so we extracted it. introducing gpt-oss-20b-base 🧵
dr. jack morris tweet mediadr. jack morris tweet media
English
163
446
6.1K
928.7K
Mahan Fathi
Mahan Fathi@MahanFathi·
@StellaLisy cool paper with really surprising results. but looking at the abstract, Qwen2.5-Math-7B gets ~70% on MATH500 when no template is used right off the shelf = about as much improvement you achieve with random rewards. this is due to a mismatch between the train/test time templates.
English
0
0
1
204
Stella Li
Stella Li@StellaLisy·
🤯 We cracked RLVR with... Random Rewards?! Training Qwen2.5-Math-7B with our Spurious Rewards improved MATH-500 by: - Random rewards: +21% - Incorrect rewards: +25% - (FYI) Ground-truth rewards: + 28.8% How could this even work⁉️ Here's why: 🧵 Blogpost: tinyurl.com/spurious-rewar…
Stella Li tweet media
English
72
331
1.8K
699.3K
Mahan Fathi retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
Shashwat Goel tweet media
English
33
120
873
323.4K
Mahan Fathi retweetledi
Oleksii Kuchaiev
Oleksii Kuchaiev@kuchaev·
NeMo RL is now open source! It replaces NeMo-Aligner and is the toolkit we use to post train next generations of our models. Give it a try github.com/NVIDIA/NeMo-RL
English
5
64
394
24.9K
Mahan Fathi retweetledi
Ross Goroshin
Ross Goroshin@RGoroshin·
The talk I gave @ Mila on learning linearized representations of dynamical systems (Koopman representations) is on YouTube. The work was mainly carried out by @MahanFathi in collaboration with @pierrelux 's lab, and was presented at ICLR 2024. youtube.com/watch?v=wKyN5j…
YouTube video
YouTube
English
0
3
20
4.6K
Mahan Fathi retweetledi
Guillaume Lajoie
Guillaume Lajoie@g_lajoie_·
In-context learnin (ICL) is one of the most exciting part of the LLM boom. Sequence models (not just LLMs) implement on-the-fly models conditionned on inputs w/o weight updates! Q: are in-context models better than «in-weights» ones? A: some times ICL is better than standard opt.
Eric Elmoznino@EricElmoznino

Introducing our new paper explaining in-context learning through the lens of Occam’s razor, giving a normative account of next-token prediction objectives. This was with @Tom__Marty @tejaskasetty @le0gagn0n @sarthmit @MahanFathi @dhanya_sridhar @g_lajoie_ arxiv.org/abs/2410.14086

English
0
5
21
3.4K
Mahan Fathi
Mahan Fathi@MahanFathi·
life update: thrilled to announce that i’ll be joining @nvidia as a research scientist on the alignment team. grateful for the support from mentors and peers. this is a dream come true for both the researcher and the gamer in me!
English
33
4
404
40.8K