Rupesh Srivastava

1.1K posts

Rupesh Srivastava

@rupspace

Fully open LLM frontiers @MBZUAI IFM Silicon Valley. Previously (co)developed Highway Networks, Upside-Down RL, Bayesian Flow Networks, EvoTorch.

Santa Cruz, CA Entrou em Eylül 2014

768 Seguindo2.7K Seguidores

Tweet fixado

Rupesh Srivastava@rupspace·1 Ara

Update: new gig, and I'm hiring! I recently joined the Institute of Foundation Models in the SF Bay Area! Our goal is to train large-scale FULLY open-source LLMs at and beyond the frontier, from scratch, with open science, open data and open checkpoints. We are hiring across the training stack. Further, I'm building a new team to advance open agentic LLMs, and hiring researchers/engineers on-site. Send me a DM or email if you are interested! I'll also be at #NeurIPS2025 in San Diego this week to talk to potential candidates for internships and FT positions.

English

224

36.5K

Rupesh Srivastava@rupspace·2d

@Grad62304977 @kalomaze All networks are mixtures of experts, just gated at unit level :) arxiv.org/abs/1410.1165

English

Grad@Grad62304977·2d

@kalomaze So elegant esp when u mix with DSA (sideways MoE)

English

1.1K

kalomaze@kalomaze·2d

i feel like the concept of MoE is pretty simple (activate some subnetwork via gating mechanism at each layer) and is only hard to deal with for "making things go fast on GPUs can be hard" reasons, and i feel those reasons are unrelated to elegance in the conceptual sense

Arthur Zucker@art_zucker

The main reason I don't like MoEs is just philosophical, I'm a big ockham's razor believer and no one computed the actual brain/money cost of all in moe...

English

108

10.9K

Rupesh Srivastava retweetou

Alex Shaw@alexgshaw·5d

The Harbor registry is getting an upgrade. Now, anyone can publish to the registry to make their dataset available to every Harbor user:

English

4.4K

Rupesh Srivastava retweetou

Institute of Foundation Models@IFM_MBZUAI·27 Mar

Back in beautiful New Haven this weekend for YHack. We’ll be there with K2 Think V2, a fully open-source reasoning system. Hackers! Dig into how it works: huggingface.co/LLM360/K2-Thin…

Institute of Foundation Models tweet media

English

525

Rupesh Srivastava retweetou

Lucas Beyer (bl16)@giffmana·16 Mar

Yes and no. Very often it turns out that what you think solves the problem is not what actually solves it, and this you only find out by not moving on, but making sure you have experiments that back up the *exact* statement you make removing all reasonable confounders. And that, you get from one of: - public review - extremely strict colleagues - insane self discipline

English

167

Rupesh Srivastava retweetou

Seungwook Han@seungwookh·12 Mar

Can language models learn useful priors without ever seeing language? We pre-pre-train transformers on neural cellular automata — fully synthetic, zero language. This improves language modeling by up to 6%, speeds up convergence by 40%, and strengthens downstream reasoning. Surprisingly, it even beats pre-pre-training on natural text! Blog: hanseungwook.github.io/blog/nca-pre-p… (1/n)

English

261

1.7K

245.4K

Rupesh Srivastava retweetou

Subham Sahoo@ssahoo_·11 Mar

📢@CVPR 2026: first-ever tutorial dedicated to DISCRETE DIFFUSION 🔥 Part I: Consistency Models + Flow Maps - @JCJesseLai Part II: Discrete Diffusion - by me. ✨Few-step gen + inference-time scaling + live demos Co-orgs: @StefanoErmon @DrYangSong @mittu1204 @gimdong58085414 Full schedule + details👇 (1/3)

English

328

20.6K

Rupesh Srivastava@rupspace·10 Mar

@kalomaze @teortaxesTex @kzkirie 👀

QME

kalomaze@kalomaze·10 Mar

@teortaxesTex MoEUT 2 electric boogaloo?

Français

853

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·10 Mar

Very cool. MoE with cross-layer expert sharing (+reuse), so vastly richer combinatorially than the normal case, but what's neat is it can be warm-started from normal MoE checkpoints. I'm surprised at the claim of benign training dynamics and routing.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Yilong Chen@Yichen4NLP

We introduce MoUE. A new MoE paradigm boosts base-model performance by up to 1.3 points from scratch and up to 4.2 points on average, without increasing either activated parameters or total parameters. The main idea is simple: a sufficiently wide MoE layer with recursive reuse can be treated as a strict generalization of standard MoE. arxiv.org/abs/2603.04971 huggingface.co/papers/2603.04… #MoE #LLM #MixtureOfExperts #SparseModels #ScalingLaws #Modularity #UniversalTransformers #RecursiveComputation #ContinualPretraining

English

10.8K

Rupesh Srivastava@rupspace·10 Mar

@eliebakouch Congrats on a great run!

English

elie@eliebakouch·9 Mar

today is my last day at hugging face feeling really grateful to have worked with such an amazing team and learned so much along the way. i’m proud of what we accomplished together, especially the smollm series. building that project from scratch, putting so much into it, and getting to iterate on a model and training recipe that pushed the frontier for its size was really rewarding i hope i was able to play a part in making model training more accessible and in pushing the open model ecosystem forward. i’m also very thankful to hf for giving me the chance to share my passion for llm research, especially here, and to connect with so many awesome people things can get quite intense in this field, but i’m still very excited about the next challenges and about the good this technology can do but first, taking a few weeks break :)

English

116

745

32.9K

Rupesh Srivastava retweetou

Wonmin Byeon@wonmin_byeon·4 Mar

🚀 New paper: Mamba–Transformer hybrid VLMs can go fast without forgetting. We introduce stateful token reduction for long-video VLMs. ✅ Only 25% of visual tokens 🚀 3.8–4.2× faster prefilling (TTFT) 🎯 Near-baseline accuracy (can exceed baseline with light finetuning)

English

217

13.9K

Rupesh Srivastava@rupspace·20 Şub

This should be very useful for academic researchers in particular!

Richard Zhuang@RichardZ412

Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵

English

1.2K

Rupesh Srivastava@rupspace·17 Şub

Exhibits from GPT-5.2-Pro trying to understand a coding agent harness. The final answer was impeccable btw.

English

461

Rupesh Srivastava@rupspace·14 Şub

@teortaxesTex Needs Speciale

English

516

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·13 Şub

None of these models have been trained specifically on AIME2026. This is just the state of RL now, there's been a step change in a year. Contamination? What contamination? Contamination with generalizable heuristics, perhaps. And yes, I recommend you to try out Step 3.5 again.

Jasper Dekoninck@j_dekoninck

Link to MathArena: matharena.ai/?view=problem Link to HF: huggingface.co/datasets/MathA…

English

158

38.5K

Rupesh Srivastava@rupspace·13 Şub

You already know

Clive Chan@itsclivetime

i wonder what it feels like to be quantized, hadamard rotated, and speculatively decoded

English

484

Rupesh Srivastava@rupspace·11 Şub

Beyond these concrete contributions, there's a lot to learn from DeepSeek's meta-approach: maniacal focus on a few fundamentals: infra, long context, scaling RL. DeepSeek as an org feels like a single focused researcher with a clear point of view.

elie@eliebakouch

i think we don't realize the impact that deepseek had on the open ecosystem, there is so much from them that you can find in almost every frontier open llm today > most of the open frontier models follow the "finegrain + sparse + shared expert" deepseek moe recipe > a lot of them use MLA > first (with minicpm) to use sparse attention in prod (DSA) > first to do reasoning in the open with R1 > GRPO which is the foundation for most of the newer RL algorithms > they also innovated on the training recipe at scale, first to do fp8? MTP? load balancing schemes that now other lab is using > advance training/inference infra with oss release like DeepEP that pretraining lib like megatron use i'm so grateful deepseek exists

English

1.5K

Rupesh Srivastava@rupspace·11 Şub

@giffmana I feel like after the recent "juice" changes the Thinking model tends to skip thinking more often.

English

Lucas Beyer (bl16)@giffmana·11 Şub

@rupspace It feels like it didn't, at least not enough. I had it on standard, on extended it gets it.

English

479

Lucas Beyer (bl16)@giffmana·11 Şub

Haha i didn't believe it at first, but i can reproduce this. Nice "world model" fail!

Jack Cole@MindsAI_Jack

ChatGPT 5.2 with thinking. The mutated strawberry problem. 🍓

English

737

165.6K

Rupesh Srivastava retweetou

Max Jaderberg@maxjaderberg·10 Şub

We give a glimpse at some of the capabilities of IsoDDE: - predicting novel biomolecular structures with 2-3x the accuracy of previous methods (including our own!) - the ability to predict binding affinity, one of the holy grail quantities of rational drug design, better even than physics simulations - the ability to highlight and uncover new pockets that had not previously been discovered 2/7

English

113

13K

Rupesh Srivastava@rupspace·11 Şub

@Dorialexander It's already called "pre-training" for good reason

English

220

Alexander Doria@Dorialexander·10 Şub

I feel we just need to change the meaning of words.

Yaroslav Bulatov@yaroslavvb

Was surprised to learn that only 20% of the compute was spent on pre-train for a frontier model. The rest is post.

English

4.3K

Rupesh Srivastava retweetou

Lucas Beyer (bl16)@giffmana·8 Şub

As per my recurring rants:

English

476

40.4K

Rupesh Srivastava@rupspace·7 Şub

@ajambrosino Bit of time, with 5.3-codex. So... I should expect it in a week or so?

English

Andrew Ambrosino@ajambrosino·7 Şub

We want to make this a lot smoother for the public version. Something that is fully integrated. It should work for individual people with a remote machine and also large enterprise. We're taking a bit of time to do this the right way.

English

6.6K

Andrew Ambrosino@ajambrosino·7 Şub

On SSH/remote/boxes: We're working on this! Quick notes: 👇

English

119

18.8K

Descobrir

@Grad62304977 @kalomaze @CVPR @JCJesseLai @StefanoErmon @DrYangSong @mittu1204 @gimdong58085414