Gowtham Ramesh

172 posts

Gowtham Ramesh

Gowtham Ramesh

@gowtham_ramesh1

Applied RS GenAI @AMD | Ex - Student Researcher @GoogleAI NYC, @WisconsinCS, @ai4bharat, @iitmadras

San Jose, CA Katılım Ekim 2016
1.5K Takip Edilen439 Takipçiler
Gowtham Ramesh retweetledi
LMSYS Org
LMSYS Org@lmsysorg·
🚀 New blog: ROCm Support for Miles: Large-Scale RL Post-Training on AMD Instinct™ GPUs Together with @AMD, Miles brings end-to-end RL pipelines to MI300/350-class clusters: ⚡️ Rollout generation dominates RL compute, and AMD’s HBM bandwidth directly addresses this bottleneck 🧠 AIME accuracy improved from 0.665 → 0.729 across training on Qwen3-30B-A3B with GRPO 💾 MI300X delivers ~1.1–1.3k tok/GPU/s rollout throughput ⏱️ Mean step time 388.5s on a single 8-GPU MI300X node (32×8 sampling, 8k response cap) 🔧 Multi-turn agentic training validated ... and more optimizations to come 🔥
LMSYS Org tweet media
English
3
18
66
12.7K
Gowtham Ramesh retweetledi
Almanac
Almanac@thinkwithalma·
You’ve curated the sources. You’ve researched everything. You know exactly what you want to say. You just can't get it out of your head and onto the page. We are building Almanac exactly for this. Experience the beta version here: try.almanac.so Here are a few things you can do with Almanac👇
English
17
39
123
45.4K
Gowtham Ramesh retweetledi
AI4Bharat
AI4Bharat@ai4bharat·
For AI to be truly inclusive, it must understand more than just grammar—it must understand context. @AI4Bharat at @iitmadras had launched the Indic LLM Arena. This isn't just another leaderboard; it’s a public utility for: ✅ Developers: Test your models against real-world Indian use cases. ✅ Enterprises: Find out which LLM actually resonates with your customers in rural India. ✅ Sovereignty: Building AI that respects our social fabric and safety norms. Be a part of this movement. Try the Arena today and help us rank the models that will power India's digital future. 👉 ai4bharat.iitm.ac.in/blog/indic-llm… #GenerativeAI #DigitalIndia #IITMadras #IndicLLM #indiaaiimpactsummit2026 @MiteshKhapra @anoopk @prajdabre @ravi_iitm @partha_p_t @ManishGuptaMG1 @meghtweets @dineshteewari1 @abapna @WSAI_IITM @OfficialINDIAai @EkStep_Org @PeoplePlusAI
English
8
38
238
21.9K
Gowtham Ramesh retweetledi
Rosinality
Rosinality@rosinality·
Sometimes a simple baseline with clip higher just works. But when?
Rosinality tweet media
English
4
35
283
14.7K
Gowtham Ramesh retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
We present Olmo 3, our next family of fully open, leading language models. This family of 7B and 32B models represents: 1. The best 32B base model. 2. The best 7B Western thinking & instruct models. 3. The first 32B (or larger) fully open reasoning model. This is a big milestone for Ai2 and the Olmo project. These aren’t huge models (more on that later), but it’s crucial for the viability of fully open-source models that they are competitive on performance – not just replications of models that came out 6 to 12 months ago. As always, all of our models come with full training data, code, intermediate checkpoints, training logs, and a detailed technical report. All are available today, with some more additions coming before the end of the year. As with OLMo 2 32B at its release, OLMo 3 32B is the best open-source language model ever released. It’s an awesome privilege to get to provide these models to the broader community researching and understanding what is happening in AI today. Base models – a strong foundation Pretraining’s demise is now regularly overstated. 2025 has marked a year where the entire industry rebuilt their training stack to focus on reasoning and agentic tasks, but some established base model sizes haven’t seen a new leading model since @alibaba_qwen's Qwen 2.5 in 2024. The Olmo 3 32B base model could be our most impactful artifact here, as Qwen3 did not release their 32B base model (likely for competitive reasons). We show that our 7B recipe competes with Qwen 3, and the 32B size enables a starting point for strong reasoning models or specialized agents. Our base model’s performance is in the same ballpark as Qwen 2.5, surpassing the likes of Stanford’s Marin (@stanfordAILab) and Gemma 3 (@GoogleDeepMind), but with pretraining data and code available, it should be more accessible to the community to learn how to finetune it (and be confident in our results). We’re excited to see the community take Olmo 3 32B base in many directions. 32B is a loved size for easy deployment on single 80GB+ memory GPUs and even on many laptops, like the MacBook I’m using to write this on. A model flow – the lifecycle of creating a model With these strong base models, we’ve created a variety of post-training checkpoints to showcase the many ways post-training can be done to suit different needs. We’re calling this a “Model Flow.” For post-training, we’re releasing Instruct versions – short, snappy, intelligent, and useful especially for synthetic data en masse (e.g. recent work by Datology @datologyai on OLMo 2 Instruct), Think versions – thoughtful reasoners with the performance you expect from a leading thinking model on math, code, etc. and RL Zero versions – controlled experiments for researchers understanding how to build post-training recipes that start with large-scale RL on the base model. The first two post-training recipes are distilled from a variety of leading, open and closed, language models. At the 32B and smaller scale, direct distillation with further preference finetuning and reinforcement learning with verifiable rewards (RLVR) is becoming an accessible and highly capable pipeline. Our post-training recipe follows our recent models: 1) create an excellent SFT set, 2) use direct preference optimization (DPO) as a highly iterable, cheap, and stable preference learning method despite its critics, and 3) finish up with scaled up RLVR. All of these stages confer meaningful improvements on the models’ final performance. Instruct models – low latency workhorse Instruct models today are often somewhat forgotten, but the likes of @aiatmeta Llama 3.1 Instruct and smaller, concise models are some of the most adopted open models of all time. The instruct models we’re building are a major polishing and evolution of the Tülu 3 pipeline – you’ll see many similar datasets and methods, but with pretty much every datapoint or training code being refreshed. Olmo 3 Instruct should be a clear upgrade on Llama 3.1 8B, representing the best 7B scale model from a Western or American company. As scientists we don’t like to condition the quality of our work based on its geographic origins, but this is a very real consideration to many enterprises looking to open models as a solution for trusted AI deployments with sensitive data. Building a thinking model What people have most likely been waiting for are our thinking or reasoning models, both because every company needs to have a reasoning model in 2025, but also to clearly open the black box for the most recent evolution of language models. Olmo 3 Think, particularly the 32B, are flagship models of this release, where we considered what would be best for a reasoning model at every stage of training. Extensive effort (ask me IRL about more war stories) went into every stage of the post-training of the Think models. We’re impressed by the magnitude of gains that can be achieved in each stage – neither SFT nor RL is all you need at these intermediate model scales. First we built an extensive reasoning dataset for supervised finetuning (SFT), called Dolci-Think-SFT, building on very impactful open projects like OpenThoughts3, Nvidia’s Nemotron Post-training, Prime Intellect’s SYNETHIC-2, and many more open prompt sources we pulled forward from Tülu 3 / OLMo 2. Datasets like this are often some of our most impactful contributions (see the Tülu 3 dataset as an example in Thinking Machine’s Tinker :D @thinkymachines @tinker_api – please add Dolci-Think-SFT too, and Olmo 3 while you’re at it, the architecture is very similar to Qwen which you have). For DPO with reasoning, we converged on a very similar method as HuggingFace’s (@huggingface) SmolLM 3 with Qwen3 32B as the chosen model and Qwen3 0.6B as the rejected. Our intuition is that the delta between the chosen and rejected samples is what the model learns from, rather than the overall quality of the chosen answer alone. These two models provide a very consistent delta, which provides way stronger gains than expected. Same goes for the Instruct model. It is likely that DPO is helping the model converge on more stable reasoning strategies and softening the post-SFT model, as seen by large gains even on frontier evaluations such as AIME. Our DPO approach was an expansion of Geng, Scott, et al. "The delta learning hypothesis: Preference tuning on weak data can yield strong gains." arXiv preprint arXiv:2507.06187 (2025). Many early open thinking models that were also distilled from larger, open-weight thinking models likely left a meaningful amount of performance on the table by not including this stage. Finally, we turn to the RL stage. Most of the effort here went into building effective infrastructure to be able to run stable experiments with the long-generations of larger language models. This was an incredible team effort to be a small part of, and reflects work ongoing at many labs right now. Most of the details are in the paper, but our details are a mixture of ideas that have been shown already like ServiceNow’s PipelineRL or algorithmic innovations like DAPO and Dr. GRPO. We have some new tricks too! Some of the exciting contributions of our RL experiments are 1) what we call “active refilling” which is a way of keeping the generations from the learner nodes constantly flowing until there’s a full batch of completions with nonzero gradients (from equal advantages) – a major advantage of our asynchronous approach; and 2) cleaning, documenting, decontaminating, mixing, and proving out the large swaths of work done by the community over the last months. The result is an excellent model that we’re very proud of. It has very strong reasoning benchmarks (AIME, GPQA, etc.) while also being stable, quirky, and fun in chat with excellent instruction following. The 32B range is largely devoid of non-Qwen competition. The scores for both of our Thinkers get within 1-2 points overall with their respective Qwen3 8/32B models – we’re proud of this! A very strong 7B scale, Western thinking model is Nvidia’s (@NVIDIAAI) NVIDIA-Nemotron-Nano-9B-v2 hybrid model. It came out months ago and is extremely strong. I personally suspect it may be due to the hybrid architecture making subtle implementation bugs in popular libraries, but who knows. All in, the Olmo 3 Think recipe gives us a lot of excitement for new things to try in 2026. RL Zero DeepSeek R1 showed us a way to new post-training recipes for frontier models, starting with RL on the base model rather than a big SFT stage (yes, I know about cold-start SFT and so on, but that’s an implementation detail). We used RL on base model as a core feedback cycle when developing the model, such as during intermediate midtraining mixing. This is viewed now as a fundamental, largely innate, capability of the base-model. To facilitate further research on RL Zero, we released 4 datasets and series of checkpoints, showing per-domain RL Zero performance on our 7B model for data mixes focus on math, code, instruction following, and all mixed together. In particular, we’re excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative). This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).) What’s next This is the biggest project we’ve ever taken on at Ai2 (@allen_ai), with 60+ authors and numerous other support staff. In building and observing “thinking” and “instruct” models coming today, it is clear to us that there’s a very wide variety of models that fall into both of these buckets. The way we view it is that thinking and instruct characteristics are on a spectrum, as measured by the number of tokens used per evaluation task. In the future we’re excited to view this thinking budget as a trade-off, and build models that serve different use-cases based on latency/throughput needs. As for a list of next models or things we’ll build, we can give you a list of things you’d expect from a (becoming) frontier lab: MoEs, better character training, pareto efficient instruct vs think, scale, specialized models we actually use at Ai2 internally, and all the normal things. This is one small step towards what I see as a success for my ATOM project. We thank you for all your support of our work at Ai2. We have a lot of work to do. We’re going to be hunting for top talent at NeurIPS to help us scale up our Olmo team in 2026. This post in full also appears on Interconnects – the full links to the artifacts and paper are below. Moo, moo, rawr!
Nathan Lambert tweet mediaNathan Lambert tweet mediaNathan Lambert tweet media
English
99
358
2.2K
501.2K
Gowtham Ramesh retweetledi
Tongyi Lab
Tongyi Lab@Ali_TongyiLab·
We are excited to release AgentEvolver , an open-source, self-evolving agent system from Tongyi Lab. AgentEvolver integrates three synergistic mechanisms—Self-Questioning , Self-Navigating , and Self-Attributing —to systematically address critical bottlenecks in Agent RL training, including task scarcity, inefficient exploration, and low sample utilization. This framework guides agents to shift from a mode of "being trained" to a new paradigm of "self-evolving".
Tongyi Lab tweet media
English
15
116
790
538.7K
Gowtham Ramesh retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
Defeating the Training-Inference Mismatch via FP16 Quick summary: A big problem in RL LLM training is that typical policy gradient methods expect the model generating the rollouts and the model being trained are exactly the same... but when you have a separate inference server with its own optimizations (like vLLM) this is often hard to ensure due to numerical differences, leading to training instability and performance degradation. This problem is known as the training-inference mismatch. Some fixes proposed include using importance sampling and removing nondeterminism from the inference server. (I've actually been doing a lot of reading about this challenge and have been meaning to write up a post about it lol) Turns out that switching from BF16 (which is what most people have using these days) to FP16 virtually eliminates the training-inference mismatch! "With more mantissa bits, FP16 offers higher numerical precision, making results less sensitive to the implementation differences between training and inference." Love how simple this trick is!
Tanishq Mathew Abraham, Ph.D. tweet media
English
11
34
214
47.7K
Gowtham Ramesh retweetledi
Caiming Xiong
Caiming Xiong@CaimingXiong·
🚀🚀🚀 Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels We at @SFResearch build an automated pipeline that converts raw web text into verifiable QA pairs, filtered and verified by LLMs, then use Group Relative Policy Optimization (GRPO) to train models directly on this reward-driven data. The result: models trained on Webscale-RL outperform continual pretraining and data-refinement baselines — while using up to 100× fewer tokens. The gains are most pronounced in reasoning, math, and factual QA tasks. Beyond benchmarks, the key shift is conceptual: RL is no longer just a post-training alignment trick — it’s becoming a core optimization stage inside the LLM pretraining loop. This points toward a future of mid-training RL, where large-scale synthetic or automatically verified datasets provide structured reward signals long before human feedback fine-tuning. 🧩 Webscale-RL hints at a new pretraining paradigm — one that learns not just from text, but from reward. Paper: bit.ly/3IFuMhf Code: bit.ly/42AVpdX Data: bit.ly/4h5lVBS
Caiming Xiong tweet media
English
1
17
85
11.9K
Gowtham Ramesh retweetledi
Yuandong Tian
Yuandong Tian@tydsh·
Several of my team members + myself are impacted by this layoff today. Welcome to connect :)
English
468
266
6.4K
4.4M
Gowtham Ramesh retweetledi
Feng Yao
Feng Yao@fengyao1909·
🆕 𝐔𝐩𝐝𝐚𝐭𝐞!! A few more on 𝐑𝐨𝐥𝐥𝐨𝐮𝐭–𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐌𝐢𝐬𝐦𝐚𝐭𝐜𝐡. 🧵👇 (1/3) In Part I, we discuss 𝐭𝐨𝐤𝐞𝐧–𝐥𝐞𝐯𝐞𝐥 and 𝐬𝐞𝐪𝐮𝐞𝐧𝐜𝐞–𝐥𝐞𝐯𝐞𝐥 policy gradients. While 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 for the classic REINFORCE algorithm, they 𝐝𝐢𝐯𝐞𝐫𝐠𝐞 for trust region methods like PPO in a meaningful way, and 𝐛𝐨𝐭𝐡 𝐚𝐫𝐞 𝐫𝐞𝐚𝐬𝐨𝐧𝐚𝐛𝐥𝐞. (2/3) In Part II, we show a 𝐬𝐭𝐫𝐨𝐧𝐠 𝐜𝐨𝐫𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧 between the average 𝐥𝐞𝐚𝐫𝐧𝐞𝐫–𝐬𝐚𝐦𝐩𝐥𝐞𝐫 𝐦𝐢𝐬𝐦𝐚𝐭𝐜𝐡 and average 𝐨𝐮𝐭𝐩𝐮𝐭 𝐩𝐞𝐫𝐭𝐮𝐫𝐛𝐚𝐭𝐢𝐨𝐧. Further analyses sheds insights into filtering large mismatch instances. (3/3) We also show how the 𝐬𝐮𝐫𝐫𝐨𝐠𝐚𝐭𝐞 𝐨𝐛𝐣𝐞𝐜𝐭𝐢𝐯𝐞 in trust region algorithms (like PPO), which acts as a lower bound for policy improvements, 𝐜𝐚𝐧 𝐛𝐞 𝐚𝐝𝐚𝐩𝐭𝐞𝐝 to account for this mismatch. 🔗 Read the update: Part I: fengyao.notion.site/pg-seq-token-p… Part II: fengyao.notion.site/pg-seq-token-p…
Feng Yao tweet media
Feng Yao@fengyao1909

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog: fengyao.notion.site/off-policy-rl 💻 Code: github.com/yaof20/verl/tr…

English
5
38
225
26K
Gowtham Ramesh retweetledi
Jiawei Zhao
Jiawei Zhao@jiawzhao·
We’ve always assumed stale and off-policy data hurts RL a lot — but our latest work shows the opposite. 🧠 M2PO (Second-Moment Trust Policy Optimization) reveals that even data stale by 256 model updates can train LLMs as effectively as on-policy RL, unlocking scalable and asynchronous RL scenarios. We also discovered a “Prosperity-before-Collapse” phenomenon: training without a trust region can temporarily outperform on-policy RL before divergence — suggesting stale data is surprisingly informative. M2PO tackles this by using a second-moment constraint and token-level masking, stabilizing off-policy learning while retaining high-entropy, high-value tokens that drive progress. 🚀 Stable. Efficient. Asynchronous. 🔗 Blog: m2po.notion.site/rl-stale-m2po 📄 Paper: arxiv.org/abs/2510.01161 💻 Code: github.com/Infini-AI-Lab/…
Infini-AI-Lab@InfiniAILab

🤔Can we train RL on LLMs with extremely stale data? 🚀Our latest study says YES! Stale data can be as informative as on-policy data, unlocking more scalable, efficient asynchronous RL for LLMs. We introduce M2PO, an off-policy RL algorithm that keeps training stable and performant even when using data stale by 256 model updates. 🔗 Notion Blog: m2po.notion.site/rl-stale-m2po 📄 Paper: arxiv.org/abs/2510.01161 💻 GitHub: github.com/Infini-AI-Lab/… 🧵 1/4

English
2
21
131
15.9K
Gowtham Ramesh retweetledi
Gowtham Ramesh retweetledi
Lisa Su
Lisa Su@LisaSu·
Exciting day today! Thrilled to partner with @OpenAI to deploy 6GWs of AMD Instinct GPUs. The world needs more AI compute. Together, we’re bringing the best of both companies to accelerate the global AI infrastructure buildout. Thanks @sama @gdb for the trust and partnership. A true win-win for both companies!
English
163
359
3.9K
318.2K
Gowtham Ramesh retweetledi
AMD
AMD@AMD·
Today, we’re announcing a multi-year, multi generation strategic partnership with @OpenAI that puts AMD compute at the center of the global AI infrastructure buildout. ✅ 6GW of AI infrastructure ✅ Initial 1GW deployment of AMD Instinct MI450 series GPU capacity beginning 2H 2026 ✅ Enabling very large-scale AI deployments and advancing the entire AI ecosystem More here: bit.ly/3KzsnFk
AMD tweet media
English
280
1.1K
7.3K
1.3M
Gowtham Ramesh retweetledi
SemiAnalysis
SemiAnalysis@SemiAnalysis_·
As we mentioned back in April, AMD is in war mode and has developed a new sense of urgency. With this urgency, their software engineers are working harder and smarter than ever to fix bugs and improve the ROCm user experience, all with the goal of matching the CUDA experience. Some of AMD’s software engineers are not just working 996 or even 997 anymore, they are on 007. 0am in the early morning through 11:59pm in the late evening.
SemiAnalysis tweet media
English
30
38
443
47.1K
Gowtham Ramesh retweetledi
Feng Yao
Feng Yao@fengyao1909·
🆕 𝐔𝐩𝐝𝐚𝐭𝐞!! A few additional findings for 𝐑𝐨𝐥𝐥𝐨𝐮𝐭–𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐌𝐢𝐬𝐦𝐚𝐭𝐜𝐡: ① 𝐏𝐚𝐫𝐚𝐥𝐥𝐞𝐥𝐢𝐬𝐦 𝐝𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐜𝐞 is a huge driver of the gap, with Sequence Parallelism (SP) causing most high max mismatch. ② 𝐋𝐨𝐧𝐠𝐞𝐫 𝐬𝐞𝐪𝐮𝐞𝐧𝐜𝐞𝐬 amplify the gap. The first 4k tokens of a 20k response have a bigger mismatch than a 4k response. ③ 𝐒𝐚𝐦𝐩𝐥𝐞𝐫 𝐟𝐢𝐱𝐞𝐬 𝐚𝐥𝐨𝐧𝐞 aren't enough. Different backends (incl. SGLang w/ deterministic kernels) had limited impact individually. 🔗 𝐅𝐮𝐥𝐥 𝐚𝐧𝐚𝐥𝐲𝐬𝐢𝐬: #279721e3f6c48092bbe2fcfe0e9c6b33" target="_blank" rel="nofollow noopener">fengyao.notion.site/off-policy-rl#…
Feng Yao tweet media
Feng Yao@fengyao1909

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog: fengyao.notion.site/off-policy-rl 💻 Code: github.com/yaof20/verl/tr…

English
2
31
171
18K