Ashutosh Baheti

120 posts

Ashutosh Baheti

Ashutosh Baheti

@abaheti95

Sr. Research Scientist, Agentic RL @databricks I'm interested in LLM, Agents, Tool use, Reinforcement Learning and making a JARVIS 🤖

Katılım Mart 2015
494 Takip Edilen471 Takipçiler
Ashutosh Baheti retweetledi
Ashutosh Baheti retweetledi
Michael Bendersky
Michael Bendersky@bemikelive·
We just published OfficeQA Pro - a set of 133 challenging questions from the original OfficeQA benchmark. Even the best frontier agents still struggle on OfficeQA Pro with common issues stemming from errors in parsing, retrieval, and visual reasoning.
Michael Bendersky tweet media
English
1
8
24
2.2K
Ashutosh Baheti retweetledi
Krista Opsahl-Ong
Krista Opsahl-Ong@kristahopsalong·
Most AI benchmarks test reasoning in isolation. Real enterprise tasks require grounded reasoning: 1️⃣ Find the right documents 2️⃣ Extract the right values 3️⃣ Perform analyses OfficeQA Pro evaluates this end-to-end. Frontier agents still score <50%. 🧵Paper & details below!
Krista Opsahl-Ong tweet media
English
7
27
111
43.3K
Ashutosh Baheti retweetledi
Ashutosh Baheti retweetledi
Jonathan Chang
Jonathan Chang@j_nadan_chang·
We just released KARL — a knowledge agent trained with reinforcement learning that beats Claude Opus 4.6 and GPT-5.2 on enterprise search, at a fraction of the cost and latency. 🧵
Jonathan Chang tweet media
English
8
54
435
100.5K
Ashutosh Baheti retweetledi
Owen Oertell
Owen Oertell@owenoertell·
Super excited to talk about KARL! A few of my favorite things about the report: - we show RL is doing more than just sharpening - TTS that works on non-verifiable tasks - multitask RL > multi expert distillation - OAPL working at scale and more!
Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English
1
6
15
1.7K
Ashutosh Baheti retweetledi
Wen Sun
Wen Sun@WenSun1·
Going to do a more technical deep dive on our enterprise knowledge agents and how we train them with RL. Overall we found that simple, yet principled off-policy RL works at scale for complex agentic tasks with hundreds of steps of tool use and context management. Here are the key takeaways from our 80 page technical report. (1) RL does not just sharpen base model's distribution. We see test-time scaling improves consistently over the iterations of the RL training. Skills learned during RL transfers to unseen prompts and agent learns to solve prompts where base model has zero accuracy under pass@16. (2) Multi-task RL generalizes really well Simple mixing of training data from multiple tasks works well and allows multi-task RL scale beyond your in-distribution training tasks. We found that multi-task RL just works better than multi-expert distillation. (3) End-to-end RL for tools and context management works best. We skipped mid-training, and directly trained everything end-to-end using RL at scale (2m tokens per gradient computation). Models learned to use vector database tools and context compression at the same time.
Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English
2
18
71
15K
Ashutosh Baheti retweetledi
Andrew Drozdov
Andrew Drozdov@mrdrozdov·
Today, we're sharing 🌁 Knowledge Agents from Reinforcement Learning (KARL) 🌁 We trained an agent that excels on challenging grounded reasoning tasks. KARL matches Sonnet 4.5 quality at a fraction of the cost, and with test-time scaling reaches Opus 4.6 levels. This was a fun project that I learned a lot from. Here are a few pieces that resonated with me.
Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English
4
20
93
13.1K
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
Incredibly proud of the team behind this!! KARL solves a genuinely hard problem and it's only the first of many agents we're building @DbrxMosaicAI 🚀🤖
Jonathan Frankle@jefrankle

Meet KARL, an RL'd model for document-centric tasks at frontier quality and open source cost/speed. Great for @databricks customers and scientists (77-page tech report!) As usual, this isn't just one model - it's an RL assembly line to churn out models for us and our customers 🧵

English
2
2
17
1.5K
Ashutosh Baheti retweetledi
Davis Blalock
Davis Blalock@davisblalock·
🚀 Today we’re releasing FlashOptim: better implementations of Adam, SGD, etc, that compute the same updates but save tons of memory. You can use it right now via `pip install flashoptim`. 🚀 arxiv.org/abs/2602.23349 A bunch of cool ideas make this possible: [1/n]
Davis Blalock tweet media
English
30
226
1.5K
203.5K
Ashutosh Baheti retweetledi
Wen Sun
Wen Sun@WenSun1·
Many recent works try to force GRPO to be on-policy by adding things like extra importance weighting, clipping, masking, data deletion, inference engine edits, router replay… But are these actually needed? We push the other direction: make it maximally off-policy and keep it simple! Turns out that nothing is wrong with off-policy. It works well, avoids entropy collapse, and improves test-time scaling (RL doesn't just sharpen base model's distribution). While we only presented dense models in this report for math and coding tasks, similar recipe works for large-scale MoEs in agentic settings. Will share more results very soon!
Kianté Brantley@xkianteb

Does LLM RL post-training need to be on-policy?

English
1
15
158
21K
Ashutosh Baheti retweetledi
Ali Ghodsi
Ali Ghodsi@alighodsi·
I now constantly get questions about the SAAS meltdown, role of AI, system of records etc. I don't have an answer to all these. But I do know that we saw an acceleration in our business in Q2, Q3, and now finished the year with accelerating Q4. The question is, why? Short answer: AI. But the underlying reason is subtle. We are growing fast because we are finally removing the biggest bottleneck in data: the technical barrier to entry. For years, if you didn’t know SQL, Python, you were locked out of the value chain. That has changed fundamentally with the 𝐆𝐞𝐧𝐢𝐞 𝐟𝐚𝐦𝐢𝐥𝐲, and it is the "secret sauce" behind our recent momentum: • 𝐆𝐞𝐧𝐢𝐞: Analysts can query data without any SQL. I use this every day myself. • 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 𝐆𝐞𝐧𝐢𝐞: Builds end-to-end AI models for you, similar to Cursor for ML on your data. • 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫 𝐆𝐞𝐧𝐢𝐞: Write Spark pipelines, does plumbing, troubleshooting. We've been talking about DATA + AI democratization, but generative AI finally enabled it in a way that wasn't possible before. That's why we're seeing a market response. Take 𝐋𝐚𝐤𝐞𝐛𝐚𝐬𝐞 𝐏𝐨𝐬𝐭𝐠𝐫𝐞𝐬. We launched this serverless engine for agents and apps recently. At 8 months into its journey, its revenue is already 2x what our Data Warehouse product was at the same stage. All this taken together, we ended up with the following stats for Q4: 🚀 $5.4B Revenue Run-Rate, growing >65% YoY 🚀 $1.4B AI Revenue Run-Rate 🚀 FCF Positive for the year 🚀 NRR >>140% databricks.com/company/newsro…
English
57
123
964
497.6K
Ashutosh Baheti retweetledi
Matei Zaharia
Matei Zaharia@matei_zaharia·
Agent memory is a simple and powerful way to do continual learning! With the new MemAlign method from Databricks Research, we can build better LLM judges from examples of human ratings, and they scale with more data. Now in Databricks and @MLflow. databricks.com/blog/memalign-…
English
10
37
233
18.3K
Ashutosh Baheti retweetledi
Ashutosh Baheti retweetledi
Ashutosh Baheti retweetledi
Databricks
Databricks@databricks·
Today we’re introducing OfficeQA, a new benchmark grounded in ~89,000 pages of U.S. Treasury Bulletins that reflects the complex, document-heavy tasks enterprises actually face. Unlike existing benchmarks, OfficeQA measures economically valuable, real-world reasoning: parsing dense tables, navigating scanned PDFs, and retrieving facts across decades of documents. Even strong agents reach only ~45% accuracy, showing how far the field has to go. The benchmark is now open to the community, and the Databricks Grounded Reasoning Cup in Spring 2026 will challenge teams to push these capabilities forward. databricks.com/blog/introduci…
Databricks tweet media
English
3
13
74
22.1K
Ashutosh Baheti retweetledi
Jonathan Frankle
Jonathan Frankle@jefrankle·
Special Databricks swag for the first five people to send me a selfie with Ashu in the Databricks booth at NeurIPS!
Ashutosh Baheti@abaheti95

Will be at #NeurIPS2025 from 2nd to 6th Dec. Excited to chat about async RL, Environment Exploration, Agents/Tool use, User Simulator, Synthetic Data Generation or any other topic!! You can find me at the @databricks booth @ Tue 12 - 4pm

English
5
3
39
20.6K