Ashutosh Baheti

139 posts

Ashutosh Baheti

Ashutosh Baheti

@abaheti95

Sr. Research Scientist, Agentic RL @databricks I'm interested in LLM, Agents, Tool use, Reinforcement Learning and making a JARVIS 🤖

Katılım Mart 2015
506 Takip Edilen512 Takipçiler
Sabitlenmiş Tweet
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…
Ashutosh Baheti tweet media
English
2
14
68
12.1K
Samuel Bodin
Samuel Bodin@samdotb·
llms should really learn to copy code with tool instead of using tokens it's insane
English
94
59
3.4K
289.3K
Ashutosh Baheti retweetledi
Prithviraj (Raj) Ammanabrolu
Prithviraj (Raj) Ammanabrolu@rajammanabrolu·
Ever wished we had fewer X-training hyphenates? Pre, mid, post etc. Why not just Training? Trying to bridge the divides (and get all our friends into one team again), we intro *Introspective X Training*, an offline RL inspired method that scales effectively across any LLM stage by annotating your data with a thinking reward generated language critique! Up to 2.8x FLOP efficiency + 5-10 point score gains (esp with math and code) at any stage from scratch to 24T tokens on 8b (active) sized models!! We burned much compute ablating so you wouldn't have to Moral of the story is‼️don't throw out any data via filtering, just feedback condition it‼️ You can spend FLOPs up front on inference to *classify* data quality and then train so that tokens aren't all treated equally based on the feedback starting early in training itself. Right now they're really only separated out much later during mid/post training This improves overall compute efficiency and gives us benchmark perf not possible with just baseline methods! Paper here: arxiv.org/abs/2605.20285 Thanks to @BrandoCui and @GXiming for leading this w/ @__SyedaAkter @davidjesusacu @hyunw_kim @jaehunjung_com Yuxiao Qu @shrimai_ @YejinChoinka
English
1
17
95
17.4K
Ashutosh Baheti retweetledi
Ben Clavié
Ben Clavié@bclavie·
Extremely excited to see this hit the timeline the same day I give a talk where I spend 2 minutes ranting about how As We May Think might be the most relevant essay to today's information retrieval world. And on top of that, it's great work going in the right direction!
Ashutosh Baheti@abaheti95

At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

English
0
3
19
2.2K
Ashutosh Baheti retweetledi
Ivan Zhou
Ivan Zhou@ivanzhouyq·
We're pushing the frontier of enterprise agents that reason over massive amounts of structured and unstructured data at @databricks. A recurring barrier is that agents burn tokens reading data and grow fuzzy as their context fills up. MemEx is an elegant solution. It lifts performance on both frontier and smaller OSS models, while significantly cutting the cost and latency of complex agentic tasks.
Ashutosh Baheti@abaheti95

At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

English
0
1
8
365
Ashutosh Baheti retweetledi
Databricks AI Research
Databricks AI Research@DbrxMosaicAI·
New research from Databricks: the context window is the only persistent substrate today's LLM agents have, and it floods fast. A single SQL query can return millions of rows that ride along in every subsequent turn, even when only one cell ever mattered. We hit this constraint every day in the agents we run in production, from Genie to Agent Bricks' Supervisor Agent to KARL. In a new post from the Databricks research team, we introduce MemEx: a programmable Python scratchpad that lets agents transform, slice, and persist tool outputs as typed objects in a live kernel. Same observe-act loop. Different action space. Across nine frontier and open-weight models on two enterprise agentic tasks (OfficeQA Pro and Enterprise Structured Retrieval): • Frontier models (Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro) gain 2 to 5 accuracy points at 25 to 30% lower cost • Qwen 122B and Qwen 397B nearly double accuracy at 40 to 50% lower cost • Four of the five points on the OfficeQA Pro cost-accuracy Pareto frontier are MemEx configurations MemEx extends the code-as-action line (CodeAct, Anthropic Programmatic Tool Calling, Cloudflare Code Mode) with persistent scope across turns, eager spawn_agent for parallel sub-agents that share the parent's namespace, typed submit() for validated returns, and live-object scope injection. Built on aroll, the same Databricks agentic rollouts framework already powering those production systems. MemEx is rolling out across Databricks first-party agents and Agent Bricks soon. If you build on Databricks agents today, you'll be able to try it. Full write-up: databricks.com/blog/memex-pro…
Databricks AI Research tweet media
English
17
18
188
154K
Ashutosh Baheti retweetledi
Shubham Toshniwal
Shubham Toshniwal@ShubhamToshniw6·
Agents are bottlenecked by the current tool-calling based harness. Outputs get flattened to text, added to context, and re-parsed each turn. The model spends most of its tokens transcribing. We just shipped MemEx where the agent gets supercharged with a Python scratchpad!
Ashutosh Baheti@abaheti95

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…

English
1
4
11
1.8K
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
Same pattern for test-time scaling. We aggregated 8 Qwen rollouts of OfficeQA-Pro. The Tool Calling aggregator worked from lossy summaries (full traces don't fit in context). The MemEx aggregator received the full trajectories as scope variables, and won.
Ashutosh Baheti tweet media
English
1
0
4
240
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
📈 On complex long-horizon enterprise tasks like OfficeQA Pro and Enterprise Structured Retrieval: Frontier models like Opus 4.6: +5pp at 30% less cost. OSS like Qwen3.5-122B: doubles, 18% → 36%. Same agent. Same model. Same tools. Same prompts. Different action space.
Ashutosh Baheti tweet media
English
1
1
7
440
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
🤖 We ran MemEx on the agents' OWN trajectories. An audit agent loaded 6 of them (3 MemEx, 3 Tool Calling) into Python scope and classified failure modes. MemEx had 2x fewer search/execution errors. Retrieval stays in variables, never copied between calls.
Ashutosh Baheti tweet media
English
1
0
5
262
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.
Ashutosh Baheti tweet media
English
1
1
13
3.5K
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…
Ashutosh Baheti tweet media
English
2
14
68
12.1K
Ashutosh Baheti retweetledi
Julia Neagu
Julia Neagu@julianeagu·
I'm building a new team at @databricks AI Research and we're hiring. We're focused on one of the hardest open problems in AI right now: how do you measure and continuously improve agents that operate on enterprise data at scale. We're looking for founding engineers to build the flywheel that turns evaluation results directly into better agents — from development and training all the way to production. If you want to work on problems that actually matter at the frontier of AI research, I'd love to talk. Link in comments 👇
English
81
62
1.5K
169.7K
Kusha Sareen
Kusha Sareen@KushaSareen·
Hey! That's a great question and we thought about it a bit. For simplicity, we just kept the GEPA prompt that had the highest validation accuracy but there are certainly other options! Eg. what we get from this algorithm is really (pool of prompts, model) rather than just (prompt, model) so there are all kinds of clever things you could do to better make use of the pool of prompts at inference time.
English
2
1
11
1K
Kusha Sareen
Kusha Sareen@KushaSareen·
Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls
Kusha Sareen tweet media
English
20
92
542
130.1K
Ashutosh Baheti
Ashutosh Baheti@abaheti95·
Ash Ketchum is basically a phd advisor. He brings pokemon in, evolves them, and then just as they get good, he sets them free!
English
0
0
5
136
Ashutosh Baheti retweetledi
Databricks AI Research
Databricks AI Research@DbrxMosaicAI·
Most enterprise questions don't live in one dataset. They span structured systems and unstructured sources like documents, reviews, and reports. In our latest research, we show how Agent Bricks Supervisor Agent handles this by decomposing queries across structured and unstructured tools, then synthesizing results over multiple reasoning steps. The results across STaRK and KARLBench: 20%+ improvement over SoTA baselines, with the biggest gains on tasks requiring tight integration of structured and unstructured data. All built declaratively — no custom code, just precise instructions and the right tools. databricks.com/blog/agentic-r…
Databricks AI Research tweet media
English
5
15
47
10.1K
Ashutosh Baheti retweetledi
Matei Zaharia
Matei Zaharia@matei_zaharia·
As AI reasoning gets good enough, we think memory will be the next bottleneck for agents. Can your agent improve with more experience? We call this Memory Scaling, and it's related but different from continual learning. A few examples and challenges: databricks.com/blog/memory-sc…
Matei Zaharia tweet media
English
10
50
382
29.9K
Ashutosh Baheti retweetledi