Ashutosh Baheti (@abaheti95) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…

English

2

14

68

12.1K

Ashutosh Baheti@abaheti95·3h

@samdotb Checkout MemEx from @DbrxMosaicAI We built a programmable scratchpad for LLM agents which does just that and a lot more! x.com/i/status/20568…

Ashutosh Baheti@abaheti95

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…

English

0

1

35

Samuel Bodin@samdotb·2d

llms should really learn to copy code with tool instead of using tokens it's insane

English

94

59

3.4K

289.3K

Ashutosh Baheti retweetledi

Prithviraj (Raj) Ammanabrolu@rajammanabrolu·3d

Ever wished we had fewer X-training hyphenates? Pre, mid, post etc. Why not just Training? Trying to bridge the divides (and get all our friends into one team again), we intro *Introspective X Training*, an offline RL inspired method that scales effectively across any LLM stage by annotating your data with a thinking reward generated language critique! Up to 2.8x FLOP efficiency + 5-10 point score gains (esp with math and code) at any stage from scratch to 24T tokens on 8b (active) sized models!! We burned much compute ablating so you wouldn't have to Moral of the story is‼️don't throw out any data via filtering, just feedback condition it‼️ You can spend FLOPs up front on inference to *classify* data quality and then train so that tokens aren't all treated equally based on the feedback starting early in training itself. Right now they're really only separated out much later during mid/post training This improves overall compute efficiency and gives us benchmark perf not possible with just baseline methods! Paper here: arxiv.org/abs/2605.20285 Thanks to @BrandoCui and @GXiming for leading this w/ @__SyedaAkter @davidjesusacu @hyunw_kim @jaehunjung_com Yuxiao Qu @shrimai_ @YejinChoinka

English

1

17

95

17.4K

Ashutosh Baheti retweetledi

Ben Clavié@bclavie·5d

Extremely excited to see this hit the timeline the same day I give a talk where I spend 2 minutes ranting about how As We May Think might be the most relevant essay to today's information retrieval world. And on top of that, it's great work going in the right direction!

Ashutosh Baheti@abaheti95

At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

English

0

3

19

2.2K

Ashutosh Baheti retweetledi

Ivan Zhou@ivanzhouyq·5d

We're pushing the frontier of enterprise agents that reason over massive amounts of structured and unstructured data at @databricks. A recurring barrier is that agents burn tokens reading data and grow fuzzy as their context fills up. MemEx is an elegant solution. It lifts performance on both frontier and smaller OSS models, while significantly cutting the cost and latency of complex agentic tasks.

Ashutosh Baheti@abaheti95

At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

English

0

1

8

365

Ashutosh Baheti retweetledi

Databricks AI Research@DbrxMosaicAI·6d

New research from Databricks: the context window is the only persistent substrate today's LLM agents have, and it floods fast. A single SQL query can return millions of rows that ride along in every subsequent turn, even when only one cell ever mattered. We hit this constraint every day in the agents we run in production, from Genie to Agent Bricks' Supervisor Agent to KARL. In a new post from the Databricks research team, we introduce MemEx: a programmable Python scratchpad that lets agents transform, slice, and persist tool outputs as typed objects in a live kernel. Same observe-act loop. Different action space. Across nine frontier and open-weight models on two enterprise agentic tasks (OfficeQA Pro and Enterprise Structured Retrieval): • Frontier models (Opus 4.6, Sonnet 4.6, Gemini 3.1 Pro) gain 2 to 5 accuracy points at 25 to 30% lower cost • Qwen 122B and Qwen 397B nearly double accuracy at 40 to 50% lower cost • Four of the five points on the OfficeQA Pro cost-accuracy Pareto frontier are MemEx configurations MemEx extends the code-as-action line (CodeAct, Anthropic Programmatic Tool Calling, Cloudflare Code Mode) with persistent scope across turns, eager spawn_agent for parallel sub-agents that share the parent's namespace, typed submit() for validated returns, and live-object scope injection. Built on aroll, the same Databricks agentic rollouts framework already powering those production systems. MemEx is rolling out across Databricks first-party agents and Agent Bricks soon. If you build on Databricks agents today, you'll be able to try it. Full write-up: databricks.com/blog/memex-pro…

English

17

18

188

154K

Ashutosh Baheti retweetledi

Shubham Toshniwal@ShubhamToshniw6·6d

Agents are bottlenecked by the current tool-calling based harness. Outputs get flattened to text, added to context, and re-parsed each turn. The model spends most of its tokens transcribing. We just shipped MemEx where the agent gets supercharged with a Python scratchpad!

Ashutosh Baheti@abaheti95

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…

English

1

4

11

1.8K

Ashutosh Baheti@abaheti95·6d

🙏Huge thanks to my co-authors @ShubhamToshniw6 , Arnav Singhvi, @kristahopsalong, @seankski, @sam_havens, Jonathan Li, @Mdjxjxnsk, @j_nadan_chang, @WenSun1, @alexrtrott, @jefrankle, Xing Chen, and @matei_zaharia. MemEx is the future of agentic harnesses!

English

1

0

6

221

Ashutosh Baheti@abaheti95·6d

🚀 MemEx is rolling out across @databricks's first-party agents and Agent Bricks. Full write-up (numbers, design, trace analysis): databricks.com/blog/memex-pro…

English

1

0

4

247

Ashutosh Baheti@abaheti95·6d

Same pattern for test-time scaling. We aggregated 8 Qwen rollouts of OfficeQA-Pro. The Tool Calling aggregator worked from lossy summaries (full traces don't fit in context). The MemEx aggregator received the full trajectories as scope variables, and won.

English

1

0

4

240

Ashutosh Baheti@abaheti95·6d

📈 On complex long-horizon enterprise tasks like OfficeQA Pro and Enterprise Structured Retrieval: Frontier models like Opus 4.6: +5pp at 30% less cost. OSS like Qwen3.5-122B: doubles, 18% → 36%. Same agent. Same model. Same tools. Same prompts. Different action space.

English

1

7

440

Ashutosh Baheti@abaheti95·6d

🤖 We ran MemEx on the agents' OWN trajectories. An audit agent loaded 6 of them (3 MemEx, 3 Tool Calling) into Python scope and classified failure modes. MemEx had 2x fewer search/execution errors. Retrieval stays in variables, never copied between calls.

English

1

0

5

262

Ashutosh Baheti@abaheti95·6d

At Databricks, 🧞Genie hits this wall every day! Its queries span an entire workspace and pulls data from tables, vector indices, and other sources via many tool calls. Here's how MemEx can convert complex workflows like these into streamlined code with far less token repetition.

English

1

13

3.5K

Ashutosh Baheti@abaheti95·6d

In 1945, Vannevar Bush imagined a machine to extend a scientist's memory. He called it the MemEx. 80 years later, we built one for LLM agents. Tool outputs become Python objects; only print statements reach the model's context. 🧵 databricks.com/blog/memex-pro…

English

2

14

68

12.1K

Ashutosh Baheti retweetledi

Julia Neagu@julianeagu·15 May

I'm building a new team at @databricks AI Research and we're hiring. We're focused on one of the hardest open problems in AI right now: how do you measure and continuously improve agents that operate on enterprise data at scale. We're looking for founding engineers to build the flywheel that turns evaluation results directly into better agents — from development and training all the way to production. If you want to work on problems that actually matter at the frontier of AI research, I'd love to talk. Link in comments 👇

English

81

62

1.5K

169.7K

Ashutosh Baheti@abaheti95·16 May

@KushaSareen @LakshyAAAgrawal @Cameron_Chann @rish2k1 @agarwl_ @Devvrit_Khatri @inderjit_ml @profjoeyg @KurtKeutzer Interesting. What is the performance in the no prompt case after FST? Does the model without GEPA prompt also improve as much as with the GEPA prompt?

English

1

0

1

122

Kusha Sareen@KushaSareen·15 May

Hey! That's a great question and we thought about it a bit. For simplicity, we just kept the GEPA prompt that had the highest validation accuracy but there are certainly other options! Eg. what we get from this algorithm is really (pool of prompts, model) rather than just (prompt, model) so there are all kinds of clever things you could do to better make use of the pool of prompts at inference time.

English

2

1

11

1K

Kusha Sareen@KushaSareen·13 May

Can LLMs adapt continually without losing base skills? Fast-Slow Training (FST) pairs "slow" weights with "fast" context. FST vs. RL: • 3x more sample-efficient • Higher performance ceiling • Less KL drift (better plasticity) • Continual learning: succeeds where RL stalls

English

20

92

542

130.1K

Ashutosh Baheti@abaheti95·11 May

Ash Ketchum is basically a phd advisor. He brings pokemon in, evolves them, and then just as they get good, he sets them free!

English

0

5

136

Ashutosh Baheti@abaheti95·8 May

🧞 is out of the bottle and answering every enterprise question I throw at it. The pace of agent development has been incredible @databricks. Excited for what's next. Lots more to come!

Matei Zaharia@matei_zaharia

Genie has transformed how Databricks users work with data, with 3x the accuracy of generic agents. We're sharing some of the research behind it and what makes building data agents challenging. Super proud of our research team's impact with this! databricks.com/blog/pushing-f…

English

0

6

542

Ashutosh Baheti retweetledi

Databricks AI Research@DbrxMosaicAI·14 Nis

Most enterprise questions don't live in one dataset. They span structured systems and unstructured sources like documents, reviews, and reports. In our latest research, we show how Agent Bricks Supervisor Agent handles this by decomposing queries across structured and unstructured tools, then synthesizing results over multiple reasoning steps. The results across STaRK and KARLBench: 20%+ improvement over SoTA baselines, with the biggest gains on tasks requiring tight integration of structured and unstructured data. All built declaratively — no custom code, just precise instructions and the right tools. databricks.com/blog/agentic-r…

English

5

15

47

10.1K

Ashutosh Baheti retweetledi

Matei Zaharia@matei_zaharia·10 Nis

As AI reasoning gets good enough, we think memory will be the next bottleneck for agents. Can your agent improve with more experience? We call this Memory Scaling, and it's related but different from continual learning. A few examples and challenges: databricks.com/blog/memory-sc…

English

10

50

382

29.9K

Ashutosh Baheti retweetledi

Matei Zaharia@matei_zaharia·11 Mar

I’m super excited about the launch of Genie Code! It extends the power of AI coding to agentic data work, answering questions 2-3x more accurately than coding agents and automatically engineering and monitoring high quality pipelines. It’s transformed my own work at Databricks!

Databricks@databricks

Today we're announcing Genie Code, your autonomous AI partner for data. Genie Code is a state-of-the-art agent that lets data teams move from prompting a copilot to delegating real work: building pipelines, machine learning models, debugging failures, and shipping dashboards. This isn't a smarter autocomplete. It's a different kind of AI partner entirely. Unlike general coding agents that stop once the code is built, Genie Code plans, executes, and iterates across the full data and AI lifecycle inside Databricks. It's purpose-built for data engineering, data science, and BI: • More than doubles the success rate of leading coding agents on real-world data science tasks • Proactively monitors your pipelines and AI models in the background, triaging failures and fixing issues before a human intervenes • Works with your data wherever it lives, across Databricks and external platforms, with full governance and MCP support This is what the future of data work looks like. databricks.com/blog/introduci…

English

11

15

153

30.4K

Ashutosh Baheti

Keşfet