Diane

39 posts

Diane

@dianetc_

Figuring things out slowly. MIT PhD student, prev: @UofMaryland

P{Boston=.8,DC=.2} Katılım Mart 2021

302 Takip Edilen551 Takipçiler

Diane retweetledi

ACM Conference on AI and Agentic Systems@CAISconf·4d

🎤 Keynote announcement: @trq212 (Thariq Shihipar), Member of Technical Staff on Claude Code at @AnthropicAI, is keynoting #CAIS2026. Thariq's "Lessons from Building Claude Code" series on Skills, prompt caching, tool design, and "unhobbling" is required reading for anyone building agentic systems. We're thrilled to have him. 📍 San Jose · May 26–29 🔗 caisconf.org

ACM Conference on AI and Agentic Systems tweet media

English

109

17.2K

Diane retweetledi

Omar Khattab@lateinteraction·12 Nis

A subtle thing that's worth observing is how all four are actually riffs on the same fundamental concept: decomposing something that the mainstream paradigm insists as treating as a monolith. Late Interaction: decompose document representations into a *set* of objects, and similarity scoring as a *composition* of operations, yet nonetheless manage to do search in sub-linear time. DSPy: decouple the specification vs. optimization of AI systems, and decompose AI programs into symbolic modules declarations with natural language specs, instead of monolithic prompt debt. GEPA: decompose the learning signal from rollouts into the actual tokens and feedback, instead of scalar rewards as in policy gradient RL. RLMs: teach models to decompose their treatment of hard problems into symbolic programs that invoke models, and scale understanding massive context lengths not by monolithic attention but through recursion. In modern AI, decomposition is usually done poorly and almost always runs foul of the bitter lesson when done poorly. The hard things that all of these have done is manage to last, for up to 6.5 years in the case of late interaction and 3.5 years in the case of DSPy, because they decompose things at the right fundamental joints.

English

175

25.2K

Diane retweetledi

Christos Tzamos@ChristosTzamos·12 Mar

1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy

English

249

815

6.1K

1.8M

Diane retweetledi

Lakshya A Agrawal@LakshyAAAgrawal·19 Şub

Excited to release @gepa_ai's optimize_anything: a universal API for optimizing any text parameter. It consistently matches or outperforms domain-specific tools optimizing code, prompts, agent harnesses, cloud policies, even visuals! If you can measure it, you can optimize it.

English

520

123.4K

Diane@dianetc_·18 Oca

I tried treating similarity matrices as images and training a CNN to distinguish between gold docs and negatives. It was a fun little experiment and the results were unsurprising in hindsight (worked amazingly then failed spectacularly). details here: dianetc.github.io/musings/cnn_le…

English

125

Diane retweetledi

Prime Intellect@PrimeIntellect·2 Oca

We believe the next breakthrough in long-horizon agents is training models to manage their own context. Introducing our new research direction on Recursive Language Models. We are sharing our initial experiments showing the promise of RLMs. primeintellect.ai/blog/rlm

English

222

1.6K

452.1K

Diane@dianetc_·1 Oca

I didn’t expect him to talk about love at the end but wow he hit the nail on the head. A person that knows that they are deeply loved, walks with a confidence and sureness that you cannot put a price on. Thank you Mr.Brown!

Morgan Brown@morganb

I turn 50 next month and have been reflecting on what I’ve learned so far and how I want to live next. I wrote down a few thoughts here: docs.google.com/document/d/1NR…

English

121

Diane retweetledi

Aaron Roth@Aaroth·21 Ara

A world in which clever discoveries happen in data centers, and the role of the professional researcher is careful verification and due diligence is a world in which the job of researcher is much less fun. Many fewer people with choices would want this job, given the other costs.

English

6.6K

Diane@dianetc_·14 Ara

@emiyazono This may disadvantage candidates with learning disabilities affecting spatial reasoning/processing speed. It wouldn’t predict job performance when they are given the proper time and space to adapt. Useful for gauging points 2-3 but quite noisy for point 1. Not terrible just noisy

English

Evan Miyazono@emiyazono·31 Eki

I'd also welcome reasons why this is a terrible idea

English

Evan Miyazono@emiyazono·31 Eki

Has anyone ever seen or used board games in a job interview process? you get to learn: - how quickly the candidate learns rules - how quickly the candidate develops & changes strategies - evaluate how much the person wants to win / behaves when they lose

English

185

Diane retweetledi

Melissa Pan@melissapan·5 Ara

Thrilled to release our new paper MAP: Measuring Agents in Production ⚙️🚀 2025 is the year of agents… but do they actually work in the real world? Is it just hype? A group of 25 researchers from Berkeley, Stanford, UIUC, IBM, and Intesa Sanpaolo investigated what makes agents deployable in the wild. So… 📈 Why agents? Productivity gains ➕ How to build production agents? Simple & controllable methods 🧑‍💻 How to evaluate agents? Heavy human oversight 🛑 Top challenge now? Reliability remains unsolved We surveyed 306 agent builders and ran 20 in-depth interviews across 26 agent application domains to understand the current landscape of production agents. Check out our latest paper: MAP - more in the thread 👇 (1/N)

English

107

516

197K

Diane retweetledi

alex zhang@a1zhang·15 Eki

What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length, as a REPL environment. On the OOLONG benchmark, RLMs with GPT-5-mini outperforms GPT-5 by over 110% gains (more than double!) on 132k-token sequences and is cheaper to query on average. On the BrowseComp-Plus benchmark, RLMs with GPT-5 can take in 10M+ tokens as their “prompt” and answer highly compositional queries without degradation and even better than explicit indexing/retrieval. We link our blogpost, (still very early!) experiments, and discussion below.

English

135

378

2.8K

948.9K

Diane@dianetc_·21 Eyl

@HaochengXiUCB Hey, I'm interested in checking this out more, but the github link isn't working (?)

English

Haocheng Xi@HaochengXiUCB·19 Eyl

5. Summary We explored if we could use LLM’s coding capability and have it write code and leverage numerical solvers to find the solution, rather than directly predicting the solution map which could involve a large number of floating point values. Here’s what we found: • 🧠 LLMs can reason about stiffness & write scientific code • ✏️ Prompting alone gets high accuracy with strong models • 🧪 Fine-tuning boosts weaker ones significantly • 🔓 We've open-sourced both datasets for future research 📄 Paper: arxiv.org/abs/2509.09936 💻 Code: github.com/SqueezeAILab/s…

English

226

Haocheng Xi@HaochengXiUCB·19 Eyl

Introducing SciML Agent: Write the Solver, Not the Solution! Motivation: Most prior work in Scientific ML (e.g., PINNs, neural ODEs, operator learning) tries to predict the solution directly with neural networks (which means outputting a large set of floating point values which are hard to get right directly). But in practice, these approaches often struggle with: • Difficulty in solving the resulting optimization problem • Poor generalization to newer environments/settings • Numerical instability during training What if we flip the paradigm? 💡 Why not leverage years of progress in numerical methods and let the LLMs reason about the problem, and write a code to solve it? 🔍 But first, a major gap: There was no benchmark to evaluate whether LLMs can generate scientifically valid code to solve ODEs. So we built two: • Diagnostic dataset where we intentionally create “misleading” ODEs, to test if the agent is focusing on superficial properties of the problem or can perform non-trivial reasoning • ODE-1000 dataset, which consists of 1000 diverse, verified ODE tasks each with a natural language description of the problem and the corresponding Python solution Key Insights from our evaluation: • We can achieve surprisingly good results on both datasets, specially with newer instruction following models • Importantly we find that fine-tuning may not be required as long as the model has enough context, instructions, and capacity to follow those instructions • Fine-tuning can still improve the performance, especially for smaller or older models. Overall, with either effective prompting or fine-tuning, our preliminary results indicate that it’s possible to build a specialized SciML agent to reliably solve ODE problems. 📄 Paper: arxiv.org/abs/2509.09936 🔓 Datasets & Code: github.com/SqueezeAILab/s… Joint work with: Saarth Gaonkar, Xiang Zheng, @HaochengXiUCB ,@rish2k1 , @KurtKeutzer , Dmitriy Morozov, Michael W. Mahoney, @amir__gholami 🧵Thread below 👇

English

2.4K

Diane@dianetc_·17 Eyl

@SadhikaMalladi @natolambert As in formalizing what makes a prompt “good”? There’s a bunch of work on optimizing prompts (e.g. arxiv.org/abs/2405.17346) or generalization bounds (e.g. arxiv.org/abs/2310.03957) but I don’t know anything like what you may be describing offhand

English

Sadhika Malladi@SadhikaMalladi·13 Ara

@natolambert Yes! Building a useful theoretical description of this currently :)

English

179

Nathan Lambert@natolambert·13 Ara

All post training data is about is really good prompts for your tasks imo.

Alexandr Wang@alexandr_wang

Compute matters, but so does data, and we have reached a pre-training data wall. Get ready for the post-training data boom. Companies will race to have the best frontier data—multi-modal, agentic, complex reasoning, and more. Follow the data, find the winners. 7/8

English

7.6K

Diane@dianetc_·2 Eyl

@mail4agam GPT5 did reasonably well on the math errors benchmark, its just that some weird behaviors (quantization) indicated lots of headroom. Maybe a more advanced model will correct for these issues but currently a systems approach is what works well

English

Diane@dianetc_·2 Eyl

@mail4agam Essentially, encoder-only Transformer face optimization challenges due to misalignment between cross-entropy loss and regression objectives pointing to some fundamentals issues...But

English

Diane@dianetc_·1 Eyl

We asked LLMs to estimate the *fraction* of a math solution that was right… Turns out that while they can reason through complex problems they still have a hard time producing precise numerical outputs Let’s talk about what we call Reasoning-Intensive Regression (RiR) tasks 🧵

English

270

35.5K

Diane@dianetc_·1 Eyl

MENTAT offers an efficient approach, but is only one step towards fixing the fundamental reasoning-precision trade-off in current low-data regimes. Much headroom for future work! Thanks to @lateinteraction for guidance & coauthoring! Full paper: arxiv.org/abs/2508.21762

English

699

Diane@dianetc_·1 Eyl

📝 Recap: We define RIR as those requiring precise predictions, proper ranking, AND deep per-instance reasoning. Standard RiR methods struggle with balancing these! We introduce MENTAT, a simple algorithm that combines lightweight batched prompt evolution with ensemble learning.

English

741

Keşfet

@trq212 @AnthropicAI @gepa_ai @emiyazono @HaochengXiUCB @rish2k1 @KurtKeutzer @amir__gholami