Diane

39 posts

Diane banner
Diane

Diane

@dianetc_

Figuring things out slowly. MIT PhD student, prev: @UofMaryland

P{Boston=.8,DC=.2} Katılım Mart 2021
302 Takip Edilen551 Takipçiler
Diane retweetledi
ACM Conference on AI and Agentic Systems
🎤 Keynote announcement: @trq212 (Thariq Shihipar), Member of Technical Staff on Claude Code at @AnthropicAI, is keynoting #CAIS2026. Thariq's "Lessons from Building Claude Code" series on Skills, prompt caching, tool design, and "unhobbling" is required reading for anyone building agentic systems. We're thrilled to have him. 📍 San Jose · May 26–29 🔗 caisconf.org
ACM Conference on AI and Agentic Systems tweet media
English
2
19
109
17.2K
Diane retweetledi
Omar Khattab
Omar Khattab@lateinteraction·
A subtle thing that's worth observing is how all four are actually riffs on the same fundamental concept: decomposing something that the mainstream paradigm insists as treating as a monolith. Late Interaction: decompose document representations into a *set* of objects, and similarity scoring as a *composition* of operations, yet nonetheless manage to do search in sub-linear time. DSPy: decouple the specification vs. optimization of AI systems, and decompose AI programs into symbolic modules declarations with natural language specs, instead of monolithic prompt debt. GEPA: decompose the learning signal from rollouts into the actual tokens and feedback, instead of scalar rewards as in policy gradient RL. RLMs: teach models to decompose their treatment of hard problems into symbolic programs that invoke models, and scale understanding massive context lengths not by monolithic attention but through recursion. In modern AI, decomposition is usually done poorly and almost always runs foul of the bitter lesson when done poorly. The hard things that all of these have done is manage to last, for up to 6.5 years in the case of late interaction and 3.5 years in the case of DSPy, because they decompose things at the right fundamental joints.
English
3
21
175
25.2K
Diane retweetledi
Christos Tzamos
Christos Tzamos@ChristosTzamos·
1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy
English
249
815
6.1K
1.8M
Diane retweetledi
Lakshya A Agrawal
Lakshya A Agrawal@LakshyAAAgrawal·
Excited to release @gepa_ai's optimize_anything: a universal API for optimizing any text parameter. It consistently matches or outperforms domain-specific tools optimizing code, prompts, agent harnesses, cloud policies, even visuals! If you can measure it, you can optimize it.
Lakshya A Agrawal tweet media
English
22
95
520
123.4K
Diane
Diane@dianetc_·
I tried treating similarity matrices as images and training a CNN to distinguish between gold docs and negatives. It was a fun little experiment and the results were unsurprising in hindsight (worked amazingly then failed spectacularly). details here: dianetc.github.io/musings/cnn_le…
English
0
0
0
125
Diane retweetledi
Prime Intellect
Prime Intellect@PrimeIntellect·
We believe the next breakthrough in long-horizon agents is training models to manage their own context. Introducing our new research direction on Recursive Language Models. We are sharing our initial experiments showing the promise of RLMs. primeintellect.ai/blog/rlm
English
57
222
1.6K
452.1K
Diane retweetledi
Aaron Roth
Aaron Roth@Aaroth·
A world in which clever discoveries happen in data centers, and the role of the professional researcher is careful verification and due diligence is a world in which the job of researcher is much less fun. Many fewer people with choices would want this job, given the other costs.
English
2
5
57
6.6K
Diane
Diane@dianetc_·
@emiyazono This may disadvantage candidates with learning disabilities affecting spatial reasoning/processing speed. It wouldn’t predict job performance when they are given the proper time and space to adapt. Useful for gauging points 2-3 but quite noisy for point 1. Not terrible just noisy
English
0
0
0
17
Evan Miyazono
Evan Miyazono@emiyazono·
I'd also welcome reasons why this is a terrible idea
English
1
0
1
81
Evan Miyazono
Evan Miyazono@emiyazono·
Has anyone ever seen or used board games in a job interview process? you get to learn: - how quickly the candidate learns rules - how quickly the candidate develops & changes strategies - evaluate how much the person wants to win / behaves when they lose
English
3
0
4
185
Diane retweetledi
Melissa Pan
Melissa Pan@melissapan·
Thrilled to release our new paper MAP: Measuring Agents in Production ⚙️🚀 2025 is the year of agents… but do they actually work in the real world? Is it just hype? A group of 25 researchers from Berkeley, Stanford, UIUC, IBM, and Intesa Sanpaolo investigated what makes agents deployable in the wild. So… 📈 Why agents? Productivity gains ➕ How to build production agents? Simple & controllable methods 🧑‍💻 How to evaluate agents? Heavy human oversight 🛑 Top challenge now? Reliability remains unsolved We surveyed 306 agent builders and ran 20 in-depth interviews across 26 agent application domains to understand the current landscape of production agents. Check out our latest paper: MAP - more in the thread 👇 (1/N)
Melissa Pan tweet media
English
20
107
516
197K
Diane retweetledi
alex zhang
alex zhang@a1zhang·
What if scaling the context windows of frontier LLMs is much easier than it sounds? We’re excited to share our work on Recursive Language Models (RLMs). A new inference strategy where LLMs can decompose and recursively interact with input prompts of seemingly unbounded length, as a REPL environment. On the OOLONG benchmark, RLMs with GPT-5-mini outperforms GPT-5 by over 110% gains (more than double!) on 132k-token sequences and is cheaper to query on average. On the BrowseComp-Plus benchmark, RLMs with GPT-5 can take in 10M+ tokens as their “prompt” and answer highly compositional queries without degradation and even better than explicit indexing/retrieval. We link our blogpost, (still very early!) experiments, and discussion below.
alex zhang tweet media
English
135
378
2.8K
948.9K
Diane
Diane@dianetc_·
@HaochengXiUCB Hey, I'm interested in checking this out more, but the github link isn't working (?)
English
0
0
0
27
Haocheng Xi
Haocheng Xi@HaochengXiUCB·
5. Summary We explored if we could use LLM’s coding capability and have it write code and leverage numerical solvers to find the solution, rather than directly predicting the solution map which could involve a large number of floating point values. Here’s what we found: • 🧠 LLMs can reason about stiffness & write scientific code • ✏️ Prompting alone gets high accuracy with strong models • 🧪 Fine-tuning boosts weaker ones significantly • 🔓 We've open-sourced both datasets for future research 📄 Paper: arxiv.org/abs/2509.09936 💻 Code: github.com/SqueezeAILab/s…
English
1
0
1
226
Haocheng Xi
Haocheng Xi@HaochengXiUCB·
Introducing SciML Agent: Write the Solver, Not the Solution! Motivation: Most prior work in Scientific ML (e.g., PINNs, neural ODEs, operator learning) tries to predict the solution directly with neural networks (which means outputting a large set of floating point values which are hard to get right directly). But in practice, these approaches often struggle with: • Difficulty in solving the resulting optimization problem • Poor generalization to newer environments/settings • Numerical instability during training What if we flip the paradigm? 💡 Why not leverage years of progress in numerical methods and let the LLMs reason about the problem, and write a code to solve it? 🔍 But first, a major gap: There was no benchmark to evaluate whether LLMs can generate scientifically valid code to solve ODEs. So we built two: • Diagnostic dataset where we intentionally create “misleading” ODEs, to test if the agent is focusing on superficial properties of the problem or can perform non-trivial reasoning • ODE-1000 dataset, which consists of 1000 diverse, verified ODE tasks each with a natural language description of the problem and the corresponding Python solution Key Insights from our evaluation: • We can achieve surprisingly good results on both datasets, specially with newer instruction following models • Importantly we find that fine-tuning may not be required as long as the model has enough context, instructions, and capacity to follow those instructions • Fine-tuning can still improve the performance, especially for smaller or older models. Overall, with either effective prompting or fine-tuning, our preliminary results indicate that it’s possible to build a specialized SciML agent to reliably solve ODE problems. 📄 Paper: arxiv.org/abs/2509.09936 🔓 Datasets & Code: github.com/SqueezeAILab/s… Joint work with: Saarth Gaonkar, Xiang Zheng, @HaochengXiUCB ,@rish2k1 , @KurtKeutzer , Dmitriy Morozov, Michael W. Mahoney, @amir__gholami 🧵Thread below 👇
Haocheng Xi tweet media
English
6
5
22
2.4K
Diane
Diane@dianetc_·
@mail4agam GPT5 did reasonably well on the math errors benchmark, its just that some weird behaviors (quantization) indicated lots of headroom. Maybe a more advanced model will correct for these issues but currently a systems approach is what works well
English
0
0
2
67
Diane
Diane@dianetc_·
@mail4agam Essentially, encoder-only Transformer face optimization challenges due to misalignment between cross-entropy loss and regression objectives pointing to some fundamentals issues...But
English
1
0
2
69
Diane
Diane@dianetc_·
We asked LLMs to estimate the *fraction* of a math solution that was right… Turns out that while they can reason through complex problems they still have a hard time producing precise numerical outputs Let’s talk about what we call Reasoning-Intensive Regression (RiR) tasks 🧵
Diane tweet media
English
6
32
270
35.5K
Diane
Diane@dianetc_·
MENTAT offers an efficient approach, but is only one step towards fixing the fundamental reasoning-precision trade-off in current low-data regimes. Much headroom for future work! Thanks to @lateinteraction for guidance & coauthoring! Full paper: arxiv.org/abs/2508.21762
English
0
2
17
699
Diane
Diane@dianetc_·
📝 Recap: We define RIR as those requiring precise predictions, proper ranking, AND deep per-instance reasoning. Standard RiR methods struggle with balancing these! We introduce MENTAT, a simple algorithm that combines lightweight batched prompt evolution with ensemble learning.
English
1
0
10
741