Daniel Fried

954 posts

Daniel Fried banner
Daniel Fried

Daniel Fried

@dan_fried

Assistant prof. @LTIatCMU @SCSatCMU. Working on NLP: LLM agents, language-to-code, applied pragmatics, grounding.

Pittsburgh, PA Katılım Ağustos 2013
908 Takip Edilen4.1K Takipçiler
Daniel Fried retweetledi
Mingqian Zheng
Mingqian Zheng@elisazmq_zheng·
LLMs refuse ambiguous queries that look harmful but aren't. Can they recover once users clarify, while staying safe? Our new interactive multi-turn benchmark measures both. 🚨 Turns out: not both at once.
Mingqian Zheng tweet media
English
7
24
95
8.9K
Daniel Fried retweetledi
Apurva Gandhi
Apurva Gandhi@apurvasgandhi·
Sub-agents are a promising inference-time scaling primitive: • Expand an agent's working memory • Divide-and-conquer hard problems • Solve problems faster with parallel execution But how do we train a model to best take advantage of sub-agents and make sure we get these benefits? Very excited to release RAO: Recursive Agent Optimization. RAO is an end-to-end reinforcement learning approach for training LLM agents to spawn, delegate to, and coordinate with recursive copies of themselves (that can themselves spawn other agents) - turning recursive inference into a learned capability. 1/10
GIF
English
23
117
712
133.4K
Daniel Fried retweetledi
Joachim Baumann @ ICLR'26
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
Joachim Baumann @ ICLR'26 tweet media
English
14
78
476
69.7K
Daniel Fried
Daniel Fried@dan_fried·
How successfully -- and efficiently! -- can agents carry out long-horizon tasks on the web? We built a benchmark of ~200 multi-site tasks, based on people's real browsing history. Many of them take hours to solve. Paper: odysseys-website.pages.dev Led by @JangLawrenceK and @kohjingyu, with @rsalakhu
Jing Yu Koh@kohjingyu

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

English
1
8
53
13.9K
Daniel Fried retweetledi
Anirudh Goyal
Anirudh Goyal@anirudhg9119·
How do coding agents get better from experience? Past Attempts as Interface: Turn rollouts into reusable summaries that future attempts can build on. arxiv.org/abs/2604.16529
Anirudh Goyal tweet media
English
3
14
82
36.1K
Daniel Fried
Daniel Fried@dan_fried·
Also at #ICLR2026: a new benchmark for coding agents that implement and run experiments from papers. Masking regions of code gives us a knob to control difficulty of the task (still verifiable!) Paper: arxiv.org/abs/2506.19724 Work with @j1mk1m1016, Alex Wilf, and @lpmorency
James Kim@j1mk1m1016

🚀 Excited to share our ICLR 2026 paper: "From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking"! Work with Alex Wilf, LP Morency, @dan_fried Check out the project here! iclr.cc/virtual/2026/p…

English
0
4
35
4.5K
Daniel Fried retweetledi
Sanidhya Vijayvargiya
Sanidhya Vijayvargiya@sanidhya903·
1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones?  We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.
Sanidhya Vijayvargiya tweet media
English
1
5
28
10.6K
Daniel Fried retweetledi
Vijay V.
Vijay V.@vijaytarian·
We trained an 8B model to help coding agents ask users clarifying questions, matching GPT-5 while asking far fewer Q's! We show a concrete playbook for RL in human-AI interaction: use data analysis to find what drives good interactions, then encode it as a structured reward ⬇️🧵
Sanidhya Vijayvargiya@sanidhya903

1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones?  We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.

English
0
5
15
2.2K
Daniel Fried
Daniel Fried@dan_fried·
This morning (Fri) at #ICLR2026, check out Andy's work on ConflictScope: determining how an LLM prioritizes between a set of user-provided values, by generating scenarios where the values are in conflict. P4-#4105
Andy Liu (➡️ bay area)@uilydna

I'll be in Rio this week for #ICLR2026 to present "Generative Value Conflicts Reveal LLM Priorities" (Friday morning, P4-#4105). Happy to chat anything related to LLM alignment, human-AI interaction, or multi-agent systems - feel free to DM if interested!

English
1
4
14
1.8K
Daniel Fried retweetledi
Daria Kryvosheieva
Daria Kryvosheieva@DKryvosheieva·
Today’s coding agent evals = single-number benchmark accuracies. But this obscures important details: which tasks in a benchmark are harder, and why? We study agent performance at the task level, and predict how new agents perform on new tasks. 📃To appear at ICLR 2026 AIWILD!
Daria Kryvosheieva tweet mediaDaria Kryvosheieva tweet media
English
2
3
21
4K
Daniel Fried retweetledi
Xuhui Zhou
Xuhui Zhou@nlpxuhui·
Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).
Xuhui Zhou tweet media
English
8
67
286
32.9K
Daniel Fried retweetledi
Pranjal Aggarwal ✈️ ICLR'26
Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16·
What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵
English
20
81
426
144.9K
Daniel Fried retweetledi
Fulcrum
Fulcrum@fulcrum_inc·
🚨 We're open-sourcing Druids, a library for coordinating and deploying coding agents across machines. Our beta users have used Druids to work on open math problems, conduct ML "autoresearch," and make software faster.
English
3
31
226
25.4K
Daniel Fried retweetledi
Zora Wang
Zora Wang@ZhiruoW·
To track agent progress at real work, we release a database linking benchmarks <-> real occupations & skills: zorazrw.github.io/ai4work/ ‼️We call for new submissions of: - Agent benchmarks: guided by our 3 principles - work coverage, realism, and granular evaluation - Open agent trajectories: to enable large-scale autonomy analysis
English
3
4
26
2.9K
Daniel Fried
Daniel Fried@dan_fried·
We analyzed coverage of tasks from 1K US occupations in popular AI agent benchmarks, and found math and coding are vastly overrepresented. Other domains may be harder to evaluate, but we should look for our keys beyond the lamppost- contribute benchmarks to our database!
Zora Wang@ZhiruoW

AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

English
2
2
22
2K
Daniel Fried retweetledi
Emmy Liu
Emmy Liu@_emliu·
Check out our work on training general SWE-agents! In particular there are a lot of simple training tasks that don't require execution and can be scaled up to improve model performance across tasks!
Yiqing Xie@YiqingXieNLP

Training on issue-solving only does NOT guarantee transfer to other tasks. 🎨Introducing Hybrid-Gym - synthetic training tasks for generalization (hybrid-gym.github.io) +25.4% on SWE-Bench / +7.9% on SWT-Bench / +5.1% on Commit-0 with NO issue-solving / test-gen/... training

English
1
2
21
4K