Daniel Fried

954 posts

Daniel Fried

@dan_fried

Assistant prof. @LTIatCMU @SCSatCMU. Working on NLP: LLM agents, language-to-code, applied pragmatics, grounding.

Pittsburgh, PA Katılım Ağustos 2013

908 Takip Edilen4.1K Takipçiler

Daniel Fried retweetledi

Mingqian Zheng@elisazmq_zheng·13 May

LLMs refuse ambiguous queries that look harmful but aren't. Can they recover once users clarify, while staying safe? Our new interactive multi-turn benchmark measures both. 🚨 Turns out: not both at once.

English

8.9K

Daniel Fried retweetledi

(((ل()(ل() 'yoav))))👾@yoavgo·8 May

coding agents are not compilers from english to programs. and it is not because they are not deterministic. gist.github.com/yoavg/b2454c4d…

English

4.6K

Daniel Fried retweetledi

Apurva Gandhi@apurvasgandhi·8 May

Sub-agents are a promising inference-time scaling primitive: • Expand an agent's working memory • Divide-and-conquer hard problems • Solve problems faster with parallel execution But how do we train a model to best take advantage of sub-agents and make sure we get these benefits? Very excited to release RAO: Recursive Agent Optimization. RAO is an end-to-end reinforcement learning approach for training LLM agents to spawn, delegate to, and coordinate with recursive copies of themselves (that can themselves spawn other agents) - turning recursive inference into a learned capability. 1/10

GIF

English

117

712

133.4K

Daniel Fried retweetledi

Sean Welleck@wellecks·1 May

Propose, Solve, Verify (PSV) accepted at ICML! arxiv.org/abs/2512.18160

Sean Welleck@wellecks

New paper: Propose, Solve, Verify Self-play for code generation via formal verification instead of unit tests: - propose new problems (formal specs) - try to solve them (write program and proofs) - formal verifier checks correctness arxiv.org/abs/2512.18160

English

10.7K

Daniel Fried retweetledi

Joachim Baumann @ ICLR'26@joabaum·27 Nis

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

English

476

69.7K

Daniel Fried@dan_fried·29 Nis

How successfully -- and efficiently! -- can agents carry out long-horizon tasks on the web? We built a benchmark of ~200 multi-site tasks, based on people's real browsing history. Many of them take hours to solve. Paper: odysseys-website.pages.dev Led by @JangLawrenceK and @kohjingyu, with @rsalakhu

Jing Yu Koh@kohjingyu

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

English

13.9K

Daniel Fried retweetledi

Anirudh Goyal@anirudhg9119·22 Nis

How do coding agents get better from experience? Past Attempts as Interface: Turn rollouts into reusable summaries that future attempts can build on. arxiv.org/abs/2604.16529

English

36.1K

Daniel Fried@dan_fried·25 Nis

Also at #ICLR2026: a new benchmark for coding agents that implement and run experiments from papers. Masking regions of code gives us a knob to control difficulty of the task (still verifiable!) Paper: arxiv.org/abs/2506.19724 Work with @j1mk1m1016, Alex Wilf, and @lpmorency

James Kim@j1mk1m1016

🚀 Excited to share our ICLR 2026 paper: "From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking"! Work with Alex Wilf, LP Morency, @dan_fried Check out the project here! iclr.cc/virtual/2026/p…

English

4.5K

Daniel Fried retweetledi

Sanidhya Vijayvargiya@sanidhya903·22 Nis

1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones? We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.

English

10.6K

Daniel Fried retweetledi

Vijay V.@vijaytarian·23 Nis

We trained an 8B model to help coding agents ask users clarifying questions, matching GPT-5 while asking far fewer Q's! We show a concrete playbook for RL in human-AI interaction: use data analysis to find what drives good interactions, then encode it as a structured reward ⬇️🧵

Sanidhya Vijayvargiya@sanidhya903

English

2.2K

Daniel Fried@dan_fried·24 Nis

Paper: arxiv.org/abs/2509.25369 Work with @uilydna , @GhateKshitish , @MonaDiab77, @dan_fried , @Dr_Atoosa , @maxhkw

English

169

Daniel Fried@dan_fried·24 Nis

This morning (Fri) at #ICLR2026, check out Andy's work on ConflictScope: determining how an LLM prioritizes between a set of user-provided values, by generating scenarios where the values are in conflict. P4-#4105

Andy Liu (➡️ bay area)@uilydna

I'll be in Rio this week for #ICLR2026 to present "Generative Value Conflicts Reveal LLM Priorities" (Friday morning, P4-#4105). Happy to chat anything related to LLM alignment, human-AI interaction, or multi-agent systems - feel free to DM if interested!

English

1.8K

Daniel Fried retweetledi

Daria Kryvosheieva@DKryvosheieva·21 Nis

Today’s coding agent evals = single-number benchmark accuracies. But this obscures important details: which tasks in a benchmark are harder, and why? We study agent performance at the task level, and predict how new agents perform on new tasks. 📃To appear at ICLR 2026 AIWILD!

English

Daniel Fried retweetledi

Xuhui Zhou@nlpxuhui·19 Mar

Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).

English

286

32.9K

Daniel Fried retweetledi

Pranjal Aggarwal ✈️ ICLR'26@PranjalAggarw16·8 Nis

What if computer-use agents could do real work? We built Gym-Anything: a framework that turns any software into a computer-use agent environment. We used it to create CUA-World: 200+ real software, 10,000+ tasks and environments, across all major occupation groups, from medical imaging to financial trading. 🧵

English

426

144.9K

Daniel Fried retweetledi

Fulcrum@fulcrum_inc·26 Mar

🚨 We're open-sourcing Druids, a library for coordinating and deploying coding agents across machines. Our beta users have used Druids to work on open math problems, conduct ML "autoresearch," and make software faster.

English

226

25.4K

Daniel Fried retweetledi

Zora Wang@ZhiruoW·3 Mar

To track agent progress at real work, we release a database linking benchmarks <-> real occupations & skills: zorazrw.github.io/ai4work/ ‼️We call for new submissions of: - Agent benchmarks: guided by our 3 principles - work coverage, realism, and granular evaluation - Open agent trajectories: to enable large-scale autonomy analysis

English

2.9K

Daniel Fried@dan_fried·4 Mar

We analyzed coverage of tasks from 1K US occupations in popular AI agent benchmarks, and found math and coding are vastly overrepresented. Other domains may be harder to evaluate, but we should look for our keys beyond the lamppost- contribute benchmarks to our database!

Zora Wang@ZhiruoW

AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

English

Daniel Fried retweetledi

Emmy Liu@_emliu·23 Şub

Check out our work on training general SWE-agents! In particular there are a lot of simple training tasks that don't require execution and can be scaled up to improve model performance across tasks!

Yiqing Xie@YiqingXieNLP

Training on issue-solving only does NOT guarantee transfer to other tasks. 🎨Introducing Hybrid-Gym - synthetic training tasks for generalization (hybrid-gym.github.io) +25.4% on SWE-Bench / +7.9% on SWT-Bench / +5.1% on Commit-0 with NO issue-solving / test-gen/... training

English

Keşfet

@JangLawrenceK @kohjingyu @rsalakhu @j1mk1m1016 @lpmorency @uilydna @GhateKshitish @MonaDiab77