Daniel Fried
954 posts

Daniel Fried
@dan_fried
Assistant prof. @LTIatCMU @SCSatCMU. Working on NLP: LLM agents, language-to-code, applied pragmatics, grounding.



New paper: Propose, Solve, Verify Self-play for code generation via formal verification instead of unit tests: - propose new problems (formal specs) - try to solve them (write program and proofs) - formal verifier checks correctness arxiv.org/abs/2512.18160

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with @JangLawrenceK) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!



🚀 Excited to share our ICLR 2026 paper: "From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking"! Work with Alex Wilf, LP Morency, @dan_fried Check out the project here! iclr.cc/virtual/2026/p…

1/ Humans often can’t state exactly what they want, making things hard for AI agents. Obvious fix: ask clarifying questions. But which ones? We studied this empirically with coding agents. Effective clarification comes down to two properties: answerability and task relevance.


I'll be in Rio this week for #ICLR2026 to present "Generative Value Conflicts Reveal LLM Priorities" (Friday morning, P4-#4105). Happy to chat anything related to LLM alignment, human-AI interaction, or multi-agent systems - feel free to DM if interested!





AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

Training on issue-solving only does NOT guarantee transfer to other tasks. 🎨Introducing Hybrid-Gym - synthetic training tasks for generalization (hybrid-gym.github.io) +25.4% on SWE-Bench / +7.9% on SWT-Bench / +5.1% on Commit-0 with NO issue-solving / test-gen/... training









