Daniel Fried

937 posts

Daniel Fried banner
Daniel Fried

Daniel Fried

@dan_fried

Assistant prof. @LTIatCMU @SCSatCMU. Working on NLP: LLM agents, language-to-code, applied pragmatics, grounding.

Pittsburgh, PA Katılım Ağustos 2013
896 Takip Edilen4K Takipçiler
Daniel Fried retweetledi
Zora Wang
Zora Wang@ZhiruoW·
To track agent progress at real work, we release a database linking benchmarks <-> real occupations & skills: zorazrw.github.io/ai4work/ ‼️We call for new submissions of: - Agent benchmarks: guided by our 3 principles - work coverage, realism, and granular evaluation - Open agent trajectories: to enable large-scale autonomy analysis
English
3
4
25
2.8K
Daniel Fried
Daniel Fried@dan_fried·
We analyzed coverage of tasks from 1K US occupations in popular AI agent benchmarks, and found math and coding are vastly overrepresented. Other domains may be harder to evaluate, but we should look for our keys beyond the lamppost- contribute benchmarks to our database!
Zora Wang@ZhiruoW

AI agents are tackling more and more "human work" But are they benchmarked on the work people actually do? tl;dr: Not really Most benchmarks focus on math & coding, while most human labor and capital lie elsewhere. 📒 We built a database linking agent benchmarks & real-world work Submit new tasks + agent trajectories today 🧵

English
2
2
20
1.8K
Daniel Fried retweetledi
Emmy Liu
Emmy Liu@_emliu·
Check out our work on training general SWE-agents! In particular there are a lot of simple training tasks that don't require execution and can be scaled up to improve model performance across tasks!
Yiqing Xie@YiqingXieNLP

Training on issue-solving only does NOT guarantee transfer to other tasks. 🎨Introducing Hybrid-Gym - synthetic training tasks for generalization (hybrid-gym.github.io) +25.4% on SWE-Bench / +7.9% on SWT-Bench / +5.1% on Commit-0 with NO issue-solving / test-gen/... training

English
1
2
20
4K
Daniel Fried
Daniel Fried@dan_fried·
How do we avoid models overfitting to SWE-bench / issue solving? We find that multi-task training on simple synthetic tasks shows surprising generalization to more realistic tasks -- even improving on issue solving without training on it!
Yiqing Xie@YiqingXieNLP

Training on issue-solving only does NOT guarantee transfer to other tasks. 🎨Introducing Hybrid-Gym - synthetic training tasks for generalization (hybrid-gym.github.io) +25.4% on SWE-Bench / +7.9% on SWT-Bench / +5.1% on Commit-0 with NO issue-solving / test-gen/... training

English
0
2
20
3.6K
Daniel Fried retweetledi
Ziqian Zhong
Ziqian Zhong@fjzzq2002·
🔭 We’re releasing Hodoscope: an open-source tool for unsupervised behavior discovery. It lets you visually explore and compare agent behaviors at scale. It helped us discover a novel reward hacking vulnerability in Commit0 - with just a couple minutes of human effort.
English
28
157
1.1K
68.1K
Daniel Fried retweetledi
Yike Wang
Yike Wang@yikewang_·
Small language models are not very helpful as judges, how about 🔄 backward inference—inferring the instruction given only the response, and using the similarity between the inferred and the original instructions as the reward signal? Introducing ⚙️FLIP, a reference-free and rubric-free reward modeling approach that boosts the RewardBench2 performance of 13 small language models by an average of 79.6%, and substantially outperforms LLM-as-a-Judge under test-time scaling via parallel sampling and GRPO training. 📄paper: arxiv.org/abs/2602.13551  🔗code: github.com/yikee/FLIP
Yike Wang tweet media
English
12
52
251
27.3K
Daniel Fried retweetledi
Zora Wang
Zora Wang@ZhiruoW·
Most agents either run fully autonomously or interrupt at the wrong times. What if agents know when YOU want to step in? 🚀Introducing PlowPilot - a web agent that adapts to your interaction patterns achieving +26.5% user-reported usefulness Huge credit to @FariaHuqOaishi for leading this project!
Zora Wang tweet media
English
6
28
119
7.7K
Daniel Fried retweetledi
Ming Jin
Ming Jin@MingJin_AI·
How do AI agents solve long-horizon tasks differently than humans? 🤖💻 This Friday, we are thrilled to host @dan_fried (Daniel Fried) from @SCSatCMU at the AI Agent Frontier Seminar! Daniel will discuss Agent Workflow Memory, inducing executable programs, and the stark contrast between UI-centric human methods and programmatic AI strategies. 📅 Friday, Feb 20 🕛 12 PM ET / 9 AM PT 📍 Zoom: virginiatech.zoom.us/j/87872134251 🔑 Passcode: 309194 Organizers: @yalidux @ShangdingG95714 @MingJin_AI #AIAgents #LLM #NLP #MachineLearning #CMU
English
0
2
13
734
Daniel Fried
Daniel Fried@dan_fried·
How do we enable people to trust, control, and verify coding agents as they carry out increasingly complex tasks? We think it's time to increase our focus in the ML coding agent community on human-facing problems (and talk more with HCI and SE)!
Zora Wang@ZhiruoW

‼️Position: AI coding agent research needs recalibration. We've heavily optimized for solo autonomy, and far less for designing agents that empower the humans using them. It’s time to build human-centered coding agents. 🧵

English
1
2
24
3.9K
Daniel Fried retweetledi
Maarten Sap (he/him)
Maarten Sap (he/him)@MaartenSap·
🚀Apply to CMU LTI’s Summer 2026 “Language Technology for All” internship🎓Open to pre‑doctoral students new to language tech (non‑CS backgrounds welcome). 🔬12-14 weeks in‑person in Pittsburgh; travel + stipend paid.💸Deadline: Feb 20, 11:59pm ET. forms.gle/cUu8g6wb27HsWW…
English
8
88
602
83.5K
Daniel Fried
Daniel Fried@dan_fried·
Why do diffusion LMs outperform autoregressive LMs on some reasoning tasks? It's partly because they actually use more supervised computation per token generated. This computation is structured into what we call "latent tokens". We can modulate latent tokens to trade off compute <> performance, and port them to improve autoregressive LMs on the same tasks! Check out Andre and Sean's threads:
Sean Welleck@wellecks

There's been a lot of excitement about diffusion LMs, but we don't have a good understanding of what they might offer over autoregressive models in areas like reasoning and planning We show that diffusion models can leverage what we call latent tokens to gain an advantage on certain kinds of reasoning problems.

English
6
17
128
13.7K
Daniel Fried retweetledi
Sean Welleck
Sean Welleck@wellecks·
New paper: Propose, Solve, Verify Self-play for code generation via formal verification instead of unit tests: - propose new problems (formal specs) - try to solve them (write program and proofs) - formal verifier checks correctness arxiv.org/abs/2512.18160
Sean Welleck tweet media
English
11
66
353
19.8K
Daniel Fried retweetledi
Sida Wang
Sida Wang@sidawxyz·
The latest work from @YuxiangWei9 showing one can get good software agent from self-exploring natural codebases, impressively without human issues and data curation. As a small bonus, appendix B shows what happens if you take self-play too far and some mini-positions of mine.
Yuxiang Wei@YuxiangWei9

Software agents can self-improve via self-play RL Introducing Self-play SWE-RL (SSR): training a single LLM agent to self-play between bug-injection and bug-repair, grounded in real-world repositories, no human-labeled issues or tests. 🧵

English
9
24
179
23.4K
Daniel Fried retweetledi
Yuxiang Wei
Yuxiang Wei@YuxiangWei9·
Software agents can self-improve via self-play RL Introducing Self-play SWE-RL (SSR): training a single LLM agent to self-play between bug-injection and bug-repair, grounded in real-world repositories, no human-labeled issues or tests. 🧵
Yuxiang Wei tweet media
English
66
290
1.7K
510.9K
Daniel Fried retweetledi
Yueqi Song
Yueqi Song@yueqi_song·
We just built and released the largest dataset for supervised fine-tuning of agentic LMs, 1.27M trajectories (~36B tokens)! Up until now, large-scale SFT for agents is rare - not for lack of data, but because of fragmentation across heterogeneous formats, tools, and interfaces. To solve this, we introduce the Agent Data Protocol, a new “interlingua” between a broad variety of heterogeneous agent datasets - coding, browsing, API/tool use - and unified agent training pipelines downstream. We unified 13 datasets into ADP, converted them to be compatible with multiple agent frameworks, and observed ~20% average gains, reaching SOTA/near-SOTA without domain-specific tuning. 📄 Read our paper: arxiv.org/abs/2510.24702 🌐 Check our project website: agentdataprotocol.com And this is just getting started, we can add more datasets, further expand the resources, and make training agent LMs easy for all. We’d love to have you join the shared effort and help to make ADP the open standard for the community 🚀
English
26
174
1.1K
252.1K
Daniel Fried retweetledi
Apurva Gandhi
Apurva Gandhi@apurvasgandhi·
Go-Browse won Best Poster at the @SEAWorkshop at NeurIPS 2025! Huge thanks to the organizers for putting this workshop together 🙏 Loved the talks and had great conversations with so many cool people!
SEA Workshop@SEAWorkshop

The best poster awards go to: 1. Go-Browse: Training Web Agents with Structured Exploration Apurva Gandhi, Graham Neubig 2. Scaling Open-Ended Reasoning to Predict the Future Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping 🎉Congrats!

English
0
4
14
2.5K