Weiwei Sun (@sunweiwei12) - Twitter Profili | Zamantika Mersobahis Locabet

Weiwei Sun@sunweiwei12·9h

451 real users vs. LLM simulators. We find a clear sim2real gap across 21 behavioral dimensions. Models deviate from humans in many ways (eg too polite, too verbose) Stronger LLMs can even be worse at simulating humans. Check this out 👇 arxiv.org/pdf/2603.11245

Xuhui Zhou@nlpxuhui

Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).

English

0

1

6

373

Weiwei Sun retweetledi

Xuhui Zhou@nlpxuhui·14h

Creating user simulators is a key to evaluating and training models for user-facing agentic applications. But are stronger LLMs better user simulators? TL;DR: not really. We ran the largest sim2real study for AI agents to date: 31 LLM simulators vs. 451 real humans across 165 tasks. Here's what we found (co-lead with @sunweiwei12).

English

5

40

164

12K

Weiwei Sun retweetledi

Zhuokai Zhao@zhuokaiz·18 Ara

Been really enjoying this paper by @sunweiwei12 et al. lately: arxiv.org/pdf/2510.11967 I really like how it treats context management as something the agent actually learns, instead of an external system hack like summarization or fixed multi-agent setups. The test-time idea is also pretty clean, the agent just spins up sub-trajectories when needed, no pre-defined roles. Imo a really smart way to scale long-horizon agents beyond "just use a bigger context window."

English

5

50

338

18.7K

Weiwei Sun@sunweiwei12·5 Ara

Check out our #NeurIPS2025 spotlight poster, “Enhancing Training Data Attribution with Representational Optimization”! 📅 Dec 5, 4:30 PM – 7:30 PM PST 📍 Exhibit Hall C,D,E #107 📄 Paper: arxiv.org/pdf/2505.18513

Weiwei Sun@sunweiwei12

🚨 Modern LLMs are trained on trillions of tokens, but for any given output, only a tiny subset of examples really matter. Training Data Attribution (TDA) is about finding those examples and measuring their influence. Gradient-based approaches, while well-founded, are extremely costly for LLMs because they require computing and storing gradients. 💡 We introduce AirRep, a small representation model trained to predict how training data influences model behavior. The result: as accurate as gradient-based methods (and often more accurate), 80× faster, and with 50× storage reduction. On a single GPU, AirRep can process 2500 examples per second, while a well-optimized gradient-based model can only handle 30. #neurips2025

English

0

11

948

Weiwei Sun retweetledi

Sumit@_reachsumit·3 Ara

Deep Research: A Systematic Survey @Zhengliang_Shi et al. present a survey of deep research systems that combine LLM reasoning with external tools like search engines to complete complex, open-ended tasks. 📝 arxiv.org/abs/2512.02038 👨🏽‍💻 github.com/mangopy/Deep-R…

English

1

14

56

3.7K

Weiwei Sun@sunweiwei12·3 Ara

Check out our #NeurIPS2025 poster, “Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning”! 📅 Dec 3, 11:00 AM–2:00 PM PST 📍 Exhibit Hall C/D/E, #3403 📄 Paper: arxiv.org/pdf/2501.15228

English

0

2

8

626

Weiwei Sun retweetledi

Zhaopeng Tu@tuzhaopeng·27 Kas

Can AI agents autonomously explore, synthesize, and discover knowledge like researchers? 🤖🔬 Introducing a comprehensive survey on Deep Research (DR) systems, where LLMs evolve from passive text generators into autonomous agents capable of long-horizon reasoning and verifiable knowledge creation. 🗺️ Three-phase roadmap: 1⃣ Agentic Search → Precise evidence acquisition 2⃣ Integrated Research → Multi-source synthesis & reporting 3⃣ Full-stack AI Scientist → Hypothesis generation & discovery 🔧 Four foundational components: 1⃣ Query Planning: Decompose complex questions (parallel, sequential, tree-based). 2⃣ Information Acquisition: Dynamically retrieve from web search, APIs, & multimodal sources. 3⃣ Memory Management: Store, update, and prune context over long horizons. 4⃣ Answer Generation: Synthesize verifiable, cited reports. 🚀 Three optimization paradigms: 1⃣ Workflow Prompting 2⃣ Supervised Fine-Tuning (SFT) 3⃣ End-to-End Agentic Reinforcement Learning (RL) 📊 Key Insight: DR is not just advanced RAG. Unlike standard RAG, DR enables: ✅ Flexible interaction & tool use beyond static retrieval ✅ Long-horizon planning with autonomous workflows ✅ Reliable, verifiable, and structured outputs 📈 As the field evolves, we are committed to continuously updating this survey to reflect the latest progress! 🧑‍💻 Project: github.com/mangopy/Deep-R… 📃 Paper: preprints.org/manuscript/202…

English

12

60

230

17.9K

Weiwei Sun@sunweiwei12·27 Kas

💥 Our paper has been accepted to NeurIPS 2025 as a Spotlight! Read the paper for more details! 📄 Paper: arxiv.org/pdf/2505.18513 🧩 Model: huggingface.co/sunweiwei/AirR… 💻 Code: github.com/sunnweiwei/Air… Work done with amazing collaborators: Haokun Liu (@liu_haokun), Nikhil Kandpal (@kandpal_nikhil), Colin Raffel, and Yiming Yang

English

0

1

2

248

Weiwei Sun@sunweiwei12·27 Kas

AirRep does have an upfront cost: we generate supervision and train the encoder. But once trained, it amortizes beautifully. After a moderate crossover point, AirRep can attribute 100M of examples under the same GPU budget where gradient-based methods manage only a few million.

English

1

0

1

161

Weiwei Sun@sunweiwei12·27 Kas

🚨 Modern LLMs are trained on trillions of tokens, but for any given output, only a tiny subset of examples really matter. Training Data Attribution (TDA) is about finding those examples and measuring their influence. Gradient-based approaches, while well-founded, are extremely costly for LLMs because they require computing and storing gradients. 💡 We introduce AirRep, a small representation model trained to predict how training data influences model behavior. The result: as accurate as gradient-based methods (and often more accurate), 80× faster, and with 50× storage reduction. On a single GPU, AirRep can process 2500 examples per second, while a well-optimized gradient-based model can only handle 30. #neurips2025

English

1

3

1.3K

Weiwei Sun retweetledi

Xuhui Zhou@nlpxuhui·17 Kas

New blog post out! 📜 We share our latest research efforts to build more effective, human-centered AI collaboration. Months ago, I was genuinely surprised by how quickly AI agents were improving, and with that came a deep fear of being replaced, of humans slowly losing agency as AI grows more capable. At the same time, I felt the intense frustration of working with coding agents who produce thousands of lines of seemingly working code that ultimately prove unusable. These days, I’ve been coming to a clearer conclusion: the future of AI has to be true human–AI collaboration. And making that collaboration actually smooth, not frustrating, not disempowering, has never been more important. xuhuiz.com/blog/on-the-qu… #AI #AIAgents #HumanAICollaboration

English

3

25

124

24.1K

Weiwei Sun@sunweiwei12·17 Kas

✨ AI’s most meaningful impact won’t come from acting alone, but from collaborating in ways that amplify human strengths. Making that collaboration smooth, intuitive, and genuinely empowering matters more than ever. Check out the new blog! 📘 It dives into how to benchmark human–AI collaboration, build agents for it, and strengthen it with RL.

Xuhui Zhou@nlpxuhui

New blog post out! 📜 We share our latest research efforts to build more effective, human-centered AI collaboration. Months ago, I was genuinely surprised by how quickly AI agents were improving, and with that came a deep fear of being replaced, of humans slowly losing agency as AI grows more capable. At the same time, I felt the intense frustration of working with coding agents who produce thousands of lines of seemingly working code that ultimately prove unusable. These days, I’ve been coming to a clearer conclusion: the future of AI has to be true human–AI collaboration. And making that collaboration actually smooth, not frustrating, not disempowering, has never been more important. xuhuiz.com/blog/on-the-qu… #AI #AIAgents #HumanAICollaboration

English

0

1

2

1K

Weiwei Sun retweetledi

Alex Prompter@alex_prompter·7 Kas

🚨 Carnegie Mellon just dropped one of the most important AI agent papers of the year. It’s called “Training Proactive and Personalized LLM Agents.” Here’s the wild part... they didn’t train agents to just complete tasks. They trained them to talk better. Most AI agents are task junkies: they execute, they don’t interact. These new ones do three things simultaneously: → Productivity – actually finish the job → Proactivity – ask smart clarifying questions → Personalization – adapt tone, style, and behavior to you They built a full interactive world called UserVille, filled with simulated users each with unique personalities and quirks (like users who only reply in JSON, or only answer A/B/C questions 🤯). Then they trained agents using a new RL framework called PPP (Productive, Proactive, Personalized). Results? +21.6% higher performance than GPT-5 across complex engineering & research tasks. Agents started asking fewer, sharper questions and mirroring user preferences automatically. This is the future: Not just agents that do things but agents that understand who they’re doing them for. Paper: arxiv. org/abs/2511.02208v1

English

24

118

559

47.5K

Weiwei Sun@sunweiwei12·7 Kas

@kmingl20 @nlpxuhui @StigLidu @gneubig @MaartenSap Thanks! We design this mostly based on the time required for the user to respond. Refusing usually costs little user time, so it’s a medium effort (and it accumulates so more poor questions = more penalty) High effort means the user has to spend time doing actual work.

English

0

135

Kaiming Liu@kmingl20·7 Kas

@sunweiwei12 @nlpxuhui @StigLidu @gneubig @MaartenSap Great work! Why do you define Medium-effort as refuses to answer but High-effort as replying with efforts?

English

1

0

164

Weiwei Sun@sunweiwei12·5 Kas

AI agents are supposed to collaborate with us to solve real-world problems, but can they really? Even the most advanced models can still give us frustrating moments when working with them deeply. We argue that real-world deployment requires more than productivity (e.g., task accuracy); agents must also be proactive in communication and personalized to individual user preferences. Our new work introduces PPP, a Productive, Proactive, and Personalized optimization framework that explicitly trains LLM agents for effective human interaction. 🚀PPP achieves significant gains in complex, real-world agent–user scenarios (software engineering and deep research), outperforming even GPT-5 on both tasks with initially vague user instructions.

English

13

59

297

188.7K

Weiwei Sun retweetledi

Marktechpost AI Dev News ⚡@Marktechpost·6 Kas

CMU Researchers Introduce PPP and UserVille To Train Proactive And Personalized LLM Agents Most LLM agents are tuned to maximize task success. They resolve GitHub issues or answer deep research queries, but they do not reason carefully about when to ask the user questions or how to respect different interaction preferences. How can we design LLM agents that know when to ask better questions and adapt their behavior to each individual user? A team of researchers from Carnegie Mellon University CMU and OpenHands formalizes these missing behaviors as 3 joint objectives, Productivity, Proactivity, and Personalization, and optimizes them with a multi objective reinforcement learning framework called PPP inside a new environment named UserVille. Key Takeaways ➡️ PPP frames agent training as a multi objective RL problem that jointly optimizes Productivity, Proactivity, and Personalization, instead of focusing only on task success. ➡️ UserVille builds vague prompt versions of existing benchmarks and pairs them with preference aware user simulators, which enforce 20 distinct interaction preferences and label user effort levels. ➡️ The total reward combines task metric, user effort, and preference adherence, using bonuses for low effort questions and penalties for medium and high effort or preference violations, implemented with a GRPO based RL algorithm. ➡️ On SWE Bench Func Loc and BrowseComp Plus with vague prompts, PPP trained Seed OSS 36B significantly improves all 3 metrics over the base model and over GPT 5 baselines, with an average gain of about 16.72 points across dimensions and datasets. ➡️ PPP agents generalize to unseen preferences, alternate simulators, and harder tasks such as SWE Bench Full, and they learn to ask fewer but more targeted low effort questions, especially when prompts are vague. Full analysis: marktechpost.com/2025/11/06/cmu… Paper: arxiv.org/abs/2511.02208 Repo: github.com/sunnweiwei/PPP… @nlpxuhui @sunweiwei12 @nlpxuhui @StigLidu @xingyaow_ @wellecks @gneubig @MaartenSap

English

0

9

18

1.5K

Weiwei Sun

Keşfet