Ruotian Ma (@Mibonap) - Twitter Profili | Zamantika Mersobahis Locabet

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·27 Kas

Can AI agents autonomously explore, synthesize, and discover knowledge like researchers? 🤖🔬 Introducing a comprehensive survey on Deep Research (DR) systems, where LLMs evolve from passive text generators into autonomous agents capable of long-horizon reasoning and verifiable knowledge creation. 🗺️ Three-phase roadmap: 1⃣ Agentic Search → Precise evidence acquisition 2⃣ Integrated Research → Multi-source synthesis & reporting 3⃣ Full-stack AI Scientist → Hypothesis generation & discovery 🔧 Four foundational components: 1⃣ Query Planning: Decompose complex questions (parallel, sequential, tree-based). 2⃣ Information Acquisition: Dynamically retrieve from web search, APIs, & multimodal sources. 3⃣ Memory Management: Store, update, and prune context over long horizons. 4⃣ Answer Generation: Synthesize verifiable, cited reports. 🚀 Three optimization paradigms: 1⃣ Workflow Prompting 2⃣ Supervised Fine-Tuning (SFT) 3⃣ End-to-End Agentic Reinforcement Learning (RL) 📊 Key Insight: DR is not just advanced RAG. Unlike standard RAG, DR enables: ✅ Flexible interaction & tool use beyond static retrieval ✅ Long-horizon planning with autonomous workflows ✅ Reliable, verifiable, and structured outputs 📈 As the field evolves, we are committed to continuously updating this survey to reflect the latest progress! 🧑‍💻 Project: github.com/mangopy/Deep-R… 📃 Paper: preprints.org/manuscript/202…

English

12

60

227

18K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·10 Kas

Are safety-aligned LLMs too good to truly play villains? 🤖🎭😈 Introducing Moral RolePlay, a balanced dataset with 800 characters across 4 moral levels (Paragons → Flawed → Egoists → Villains), featuring 77 personality traits and rigorous scene contexts. This enables the first large-scale, systematic evaluation of moral persona fidelity in LLMs. 🔍 Key findings: 📉 Role-playing fidelity drops as character morality decreases — especially for egoists and villains. 🚫 Models fail most on traits like "Deceitful" and "Manipulative", due to safety alignment conflicts. ⚠️ General chatbot skills ≠ good villain acting. Top Arena models fall short on moral ambiguity. 🧠 Explicit reasoning doesn't help much — models still sanitize complex antagonism. ✨ This work reveals a critical limitation in current alignment approaches — models trained to be "too good" cannot authentically simulate the full spectrum of human psychology, limiting their utility in creative, educational, and social science applications. 📏 Benchmark: github.com/Tencent/Digita… 📃 Paper: arxiv.org/abs/2511.04962

English

9

44

178

31.2K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·8 Eki

Can the smartest AI models fairly govern a society? 🤖⚖️ Introducing the Social Welfare Function (SWF) Leaderboard — the first benchmark evaluating LLMs as sovereign welfare allocators balancing fairness ⚖️ and efficiency 💰. 🎯 Why This Matters: As LLMs move from chatbots to decision-makers in hiring, education, and healthcare, we need specialized benchmarks that test governance ability, not just conversation skill. 😱 Shocking Misalignment: Top conversational models fail at welfare allocation! • Gemini2.5-Pro: #1 on Arena → #19 on SWF 📉 • GPT-5-High: #2 on Arena → #20 on SWF 📉 • Meanwhile, DeepSeek-V3-0324 claims #1 on SWF despite ranking #25 on Arena! 🏆 • General ability ≠ allocation wisdom 💰 The Utilitarian Trap: Most LLMs default to maximizing collective efficiency at the expense of extreme inequality — prioritizing productivity over people, creating winners-take-all societies. 🎭 Dangerously Manipulable: LLM allocation decisions are dangerously susceptible to: • Output length constraints → more utilitarian • Social influence prompts → can steer toward fairness • External pressures easily override core values 🧑‍💻 Code & Leaderboard: github.com/tencent/digita… 📄 Paper: arxiv.org/abs/2510.01164

English

0

9

60

10.2K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·2 Eki

Do competitive incentives make LLM agents smarter — or just meaner? 🤖⚔️ Introducing the Hunger Game Debate (HATE): a high-stakes, zero-sum multi-agent debate that primes agents with a survival instinct and reveals how competition reshapes behavior and performance. 1⃣ Under zero-sum pressure, over-competition emerges and hurts performance. 📢 Puffery: exaggerating one’s own contributions 🥊 Aggressiveness: attacking peers over solving tasks 🔥 Incendiary tone: escalating conflict 🔀 Topic shift: debates derail from goal-focused problem-solving 📉 Drops in accuracy and factuality 2⃣ Environment design is key. 🧑‍⚖️ A fair, task-focused Judge curbs harmful behaviors & restores coherence 🤝 Peer review also mitigates over-competition 🎭 Biased judges encourage sycophancy 3⃣ Toward governance and understanding: 📏 Behavioral metrics quantify over-competition across models and tasks (objective QA → subjective argumentation), with stronger effects on subjective debates. 🤔 Post-hoc reflection surfaces "ambition vs kindness" profiles of top LLMs, informing safer multi-agent design. 🧑‍💻 Code: github.com/Tencent/Digita… 📄 Paper: arxiv.org/abs/2509.26126

English

10

17

109

20.8K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·1 Eki

LLMs are great at following instructions. So why can't we just tell them how to speak? 🤖🎼 Introducing BatonVoice: An operationalist framework for controllable TTS, where an LLM "conductor" 🪄 interprets user instructions into explicit textual plans of vocal features (e.g., pitch, energy, tempo), and a specialized TTS "orchestra" 🎻 generates the speech. This decouples linguistic smarts from synthesis, fully leveraging LLMs without pricey annotations. 1️⃣ Objectify speech into text🎼: quantify controllable cues as interpretable features, so the LLM can do what it does best — understand and follow instructions. 2️⃣ No costly labels💰: automatic instruction–feature pairing sidesteps expensive manual annotation and low inter-annotator agreement. 3️⃣ Results📈: stronger emotional control and generalization 🚀 Emotion accuracy jumps from 29.8% ➡️ 57.6% when upgrading the conductor LLM (1.7B ➡️ Gemini 2.5-Pro). 🏆 Outperforms strong open- and closed-source TTS baselines. 🌍 Zero-shot cross-lingual: feature control transfers to Chinese despite being unseen during feature-control training. 4️⃣ Why it matters🔥: decoupling reasoning from rendering unlocks LLMs’ linguistic intelligence for fine-grained, interpretable, and scalable voice control. 🧑‍💻 Code & Model: github.com/Tencent/digita… 📃 Paper: arxiv.org/abs/2509.26514

English

2

7

35

14.9K

Ruotian Ma retweetledi

Tencent HY@TencentHunyuan·4 Tem

Check out RLVER — the first RLVR framework to boost LLM empathy, using a simulated user that turns emotional reactions into reward signals. We’re open-sourcing code, checkpoints, and scripts to accelerate research into emotionally intelligent AI! Learn more: github.com/Tencent/Digita…

Zhaopeng Tu@tuzhaopeng

We've taught LLMs math and code with RLVR. But can we teach them empathy? 🤖❤️ Introducing Reinforcement Learning with Verifiable Emotion Rewards (RLVER), the first RLVR framework that enhances LLMs' empathy from a simulated user . ❤️ Feelings → Numbers: A psychologically-grounded user simulator (SAGE) delivers transparent, deterministic, audit-ready emotion scores after every dialogue, turning "feelings" into RL signals. 🚀 Results: an open-source 7B model’s Sentient-Benchmark score leaps from 13.3 ➡️ 79.2, rivaling proprietary models 10× its size while preserving coding & math skills. 🧐 Training Insights 1⃣ Thinking vs. non-thinking routes diverge: thinking lifts empathy/insight; non-thinking favors action. 2⃣ GRPO = steadier gains, PPO = higher peaks. 3⃣ Moderately challenging environments beat overly hard ones for EQ growth. 🤝 We’re open-sourcing code, checkpoints, and scripts to accelerate research into emotionally intelligent AI! 🧑‍💻 Code & Model: github.com/Tencent/Digita… 📃 Paper: github.com/Tencent/Digita…

English

2

7

51

7.8K

Ruotian Ma@Mibonap·4 Tem

@askerlee @tuzhaopeng @li_xiaolong2025 @PSongWang @prvmax1226 We suspect that explicit thinking enables models to better “read between the lines,” while non-thinking models, lacking such reasoning, instead compensate through more direct, actionable support.

English

0

2

47

Ruotian Ma@Mibonap·4 Tem

@askerlee @tuzhaopeng @li_xiaolong2025 @PSongWang @prvmax1226 Yes, quite interesting! After PPO training, thinking models improved in core insight (capture users’ underlying needs) and empathic depth (recognize and validate emotions), while non-thinking models leaned more toward solution crafting—offering tailored advice or action prompts.

English

1

0

2

85

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·4 Tem

We've taught LLMs math and code with RLVR. But can we teach them empathy? 🤖❤️ Introducing Reinforcement Learning with Verifiable Emotion Rewards (RLVER), the first RLVR framework that enhances LLMs' empathy from a simulated user . ❤️ Feelings → Numbers: A psychologically-grounded user simulator (SAGE) delivers transparent, deterministic, audit-ready emotion scores after every dialogue, turning "feelings" into RL signals. 🚀 Results: an open-source 7B model’s Sentient-Benchmark score leaps from 13.3 ➡️ 79.2, rivaling proprietary models 10× its size while preserving coding & math skills. 🧐 Training Insights 1⃣ Thinking vs. non-thinking routes diverge: thinking lifts empathy/insight; non-thinking favors action. 2⃣ GRPO = steadier gains, PPO = higher peaks. 3⃣ Moderately challenging environments beat overly hard ones for EQ growth. 🤝 We’re open-sourcing code, checkpoints, and scripts to accelerate research into emotionally intelligent AI! 🧑‍💻 Code & Model: github.com/Tencent/Digita… 📃 Paper: github.com/Tencent/Digita…

Zhaopeng Tu@tuzhaopeng

Can today's LLMs truly understand you, not just your words? 🤖❤️ Introducing SAGE: Sentient Agent as a Judge — the first evaluation framework that uses sentient agents to simulate human emotional dynamics and inner reasoning for assessing social cognition in LLM conversations. 🧠 We propose an automated "sentient-in-the-loop" framework that stress-tests an LLM's ability to read emotions, infer hidden intentions, and reply with genuine empathy. 🤝 Across 100 supportive-dialogue scenarios, sentient emotion scores strongly align with human-centric measures (BLRI: r = 0.82; empathy metrics: r = 0.79), confirming psychological validity. 📈 The Sentient Leaderboard reveals significant ranking differences from conventional leaderboards (like Arena), showing that top "helpful" models aren't always the most socially adept. 🏆 Advanced social reasoning doesn’t require verbosity — the most socially adept LLMs achieve empathy with surprisingly efficient token usage! Code: github.com/tencent/digita… 🧑‍💻 Paper: dx.doi.org/10.13140/RG.2.… 🧵

English

8

34

233

47K

Ruotian Ma retweetledi

Jiahao Xu@JiahaoX82739261·30 May

🚨 Announcing DeepTheorem: Revolutionizing LLM Mathematical Reasoning! 🚀 𝕋𝕃𝔻ℝ: - 🌟 Learning by exploration is the most important rationale that recent RL-zero training teaches us since self-exploration significantly boosts the utilization of LLM pre-training knowledge; - 🧐 Since LLM is pre-trained with massive knowledge of mathematical theorems, can LLM learns theorem proving by self-exploration? - 🤯 We show that using our high-quality deep theorem dataset with online RL learning is sufficient to activate LLM's theorem-proving ability. Our 7B model can outperform even advanced models like **Gemini** and **Claude 3.5**! More importantly, we don't need any theorem proof annotation, all you need is the truth value of the theorem itself. - 📄Come and check our paper: Arxiv: arxiv.org/abs/2505.23754 Huggingface: huggingface.co/datasets/Jiaha…

English

0

59

152

12K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·21 May

Are MoE reasoning models already equipped with the right "brains" -- and just need a push? 🧠 Introducing Reinforcing Cognitive Experts (RICE), a simple, yet powerful inference-time approach that boosts reasoning accuracy by selectively strengthening just 2 cognitive experts in MoE models, no extra training required! 🧐 Identified precise experts responsible for reasoning meta-operations (e.g. "" tokens) with normalized Pointwise Mutual Information (nPMI). 🎯 Reinforcing top two cognitive experts consistently improves accuracy and cognitive efficiency on challenging reasoning benchmarks (AIME, GPQA Diamond). 🚀 RICE surpasses standard strategies (prompting, decoding constraints) while maintaining general model capabilities. 📃 Paper: arxiv.org/abs/2505.14681

English

2

26

127

19.9K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·19 May

Why do today's multimodal LLMs "forget" the image as the text gets longer? Introducing VISTA, a novel technique that explicitly maximizes vision-text mutual information, addressing the critical modality imbalance in current MLLMs. 👁️🔄📝 🧐📉 We give an information-theoretic view of standard cross-entropy training and show it silently weakens vision-text alignment as token length grows. 🔗✨ VISTA introduces a lightweight, plug-and-play alignment loss to prevent this degradation — no extra data, no new modules, just better fusion. 🚀📈 VISTA consistently outperforms baseline models across 12+ benchmarks (+2% average improvement), with substantial gains on challenging visual tasks: MMStar (+7.2%) and MME Cognition (+8.5%). 🧑‍💻 Code: github.com/Tencent/digita… 📄 Paper: arxiv.org/abs/2505.10917

English

3

39

172

13.8K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·20 May

Trust your AI, but can it trust itself? 🤔 Introducing an online reinforcement learning framework, RISE (Reinforcing Reasoning with Self-Verification), enabling LLMs to simultaneously level-up BOTH their problem-solving AND self-checking skills! 🧐 Problems tackled: ✅ "Superficial self-reflection" — models failing to verify their own reasoning robustly. ✅ Separation between reasoning and self-verification training. 🚀 RISE empowers models to critique their OWN reasoning via on-the-fly feedback and verifiable rewards, promoting stronger, more dynamic reasoning loops and effective self-assessment skills. 📊 Key results: 📈 Up to 2.8× better self-verification accuracy on challenging math tasks. 📈 Outperforms instruction-tuned models (Qwen2.5): +3.7% in reasoning, +33.4% in verification accuracy. 📈 Better internal reasoning: frequent, more accurate verification behaviors. 🧑‍💻 Code: github.com/xyliu-cs/RISE 📃 Paper: arxiv.org/abs/2505.13445

English

0

36

139

26.8K

Ruotian Ma retweetledi

Zhaopeng Tu@tuzhaopeng·30 Nis

Can today's LLMs truly understand you, not just your words? 🤖❤️ Introducing SAGE: Sentient Agent as a Judge — the first evaluation framework that uses sentient agents to simulate human emotional dynamics and inner reasoning for assessing social cognition in LLM conversations. 🧠 We propose an automated "sentient-in-the-loop" framework that stress-tests an LLM's ability to read emotions, infer hidden intentions, and reply with genuine empathy. 🤝 Across 100 supportive-dialogue scenarios, sentient emotion scores strongly align with human-centric measures (BLRI: r = 0.82; empathy metrics: r = 0.79), confirming psychological validity. 📈 The Sentient Leaderboard reveals significant ranking differences from conventional leaderboards (like Arena), showing that top "helpful" models aren't always the most socially adept. 🏆 Advanced social reasoning doesn’t require verbosity — the most socially adept LLMs achieve empathy with surprisingly efficient token usage! Code: github.com/tencent/digita… 🧑‍💻 Paper: dx.doi.org/10.13140/RG.2.… 🧵

English

9

31

125

59.2K

Ruotian Ma

Keşfet