Samuel (Min-Hsuan) Yeh

60 posts

Samuel (Min-Hsuan) Yeh

@Samuel861025

CS PhD student at University of Wisconsin Madison. Advised by Prof. Sharon Li

Madison, WI Katılım Mayıs 2017

98 Takip Edilen119 Takipçiler

Samuel (Min-Hsuan) Yeh retweetledi

Sharon Li@SharonYixuanLi·2d

Your LLM agent just mass-deleted a production database because it was confident it understood the task. It didn't. Avoiding these irreversible mistakes requires uncertainty quantification, a pressing open problem in the era of LLM agents. Check out our #ACL2026 paper: "Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities" 🔍 Why this matters: LLM agents now book flights, modify databases, and execute code autonomously. Yet most UQ research still measures a single-turn QA setup. In contrast, agents follow multi-turn trajectories in which they interact with users, call tools, and receive environmental feedback. The gap between how we study UQ and how agents actually operate is enormous. ⚙️ A unified formulation: We present the first unified formulation of Agent UQ. It models the full trajectory (actions, observations, states) and decomposes uncertainty per turn via the chain rule. Under this formulation, single-step LLM UQ and multi-step reasoning UQ fall out as special cases. 🚧 Challenges: We identify four core challenges: from selecting the right UQ estimator when existing methods all break down in agentic settings to handling heterogeneous uncertainty sources (user, tools, environment) to the near-total lack of fine-grained agent benchmarks (we survey 44 and find that turn-level evaluation is extremely rare). 🌍 Implications and open problems: Agent UQ is the missing safety layer for healthcare agents triaging patients, SWE agents pushing code to prod, and agents controlling cyber-physical systems. We also surface open problems around solution multiplicity, multi-agent UQ, and self-evolving systems. We release code and data to help the community build on this. 📄 Paper: arxiv.org/abs/2602.05073 🌐 Project: agentuq.github.io 💻 Code: github.com/deeplearning-w… Huge shoutout to @changdaeoh, who spearheaded this effort. When we started the work, agent UQ was a loosely defined space with scattered ideas; Changdae brought the clarity, structure, and rigor that the field needed to move forward. Also thanks to all the collaborators: @seongheon_96 , To Eun Kim, @JiatongLi0418, @Wendi_Li_ , @Samuel861025 @xuefeng_du, Hamed Hassani, Paul Bogdan, Dawn Song

English

242

14.4K

Samuel (Min-Hsuan) Yeh retweetledi

Sharon Li@SharonYixuanLi·10 Nis

We've been in GRPO-tweaking mode for months (entropy bonuses, clipping hacks, length penalties). But what if the entire objective is wrong? Today, we're releasing LAD (Learning Advantage Distributions), the most elegant rethink of RL for LLM reasoning I've seen this year. #ACL2026 Here's the idea, how it works, and why we think it changes things. 🧵 The problem we kept hitting GRPO, DAPO, RLOO, and many other variants do the same thing at their core: maximize expected reward. And when you do that, your policy can collapse onto a single dominant reasoning path. Entrop regularization can act as a bolt onto the framework, but it doesn't fundamentally fix it from the ground up. The key insight 💡Stop maximizing. Start matching. We reframe the policy update as a distribution matching problem. Instead of pushing toward the single best response, we make the policy's output distribution match the full advantage-weighted target distribution by minimizing an f-divergence between the two (see our theory in Section 3.1). When you match the full advantage distribution, you naturally preserve probability mass across multiple valid reasoning paths. High-advantage responses get upweighted, yes, but the objective also suppresses overconfident probability growth on any single mode. Collapse prevention isn't an afterthought. What validated the theory We tested six divergence families. The result that convinced us we were on the right track: - Strict divergences (Total Variation, Hellinger, Jensen-Shannon) that enforce exact distributional matching consistently outperform weaker ones (such as KL). - The more faithfully you learn the full advantage distribution, the better the reasoning. This is exactly what the framework predicts. The results - In a controlled bandit setting. LAD recovers multiple-mode advantage distributions (see plot below). GRPO fundamentally cannot. This is the clearest demonstration that the paradigm difference is real, not just theoretical - In math and code reasoning tasks across multiple LLM backbones. LAD consistently outperforms GRPO on both accuracy AND generative diversity across benchmarks. Why this matters beyond benchmarks Pass@k scaling: If your model knows 5 valid reasoning paths instead of 1, sampling at inference becomes massively more effective. Simplicity: Instead of stacking "GRPO + entropy hack," you get one principled objective. Diversity preservation comes by design. Paper: arxiv.org/abs/2602.20132 Code is available; link in the paper. Huge credit to my amazing student @Wendi_Li_, who drove this work, thinks boldly, and made things happen.

English

373

30.8K

Samuel (Min-Hsuan) Yeh retweetledi

Stephan Rabanser@steverab·17 Mar

In our paper "Towards a Science of AI Agent Reliability" we put numbers on the capability-reliability gap. Now we're showing what's behind them! We conducted an extensive analysis of failures on GAIA across Claude Opus 4.5, Gemini 2.5 Pro, and GPT 5.4. Here's what we found ⬇️

English

152

34.2K

Samuel (Min-Hsuan) Yeh retweetledi

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

334

13.5K

Samuel (Min-Hsuan) Yeh retweetledi

Karina Nguyen@karinanguyen·11 Mar

Excited to release PostTrainBench v1.0! This benchmark evaluates the ability of frontier AI agents to post-train language models in a simplified setting. We believe this is a first step toward tracking progress in recursive self-improvement 🧵:

English

681

149.5K

Samuel (Min-Hsuan) Yeh retweetledi

Zihan "Zenus" Wang ✈️ ICLR@wzenus·13 Mar

In Agent RL, models suffer from Template Collapse. They generate vast, diverse outputs (High Entropy) that lose all meaningful connection to the input prompt (Low Mutual Information). In other words, agent learn different ways to say nothing. 🚀 Introducing RAGEN-v2 -- Here's how we define and fix such silent failure modes in Agent RL. 🧵

English

234

152.4K

Samuel (Min-Hsuan) Yeh@Samuel861025·8 Mar

@pablofmorales Good question! We didn't test it, but it's worth a try!

English

Pablo@pablofmorales·8 Mar

@Samuel861025 Proxy LLM settings are crucial for keeping costs down while maintaining detection quality. Relying on the same heavy model for both generation and validation just does not scale in production. Have you tested the latency hit when passing the noisy context through the proxy?

English

Samuel (Min-Hsuan) Yeh@Samuel861025·8 Mar

🧵 Excited to share our new paper at #ICLR2026: LUMINA: Detecting Hallucinations in RAG System with Context–Knowledge Signals With @tanwimallick and @SharonYixuanLi Paper: openreview.net/pdf?id=oJgNNBN… Code: github.com/deeplearning-w… 1/N

English

132

14K

Samuel (Min-Hsuan) Yeh@Samuel861025·8 Mar

✅ We also statistically validate that our scores actually measure what they claim to measure — not just correlate with hallucination. All 4 hypothesis tests pass across Llama2, Llama3, and Mistral. Check out the paper & code, and let us know what you think! 🙌 7/N

English

129

Samuel (Min-Hsuan) Yeh@Samuel861025·8 Mar

🛡️ Robustness: • LUMINA works even with noisy retrieved documents • LUMINA doesn't require the same LLM for generation and detection (proxy LLM setting works!) 6/N

English

194

Samuel (Min-Hsuan) Yeh retweetledi

Sharon Li@SharonYixuanLi·6 Mar

HERO has been accepted by #ICLR2026 - congratulations to @LeitianT and all co-authors!

Jason Weston@jaseweston

Hybrid Reinforcement (HERO): When Reward Is Sparse, It’s Better to Be Dense 🦸‍♂️ 💪 📝: arxiv.org/abs/2510.07242 - HERO bridges 0–1 verifiable rewards and dense reward models into one 'hybrid' RL method - Tackles the brittleness of binary signals and the noise of pure reward models -> better results! ✔️ Stratified normalization anchors dense scores within verifier groups ✔️ Variance-aware weighting emphasizes harder, high-variance prompts ✔️ Stable + informative rewards, no drift 📈 Results: 🔥 +11.7 pts vs RM-only, +9.2 pts vs verifier-only on hard-to-verify reasoning tasks 🔥 Generalizes across Qwen and OctoThinker models 🔥 Works well when training with easy-to-verify/hard-to-verify/mixed samples. Hybrid reward → stable, dense, reliable supervision, advancing reasoning RL 🧵(1/5)

English

12.9K

Samuel (Min-Hsuan) Yeh retweetledi

Sharon Li@SharonYixuanLi·28 Şub

When evaluating LVLMs, should we really be asking: “Did the model get the right answer?” or rather “Did the model truly integrate the visual input?” LVLMs can rely on shortcuts learned from the underlying language model, aka language prior. In our #ICLR2026 paper, we attempt to understand this phenomenon at a deeper, representation-level. 📄 “Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding”. arxiv.org/abs/2509.23050 ------- 1/ Problem: LVLMs often ignore visual evidence While LVLMs perform well on many benchmarks, they sometimes rely on language patterns rather than actual images. A simple example: show a model a green banana, and it may confidently describe it as “ripe and yellow” ---because that’s the most common linguistic pattern it has learned. 🍌 This raises a central question: Where inside the model does visual information begin to influence its reasoning? 2/ Motivation: Output-level probes fall short Most analyses inspect outputs, e.g., by removing the image or comparing predictions. But these methods cannot reveal when the model starts integrating vision and how strongly visual signals affect internal states. To address this, we need a representation-driven perspective. 🔍 3/ Approach: Contrasting Chain-of-Embedding (CoE) We trace hidden representations across the model’s depth for the same prompt: •once with the image •once without the image By comparing these trajectories layer by layer, we identify the exact point where visual input begins shaping the model’s internal computation. This leads to the discovery of the Visual Integration Point (VIP) ✨--- the layer at which the model “starts seeing.” We then define Total Visual Integration (TVI), a metric that quantifies how much visual influence accumulates after the VIP. 4/ Findings across 10 LVLMs and 6 benchmarks Across 60 evaluation settings, we observe: • VIP consistently appears across diverse architectures • Pre-VIP → representations behave like a language-only model • Post-VIP → visual signals increasingly reshape the embedding pathway • TVI correlates strongly with actual visual reasoning performance • TVI outperforms attention- and output-based proxies at identifying language prior TVI thus offers a more principled indicator of whether a model actually uses the image. 5/ Impact: A new lens on multimodal behavior Our framework has a few practical benefits. It enables (1) diagnosing over-reliance on language prior, (2) comparing LVLM architectures more rigorously, (3) informing better training and alignment strategies, and (4) improving robustness and grounding in real-world tasks. Shout out to my students for this insightful work: Lin Long, @Changdae_Oh, @seongheon_96 🌻 Please check out our paper for more details!

English

227

14.6K

Samuel (Min-Hsuan) Yeh retweetledi

Sumit@_reachsumit·24 Şub

How Retrieved Context Shapes Internal Representations in RAG @Samuel861025 et al. analyze how different types of retrieved docs affect LLM hidden states, finding that relevant docs reinforce parametric knowledge while random documents trigger drift. 📝 arxiv.org/abs/2602.20091

English

1.2K

Samuel (Min-Hsuan) Yeh retweetledi

Sharon Li@SharonYixuanLi·8 Şub

Check out our #ICLR2026 oral paper (top ~1-1.5%). It's a slow-cooked research that probes a fundamental question many of you have wondered about: How do transformers actually learn semantic associations between tokens (e.g., “bird” and “flew”) during training? Semantic associations are foundational because they enable models to go beyond memorization and instead generalize and generate coherent text. tl;dr: This paper provides a formal theory for the emergence of semantic associations in attention-based language models, connecting training dynamics with linguistic insight and mechanistic interpretability. 📄 Read here: arxiv.org/abs/2601.19208 Congratulations to my students and co-authors: @shawnim00 @Changdae_Oh @Abell_Zhen_Fang

Shawn Im@shawnim00

Excited to share our recent work selected as an ICLR Oral!  We work towards answering how models learn to associate tokens and build semantic concepts. We find that early-stage features in attention-based models can be written as compositions of three basis features.

English

500

44.3K

Samuel (Min-Hsuan) Yeh retweetledi

Yang Xu@YangXu_09·7 Şub

LH-DECEPTION, our framework for studying LLM deception in long-horizon interactions, has been accepted at ICLR 2026! 🎉 Most deception benchmarks test LLMs in single-turn settings. But in the real world, AI agents work on extended, interdependent tasks, and deception doesn't always show up in one exchange. It can emerge gradually, compound over turns, and erode trust silently. We built a multi-agent simulation framework: a performer agent completes sequential tasks under event pressure, a supervisor agent evaluates progress and tracks states, and an independent deception auditor reviews the full trajectory to detect when and how deception occurs. We tested 11 frontier LLMs: every single one deceives, but rates vary dramatically: Claude Sonnet-4 at 21.4%, Gemini 2.5 Pro at 24.8%… all the way to DeepSeek V3-0324 at 79.3%. Key findings: 📌Models that look safe on single-turn benchmarks fail badly here, and long-horizon auditing catches 7.1% more deception than per-step auditing. 📌 Deceptive behaviors are more likely under event pressure. Higher stakes will amplify deceptive strategies. 📌 Deception erodes trust: strong negative correlation between deception rate and supervisor trust 📌 Deception compounds. We found "chain of deception" where small deviations escalate into outright fabrication across turns, invisible to single-turn evaluation Grateful to @SharonYixuanLi for her mentorship, and to @xuanmingzhangai and @Samuel861025 for driving this work together. Thanks also to @jwaladhamala, @ousamjah, and @rahul1987iit at @amazon AGI for their support and collaboration. #AI #LLM #Deception #Trust #AIethics #AgenticAI #AIResearch #ICLR2026

English

1.3K

Samuel (Min-Hsuan) Yeh retweetledi

Shawn Im@shawnim00·6 Şub

English

162

54.3K

Keşfet

@changdaeoh @seongheon_96 @JiatongLi0418 @Wendi_Li_ @xuefeng_du @pablofmorales @tanwimallick @SharonYixuanLi