UNC Computer Science

hyunji amy lee@hyunji_amy_lee

17

38

3.4K

UNC Computer Science retweetledi

Roni Sengupta@SenguptRoni·3d

Honored to receive an NSF CAREER Award! 🎉 Huge thanks to my students, mentors, the amazing colleagues at @unccs , and my family for making this possible. 🙏 We'll be working on Inverse Physics — teaching computers to infer shape, reflectance, lighting, material properties, and motion from images and videos, spanning both inverse rendering and simulation. These algorithms will advance endoscopic and laparoscopic procedures with robotic guidance (supported by our NIH grants), help robots handle delicate materials in manufacturing, and various other scientific and engineering applications.

English

7

4

44

2.8K

UNC Computer Science retweetledi

Duy Nguyen@duynguyen772·4d

Sparse binary rewards bottleneck LLM RL, motivating the use of privileged information in self-distillation as dense teachers. How can we use and balance multiple types of privileged info: leveraging stable cross-view info, while preserving view-specific info? Current on-policy self-distillation methods often condition the teacher on only one type of privileged view: full solution, partial rationale, answer-only, reference code, feedback, etc. This can be suboptimal: 1️⃣ No single privileged view consistently performs best when used as a teacher. 2️⃣ Views can introduce teacher-specific artifacts from information unavailable to the student. 🧠 Adaptive-View Self-Distillation (AVSD) considers multiple privileged views jointly as a teacher family, balancing cross-view consensus and view-specific signals through a token-level gate to construct better dense learning signals. 🧵👇

English

4

35

84

25.1K

UNC Computer Science retweetledi

Mohit Bansal@mohitban47·5d

🚨 Check out MINTEval, a new *memory interference* benchmark to stress-test agentic memory systems on: 👉 frequent & interfering context changes (avg. 86 updates) 👉 over long horizons (avg. 138.8k-token contexts, up to 1.8M) 👉 5 challenging question types (incl. long-range recovery, multi-target reasoning) 👉 4 realistic domains (state tracking, multi-turn dialogue, wikipedia revisions, code commits) 📊 Across 7 representative systems (Full Context, RAG-based, and Memory-Augmented Agents), the best performance is only 33.4%! Other interesting findings: 🔎 Memory construction failures are a major bottleneck 🔎 Memory agents are highly sensitive to design choices 🔎 Systems strongly favor insertion over deletion/update operations 🧵👇

LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this? We introduce MINTEval: ✅ Frequent context changes & interference (avg. 86 updates) ✅ 5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context ✅ 4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits ✅ Avg. 138.8k tokens per instance (up to 1.8M) ✅ Human verification on generated QAs = 95.6% 📊 Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%. 🔎 Our analysis shows: • Memory construction failures cause a 41.7% drop • Memory agents are highly sensitive to design choices • Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update

English

14

32

4.3K

UNC Computer Science retweetledi

hyunji amy lee@hyunji_amy_lee·5d

LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this? We introduce MINTEval: ✅ Frequent context changes & interference (avg. 86 updates) ✅ 5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context ✅ 4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits ✅ Avg. 138.8k tokens per instance (up to 1.8M) ✅ Human verification on generated QAs = 95.6% 📊 Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%. 🔎 Our analysis shows: • Memory construction failures cause a 41.7% drop • Memory agents are highly sensitive to design choices • Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update

English

9

36

106

22.5K

UNC Computer Science retweetledi

Mohit Bansal@mohitban47·18 May

🚨 Check out Agent-BRACE, our new work on belief state modeling for LLM agents in long-horizon tasks! In long-horizon partially-observable tasks, interaction history exceeds LLM context windows, but summarizing it can discard useful uncertainty about the environment. Agent-BRACE represents belief states as a set of natural language claims with verbalized confidence, and jointly trains a belief model to produce these states and a policy that conditions on them when taking actions. ✅ Improved task performance over strong RL baselines ✅ Compact, near-constant context ✅ Better belief calibration 🔎 We can see epidemic uncertainty reducing as the agent explores! 👇

Joykirat@joykiratsingh

🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵

English

Yidong Huang@owenhuang117

10

23

2.5K

UNC Computer Science retweetledi

Mohit Bansal@mohitban47·17 May

🚨 Check out PhyMotion, our new Real2Sim2Real framework for physics-grounded human motion video generation (to avoid failures e.g. floating feet, unstable balance, body self-penetration, dynamically infeasible motion, etc,) --> lifts generated videos into 3D SMPL-X human motion, grounds them in a physics simulator, and evaluates motion through structured 3D physical rewards covering: ➡️ kinematic plausibility ➡️ contact and balance consistency ➡️ dynamic feasibility PhyMotion not only aligns better with human judgments as an evaluator, but also serves as an effective RL post-training reward! 👇👇

🚨 Excited to introduce PhyMotion🤸: Structured 3D Motion Reward for Physics-Grounded Human Video Generation! ❌ Existing 2D video rewards misleadingly assign high scores to videos with floating feet, self-penetrating limbs, and physics-violating motions. ✅ PhyMotion lifts generated videos into 3D, grounds them in a physics simulator, and scores motion along kinematic / contact / dynamic feasibility. ➡️ RL post-training with PhyMotion improves 1.3B model to match 14B models performance in human prefence. 🧵(1/n)👇

English

10

34

3K

UNC Computer Science retweetledi

Yidong Huang@owenhuang117·15 May

🚨 Excited to introduce PhyMotion🤸: Structured 3D Motion Reward for Physics-Grounded Human Video Generation! ❌ Existing 2D video rewards misleadingly assign high scores to videos with floating feet, self-penetrating limbs, and physics-violating motions. ✅ PhyMotion lifts generated videos into 3D, grounds them in a physics simulator, and scores motion along kinematic / contact / dynamic feasibility. ➡️ RL post-training with PhyMotion improves 1.3B model to match 14B models performance in human prefence. 🧵(1/n)👇

English

36

97

56.2K

UNC Computer Science@unccs·13 May

@nguyenp2004 @yong_jae_lee Excited to welcome you to @unccs @nguyenp2004! And congratulations! 4 degrees!

English

0

2

66

Le Thien Phuc Nguyen@nguyenp2004·12 May

Just graduated from UW-Madison with 4 majors: Computer Science, Data Science, Mathematics, and Statistics. A journey I didn't expect, but one that pushed my limits far beyond what I thought possible. My deepest gratitude to Professor Yong Jae Lee @yong_jae_lee and every member of WAIV Lab for two years of incredible mentorship and support. Excited to announce I'm starting my CS PhD at RAIR lab, UNC-Chapel Hill @unccs under Professor Jason Ren @RenZhongzheng with continued deep collaboration with Professor Yong Jae Lee @yong_jae_lee as my co-advisor. A new challenge awaits, but the research passion remains. Let's go!

English

4

1

40

19.1K

UNC Computer Science retweetledi

Snigdha Chaturvedi@snigdhac25·13 May

Congratulations to Dr. Anvesh Rao Vijjini for successfully defending his PhD thesis on realism and safety of personalized LLMs. Check out his work here: nvshrao.github.io PS: Anvesh is on the job market! @nvshrao @unc_ai_group @unccs

English

4

28

2.1K

UNC Computer Science retweetledi

Joykirat@joykiratsingh·13 May

🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵

English

39

67

15.6K

UNC Computer Science retweetledi

Ziyang Wang@ZiyangW00·12 May

🚨 Excited to share EgoMemReason, a benchmark for multi-level memory-driven reasoning (entity, event, and behavior memory) over week-long egocentric videos (average 25.9 hours of temporal backtracking)! 📉 Current long video approaches can retrieve isolated event, but struggle with long-horizon memory that requires retrieve and understand across multiple events and long time: tracking evolving entities across days, linking temporally distant events, and abstracting recurring behavior patterns from long observations. 🎥 EgoMemReason evaluates these challenges through 500 human-verified questions spanning entity, event, and behavior memory, requiring aggregation over an average of 5.1 evidence segments and 25.9 hours of temporal backtracking. ⭐️ Across 17 models/frameworks, even the best model achieves only 39.6% accuracy, revealing that long-horizon multimodal memory remains far from solved.

English

3

27

47

7.9K

UNC Computer Science retweetledi

Mohit Bansal@mohitban47·9 May

Looking forward to giving a keynote at the Midwest Machine Learning Symposium (MMLS) 2026 (being held at Purdue University this year) & meeting folks from all the strong universities in the midwest, with their inspiring, long tradition of these exciting symposiums! 🙂 👇👇

Ruqi Zhang@ruqi_zhang

The Midwest Machine Learning Symposium (MMLS) 2026 will happen at Purdue University! 📍 West Lafayette, IN 📅 June 24–25, 2026 🔗 midwest-ml.org/2026/ 📌 Poster submission deadline: May 24 We have an amazing lineup of plenary speakers: Tong Zhang, Jennifer Neville @ProfJenNeville, Mohit Bansal @mohitban47, Joyce Chai. Looking forward to seeing you there! @PurdueCS @PurdueECE @PurdueStats

English

18

44

3.8K

UNC Computer Science retweetledi

David Wan@meetdavidwan·8 May

🥳 Excited to share that MuRGAt is accepted to #ICML2026! Even strong MLLMs hallucinate citations to multimodal sources (video, audio, charts). Our new Fact-Level Multimodal Attribution benchmark tackles this by: 🕐 Requiring fine-grained temporal & per-modality citations (vs. just source-level) 🔍 Distinguishing verifiable claims from reasoning steps to evaluate multi-step responses We also introduce MuRGAt-SCORE, a reference-free, decomposed metric aligned with human judgment, and show that Programmatic Grounding substantially boosts attribution! 👇

David Wan@meetdavidwan

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English

Archiki Prasad@ArchikiPrasad

21

39

4K

UNC Computer Science retweetledi

Archiki Prasad@ArchikiPrasad·7 May

🎉 Excited to share that our work on intrinsic dimensionality of reasoning has been accepted to #ICML2026 as a ✨spotlight✨ (top 2.2%)! We analyze the effectiveness of teaching a model how to reason via the lens of intrinsic dimensionality (the minimum effective capacity a model needs to solve the task) and find that effective reasoning chains are inherently compressive! Across Gemma-3 1B and 4B, lower intrinsic dimensionality strongly predicts not only in-distribution accuracy (GSM8K), but also robustness on OOD benchmarks (GSM-Hard, GSM-Symbolic, GSM-IC) -- outperforming reasoning length, token perplexity, and KL divergence. Stay tuned for more results and exciting updates in the camera-ready! 🚀

🚨Excited to share our new work viewing reasoning strategies as teaching tools: for fixed target model, which CoT strategies best support learning and generalization? ✨Our answer is intrinsic dimensionality (minimum effective capacity a model needs to solve the task). Somewhat counterintuitively, adding CoT – which requires generating longer and more structured outputs – can reduce learning complexity. Good reasoning compresses the task, i.e., it reduces the degrees of freedom the model needs to map inputs to correct solutions. 🧵⬇️ (1/5)

English

Elias Stengel-Eskin@EliasEskin

38

199

19.9K

UNC Computer Science retweetledi

Hanqi Xiao@hanqi_xiao·4 May

Glad that GCMs for analyzing confidence estimation from historical predictions was accepted to #ICML2026! We examine whether models have an advantage when predicting their own correctness and confidence and find that little usable privileged information exist for confidence prediction. This leads us to train Generalized Correctness Models to predict the calibrated confidence and correctness of models, outperforming the logit and verbalized confidences of much larger models! Thanks to @vaidehi_patil_, @hyunji_amy_lee, @EliasEskin, and @mohitban47! See more in 🧵below!

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger Llama-3-70B’s self-emitted confidences in downstream selective prediction tasks. We motivate GCMs and analyze them by answering 2 questions: ❓ RQ1: Are LLMs better than other LLMs at predicting their own correctness? We find that they are not, instead historical information (past LLM outputs and their correctness) drives performance, motivating cross-model transfer and training of GCMs! ❓ RQ2: How can we use historical information from multiple models for correctness prediction? Within RQ2, we explore 3 further subquestions, informing the design of GCMs: 1⃣ How does confidence prediction generalize across models? GCMs transfers strategies across models and datasets, even beating models trained directly on OOD datasets. 2⃣ What information should GCMs condition on? The exact way an LLM phrases an answer is a strong predictor for correctness + strategies leveraging world-knowledge seem to drive generalization. 3⃣ How do alternative methods for encoding history (e.g. post hoc calibration, ICL) compare? Including historical information ICL can aid larger models to predict correctness but underperforms GCMs, and post hoc calibration can complement GCMs to reduce calibration error. 🧵👇

English

Elias Stengel-Eskin@EliasEskin

12

21

2.8K

UNC Computer Science retweetledi

Elias Stengel-Eskin@EliasEskin·4 May

🎉 Happy to share our paper introducing GCMs for confidence estimation based on historical predictions has been accepted to #ICML2026! We find that models are no better at learning to predict their own correctness than others', i.e., they don't have privileged self-access (given training). Training smaller models to predict the correctness of many models generalizes and leads to better calibration than self-reported confidence from much larger models! Check out 🧵 for more

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger Llama-3-70B’s self-emitted confidences in downstream selective prediction tasks. We motivate GCMs and analyze them by answering 2 questions: ❓ RQ1: Are LLMs better than other LLMs at predicting their own correctness? We find that they are not, instead historical information (past LLM outputs and their correctness) drives performance, motivating cross-model transfer and training of GCMs! ❓ RQ2: How can we use historical information from multiple models for correctness prediction? Within RQ2, we explore 3 further subquestions, informing the design of GCMs: 1⃣ How does confidence prediction generalize across models? GCMs transfers strategies across models and datasets, even beating models trained directly on OOD datasets. 2⃣ What information should GCMs condition on? The exact way an LLM phrases an answer is a strong predictor for correctness + strategies leveraging world-knowledge seem to drive generalization. 3⃣ How do alternative methods for encoding history (e.g. post hoc calibration, ICL) compare? Including historical information ICL can aid larger models to predict correctness but underperforms GCMs, and post hoc calibration can complement GCMs to reduce calibration error. 🧵👇

English

15

34

2.7K

UNC Computer Science retweetledi

Zun Wang@ZunWang919·1 May

🎉 Excited to share EPiC is accepted to #ICML2026! We show that learning precise camera control for video diffusion doesn't need expensive 3D supervision or large-scale data. No camera or point cloud processing — just mask source videos based on visibility to construct precise training anchor videos, and learn a SoTA camera controller with only 30M params, trained >100× faster on >100× less data than prior work, while generalizing across both I2V and V2V camera control tasks.

Zun Wang@ZunWang919

🚨Thrilled to introduce EPiC🎥: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance A generative model enables precise 3D camera trajectory control over user-provided videos or images. It achieves highly efficient training, completing within just 16 GPU-hours (2 hours on 8×H100 GPUs), significantly faster compared to baselines that typically require at least 200 GPU-hours, while achieving better performance. Thread 🧵👇(1/8)

English

Justin Chih-Yao Chen@cyjustinchen

16

34

4.6K

UNC Computer Science retweetledi

Justin Chih-Yao Chen@cyjustinchen·1 May

Excited to share that Symbolic-MoE has been accepted to #ICML2026! 🎉 We find that adaptive instance-level "mixture-of-experts" yields +8.15% average gain over the best multi-agent baseline, while almost 2x faster + generalize to unseen tasks! Existing multi-LLM/multi-agent setups can improve reasoning, but they use a fixed set of LLMs, making them sensitive to the choice of model + hard to scale due to the high cost of multi-round agent discussion. Symbolic MoE instead adaptively recruits the most relevant expert for each instance based on the skills needed and the strengths of each model. In addition to beating baselines in performance, Symbolic-MoE is also more efficient: it skips the expensive multi-round discussion, and our novel batching method allows us to integrate 16 experts on a single GPU! 🧵👇

🚨 We introduce ✨ Symbolic-MoE ✨ which uses skill-based instance-level recruiting to dynamically combine LLMs, allowing three 7-8B LLMs to beat GPT4o-mini and Llama3.3 70B across challenging + diverse reasoning tasks (MMLU-Pro, AIME, GPQA, MedMCQA) while running on 1 GPU! Key highlights: 1️⃣ Instance-level adaptive recruiting: experts chosen for each question based on skills. 2️⃣ +8.15% average gain over the best multi-agent baseline. 3️⃣ Runs almost 2x faster than SOTA multi-agent discussion. Existing multi-LLM/multi-agent setups can improve reasoning, but current methods are 1️⃣ too coarse-grained, using task performance to select LLMs 2️⃣ too rigid, using a fixed set of LLMs, making them sensitive to the choice of model 3️⃣ hard to scale due to high cost of multi-round agent discussion. Symbolic MoE instead adaptively recruits the most relevant expert for each instance based on the skills needed and the strengths of each model. In addition to beating baselines in performance, Symbolic-MoE is more efficient: our novel batching method allows us to integrate 16 experts on a single GPU. 🧵👇

English

20

37

2.6K

UNC Computer Science retweetledi

Shoubin Yu@shoubin621·1 May

SciVideoBench is accepted to #ICML2026 🎉 We introduce the first benchmark for scientific-level video reasoning, spanning 25+ domains with 1K model+expert-curated QA pairs requiring spatiotemporal understanding + domain knowledge + multi-step reasoning. Current SOTA multimodal models still struggle. We hope this work pushes forward multimodal AI4Science research.

Shoubin Yu@shoubin621

🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning! 🔬SciVideoBench: 1. Spans Physics, Chemistry, Biology & Medicine with authentic experimental videos. 2. Features 1,000 challenging MCQs across three reasoning types: Conceptual, Hypothetical, and Quantitative. 3. Explicitly targets higher-order multimodal cognitive skills in LMMs. 4. Provides valuable insights on Chain-of-Thought prompting, scaling effects, and reasoning limits. 🧵

English