UNC Computer Science

2.7K posts

UNC Computer Science banner
UNC Computer Science

UNC Computer Science

@unccs

Department of Computer Science - University of North Carolina at Chapel Hill Choose to #GIVE today - learn more here: https://t.co/cLdenfM5G5

Chapel Hill, NC Katılım Eylül 2009
432 Takip Edilen3.3K Takipçiler
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Outcome rewards in LLM RL are sparse --> AVSD (Adaptive-View Self-Distillation) turns privileged info into dense token-level supervision, and instead of relying on only one privileged view, it combines multiple views and balances stable cross-view consensus vs. potentially noisy view-specific signals. Privileged views such as full solutions, partial rationales, final answers, reference code, and feedback can all help, but none is consistently the best. AVSD uses consensus across views as the reliable update direction, then adds a view-specific residual only when it aligns with that consensus and is bounded. The result is a richer but still stable learning signal, leading to consistent gains on several math and code benchmarks across model families for each configuration we test. 🧵👇
Duy Nguyen@duynguyen772

Sparse binary rewards bottleneck LLM RL, motivating the use of privileged information in self-distillation as dense teachers. How can we use and balance multiple types of privileged info: leveraging stable cross-view info, while preserving view-specific info? Current on-policy self-distillation methods often condition the teacher on only one type of privileged view: full solution, partial rationale, answer-only, reference code, feedback, etc. This can be suboptimal: 1️⃣ No single privileged view consistently performs best when used as a teacher. 2️⃣ Views can introduce teacher-specific artifacts from information unavailable to the student. 🧠 Adaptive-View Self-Distillation (AVSD) considers multiple privileged views jointly as a teacher family, balancing cross-view consensus and view-specific signals through a token-level gate to construct better dense learning signals. 🧵👇

English
1
17
38
3.4K
UNC Computer Science retweetledi
Roni Sengupta
Roni Sengupta@SenguptRoni·
Honored to receive an NSF CAREER Award! 🎉 Huge thanks to my students, mentors, the amazing colleagues at @unccs , and my family for making this possible. 🙏 We'll be working on Inverse Physics — teaching computers to infer shape, reflectance, lighting, material properties, and motion from images and videos, spanning both inverse rendering and simulation. These algorithms will advance endoscopic and laparoscopic procedures with robotic guidance (supported by our NIH grants), help robots handle delicate materials in manufacturing, and various other scientific and engineering applications.
Roni Sengupta tweet media
English
7
4
44
2.8K
UNC Computer Science retweetledi
Duy Nguyen
Duy Nguyen@duynguyen772·
Sparse binary rewards bottleneck LLM RL, motivating the use of privileged information in self-distillation as dense teachers. How can we use and balance multiple types of privileged info: leveraging stable cross-view info, while preserving view-specific info? Current on-policy self-distillation methods often condition the teacher on only one type of privileged view: full solution, partial rationale, answer-only, reference code, feedback, etc. This can be suboptimal: 1️⃣ No single privileged view consistently performs best when used as a teacher. 2️⃣ Views can introduce teacher-specific artifacts from information unavailable to the student. 🧠 Adaptive-View Self-Distillation (AVSD) considers multiple privileged views jointly as a teacher family, balancing cross-view consensus and view-specific signals through a token-level gate to construct better dense learning signals. 🧵👇
Duy Nguyen tweet media
English
4
35
84
25.1K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Check out MINTEval, a new *memory interference* benchmark to stress-test agentic memory systems on: 👉 frequent & interfering context changes (avg. 86 updates) 👉 over long horizons (avg. 138.8k-token contexts, up to 1.8M) 👉 5 challenging question types (incl. long-range recovery, multi-target reasoning) 👉 4 realistic domains (state tracking, multi-turn dialogue, wikipedia revisions, code commits) 📊 Across 7 representative systems (Full Context, RAG-based, and Memory-Augmented Agents), the best performance is only 33.4%! Other interesting findings: 🔎 Memory construction failures are a major bottleneck 🔎 Memory agents are highly sensitive to design choices 🔎 Systems strongly favor insertion over deletion/update operations 🧵👇
hyunji amy lee@hyunji_amy_lee

LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this? We introduce MINTEval: ✅ Frequent context changes & interference (avg. 86 updates) ✅ 5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context ✅ 4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits ✅ Avg. 138.8k tokens per instance (up to 1.8M) ✅ Human verification on generated QAs = 95.6% 📊 Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%. 🔎 Our analysis shows: • Memory construction failures cause a 41.7% drop • Memory agents are highly sensitive to design choices • Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update

English
1
14
32
4.3K
UNC Computer Science retweetledi
hyunji amy lee
hyunji amy lee@hyunji_amy_lee·
LLM agents & memory systems operate in continuously updated environments (Git repos, evolving docs). They must process long contexts, recover earlier information, and reason over many updates that create interference between old and new information. How well do they handle this? We introduce MINTEval: ✅ Frequent context changes & interference (avg. 86 updates) ✅ 5 challenging question types, including long-range lookback & reasoning over multiple targets distributed across context ✅ 4 realistic domains: state tracking, multi-turn dialogue, Wikipedia revisions, GitHub commits ✅ Avg. 138.8k tokens per instance (up to 1.8M) ✅ Human verification on generated QAs = 95.6% 📊 Across 7 representative systems, MINTEval remains difficult, showing an avg. acc of 27.9%, and the best system reaches only 33.4%. 🔎 Our analysis shows: • Memory construction failures cause a 41.7% drop • Memory agents are highly sensitive to design choices • Memory systems have a strong bias toward insertion operations (76.8%) over deletion/update
hyunji amy lee tweet media
English
9
36
106
22.5K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Check out Agent-BRACE, our new work on belief state modeling for LLM agents in long-horizon tasks! In long-horizon partially-observable tasks, interaction history exceeds LLM context windows, but summarizing it can discard useful uncertainty about the environment. Agent-BRACE represents belief states as a set of natural language claims with verbalized confidence, and jointly trains a belief model to produce these states and a policy that conditions on them when taking actions. ✅ Improved task performance over strong RL baselines ✅ Compact, near-constant context ✅ Better belief calibration 🔎 We can see epidemic uncertainty reducing as the agent explores! 👇
Joykirat@joykiratsingh

🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵

English
0
10
23
2.5K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
🚨 Check out PhyMotion, our new Real2Sim2Real framework for physics-grounded human motion video generation (to avoid failures e.g. floating feet, unstable balance, body self-penetration, dynamically infeasible motion, etc,) --> lifts generated videos into 3D SMPL-X human motion, grounds them in a physics simulator, and evaluates motion through structured 3D physical rewards covering: ➡️ kinematic plausibility ➡️ contact and balance consistency ➡️ dynamic feasibility PhyMotion not only aligns better with human judgments as an evaluator, but also serves as an effective RL post-training reward! 👇👇
Yidong Huang@owenhuang117

🚨 Excited to introduce PhyMotion🤸: Structured 3D Motion Reward for Physics-Grounded Human Video Generation! ❌ Existing 2D video rewards misleadingly assign high scores to videos with floating feet, self-penetrating limbs, and physics-violating motions. ✅ PhyMotion lifts generated videos into 3D, grounds them in a physics simulator, and scores motion along kinematic / contact / dynamic feasibility. ➡️ RL post-training with PhyMotion improves 1.3B model to match 14B models performance in human prefence. 🧵(1/n)👇

English
1
10
34
3K
UNC Computer Science retweetledi
Yidong Huang
Yidong Huang@owenhuang117·
🚨 Excited to introduce PhyMotion🤸: Structured 3D Motion Reward for Physics-Grounded Human Video Generation! ❌ Existing 2D video rewards misleadingly assign high scores to videos with floating feet, self-penetrating limbs, and physics-violating motions. ✅ PhyMotion lifts generated videos into 3D, grounds them in a physics simulator, and scores motion along kinematic / contact / dynamic feasibility. ➡️ RL post-training with PhyMotion improves 1.3B model to match 14B models performance in human prefence. 🧵(1/n)👇
Yidong Huang tweet media
English
2
36
97
56.2K
Le Thien Phuc Nguyen
Le Thien Phuc Nguyen@nguyenp2004·
Just graduated from UW-Madison with 4 majors: Computer Science, Data Science, Mathematics, and Statistics. A journey I didn't expect, but one that pushed my limits far beyond what I thought possible. My deepest gratitude to Professor Yong Jae Lee @yong_jae_lee and every member of WAIV Lab for two years of incredible mentorship and support. Excited to announce I'm starting my CS PhD at RAIR lab, UNC-Chapel Hill @unccs under Professor Jason Ren @RenZhongzheng with continued deep collaboration with Professor Yong Jae Lee @yong_jae_lee as my co-advisor. A new challenge awaits, but the research passion remains. Let's go!
Le Thien Phuc Nguyen tweet mediaLe Thien Phuc Nguyen tweet media
English
4
1
40
19.1K
UNC Computer Science retweetledi
Snigdha Chaturvedi
Snigdha Chaturvedi@snigdhac25·
Congratulations to Dr. Anvesh Rao Vijjini for successfully defending his PhD thesis on realism and safety of personalized LLMs. Check out his work here: nvshrao.github.io PS: Anvesh is on the job market! @nvshrao @unc_ai_group @unccs
Snigdha Chaturvedi tweet media
English
0
4
28
2.1K
UNC Computer Science retweetledi
Joykirat
Joykirat@joykiratsingh·
🚨Excited to announce Agent-BRACE! LLM agents in long-horizon POMDPs either blow up their context with raw history or summarize it, discarding uncertainty by collapsing belief into a point estimate. Agent-BRACE decouples the agent into belief state + policy models, jointly trained via RL. Key takeaways: 1️⃣ 🎯The belief state model produces a structured approximation of the belief distribution as a set of atomic natural-language claims with ordinal verbalized certainty labels ranging from certain to unknown. The policy conditions on this compact belief rather than the full history. 2️⃣ 📈 Outperforms strong RL baselines on long-horizon partially observable embodied language environments while maintaining a near-constant context window independent of episode length. 3️⃣ 🔄 The learned belief becomes increasingly calibrated as evidence accumulates, and epistemic belief decreases over time: the proportion of claims that the agent has the strongest level of belief in grows from 21% → 52% over an episode. 👇🧵
Joykirat tweet media
English
2
39
67
15.6K
UNC Computer Science retweetledi
Ziyang Wang
Ziyang Wang@ZiyangW00·
🚨 Excited to share EgoMemReason, a benchmark for multi-level memory-driven reasoning (entity, event, and behavior memory) over week-long egocentric videos (average 25.9 hours of temporal backtracking)! 📉 Current long video approaches can retrieve isolated event, but struggle with long-horizon memory that requires retrieve and understand across multiple events and long time: tracking evolving entities across days, linking temporally distant events, and abstracting recurring behavior patterns from long observations. 🎥 EgoMemReason evaluates these challenges through 500 human-verified questions spanning entity, event, and behavior memory, requiring aggregation over an average of 5.1 evidence segments and 25.9 hours of temporal backtracking. ⭐️ Across 17 models/frameworks, even the best model achieves only 39.6% accuracy, revealing that long-horizon multimodal memory remains far from solved.
English
3
27
47
7.9K
UNC Computer Science retweetledi
Mohit Bansal
Mohit Bansal@mohitban47·
Looking forward to giving a keynote at the Midwest Machine Learning Symposium (MMLS) 2026 (being held at Purdue University this year) & meeting folks from all the strong universities in the midwest, with their inspiring, long tradition of these exciting symposiums! 🙂 👇👇
Ruqi Zhang@ruqi_zhang

The Midwest Machine Learning Symposium (MMLS) 2026 will happen at Purdue University! 📍 West Lafayette, IN 📅 June 24–25, 2026 🔗 midwest-ml.org/2026/ 📌 Poster submission deadline: May 24 We have an amazing lineup of plenary speakers: Tong Zhang, Jennifer Neville @ProfJenNeville, Mohit Bansal @mohitban47, Joyce Chai. Looking forward to seeing you there! @PurdueCS @PurdueECE @PurdueStats

English
0
18
44
3.8K
UNC Computer Science retweetledi
David Wan
David Wan@meetdavidwan·
🥳 Excited to share that MuRGAt is accepted to #ICML2026! Even strong MLLMs hallucinate citations to multimodal sources (video, audio, charts). Our new Fact-Level Multimodal Attribution benchmark tackles this by: 🕐 Requiring fine-grained temporal & per-modality citations (vs. just source-level) 🔍 Distinguishing verifiable claims from reasoning steps to evaluate multi-step responses We also introduce MuRGAt-SCORE, a reference-free, decomposed metric aligned with human judgment, and show that Programmatic Grounding substantially boosts attribution! 👇
David Wan@meetdavidwan

🚀Announcing MuRGAt! MLLMs are improving at reasoning over complex multimodal inputs, but does that translate to faithful grounding to multimodal sources (video, audio, charts, etc.)? We find that even strong MLLMs often hallucinate citations despite getting the answer correct!🤯 We introduce a benchmark for Fact-Level Multimodal Attribution featuring: ✅ High-quality Human Annotations for validation. ✅ MuRGAt-SCORE: A decomposed metric that highly correlates with human judgment. ✅ Methods to improve citations, showing that Programmatic Grounding boosts attribution. 🧵👇

English
2
21
39
4K
UNC Computer Science retweetledi
Archiki Prasad
Archiki Prasad@ArchikiPrasad·
🎉 Excited to share that our work on intrinsic dimensionality of reasoning has been accepted to #ICML2026 as a ✨spotlight✨ (top 2.2%)! We analyze the effectiveness of teaching a model how to reason via the lens of intrinsic dimensionality (the minimum effective capacity a model needs to solve the task) and find that effective reasoning chains are inherently compressive! Across Gemma-3 1B and 4B, lower intrinsic dimensionality strongly predicts not only in-distribution accuracy (GSM8K), but also robustness on OOD benchmarks (GSM-Hard, GSM-Symbolic, GSM-IC) -- outperforming reasoning length, token perplexity, and KL divergence. Stay tuned for more results and exciting updates in the camera-ready! 🚀
Archiki Prasad@ArchikiPrasad

🚨Excited to share our new work viewing reasoning strategies as teaching tools: for fixed target model, which CoT strategies best support learning and generalization? ✨Our answer is intrinsic dimensionality (minimum effective capacity a model needs to solve the task). Somewhat counterintuitively, adding CoT – which requires generating longer and more structured outputs – can reduce learning complexity. Good reasoning compresses the task, i.e., it reduces the degrees of freedom the model needs to map inputs to correct solutions. 🧵⬇️ (1/5)

English
2
38
199
19.9K
UNC Computer Science retweetledi
Hanqi Xiao
Hanqi Xiao@hanqi_xiao·
Glad that GCMs for analyzing confidence estimation from historical predictions was accepted to #ICML2026! We examine whether models have an advantage when predicting their own correctness and confidence and find that little usable privileged information exist for confidence prediction. This leads us to train Generalized Correctness Models to predict the calibrated confidence and correctness of models, outperforming the logit and verbalized confidences of much larger models! Thanks to @vaidehi_patil_, @hyunji_amy_lee, @EliasEskin, and @mohitban47! See more in 🧵below!
Elias Stengel-Eskin@EliasEskin

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger Llama-3-70B’s self-emitted confidences in downstream selective prediction tasks. We motivate GCMs and analyze them by answering 2 questions: ❓ RQ1: Are LLMs better than other LLMs at predicting their own correctness? We find that they are not, instead historical information (past LLM outputs and their correctness) drives performance, motivating cross-model transfer and training of GCMs! ❓ RQ2: How can we use historical information from multiple models for correctness prediction? Within RQ2, we explore 3 further subquestions, informing the design of GCMs: 1⃣ How does confidence prediction generalize across models? GCMs transfers strategies across models and datasets, even beating models trained directly on OOD datasets. 2⃣ What information should GCMs condition on? The exact way an LLM phrases an answer is a strong predictor for correctness + strategies leveraging world-knowledge seem to drive generalization. 3⃣ How do alternative methods for encoding history (e.g. post hoc calibration, ICL) compare? Including historical information ICL can aid larger models to predict correctness but underperforms GCMs, and post hoc calibration can complement GCMs to reduce calibration error. 🧵👇

English
0
12
21
2.8K
UNC Computer Science retweetledi
Elias Stengel-Eskin
Elias Stengel-Eskin@EliasEskin·
🎉 Happy to share our paper introducing GCMs for confidence estimation based on historical predictions has been accepted to #ICML2026! We find that models are no better at learning to predict their own correctness than others', i.e., they don't have privileged self-access (given training). Training smaller models to predict the correctness of many models generalizes and leads to better calibration than self-reported confidence from much larger models! Check out 🧵 for more
Elias Stengel-Eskin@EliasEskin

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger Llama-3-70B’s self-emitted confidences in downstream selective prediction tasks. We motivate GCMs and analyze them by answering 2 questions: ❓ RQ1: Are LLMs better than other LLMs at predicting their own correctness? We find that they are not, instead historical information (past LLM outputs and their correctness) drives performance, motivating cross-model transfer and training of GCMs! ❓ RQ2: How can we use historical information from multiple models for correctness prediction? Within RQ2, we explore 3 further subquestions, informing the design of GCMs: 1⃣ How does confidence prediction generalize across models? GCMs transfers strategies across models and datasets, even beating models trained directly on OOD datasets. 2⃣ What information should GCMs condition on? The exact way an LLM phrases an answer is a strong predictor for correctness + strategies leveraging world-knowledge seem to drive generalization. 3⃣ How do alternative methods for encoding history (e.g. post hoc calibration, ICL) compare? Including historical information ICL can aid larger models to predict correctness but underperforms GCMs, and post hoc calibration can complement GCMs to reduce calibration error. 🧵👇

English
0
15
34
2.7K
UNC Computer Science retweetledi
Zun Wang
Zun Wang@ZunWang919·
🎉 Excited to share EPiC is accepted to #ICML2026! We show that learning precise camera control for video diffusion doesn't need expensive 3D supervision or large-scale data. No camera or point cloud processing — just mask source videos based on visibility to construct precise training anchor videos, and learn a SoTA camera controller with only 30M params, trained >100× faster on >100× less data than prior work, while generalizing across both I2V and V2V camera control tasks.
Zun Wang@ZunWang919

🚨Thrilled to introduce EPiC🎥: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance A generative model enables precise 3D camera trajectory control over user-provided videos or images. It achieves highly efficient training, completing within just 16 GPU-hours (2 hours on 8×H100 GPUs), significantly faster compared to baselines that typically require at least 200 GPU-hours, while achieving better performance. Thread 🧵👇(1/8)

English
1
16
34
4.6K
UNC Computer Science retweetledi
Justin Chih-Yao Chen
Justin Chih-Yao Chen@cyjustinchen·
Excited to share that Symbolic-MoE has been accepted to #ICML2026! 🎉 We find that adaptive instance-level "mixture-of-experts" yields +8.15% average gain over the best multi-agent baseline, while almost 2x faster + generalize to unseen tasks! Existing multi-LLM/multi-agent setups can improve reasoning, but they use a fixed set of LLMs, making them sensitive to the choice of model + hard to scale due to the high cost of multi-round agent discussion. Symbolic MoE instead adaptively recruits the most relevant expert for each instance based on the skills needed and the strengths of each model. In addition to beating baselines in performance, Symbolic-MoE is also more efficient: it skips the expensive multi-round discussion, and our novel batching method allows us to integrate 16 experts on a single GPU! 🧵👇
Justin Chih-Yao Chen@cyjustinchen

🚨 We introduce ✨ Symbolic-MoE ✨ which uses skill-based instance-level recruiting to dynamically combine LLMs, allowing three 7-8B LLMs to beat GPT4o-mini and Llama3.3 70B across challenging + diverse reasoning tasks (MMLU-Pro, AIME, GPQA, MedMCQA) while running on 1 GPU! Key highlights: 1️⃣ Instance-level adaptive recruiting: experts chosen for each question based on skills. 2️⃣ +8.15% average gain over the best multi-agent baseline. 3️⃣ Runs almost 2x faster than SOTA multi-agent discussion. Existing multi-LLM/multi-agent setups can improve reasoning, but current methods are 1️⃣ too coarse-grained, using task performance to select LLMs 2️⃣ too rigid, using a fixed set of LLMs, making them sensitive to the choice of model 3️⃣ hard to scale due to high cost of multi-round agent discussion. Symbolic MoE instead adaptively recruits the most relevant expert for each instance based on the skills needed and the strengths of each model. In addition to beating baselines in performance, Symbolic-MoE is more efficient: our novel batching method allows us to integrate 16 experts on a single GPU. 🧵👇

English
1
20
37
2.6K
UNC Computer Science retweetledi
Shoubin Yu
Shoubin Yu@shoubin621·
SciVideoBench is accepted to #ICML2026 🎉 We introduce the first benchmark for scientific-level video reasoning, spanning 25+ domains with 1K model+expert-curated QA pairs requiring spatiotemporal understanding + domain knowledge + multi-step reasoning. Current SOTA multimodal models still struggle. We hope this work pushes forward multimodal AI4Science research.
Shoubin Yu@shoubin621

🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning! 🔬SciVideoBench: 1. Spans Physics, Chemistry, Biology & Medicine with authentic experimental videos. 2. Features 1,000 challenging MCQs across three reasoning types: Conceptual, Hypothetical, and Quantitative. 3. Explicitly targets higher-order multimodal cognitive skills in LMMs. 4. Provides valuable insights on Chain-of-Thought prompting, scaling effects, and reasoning limits. 🧵

English
1
16
34
3.1K