Shiyu Ni

51 posts

Shiyu Ni

@Shictyu

Ph.D candidate at the Institute of Computing Technology, Chinese Academy of Sciences | Trustworthy LLMs; Adaptive RAG

เข้าร่วม Kasım 2022

129 กำลังติดตาม44 ผู้ติดตาม

Shiyu Ni@Shictyu·9 Nis

📍 TL;DR: Reasoning traces are a double-edged sword. LLM judges still can't consistently distinguish "actually correct" from "sounds correct."

English

Shiyu Ni@Shictyu·9 Nis

Excited to share our paper "How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality" \w @bikeping got accepted to ACL 2026, with an Oral recommendation from the Senior Area Chair! Paper: arxiv.org/pdf/2604.06756 Code: github.com/Trustworthy-In… 🎉 Here's what we did 🧵 💡Motivation LLM-as-a-Judge often fails because judges don't know the correct answer — and have no extra information to reference. Can the reasoning trace serves as additional evidence that help the judges to judge more accurately? 📖Concrete example: Q: Who was the first Nobel Physics laureate? A: Einstein The judge doesn't know if that's right. But the reasoning says "Einstein won his first Nobel in 1921" — while the first prize was awarded in 1901. Caught! 🙅 Sounds great… but is it really that simple? 🔦 What we did? TL;DR: Reasoning traces are a double-edged sword. LLM judges still can't consistently distinguish "actually correct" from "sounds correct." We studied this across 4 datasets × 10+ judge models (GPT-4o, Claude Sonnet 4.5, DeepSeek-v3.1…). Two key findings: ❌ Weak judges are almost completely fooled. In NQ, only 23.2% of answers are correct — but weak models accept up to 88% when reasoning looks fluent. They judge style, not substance. ✅ Strong judges are smarter, but not perfect. DeepSeek-v3.1's alignment improves from 63.4% → 76.2% on NQ. But even strong judges get misled by high-quality reasoning chains. Just like humans: non-experts get sweet-talked, experts push back 😄 Controlled experiments on reasoning chain features: 1. Fluency is the first gate: break the reasoning flow, and most models mark it “incorrect,” even if it’s right. 2. Factuality is important: counterfactuals reduce pass rates, but adding more errors doesn’t increase sensitivity—the evaluator isn’t counting them. 3. Position matters: errors at the start hurt most; errors at the end matter less.

English

Shiyu Ni@Shictyu·1 Mar

@SharonYixuanLi Very interesting!

English

Sharon Li@SharonYixuanLi·28 Şub

When evaluating LVLMs, should we really be asking: “Did the model get the right answer?” or rather “Did the model truly integrate the visual input?” LVLMs can rely on shortcuts learned from the underlying language model, aka language prior. In our #ICLR2026 paper, we attempt to understand this phenomenon at a deeper, representation-level. 📄 “Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding”. arxiv.org/abs/2509.23050 ------- 1/ Problem: LVLMs often ignore visual evidence While LVLMs perform well on many benchmarks, they sometimes rely on language patterns rather than actual images. A simple example: show a model a green banana, and it may confidently describe it as “ripe and yellow” ---because that’s the most common linguistic pattern it has learned. 🍌 This raises a central question: Where inside the model does visual information begin to influence its reasoning? 2/ Motivation: Output-level probes fall short Most analyses inspect outputs, e.g., by removing the image or comparing predictions. But these methods cannot reveal when the model starts integrating vision and how strongly visual signals affect internal states. To address this, we need a representation-driven perspective. 🔍 3/ Approach: Contrasting Chain-of-Embedding (CoE) We trace hidden representations across the model’s depth for the same prompt: •once with the image •once without the image By comparing these trajectories layer by layer, we identify the exact point where visual input begins shaping the model’s internal computation. This leads to the discovery of the Visual Integration Point (VIP) ✨--- the layer at which the model “starts seeing.” We then define Total Visual Integration (TVI), a metric that quantifies how much visual influence accumulates after the VIP. 4/ Findings across 10 LVLMs and 6 benchmarks Across 60 evaluation settings, we observe: • VIP consistently appears across diverse architectures • Pre-VIP → representations behave like a language-only model • Post-VIP → visual signals increasingly reshape the embedding pathway • TVI correlates strongly with actual visual reasoning performance • TVI outperforms attention- and output-based proxies at identifying language prior TVI thus offers a more principled indicator of whether a model actually uses the image. 5/ Impact: A new lens on multimodal behavior Our framework has a few practical benefits. It enables (1) diagnosing over-reliance on language prior, (2) comparing LVLM architectures more rigorously, (3) informing better training and alignment strategies, and (4) improving robustness and grounding in real-world tasks. Shout out to my students for this insightful work: Lin Long, @Changdae_Oh, @seongheon_96 🌻 Please check out our paper for more details!

English

227

14.6K

Shiyu Ni@Shictyu·24 Şub

@caiyue5 可能烧脖子🤔

中文

Yue@caiyue5·24 Şub

卧槽，惊奇发现，用自己的脑子写代码不烧 token，是免费的！

中文

258

130

2.7K

399.5K

Shiyu Ni@Shictyu·17 Şub

@aryan31026 @bikeping Thanks！

English

Aryan Karmore@aryan31026·16 Şub

@Shictyu @bikeping Congrats!!

English

Shiyu Ni@Shictyu·15 Şub

🚀 Thrilled to share our paper accepted to #ICLR2026 /w @bikeping ! 📄 Annotation-Efficient Universal Honesty Alignment 🔗 arxiv.org/abs/2510.17509 LLMs are powerful, but a core missing capability is honesty — the ability to know what they know and what they don’t, and to express calibrated confidence about their answers. Traditionally, achieving honesty across tasks and domains relies on large-scale correctness annotations, which are expensive and hard to scale. We ask: Can we get universal honesty alignment with very little human annotation? 💡 Our proposed solution: Elicitation-Then-Calibration (EliCal) We introduce EliCal, a two-stage framework that follows a pretraining → finetuning paradigm to achieve honesty alignment annotation-efficiently. 1️⃣ Stage 1 — Confidence Elicitation Instead of relying on costly correctness labels, we first teach the model to elicit its own internal confidence using inexpensive self-consistency signals. 👉 Self-consistency measures how consistent a model’s multiple generations are for the same question — an inexpensive proxy for confidence that correlates well with correctness. 2️⃣ Stage 2 — Confidence Calibration Starting from the elicited confidence ability, we then use a small set of human correctness annotations to calibrate the model’s confidence so that it matches actual correctness. 📊 HonestyBench — Universal Training and Evaluation To support and evaluate universal honesty alignment, we also release HonestyBench — a large benchmark covering 10 free-form QA datasets with ~560k training and ~70k evaluation instances annotated with both correctness and self-consistency signals. This lets us measure honesty across diverse tasks. 🌟 Key Results • EliCal achieves near-optimal honesty alignment using only ~1,000 correctness annotations (~0.18% of full supervision). • It outperforms calibration-only baselines and generalizes much better to unseen tasks (e.g., MMLU). • Our method dramatically reduces annotation cost while still delivering universal, calibrated confidence. See you at ICLR 2026! 🙌

English

2.2K

Shiyu Ni@Shictyu·27 Kas

RT @tuzhaopeng: Can AI agents autonomously explore, synthesize, and discover knowledge like researchers? 🤖🔬 Introducing a comprehensive su…

English

Shiyu Ni@Shictyu·30 Tem

Our paper "Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception" will be presented on July 30th from 11:00 to 12:30 at Hall 5X, #195. Welcome to drop by and have a discussion! #ACL2025NLP

English

195

Shiyu Ni@Shictyu·30 Tem

@_reachsumit Thank you for your interest in our paper！

English

Sumit@_reachsumit·28 Tem

Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation Explicitly incorporates retrieved passages into LLMs' reasoning process to enhance robustness against noisy information and improve RAG performance. 📝arxiv.org/abs/2507.19333

English

1.2K

Run-Ze Fan@Vfrz525_·23 Tem

🚨 New release: MegaScience The largest & highest-quality post-training dataset for scientific reasoning is now open-sourced (1.25M QA pairs)! 📈 Trained models outperform official Instruct baselines 🔬 Covers 7+ disciplines with university-level textbook-grade QA 📄 Paper: huggingface.co/papers/2507.16… 🤖 Data & Models : huggingface.co/MegaScience 💻 Code: github.com/GAIR-NLP/MegaS… 🎯Evaluation System: github.com/GAIR-NLP/lm-op… Details 🧵👇 1. Why MegaScience? While LLMs like o1 and DeepSeek-R1 excel at math & code, they still struggle with science reasoning — largely due to the lack of large-scale, high-quality datasets. 2. What makes MegaScience different? We address 4 core challenges: 🧪 Unreliable benchmark evaluation ☢️ Less rigorous decontamination ❌ Low-quality reference answers 🧠 Superficial knowledge (data) distillation 3. We tackle this from the ground up. First, we introduce TextbookReasoning: 📘 Built from 128K+ university-level science textbooks ⚙️ Fully automated LLM-driven pipeline 🧠 650K QA pairs with reliable reference answers 🌍 Covers 7 major disciplines 4. But we didn’t stop there. We then construct MegaScience — a diverse, hybrid dataset of 1.25M QA pairs, using: * TextbookReasoning * NaturalReasoning * Nemotron-Science We conduct comprehensive ablation studies across different data selection methods to identify the optimal approach for each dataset, thereby contributing high-quality subsets. 5. To evaluate properly, we also open-sourced a reproducible and flexible Scientific Reasoning Evaluation framework with: * 15 science reasoning tasks * Multiple question formats (MCQ, calc, open-ended) * Multi-GPU parallelism & model-agnostic evaluation * Comprehensive answer extraction strategies 6. Results: Models trained on MegaScience consistently outperform official Instruct versions — especially for Qwen3 series. Bigger models see greater gains, showing strong scalability. 7. Everything is open-source: 📚 Dataset 🧪 Evaluation toolkit 🤖 Trained models 🔧 Codebase → Let’s build better science agents together! This work is impossible without all the brilliant co-authors @SinclairWang1 @stefan_fee

English

256

22.2K

Shiyu Ni@Shictyu·23 Tem

@Vfrz525_ Awesome work！

English

214

Shiyu Ni@Shictyu·18 Tem

@SinclairWang1 great work！

English

Zengzhi Wang@SinclairWang1·18 Tem

Zengzhi Wang@SinclairWang1

Excited to share that our two papers have been accepted to #ICML2025! @icmlconf However, I can't be there in person due to visa issues. What a pity.🥲 Feel free to check out our poster, neither online nor offline in the Vancouver Convention Center. Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale accepted by the Main conference (Poster) arxiv.org/abs/2409.17115 OctoThinker: Mid-Training Incentivizes Reinforcement Learning Scaling accepted by the workshop 2nd AI for Math Workshop @ ICML 2025 (poster) arxiv.org/abs/2506.20512

ZXX

3.4K

Shiyu Ni@Shictyu·16 Tem

@liangdingNLP Thanks！🥰🥰

English

Liam Liang Ding@liangdingNLP·16 Tem

Congrats @Shictyu on your paper acceptance!👏

Shiyu Ni@Shictyu

🥳Happy to share that our paper "Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception" has been accepted by #ACL2025! We explore leveraging LLMs' internal states to improve their knowledge boundary perception from efficiency and risk perspectives.

English

766

Shiyu Ni@Shictyu·16 Tem

@YuchenWen1027 See you in Vienna, bro!

English

Yuchen Wen@YuchenWen1027·16 Tem

@Shictyu Great work! 👍

English

Shiyu Ni@Shictyu·16 Tem

English

1.7K

Shiyu Ni@Shictyu·16 Tem

@Zhengliang_Shi See you bro!

English

Zhengliang Shi@Zhengliang_Shi·16 Tem

@Shictyu Congratulations, Shiyu! See you in Vienna!

English

Shiyu Ni รีทวีตแล้ว

Yuchen Wen@YuchenWen1027·19 Haz

😎Our paper “Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective” is accepted to #acl2025 w/@bikeping etc. We propose a psychometric-inspired framework to induce and evaluate implicit bias in LLMs. Project webpage:yuchenwen1.github.io/ImplicitBiasEv…

English

226

Shiyu Ni@Shictyu·16 Tem

🥰Prof. Keping Bi @bikeping made significant contributions to this work, and this work is co-advised by Prof. Jiafeng Guo. we also thank all our co-authors. Paper link: arxiv.org/abs/2502.11677 Code: github.com/Trustworthy-In…

English

Shiyu Ni@Shictyu·16 Tem

6/n) ⭐Risk Mitigation-C^3: Consistency-based Confidence Calibration 1. C^3 substantially enhances LLMs’ perception of what they do not know, mitigating risks. 2. C^3 does not overly weaken the model’s confidence to the point of making it overly conservative, as Align. improves.

English

ค้นพบ

@bikeping @SharonYixuanLi @Changdae_Oh @seongheon_96 @caiyue5 @aryan31026 @tuzhaopeng @_reachsumit