Shiyu Ni

51 posts

Shiyu Ni

Shiyu Ni

@Shictyu

Ph.D candidate at the Institute of Computing Technology, Chinese Academy of Sciences | Trustworthy LLMs; Adaptive RAG

เข้าร่วม Kasım 2022
129 กำลังติดตาม44 ผู้ติดตาม
Shiyu Ni
Shiyu Ni@Shictyu·
📍 TL;DR: Reasoning traces are a double-edged sword. LLM judges still can't consistently distinguish "actually correct" from "sounds correct."
English
0
0
1
15
Shiyu Ni
Shiyu Ni@Shictyu·
Excited to share our paper "How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality" \w @bikeping got accepted to ACL 2026, with an Oral recommendation from the Senior Area Chair! Paper: arxiv.org/pdf/2604.06756 Code: github.com/Trustworthy-In… 🎉 Here's what we did 🧵 💡Motivation LLM-as-a-Judge often fails because judges don't know the correct answer — and have no extra information to reference. Can the reasoning trace serves as additional evidence that help the judges to judge more accurately? 📖Concrete example: Q: Who was the first Nobel Physics laureate? A: Einstein The judge doesn't know if that's right. But the reasoning says "Einstein won his first Nobel in 1921" — while the first prize was awarded in 1901. Caught! 🙅 Sounds great… but is it really that simple? 🔦 What we did? TL;DR: Reasoning traces are a double-edged sword. LLM judges still can't consistently distinguish "actually correct" from "sounds correct." We studied this across 4 datasets × 10+ judge models (GPT-4o, Claude Sonnet 4.5, DeepSeek-v3.1…). Two key findings: ❌ Weak judges are almost completely fooled. In NQ, only 23.2% of answers are correct — but weak models accept up to 88% when reasoning looks fluent. They judge style, not substance. ✅ Strong judges are smarter, but not perfect. DeepSeek-v3.1's alignment improves from 63.4% → 76.2% on NQ. But even strong judges get misled by high-quality reasoning chains. Just like humans: non-experts get sweet-talked, experts push back 😄 Controlled experiments on reasoning chain features: 1. Fluency is the first gate: break the reasoning flow, and most models mark it “incorrect,” even if it’s right. 2. Factuality is important: counterfactuals reduce pass rates, but adding more errors doesn’t increase sensitivity—the evaluator isn’t counting them. 3. Position matters: errors at the start hurt most; errors at the end matter less.
English
1
2
2
46
Sharon Li
Sharon Li@SharonYixuanLi·
When evaluating LVLMs, should we really be asking: “Did the model get the right answer?” or rather “Did the model truly integrate the visual input?” LVLMs can rely on shortcuts learned from the underlying language model, aka language prior. In our #ICLR2026 paper, we attempt to understand this phenomenon at a deeper, representation-level. 📄 “Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding”. arxiv.org/abs/2509.23050 ------- 1/ Problem: LVLMs often ignore visual evidence While LVLMs perform well on many benchmarks, they sometimes rely on language patterns rather than actual images. A simple example: show a model a green banana, and it may confidently describe it as “ripe and yellow” ---because that’s the most common linguistic pattern it has learned. 🍌 This raises a central question: Where inside the model does visual information begin to influence its reasoning? 2/ Motivation: Output-level probes fall short Most analyses inspect outputs, e.g., by removing the image or comparing predictions. But these methods cannot reveal when the model starts integrating vision and how strongly visual signals affect internal states. To address this, we need a representation-driven perspective. 🔍 3/ Approach: Contrasting Chain-of-Embedding (CoE) We trace hidden representations across the model’s depth for the same prompt: •once with the image •once without the image By comparing these trajectories layer by layer, we identify the exact point where visual input begins shaping the model’s internal computation. This leads to the discovery of the Visual Integration Point (VIP) ✨--- the layer at which the model “starts seeing.” We then define Total Visual Integration (TVI), a metric that quantifies how much visual influence accumulates after the VIP. 4/ Findings across 10 LVLMs and 6 benchmarks Across 60 evaluation settings, we observe: • VIP consistently appears across diverse architectures • Pre-VIP → representations behave like a language-only model • Post-VIP → visual signals increasingly reshape the embedding pathway • TVI correlates strongly with actual visual reasoning performance • TVI outperforms attention- and output-based proxies at identifying language prior TVI thus offers a more principled indicator of whether a model actually uses the image. 5/ Impact: A new lens on multimodal behavior Our framework has a few practical benefits. It enables (1) diagnosing over-reliance on language prior, (2) comparing LVLM architectures more rigorously, (3) informing better training and alignment strategies, and (4) improving robustness and grounding in real-world tasks. Shout out to my students for this insightful work: Lin Long, @Changdae_Oh, @seongheon_96 🌻 Please check out our paper for more details!
Sharon Li tweet media
English
2
34
227
14.6K
Yue
Yue@caiyue5·
卧槽,惊奇发现,用自己的脑子写代码不烧 token,是免费的!
中文
258
130
2.7K
399.5K
Shiyu Ni
Shiyu Ni@Shictyu·
🚀 Thrilled to share our paper accepted to #ICLR2026 /w @bikeping ! 📄 Annotation-Efficient Universal Honesty Alignment 🔗 arxiv.org/abs/2510.17509 LLMs are powerful, but a core missing capability is honesty — the ability to know what they know and what they don’t, and to express calibrated confidence about their answers. Traditionally, achieving honesty across tasks and domains relies on large-scale correctness annotations, which are expensive and hard to scale. We ask: Can we get universal honesty alignment with very little human annotation? 💡 Our proposed solution: Elicitation-Then-Calibration (EliCal) We introduce EliCal, a two-stage framework that follows a pretraining → finetuning paradigm to achieve honesty alignment annotation-efficiently. 1️⃣ Stage 1 — Confidence Elicitation Instead of relying on costly correctness labels, we first teach the model to elicit its own internal confidence using inexpensive self-consistency signals. 👉 Self-consistency measures how consistent a model’s multiple generations are for the same question — an inexpensive proxy for confidence that correlates well with correctness. 2️⃣ Stage 2 — Confidence Calibration Starting from the elicited confidence ability, we then use a small set of human correctness annotations to calibrate the model’s confidence so that it matches actual correctness. 📊 HonestyBench — Universal Training and Evaluation To support and evaluate universal honesty alignment, we also release HonestyBench — a large benchmark covering 10 free-form QA datasets with ~560k training and ~70k evaluation instances annotated with both correctness and self-consistency signals. This lets us measure honesty across diverse tasks. 🌟 Key Results • EliCal achieves near-optimal honesty alignment using only ~1,000 correctness annotations (~0.18% of full supervision). • It outperforms calibration-only baselines and generalizes much better to unseen tasks (e.g., MMLU). • Our method dramatically reduces annotation cost while still delivering universal, calibrated confidence. See you at ICLR 2026! 🙌
English
1
5
28
2.2K
Shiyu Ni
Shiyu Ni@Shictyu·
RT @tuzhaopeng: Can AI agents autonomously explore, synthesize, and discover knowledge like researchers? 🤖🔬 Introducing a comprehensive su…
English
0
2
0
58
Shiyu Ni
Shiyu Ni@Shictyu·
Our paper "Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception" will be presented on July 30th from 11:00 to 12:30 at Hall 5X, #195. Welcome to drop by and have a discussion! #ACL2025NLP
Shiyu Ni tweet media
English
0
2
3
195
Sumit
Sumit@_reachsumit·
Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation Explicitly incorporates retrieved passages into LLMs' reasoning process to enhance robustness against noisy information and improve RAG performance. 📝arxiv.org/abs/2507.19333
English
1
5
31
1.2K
Run-Ze Fan
Run-Ze Fan@Vfrz525_·
🚨 New release: MegaScience The largest & highest-quality post-training dataset for scientific reasoning is now open-sourced (1.25M QA pairs)! 📈 Trained models outperform official Instruct baselines 🔬 Covers 7+ disciplines with university-level textbook-grade QA 📄 Paper: huggingface.co/papers/2507.16… 🤖 Data & Models : huggingface.co/MegaScience 💻 Code: github.com/GAIR-NLP/MegaS… 🎯Evaluation System: github.com/GAIR-NLP/lm-op… Details 🧵👇 1. Why MegaScience? While LLMs like o1 and DeepSeek-R1 excel at math & code, they still struggle with science reasoning — largely due to the lack of large-scale, high-quality datasets. 2. What makes MegaScience different? We address 4 core challenges: 🧪 Unreliable benchmark evaluation ☢️ Less rigorous decontamination ❌ Low-quality reference answers 🧠 Superficial knowledge (data) distillation 3. We tackle this from the ground up. First, we introduce TextbookReasoning: 📘 Built from 128K+ university-level science textbooks ⚙️ Fully automated LLM-driven pipeline 🧠 650K QA pairs with reliable reference answers 🌍 Covers 7 major disciplines 4. But we didn’t stop there. We then construct MegaScience — a diverse, hybrid dataset of 1.25M QA pairs, using: * TextbookReasoning * NaturalReasoning * Nemotron-Science We conduct comprehensive ablation studies across different data selection methods to identify the optimal approach for each dataset, thereby contributing high-quality subsets. 5. To evaluate properly, we also open-sourced a reproducible and flexible Scientific Reasoning Evaluation framework with: * 15 science reasoning tasks * Multiple question formats (MCQ, calc, open-ended) * Multi-GPU parallelism & model-agnostic evaluation * Comprehensive answer extraction strategies 6. Results: Models trained on MegaScience consistently outperform official Instruct versions — especially for Qwen3 series. Bigger models see greater gains, showing strong scalability. 7. Everything is open-source: 📚 Dataset 🧪 Evaluation toolkit 🤖 Trained models 🔧 Codebase → Let’s build better science agents together! This work is impossible without all the brilliant co-authors @SinclairWang1 @stefan_fee
Run-Ze Fan tweet mediaRun-Ze Fan tweet mediaRun-Ze Fan tweet mediaRun-Ze Fan tweet media
English
3
50
256
22.2K
Zengzhi Wang
Zengzhi Wang@SinclairWang1·
Zengzhi Wang tweet media
Zengzhi Wang@SinclairWang1

Excited to share that our two papers have been accepted to #ICML2025! @icmlconf However, I can't be there in person due to visa issues. What a pity.🥲 Feel free to check out our poster, neither online nor offline in the Vancouver Convention Center. Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale accepted by the Main conference (Poster) arxiv.org/abs/2409.17115 OctoThinker: Mid-Training Incentivizes Reinforcement Learning Scaling accepted by the workshop 2nd AI for Math Workshop @ ICML 2025 (poster) arxiv.org/abs/2506.20512

ZXX
1
3
23
3.4K
Shiyu Ni
Shiyu Ni@Shictyu·
🥳Happy to share that our paper "Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception" has been accepted by #ACL2025! We explore leveraging LLMs' internal states to improve their knowledge boundary perception from efficiency and risk perspectives.
Shiyu Ni tweet media
English
4
6
11
1.7K
Shiyu Ni รีทวีตแล้ว
Yuchen Wen
Yuchen Wen@YuchenWen1027·
😎Our paper “Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective” is accepted to #acl2025 w/@bikeping etc. We propose a psychometric-inspired framework to induce and evaluate implicit bias in LLMs. Project webpage:yuchenwen1.github.io/ImplicitBiasEv…
English
1
2
6
226
Shiyu Ni
Shiyu Ni@Shictyu·
6/n) ⭐Risk Mitigation-C^3: Consistency-based Confidence Calibration 1. C^3 substantially enhances LLMs’ perception of what they do not know, mitigating risks. 2. C^3 does not overly weaken the model’s confidence to the point of making it overly conservative, as Align. improves.
Shiyu Ni tweet media
English
1
0
0
80