Ruiwen Zhou

15 posts

Ruiwen Zhou

Ruiwen Zhou

@skyriver_2000

CS Ph.D. student @wing_nus @NUSingapore | Prev. @ucsbnlp @sjtu1896. LLM Reasoning and AI Agent Seeking summer research internship opportunity!

Singapore Katılım Aralık 2021
407 Takip Edilen103 Takipçiler
Ruiwen Zhou retweetledi
NICE AI Talk
NICE AI Talk@academic_nice·
🌟 Welcome to NICE AI Talk 135 (Chinese Talk) | Can LLMs Truly Build a Complete Project Repository from Scratch?🌟 🚀 Recent advances in code generation have delivered impressive results on short-horizon tasks like function synthesis and local code completion. But a fundamental question remains unanswered: Can large language models sustain coherent planning and stable execution across the full process of building a real project repository from scratch? In this talk, we present NL2Repo-Bench, a long-horizon evaluation benchmark that challenges models to construct a complete, runnable Python repository using only a natural language specification and an empty workspace. Experimental results reveal that even with a perfectly designed prompt, current models frequently fail under long-horizon settings—exposing logical collapse, fragile cross-file dependencies, and insufficient global planning. 📌 Register: luma.com/51xk2nah 📺 YouTube Livestream: lnkd.in/gjbn7ukk ⏰ Talk Time 🕙 Beijing Time: 02.08 10:00–11:00 🕘 EST: 02.07 21:00–22:00 🕕 PST: 02.07 18:00–19:00 🎤 Invited Speaker: Shengda Long, Master’s Student at Peking University 🎙 Host: Ruiwen Zhou, PhD Student at National University of Singapore 👑 Discuss with researchers worldwide: lnkd.in/gpsDn-DH ✅ Want to present your work at NICE? Submit here: lnkd.in/g8RYT-KX #AI #LLM #CodeGeneration #AutonomousAgents #LongHorizonReasoning #FoundationModels #AcademicResearch #NICEAITalk
NICE AI Talk tweet media
English
0
2
5
515
Ruiwen Zhou
Ruiwen Zhou@skyriver_2000·
📌 Final thought Trust is not a soft concept. It is a learnable, usable signal for LLM agents. If we want reliable multi-agent systems, we must teach agents who to believe — not just how to reason. 🏠 Github repo: github.com/skyriver-2000/… 🔗 arXiv link: arxiv.org/abs/2601.21742 💬 Thoughts welcome! 🧵 4/n
English
0
0
5
288
Ruiwen Zhou
Ruiwen Zhou@skyriver_2000·
🤔 Who to trust in a multi-agent system? 🔥 We are thrilled to introduce Epistemic Context Learning (ECL) -- a reasoning framework that enables LLMs to reason with trust in multi-agent systems 📖 Key Takeaways - LLMs fail in multi-agent systems when they blindly conform to confident but unreliable peers - We introduce interaction history of peers so that LLMs can judge peer reliability and selectively refer to them. - We develop ECL as a practical solution, which lets small LM deliver comparable performance to much larger ones and enables near-perfect accuracy under adversarial peers 🔗 arXiv link: arxiv.org/abs/2601.21742 🧵1/n
Ruiwen Zhou tweet media
English
3
10
42
7K
Ruiwen Zhou
Ruiwen Zhou@skyriver_2000·
🔥 Results that surprised us 💡 Small models learn to estimate and utilize trust decently -- ECL enhances Qwen 3-4B to match Qwen 3-30B baselines 💪🏻 Frontier models reach near-perfect accuracy ⚠️ Why this matters Without trust modeling: ❌ agents collapse under social pressure ❌ confident hallucinations dominate ❌ adversarial peers win With ECL: ✅ agents resist blind conformity ✅ trust becomes an explicit reasoning signal 🧵 3/n
Ruiwen Zhou tweet media
English
0
0
3
206
Ruiwen Zhou
Ruiwen Zhou@skyriver_2000·
🧠 Core idea - When LLMs cannot verify correctness, judging what peers said is hard - but judging who is speaking from past behavior is easier. So we shift the problem from ❌ reasoning-quality judgment ➡️ peer reliability estimation from history ⏳ 🔍 We formalize this as history-aware reference and propose Epistemic Context Learning (ECL): 🧩 Two-stage reasoning 1️⃣ Stage 1: build trust profiles from interaction history 2️⃣ Stage 2: answer the current question using those trust priors No shortcuts. No surface plausibility tricks. 🧵 2/n
Ruiwen Zhou tweet media
English
0
1
3
222
Ruiwen Zhou retweetledi
NICE AI Talk
NICE AI Talk@academic_nice·
🌟 Welcome to NICE Talk 131 | Agent Memory Self-Evolution 🚀 This is Era of Experience for AI Agent. The core is not the simple replay of past episodes, but whether they can, through runtime learning, transform accumulated experience into a self-evolving drive for tackling unknown tasks. 📌 Register: luma.com/zaueyrbj 📌 YouTube livestream and video summaries: lnkd.in/e8PhA9z5 ⏰ Talk Time ⏰ USA Eastern Standard Time: 2026.01.31 (Sat) 21:30 ⏰ Pacific Time: 2026.01.31 (Sat) 18:30 🎙️ Invited Speaker: Jiaqian Wang, PhD student at Xidian University 🎙️ Host: Ruiwen Zhou, Ph.D. student at National University of Singapore ------------------------------------------- 👑 Discuss with researchers worldwide: lnkd.in/gp-aa2EM ✅ Want to present your work at NICE? Submit here: lnkd.in/g8RYT-KX #AI #LLM #MachineLearning #DeepLearning #FoundationModels #Academic #Research
NICE AI Talk tweet media
English
0
3
5
593
Ruiwen Zhou retweetledi
Wenyue Hua
Wenyue Hua@HuaWenyue31539·
📢 Call for Papers: ICLR 2026 Workshop “MemAgents” 🧠Memory layers for LLM agents (including architecture, reinforcement learning, systems, neuroscience, evaluation, etc.) 📅 Submission deadline: February 5, 2026 (AoE) 📰📜Submissions: Tiny papers, short papers, and full papers are all welcome for discussion and exchange. 📌 Location: Rio de Janeiro, Brazil 🌴For details and submission instructions, please visit: sites.google.com/view/memagent-… We would greatly appreciate it if you could help share this with interested faculty members and students.
English
4
25
175
21.4K
Ruiwen Zhou retweetledi
Wenyue Hua
Wenyue Hua@HuaWenyue31539·
Hi all, I am hosting a dinner party on 11.5 at EMNLP this year! We've invited a bunch VCs and startup people, also fantastic panelist to talk about embodied AI and LLM agents. All are welcome to attend!!
Wenyue Hua tweet media
English
3
15
43
6.1K
Ruiwen Zhou retweetledi
Xiaobao Wu
Xiaobao Wu@BobXWu1·
🏆 Thrilled to announce our paper AntiLeakBench (arxiv.org/abs/2412.13670) won the SAC Award at #ACL2025! Huge thanks to my amazing co-authors! 🚨 Data contamination risks fair LLM evaluation. AntiLeakBench addresses this by: 🔒 Creating samples with explicitly ew knowledge absent from LLM training sets for contamination-free evaluation. 🤖 Automating benchmark updates without human effort. 📊 Enabling fair and low-cost evaluation for emerging LLMs. Check it out 👉arxiv.org/abs/2412.13670 #ACL2025NLP #NLProc #LLMs
Xiaobao Wu tweet media
English
0
2
6
1.9K
Ruiwen Zhou
Ruiwen Zhou@skyriver_2000·
🚀🚀 Catch me at ACL tomorrow (July 28) during the 11:00–12:30 poster session! I will be presenting RuleArena, a challenging LLM benchmark for LLM reasoning under the guidance of complex real-world natural language rules. Come by and let's talk! 🚀🚀 #ACL2025 #LLMs #NLProc
Wenyue Hua@HuaWenyue31539

🚀🚀Can #LLMs Handle Your Taxes? 💸 Thank you @skyriver_2000 for leading this very interesting project! He is applying for PhD program now :) Introducing RuleArena – a cutting-edge benchmark designed to test the logic reasoning of large language models with ~100 natural language rules from REAL-world scenarios: ✈️ American Airline luggage checking policies 🏀 NBA transaction policies 📊 personal tax rules 🔍 Why RuleArena? Rooted in real-life applications, RuleArena evaluates whether your LLM or agent is ready for safe and reliable deployment in everyday tasks. 💪 Super Challenging • Each rule spans ~400 tokens • Context lengths up to 20k! 🔑 Key Findings: 1️⃣ Low Recall: LLMs often miss context-specific rules, rules required in special scenarios. 2️⃣ Context Dependency Issues: Struggle with rules requiring multiple intermediate steps as inputs. 3️⃣ In-Context Examples Don’t Always Help: Providing examples doesn’t guarantee better performance. 4️⃣ Fragile Accuracy: A single mistake in calculation or rule application can lead to incorrect answers. 😕 Check paper here: arxiv.org/abs/2412.08972 Code releasing soon!😁

English
0
2
4
529
Ruiwen Zhou
Ruiwen Zhou@skyriver_2000·
🚀🚀 We just released our code and problem set on github: github.com/SkyRiver-2000/… Welcome to evaluate and analyze your models and reasoning frameworks on our RuleArena!
Wenyue Hua@HuaWenyue31539

🚀🚀Can #LLMs Handle Your Taxes? 💸 Thank you @skyriver_2000 for leading this very interesting project! He is applying for PhD program now :) Introducing RuleArena – a cutting-edge benchmark designed to test the logic reasoning of large language models with ~100 natural language rules from REAL-world scenarios: ✈️ American Airline luggage checking policies 🏀 NBA transaction policies 📊 personal tax rules 🔍 Why RuleArena? Rooted in real-life applications, RuleArena evaluates whether your LLM or agent is ready for safe and reliable deployment in everyday tasks. 💪 Super Challenging • Each rule spans ~400 tokens • Context lengths up to 20k! 🔑 Key Findings: 1️⃣ Low Recall: LLMs often miss context-specific rules, rules required in special scenarios. 2️⃣ Context Dependency Issues: Struggle with rules requiring multiple intermediate steps as inputs. 3️⃣ In-Context Examples Don’t Always Help: Providing examples doesn’t guarantee better performance. 4️⃣ Fragile Accuracy: A single mistake in calculation or rule application can lead to incorrect answers. 😕 Check paper here: arxiv.org/abs/2412.08972 Code releasing soon!😁

English
2
5
15
5.4K
Ruiwen Zhou retweetledi
Wenyue Hua
Wenyue Hua@HuaWenyue31539·
🚀🚀Can #LLMs Handle Your Taxes? 💸 Thank you @skyriver_2000 for leading this very interesting project! He is applying for PhD program now :) Introducing RuleArena – a cutting-edge benchmark designed to test the logic reasoning of large language models with ~100 natural language rules from REAL-world scenarios: ✈️ American Airline luggage checking policies 🏀 NBA transaction policies 📊 personal tax rules 🔍 Why RuleArena? Rooted in real-life applications, RuleArena evaluates whether your LLM or agent is ready for safe and reliable deployment in everyday tasks. 💪 Super Challenging • Each rule spans ~400 tokens • Context lengths up to 20k! 🔑 Key Findings: 1️⃣ Low Recall: LLMs often miss context-specific rules, rules required in special scenarios. 2️⃣ Context Dependency Issues: Struggle with rules requiring multiple intermediate steps as inputs. 3️⃣ In-Context Examples Don’t Always Help: Providing examples doesn’t guarantee better performance. 4️⃣ Fragile Accuracy: A single mistake in calculation or rule application can lead to incorrect answers. 😕 Check paper here: arxiv.org/abs/2412.08972 Code releasing soon!😁
Wenyue Hua tweet media
English
2
5
23
9.8K
Ruiwen Zhou retweetledi
Wenda Xu
Wenda Xu@WendaXu2·
I am on job market for full-time industry positions. My research focuses on text generation evaluation and LLM alignment. If you have relevant positions, I’d love to connect! Here are list of my publications and summary of my research:
English
1
19
59
25.9K