

Ruiwen Zhou
15 posts

@skyriver_2000
CS Ph.D. student @wing_nus @NUSingapore | Prev. @ucsbnlp @sjtu1896. LLM Reasoning and AI Agent Seeking summer research internship opportunity!















🚀🚀Can #LLMs Handle Your Taxes? 💸 Thank you @skyriver_2000 for leading this very interesting project! He is applying for PhD program now :) Introducing RuleArena – a cutting-edge benchmark designed to test the logic reasoning of large language models with ~100 natural language rules from REAL-world scenarios: ✈️ American Airline luggage checking policies 🏀 NBA transaction policies 📊 personal tax rules 🔍 Why RuleArena? Rooted in real-life applications, RuleArena evaluates whether your LLM or agent is ready for safe and reliable deployment in everyday tasks. 💪 Super Challenging • Each rule spans ~400 tokens • Context lengths up to 20k! 🔑 Key Findings: 1️⃣ Low Recall: LLMs often miss context-specific rules, rules required in special scenarios. 2️⃣ Context Dependency Issues: Struggle with rules requiring multiple intermediate steps as inputs. 3️⃣ In-Context Examples Don’t Always Help: Providing examples doesn’t guarantee better performance. 4️⃣ Fragile Accuracy: A single mistake in calculation or rule application can lead to incorrect answers. 😕 Check paper here: arxiv.org/abs/2412.08972 Code releasing soon!😁

🚀🚀Can #LLMs Handle Your Taxes? 💸 Thank you @skyriver_2000 for leading this very interesting project! He is applying for PhD program now :) Introducing RuleArena – a cutting-edge benchmark designed to test the logic reasoning of large language models with ~100 natural language rules from REAL-world scenarios: ✈️ American Airline luggage checking policies 🏀 NBA transaction policies 📊 personal tax rules 🔍 Why RuleArena? Rooted in real-life applications, RuleArena evaluates whether your LLM or agent is ready for safe and reliable deployment in everyday tasks. 💪 Super Challenging • Each rule spans ~400 tokens • Context lengths up to 20k! 🔑 Key Findings: 1️⃣ Low Recall: LLMs often miss context-specific rules, rules required in special scenarios. 2️⃣ Context Dependency Issues: Struggle with rules requiring multiple intermediate steps as inputs. 3️⃣ In-Context Examples Don’t Always Help: Providing examples doesn’t guarantee better performance. 4️⃣ Fragile Accuracy: A single mistake in calculation or rule application can lead to incorrect answers. 😕 Check paper here: arxiv.org/abs/2412.08972 Code releasing soon!😁

