Ran Xu

49 posts

Ran Xu

@ritaranx

Research Scientist @GoogleDeepMind | Prev: CS PhD @EmoryUniversity | LLM, RAG, Agent, Tool-integrated reasoning, RL

Mountain View, CA Katılım Eylül 2022

295 Takip Edilen634 Takipçiler

Sabitlenmiş Tweet

Ran Xu@ritaranx·3 Eki

🚨 Happy to share AceSearcher accepted to #NeurIPS2025 #Spotlight! 🔹 One LLM, two roles: Decomposer (split queries) + Solver (combine context) 🔹 +7.6% on QA & fact verification 🔹 32B ≈ DeepSeek-V3 on DocMath 📂 Code: github.com/ritaranx/AceSe… 📑 arXiv: arxiv.org/abs/2509.24193

English

7.7K

Ran Xu@ritaranx·3 Ara

Excited to be at #NeurIPS through Dec 8 — happy to connect! I’ll be presenting our Spotlight paper on complex QA and reasoning with search: 🗓️ Dec 5, 11:00–2:00pm PST 📍 Exhibit C/D/E — Poster #1908 Also exploring full-time opportunities—DMs open if you’d like to chat!

Ran Xu@ritaranx

English

Ran Xu@ritaranx·10 Kas

Thanks for featuring our work! 🙌

DAIR.AI@dair_ai

4. TIR-Judge Google and collaborators introduce TIR-Judge, an end-to-end reinforcement learning framework that trains LLM judges to integrate code execution for precise evaluation. x.com/ritaranx/statu…

English

9.1K

Ran Xu@ritaranx·8 Kas

@curlyhacks1 @Google @GoogleDeepMind @googlecloud Yes! Our framework natively supports multi-turn tool calling.

English

Andrea Villa@curlyhacks1·7 Kas

@ritaranx @Google @GoogleDeepMind @googlecloud Supports multi-turn training with tool calling?

English

115

Ran Xu@ritaranx·6 Kas

Happy to introduce my internship work at @Google and @GoogleDeepMind, collab w/ @googlecloud. We introduce TIR-Judge, an end-to-end agentic RL framework that trains LLM judges with tool-integrated reasoning 🧠🛠️ 🔗arxiv.org/pdf/2510.23038 #Agents #LLMs #Judges #RL #reasoning

English

521

45.5K

Ran Xu@ritaranx·8 Kas

@alpniks @Google @GoogleDeepMind @googlecloud Thanks! TIR-Judge is particularly effective for tasks that involve symbolic reasoning or calculation. For non-verifiable domains, we’ve also introduced a rubric-based framework in a recent paper to address evaluation in those cases: arxiv.org/pdf/2510.07743

English

Alp@alpniks·7 Kas

@ritaranx @Google @GoogleDeepMind @googlecloud Great work! Tool integration is definitely an essential part to train LLM judges on par with human evaluation level. How does TIR-Judge performance compare to human performance for more reasoning-heavy non-verifiable domains?

English

175

Ran Xu@ritaranx·6 Kas

Thanks for sharing our work on improving LLM judges with agentic RL!

Rohan Paul@rohanpaul_ai

New Google paper trains LLM judges to use small bits of code alongside reasoning, so their decisions become precise. So judging stops being guesswork and becomes checkable. Text only judges often miscount, miss structure rules, or accept shaky logic that a simple program would catch. TIR-Judge makes the judge think step by step, write code to check claims, run it in a sandbox, then update the verdict. Training mixes tasks where code can verify answers and tasks where it cannot, so the judge learns when to call tools and when to rely on reasoning. One prompt schema covers pointwise scoring, pairwise choices, and listwise selection, so it plugs into many workflows. Reinforcement learning rewards being correct, following strict output tags, and using at most 3 tool calls. A variant called TIR-Judge-Zero skips teacher distillation and still improves by alternating reinforcement learning, rejection sampling, and supervised fine tuning. Across public judge benchmarks it beats text only judges, and with 8B it reaches 96% of Claude Opus 4 on listwise ranking. The core idea, give the judge verifiable checks plus rewards that favor careful tool use. ---- Paper – arxiv. org/abs/2510.23038 Paper Title: "Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning"

English

4.3K

Ran Xu@ritaranx·6 Kas

@Google @GoogleDeepMind @googlecloud 7/n Thanks to my collaborators: Jingjing Chen Jiayu Ye Yu Wu Jun Yan @jun_yannn Carl Yang @yangji9181 Hongkun Yu And thanks for the useful discussions from: Jing Nathan Yan @NathanYan2012 Yuchen Zhuang @yuchen_zhuang Zhengzhe Yang

English

788

Ran Xu@ritaranx·6 Kas

@Google @GoogleDeepMind @googlecloud 6/n 📊Best-of-N on Policy Models TIR-Judge is not only a better judge — it makes other models better. When selecting responses in best-of-N inference, TIR-Judge improves policy accuracy by +3.9~6.7% on AIME, BigCodeBench, IFEval, etc. → Better downstream reasoning too🎯

English

950

Ran Xu@ritaranx·3 Eki

n/n Thanks for our collaborators: Yuchen Zhuang @yuchen_zhuang Zihan Dong @zhiiiiaaaa Ruiyu Wang Yue Yu @yue___yu Joyce C. Ho @joycehoUT Linjun Zhang @linjunz_stat Haoyu Wang @haoyuwang0408 Wenqi Shi @WenqiShi0106 Carl Yang @yangji9181

English

405

Ran Xu@ritaranx·3 Eki

6/n Takeaways: ✅ With self-play frameworks: Smaller LLMs can rival giant proprietary models ✅ We can borrow the treasure from reasoning datasets to assist search in LLM and better couple search and reasoning ✅ Have the great potential for domains: finance, health, science

English

287

Ran Xu@ritaranx·3 Eki

English

7.7K

Keşfet

@curlyhacks1 @Google @GoogleDeepMind @googlecloud @alpniks @jun_yannn @yangji9181 @NathanYan2012