

Yi Dong
21 posts




There’s growing excitement around scaling up RLVR to get continuous gains with more compute. But in practice, improvements saturate on finite training data. 😱 Introducing Golden Goose 🦢✨, a simple trick to synthesize unlimited RLVR tasks 😎 from unverifiable internet text. 🌐

We built ProfBench to raise the bar for LLMs - literally. At @NVIDIA, we worked with domain experts to create a benchmark that goes far beyond trivia and short answers. ProfBench tests LLMs on complex, multi-step tasks that demand the kind of reasoning, synthesis, and clarity you'd expect from a PhD physicist or MBA consultant. 🌎 This isn’t just a dataset drop. It’s a global collaboration: 38 professionals across 8 countries contributed over 7,000 expert-written rubrics across finance MBA 💵, consulting MBA 📊, chemistry PhD 🧪and physics PhD 🚀. 🧗Every prompt and grading rubric was handcrafted, requiring tens of hours of dedicated and focussed work. Now fully supported in the NeMo Evaluator SDK, ProfBench enables reproducible, rubric-based evaluations and side-by-side model comparisons. 🔗 ProfBench on @HuggingFace huggingface.co/datasets/nvidi… 🔗 NeMo Evaluator SDK github.com/NVIDIA-NeMo/Ev… I’m so proud of the team that made this happen. Let’s keep pushing what AI can do. Work done with @jaehunjung_com @GXiming @shizhediao Ellie Evans @jiaqizengggggg @PavloMolchanov @YejinChoinka @jankautz @doyend #ProfBench #LLM #AIevaluation #NeMo #NVIDIA #OpenSourceAI #AIresearch #AgenticAI #GenerativeAI #BuiltByExperts #GTCDC










