
Shantanu Sharma
87 posts

Shantanu Sharma
@shantanu
he/him RT≠👍 NMLS # 1677482




Reinforcement learning leads to better AI scientist agents! 🚀 By training models end-to-end with multi-turn RL, we’re seeing breakthroughs in reasoning and problem-solving for real biomedical research. Excited to introduce Biomni-R0 — an agentic LLM trained with this approach. On 10 real research tasks, it nearly doubles performance over its open-source base model and even surpasses closed-source frontier models by >10%. A scalable path to expert-level AI in biomedicine. Led by @RyanLi0802 @KexinHuang5 @ProjectBiomni with exciting collaboration with the SkyRL team @shiyi_c98 @NovaSkyAI. Learn more: biomni.stanford.edu/blog/biomni-r0… — open sourcing soon!



There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few days it was worse than what I was used to. Conversely as an example, around the same time Claude 3.5 was a top tier model in my personal use but it ranked very low on the arena. I heard similar sentiments both online and in person. And there were a number of other relatively random models, often suspiciously small, with little to no real-world knowledge as far as I know, yet they ranked quite high too. "When the data and the anecdotes disagree, the anecdotes are usually right." (Jeff Bezos on a recent pod, though I share the same experience personally). I think these teams have placed different amount of internal focus and decision making around LM Arena scores specifically. And unfortunately they are not getting better models overall but better LM Arena models, whatever that is. Possibly something with a lot of nested lists, bullet points and emoji. It's quite likely that LM Arena (and LLM providers) can continue to iterate and improve within this paradigm, but in addition I also have a new candidate in mind to potentially join the ranks of "top tier eval". It is the @openrouter LLM rankings: openrouter.ai/rankings Basically, OpenRouter allows people/companies to quickly switch APIs between LLM providers. All of them have real use cases (not toy problems or puzzles), they have their own private evals, and all of them have an incentive to get their choices right, so by choosing one LLM over another they are directly voting for some combo of capability+cost. I don't think OpenRouter is there just yet in both the quantity and diversity of use, but something of this kind I think has great potential to grow into a very nice, very difficult to game eval.

Today is the start of a new era of natively multimodal AI innovation. Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality. Llama 4 Scout • 17B-active-parameter model with 16 experts. • Industry-leading context window of 10M tokens. • Outperforms Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 across a broad range of widely accepted benchmarks. Llama 4 Maverick • 17B-active-parameter model with 128 experts. • Best-in-class image grounding with the ability to align user prompts with relevant visual concepts and anchor model responses to regions in the image. • Outperforms GPT-4o and Gemini 2.0 Flash across a broad range of widely accepted benchmarks. • Achieves comparable results to DeepSeek v3 on reasoning and coding — at half the active parameters. • Unparalleled performance-to-cost ratio with a chat version scoring ELO of 1417 on LMArena. These models are our best yet thanks to distillation from Llama 4 Behemoth, our most powerful model yet. Llama 4 Behemoth is still in training and is currently seeing results that outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks. We’re excited to share more details about it even while it’s still in flight. Read more about the first Llama 4 models, including training and benchmarks ➡️ go.fb.me/gmjohs Download Llama 4 ➡️ go.fb.me/bwwhe9

























