
Josh Vendrow
111 posts

Josh Vendrow
@josh_vendrow
Safety training @OpenAI | on leave from PhD at MIT


It's getting harder and harder to get signal from benchmark numbers. Rather than averages, I except in the (near) future we will also care about "argmax": what's the BEST output a model can deliver? After all, we don't need to solve PvsNP 10 out of 10 times, once is enough 😅. So with that in mind let me tell you a bit more about THE MOST IMPRESSIVE LLM OUTPUT I have ever seen.



✨Weekly AI Evaluation Paper Spotlight✨ 🕵️ Is benchmark noise and label errors masking the true fragility of LLMs? 🖇️"Do Large Language Model Benchmarks Test Reliability?" - This paper by @josh_vendrow, @EdwardVendrow @sarameghanbeery @aleks_madry provides insights!


GPT-5 Thinking definitely isn’t perfect, but it’s the first AI model I can trust more than many common sources of truth on the internet.



GPT-5 is here. Rolling out to everyone starting today. openai.com/gpt-5/



@stevenshinechen omg... if i'm reading this page correctly a majority of the GPUs you guys have campus wide access to are from before Ampere was even a thing > "...more than 850 NVidia Volta GPUs in total...."












