somnath sandeep
726 posts

somnath sandeep
@somnathsandeepp
building your personal health companion AI @meetaugustai 🩺 prev: 2x founder, consumer biz • dumpin' real-time notes on things i'm curious abt & find joy in 🌱


















most benchmarks suck, but also ppl misinterpret them HLE, for example, can easily be cheated / trained for, even unintentionally, because the questions are all over the internet, and the answers being private doesn't really matter because people WILL solve it and the information WILL spread. so, a model scoring well on it almost always just means "the AI seen the answer". I don't like this kind of fixed questions benchmark, and I think it becomes a non-signal as soon as it gets popular. or rather, all they measure is the extent on which the team failed to hide the answers from the model, so, more often than not, higher scores are a bad sign on VPCT, all questions are roughly in the same difficulty level, so, a model going from 10% to 90% doesn't imply it is super-human; just that it broke that specific threshold. even ARC-AGI suffers from this. that's also why often a benchmark stales at a percentage; usually that means most questions are easy, and a select few are super hard (or even wrong), so, AIs just stop making progress at that point. (not bad mouthing Chase's work in any way, it is a nice idea and a good benchmark, but it is very hard to construct a flawless eval. perhaps a V2 with a proper scaling would fix this specific flaw) I avoid that on my vibe tests by having just a few personal questions on each "difficulty bracket". when an AIs get smarter, I just make a harder question. that way, when a new model launches, all I have to do is give it my easiest questions, then a harder question, then a harder question, and so on. it becomes very easy to decide the actual intelligence of the model. and since I have only a few questions, it is easy to create small variations on the spot, if I suspect an AI has just seen the answer I wish I had time to make an eval





Med school students are uploading their course materials to ChatGPT and using Voice mode as a study partner 👀



Med school students are uploading their course materials to ChatGPT and using Voice mode as a study partner 👀







