Akshit
1.1K posts

Akshit
@akshitwt
assessing ai capabilities. ML @cambridge_uni. previously @precogatiiith, @iiit_hyderabad. futurebound.






Reaching scientific goals: expectation vs. reality, v/Florian Aigner.



Are you up for a challenge? openai.com/parameter-golf




🌶️ take that I'll continue to stand by: Automated hill-climbing is useful, but won't lead to the biggest scientific breakthroughs. The real magic is in defining new hills to climb, or coming up with fundamental, generalizable methods that help across hills, not stacking tricks together to climb existing ones. Whats exciting is that if we automate the latter, it frees us to be more creative about the former. The question is, how do we get AI to assist us on brainstorming and enhance our creativity in finding new hills. This motivated our work on Training AI CoScientists, arxiv.org/abs/2512.23707. Will release some smol experiments on designing an AI co-explorer interface done with @akshitwt soon :)

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)






⚡ Excited to announce Gemini 3.1 Flash-Lite! We’ve set a new standard for efficiency and capability to give developers our fastest, most cost-effective Gemini 3 model yet. We engineered this model with thinking levels, allowing it to handle high-volume queries instantly, while scaling up its reasoning for complex edge cases. By the numbers: ⏱️ 2.5X faster time-to-first-token than 2.5 Flash while being significantly higher quality 📉 $0.25 per 1M input tokens 📊 1432 Elo on LMArena & 86.9% on GPQA Diamond Thrilled to see what developers build with this kind of speed and quality at scale. Available now in Google AI Studio and Vertex AI. blog.google/innovation-and…


introducing a new, very fun, LLM benchmark- the Game-of-Life Bench! the rules are simple: given an 8x8 grid following Conway's game of life rules, the goal is to create an initial pattern with at most 32 cells that can last the longest number of turns before dying/repeating. some results to highlight (with caveats detailed below): - gpt 5.1 lasts the longest with a 106 step run - claude models are really bad at this! they refuse to reason about this task and score < 25 points - deepseek r1 is the best open model with 102 steps. why? because i wanted to create a benchmark that has (i think) no practicality, but is still fun to look at, cheap, and still measures something interesting. i also am a big fan of the game of life. its absurdly simple rules leading to intractability is extremely cool to me. also, i saw a lot of work with LLMs trying to "predict" the next state in Conway's game of life, I think game-of-life bench is more fun because it's pretty open ended and only asks the LLM for the initial state. I also think this could be an RL env? but idk why you would ever train on this task haha i don't think this is a "serious" benchmark because it doesnt measure anything practical, but i still think it's a hard benchmark exactly because you can't predict what happens with your initial state many turns into the future; this is why i was initially expecting all LLMs to be bad at it, but turns out, some are clearly better than the others (the ordering may surprise you!) reminder: this is still a work-in-progress; (1) i am gpu-poor so could only do 10 runs for each model, even though total running cost is relatively low. maybe with some more credits i can run more seeds for each model. (2) i handpicked models which i think are at the frontier right now, plus some others that were on my mind. so, if you'd like to see a model on here, let me know. (3) i currently only do an 8x8 grid because i thought that by itself would be pretty hard for current LLMs, but of course we can increase grid sizes! (4) the coolest thing is, i dont think we can calculate the max possible number of states (yay undecidability!) you can go without repeating, so this is essentially a no-ceiling task, which is pretty cool! again, i did this mostly out of a desire to make LLMs do something fun. if this keeps me entertained for a few more days, i'd likely release a blog post on it. if it keeps me entertained for a week (and someone sponsors me), i'll put more work into it :P lastly, this is fully open sourced, so feel free to run this on your own!



BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model variants tested. BullshitBench is already at 380 starts on GitHub - all questions, scripts, responses and judgements are there so check it out. TL;DR: - Results replicated - @AnthropicAI latest models are scoring exceptionally well - @Alibaba_Qwen is another very strong performer - OpenAI and Google models are not doing well and are not improving - Domains do not show much difference - rates of BS detection are about the same across all domains - Reasoning, if anything, has negative effect - Newer models don't do that much better than older ones (except Anthropic) Links: - Data explorer: petergpt.github.io/bullshit-bench… - GitHub: github.com/petergpt/bulls… Highly recommend the data explorer where you can study the data and the questions & sample answers.


In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.






