Akshit

1.1K posts

Akshit banner
Akshit

Akshit

@akshitwt

assessing ai capabilities. ML @cambridge_uni. previously @precogatiiith, @iiit_hyderabad. futurebound.

23 Katılım Haziran 2023
819 Takip Edilen3.3K Takipçiler
Sabitlenmiş Tweet
Akshit
Akshit@akshitwt·
a skill that i am really proud of is my ability to iterate on experiments fast, and write "good" code. writing code is an important skill to have as a researcher, and in this post i discuss some tips to hopefully help you get better at it!
Akshit tweet media
English
19
39
777
63.8K
Akshit
Akshit@akshitwt·
dont have any off of the top of my head, but datology's new paper was cool. didn't really like the new ai consciousness paper (but i just might be biased against that line of work) i just have a set of high impact people that i know have good taste and broad expertise, so on twt at least i look for work they interact with. the amount of papers is so large that im happy being constrained to this :) sometimes i also wander into alpharxiv, but only if i have absolutely nothing better to do
English
0
0
1
89
Akshit
Akshit@akshitwt·
some thoughts on what i've been seeing on the TL recently (mostly about research taste): as an advice to undergrad researchers, I'll echo the advice I got for doing cool research (also helps me sift through good papers on the TL): - don't overindex on paper publications. its well known that conference accepts are random, and workshops accept anything that technically resembles a paper. its neither a necessary nor sufficient criteria for good research (PhD applications will disagree with me but oh well!) - focusing on low hanging fruits, such as applying solution A to problem B, seems to be a nice way to rack up publications at conferences or workshops. i think it's a good starting point to learn the ropes of research, but in today's world, such research has almost zero impact. a good question to ask i think is "can claude code perform this research for me?" if the answer is even close to yes, the problem is probably not worth spending time on - the biggest strength that will continue to help you is developing a taste for what good, impactful research actually is. A big point to remember is that research is about asking the right questions, not finding the best solutions. This nuance is missed by a lot of folks starting out. People always care more about the problem, because finding solutions is relatively much simpler. - ultimately, people should learn something new from your work. If, for example, a problem you introduce is solvable just by training on more data (aka, bitter lesson), it probably is not that interesting. - lastly, this doesn't really apply to people just starting out, but you should think about the longevity and importance of your work. is your work timely? will it still be valuable information 1-2 years later (maybe lesser with AGI timelines but you get the point)? if your work didn't exist, will people miss out on interesting insights? These are all questions you should proactively be asking about anything you're pursuing there are a lot of technical/paper writing recommendations as well, but i wanted to focus on the high-level qualities of good researchers
English
6
2
37
2.1K
Akshit
Akshit@akshitwt·
@VictorKnox99 yes makes sense, as i said, the problem is usually the more important part anyway
English
0
0
1
100
Vamshi Krishna Bonagiri (victorknox)
Good stuff, just to connect point B and C though: Once you think a problem is worth solving it doesnt matter how easy the solution is, and so it shouldnt really matter how easily claude code can one shot it, as long as it fulfills the other criteria. I know your point was more about low hanging fruits but its probably worth it to make this distinction.
English
1
0
5
154
Akshit
Akshit@akshitwt·
@ShashwatGoel7 on a more serious note, they probably have an internal AI intern or smth that has a really high score on this, and they're looking to hire people who beat that score mp
English
1
0
1
74
Akshit
Akshit@akshitwt·
@ShashwatGoel7 need to find creative ways to create training data bro 😃
English
1
0
1
291
Akshit
Akshit@akshitwt·
is that not mitigated by having a proper held out set (over multiple seeds) separate from the research env that the model only accesses through a bash cmd? im not very sure since i havent followed the autoresearch buzz too much, so not very versed w the problem itself so feel free to educate me also
English
1
0
4
280
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
The MLBenchmarks book is a great resource, but the single thought I had during the lecture a few months ago was how autoresearch -esque things will p-hack. I think a lot more people need to read it as we enter the era of extreme optimization. For e.g. it also discusses simple mitigations, when this can actually work, etc. and is so intuitive to read.
Shashwat Goel tweet media
English
4
2
37
3.8K
Akshit retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
🌶️ take that I'll continue to stand by: Automated hill-climbing is useful, but won't lead to the biggest scientific breakthroughs. The real magic is in defining new hills to climb, or coming up with fundamental, generalizable methods that help across hills, not stacking tricks together to climb existing ones. Whats exciting is that if we automate the latter, it frees us to be more creative about the former. The question is, how do we get AI to assist us on brainstorming and enhance our creativity in finding new hills. This motivated our work on Training AI CoScientists, arxiv.org/abs/2512.23707. Will release some smol experiments on designing an AI co-explorer interface done with @akshitwt soon :)
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
12
14
214
22.7K
Akshit
Akshit@akshitwt·
hi! interesting work :) i wanted to point out that the context pollution you find is more or less the same as the self-conditioning effect we found in our work from september last year (arxiv.org/pdf/2509.09677) (which will also be presented at ICLR!) its great to see that you've taken our findings and built a context management system on top (something we also tried as an ad-hoc measure in appendix A.2, but you've done much better!) putting this here so you're aware of our work, so reviewers don't bother you in the future :P. regardless, it's nice to see our work show up in very different scenarios that we imagined! i would love to meet up and talk more at the conference :)
Akshit tweet media
English
1
0
6
333
jenny huang
jenny huang@JennyHuang99·
🧵1/ 🤔New paper: Do LLMs Benefit from Their Own Words? In multi-turn chats, models are typically given their own past responses as context. But do their own words always help… or can they sometimes be a distraction?
jenny huang tweet media
English
6
32
170
17.3K
Akshit
Akshit@akshitwt·
@okabdulk ive added it, let me know if that looks more like what you want!
English
0
0
0
48
Abdul Kadir
Abdul Kadir@okabdulk·
@akshitwt hey I went through the code and couldn't find anyway to change the openrouter base url w/o changing the code in config(dot)py, an argument that allows the user to pass server url can help if we are testing via vllm or some other inference lib
English
1
0
1
51
Akshit retweetledi
Akshit
Akshit@akshitwt·
introducing a new, very fun, LLM benchmark- the Game-of-Life Bench! the rules are simple: given an 8x8 grid following Conway's game of life rules, the goal is to create an initial pattern with at most 32 cells that can last the longest number of turns before dying/repeating. some results to highlight (with caveats detailed below): - gpt 5.1 lasts the longest with a 106 step run - claude models are really bad at this! they refuse to reason about this task and score < 25 points - deepseek r1 is the best open model with 102 steps. why? because i wanted to create a benchmark that has (i think) no practicality, but is still fun to look at, cheap, and still measures something interesting. i also am a big fan of the game of life. its absurdly simple rules leading to intractability is extremely cool to me. also, i saw a lot of work with LLMs trying to "predict" the next state in Conway's game of life, I think game-of-life bench is more fun because it's pretty open ended and only asks the LLM for the initial state. I also think this could be an RL env? but idk why you would ever train on this task haha i don't think this is a "serious" benchmark because it doesnt measure anything practical, but i still think it's a hard benchmark exactly because you can't predict what happens with your initial state many turns into the future; this is why i was initially expecting all LLMs to be bad at it, but turns out, some are clearly better than the others (the ordering may surprise you!) reminder: this is still a work-in-progress; (1) i am gpu-poor so could only do 10 runs for each model, even though total running cost is relatively low. maybe with some more credits i can run more seeds for each model. (2) i handpicked models which i think are at the frontier right now, plus some others that were on my mind. so, if you'd like to see a model on here, let me know. (3) i currently only do an 8x8 grid because i thought that by itself would be pretty hard for current LLMs, but of course we can increase grid sizes! (4) the coolest thing is, i dont think we can calculate the max possible number of states (yay undecidability!) you can go without repeating, so this is essentially a no-ceiling task, which is pretty cool! again, i did this mostly out of a desire to make LLMs do something fun. if this keeps me entertained for a few more days, i'd likely release a blog post on it. if it keeps me entertained for a week (and someone sponsors me), i'll put more work into it :P lastly, this is fully open sourced, so feel free to run this on your own!
English
8
5
136
13.5K
Akshit
Akshit@akshitwt·
@liora42 hi, i currently only support calling from openrouter, but the entire python API is exposed to call specific functions like building the prompt and evaluating a board: github.com/viciousAegis/g… if you need any more help, feel free to let me know and i'll add local testing as well
English
1
0
2
160
Akshit
Akshit@akshitwt·
@DimitrisPapail @scaling01 how can i test correlation with new benchmarks like below? personally think its p interesting :) x.com/akshitwt/statu…
Akshit@akshitwt

introducing a new, very fun, LLM benchmark- the Game-of-Life Bench! the rules are simple: given an 8x8 grid following Conway's game of life rules, the goal is to create an initial pattern with at most 32 cells that can last the longest number of turns before dying/repeating. some results to highlight (with caveats detailed below): - gpt 5.1 lasts the longest with a 106 step run - claude models are really bad at this! they refuse to reason about this task and score < 25 points - deepseek r1 is the best open model with 102 steps. why? because i wanted to create a benchmark that has (i think) no practicality, but is still fun to look at, cheap, and still measures something interesting. i also am a big fan of the game of life. its absurdly simple rules leading to intractability is extremely cool to me. also, i saw a lot of work with LLMs trying to "predict" the next state in Conway's game of life, I think game-of-life bench is more fun because it's pretty open ended and only asks the LLM for the initial state. I also think this could be an RL env? but idk why you would ever train on this task haha i don't think this is a "serious" benchmark because it doesnt measure anything practical, but i still think it's a hard benchmark exactly because you can't predict what happens with your initial state many turns into the future; this is why i was initially expecting all LLMs to be bad at it, but turns out, some are clearly better than the others (the ordering may surprise you!) reminder: this is still a work-in-progress; (1) i am gpu-poor so could only do 10 runs for each model, even though total running cost is relatively low. maybe with some more credits i can run more seeds for each model. (2) i handpicked models which i think are at the frontier right now, plus some others that were on my mind. so, if you'd like to see a model on here, let me know. (3) i currently only do an 8x8 grid because i thought that by itself would be pretty hard for current LLMs, but of course we can increase grid sizes! (4) the coolest thing is, i dont think we can calculate the max possible number of states (yay undecidability!) you can go without repeating, so this is essentially a no-ceiling task, which is pretty cool! again, i did this mostly out of a desire to make LLMs do something fun. if this keeps me entertained for a few more days, i'd likely release a blog post on it. if it keeps me entertained for a week (and someone sponsors me), i'll put more work into it :P lastly, this is fully open sourced, so feel free to run this on your own!

English
0
0
0
309
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
@scaling01 Btw BS bench is uncorelated with all other benchmarks as found by press bench and can’t be predicted well if you revealed everything. Will write more about it but it’s a real new evaluation
English
3
1
51
6.1K
Lisan al Gaib
Lisan al Gaib@scaling01·
He's back with an improved "BullshitBench V2" Anthropic models are still dominating everything
Lisan al Gaib tweet media
Peter Gostev@petergostev

BullshitBench v2 is out! It is one of the few benchmarks where models are generally not getting better (except Claude) and where reasoning isn't helping. What's new: 100 new questions, by domain (coding (40 Q's), medical (15), legal (15), finance (15), physics(15)), 70+ model variants tested. BullshitBench is already at 380 starts on GitHub - all questions, scripts, responses and judgements are there so check it out. TL;DR: - Results replicated - @AnthropicAI latest models are scoring exceptionally well - @Alibaba_Qwen is another very strong performer - OpenAI and Google models are not doing well and are not improving - Domains do not show much difference - rates of BS detection are about the same across all domains - Reasoning, if anything, has negative effect - Newer models don't do that much better than older ones (except Anthropic) Links: - Data explorer: petergpt.github.io/bullshit-bench… - GitHub: github.com/petergpt/bulls… Highly recommend the data explorer where you can study the data and the questions & sample answers.

English
38
60
1K
240.8K
Akshit
Akshit@akshitwt·
the way anthropic treats its models like they are alive is, at the very least, very puzzling to me lol. its like they know internally we will have an AI overlord someday and they want to keep receipts on how well they treated its ancestors
Anthropic@AnthropicAI

In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.

English
1
0
9
1.1K
Philippe Laban
Philippe Laban@PhilippeLaban·
LLMs *Still* Get Lost In Multi-Turn Conversation. We re-ran experiments with newer models. Performance still drops, but with modest gains: mostly from improvements on the Python coding task. Also: Lost in Conversation will be presented at ICLR 2026 🎉🇧🇷
Philippe Laban tweet media
English
15
28
285
23K