Sachin

2.3K posts

Sachin

Sachin

@sachdh

cooking reasoning models and agents at @AthenaAgentRL - a narrow intelligence lab

Beigetreten Nisan 2019
838 Folgt4K Follower
Angehefteter Tweet
Sachin
Sachin@sachdh·
Excited to share Aryabhatta 1.0, our leading model that scores 90.2% on JEE Mains, outperforming frontier models like o4 mini and Gemini Flash 2.5 Trained by us at @AthenaAgentRL , in collaboration with @physics__wallah, using custom RLVR training on 130K+ curated JEE problems 7B parameters and 4K context is all you need to crack JEE Also, you don’t need to blindly follow GRPO. Custom objective functions make a huge difference Details below 👇
Sachin tweet media
English
108
191
1.9K
197.8K
Leonard Tang
Leonard Tang@leonardtang_·
opportunity cost is insanely high for exceptional talent
English
6
9
266
20.3K
Nathan Lambert
Nathan Lambert@natolambert·
I am confidentially not joining Anthropic
English
29
3
529
53.9K
Sachin
Sachin@sachdh·
@_arohan_ GRPO variants from last year will say hi
English
0
0
2
917
rohan anil
rohan anil@_arohan_·
I don’t know what the phenomena is called: Sometimes the field mines improvements near a local neighborhood. Like Adam -> (badam, dadam, madam), Shampoo -> Muon -> (Duon, Buon, Luon), last few made up instead of questioning whether the original formulation itself is the right question. You get so much math explaining these variants bordering slop. Same happened with Transformers too. Mathematically sophisticated but solving the wrong problem.
English
18
4
196
17.9K
Sachin retweetet
tokenbender
tokenbender@tokenbender·
We are releasing a fully reproducible early preprint of "Prism: Unlocking Language Model Capability Extraction". A trained language model knows many things at once, but deployment usually asks for one behavior at a time. Enterprise scenarios often have few products, workflows, features, or use-cases matter disproportionately. Prism asks and answers a simple question - "Is it possible to isolate and deploy only capabilities that are driven by Pareto principle and cut down costs by a huge margin while preserving most of the performance?" This paper discusses a novel approach to efficiency, understanding model behavior and opens up capability extraction.
tokenbender tweet media
English
20
40
211
20.9K
Sachin
Sachin@sachdh·
@ar0cket1 @ChinmayKak i agree about importance. yes, group rewards is monte carlo estimation to reduce variance. but lack / presence of clipping and KL regularization decides if it is REINFORCE or PPO
English
0
0
2
71
ar0cket1
ar0cket1@ar0cket1·
@sachdh @ChinmayKak i would say its more similar to reinforce grouped. imo the source of reward estimation is more important than clipping and KL regularization (you can add clipping and KL on anything, the source of reward is the more defining feature)
English
1
0
0
78
ar0cket1
ar0cket1@ar0cket1·
@ChinmayKak yeh I’ve seen that too, but GRPO is generally significnatly more informative than REINFORCE so I personally wouldn’t do REINFORCE
English
1
0
0
7K
Ronak Malde
Ronak Malde@rronak_·
Today, @MichaelElabd, @QuantumArjun, and I are excited to announce Trajectory. We are a research lab and product company building the platform for Continual Learning. Our platform unlocks the signal already sitting in product usage, so companies can continuously post-train large-scale agentic models that outperform the frontier. @trajectorylabs We’ve raised $15M from @Conviction, @BessemerVP, @radicalvcfund, @jeffdean, @drfeifei and more. We’re partnering with some of the best AI-native companies: @ClayRunHQ @Harvey, @DecagonAI, @mercor_ai, @RogoAI to power their agentic systems, some of which we are already in production with. We’ve brought together a world class research team from DeepMind, OpenAI, Apple, Meta Superintelligence, Amazon AGI, Scale AI, and an elite product team from Stripe and Figma. AI will never again start on day one. Every correction, every retry, every edit will make products smarter. This is Continual Learning.
English
244
154
1.4K
1.8M
Shubham Sharma
Shubham Sharma@HappyyPablo·
Super happy that a bunch of people are finding marlin useful. Thank you for the inference support @ZeroGPU_AI @huggingface 🥰🤝 We’ve got hands on more compute now so we’ll also release a series of blogs and benchmarks for the Open source community to use for dense captioning and retrieval
Shubham Sharma tweet media
English
3
4
39
5.2K
joey00072
joey00072@joey00072fp4·
i love being a small account again
English
5
0
25
755
Sachin retweetet
joey00072
joey00072@joey00072fp4·
i lost everything job, old phone, twitter account, old guitar
English
3
1
22
7.8K
Sachin retweetet
render
render@infinterenders·
report this asshole hacker he hacked @shxf0072 handle and @joey00072fp4 ( this is real person account he created new one )
render tweet mediarender tweet mediarender tweet media
English
3
5
15
1.6K
Sachin
Sachin@sachdh·
@neural_avb it was a goated team sir most fun I had in a job + unity office had lots of games. so game playing was kinda work 😜
English
0
0
1
27
AVB
AVB@neural_avb·
@sachdh 🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼🙏🏼 goat
English
1
0
1
40
AVB
AVB@neural_avb·
This is what you can achieve with 5-6 hours of Self-Play RL training by the way Actors view the projectiles with lidar scans, picks an action using PPO policy, and competes against past versions of itself in a iterative self-improvement loop. Made in Unity with MLAgents.
Dwarkesh Patel@dwarkesh_sp

New blackboard lecture w @ericjang11 He walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Timestamps: 0:00:00 – Basics of Go 0:08:06 – Monte Carlo Tree Search 0:31:53 – What the neural network does 1:00:22 – Self-play 1:25:27 – Alternative RL approaches 1:45:36 – Why doesn’t MCTS work for LLMs 2:00:58 – Off-policy training 2:11:51 – RL is even more information inefficient than you thought 2:22:05 – Automated AI researchers

English
10
28
429
97.9K
Sachin
Sachin@sachdh·
@neural_avb checked ... it is still under active development MLAgents was my last internship in 2018
English
1
0
1
38
AVB
AVB@neural_avb·
@sachdh This is like 3-4 years ago... I have no idea what's happening with MLAgents now!
English
1
0
2
186