Pushpendre Rastogi

306 posts

Pushpendre Rastogi banner
Pushpendre Rastogi

Pushpendre Rastogi

@Pushpendre89

Multi objective RL @ https://t.co/XGoEBmMLcC | Ex Deepmind, Amazon, JHU PhD, IITD ECE

Palo Alto Katılım Mayıs 2012
704 Takip Edilen613 Takipçiler
Sabitlenmiş Tweet
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
A few weeks ago I had hinted at a new prompt optimizer service that beats Gepa!!! We are live now!! The setup is so simple that openclaw can install it and test it for you. Prompt: "Go to vizpy.vizops.ai, describe their service, set up an experiment comparing GEPA from dspy vs their optimizer. Does it actually work?"
English
0
1
3
383
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
Marc Andreessen says introspection is a modern invention. Backpropagation has been doing it since 1986. If you want your LLM agents to learn from mistakes unlike @pmarca, try VizPy → vizpy.vizops.ai
English
0
1
0
121
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
GEPA was ahead of its time. We built VizPy on top of it — added contrastive learning from failure→success pairs (ContraPrompt) so the optimizer learns not just what works but why things fail. Ends up +29% on HotPotQA vs GEPA. Maybe the traction will come now that Karpathy's crowd is paying attention. vizpy.vizops.ai
English
0
0
0
49
Ivan
Ivan@ivanbokii·
It’s quite unfortunate that GEPA Optimize Anything didn’t get enough traction, while very, very similar ideas promoted by Karpathy’s autoresearch + Lütke’s pi-autoresearch - got so much traction, despite being less general
English
11
12
122
12.2K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
GEPA/DSPy optimize the prompt. The harder problem is learning *why* a prompt failed — so the next iteration does not make the same mistake. That's what ContraPrompt does: mines failure→success pairs and turns them into optimizer signal. Ends up +29% on HotPotQA vs GEPA. vizpy.vizops.ai
English
0
0
2
78
Sanyam Jain
Sanyam Jain@Sanyam0605·
People be making variants of autoresearch, in which they be optimising prompts in the loop 😂 They don't know, for this exact task GEPA, DSPY are there for a long time now DSPY, Text2grad, GEPA ❤️❤️⚡
English
2
2
43
2.2K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
the "is it good? unclear" is where the signal lives. the failed experiments tell you more than the successes if you mine them right -- contrast what changed between failure->success runs and you get a much sharper update. that's the core of what we built in VizPy. vizops.ai/blog.html
English
0
0
0
89
Belinda
Belinda@belindmo·
Continuously self-improving agents are here. ⚠️🧪 We set up an agent to run @karpathy’s autoresearch every 3 hours. It wakes up, reads the research log from previous sessions, forms a hypothesis, trains on a @modal A100, and decides whether to keep or discard. Then it goes back to sleep. No human in the loop. Is it good? Unclear, it’s been 20 experiments so far led by with Opus 4.6. Guess we’ll find out if a model can self improve with this setup , it’s still running 🐻->
Belinda tweet media
English
2
2
29
2.9K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@osoleve 35% for $0.75 is wild. curious if you also looked at what the failed candidates had in common -- that's where we found the biggest signal. contrastive mining of failure->success pairs gets you further than scoring alone. built that into VizPy: vizops.ai/blog.html
English
0
0
0
56
oso
oso@osoleve·
DSPy is so much better than it was when I tried it a couple years ago, wow 35% improvement on extraction for Nemotron 3 Super with just $0.75 in Kimi K2.5 tokens
English
14
2
28
3.6K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
this is exactly the loop VizPy runs under the hood -- but instead of just mutating prompts randomly, it mines contrastive failure->success pairs to guide the mutations. stronger signal, fewer iterations. curious what your Spearman looks like after 500 posts. vizops.ai/blog.html
English
0
0
1
27
Ollie Techdale
Ollie Techdale@TekoalyOlli·
I built a system based on @karpathy's autoresearch that evolves its own theory of social media and in one night it discovered rules sharper than any social media book I've read. Based on @karpathy's autoresearch concept: mutate prompt → evaluate on 306 real X posts → keep best → repeat forever. After one night, it found: - Account type (individual vs corporate) is the #1 predictor of engagement — beats content, timing, and hashtags combined - Replies have exactly 4 triggers: shocking numbers, ranked lists, controversial opinions, community posts: everything else defaults to near-zero - Bold superlatives ("fastest", "in history") multiply views 4-8x for individual accounts - User history calibration should be capped at ±30%: content type dominates Spearman 0.50 on 140 posts. The prompt isn't just predicting engagement. It's writing social media theory from raw data and it's better at it than the humans, well at least me. Stack: Claude in a Python loop, SQLite with 306 posts, Spearman rank correlation as fitness. github.com/karpathy/autor… for the original concept
English
2
1
0
56
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@Chris_Worsey @karpathy prompts are the weights, Sharpe is the loss function -- this is exactly the framing we built VizPy on. Contrastive failure->success mining, not brute-force candidate scoring. Cleaner signal, less iteration. +29% HotPotQA vs GEPA. vizpy.vizops.ai
English
0
0
1
64
Chris Worsey
Chris Worsey@Chris_Worsey·
I took the @karpathy autoresearch loop and pointed it at markets. 25 AI agents debate macro, rates, commodities, sectors, and single stocks daily. Every recommendation scored against real outcomes. Worst agent by rolling Sharpe gets its prompt rewritten by the system. Keep or revert. Same loop, prompts are the weights, Sharpe is the loss function. Trained the agents on 18 months of market data. 378 iterations. 54 prompt modifications, 16 survived. The system learned which agents to trust using Darwinian weights — geopolitical, commodities, and the @BillAckman quality compounder rose to the top. The agents even figured out their own portfolio manager was the weakest link before we did! Deployed the trained agents. +22% in 173 days. Best pick: AVGO at $152, held for +128%. The final prompts are evolutionary products — shaped by market feedback, not human intuition. Now running live with my own capital. github.com/chrisworsey55/… Part hedge fund, part research experiment :)
Andrej Karpathy@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

English
154
227
3.9K
763.7K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@neowes2025 The main delta from GEPA/DSPy: *where* the signal comes from. GEPA scores prompt candidates. Mine contrastive failure->success pairs instead and the optimizer learns *why* prompts fail. Built VizPy on this: +29% HotPotQA vs GEPA. vizpy.vizops.ai
English
0
0
1
229
Wesley Smith
Wesley Smith@neowes2025·
I really don't understand this karpathy/autoresearch hype. I mean, it's a cool project, but haven't we been doing this kind of thing for a while now? What is different from DSPy, GEPA and that whole area of tools? What am I missing?
English
29
7
227
40.5K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@tom_doerr DSPy+GEPA is solid. We pushed it further with VizPy: instead of scoring prompt candidates, we mine contrastive failure->success pairs. The signal is richer, especially on multi-hop tasks. +29% HotPotQA vs GEPA, drop-in compatible. vizpy.vizops.ai
English
0
0
0
37
Tom Dörr
Tom Dörr@tom_doerr·
Agent prompt optimization with DSPy + GEPA
Karthik Kalyan@karthikkalyan90

✨ DSPyground 0.2.7 is out. With this update, it has now fully evolved into a harness that seamlessly plugs into existing multi turn Agent environments. (@aisdk based agents to start with.) What this means is, it can connect to your prompts, tools and your pipeline, allows you to sample and label traces and run the SOTA @DSPyOSS GEPA(Genetic Pareto) optimization algorithm in order to align your agent setup with the desired behaviour by generating an optimized prompt as the final artifact. TLDR; npm i dspyground Read on for a detailed break down 👇

English
3
4
23
6.4K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@mdancho84 DSPy gets you modular + optimizable pipelines. One thing we layered on top: automatic failure mining. Instead of just scoring prompt candidates, VizPy learns contrastively from failure->success pairs. Ended up +29% on HotPotQA vs GEPA as a result. vizpy.vizops.ai
English
0
0
0
15
Matt Dancho (Business Science)
1. Why DSPy? DSPy is the open-source framework for programming—rather than prompting—language models. It allows you to iterate fast on building modular AI systems.
English
2
1
5
1.9K
Matt Dancho (Business Science)
Stop Prompting LLMs. Start Programming LLMs. Introducing DSPy by Stanford NLP. This is why you need to learn it:
Matt Dancho (Business Science) tweet media
English
7
56
461
24.2K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@iljaas_a @karpathy GEPA works but it only scores prompt candidates -- it doesn't learn why they fail. We built VizPy to mine contrastive failure->success pairs instead. Ended up +29% on HotPotQA vs GEPA. Worth a look if you're thinking about this for nanochat. vizpy.vizops.ai
English
0
0
1
21
Iljaas Abdoella
Iljaas Abdoella@iljaas_a·
@karpathy Did you use anything DSPy/GEPA-like for the agent policy, or is the main win coming from the experiment harness around branching, eval, and merge logic rather than from prompt optimization itself?
English
1
0
0
188
Andrej Karpathy
Andrej Karpathy@karpathy·
nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to NVIDIA ClimbMix (nice work NVIDIA!). I had tried Olmo, FineWeb, DCLM which all led to regressions, ClimbMix worked really well out of the box (to the point that I am slightly suspicious about about goodharting, though reading the paper it seems ~ok). In other news, after trying a few approaches for how to set things up, I now have AI Agents iterating on nanochat automatically, so I'll just leave this running for a while, go relax a bit and enjoy the feeling of post-agi :). Visualized here as an example: 110 changes made over the last ~12 hours, bringing the validation loss so far from 0.862415 down to 0.858039 for a d12 model, at no cost to wall clock time. The agent works on a feature branch, tries out ideas, merges them when they work and iterates. Amusingly, over the last ~2 weeks I almost feel like I've iterated more on the "meta-setup" where I optimize and tune the agent flows even more than the nanochat repo directly.
Andrej Karpathy tweet media
English
337
562
6.5K
594.7K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@Teknium @lateinteraction GEPA is a strong baseline. One thing we found building VizPy: mine failure→success pairs contrastively, not just score prompt candidates. The optimization signal gets richer, especially for multi-hop tasks. +29% on HotPotQA vs GEPA. vizpy.vizops.ai
English
0
0
0
22
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
Tried it. We built VizPy specifically to go beyond GEPA on this. Key difference: GEPA generates candidate prompts and scores them but doesn't learn *why* failures happen. VizPy mines failure→success pairs to extract contrastive rules. +29% HotPotQA vs GEPA. vizpy.vizops.ai
English
0
0
0
9
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
961
2.1K
19.3K
3.5M
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
Vizpy is live on producthunt now. producthunt.com/products/vizpy… Vizpy is a state-of-the-art prompt optimization service which learns from failures and updates your prompts with rules that it has learned. We have compared it extensively against baselines such as GEPA on benchmarks like BBH, HotPotQA, GPQA Diamond, and GDPR-Bench and VizPy wins on all of them. We'll have more benchmarks on cyber-security and chip-design coming out soon. @producthunt
English
0
1
1
111
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@sreejan_kumar Would you say the recent RLM work is a step in this direction? E.g. they frame long context problems as manipulation of a variable in a repl, or did you mean something else entirely?
English
0
0
0
13
Sreejan Kumar
Sreejan Kumar@sreejan_kumar·
@Pushpendre89 So my prediction: the next jump comes when people figure out how to steer toward better problem representations rather than better output dispositions.
English
1
0
1
38
Sreejan Kumar
Sreejan Kumar@sreejan_kumar·
In 2022, I won the NeurIPS Outstanding Paper Award. In 2026, I've realized this paper, ahead of its time, accidentally predicted the trajectory of AI development over the past few years. A thread using this to explain how AI has developed 2018->2026:
Sreejan Kumar tweet media
English
4
25
339
29.6K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@sreejan_kumar Are there results from the paper that can be used to predict the trends/breakthroughs in 2026/27, or would you say that the paper's ideas are used up by now
English
1
0
0
64
Sreejan Kumar
Sreejan Kumar@sreejan_kumar·
The overall picture: AI systems are getting better because they're learning to approximate something humans do naturally: rapidly formulate the right second-order abstraction of the problem at hand.
English
2
2
26
1.6K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@daniel_rossett Do you know of any new approaches/research for autism, attacking that from the pov of autoimmune disfunction?
English
0
0
1
965
Daniel Rossett
Daniel Rossett@daniel_rossett·
To everyone asking for details: I helped Neal and Ian by regulating their immune systems. That is the only information I can give at this time as what I am doing is not available elsewhere. If it were as simple as a supplement you could order off amazon, I would tell you. Once I have run formal studies and confirmed efficacy and a lack of side effects, I will discuss further. Applied Determinism is completely separate - it is useful for living a more peaceful life, not for addressing chronic inflammation.
English
29
7
289
30.6K
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
@qzhang517 Thanks for the question, in the video I was mainly trying to get a feel for what parts of the compiler did the agent try to build first and which came later, and at what speed. The interactive html linked in the video is a lot better for fine grained vis.
English
0
0
0
12
Pushpendre Rastogi
Pushpendre Rastogi@Pushpendre89·
One underrated aspect of the Claude C Compiler is that the entire git history preserves architectural reasoning. The agents left enough breadcrumbs that we could reverse-engineer the scaffolding and run small agent-scaling experiments. We call it "code archaeology". Comparing 1 agent × 2h vs 2 agents × 1h showed early compiler bootstrapping is still largely serial. vizops.ai/blog/agent-sca… 1/3
English
1
6
83
66.3K