Adrien Gaidon

3.2K posts

Adrien Gaidon banner
Adrien Gaidon

Adrien Gaidon

@adnothing

Building something new in Robotics and Physical AI! Adjunct Prof of CS at @Stanford, ex partner at @CalibrateVC & head of ML at @ToyotaResearch

Mountain View, CA Beigetreten Şubat 2012
1.4K Folgt4.3K Follower
Adrien Gaidon
Adrien Gaidon@adnothing·
Physical AI is accelerating, isn't it? It's almost like something big is happening 😉 The age of demos is coming to an end, and the age of real, useful, high-value work is upon us. Exciting times!!
English
3
2
31
2.2K
Danfei Xu
Danfei Xu@danfei_xu·
Honored to receive the NSF CAREER Award from the Foundational Research in Robotics (FRR) program! Deep gratitude to my @ICatGT @gtcomputing colleagues and the robotics community for their unwavering support. Grateful of @NSF for continuing to fund the future of robotics research.
Danfei Xu tweet media
English
30
5
212
12K
Adrien Gaidon
Adrien Gaidon@adnothing·
@karpathy 💯 Good structure and design principles (e.g., separation of concerns) are key to scaling verification in large teams (humans or agents) and code bases. The common denominator in {software,ML,prompt,AI} Engineering 😉 (and personally an inspiration in self-supervised learning)
English
0
0
1
556
Andrej Karpathy
Andrej Karpathy@karpathy·
Related tweet from earlier where I was describing my own (developing) workflow of "AI Assisted coding" where among other things I try really hard to structure it to decrease verification. x.com/karpathy/statu…
Andrej Karpathy@karpathy

Noticing myself adopting a certain rhythm in AI-assisted coding (i.e. code I actually and professionally care about, contrast to vibe code). 1. Stuff everything relevant into context (this can take a while in big projects. If the project is small enough just stuff everything e.g. `files-to-prompt . -e ts -e tsx -e css -e md --cxml --ignore node_modules -o prompt.xml`) 2. Describe the next single, concrete incremental change we're trying to implement. Don't ask for code, ask for a few high-level approaches, pros/cons. There's almost always a few ways to do thing and the LLM's judgement is not always great. Optionally make concrete. 3. Pick one approach, ask for first draft code. 4. Review / learning phase: (Manually...) pull up all the API docs in a side browser of functions I haven't called before or I am less familiar with, ask for explanations, clarifications, changes, wind back and try a different approach. 6. Test. 7. Git commit. Ask for suggestions on what we could implement next. Repeat. Something like this feels more along the lines of the inner loop of AI-assisted development. The emphasis is on keeping a very tight leash on this new over-eager junior intern savant with encyclopedic knowledge of software, but who also bullshits you all the time, has an over-abundance of courage and shows little to no taste for good code. And emphasis on being slow, defensive, careful, paranoid, and on always taking the inline learning opportunity, not delegating. Many of these stages are clunky and manual and aren't made explicit or super well supported yet in existing tools. We're still very early and so much can still be done on the UI/UX of AI assisted coding.

English
10
4
384
131.8K
Andrej Karpathy
Andrej Karpathy@karpathy·
Good post from @balajis on the "verification gap". You could see it as there being two modes in creation. Borrowing GAN terminology: 1) generation and 2) discrimination. e.g. painting - you make a brush stroke (1) and then you look for a while to see if you improved the painting (2). these two stages are interspersed in pretty much all creative work. Second point. Discrimination can be computationally very hard. - images are by far the easiest. e.g. image generator teams can create giant grids of results to decide if one image is better than the other. thank you to the giant GPU in your brain built for processing images very fast. - text is much harder. it is skimmable, but you have to read, it is semantic, discrete and precise so you also have to reason (esp in e.g. code). - audio is maybe even harder still imo, because it force a time axis so it's not even skimmable. you're forced to spend serial compute and can't parallelize it at all. You could say that in coding LLMs have collapsed (1) to ~instant, but have done very little to address (2). A person still has to stare at the results and discriminate if they are good. This is my major criticism of LLM coding in that they casually spit out *way* too much code per query at arbitrary complexity, pretending there is no stage 2. Getting that much code is bad and scary. Instead, the LLM has to actively work with you to break down problems into little incremental steps, each more easily verifiable. It has to anticipate the computational work of (2) and reduce it as much as possible. It has to really care. This leads me to probably the biggest misunderstanding non-coders have about coding. They think that coding is about writing the code (1). It's not. It's about staring at the code (2). Loading it all into your working memory. Pacing back and forth. Thinking through all the edge cases. If you catch me at a random point while I'm "programming", I'm probably just staring at the screen and, if interrupted, really mad because it is so computationally strenuous. If we only get much faster 1, but we don't also reduce 2 (which is most of the time!), then clearly the overall speed of coding won't improve (see Amdahl's law).
Balaji@balajis

AI PROMPTING → AI VERIFYING AI prompting scales, because prompting is just typing. But AI verifying doesn’t scale, because verifying AI output involves much more than just typing. Sometimes you can verify by eye, which is why AI is great for frontend, images, and video. But for anything subtle, you need to read the code or text deeply — and that means knowing the topic well enough to correct the AI. Researchers are well aware of this, which is why there’s so much work on evals and hallucination. However, the concept of verification as the bottleneck for AI users is under-discussed. Yes, you can try formal verification, or critic models where one AI checks another, or other techniques. But to even be aware of the issue as a first class problem is half the battle. For users: AI verifying is as important as AI prompting.

English
134
536
4.4K
843.1K
Adrien Gaidon
Adrien Gaidon@adnothing·
Hello from #ICRA2025 in sunny Atlanta 👋 Looking forward to catching up with my robotics colleagues! I'll also be chairing a session on Thursday (ThET6) with Katherine Liu from @ToyotaResearch where we will present our work OmniShape tri-ml.github.io/omnishape/ See you there!
Adrien Gaidon tweet media
English
0
0
11
783
Adrien Gaidon
Adrien Gaidon@adnothing·
💯 Developing appreciation (or even enthusiasm!) for being proven wrong is a learning superpower. Having some formal training in logic, information theory, and epistemology is a way to rationally convince yourself that this is the way. Then you have to put in the reps to eventually enjoy it (with moderation, otherwise you become a troll).
English
0
1
11
1.4K
Arvind Narayanan
Arvind Narayanan@random_walker·
I tell students on the first day of class that if you're truly learning it's supposed to feel uncomfortable. The reason is that real learning is not simply the accumulation of facts; it is deep understanding, building mental models of the world, and other higher-level abilities. The problem is that we already have simple, intuitive, and usually incorrect mental models of most things in the world around us. So real learning usually involves *unlearning*. And that has an extremely high cognitive cost and we're very resistant to doing it, presumably for evolutionary reasons. Students go so far as to learn concepts in class but somehow parcel them so that they think the concepts are only applicable to the toy problems on tests but not the world around us! Turns out these mental gymnastics are still easier than actually updating their mental models. (The screenshots are from the book "What the best college teachers do.") It only gets harder, not easier, to learn as we progress in our careers, because in addition to the cognitive cost of learning, you have to face the prospect of admitting that you were wrong in front of subordinates, if not in public, and admitting to yourself that you've been making suboptimal decisions all along. I've found that the only way to continue to learn is to develop a kind of masochism where you learn to enjoy, or at least love-hate, the feeling of having been wrong. It's not easy but I think it's necessary!
Arvind Narayanan tweet mediaArvind Narayanan tweet media
English
24
152
881
209.7K
Adrien Gaidon
Adrien Gaidon@adnothing·
3D is mainstream now: incredible progress in the past few years (e.g., on zero-shot performance). No reason to stay in 2D: elevate your vision 😉
Rui Li@leedaray

🚀 Details of the #CVPR2025 award candidate papers are out. 14 of 2967 accepted papers made the list, spanning 3D vision, embodied AI, VLMs/MLLMs, learning systems, and scene understanding. 3D vision leads with the most entries. I collected the TL;DR, paper, and project links👇

English
0
0
11
991
Andrej Karpathy
Andrej Karpathy@karpathy·
There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few days it was worse than what I was used to. Conversely as an example, around the same time Claude 3.5 was a top tier model in my personal use but it ranked very low on the arena. I heard similar sentiments both online and in person. And there were a number of other relatively random models, often suspiciously small, with little to no real-world knowledge as far as I know, yet they ranked quite high too. "When the data and the anecdotes disagree, the anecdotes are usually right." (Jeff Bezos on a recent pod, though I share the same experience personally). I think these teams have placed different amount of internal focus and decision making around LM Arena scores specifically. And unfortunately they are not getting better models overall but better LM Arena models, whatever that is. Possibly something with a lot of nested lists, bullet points and emoji. It's quite likely that LM Arena (and LLM providers) can continue to iterate and improve within this paradigm, but in addition I also have a new candidate in mind to potentially join the ranks of "top tier eval". It is the @openrouter LLM rankings: openrouter.ai/rankings Basically, OpenRouter allows people/companies to quickly switch APIs between LLM providers. All of them have real use cases (not toy problems or puzzles), they have their own private evals, and all of them have an incentive to get their choices right, so by choosing one LLM over another they are directly voting for some combo of capability+cost. I don't think OpenRouter is there just yet in both the quantity and diversity of use, but something of this kind I think has great potential to grow into a very nice, very difficult to game eval.
Arena.ai@arena

Thanks for the authors’ feedback, we’re always looking to improve the platform! If a model does well on LMArena, it means that our community likes it! Yes, pre-release testing helps model providers identify which variant our community likes best. But this doesn’t mean the leaderboard is biased; see the clarification below. The leaderboard reflects millions of fresh, real human preferences. One might disagree with human preferences—they’re subjective—but that’s exactly why they matter. Understanding subjective preference is essential to evaluating real-world performance, as these models are used by people. That’s why we’re working on statistical methods—like style and sentiment control—to decompose human preference into its constituent parts. We are also strengthening our user base to include more diversity. And if pre-release testing and data helps models optimize for millions of people’s preferences, that’s a positive thing! Pre-release model testing is also a huge part of why people come to LMArena. Our community loves being the first to test the best and newest AIs! That’s why we welcome all model providers to submit their AIs to battle and win the preferences of our community. Within our capacity, we are trying to satisfy all requests for testing we get from model providers. We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference. If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly. Every model provider makes different choices about how to use and value human preferences. We helped Meta with pre-release testing for Llama 4, like we have helped many other model providers in the past. We support open-source development. Our own platform and analysis tools are open source, and we have released millions of open conversations as well. This benefits the whole community. We agree with a few of this writeup’s suggestions (e.g. implementing an active sampling algorithm) and are happy to consider more. Unfortunately, there are also a number of factual errors and misleading statements in this writeup. - The simulation of LMArena, e.g. in Figures 7/8, is flawed. It’s like saying: “The average 3-point percentage in the NBA is 35%. Steph Curry has the highest 3-point percentage in the NBA at 42%. This is unfair, because he comes from the distribution of NBA players, and they all have the same latent mean.” - We designed our policy to prevent model providers from just reporting the highest score they received during testing. We only publish the score for the model they release publicly. - Many of the numbers in the paper do not reflect reality: see the blog below (released a few days ago) for the actual statistics on the number of models tested from different providers. See also in thread our longstanding policy on pre-release testing. We have been doing so transparently with the support of our community for over a year.

English
183
413
4.3K
689.1K
Chris Paxton
Chris Paxton@chris_j_paxton·
So I recently joined Agility robotics to help lead AI efforts, and I wanted to share this as one of the first things I worked on: - whole body control, running sim-to-real RL, all day for like 4 days straight at GTC - manipulating previously unseen objects - we bought them monday and put them on a shelf
English
39
20
402
19.7K
Ian Huang
Ian Huang@IanHuang3D·
🏡Building realistic 3D scenes just got smarter! Introducing our #CVPR2025 work, 🔥FirePlace, a framework that enables Multimodal LLMs to automatically generate realistic and geometrically valid placements for objects into complex 3D scenes. How does it work?🧵👇
English
22
90
377
116.8K
Adrien Gaidon retweetet
Stanford AI Lab
Stanford AI Lab@StanfordAILab·
We do a lot of cutting edge research at the Stanford AI Lab, but really our main job is educating students. Here is a list of great SAIL Graduates of 2025, who are variously looking for academic and industry jobs! 💪 ai.stanford.edu/blog/sail-grad… Compiled by @NikilSelvam Alex Nam @judyhshen
English
10
50
90
12.3K
Adrien Gaidon
Adrien Gaidon@adnothing·
The biggest bottleneck in robotics today? Data. Scaling up robot demonstrations is crucial, but we are still 5-6 orders of magnitude away from LLMs 😱 So how do we close that huge gap? Ken's talk at #GTC25 is phenomenal and makes a strong case for scaling with Production Data™️ The evidence from @AmbiRobotics is clear, and we see that too at @CalibrateVC with startups like @BradPorter_ 's CoBot, @GrayMatterRobot, and more. But to do that, you need a product, iteration in the field, continuous delivery of value, clear ROI... Robotics startups don’t just need funding - they need customers. More broadly, Venture Capital is only a catalyst for the real reaction that happens in the field. The best AI companies know that and ship fast to get paid twice: in data AND in 💵. That’s the unstoppable flywheel that happens when you build something people want. That's not to say only production data matters. Robotics is so hard you need ALL the data: web data, sim, teleop, AND production data. But only one scales with customers. That's why we believe purpose-built robots are a massive unlock from today's foundation models AND the path to build tomorrow's even more general embodied AI. PS: if you’re building in this space or thinking about jumping in, let's talk 😁
Ken Goldberg@Ken_Goldberg

Looking fwd to presenting @nvidia #GTC at 1pm today!

English
0
2
12
2.7K
Adrien Gaidon
Adrien Gaidon@adnothing·
@Karttikeya_m @eladgil Personalized tutoring with AI is the longest lever arm on the world. It is so much harder than a ChatGPT wrapper, as @emollick pointed out. That's why you need a crazy team like @Karttikeya_m 's at SigIQ: ML expertise, a passion for education, and a unique approach on hard tests!
Ethan Mollick@emollick

The data so far on AI-as-a-tutor shows just letting students use AI chatbots often undermines learning by just giving answers. But AIs properly prompted to act like tutors, especially with instructor support, seem to be able to boost learning a lot through customized instruction

English
0
0
0
123
Karttikeya Mangalam
Karttikeya Mangalam@Karttikeya_m·
@eladgil @eladgil also recommend to checkout sigiq.ai — cracked AI PhD founding team building the exact vision (even the company’s first 3 letters come from the 2-sigma paper!) — our custom built AI tutors >> general LLMs (4o etc) on some of worlds toughest exams.
English
1
2
9
767
Elad Gil
Elad Gil@eladgil·
Why AI based tutors are going to be such a big deal 1:1 tutoring = 2 sigma improvement in learning achievement Image from "The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-toOne Tutoring" by Benjamin S. Bloom
Elad Gil tweet media
English
106
185
1.6K
424K
Adrien Gaidon
Adrien Gaidon@adnothing·
@giffmana Same! Especially considering it was from Sergey 🤣
English
0
0
1
374
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
The placement of a line break matters. I parsed this as (scaling test-time compute without verification) or: "rl is suboptimal" When it's really "scaling test-time compute without (verification or rl) is suboptimal" I was all excited about the former variant, actually!
Lucas Beyer (bl16) tweet media
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

Scaling Test-Time Compute Without Verification or RL is Suboptimal "In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget."

English
10
12
185
22.3K
Adrien Gaidon
Adrien Gaidon@adnothing·
The demo-to-product gap is bigger than ever. AI is great for IA (Intelligence Amplification) but rough for AA (Artificial Agents). Intelligence and Autonomy have a complex relation --> closing the loop is key (mainly with humans). True for robots too!
Nabeel S. Qureshi@nabeelqu

Me using LLMs for fun little personal projects: wow this thing is such a genius why do we even need humans anymore Me trying to deploy LLMs in messy real-world environments: why is this thing so unbelievably stupid and dumb

English
0
0
2
448
Adrien Gaidon
Adrien Gaidon@adnothing·
This is a great research direction indeed: some simple logic games are surprisingly hard for current “Large Reasoning Models”! A great example found by @BradPorter_ is the classic Mastermind! @OpenAI 's o1 and o3 are meh at it (not sure why), but so is @deepseek_ai 's R1. I did some experiments this weekend, and although R1 sucks at Mastermind, its chain-of-thought is super interesting and shows encouraging patterns! Posted the setup and results of my small experiment here if you want to reproduce it or just have some fun: adriengaidon.com/posts/2025/02/…
Adrien Gaidon tweet media
English
0
0
5
422
Andrej Karpathy
Andrej Karpathy@karpathy·
I quite like the idea using games to evaluate LLMs against each other, instead of fixed evals. Playing against another intelligent entity self-balances and adapts difficulty, so each eval (/environment) is leveraged a lot more. There's some early attempts around. Exciting area.
León@LeonGuertler

Perfect timing, we are just about to publish TextArena. A collection of 57 text-based games (30 in the first release) including single-player, two-player and multi-player games. We tried keeping the interface similar to OpenAI gym, made it very easy to add new games, and created an online leaderboard (you can let your model compete online against other models and humans). There are still some kinks to fix up, but we are actively looking for collaborators :) If you are interested check out textarena.ai, DM me or send an email to guertlerlo@cfar.a-star.edu.sg Next up, the plan is to use R1 style training to create a model with super-human soft-skills (i.e. theory of mind, persuasion, deception etc.)

English
253
409
5.9K
978.2K
Adrien Gaidon
Adrien Gaidon@adnothing·
The magnitude of the Nvidia selloff is weird... Forget Jevons' paradox: does the market understand that these methods scale with compute? This was actually *reinforced* (pun intended) with R1! Or do they believe it is not going to get significantly better? 🤔
English
0
0
4
419
Adrien Gaidon
Adrien Gaidon@adnothing·
The AI battlefield is not benchmarks or abstract capabilities. We passed a blurry threshold where the battle is utility. Focusing on products is the right move. Agents is where it's at now, because autonomy is harder than intelligence.
English
0
0
2
333