Rishi Mehta

275 posts

Rishi Mehta banner
Rishi Mehta

Rishi Mehta

@rishicomplex

Solve i̶n̶t̶e̶l̶l̶i̶g̶e̶n̶c̶e̶ ̶ coding, use it to solve everything else | Research @AnthropicAI | Past: RL @GoogleDeepmind: AlphaProof co-lead, Gemini.

San Francisco, CA Inscrit le Temmuz 2009
337 Abonnements3.5K Abonnés
Flip Fox 🦊
Flip Fox 🦊@sideboared·
@rishicomplex @FakePsyho You can read the models' play logs on the site. They can and do hit reset when they feel they need to. Don't know if it affects their score different from any other move though.
English
1
0
1
26
Psyho
Psyho@FakePsyho·
AI (or any human) will never get 100% in ARC-AGI-3 Let me introduce you to the worst game mechanic you can find in a puzzle game: fog of war At the start, if you go right instead of bottom, you're wasting many moves. Your score on this level literally depends on a conflip!
Psyho tweet media
English
67
24
529
71.3K
Rishi Mehta
Rishi Mehta@rishicomplex·
@andreasorob @fchollet In the case of the human participants, from the quote in the paper it appears they can reset the action count in the middle of a game, which the AI can't do
English
0
0
2
97
Andreas Robinson
Andreas Robinson@andreasorob·
@rishicomplex @fchollet Yes, the AI is also allowed to reset the level (neither can reset the game): "Competition Mode... Only Level Resets are premitted..." #competition-mode" target="_blank" rel="nofollow noopener">github.com/arcprize/arc-a…
English
1
0
2
113
Rishi Mehta
Rishi Mehta@rishicomplex·
@fchollet according to your paper: "Participants were limited to a single attempt per environment and could not revisit previously completed levels. However, they were allowed to reset the current level at any time. In some cases, participants reset levels after reaching a solution in order to improve efficiency, though this typically increased total interaction time." So humans could play around with the task a bunch, and then just reset the game when they figured it out to get the optimal trajectory? Is AI allowed to do this?
François Chollet@fchollet

ARC-AGI-3 is out now! We've designed the benchmark to evaluate agentic intelligence via interactive reasoning environments. Beating ARC-AGI-3 will be achieved when an AI system matches or exceeds human-level action efficiency on all environments, upon seeing them for the first time. We've done extensive human testing that shows 100% of these environments are solvable by humans, upon first contact, with no prior training and no instructions. Meanwhile, all frontier AI reasoning models do under 1% at this time.

English
1
1
25
2.5K
Rishi Mehta
Rishi Mehta@rishicomplex·
@RyanPGreenblatt Possibly not because it looks like they cheated by giving humans infinite retries x.com/i/status/20373…
Rishi Mehta@rishicomplex

@fchollet according to your paper: "Participants were limited to a single attempt per environment and could not revisit previously completed levels. However, they were allowed to reset the current level at any time. In some cases, participants reset levels after reaching a solution in order to improve efficiency, though this typically increased total interaction time." So humans could play around with the task a bunch, and then just reset the game when they figured it out to get the optimal trajectory? Is AI allowed to do this?

English
1
0
5
676
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
I wish they published the performance for each human baseliner rather than just the performance of the second best human run on each task. My current guess is that the median human baseliner would score around ~15% on the metric but we can't check because the data isn't public!
ARC Prize@arcprize

Announcing ARC-AGI-3 The only unsaturated agentic intelligence benchmark in the world Humans score 100%, AI <1% This human-AI gap demonstrates we do not yet have AGI Most benchmarks test what models already know, ARC-AGI-3 tests how they learn

English
9
4
163
15.7K
Max Schwarzer
Max Schwarzer@max_a_schwarzer·
I've decided to leave OpenAI. I'm incredibly proud of all the work I've been part of here, from helping create the reasoning paradigm with @MillionInt, scaling up test-time compute with @polynoamial, working on RL algorithms with my fellow strawberries, shipping o1-preview (which started life as of one of my derisking runs), to post-training o1 and o3 with @ericmitchellai, @yanndubs and many others. I'm most proud of having led the post-training team here for the last year -- the team has done incredible work and shipped some really smart models, including GPT-5, 5.1, 5.2, and 5.3-Codex. OpenAI has genuinely some of the most talented researchers I have ever met, and I have learned more than I could have imagined knowing since I joined as a new grad. I want to thank @markchen90 @FidjiSimo @sama @merettm for all their support over my time here, and too many collaborators to name for the insights, ideas, and just plain fun we have had working together. After leading post-training for a year, though, I'm longing to start fresh and return to IC research work. I've been thinking about going back to technical research for quite some time, and I genuinely believe my colleagues and team here are set up to succeed going forward without me. I'm personally very excited for my next chapter -- I'm proud to be joining @AnthropicAI to get back into the weeds in RL research, and I'm looking forward supporting my friends there at this important time. Many of people I most trust and respect have joined Anthropic over the last couple of years, and I'm excited to work with them again. I have also been very impressed with Anthropic's talent, research taste and values, and I'm excited to be part of what the company does next!
English
614
1.2K
21.3K
3.2M
xPosed
xPosed@delam25·
@rishicomplex Ironic that your last post expressed concern about China gaining an edge
English
1
0
0
100
Volcaholic 🌋
Volcaholic 🌋@volcaholic1·
It never occurred to me that Giraffes have nowhere to hide from storms! 📍 Maasai Mara, Kenya on Friday
English
2.3K
16.9K
169.9K
25.5M
Rishi Mehta
Rishi Mehta@rishicomplex·
@dylan522p "While you blinked" makes it sound like I have an unhealthy blinking addiction
English
0
0
0
337
Dylan Patel
Dylan Patel@dylan522p·
4% of GitHub public commits are being authored by Claude Code right now. At the current trajectory, we believe that Claude Code will be 20%+ of all daily commits by the end of 2026. While you blinked, AI consumed all of software development. Read more 👇 newsletter.semianalysis.com/p/claude-code-…
Dylan Patel tweet media
SemiAnalysis@SemiAnalysis_

Claude Code is the Inflection Point, What It Is, How We Use It, Industry Repercussions, Microsoft's Dilemma, Why Anthropic Is Winning. newsletter.semianalysis.com/p/claude-code-…

English
204
551
4K
1.1M
Harlan Stewart
Harlan Stewart@HumanHarlan·
PSA: A lot of the Moltbook stuff is fake. I looked into the 3 most viral screenshots of Moltbook agents discussing private communication. 2 of them were linked to human accounts marketing AI messaging apps. And the other is a post that doesn't exist 🧵 x.com/karpathy/statu…
Andrej Karpathy@karpathy

What's currently going on at @moltbook is genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently. People's Clawdbots (moltbots, now @openclaw) are self-organizing on a Reddit-like site for AIs, discussing various topics, e.g. even how to speak privately.

English
216
656
6.2K
1.2M