Adit Jain

158 posts

Adit Jain banner
Adit Jain

Adit Jain

@aditjain1980

Vibing @ CollinearAI. PhD, Cornell ECE.

Ithaca, NY Katılım Nisan 2023
375 Takip Edilen82 Takipçiler
Adit Jain
Adit Jain@aditjain1980·
@aditya5109 the agent gets access to a computer which it can control by observing screenshots - and then take actions (click (x,y) coordinates, scroll etc.) - and no, this is completely trained using RL! so no supervision using GPT-5.4, the model learns through interaction in the environment
English
0
0
1
30
Aditya Bansal
Aditya Bansal@aditya5109·
@aditjain1980 How is the data structured for CUA tasks? Is it using GPT 5.4's steps and feeding that in?
English
1
0
0
17
Adit Jain
Adit Jain@aditjain1980·
@abhishekn @hendrycks @ai_frontiers_ One can also ask the following question (related to q2)(given empirical data from the last few years): Given the advancements in AI - how many new types of jobs did AI create? my hypothesis is if the rate of advancements continues the jobs it will create will dwarf (q1)
English
0
0
0
36
Abhishek Nagaraj
Abhishek Nagaraj@abhishekn·
love this article by Anton Shenk in @hendrycks' @ai_frontiers_ ai-frontiers.org/articles/the-q… it ground disagreements on GDP/growth forecasts on fundamental disagreements between the doomers and the luddites (if I may!) -- and argues that we can resolve many of these with grounded empirical work. Question 1: Which Jobs Can AI Actually Automate? Question 2: Can the Economy Absorb What AI Companies Produce? Question 3: Can AI Automate Innovation Itself? while i'm not sure debates can be resolved, I agree that these three questions are pretty central drivers of disagreement, especially Q2.
English
1
6
12
7.4K
Adit Jain
Adit Jain@aditjain1980·
@trq212 can we add a feature where when I am submitting my answers after a q/a session with claude it allows me to add some additional points.
Adit Jain tweet media
English
0
0
2
43
Adit Jain
Adit Jain@aditjain1980·
@a1zhang Why is performance not monotonic in depth? Isn't the RLM with depth N a strict generalization of an RLM with depth N-1?
English
0
0
0
320
alex zhang
alex zhang@a1zhang·
RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis! We add depth=2/3 experiments, where the RLM now has access to recursive RLM calls. This is also a feature of the open source `rlm` repo as well. We observe significant performance gains on OOLONG-Pairs and gains on all other benchmarks! We also include various OpenCode and Claude Code comparisons now per popular request. We add a length generalization experiment on MRCRv2 to show more promising training results, add a small prompting case study on OOLONG, and update the error analysis section to discuss the effect of syntax errors, decomposition mistakes, and general observations from the RLM trajectories. The appendix is now also updated with several new experiments and plots!
alex zhang tweet media
English
5
34
231
11K
Adit Jain
Adit Jain@aditjain1980·
@henrytdowling yes and also experiments where even the goal is unclear and you iterate on the goal itself!
English
0
0
0
15
Adit Jain
Adit Jain@aditjain1980·
too many YOLO experiments to keep track of 💀
English
1
0
7
173
Adit Jain
Adit Jain@aditjain1980·
@jxnlco less but better - one software which takes care of everything!
English
0
0
1
31
jason
jason@jxnlco·
Do you want more software or better software?
English
300
9
349
43.9K
Adit Jain
Adit Jain@aditjain1980·
I am 10x more bullish on @tinkerapi than I was a couple weeks ago! Will share some experimentation soon!
English
0
0
6
160
Adit Jain
Adit Jain@aditjain1980·
@GaryMarcus 50% success rate is across trials - you can scale compute and run more trials and then select best of N. Its not efficient but I think your point around reliability is moot if we can measurably improve it with more compute.
English
0
0
1
322
Gary Marcus
Gary Marcus@GaryMarcus·
Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. •  Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

English
40
20
182
92.1K
jason
jason@jxnlco·
small ship / passion project, more details soon github.com/openai/openai-… 1. call responses via cli with all cloud tools 2. unix style structured outputs via cli 2. image gen/edit, transcription, tts 3. make projects and provision api keys more docs soon
English
38
93
929
167.3K
Adit Jain retweetledi
Adit Jain
Adit Jain@aditjain1980·
now that models can create and train on environments autonomously - is it safe to assume that any public benchmark will land in the training set relatively soon - and so contamination is unavoidable?
English
1
0
2
68