Adit Jain (@aditjain1980) - Twitter Profili | Zamantika Mersobahis Locabet

Adit Jain@aditjain1980·1d

@aditya5109 the agent gets access to a computer which it can control by observing screenshots - and then take actions (click (x,y) coordinates, scroll etc.) - and no, this is completely trained using RL! so no supervision using GPT-5.4, the model learns through interaction in the environment

English

0

1

30

Aditya Bansal@aditya5109·1d

@aditjain1980 How is the data structured for CUA tasks? Is it using GPT 5.4's steps and feeding that in?

English

1

0

17

Adit Jain@aditjain1980·1d

We're hillclimbing open source models on real-world computer use tasks - come join us 🚀

Collinear AI@CollinearAI

We discovered significant gaps between open and closed sourced models on our realistic computer-use-agent tasks, and it is a data problem. Although open models have nearly saturated OSWorld, we found that kimi k2.6 cannot do tasks that GPT-5.4 solves in 50 steps. Our 30 tasks are realistic: the agent works with an open source version of Office Suit in an linux OS, and compiles excel sheets. GPT-5.4-high solves 2/3 in 25 steps, and 1/3 in 50 steps. Kimi k2.6, the strongest open model on OSWorld, fails almost all of them. We understand the problem to be very simple: open models simply are not trained on realistic CUA data enough. To test this hypothesis, we simply RL-ed Kimi K2.6 on 10 in-domain CUA office tasks with LoRA. The result of the simplistic RL is a significant increase of +30% in the capacity to do office tasks. However, the improvement gracefully carries over to OSWorld itself: on a stratified subset of 30 tasks, the RL-ed model sees another +10% lift. The takeaway from our initial results is that CUA models suffer from unrealistic, low-quality data. As a result, we are continually building realistic apps / RL environments to bridge the gap. More to come. Solid work done by @alckasoc

English

1

0

4

965

Adit Jain@aditjain1980·1d

also what is this amazing company which doesn't engage in chart crimes

Collinear AI@CollinearAI

We discovered significant gaps between open and closed sourced models on our realistic computer-use-agent tasks, and it is a data problem. Although open models have nearly saturated OSWorld, we found that kimi k2.6 cannot do tasks that GPT-5.4 solves in 50 steps. Our 30 tasks are realistic: the agent works with an open source version of Office Suit in an linux OS, and compiles excel sheets. GPT-5.4-high solves 2/3 in 25 steps, and 1/3 in 50 steps. Kimi k2.6, the strongest open model on OSWorld, fails almost all of them. We understand the problem to be very simple: open models simply are not trained on realistic CUA data enough. To test this hypothesis, we simply RL-ed Kimi K2.6 on 10 in-domain CUA office tasks with LoRA. The result of the simplistic RL is a significant increase of +30% in the capacity to do office tasks. However, the improvement gracefully carries over to OSWorld itself: on a stratified subset of 30 tasks, the RL-ed model sees another +10% lift. The takeaway from our initial results is that CUA models suffer from unrealistic, low-quality data. As a result, we are continually building realistic apps / RL environments to bridge the gap. More to come. Solid work done by @alckasoc

English

0

9

355

Adit Jain@aditjain1980·2d

@abhishekn @hendrycks @ai_frontiers_ One can also ask the following question (related to q2)(given empirical data from the last few years): Given the advancements in AI - how many new types of jobs did AI create? my hypothesis is if the rate of advancements continues the jobs it will create will dwarf (q1)

English

0

36

Abhishek Nagaraj@abhishekn·2d

love this article by Anton Shenk in @hendrycks' @ai_frontiers_ ai-frontiers.org/articles/the-q… it ground disagreements on GDP/growth forecasts on fundamental disagreements between the doomers and the luddites (if I may!) -- and argues that we can resolve many of these with grounded empirical work. Question 1: Which Jobs Can AI Actually Automate? Question 2: Can the Economy Absorb What AI Companies Produce? Question 3: Can AI Automate Innovation Itself? while i'm not sure debates can be resolved, I agree that these three questions are pretty central drivers of disagreement, especially Q2.

English

1

6

12

7.4K

Adit Jain@aditjain1980·2d

wonder what are the true stats without the selection bias

Andrew Curran@AndrewCurran_

According to the new data from Ramp, Anthropic has passed OpenAI in business adoption for the first time. 'Adoption of Anthropic rose 3.8% in April to 34.4% of businesses. OpenAl adoption fell 2.9% to 32.3%. Overall Al adoption rose 0.2 percentage points to 50.6%.'

English

1

0

2

68

Adit Jain@aditjain1980·3d

@trq212 can we add a feature where when I am submitting my answers after a q/a session with claude it allows me to add some additional points.

English

0

2

43

Adit Jain retweetledi

Collinear AI@CollinearAI·3d

x.com/i/article/2053…

ZXX

2

7

216

Adit Jain@aditjain1980·3d

@a1zhang Why is performance not monotonic in depth? Isn't the RLM with depth N a strict generalization of an RLM with depth N-1?

English

0

320

alex zhang@a1zhang·3d

RLM arXiv paper update: depth>1 results, more comparisons, more training, and more error analysis! We add depth=2/3 experiments, where the RLM now has access to recursive RLM calls. This is also a feature of the open source `rlm` repo as well. We observe significant performance gains on OOLONG-Pairs and gains on all other benchmarks! We also include various OpenCode and Claude Code comparisons now per popular request. We add a length generalization experiment on MRCRv2 to show more promising training results, add a small prompting case study on OOLONG, and update the error analysis section to discuss the effect of syntax errors, decomposition mistakes, and general observations from the RLM trajectories. The appendix is now also updated with several new experiments and plots!

English

5

34

231

11K

Adit Jain@aditjain1980·4d

oh how the tables have turned from Codex copying CC features to the other way around

Daniel San@dani_avila7

Claude Code 2.1.139 added /goal You set a completion condition and Claude keeps working across turns until it's met Works in interactive, -p, and Remote Control 👏

English

0

3

249

Adit Jain@aditjain1980·4d

@henrytdowling yes and also experiments where even the goal is unclear and you iterate on the goal itself!

English

0

15

Henry Dowling@henrytdowling·4d

@aditjain1980 by yolo do you just mean "ralph loop and forget about it"?

English

1

0

13

Adit Jain@aditjain1980·5d

too many YOLO experiments to keep track of 💀

English

1

0

7

173

Adit Jain retweetledi

Peter Wildeford🇺🇸🚀@peterwildeford·6d

Deep learning is hitting a wall (the wall being our ability to measure AI capabilities)

Peter Wildeford🇺🇸🚀@peterwildeford

wow Mythos finally broke the METR graph

English

12

20

290

23.1K

Adit Jain@aditjain1980·6d

@jxnlco less but better - one software which takes care of everything!

English

0

1

31

jason@jxnlco·6d

Do you want more software or better software?

English

300

9

349

43.9K

Adit Jain@aditjain1980·6d

I am 10x more bullish on @tinkerapi than I was a couple weeks ago! Will share some experimentation soon!

English

0

6

160

Adit Jain@aditjain1980·6d

@GaryMarcus 50% success rate is across trials - you can scale compute and run more trials and then select best of N. Its not efficient but I think your point around reliability is moot if we can measurably improve it with more compute.

English

0

1

322

Gary Marcus@GaryMarcus·6d

Hot take on METR’s new graph that so many people are flipping about today. • Claude Code is a real advance; Mythos probably builds on some of what is learned there. But… • If you read the graph carefully, it is about achieving *50%* success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all. • If you read carefully, it is only about software tasks. Not general intelligence. • It certainly doesn’t tell you that *most* (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably • Aside from this, the graph doesn’t show you *how* the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph. • Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

METR@METR_Evals

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

English

40

20

182

92.1K

Adit Jain@aditjain1980·8 May

@jxnlco thank god someone did it!

English

0

125

jason@jxnlco·8 May

small ship / passion project, more details soon github.com/openai/openai-… 1. call responses via cli with all cloud tools 2. unix style structured outputs via cli 2. image gen/edit, transcription, tts 3. make projects and provision api keys more docs soon

English

38

93

929

167.3K

Adit Jain retweetledi

Oliver@olvrgln·6 May

Frontier models not being able to one-shot ffmpeg and sqlite doesn’t validate your 1998 thesis on neural nets

Gary Marcus@GaryMarcus

Some things never change. If you don’t understand this one, you don’t understand what’s happening AI. Marcus, 1998: neural nets have trouble generalizing far beyond the data. Marcus, 2001, 2012, 2019, 2022, etc: neural nets have trouble generalizing far beyond the data. Apple, 2025: neural nets have trouble generalizing far beyond the data. Meta/Stanford/Harvard, 2026: neural nets have trouble generalizing far beyond the data.

English

5

9

481

22.6K

Adit Jain@aditjain1980·6 May

Wrapper on top of a wrapper on top of a...

Harvey@harvey

Introducing 500+ legal agents and a new Agent Builder in Harvey.

English

0

3

54

Adit Jain@aditjain1980·5 May

now that models can create and train on environments autonomously - is it safe to assume that any public benchmark will land in the training set relatively soon - and so contamination is unavoidable?

English

1

0

2

68

Adit Jain@aditjain1980·4 May

Wait till you find about Robbins Monro

sankalp@dejavucoder

when you finally understand how policy gradient works after going down the differentiation trenches and realising that the REINFORCE algorithm is literally the base form of policy gradient

English

0

2

114

Adit Jain

Keşfet