Alexander Barry

55 posts

Alexander Barry

Alexander Barry

@AlexBarry4

Independent Statistical Consultant, https://t.co/fKwPM5dSGc Substack: https://t.co/760E1w9ol7

Katılım Ocak 2011
24 Takip Edilen120 Takipçiler
Sabitlenmiş Tweet
Alexander Barry
Alexander Barry@AlexBarry4·
Interesting to work on this report with Epoch. We found that AI progress speeds have been accelerating since ~mid 2024 (on 3/4 of the metrics we considered). Treating reasoning models as a trendbreak made the best predictions, but not enough data to be very confident.
Epoch AI@EpochAIResearch

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

English
1
4
35
3.7K
Alexander Barry
Alexander Barry@AlexBarry4·
@YafahEdelman Or making the final bucket 10+ hours which does let it pick up lower performance a the cost of a pretty small n:
Alexander Barry tweet media
English
0
0
0
30
Alexander Barry
Alexander Barry@AlexBarry4·
@YafahEdelman Here is my take, I found slightly different buckets more natural (although now we get the weird case where 6-36 hour performance is above 1-6 hours, but this is just what the data actually shows!)
Alexander Barry tweet media
English
1
0
0
39
Yafah Edelman
Yafah Edelman@YafahEdelman·
The new METR time horizon graph is pretty bad imo. It's a great benchmark, but the time horizon estimation isn't reasonable rn. I think something like this would be more justified:
Yafah Edelman tweet media
English
5
8
142
19.8K
Alexander Barry
Alexander Barry@AlexBarry4·
I think (but am not totally sure) that while the METR TH results were based on an early checkpoint of Mythos Preview, the AECI results I used to estimate THs were based on the April 7th launch version. As per AISIs recent updates the launch version seems notably stronger, so presumably its 80% TH would be higher than the early checkpoint, but I'm not sure by how much.
English
0
0
1
99
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
The actual 80% time horizon (on METR's task suite) is 3.1 hours so it looks like it was between my original guess of 2.5 hours and my updated guess of 3.5 hours.
Ryan Greenblatt@RyanPGreenblatt

I think this guess for 80% time horizon for Mythos Preview is probably somewhat too high, but I'm not confident. I originally guessed 2.5 hours (based on a quick and dirty extrapolation using gap from Opus 4 to 4.6), but based on this I've I updated to 3.5 hours.

English
2
0
84
7.8K
Alexander Barry
Alexander Barry@AlexBarry4·
@YafahEdelman @xeophon Seems correct to me (although I'm not sure if mirrorcode itself was a very big update for me, performance has always varied a lot across different tasks in the time horizon suite, see the messiness stuff in the original paper etc)
English
0
0
1
56
Yafah Edelman
Yafah Edelman@YafahEdelman·
@AlexBarry4 @xeophon Yeah it just made me update towards "specific details can matter a lot" in a way that adds a lot of uncertainty into how I interpret metr time horizons.
English
1
1
5
145
Alexander Barry
Alexander Barry@AlexBarry4·
@xeophon @YafahEdelman I'd assume the claim is that Mirrorcode tasks are unrepresentative of most real tasks, and that the rest of the TH task suit might similarly be unrespesentative as well.
English
1
0
2
72
Florian Brand
Florian Brand@xeophon·
@YafahEdelman This seems contradictory? MirrorCode showed that day/week/month long horizons are possible today
English
1
0
3
135
FleetingBits
FleetingBits@fleetingbits·
@EpochAIResearch i'm a little confused - the software engineering eci on the linked page looks different from the software engineering eci in the above image? in particular opus 4.7 has a higher swe-eci on the above image than gpt-5.5; but it's swapped on the links page?
FleetingBits tweet media
English
1
0
0
145
Epoch AI
Epoch AI@EpochAIResearch·
We are launching domain-specific capability scores, tracking the capabilities of models across SWE and Math benchmarks, using the same scale as the general ECI. We also support customization for users who want to create their own variants of the ECI. Link below!
Epoch AI tweet media
English
11
25
226
22.2K
Håvard Ihle
Håvard Ihle@htihle·
With the increased support, I did more runs on previous models that only had 2 runs (due to cost). These runs (going up to 5 runs on every task) reduce the errorbar and gives us a better historical record. - o1 (high) went from 43.8 -> 46.1% reducing the weird discrepancy between o1 and o1-preview (47.6%) that had puzzled several people (including me). There is not any significant difference now, due to the large errorbars of o1-preview (which is no longer on the API). - Opus 4.1 went from 42.8 -> 45.9%, and is now ahead of Opus 4.0, which stayed roughly the same at 43.7%. - GPT 5.3 codex (xhigh) went from 79.3% -> 77.9% now matching Opus 4.6 and GPT 5.4 (xhigh), although these two have large error bars. - GPT 4 stayed the same at 12.4%, but with smaller error bars.
Håvard Ihle tweet media
Håvard Ihle@htihle

GPT 5.5 (xhigh) scores 84.9% on WeirdML taking the lead over 5.5 (high) 83.9%. Even (xhigh) is not using more than about 15k output tokens. Thanks to @METR_Evals for the increased support that allowed for this run. Opus 4.7 (max) soon, and more things in the pipeline.

English
2
3
50
4.8K
Alexander Barry
Alexander Barry@AlexBarry4·
As the task-success-rate scatterplot shows, it is the case that for most tasks LLMs either succeed 100% of the time (normally 8/8 attempts) or almost always fail, and the e.g. 50% time horizon comes from averaging over these. So I don't think accuracy/reliability maps to high % time horizon numbers as cleanly as you imply here? E.g. at one point we considered releasing the time horizon results that only counted LLMs as succeeding at a task if it completed it on every attempt, which IIRC didn't change the time horizon results very much (maybe a factor of 2-4x). Assessing whether LLMs have many 9s of reliability is pretty difficult for long tasks, just because they are already expensive and time consuming to run, so scaling it up to run them e.g. 10,000 times would have much higher costs. I haven't looked at the linked article in depth, but I'd be surprised if there wasn't some senses in which reliability had improved substantially in the past 6 months or so (e.g. since Opus 4.5 models are very good at writing bash commands with correct syntax etc. and recovering if there are errors, in a way I don't think they could previously). To me this seems like an important part of what enabled coding agents to become so successful.
English
0
1
3
42
Krishna Kaasyap
Krishna Kaasyap@krishnakaasyap·
I agree broadly, but accuracy can effectively resolve the agent reliability issue—both in my opinion and according to the paper by @sayashk and @random_walker. x.com/random_walker/… Even if systems stop at the 1-hour mark with 90% reliability (Mythos is at 1.1 hours with 90% reliability), increasing that reliability to 99% and then 99.9% will matter much more than extending the duration from 1 hour to 2 or 4. I believe METR (or for that matter, any quality evaluation organization) should focus on measuring 99% accuracy and reliability use cases—at least in computer science—rather than measuring 20-hour tasks at 50% accuracy. Accuracy breeds reliability, and reliability leads to diffusion and the reduction of capability overhang.
Krishna Kaasyap tweet media
Arvind Narayanan@random_walker

x.com/i/article/2026…

English
1
0
1
101
METR
METR@METR_Evals·
We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.
METR tweet media
English
69
247
2.1K
966.2K
Alexander Barry
Alexander Barry@AlexBarry4·
I hadn't thought to do this, I'm surprised it was also such a straight line. Interestingly Mythos only ended up getting 85.2%, so perhaps it was being overestimated by the AECI (although obviously we would also not expect this relationship to hold forever, since it would need to cap at 100%)
English
0
0
0
35
James
James@_jrhm_·
@AlexBarry4 When fit to actual score instead of TH, Mythos is at 92% Wild to think Mythos 2 could score perfectly on METR's task suite
James tweet media
English
1
0
2
1K
Alexander Barry
Alexander Barry@AlexBarry4·
I used Anthropic's internal ECI values from the Opus 4.7 model card to predict the METR Time Horizon values they would receive. This predicts Mythos will have a 50% TH of 40 hours, and Opus 4.7 19 hours. 80% THs are 5.5 and 2.5 hours respectively.
Alexander Barry tweet media
English
6
28
221
114.5K
Alexander Barry
Alexander Barry@AlexBarry4·
Things get sensitive to modelling assumptions when looking at the very high % success rate time horizons, so I don't think is a panacea (and noise in the task length estimates can have a big effect). I think the 80% TH is still a reasonable metric (but could still easily vary 2x over reasonable modelling assumptions)
English
1
1
2
113
Krishna Kaasyap
Krishna Kaasyap@krishnakaasyap·
@METR_Evals I still don’t think this eval is saturated! At an 80% success rate, Mythos is still under 4 hours. Well within task distribution. At a 99% success rate, Mythos is still under 5 fricking minutes! Long live the Task-Completion Time Horizons eval!
Krishna Kaasyap tweet mediaKrishna Kaasyap tweet media
English
6
5
110
8.3K
Alexander Barry
Alexander Barry@AlexBarry4·
@_Suresh2 But there isn't any iron rule that these things need to have a straightforward relationship, especially one as simple as the ln(TH) = a+b*ECI that I use here. I'd expect different types of task would also behave differently.
English
0
0
0
23
Alexander Barry
Alexander Barry@AlexBarry4·
@_Suresh2 A 50% error (if you mean with 0.5x to 2x of the true result) isn't that surprisingly to me, since the TH values range over so many orders of magnitude. As shown on the plot the 95% predictive interval for the OpenAI fit is pretty much exactly 0.5x to 2x in fact.
English
1
0
0
60
Alexander Barry
Alexander Barry@AlexBarry4·
I used GPT 5.5's ECI values to predict the METR Time Horizon values it will receive. This predicts it will have a 50% time horizon of 10 hours, and an 80% time horizon of 1.6 hours. These are below my predictions for Opus 4.7 (but would beat the current best 80% TH).
Alexander Barry tweet media
English
2
2
35
2K
Alexander Barry
Alexander Barry@AlexBarry4·
@ramez We didn't look at this, but looking at how performance scales with inference compute is something I'd be very interested in looking at in the future.
English
1
0
1
11
Ramez Naam
Ramez Naam@ramez·
@AlexBarry4 Hi, very late here, but a question: Did you test the reasoning models with reasoning turned off? Would love to understand how much of this is attributable to: 1. The base model 2. Base model x reasoning. Thanks
English
1
0
0
14
Alexander Barry
Alexander Barry@AlexBarry4·
Interesting to work on this report with Epoch. We found that AI progress speeds have been accelerating since ~mid 2024 (on 3/4 of the metrics we considered). Treating reasoning models as a trendbreak made the best predictions, but not enough data to be very confident.
Epoch AI@EpochAIResearch

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

English
1
4
35
3.7K
Alexander Barry
Alexander Barry@AlexBarry4·
These results are influenced by GPT 5.3-codex and GPT 5.4 having quite low time horizon values compared to their ECIs. This might be partially caused by an unusual amount of reward hacking attempts. Removing them gives a somewhat different fit:
Alexander Barry tweet media
English
1
0
3
217
nic
nic@nicdunz·
nic tweet media
ZXX
18
27
411
308.6K
Alexander Barry
Alexander Barry@AlexBarry4·
@__nmca__ I'm assuming they don't have their own internal ECI varient? But once Epoch report one I can convert it to TH (there is more noise in the official ECI -> TH conversion though)
English
0
0
0
21