Alexander Barry

55 posts

Alexander Barry

@AlexBarry4

Independent Statistical Consultant, https://t.co/fKwPM5dSGc Substack: https://t.co/760E1w9ol7

Katılım Ocak 2011

24 Takip Edilen120 Takipçiler

Sabitlenmiş Tweet

Alexander Barry@AlexBarry4·17 Nis

Interesting to work on this report with Epoch. We found that AI progress speeds have been accelerating since ~mid 2024 (on 3/4 of the metrics we considered). Treating reasoning models as a trendbreak made the best predictions, but not enough data to be very confident.

Epoch AI@EpochAIResearch

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

English

3.7K

Alexander Barry@AlexBarry4·3d

@finmoorhouse Got me searching for a Claude image generation announcement haha

English

745

Fin Moorhouse@finmoorhouse·3d

Claude made this

English

100

18.5K

Alexander Barry@AlexBarry4·4d

@YafahEdelman Or making the final bucket 10+ hours which does let it pick up lower performance a the cost of a pretty small n:

English

Alexander Barry@AlexBarry4·4d

@YafahEdelman Here is my take, I found slightly different buckets more natural (although now we get the weird case where 6-36 hour performance is above 1-6 hours, but this is just what the data actually shows!)

English

Yafah Edelman@YafahEdelman·5d

The new METR time horizon graph is pretty bad imo. It's a great benchmark, but the time horizon estimation isn't reasonable rn. I think something like this would be more justified:

English

142

19.8K

Alexander Barry@AlexBarry4·4d

I think (but am not totally sure) that while the METR TH results were based on an early checkpoint of Mythos Preview, the AECI results I used to estimate THs were based on the April 7th launch version. As per AISIs recent updates the launch version seems notably stronger, so presumably its 80% TH would be higher than the early checkpoint, but I'm not sure by how much.

English

Benjamin Todd@ben_j_todd·6d

@RyanPGreenblatt Interesting it came in under this method (which predicted 5.5h): abstatisticalconsulting.substack.com/p/predicting-t… @AlexBarry4

English

414

Ryan Greenblatt@RyanPGreenblatt·9 May

The actual 80% time horizon (on METR's task suite) is 3.1 hours so it looks like it was between my original guess of 2.5 hours and my updated guess of 3.5 hours.

Ryan Greenblatt@RyanPGreenblatt

I think this guess for 80% time horizon for Mythos Preview is probably somewhat too high, but I'm not confident. I originally guessed 2.5 hours (based on a quick and dirty extrapolation using gap from Opus 4 to 4.6), but based on this I've I updated to 3.5 hours.

English

7.8K

Alexander Barry@AlexBarry4·5d

@YafahEdelman @xeophon Seems correct to me (although I'm not sure if mirrorcode itself was a very big update for me, performance has always varied a lot across different tasks in the time horizon suite, see the messiness stuff in the original paper etc)

English

Yafah Edelman@YafahEdelman·5d

@AlexBarry4 @xeophon Yeah it just made me update towards "specific details can matter a lot" in a way that adds a lot of uncertainty into how I interpret metr time horizons.

English

145

Alexander Barry@AlexBarry4·5d

@xeophon @YafahEdelman I'd assume the claim is that Mirrorcode tasks are unrepresentative of most real tasks, and that the rest of the TH task suit might similarly be unrespesentative as well.

English

Florian Brand@xeophon·5d

@YafahEdelman This seems contradictory? MirrorCode showed that day/week/month long horizons are possible today

English

135

Alexander Barry@AlexBarry4·6d

@fleetingbits @EpochAIResearch This is due to more benchmark results coming out since the original launch, which lead to an increase in GPT 5.5's SWE-ECI

English

FleetingBits@fleetingbits·10 May

@EpochAIResearch i'm a little confused - the software engineering eci on the linked page looks different from the software engineering eci in the above image? in particular opus 4.7 has a higher swe-eci on the above image than gpt-5.5; but it's swapped on the links page?

English

145

Epoch AI@EpochAIResearch·6 May

We are launching domain-specific capability scores, tracking the capabilities of models across SWE and Math benchmarks, using the same scale as the general ECI. We also support customization for users who want to create their own variants of the ECI. Link below!

English

226

22.2K

Alexander Barry@AlexBarry4·6d

@htihle Glad to see things getting less confusing!

English

130

Håvard Ihle@htihle·6d

With the increased support, I did more runs on previous models that only had 2 runs (due to cost). These runs (going up to 5 runs on every task) reduce the errorbar and gives us a better historical record. - o1 (high) went from 43.8 -> 46.1% reducing the weird discrepancy between o1 and o1-preview (47.6%) that had puzzled several people (including me). There is not any significant difference now, due to the large errorbars of o1-preview (which is no longer on the API). - Opus 4.1 went from 42.8 -> 45.9%, and is now ahead of Opus 4.0, which stayed roughly the same at 43.7%. - GPT 5.3 codex (xhigh) went from 79.3% -> 77.9% now matching Opus 4.6 and GPT 5.4 (xhigh), although these two have large error bars. - GPT 4 stayed the same at 12.4%, but with smaller error bars.

Håvard Ihle@htihle

GPT 5.5 (xhigh) scores 84.9% on WeirdML taking the lead over 5.5 (high) 83.9%. Even (xhigh) is not using more than about 15k output tokens. Thanks to @METR_Evals for the increased support that allowed for this run. Opus 4.7 (max) soon, and more things in the pipeline.

English

4.8K

Alexander Barry@AlexBarry4·11 May

As the task-success-rate scatterplot shows, it is the case that for most tasks LLMs either succeed 100% of the time (normally 8/8 attempts) or almost always fail, and the e.g. 50% time horizon comes from averaging over these. So I don't think accuracy/reliability maps to high % time horizon numbers as cleanly as you imply here? E.g. at one point we considered releasing the time horizon results that only counted LLMs as succeeding at a task if it completed it on every attempt, which IIRC didn't change the time horizon results very much (maybe a factor of 2-4x). Assessing whether LLMs have many 9s of reliability is pretty difficult for long tasks, just because they are already expensive and time consuming to run, so scaling it up to run them e.g. 10,000 times would have much higher costs. I haven't looked at the linked article in depth, but I'd be surprised if there wasn't some senses in which reliability had improved substantially in the past 6 months or so (e.g. since Opus 4.5 models are very good at writing bash commands with correct syntax etc. and recovering if there are errors, in a way I don't think they could previously). To me this seems like an important part of what enabled coding agents to become so successful.

English

Krishna Kaasyap@krishnakaasyap·11 May

I agree broadly, but accuracy can effectively resolve the agent reliability issue—both in my opinion and according to the paper by @sayashk and @random_walker. x.com/random_walker/… Even if systems stop at the 1-hour mark with 90% reliability (Mythos is at 1.1 hours with 90% reliability), increasing that reliability to 99% and then 99.9% will matter much more than extending the duration from 1 hour to 2 or 4. I believe METR (or for that matter, any quality evaluation organization) should focus on measuring 99% accuracy and reliability use cases—at least in computer science—rather than measuring 20-hour tasks at 50% accuracy. Accuracy breeds reliability, and reliability leads to diffusion and the reduction of capability overhang.

Arvind Narayanan@random_walker

x.com/i/article/2026…

English

101

METR@METR_Evals·9 May

We evaluated an early version of Claude Mythos Preview for risk assessment during a limited window in March 2026. We estimated a 50%-time-horizon of at least 16hrs (95% CI 8.5hrs to 55hrs) on our task suite, at the upper end of what we can measure without new tasks.

English

247

2.1K

966.2K

Alexander Barry@AlexBarry4·11 May

I hadn't thought to do this, I'm surprised it was also such a straight line. Interestingly Mythos only ended up getting 85.2%, so perhaps it was being overestimated by the AECI (although obviously we would also not expect this relationship to hold forever, since it would need to cap at 100%)

English

James@_jrhm_·22 Nis

@AlexBarry4 When fit to actual score instead of TH, Mythos is at 92% Wild to think Mythos 2 could score perfectly on METR's task suite

English

Alexander Barry@AlexBarry4·22 Nis

I used Anthropic's internal ECI values from the Opus 4.7 model card to predict the METR Time Horizon values they would receive. This predicts Mythos will have a 50% TH of 40 hours, and Opus 4.7 19 hours. 80% THs are 5.5 and 2.5 hours respectively.

English

221

114.5K

Alexander Barry@AlexBarry4·11 May

Things get sensitive to modelling assumptions when looking at the very high % success rate time horizons, so I don't think is a panacea (and noise in the task length estimates can have a big effect). I think the 80% TH is still a reasonable metric (but could still easily vary 2x over reasonable modelling assumptions)

English

113

Krishna Kaasyap@krishnakaasyap·9 May

@METR_Evals I still don’t think this eval is saturated! At an 80% success rate, Mythos is still under 4 hours. Well within task distribution. At a 99% success rate, Mythos is still under 5 fricking minutes! Long live the Task-Completion Time Horizons eval!

English

110

8.3K

Alexander Barry@AlexBarry4·1 May

@_Suresh2 But there isn't any iron rule that these things need to have a straightforward relationship, especially one as simple as the ln(TH) = a+b*ECI that I use here. I'd expect different types of task would also behave differently.

English

Alexander Barry@AlexBarry4·1 May

@_Suresh2 A 50% error (if you mean with 0.5x to 2x of the true result) isn't that surprisingly to me, since the TH values range over so many orders of magnitude. As shown on the plot the 95% predictive interval for the OpenAI fit is pretty much exactly 0.5x to 2x in fact.

English

Alexander Barry@AlexBarry4·30 Nis

I used GPT 5.5's ECI values to predict the METR Time Horizon values it will receive. This predicts it will have a 50% time horizon of 10 hours, and an 80% time horizon of 1.6 hours. These are below my predictions for Opus 4.7 (but would beat the current best 80% TH).

English

Alexander Barry@AlexBarry4·1 May

@ramez We didn't look at this, but looking at how performance scales with inference compute is something I'd be very interested in looking at in the future.

English

Ramez Naam@ramez·1 May

@AlexBarry4 Hi, very late here, but a question: Did you test the reasoning models with reasoning turned off? Would love to understand how much of this is attributable to: 1. The base model 2. Base model x reasoning. Thanks

English

Alexander Barry@AlexBarry4·17 Nis

Epoch AI@EpochAIResearch

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

English

3.7K

Alexander Barry@AlexBarry4·30 Nis

Read my full post: abstatisticalconsulting.substack.com/p/predicting-g…

English

163

Alexander Barry@AlexBarry4·30 Nis

These results are influenced by GPT 5.3-codex and GPT 5.4 having quite low time horizon values compared to their ECIs. This might be partially caused by an unusual amount of reward hacking attempts. Removing them gives a somewhat different fit:

English

217

Alexander Barry@AlexBarry4·25 Nis

@nicdunz Y axis starts at 1 not 0 chart crime

English

3.3K

nic@nicdunz·25 Nis

ZXX

411

308.6K

Alexander Barry@AlexBarry4·24 Nis

@__nmca__ I'm assuming they don't have their own internal ECI varient? But once Epoch report one I can convert it to TH (there is more noise in the official ECI -> TH conversion though)

English

Nat McAleese@__nmca__·23 Nis

@AlexBarry4 want to add gpt5.5?

English

109

Nat McAleese@__nmca__·22 Nis

i think this is a great approach: ECI units are uninterpretable and metr eval saturated, so just report eci in METR units!

Alexander Barry@AlexBarry4

English

11.4K

Keşfet

@finmoorhouse @YafahEdelman @RyanPGreenblatt @xeophon @fleetingbits @EpochAIResearch @htihle @sayashk