Elizabeth Barnes

279 posts

Elizabeth Barnes

@BethMayBarnes

Katılım Temmuz 2014

383 Takip Edilen3.2K Takipçiler

Elizabeth Barnes@BethMayBarnes·30 Mar

@joel_bkr This doesn't seem obvious to me, cheaper reviews can also incorrectly reject good solutions.

English

Joel Becker@joel_bkr·11 Mar

think of graders as having time/$ budget for checking solutions. - it is trivial that having larger budget (maintainer vs algorithmic scoring) leads to weakly lower measured success. - how much lower is empirical question. - there are other possible points on budget curve.

Joel Becker@joel_bkr

new @METR_Evals research note from @whitfill_parker, @cherylwoooo, nate rush, and me. (chiefly parker!) we find that *half* of SWE-bench Verified solutions from Sonnet 3.5-to-4.5 generation AIs *which are graded as passing* are rejected by project maintainers.

English

2.8K

Elizabeth Barnes@BethMayBarnes·13 Mar

@sayashk @METR_Evals I'd be very curious to see (a) human performance on your reliability metric, and (b) transcript examples for what randomly-selected failures look like for the best models, if you have those to hand.

English

267

Sayash Kapoor@sayashk·11 Mar

Hey @METR_Evals—love your work, but we think it's the *metric* that's saturated, not the task suite. For example, despite rapid gains in accuracy, we found limited gains in reliability. We'd love to work together to see if this holds up on the time-horizon task suite.

METR@METR_Evals

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

English

19.4K

Elizabeth Barnes@BethMayBarnes·4 Mar

@_alyxya @METR_Evals It's only showing a subset of the points, see the full graph here: metr.org/time-horizons/

English

alyxya@_alyxya·4 Mar

@METR_Evals how is the line created? it doesn't look like it's the best fit line on a log plot, where the earlier data points are generally below the line while the later data points are generally above the line, so I would've expected a steeper slope to better fit the points

English

3.4K

METR@METR_Evals·4 Mar

We're correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.

English

929

144.7K

Elizabeth Barnes retweetledi

Joel Becker@joel_bkr·24 Şub

our existing uplift study design is broken. devs decline to work without AI (participate in experiment, submit tasks for which they expect AI to be helpful, complete tasks randomized to no AI) leading us to underestimate uplift. x.com/METR_Evals/sta…

METR@METR_Evals

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

English

160

21.2K

Elizabeth Barnes retweetledi

Chris Painter@ChrisPainterYup·20 Şub

Our team is stretched thin at the moment! To continue upper-bounding the autonomy of AI agents, and developing evaluations for monitoring AI systems and their propensity to subvert human control, we need more great engineering and research staff. Please apply below or DM me!

METR@METR_Evals

English

351

62.6K

Elizabeth Barnes@BethMayBarnes·17 Oca

@JerryWeiAI The hope here is not nesc *never* train a model with the dangerous capabilities - but you can use the nerfed model for most cases and have much higher security precautions for the dangerous model (e.g. stricter KYC, lower upload bandwidth, reduced employee access)

English

176

Elizabeth Barnes@BethMayBarnes·17 Oca

@JerryWeiAI Surprising - I’d have guessed you can reduce bioweapon capabilities by ~3yrs of progress or 30x time horizon while only hurting a few % of biomedical usage, by targeting small number of most relevant topics (e.g. specific pathogens, reverse genetics, and maybe cell culturing).

English

831

Jerry Wei@JerryWeiAI·16 Oca

An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn't know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here's why. I'll first acknowledge a potential strength of this approach. If models simply didn't know much about dangerous topics, we wouldn't have to worry about people jailbreaking them or stealing model weights—they just wouldn't be able to help with dangerous topics at all. This is an appealing property that's hard to get with other safety approaches. However, we found that filtering out only very specific information (e.g., information directly related to chemical and biological weapons) had relatively small effects on AI capabilities in these domains. We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn't enough assurance against misuse). Broader filtering also had mixed results on effectiveness. We could have made more progress here with more research effort, but it likely would have required removing a very broad set of biology and chemistry knowledge from pretraining, making models much less useful for science (it’s not clear to us that the reduced risk from chemical and biological weapons outweigh the benefits of models helping with beneficial life-sciences work). Bottom line—filtering out enough pretraining data to make AI models truly unhelpful at relevant topics in chemistry and biology could have huge costs for their usefulness, and the approach could also be brittle as models' ability to do their own research improves.* Instead, we think that our Constitutional Classifiers approach provides high levels of defense against misuse while being much more adaptable across threat models and easy to update against new jailbreaking attacks. *The cost-benefit tradeoff could look pretty different for other misuse threats or misalignment threats though, so I wouldn't rule out pre-training filtering for things like papers on AI control or areas that have little-to-no dual-use information.

English

239

33.1K

Elizabeth Barnes@BethMayBarnes·17 Oca

@JerryWeiAI I’d be very interested if you can say anything more about how you developed your classifier, how well you actually succeeded at removing the relevant content from the training data, and how you determined the impact on dangerous capabilities.

English

246

Elizabeth Barnes@BethMayBarnes·17 Oca

@JerryWeiAI I do think you’d need to get an expert to design your classifier, I don’t expect just “ask an LLM whether this is about bioweapons” to work that well.

English

151

Elizabeth Barnes@BethMayBarnes·20 Kas

Hm, I feel dubious without more info. Would want to see some transcripts and how they checked for cheating, and how much they iterated against the benchmark. That line shape smells suspicious to me somehow

Intology@IntologyAI

Introducing Locus: the first AI system to outperform human experts at AI R&D Locus conducts research autonomously over multiple days and achieves superhuman results on RE-Bench given the same resources as humans, as well as SOTA performance on GPU kernel & ML engineering tasks. RE-Bench is a collection of several frontier AI research tasks that typically take human experts (e.g., top ML PhDs and frontier lab researchers) several days. By scaling experimentation to far longer time horizons than previous systems, Locus represents a step change in AI scientist capabilities. 🧵

English

2.3K

Elizabeth Barnes@BethMayBarnes·1 Kas

I think we also demonstrated value of independent review. E.g. see on page 12 (re training pressure on the CoT that disincentivized revealing harmful info): "There had been internal miscommunications about this topic that became clear only in response to questions from METR"

English

1.2K

Elizabeth Barnes@BethMayBarnes·1 Kas

Props to Anthropic, a great precedent for transparency. Esp sharing sensitive details like compute scaleup calculations, expected performance early in RL. Plus employee interviews and any whistleblowing reports. Led to some useful discoveries + higher confidence in final report

METR@METR_Evals

We reviewed Anthropic’s unredacted report and agreed with its assessment of sabotage risks. We want to highlight the greater access & transparency into its redactions provided, which represent a major improvement in how developers engage with external reviewers. Reflections: 🧵

English

110

17.1K

Elizabeth Barnes retweetledi

Mike McCormick@MikeMcCormick_·24 Eyl

Exactly two years ago, I launched @HalcyonFutures. So far we’ve seeded and launched 16 new orgs and companies, and helped them raise nearly a quarter billion dollars in funding. Flash back to 2022: After eight years in VC, I stepped back to explore questions about exponential technology and the future of humanity. Then ChatGPT launched – we’d entered the exponential AI era. The upside of AI is huge – but so are the risks: misalignment, loss of control, cyber and bio-threats, fraud, adversarial misuse, and more. For humanity to thrive in this era, we’ll need to build a resilient and secure world. And we’ll need an ecosystem of both nonprofit and for-profit solutions led by ambitious, thoughtful leaders from business, policy, academia and media. I found myself asking: 1. How do we get the world’s most talented people working on these civilization-scale challenges? 2. What funding model would allow us to support many different types of projects? With those questions in mind, I launched Halcyon. We’re building something a bit unusual: - @HalcyonFutures, a nonprofit and grant fund that helps leaders and entrepreneurs pivot to ambitious, high-impact work. - @HalcyonVC, a VC firm backing for-profit founders tackling the hardest problems in AI security and global resilience. So far we’ve raised $25m in funding for Halcyon’s nonprofit and VC fund. We’ve incubated or provided zero-to-one capital to 16 nonprofits and companies, and helped them go on to raise more than $200m — from @GoodfireAI's interpretability work to @aiunderwriting's AI risk standards and insurance, @TransluceAI's model behavior research, and @SeismicOrg's public opinion research and media. We're a small team of three (me, Shelby Summerfield and @rossmatican) surrounded by a community of advisors and allies that includes leaders from frontier labs, governments, cybersecurity, philanthropy, startups and VC. Over the coming weeks we’ll share more about Halcyon and what we’re building next. If you're working on something we might be excited about, say hi. Visit our website below in comments ⬇️

English

138

41.9K

Elizabeth Barnes@BethMayBarnes·24 Eyl

To donate to METR, click here: metr.org/donate If you’d like to discuss giving with us first, or receive more information about our work for the purpose of informing a donation, reach out to giving@metr.org

English

2.9K

Elizabeth Barnes@BethMayBarnes·24 Eyl

The central constraint to our publishing more and better research, and scaling up our work monitoring the AI industry for catastrophic risk, is growing our team with excellent new researchers and engineers. And our recruiting is, to some degree, constrained by our fundraising.

English

Elizabeth Barnes@BethMayBarnes·24 Eyl

METR is a non-profit research organization, and we are actively fundraising! We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted funding from frontier AI labs.

English

312

28.8K

Elizabeth Barnes retweetledi

Dan Lahav@dan_lahav·17 Eyl

Today I’m launching @Irregular (formerly Pattern Labs) with my friend and co-founder Omer Nevo: Irregular is the first frontier security lab. Our mission: protect the world in the era of increasingly capable and sophisticated AI systems.

English

389

175.1K

Elizabeth Barnes@BethMayBarnes·15 Ağu

@__nmca__ @DavidSKrueger Yep that is just a graph fuckup, top of CI should be at about 20% if you include only the 15 we thoroughly reviewed (but we skimmed a larger number and thought that probably none of them seemed mergeable)

English

119

Nat McAleese@__nmca__·15 Ağu

@BethMayBarnes @DavidSKrueger you could have used a beta so that the CI wasn’t zero width fwiw

English

226

David Krueger 🦥 ⏸️ ⏹️ ⏪@DavidSKrueger·15 Ağu

Wow. 0% PR success rate. Seasoned AI researchers are familiar with such "generalized Moravec's paradoxes" and so are suspicious of benchmarks that purport to measure progress towards AGI. The safety community has not yet internalized these lessons, but this is progress.

METR@METR_Evals

We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT. We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.

English

6.1K

Keşfet

@joel_bkr @sayashk @METR_Evals @_alyxya @JerryWeiAI @HalcyonFutures @HalcyonVC @GoodfireAI