Elizabeth Barnes

279 posts

Elizabeth Barnes

Elizabeth Barnes

@BethMayBarnes

เข้าร่วม Temmuz 2014
383 กำลังติดตาม3.2K ผู้ติดตาม
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@joel_bkr This doesn't seem obvious to me, cheaper reviews can also incorrectly reject good solutions.
English
0
0
2
41
Joel Becker
Joel Becker@joel_bkr·
think of graders as having time/$ budget for checking solutions. - it is trivial that having larger budget (maintainer vs algorithmic scoring) leads to weakly lower measured success. - how much lower is empirical question. - there are other possible points on budget curve.
Joel Becker@joel_bkr

new @METR_Evals research note from @whitfill_parker, @cherylwoooo, nate rush, and me. (chiefly parker!) we find that *half* of SWE-bench Verified solutions from Sonnet 3.5-to-4.5 generation AIs *which are graded as passing* are rejected by project maintainers.

English
3
0
18
2.8K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@sayashk @METR_Evals I'd be very curious to see (a) human performance on your reliability metric, and (b) transcript examples for what randomly-selected failures look like for the best models, if you have those to hand.
English
1
0
13
267
Sayash Kapoor
Sayash Kapoor@sayashk·
Hey @METR_Evals—love your work, but we think it's the *metric* that's saturated, not the task suite. For example, despite rapid gains in accuracy, we found limited gains in reliability. We'd love to work together to see if this holds up on the time-horizon task suite.
Sayash Kapoor tweet media
METR@METR_Evals

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

English
7
6
85
19.4K
alyxya
alyxya@_alyxya·
@METR_Evals how is the line created? it doesn't look like it's the best fit line on a log plot, where the earlier data points are generally below the line while the later data points are generally above the line, so I would've expected a steeper slope to better fit the points
English
3
0
3
3.4K
METR
METR@METR_Evals·
We're correcting a mistake in our modeling that inflated recent 50%-time horizons by 10-20% (and reduced 80%-horizons). We inappropriately penalized steepness in task-length→success curve fits. This most affects the oldest and newest models, whose fits are less data-constrained.
METR tweet media
English
26
73
929
144.7K
Elizabeth Barnes รีทวีตแล้ว
Joel Becker
Joel Becker@joel_bkr·
our existing uplift study design is broken. devs decline to work without AI (participate in experiment, submit tasks for which they expect AI to be helpful, complete tasks randomized to no AI) leading us to underestimate uplift. x.com/METR_Evals/sta…
METR@METR_Evals

Since early 2025, we've been studying how AI tools impact productivity among developers. Previously, we found a 20% slowdown. That finding is now outdated. Speedups now seem likely, but changes in developer behavior make our new results unreliable. We’re working to address this.

English
7
17
159
21.2K
Elizabeth Barnes รีทวีตแล้ว
Chris Painter
Chris Painter@ChrisPainterYup·
Our team is stretched thin at the moment! To continue upper-bounding the autonomy of AI agents, and developing evaluations for monitoring AI systems and their propensity to subvert human control, we need more great engineering and research staff. Please apply below or DM me!
METR@METR_Evals

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

English
25
46
352
62.6K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@JerryWeiAI The hope here is not nesc *never* train a model with the dangerous capabilities - but you can use the nerfed model for most cases and have much higher security precautions for the dangerous model (e.g. stricter KYC, lower upload bandwidth, reduced employee access)
English
0
0
2
176
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@JerryWeiAI Surprising - I’d have guessed you can reduce bioweapon capabilities by ~3yrs of progress or 30x time horizon while only hurting a few % of biomedical usage, by targeting small number of most relevant topics (e.g. specific pathogens, reverse genetics, and maybe cell culturing).
English
4
0
14
831
Jerry Wei
Jerry Wei@JerryWeiAI·
An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn't know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here's why. I'll first acknowledge a potential strength of this approach. If models simply didn't know much about dangerous topics, we wouldn't have to worry about people jailbreaking them or stealing model weights—they just wouldn't be able to help with dangerous topics at all. This is an appealing property that's hard to get with other safety approaches. However, we found that filtering out only very specific information (e.g., information directly related to chemical and biological weapons) had relatively small effects on AI capabilities in these domains. We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn't enough assurance against misuse). Broader filtering also had mixed results on effectiveness. We could have made more progress here with more research effort, but it likely would have required removing a very broad set of biology and chemistry knowledge from pretraining, making models much less useful for science (it’s not clear to us that the reduced risk from chemical and biological weapons outweigh the benefits of models helping with beneficial life-sciences work). Bottom line—filtering out enough pretraining data to make AI models truly unhelpful at relevant topics in chemistry and biology could have huge costs for their usefulness, and the approach could also be brittle as models' ability to do their own research improves.* Instead, we think that our Constitutional Classifiers approach provides high levels of defense against misuse while being much more adaptable across threat models and easy to update against new jailbreaking attacks. *The cost-benefit tradeoff could look pretty different for other misuse threats or misalignment threats though, so I wouldn't rule out pre-training filtering for things like papers on AI control or areas that have little-to-no dual-use information.
English
25
24
239
33.1K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@JerryWeiAI I’d be very interested if you can say anything more about how you developed your classifier, how well you actually succeeded at removing the relevant content from the training data, and how you determined the impact on dangerous capabilities.
English
1
0
6
246
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@JerryWeiAI I do think you’d need to get an expert to design your classifier, I don’t expect just “ask an LLM whether this is about bioweapons” to work that well.
English
0
0
3
151
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
I think we also demonstrated value of independent review. E.g. see on page 12 (re training pressure on the CoT that disincentivized revealing harmful info): "There had been internal miscommunications about this topic that became clear only in response to questions from METR"
English
0
0
16
1.2K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
Props to Anthropic, a great precedent for transparency. Esp sharing sensitive details like compute scaleup calculations, expected performance early in RL. Plus employee interviews and any whistleblowing reports. Led to some useful discoveries + higher confidence in final report
METR@METR_Evals

We reviewed Anthropic’s unredacted report and agreed with its assessment of sabotage risks. We want to highlight the greater access & transparency into its redactions provided, which represent a major improvement in how developers engage with external reviewers. Reflections: 🧵

English
3
1
110
17.1K
Elizabeth Barnes รีทวีตแล้ว
Mike McCormick
Mike McCormick@MikeMcCormick_·
Exactly two years ago, I launched @HalcyonFutures. So far we’ve seeded and launched 16 new orgs and companies, and helped them raise nearly a quarter billion dollars in funding. Flash back to 2022: After eight years in VC, I stepped back to explore questions about exponential technology and the future of humanity. Then ChatGPT launched – we’d entered the exponential AI era. The upside of AI is huge – but so are the risks: misalignment, loss of control, cyber and bio-threats, fraud, adversarial misuse, and more. For humanity to thrive in this era, we’ll need to build a resilient and secure world. And we’ll need an ecosystem of both nonprofit and for-profit solutions led by ambitious, thoughtful leaders from business, policy, academia and media. I found myself asking: 1. How do we get the world’s most talented people working on these civilization-scale challenges? 2. What funding model would allow us to support many different types of projects? With those questions in mind, I launched Halcyon. We’re building something a bit unusual: - @HalcyonFutures, a nonprofit and grant fund that helps leaders and entrepreneurs pivot to ambitious, high-impact work. - @HalcyonVC, a VC firm backing for-profit founders tackling the hardest problems in AI security and global resilience. So far we’ve raised $25m in funding for Halcyon’s nonprofit and VC fund. We’ve incubated or provided zero-to-one capital to 16 nonprofits and companies, and helped them go on to raise more than $200m — from @GoodfireAI's interpretability work to @aiunderwriting's AI risk standards and insurance, @TransluceAI's model behavior research, and @SeismicOrg's public opinion research and media. We're a small team of three (me, Shelby Summerfield and @rossmatican) surrounded by a community of advisors and allies that includes leaders from frontier labs, governments, cybersecurity, philanthropy, startups and VC. Over the coming weeks we’ll share more about Halcyon and what we’re building next. If you're working on something we might be excited about, say hi. Visit our website below in comments ⬇️
English
15
24
138
41.9K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
To donate to METR, click here: metr.org/donate If you’d like to discuss giving with us first, or receive more information about our work for the purpose of informing a donation, reach out to giving@metr.org
English
1
3
19
2.9K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
The central constraint to our publishing more and better research, and scaling up our work monitoring the AI industry for catastrophic risk, is growing our team with excellent new researchers and engineers. And our recruiting is, to some degree, constrained by our fundraising.
English
1
0
15
2K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
METR is a non-profit research organization, and we are actively fundraising! We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted funding from frontier AI labs.
English
4
34
312
28.8K
Elizabeth Barnes รีทวีตแล้ว
Dan Lahav
Dan Lahav@dan_lahav·
Today I’m launching @Irregular (formerly Pattern Labs) with my friend and co-founder Omer Nevo: Irregular is the first frontier security lab. Our mission: protect the world in the era of increasingly capable and sophisticated AI systems.
English
49
48
389
175.1K
Elizabeth Barnes
Elizabeth Barnes@BethMayBarnes·
@__nmca__ @DavidSKrueger Yep that is just a graph fuckup, top of CI should be at about 20% if you include only the 15 we thoroughly reviewed (but we skimmed a larger number and thought that probably none of them seemed mergeable)
English
0
0
1
119
David Krueger 🦥 ⏸️ ⏹️ ⏪
Wow. 0% PR success rate. Seasoned AI researchers are familiar with such "generalized Moravec's paradoxes" and so are suspicious of benchmarks that purport to measure progress towards AGI. The safety community has not yet internalized these lessons, but this is progress.
METR@METR_Evals

We tested how autonomous AI agents perform on real software tasks from our recent developer productivity RCT. We found a gap between algorithmic scoring and real-world usability that may help explain why AI benchmarks feel disconnected from reality.

English
6
2
65
6.1K