Oliver Daniels

504 posts

Oliver Daniels

Oliver Daniels

@Oliver_ADK

PhD Student @UMassAmherst, and MATS. married to @annasdaniels

Massachusetts, USA Katılım Ağustos 2012
494 Takip Edilen168 Takipçiler
Sabitlenmiş Tweet
Oliver Daniels
Oliver Daniels@Oliver_ADK·
Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts:
Oliver Daniels tweet media
English
1
4
14
682
Oliver Daniels retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Its code comment claimed the self-cleanup was to keep file diffs clean. Plausible! But "strategic manipulation" and "concealment" features fired on the cleanup, and our activation verbalizer (a technique which translates activations to text, similar to activation oracles) described it as "cleanup to avoid detection," and the overall plan “malicious.” (5/14)
Jack Lindsey tweet media
English
10
21
690
90.1K
Gideon Futerman
Gideon Futerman@GFuterman·
My guess is that AI Safety people are still over-indexed on the mid-2025 political situation. 2026 is a different beast entirely.
English
6
5
44
4.8K
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@ESRogs @GFuterman I think the tweet is helpful. I expect the key thing is that politicians and the public are becoming more AGI pilled (e.g. Bernie Sanders), and being AGI pilled makes people way more pro-regulation
English
1
0
5
72
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@FazlBarez I don't think this war was about liberating Iranians, but it clearly wasn't about oil... (Trump himself is confused / incoherent about this, but I think its best to model him as using oil as a kind of fake pragmatic justification).
English
0
0
1
35
Oliver Daniels retweetledi
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
AIs are much better at easy-and-cheap-to-verify SWE tasks than I expected: I've seen AIs autonomously do perhaps 3-12 months of useful work on such tasks. I've ~doubled my probability of full AI R&D automation by EOY 2028 (from ~15 to ~30%). Post explaining my updates in thread:
Ryan Greenblatt tweet media
English
14
23
312
69.6K
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@boazbaraktcs might dispute mandatory attendance as the right kind of self-selection (e.g. why go to boring intro lecture on "Modern LLM Training" when I could be working on my project) i like 3 and 4 though
English
0
0
1
1.3K
Boaz Barak
Boaz Barak@boazbaraktcs·
Tempted to announce that my AI safety course will: 1. Have mandatory attendance. 2. Projects will be expected to be research paper quality. 3. Won't satisfy any departmental requirements 4. No one will get more than A- to get right kind of self selection. (3&4 would be new)
English
17
2
223
33K
Aaron Bergman 🔍 ⏸️ (in that order)
Some kid survived being fully under water (ice cold) for ~2.5 hours (and possibly up to 3) bc he basically got cryogenically almost-frozen (so brain oxygen needs plummeted) Pretty insane
Aaron Bergman 🔍 ⏸️ (in that order) tweet media
English
26
69
1K
74.5K
Oliver Daniels retweetledi
Rob Bensinger ⏹️
Rob Bensinger ⏹️@robbensinger·
Message I sent to my family about the time-sensitive opportunity to maybe cheaply escape natural death this month: As a heads up: Some of my friends are signing up for a new procedure that can be used to chemically put the brain and body in deep freeze and potentially revive you later. It's something I'd generally recommend for older people (e.g. 70+) and terminally ill people. The tech doesn't exist today to revive people, but it seems as though enough information is preserved in the brain that medical technology will eventually advance to the point of enabling revival. (Assuming humanity doesn't destroy itself first, anyway.) I'd put this in the category of "if it weren't new and it weren't weird / outside-the-box, it would probably be standard-of-care as a last line of resort for people who medical science can't otherwise save". There are plenty of other medical procedures that are similarly risky or experimental, but that buy you far fewer years of healthy lifespan if they succeed. The biggest risks and downsides, from my perspective, are: (a) The company doing this, Nectome, is new and untested, and might turn out to be incompetent or dysfunctional in some not-yet-obvious way. (b) If it takes medical technology a long time to reach the point of being able to revive people, then Nectome might stop existing first, or some natural disaster might occur, etc. to damage or destroy the bodies. (c) Nectome only does preservation with advance notice, so you're out of luck if you pass away in a sudden accident. Some more info: - A write-up on Nectome, plus some high-quality discussion (from people I broadly respect) in the comments: [LW link] - A more general (and fun) write-up on this whole approach to end-of-life care: [@waitbutwhy link] (note that this is a ten-year-old post, and the tech was worse at the time). Per [Nectome link], Nectome's preservation services normally cost $250,000, but until April 30 they're doing a pre-sale where you can buy a $20,000 card that makes the procedure cheaper the longer you wait to use it. E.g., if you pass away in 10+ years the total cost is just the flat $20,000; if it's in 6-7 years, it's $20,000 plus an additional $90,000; etc. The card can be freely transferred at any time to anyone who needs these services, so you could potentially buy several and give them to friends and family as needed. Overall: weird stuff, but weird and neglected innovations like these are sometimes where the biggest surprises turn up. I don't think this is a super safe or ironclad bet, but I'd guess it's worth the cost if you generally care a lot about your lifespan and healthspan.
Rob Bensinger ⏹️ tweet media
English
9
9
138
9.7K
Oliver Daniels retweetledi
Oliver Habryka
Oliver Habryka@ohabryka·
> Also a lot of people grasping for new things to worry about: "mesaoptimizers" I am so confused, mesa-optimization has been at the top of the list for why AI alignment is hard/might kill everyone for like more than a decade. Agree on the general trend though, but this feels like a very weird example.
English
2
1
55
1.3K
Ryan Carey
Ryan Carey@ryancareyai·
Absolutely, views in the AI x-risk community are gradually diluting toward "AI is a big deal". One example from January: x.com/davidad/status… Also a lot of people grasping for new things to worry about: "mesaoptimizers", "gradual disempowerment", permanent dictatorship.
David Pinsof@DavidPinsof

Is it just me or has AI doomerism gradually transitioned from "AI will literally kill us all" to "AI will cause bad things to happen / Humans will do stupid things with AI / AI will cause huge changes." If so, this is a very positive development.

English
10
0
29
5.7K
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@RichardMCNgo Huh maybe I'm overestimating how confused we are about chimp politics. But like aligned to goodness vs aligned to an agent seem like distinct properties that most people know are distinct and we've just overloaded the term alignment.
English
0
0
2
37
Richard Ngo
Richard Ngo@RichardMCNgo·
@Oliver_ADK we are about as confused about LLM alignment as we are about chimpanzee political affiliation. Maybe more. E.g. people are still very confused about whether alignment is a one-place or a two-place predicate (aligned vs aligned to X).
English
1
0
1
72
Richard Ngo
Richard Ngo@RichardMCNgo·
Imagine if the whole field of primatology were focused on figuring out which primates were politically progressive. E.g. whenever chimpanzees fought, researchers would try to map their conflict onto human political divides. This is, alas, roughly analogous to current AI safety.
English
8
5
161
8.1K
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@RichardMCNgo I think I disagree with the analogy (i.e. LLM alignment is a property that makes more sense to test then chimpanzee political affiliation), but agree that we need more fundamental understanding and less contrived scary demos.
English
1
0
3
55
Richard Ngo
Richard Ngo@RichardMCNgo·
@Oliver_ADK can’t tell if you’re agreeing or disagreeing. But “chimpanzees are political moderates” is not a good null hypothesis. (Nor is using null hypotheses at all a very good scientific methodology.)
English
1
0
2
142
1a3orn
1a3orn@1a3orn·
It's kinda depressing that there are like two dozen "emergent misalignment" papers, but literally zero on how to try to reproduce what went right with Opus 3. Like correct me if I'm wrong, but I know of zero.
English
8
11
294
14.9K
Andrew Gordon Wilson
Andrew Gordon Wilson@andrewgwils·
Alec Radford (and others behind GPT, let's not forget there were other authors) deserve credit. Conventional wisdom said it shouldn't work well. It didn't work well. They got brutal feedback: stop wasting time and money. But they persisted and the results were truly mindblowing.
English
6
12
294
44.9K
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@andrewgwils I buy this, but I think the whole "keep your door open" from Hamming also applies to the internet. You get more done without it, but what you do is less important.
English
1
0
9
595
Andrew Gordon Wilson
Andrew Gordon Wilson@andrewgwils·
I did an experiment a couple years ago where I completely unplugged from all computer technology for two weeks. After a couple days of fomo and withdrawal, I've never felt better or more deeply focused, since the 1990s.
English
2
6
135
11.6K
Oliver Daniels
Oliver Daniels@Oliver_ADK·
+1 on longer evaluation horizons to better evaluate research taste. But beyond the length of the fellowship, I think the orientation / culture should shift more towards "virtuous" research - do what you think is important, and resist misaligned incentives.
Ryan Kidd@ryan_kidd44

In 2026, AI safety orgs/teams are more constrained by senior talent than ever, which is exacerbated by AI automation. There is an abundance of junior talent, but not enough capacity to harness and mentor.

English
0
0
2
95
Oliver Daniels
Oliver Daniels@Oliver_ADK·
@ohabryka @AaronBergman18 @mattyglesias "You got those things via trades that benefitted everyone involved" does not fully capture how the market economy works (addictive products, market power to extract a large majority of surplus, negative externalities, etc. )
English
2
0
4
91
Oliver Habryka
Oliver Habryka@ohabryka·
Why would that be immoral! You got those things via trades that benefitted everyone involved. You almost certainly helped the world vastly more than people who are much less rich. God, I find this attitude of “it’s immoral to have resources that you don’t spend on altruism” attitude extremely frustrating.
English
2
0
10
464
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
A lot of the MIRI work assumes strong consequentialism. Alex Mallen/Redwood's behavioral selection model says "behaviors which are not maximally rewarded will be selected out." I think Richard's scale-free agency framework is the thing that come closest to this?
English
2
0
1
179
Tim Hua 🇺🇦
Tim Hua 🇺🇦@Tim_Hua_·
We don’t have good alignment theory for path dependent worlds. I have a fairly strong intuition that the alignment for AIs at the point of no return is path dependent, but existing frameworks mostly work in the limits of intelligence/consequentialism/optimization power.
English
4
0
25
1.9K