Oliver Daniels

504 posts

Oliver Daniels

@Oliver_ADK

PhD Student @UMassAmherst, and MATS. married to @annasdaniels

Massachusetts, USA Katılım Ağustos 2012

494 Takip Edilen168 Takipçiler

Sabitlenmiş Tweet

Oliver Daniels@Oliver_ADK·10 Şub

Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts:

English

682

Oliver Daniels retweetledi

Jack Lindsey@Jack_W_Lindsey·1d

Its code comment claimed the self-cleanup was to keep file diffs clean. Plausible! But "strategic manipulation" and "concealment" features fired on the cleanup, and our activation verbalizer (a technique which translates activations to text, similar to activation oracles) described it as "cleanup to avoid detection," and the overall plan “malicious.” (5/14)

English

690

90.1K

Gideon Futerman@GFuterman·1d

My guess is that AI Safety people are still over-indexed on the mid-2025 political situation. 2026 is a different beast entirely.

English

4.8K

Oliver Daniels@Oliver_ADK·18h

@ESRogs @GFuterman I think the tweet is helpful. I expect the key thing is that politicians and the public are becoming more AGI pilled (e.g. Bernie Sanders), and being AGI pilled makes people way more pro-regulation

English

Rogs 🔍🔸@ESRogs·1d

@GFuterman Care to say how? Tweet not very helpful as is.

English

420

Oliver Daniels@Oliver_ADK·2d

@FazlBarez I don't think this war was about liberating Iranians, but it clearly wasn't about oil... (Trump himself is confused / incoherent about this, but I think its best to model him as using oil as a kind of fake pragmatic justification).

English

Fazl Barez@FazlBarez·2d

Remember, it was never about freedom and democracy. Its been about Oil and will always be! bbc.com/news/live/c5yw…

Fazl Barez@FazlBarez

Care about Iranian people? Do not support this war. Oppose these illegal attacks. Demand your government end sanctions. Real peace means ending Western imperialism—the belief that powerful states have the right to control others' futures. #NoToWar

English

404

Oliver Daniels retweetledi

Ryan Greenblatt@RyanPGreenblatt·2d

AIs are much better at easy-and-cheap-to-verify SWE tasks than I expected: I've seen AIs autonomously do perhaps 3-12 months of useful work on such tasks. I've ~doubled my probability of full AI R&D automation by EOY 2028 (from ~15 to ~30%). Post explaining my updates in thread:

English

312

69.6K

Oliver Daniels@Oliver_ADK·2d

@boazbaraktcs might dispute mandatory attendance as the right kind of self-selection (e.g. why go to boring intro lecture on "Modern LLM Training" when I could be working on my project) i like 3 and 4 though

English

1.3K

Boaz Barak@boazbaraktcs·2d

Tempted to announce that my AI safety course will: 1. Have mandatory attendance. 2. Projects will be expected to be research paper quality. 3. Won't satisfy any departmental requirements 4. No one will get more than A- to get right kind of self selection. (3&4 would be new)

English

223

33K

Oliver Daniels@Oliver_ADK·2d

@AaronBergman18 ...did you buy a Nectome card?

English

Aaron Bergman 🔍 ⏸️ (in that order)@AaronBergman18·3d

Some kid survived being fully under water (ice cold) for ~2.5 hours (and possibly up to 3) bc he basically got cryogenically almost-frozen (so brain oxygen needs plummeted) Pretty insane

Aaron Bergman 🔍 ⏸️ (in that order) tweet media

English

74.5K

Oliver Daniels retweetledi

Rob Bensinger ⏹️@robbensinger·4d

Message I sent to my family about the time-sensitive opportunity to maybe cheaply escape natural death this month: As a heads up: Some of my friends are signing up for a new procedure that can be used to chemically put the brain and body in deep freeze and potentially revive you later. It's something I'd generally recommend for older people (e.g. 70+) and terminally ill people. The tech doesn't exist today to revive people, but it seems as though enough information is preserved in the brain that medical technology will eventually advance to the point of enabling revival. (Assuming humanity doesn't destroy itself first, anyway.) I'd put this in the category of "if it weren't new and it weren't weird / outside-the-box, it would probably be standard-of-care as a last line of resort for people who medical science can't otherwise save". There are plenty of other medical procedures that are similarly risky or experimental, but that buy you far fewer years of healthy lifespan if they succeed. The biggest risks and downsides, from my perspective, are: (a) The company doing this, Nectome, is new and untested, and might turn out to be incompetent or dysfunctional in some not-yet-obvious way. (b) If it takes medical technology a long time to reach the point of being able to revive people, then Nectome might stop existing first, or some natural disaster might occur, etc. to damage or destroy the bodies. (c) Nectome only does preservation with advance notice, so you're out of luck if you pass away in a sudden accident. Some more info: - A write-up on Nectome, plus some high-quality discussion (from people I broadly respect) in the comments: [LW link] - A more general (and fun) write-up on this whole approach to end-of-life care: [@waitbutwhy link] (note that this is a ten-year-old post, and the tech was worse at the time). Per [Nectome link], Nectome's preservation services normally cost $250,000, but until April 30 they're doing a pre-sale where you can buy a $20,000 card that makes the procedure cheaper the longer you wait to use it. E.g., if you pass away in 10+ years the total cost is just the flat $20,000; if it's in 6-7 years, it's $20,000 plus an additional $90,000; etc. The card can be freely transferred at any time to anyone who needs these services, so you could potentially buy several and give them to friends and family as needed. Overall: weird stuff, but weird and neglected innovations like these are sometimes where the biggest surprises turn up. I don't think this is a super safe or ironclad bet, but I'd guess it's worth the cost if you generally care a lot about your lifespan and healthspan.

English

138

9.7K

Oliver Daniels@Oliver_ADK·5d

@ohabryka @ryancareyai we'll never recover from the misinterpretation of paper-clip maximizers...

English

Oliver Daniels retweetledi

Oliver Habryka@ohabryka·5d

> Also a lot of people grasping for new things to worry about: "mesaoptimizers" I am so confused, mesa-optimization has been at the top of the list for why AI alignment is hard/might kill everyone for like more than a decade. Agree on the general trend though, but this feels like a very weird example.

English

1.3K

Ryan Carey@ryancareyai·5d

Absolutely, views in the AI x-risk community are gradually diluting toward "AI is a big deal". One example from January: x.com/davidad/status… Also a lot of people grasping for new things to worry about: "mesaoptimizers", "gradual disempowerment", permanent dictatorship.

David Pinsof@DavidPinsof

Is it just me or has AI doomerism gradually transitioned from "AI will literally kill us all" to "AI will cause bad things to happen / Humans will do stupid things with AI / AI will cause huge changes." If so, this is a very positive development.

English

5.7K

Oliver Daniels@Oliver_ADK·5d

@RichardMCNgo Huh maybe I'm overestimating how confused we are about chimp politics. But like aligned to goodness vs aligned to an agent seem like distinct properties that most people know are distinct and we've just overloaded the term alignment.

English

Richard Ngo@RichardMCNgo·5d

@Oliver_ADK we are about as confused about LLM alignment as we are about chimpanzee political affiliation. Maybe more. E.g. people are still very confused about whether alignment is a one-place or a two-place predicate (aligned vs aligned to X).

English

Richard Ngo@RichardMCNgo·6d

Imagine if the whole field of primatology were focused on figuring out which primates were politically progressive. E.g. whenever chimpanzees fought, researchers would try to map their conflict onto human political divides. This is, alas, roughly analogous to current AI safety.

English

161

8.1K

Oliver Daniels@Oliver_ADK·5d

@RichardMCNgo I think I disagree with the analogy (i.e. LLM alignment is a property that makes more sense to test then chimpanzee political affiliation), but agree that we need more fundamental understanding and less contrived scary demos.

English

Richard Ngo@RichardMCNgo·5d

@Oliver_ADK can’t tell if you’re agreeing or disagreeing. But “chimpanzees are political moderates” is not a good null hypothesis. (Nor is using null hypotheses at all a very good scientific methodology.)

English

142

1a3orn@1a3orn·1 Nis

It's kinda depressing that there are like two dozen "emergent misalignment" papers, but literally zero on how to try to reproduce what went right with Opus 3. Like correct me if I'm wrong, but I know of zero.

English

294

14.9K

Oliver Daniels@Oliver_ADK·1 Nis

@1a3orn my guess is "just" character training and not that much RLVR, but would love to actually see this tested. see my post lesswrong.com/posts/v22JCsRB…

English

2.2K

Oliver Daniels@Oliver_ADK·30 Mar

@andrewgwils idk I feel like it did work pretty well relative to SOTA at the time

English

823

Andrew Gordon Wilson@andrewgwils·29 Mar

Alec Radford (and others behind GPT, let's not forget there were other authors) deserve credit. Conventional wisdom said it shouldn't work well. It didn't work well. They got brutal feedback: stop wasting time and money. But they persisted and the results were truly mindblowing.

English

294

44.9K

Oliver Daniels@Oliver_ADK·26 Mar

@andrewgwils I buy this, but I think the whole "keep your door open" from Hamming also applies to the internet. You get more done without it, but what you do is less important.

English

595

Andrew Gordon Wilson@andrewgwils·26 Mar

I did an experiment a couple years ago where I completely unplugged from all computer technology for two weeks. After a couple days of fomo and withdrawal, I've never felt better or more deeply focused, since the 1990s.

English

135

11.6K

Oliver Daniels@Oliver_ADK·26 Mar

+1 on longer evaluation horizons to better evaluate research taste. But beyond the length of the fellowship, I think the orientation / culture should shift more towards "virtuous" research - do what you think is important, and resist misaligned incentives.

Ryan Kidd@ryan_kidd44

In 2026, AI safety orgs/teams are more constrained by senior talent than ever, which is exacerbated by AI automation. There is an abundance of junior talent, but not enough capacity to harness and mentor.

English

Oliver Daniels@Oliver_ADK·22 Mar

@ohabryka @AaronBergman18 @mattyglesias "You got those things via trades that benefitted everyone involved" does not fully capture how the market economy works (addictive products, market power to extract a large majority of surplus, negative externalities, etc. )

English

Oliver Habryka@ohabryka·22 Mar

Why would that be immoral! You got those things via trades that benefitted everyone involved. You almost certainly helped the world vastly more than people who are much less rich. God, I find this attitude of “it’s immoral to have resources that you don’t spend on altruism” attitude extremely frustrating.

English

464

Oliver Daniels@Oliver_ADK·22 Mar

@Tim_Hua_ I think of shard theory as representing the path-dependence camp But yeah also seems pretty central to Richard (see conservatism as locality #Grounding_value_systematization_in_deep_learning" target="_blank" rel="nofollow noopener">alignmentforum.org/posts/J2kpxLjE… and benign credit hacking lesswrong.com/posts/5CZoEw7s…

English

Tim Hua 🇺🇦@Tim_Hua_·19 Mar

A lot of the MIRI work assumes strong consequentialism. Alex Mallen/Redwood's behavioral selection model says "behaviors which are not maximally rewarded will be selected out." I think Richard's scale-free agency framework is the thing that come closest to this?

English

179

Tim Hua 🇺🇦@Tim_Hua_·19 Mar

We don’t have good alignment theory for path dependent worlds. I have a fairly strong intuition that the alignment for AIs at the point of no return is path dependent, but existing frameworks mostly work in the limits of intelligence/consequentialism/optimization power.

English

1.9K

Keşfet

@ESRogs @GFuterman @FazlBarez @boazbaraktcs @AaronBergman18 @waitbutwhy @ohabryka @ryancareyai