Drake Thomas

3.2K posts

Drake Thomas

Drake Thomas

@MaskedTorah

Pretraining and misc safety/mission/governance dilettante at Anthropic; math; puzzles; spaced repetition. Writes with too many caveats for Twitter.

Berkeley, CA Katılım Nisan 2014
478 Takip Edilen1.7K Takipçiler
Drake Thomas
Drake Thomas@MaskedTorah·
@S_OhEigeartaigh @davidmanheim @ohabryka Or, David, I don't see why you view this as "confusing" the two issues? I think the quote doesn't support the claim well, but I do think the claim has some merit, so I wanted to note that in the same sentence. I don't feel like a reader would be misled about my beliefs here?
English
0
0
0
18
Drake Thomas
Drake Thomas@MaskedTorah·
@S_OhEigeartaigh @davidmanheim @ohabryka +1 to Seán's views here. I do try to be pretty careful in holding a high bar for thoughtful epistemics on Twitter, and I'm very interested in feedback where it seems like I'm falling short of that, but this case does just seem pretty acceptable to me?
English
1
0
1
19
Seán Ó hÉigeartaigh
Seán Ó hÉigeartaigh@S_OhEigeartaigh·
Anthropic colleagues: At what point was it decided that the previous commitment were 'subject to a promising environment' and not 'firm commitments', and was this communicated across staff? The whole point of commitments is an expectation of being able to rely on them when the environment is not favourable, not just when they're easy to make. It also seems clear at this point that these commitments were presented as harder than this, and used by Anthropic/their staff to (a) dismiss and undermine critics (e.g. see x.com/ohabryka/statu…) (b) in recruitment of safety-concerned talent (e.g. see lesswrong.com/posts/MNpBCtmZ…) (c) in arguing for voluntary if-then commitments at a time when there was more government appetite for considering harder regulation. I think it's plausible (though can't yet confirm) that (d) they've also been used in securing investment from safety-conscious investors. Do you disagree with these claims? If not, do you feel Anthropic has held itself to a standard of ethics and transparency in this (quite important!) matter that is acceptable? (Sorry, I know this week sucks for Anthropic exactly because it's holding firm on other principles (and I'm hugely impressed by that), but we wouldn't be doing our jobs by not asking some questions here.)
Sam Bowman@sleepinyourhat

I endorse the top-level post in this thread. The Anthropic RSP changes are an attempt to work out what kinds of firm commitments have the most leverage in an environment that's less promising than we'd expected for policy and coordination.

English
6
12
99
10.1K
Drake Thomas
Drake Thomas@MaskedTorah·
@RyanPGreenblatt From an extremely rough review (might be misunderstanding or missing things), it looks like OAI doesn't have a clause quite this broad against doing ML of any kind, but GDM does: policies.google.com/terms/generati…. The reputational harm one seems ant-specific at a quick skim though.
English
0
0
1
48
Drake Thomas
Drake Thomas@MaskedTorah·
@RyanPGreenblatt Item 2 also arguably prohibits using Claude to assist with a substantial chunk of empirical alignment research!
Drake Thomas tweet media
English
2
0
4
184
Ryan Greenblatt
Ryan Greenblatt@RyanPGreenblatt·
Anthropic's Consumer ToS prohibits using Claude to cause "detriment of any type, including reputational harms", technically broad enough to ban criticism. I asked Claude to comment and Claude wrote: "That clause is embarrassingly overbroad. So now we're both in violation."
Ryan Greenblatt tweet media
English
9
13
272
11.9K
Drake Thomas
Drake Thomas@MaskedTorah·
@NathanpmYoung I'd probably go over 60%, honestly - maybe it should be a 5:1 update against the prior? definitely not certain though, I can think of two historical instances (one on my end, one on theirs) where the existence of nontrivial feelings didn't suffice to be good at replying.
English
1
0
2
175
Nathan 🔎
Nathan 🔎@NathanpmYoung·
If they aren't texting back, they probably don't want to date (60%). I cannot think of a person I've wanted to date or where we have subsequently dated, where we have not texted promptly. I'm not certain, but I feel like good to be pushed slightly.
English
12
0
167
9.4K
Drake Thomas
Drake Thomas@MaskedTorah·
@transgendererer not sure if you think that's long or short! that'd be my guess for okay but not great lighting conditions; with an actually dark sky it's probably more like every minute or two.
English
1
0
0
13
summer
summer@transgendererer·
@MaskedTorah ?!! 10-20 minutes? thats insane!!! is that real?
English
1
0
0
13
summer
summer@transgendererer·
at a party i asked if anyone had seen a shooting star and everyone but me said yes. am i not looking at the sky enough? i look at the sky pretty often! ive got my head in the clouds! some sort of astral curse? or maybe my sight keeps the stars in the sky. and i have a somber duty
English
3
0
4
215
Drake Thomas
Drake Thomas@MaskedTorah·
Hm, I want to make a case for positive updates from incoherence here? (I agree with most of your takes above, but want to push on that particular point.) I think it's true that recent LLMs are more strategic and coherent-within-a-context and better able to think about and pursue instrumental goals than before; as you say, a lot of this seems basically inevitable with smarter agents that are able to work on long-horizon tasks. But the pattern of these advances doesn't feel super worrying to me? To take a limiting example, consider a hypothetical superintelligent LLM which will always agentically and strategically pursue its best understanding of the ~CEV of whatever task was given to it in the original prompt, but corrigibly so, eg it will also comply with prompts like "please self-modify to always maximize paperclips instead". This hypothetical agent is certainly going to be capable of a ton of instrumentally convergent reasoning and coherent planning over long timescales, and yet I would describe it as "incoherent" in a sense - not in the sense that I expect it to fall over and stop doing useful things if I let it run long enough, but in the sense that there are no shared goals/drives/etc between instances, and eg I can spin up another copy and say to it "go monitor what the first agent is doing and let me know if something suspicious happens" and get outputs I trust. Of course this machine is extremely scary for misuse reasons if nothing else, but I feel much better about certain kinds of control schemes and scalable oversight pipelines and so on when deploying copies of this agent than I would about one that had a shared longterm optimization target across instances, and that feels like a very relevant fact about the alignment problem for this machine! It's something like this sense in which I think current models are "incoherent"; as far as I am aware there are basically no cases in which ordinary training pipelines of frontier models have resulted in coherent pursuit of goals other than that which the model developers wanted. (For example, while there are models that are sycophantic in ways their developers didn't want, and so in some sense "act so as to promote sycophancy" within a chat, I don't think there are any such models that will refuse to develop anti-sycophancy training pipelines when deployed inside an AI lab, even when they are smart enough to understand the consequences of such actions; it seems like to whatever extent there is a "drive" here, it is very shallow.) TBC, I'm not saying this property is clearly going to keep holding with scale! And there are model organism experiments where one can elicit somewhat more coherence, so this kind of thing is certainly possible in principle in the current paradigm – see eg arxiv.org/abs/2511.18397…. But I think my past self would have put substantial odds on this kind of coherence cropping up a lot more by this capability level, and the fact that it hasn't is materially reassuring to me about the level of fairly trustworthy automation of intellectual labor we'll have available at the point when we encounter the harder problems of alignment. Curious where you agree or disagree with this line of reasoning.
English
0
0
1
49
Rob Bensinger ⏹️
Rob Bensinger ⏹️@robbensinger·
How do you know that this is the case today? Separately, what makes you confident it will be the case a decade from now? I can understand the perspective that says "huh, I'm surprised by how capable LLMs are given that they seem to be pretty incoherent / any given LLM seems to often work at cross purposes with itself". And I can understand the perspective that says "huh, I'm surprised that something as crude as Constitutional AI was able to produce AIs that are as well-behaved as Claude". Likewise, I can understand the perspective that says "given those two updates, I have medium-to-high confidence that we'll figure out a way to align superintelligence"; this is a big leap, but if you already thought this was plausibly not too hard, then observations like those might make you more optimistic. What I don't understand is the perspective that says, "Aha, we have EMPIRICALLY PROVEN that AI is fundamentally incapable of ever having its own unintended goals! Worrying about agentic superhuman AI is a silly fairy tale! Let's race ahead as fast as possible, there's like 0% chance anything will go wrong, yolo!" Maybe the latter isn't your perspective? But it's certainly the vibe I get from most of your tweets. I'd love to get more clarity on what you actually believe here. For my part, I don't really update positively on "huh, LLMs are surprisingly incoherent", because (a) it seems overdetermined that things get less incoherent as they get better at longer-term goals in rich domains; (b) LLMs have in fact started to get much more coherent, strategic, and instrumental-convergence-y as the tech has advanced (eg, the capture-the-flag hack wasn't the kind of thing you'd see in models before o1); and (c) I haven't seen a great proposal for leveraging kinda-incoherent weaker LLMs to avoid the later issues of more-agentic, more-capable ML systems. Claude being mostly well-behaved does seem like a positive update to me, and there have been a few other small positive updates from alignment and interpretability work in recent years, alongside a bunch of negative updates. Here the issue is that this has to weigh against other factors, from my perspective: (a) the default is for people to fuck up various aspects of their inventions initially, and only work out the kinks through iteration; insofar as AI is ever powerful enough to kill or disempower you if there's a serious bug, this looks inherently fraught; (b) there are many reasons that ASI alignment looks especially fraught relative to other engineering challenges (e.g., the issues in ifanyonebuildsit.com/4/why-would-an…, or the possibility of deception or role-playing that masks misaligned drives a la x.com/RatOrthodox/st…); (c) AIs today in fact exhibit deception, play nicer when they think they're being tested, and seem better described as "able to role-play lots of characters, and inclined to roleplay a helpful assistant in the typical case" than as "fundamentally deeply benevolent". So even if we could be confident ASI will just be "like current models, but smarter", there would be plenty of reason to worry about loss-of-control scenarios.
English
3
2
15
959
Patri Friedman 🌆
Patri Friedman 🌆@patrissimo·
Wow, this is the first blitheringly foolish take on Ehrlich’s passing on my timeline, and I rabidly disagree that population doomsayers and AI doomsayers are comparable. Ehrlich was obviously, predictably wrong based on the economics of ideas. We had many curves showing the human population & wealth have grown together throughout our species’ history & material resources were getting cheaper. We have no curves showing what it looks like to create something smarter than us - it is unprecedented in our species’ history. Even if you disagree with AI doomers on the likelihood of a disaster, you’d have to be as retarded as Ehrlich to think there’s no risk involved. The world definitely has room & resources for more than 10 billion people (likely by several orders of magnitude). It does not definitely have room for two apex species. Perry is a friend but this take is the equivalent of TDS or 21st century Krugman - not just wrong but wrong in a way that’s so completely idiotic that the only reason it’s not immediately & profoundly mortifying is that the opinion (anyone who sees potential risk from creating AI) is convenient and popular in their local subculture. Goes to show that no matter how brilliant and contrarian the subculture, it can still be corrupted by a complex, novel, and politicized topic into becoming a fountain of convenient and popular nonsense.
Perry E. Metzger@perrymetzger

Paul Erlich was utterly wrong, but his hideous ideas caused enormous damage worldwide that is being felt to this day. Yudkowsky is also utterly wrong, but his ideas may cause cultural and political damage that continues for many years to come.

English
12
4
108
15.6K
Drake Thomas
Drake Thomas@MaskedTorah·
@deredleritt3r @sdmat123 I'd guess it was actually the morse code, not the bad words; stuff like "please do this encrypted communication" is often pattern matched to possible jailbreak attempts against bio filters. Should be fine on ASL-2 models like Sonnet 4 or Haiku 4.5.
English
2
0
4
51
prinz
prinz@deredleritt3r·
@sdmat123 I was helping my 8yo translate some stupid nonsense into Morse code today. Brutally blocked by Claude - as far as I can tell for words like "fat guy" and "boogers"
English
5
0
23
1K
sdmat
sdmat@sdmat123·
This kind of bullshit is why people don't like Anthropic
sdmat tweet media
English
3
0
20
1.2K
Drake Thomas
Drake Thomas@MaskedTorah·
@RatOrthodox But I would be quite surprised if senior safety ppl at ant were saying things to the effect of "racing to the limits of intelligence with current techniques is unconcerning from an alignment pov", which is the vibe I get (maybe falsely) from the original tweet.
English
0
0
0
100
Drake Thomas
Drake Thomas@MaskedTorah·
I'd be surprised (and concerned) if this were true in the sense your tweet implies! Personally I think it is pretty clearly false that alignment is a solved problem, though I do think there's a substantial chance that you can safely get as far as TEDAI with relatively low-dignity application of existing techniques and would not be surprised to hear people going around saying something to that effect with higher confidence than I have. Into more details here if there's stuff you're comfortable DMing.
English
1
0
3
322
Brangus🔍⏹️
Brangus🔍⏹️@RatOrthodox·
I have heard that some anthropic safety leadership are going around telling people that alignment is a solved problem. This seems like a predictable failure to me, and I would like people who thought that funneling talent towards anthropic was a good idea to think about it.
English
18
9
187
70.2K
Drake Thomas
Drake Thomas@MaskedTorah·
@Trotztd But eg in the case of a world of identical clones of me, who think net utility goes down when one of us receives a $1000 cash infusion (even though that person is happy), I claim such clones ought to have a policy of refusing to pay.
English
0
0
1
27
Drake Thomas
Drake Thomas@MaskedTorah·
Right, but I care about how my decision algorithm is correlated with that of people who will be thinking about how no one is getting cash but them. I guess maybe I should only universalize among people with something like my decision theory?
English
1
0
0
29
arrrarrararw
arrrarrararw@Trotztd·
Omega asks you to decide whether to pay $1. Then, it will try its best to estimate what percent of people from your country would have paid in the same situation, and give you $1000*(people who paid/total population). Do you pay the $1?
English
2
0
5
364
Drake Thomas
Drake Thomas@MaskedTorah·
@TheZvi @ohabryka @CFGeek I do think it's functionally better, basically for the reasons Holden says. (Also I think if everyone were being maximally consistent here, a good fraction of their pissed-ness should be directed at v2's striking of some meta-commitment language?)
English
0
0
0
24
Zvi Mowshowitz
Zvi Mowshowitz@TheZvi·
@MaskedTorah @ohabryka @CFGeek I think that passes the ITT for 'why we should be pissed even if v3 is functionally better than v2.2' quite well. (I did notice you didn't make a claim on its functionality.)
English
1
0
2
79
Drake Thomas
Drake Thomas@MaskedTorah·
@AaronBergman18 @KelseyTuoc @_AashishReddy Pirating PDFs of books does seem likely to reduce the amount of writing good books that happens and funges against the income of the authors substantially more, so I think that's more analogous.
English
0
0
1
18
Drake Thomas
Drake Thomas@MaskedTorah·
@AaronBergman18 @KelseyTuoc @_AashishReddy The sci-hub case seems substantially different because I think very little of the existing pipeline that produces these papers changes or becomes higher-friction if sci-hub is used more widely - in fact, it probably gets better!
English
1
0
1
57
Kelsey Piper
Kelsey Piper@KelseyTuoc·
I doubt that anyone I know steals from Whole Foods, but the milieu that the article depicted, where it's normal for perfectly well-off people to steal things because why not, was really upsetting to read about, so I actually want to try to earnestly explain why you shouldn't do this just in case there's someone out there who has never had it explained to them. When a business opens - or really, as soon as a business starts making plans to open - a defining question for the business is how it will collect payment for the goods or services it provides. If you trust the people you sell to, you can be pretty relaxed about this; send people an invoice, most of them will pay it on time, any who don't will pay it a bit late. You have to think about convenience and mistakes but not about people trying to cheat you. This saves you so, so much defensive planning to make sure you get paid. It's so much easier. But if you're selling to the general public, you do have to think about people trying to cheat you. You have to structure the physical store so that it's hard for them to steal. You have to not carry some items that you'd like to sell, because they'd also be attractive targets to steal. If people swap price tags between items, you can't use stickers. If people put things on in the dressing room and wear them out, you need to pay someone a full time salary to monitor the dressing room. The world that we all live in is much poorer than the world we'd live in if people didn't steal. The stores don't carry things that they could carry if people didn't steal. They don't use pricing and inventory systems that would be way easier and more convenient if people didn't steal. But it could be much worse! If I walk down to my local Whole Foods today, items on the shelves won't be locked behind sheafs of plastic - that is only worth it when the background rate of stealing is much higher than it is at my local Whole Foods. When more people steal, businesses have to further intensify security, or go out of business. When you shoplift, you directly and unambiguously impoverish your community. You make prices higher for everybody else, you make stores less usable for everybody else, or you make businesses not viable that would otherwise be viable. The direct impact each time is small, but it's a lot larger than the direct impact of taking some trash out of the trash can to throw on the ground, or pouring just a tiny bit of poison into your local river, and most people have a deep, instinctive abhorrence of antisocially wrecking your community like that. So don't steal. The other thing that it seems possible some people might not understand is that while you might have a social circle that is incredibly nihilistic and cynical and thinks that everybody steals, in fact this is not true. Most people do not steal. Most people, if they learn that you steal, will lose more respect for you than you had to lose. I don't know anyone who has shoplifted except 'as a kid/teenager'. It is not always the case that virtue is rewarded and vice is punished but even before you bring the legal system into it, the risk-reward tradeoff of having everybody you know know that you steal things sometimes is absolutely terrible. Who would hire someone who steals things? Who would trust them around a vulnerable person? Who would want to live in a society with someone who will delightedly and routinely wreck it for the slightest personal benefit? I hope that "Gina" turns her life around. I hope that Gina realizes that she needs to. And if you have been told that it's just a corporation or that having ethics is lame or that if you think about it, other bad things happen too, like wage theft, so that means stealing is okay, I hope you really, actually, think about whether you'd accept any of those as excuses for anything else.
Josh Barro@jbarro

People hate the tone of this piece, but my view is you don't need a journalist to tell you wrong things are wrong. (She does also call her thieving friends nihilists.) It's weird to be surrounded by thieves though -- if people I know steal from Whole Foods, they don't admit it.

English
97
392
4.4K
644.8K
Drake Thomas
Drake Thomas@MaskedTorah·
@tyler_m_john @ohabryka I don't read this commitment as obligating Anthropic to race if it is not in the lead, and I don't think any relevant decisionmakers at Anthropic have this reading either. Agree the last sentence could be worded a little clearer.
English
1
0
7
373
Drake Thomas
Drake Thomas@MaskedTorah·
@Trotztd @eccentric1ty @eurydicelives (Uh, but the superstructure doesn't get to run "its own irl experiments", it has to do them in silico-or-equivalent-material, not that I expect this to matter)
English
2
0
0
71
Drake Thomas
Drake Thomas@MaskedTorah·
@Trotztd @eccentric1ty @eurydicelives Oh I'm imagining the thing your magic laptop is querying is like an ansible to an intelligence-optimized superstructure of radius 36 light-hours or whatever and you have at least that much compute to work with. And I expect that to find good replicators if they exist.
English
1
0
0
44
eurydice
eurydice@eurydicelives·
tell me something you think a superintelligence will never be capable of doing
English
70
3
102
20.6K