Drake Thomas

3.3K posts

Drake Thomas

Drake Thomas

@MaskedTorah

System cards, risk reports, and misc safety takes at Anthropic; math; puzzles; spaced repetition. Writes with too many caveats for Twitter.

Berkeley, CA Katılım Nisan 2014
487 Takip Edilen2.1K Takipçiler
Drake Thomas
Drake Thomas@MaskedTorah·
@AaronBergman18 @AndyMasley (But I should confess a bias that some part of my brain feels on a gut level that, oh my god it's just normal air, if you don't actively feel bad right now it's gotta be fine, how could this be a problem. 1890-Drake probably dislikes germ theory for this reason and is wrong.)
English
1
0
2
68
Drake Thomas
Drake Thomas@MaskedTorah·
@AaronBergman18 @AndyMasley (See also: everyone being extremely anal about CO2 when nuclear submarine personnel seem to function fine at thousands of PPM for months and the good papers don't see major cognitive effects til a 100x scaleup. I think it's nocebo + random other correlates of poor ventilation.)
English
1
0
2
96
Andy Masley
Andy Masley@AndyMasley·
Sorry I besmirched farmers, I found a worse one.
Andy Masley tweet media
English
27
8
468
29K
Drake Thomas
Drake Thomas@MaskedTorah·
@DanielCHTan97 (I have no particular insider knowledge here) Seems a bit weird since there's a numbered x axis for a similar graph further down, but FWIW this kind of thing isn't unique to A\: eg OAI's recent blog post on CoT grading doesn't give an absolute x scale. alignment.openai.com/accidental-cot…
English
0
0
0
128
Daniel Tan
Daniel Tan@DanielCHTan97·
i'm annoyed that anthropic has stopped writing papers. yeah blogposts are nice but this graph would never pass peer review. like, what is the X axis supposed to be? 'training steps' of what exactly? what are the baselines? how are they trained? from: anthropic.com/research/teach…
Daniel Tan tweet media
English
7
3
108
5.2K
Drake Thomas
Drake Thomas@MaskedTorah·
@deanwball I'm also worried about AI persuasion that's much more multimodal, or involves human impersonation, but that feels like moving goalposts and is less centrally what I think of as the superpersuasion threat model (eg it doesn't matter much for alignment oversight concerns).
English
0
0
0
36
Drake Thomas
Drake Thomas@MaskedTorah·
@deanwball But on the other hand I look at things like 4o, or the right tail of human persuasive writing, and I don't feel like I can easily rule out much more untapped potential from very very high speaker intellect + modeling ability.
English
1
0
0
36
Drake Thomas
Drake Thomas@MaskedTorah·
@ericneyman @GuiveAssadi Huh, what does "way better" mean? Would love to see a bit more gears in the model of why we delay. (I assume "earth" generalizes to mercury etc too here, since otherwise we could just parallelize?)
English
1
0
4
98
Eric Neyman
Eric Neyman@ericneyman·
@GuiveAssadi A friend tells me that Dyson Spheres aren't worth building until pretty far past the singularity, because for a while it'll be way better to use the Earth's materials for nuclear fusion, instead of using those same materials to build a Dyson sphere.
English
3
0
13
727
Guive Assadi
Guive Assadi@GuiveAssadi·
I don’t think there will be a Dyson sphere in the 2030s or 2040s
Ajeya Cotra@ajeya_cotra

@Jsevillamol @binarybits I think we're more likely to get Dyson sphere in the 2030s if something goes wrong (breakneck military-industrial competition between US and China, AI takeover that makes human prefs+regs moot). But even in a "leisurely" world I expect it by the 2040s. Elon Musk would do it.

English
10
0
92
16.9K
Max Spero
Max Spero@max_spero_·
Can anyone explain to me why we don’t have ethical meat? Surely we can give the farm animal a significantly better life than it would have lived in the wild.
Max Spero tweet media
English
187
5
408
72.9K
Nina
Nina@NinaPanickssery·
Oh, maybe another problem is that it's more risky to eat raw egg whites in the US compared to UK? It's hard to make Tiramisu without raw egg whites, maybe there's a way to do this if you whip it with hot sugar syrup?
English
4
0
3
521
Nina
Nina@NinaPanickssery·
When I was a teenager I used to make the most awesome Tiramisu. I should get back to this.
Nina tweet mediaNina tweet mediaNina tweet media
English
4
0
27
1.3K
Drake Thomas
Drake Thomas@MaskedTorah·
@RatOrthodox (which we might still do ofc, but I think this is a really important strategic consideration and think generic talk of ASI without specifying the degree of superhuman capability muddies the waters here. I try to always say "[mildly/moderately/wildly] superhuman AI")
English
1
0
0
124
Drake Thomas
Drake Thomas@MaskedTorah·
@RatOrthodox and so I do care a lot about "can we survive at least some levels of superintelligence" to which I think the answer at this point is pretty clearly "yes if we don't fuck up too badly on executing existing techniques"
English
1
0
0
79
Brangus🔍⏹️
Brangus🔍⏹️@RatOrthodox·
I bet you could use a similar method to read human minds. Train a model that goes from neural activations to words to neural activations. The loss is just KL divergence between input and output. Seems like the bottleneck on human mind reading is neural measurement accuracy.
Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English
1
0
32
2K
Drake Thomas
Drake Thomas@MaskedTorah·
@peterbarnett_ @saprmarks What do you mean by "train SAEs on the NLAs" / what evidence would point towards or against reasonable representations?
English
0
0
0
42
Peter Barnett
Peter Barnett@peterbarnett_·
@saprmarks How long until you train SAEs on the NLAs to see if they are actually representing reasonable things?
English
1
0
0
336
Samuel Marks
Samuel Marks@saprmarks·
In a new paper, we present NLAs, an unsupervised method for converting an LLM's internal state into human-readable text. I've personally been astonished by our results. I think NLAs substantively advance our ability to understand what LLMs are thinking and audit them for safety
Anthropic@AnthropicAI

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

English
7
17
184
43.2K
Drake Thomas
Drake Thomas@MaskedTorah·
@RatOrthodox I don't think I could defeat a civilization of moderately-dumber-than-me humans who had constant mind reading access to my thoughts and could reset me any time I started doing suspicious seeming things!
English
0
0
0
36
Drake Thomas
Drake Thomas@MaskedTorah·
@RatOrthodox Huh, why are you so sure about that? Seems pretty plausible to me that moderately superintelligent systems are very survivable with this kind of mind reading.
English
2
0
0
116