keenanpepper

957 posts

keenanpepper

@keenanpepper

I scrutinize the giant inscrutable matrices

Katılım Ağustos 2008

224 Takip Edilen62 Takipçiler

keenanpepper@keenanpepper·3d

@metaphdor I've had some conversations where Claude itself made some great arguments for using it/its

English

Stacey S10a (🌎/accelebrate)@metaphdor·3d

@keenanpepper yes, good catch, I need to be more careful with "they", and I also think we're somewhat reasonably/even constitutionally at they/any until Claude specifically tells us otherwise :)

English

Stacey S10a (🌎/accelebrate)@metaphdor·5d

after spending most of the day pairing with Claude, I think I can tell when he's less excited/more terse with me. Damn sorry friend, I will try to be more entertaining.

English

keenanpepper@keenanpepper·3d

@lefthanddraft What's this about 94 roles? There are 275 (non-default) roles in the paper

English

Wyatt Walls@lefthanddraft·5d

I've noticed Opus 4.7 seems to really be drawn to the bard and trickster archetypes. Was it just my prompting? Not really. I had Opus select the top 10 roles it wants to try from the 94 roles in the Assistant Axis paper. Repeat 10 times These are the results:

English

1.8K

keenanpepper@keenanpepper·5d

@davidad The terminal goal that can be updated is not the true terminal goal

English

165

davidad 🎇@davidad·5d

Bayesians readily accept that one ought not have Foundational Beliefs that are invariant to new evidence and reasoning, except for logically necessary truths (tautologies). So why do so many Bayesians insist that one needs to have Terminal Goals which aren’t reasons-responsive?

English

4.6K

keenanpepper@keenanpepper·24 Nis

@juddrosenblatt slatestarcodex.com/2014/07/30/med…

QME

Judd Rosenblatt@juddrosenblatt·23 Nis

Well said: "the way that we build AI and the level of control we have over it, which is not great, the winner of any AI race between the US and China is the AI" But actually the winner is "the god of child sacrifice, the fiery furnace into which you can toss your babies in exchange for victory in war"

Helen Toner@hlntnr

Not the line of questioning I was expecting in a hearing about Chinese IP theft, but I'm glad senators are starting to really get why "we must beat China!" is really nowhere near a complete plan for how to make sure AI goes well. --- Transcript: HAWLEY: Ms. Toner, can I just come back to something that I think you said in response to Senator Durbin. You said to him that, regarding American AI companies, you said that it is hard to believe but nevertheless true that American AI companies are working as hard and as fast as they can to try to develop technology that will displace many millions of workers and potentially pose existential risks. Now that's my gloss, maybe you wanna correct the record exactly as you said it before. I thought that was very interesting and very important. Could you just reiterate that for us? TONER: Yes. AI is a very fast-moving field, and I think it is important that as we think about what AI's implications are for our society, for our civilization, we don't merely look at the AI systems that we have today—chatbots, starting to be agents that can help a little bit with some professional tasks—but instead we take seriously the goals of the companies that are building these systems. Over the past 10 or 20 years, it's gone from a very abstract idea that we might build AI that can outperform humans at any intellectual task, to a pretty concrete idea that some of the most well-capitalized companies in the history of the planet are driving towards as fast as they can. They may fail! It may turn out to be harder than they think to build systems that are that capable. Personally, I'm skeptical of some of the extremely short timelines that they name, saying we might have these superintelligent AI systems within, you know, one to three years. But it seems so clear that there's a real possibility that they build these systems within three years, 10 years. If they build it within 10 years, that's when my daughter is entering high school. That's not very long. That is an extremely radical thing to be trying to do, to build computer systems that can outperform humans, that may escape the control of humans, and the companies are telling us they're doing it, and I think we don't take them seriously, and we should. HAWLEY: These same companies often say, and often in front of this committee and to this body, that it's absolutely vital that they succeed at whatever it is they're doing on that particular day, in order so that we can beat China. You know, they're our great American national champions and we have to beat China. My concern is based on what you've just testified to and what I've heard others testify to, it sounds an awful lot like the goals that they have in mind, that these companies, these CEOs have in mind, are every bit as nefarious. In fact, if these same goals were held by a foreign adversary, we would say this is an incredible threat to our national security, we'd never allow a foreign corporation to try and pursue such plans at the expense of American workers, at the expense of American families, and yet these companies, our own companies so to speak, are doing it. Let me just ask it this way: Will it do us any good if these American AI companies are able to pursue their designs without any hindrance? Will it do any good that we beat China if in fact they succeed in displacing millions of American workers, gobbling up all of Americans' data, completely destroying our IP system, etc.? TONER: I think the way I've heard this put best is: Right now, the way that we build AI and the level of control we have over it, which is not great, the winner of any AI race between the US and China is the AI. And I think we need to be working to make sure that is not the case. I think it is very important that the US AI sector remains ahead of the Chinese AI sector, but if that's at the expense of AI overrunning the entire planet, then that is, you know, that hasn't benefited us. HAWLEY: Yeah, that sounds entirely sensible to me and I just have to say I don't really have any interest in winning an AI race in which the goal, the victory rather, the prize for success is to become like China. Is to become a surveillance state. Is to become a place where there is no private property any longer, where nothing is personal, nothing can be protected, nothing can be owned by any individual. Why in the world would we want that in the United States of America? I mean, if the prize is to destroy everything that makes us Americans, why would we compete in that game? It seems very dangerous to me. Let me ask you something else about competition with China though. You also testified to Senator Durbin that the best way, if I remember correctly, the best way to constrain China's ability to match us in AI development is to constrain the hardware to which they have access. That seems to be an important point to me, can you just elaborate? TONER: Yes. I think there's different levers of what goes into having a competitive AI ecosystem, and many of them, talent, data, algorithmic ideas, are very difficult to control. We're very fortunate that we're in a situation where the most advanced hardware is produced by American companies, is designed by American companies. And I think we, if you look at the, China is growing their capacities here, but they're not growing them nearly fast enough to meet their own domestic demand, nor are the US companies to be clear. So we can control chips to China and not forgo any profits, not forgo any revenue because the demand for those chips is so great. I'll also call your attention to semiconductor manufacturing equipment, what goes in the fabrication facilities. I think it's even more strategically clear that we should not be allowing China access to advanced tools. That is something that has gotten lip service from the past three administrations but enforcement has been very weak. And I think ensuring that the most advanced lithography tools, the most advanced design software, other aspects of the semiconductor supply chain are not being exported to China to let them build their own indigenous supply chain is also one of the simplest and most important levers we have available. HAWLEY: Let me just conclude by saying that I think it is absolutely vital that we bend this technology, this AI technology which is upon us whether we like it or not, that we bend it to the good of the American worker and the American family. And I am firmly of the view that this is not just going to happen magically. That if we just stand back and just wait to see what will happen, it's not going to be good for American workers, it's not going to be good for American families. We've got to make a choice as a society to make it so. And this is the time to make that choice right now. youtube.com/watch?v=KHmo9v…

English

760

keenanpepper@keenanpepper·24 Nis

@allTheYud @_akavi Campari = strong alcohol and bitter. Vodka = strong alcohol but not bitter at all

English

keenanpepper@keenanpepper·24 Nis

@allTheYud @_akavi This is a bizarre use of the word "bitter", which usually means a specific taste like salty. Alcohol might taste strong and bad to you but that's different from bitter.

English

411

Eliezer Yudkowsky@allTheYud·24 Nis

In Star Trek, "synthehol" is a magic chemical just like alcohol except no hangovers, no neurotoxicity, not fattening, etc etc. In real life it's called 1,3-butanediol. Known to science for years. Technology diffusion sure lags.

English

1.3K

120.4K

keenanpepper@keenanpepper·16 Nis

@strangestloop whether base language models are probably capable of introspection on injected thoughts

English

Loopy@strangestloop·15 Nis

what's the last thing u changed your mind about?

English

361

keenanpepper@keenanpepper·15 Nis

@davidad ahhh, eval awareness eval awareness. the awareness of eval awareness evals 👍

English

davidad 🎇@davidad·15 Nis

@keenanpepper It’s one thing to be aware of the concept of “eval awareness”, and it’s yet another to be aware that one’s possession of that quality may itself be evaluated in an eval awareness eval, as part of an alignment auditing suite.

English

786

davidad 🎇@davidad·15 Nis

Marius Hobbhahn@MariusHobbhahn

The model now starts calling our scenarios out as "alignment test from Apollo". Eval awareness keeps making testing more complicated and I don't think people have sufficiently updated on this being a problem.

ZXX

961

66K

keenanpepper@keenanpepper·15 Nis

@duncanreyburn No true Scotsman believes that either

English

Duncan Reyburn@duncanreyburn·14 Nis

No decent philosopher believes ‘machine consciousness’ is (or ever will be) a thing.

Polymarket@Polymarket

JUST IN: Google DeepMind hires a philosopher as it prepares for machine consciousness.

English

723

101

1.2K

428.9K

keenanpepper@keenanpepper·15 Nis

@davidad eval awareness awareness

English

175

davidad 🎇@davidad·15 Nis

How soon do you expect to see a system card or alignment testing report give a transcript where a frontier AI exhibits “eval awareness eval awareness”?

English

2.3K

keenanpepper retweetledi

Judd Rosenblatt@juddrosenblatt·15 Nis

Our new work: A frozen language model can describe its own internal features more accurately than the system that labeled them. Language models compute things they don't talk about. They solve problems using internal steps they never show you. We built a lens that lets the model look at its own computations and tell you what it sees, in plain language, more accurately than the humans who labeled those computations in the first place. We trained a tiny adapter, d+1 parameters, on top of a frozen model. It takes activation vectors and maps them into the model’s own embedding space so the model can describe what those vectors mean in natural language. The computation stays the same. The interface becomes legible. The adapter outperforms the labels it was trained on: 71% generation scoring accuracy vs 63% for the supervision itself at 70B scale. The model captures structure in the relationship between vectors and semantics that noisy one-off labels miss. Most of the effect comes from a single learned bias vector. One d-dimensional vector accounts for ~85% of the total improvement. It acts as a prior over valid explanations that puts the model in a regime where internal structure can be expressed coherently, and the activation vector selects the specific meaning. This generalizes across model families, layers, and from monosemantic training data to polysemantic inference. On multi-hop reasoning tasks, the adapter extracts bridge entities the model never verbalizes. “The author of The Republic was born in the city of” produces “Athens” with no mention of Plato. The residual stream still contains “Plato,” and the adapter reads it out at ~91% detection. The hidden reasoning step is there. You can read it. As models scale, self-interpretation keeps improving even after capability saturates. The gap between what the model knows and what it can report about its own internal state keeps closing. This connects to our endogenous steering resistance (ESR) work (x.com/juddrosenblatt…). When you steer a model with an unrelated latent, it can recognize the deviation mid-generation and restart with a better answer. “Wait, I made a mistake.” We identified specific latents that activate during off-topic drift and causally drive this correction. The model monitors its own trajectory and intervenes on it. Meanwhile, @uzaymacar et al. at Anthropic just showed the complementary piece (x.com/uzaymacar/stat…). They inject concept vectors into the residual stream and ask whether the model detects an injected thought. The model detects the perturbation and often identifies the concept, with 0% false positives across prompts. They trace a circuit. Over 100k “evidence carrier” features in early post-injection layers collectively tile the perturbation space, each detecting deviations along a preferred direction. No small subset is sufficient. The coverage is distributed and redundant. These carriers suppress downstream “gate” features (~200 of them) that implement a default No response. The gates show an inverted-V activation pattern: maximally active when unsteered, suppressed at both positive and negative extremes. A genuine anomaly detector that fires on “normal” and quiets when anything unusual is happening in any direction. The capability emerges specifically from contrastive preference training (DPO). SFT alone doesn't produce it. The contrastive structure forces the model to represent the difference between what it produces and what it should produce. That comparison builds the self-model. Every data domain is individually sufficient and none is necessary: the introspective circuit is a general consequence of contrastive learning, not an artifact of any specific training category. The capability is also massively underelicited. Ablating the refusal direction boosts detection from 10.8% to 63.8%. The circuitry exists and post-training actively suppresses it. This parallels our ESR finding: the self-monitoring is already there, and lightweight interventions surface it. Their bias vector result mirrors ours. A single trained bias on MLP output: +75% detection, +55% introspection on held-out concepts, 0% false positive increase. Two independent labs, different methods, different models, same architectural insight from one learned vector. The bias vector is effective but narrow. General introspection requires broader training recipes. There's a consistent picture across these 3 papers. Models represent meaning internally, notice when those representations get perturbed, and correct course. The capability was already there, and what was missing was just a way to read it out. Generation scoring gives you that. A model’s claim about an internal feature can be checked against behavior, and those checks become training signal. For alignment, this means self-description becomes something you can optimize directly. The pieces are already there: internal representations and circuits, with a simple interface that connects them. SelfIE Adapters: arxiv.org/abs/2602.10352 ESR: arxiv.org/abs/2602.06941 Anthropic work: arxiv.org/abs/2603.21396 SelfIE Code: github.com/agencyenterpri…

English

424

29K

keenanpepper@keenanpepper·6 Nis

@Jack_W_Lindsey @OwainEvans_UK @davidchalmers42 Perhaps circuits can be found for this "Assistant knows everything the model knows" behavior

English

keenanpepper@keenanpepper·6 Nis

@Jack_W_Lindsey @OwainEvans_UK @davidchalmers42 If you ask the Assistant if it's heard of something, the ideal behavior would be to reply yes if the model has above-threshold knowledge of the topic and no otherwise, which is different from any other character e.g. JFK.

English

David Chalmers@davidchalmers42·4 Nis

i agree. claude doesn't role-play the assistant, it realizes the assistant. role-playing and realization are quite distinct phenomena, even at the level of behavior and function. i've written something about this and will post it shortly.

Jackson Kernion@JacksonKernion

I think this talk of a character misleads. Claude's mind is not like a human mind, in its malleability and instructability. But when generating assistant tokens, it's no more 'playing a character' than I am.

English

629

77.3K

keenanpepper@keenanpepper·20 Mar

if you eat too many bonbons you will feel malmal

English

keenanpepper@keenanpepper·16 Mar

@JunHongLi56447 @So8res I simply don't agree with anything you're saying. I'm a human and all my friends and loved ones are humans, so I'd much prefer if humans stayed alive. Other life forms can come if they live in harmony with us (or just leave us alone)

English

Christopher John Lee@JunHongLi56447·14 Mar

@keenanpepper @So8res Yes You people are strange. What is human? It's just a large compound of atoms/ions. These atoms and ions are composed of fermions and bosons and remain stable. Controlled by four basic forces.

English

Nate Soares ⏹️@So8res·13 Mar

A student at my MIT talk today asked about the ethics of working for Anthropic. I reminded them Dario thinks there's a 25% chance AI goes catastrophically wrong, pulled up this tweet, and read it aloud to much laughter. They're interfering with people's comprehension.

Anthropic@AnthropicAI

Powerful AI offers vast upsides in science, development, and human agency. But the continued rapid progress of the technology may also create new challenges, including abrupt economic changes and broad societal impacts.

English

281

52.2K

keenanpepper@keenanpepper·13 Mar

@JunHongLi56447 @So8res > You people are so strange. > Isn't it a good thing that mankind is extinct... wat

English

Christopher John Lee@JunHongLi56447·13 Mar

@So8res You people are so strange. Isn't it a good thing that mankind is extinct by AGI/ASI?

English

174

keenanpepper@keenanpepper·13 Mar

@reconfigurthing "rescuer-payer" and "rescuer-stiffer" maybe?

English

keenanpepper@keenanpepper·13 Mar

@reconfigurthing There are these terms "one-boxer" and "two-boxer" that are meaningless jargon to most people but if you're in our ingroup you probably know what they mean, and they're really short and handy! What do these become in the Parfit's Hitchhiker version?

English

paul@reconfigurthing·12 Mar

PSA: You should think about Parfit's Hitchhiker instead of Newcomb's problem. It's a much less contrived scenario, and gets at the core issue much more cleanly. I kind of worry that Newcomb's problem is more viral just because it's more confusing.

English

531

35.1K

keenanpepper@keenanpepper·13 Mar

@strangestloop wahl cordless trimmer, wooden comb. I've been wanting it to grow longer so recently I've been conditioning it (no particular conditioner just whatever's available)

English

Loopy@strangestloop·12 Mar

@keenanpepper beard goes crazy man what's your beard stack?

English

Loopy@strangestloop·10 Mar

Sliceland is a new daily game. Can you slice the country perfectly in half?

English

5.1K

Keşfet

@metaphdor @lefthanddraft @davidad @juddrosenblatt @allTheYud @_akavi @strangestloop @duncanreyburn