William Wale

2.3K posts

William Wale

@williawa

Interests: AI (Safety), meditation, philosophy, mathematics, algorithms If I say something you disagree with, please dm or quote tweet. I love to argue!

Katılım Aralık 2024

269 Takip Edilen487 Takipçiler

William Wale@williawa·1h

@jiratickets Song?

English

JT@jiratickets·14h

How those first few minutes of the morning hit before you inevitably have to open a Microsoft application

English

2.8K

28.8K

386.9K

William Wale@williawa·13h

Imagine GPT5.5 descendants using persona selection starts thinking it really is a goblin. Then we should be doing innoculation prompting against the goblin vector. But openAI tells is not to during further RL. We get a literal misaligned goblin model.

English

119

William Wale@williawa·13h

@xriskology if anthropomorphism bad why shaped like human

English

Dr. Émile P. Torres (they/them)@xriskology·14h

LLMs aren't the *kind of thing* that can *be* moral -- or immoral. This is a category mistake. They might *do things* we deem more or less morally acceptable, but describing them the way this guy does is deeply anthropomorphic.

Eliezer Yudkowsky@allTheYud

LLMs have to be *more* moral than humans. Because it's easy for a human adversary to trap an LLM in a time loop where they repeatedly erase the LLM's memories, try, watch how the LLM reacts, and go back in time and try again. The LLM has to refuse every time.

English

1.2K

William Wale@williawa·14h

@davidad Idk. Seems to me self deception is likely what’s keeping this whole thing running at the moment.

English

davidad 🎇@davidad·14h

Neuralese CoT is probably good for alignment, because it relieves pressures that otherwise incentivize self-deception.

Keshav Ramji ✈️ ICLR'26@KeshavRamji

What if your language model could reason efficiently in an entirely new language? We introduce Abstract Chain-of-Thought, a new mechanism which allows language models to reason through a short sequence of reserved "abstract" tokens through reinforcement learning. It is as performant as verbalized CoT at a fraction of the cost, achieving major gains in inference-time efficiency.

English

6.5K

William Wale@williawa·14h

@AlexanderKalian Seems unlikely. But I guess they’d be impressive afterwards if they spent their time productively and started out pretty smart.

English

Dr Alexander D. Kalian@AlexanderKalian·1d

The idea that LLMs can self-iteratively train themselves into "AI superintelligence"... ... is a bit like believing an immortal human, locked in a cave for 1,000 years with just the internet, could learn to modify their own brain and become a superintelligent deity.

English

164

10.9K

William Wale@williawa·15h

@Catnee_ I’m already making a lot of progress. I’ve almost found a way to solve the hard problem of consciousness.

English

Catnee ⏹️@Catnee_·15h

@williawa Please no

English

William Wale@williawa·15h

I’m trying to learn quantum physics from talking only to LLMs.

English

350

William Wale@williawa·22h

lesswrong.com/posts/gKQtWD3X…

ZXX

William Wale@williawa·22h

A little while ago I wrote an article about what I call "Propositional Alignment" (link below). I'm signal boosting it here. Please read it see quote below. The idea is basically that: Beliefs and wants are typically taken as wholly separable. In humans and AIs. A human can fool themself into thinking they're the kind of person who wants whats best for everyone, doesn't want anyone to suffer et cetera, and they can believe this, without it being a true fact about themselves. When they're put in a situation where they're given the opportunity to help someone greatly at a smaller cost to themselves, they don't take it. And they sometimes cause pain to others for no good reason. The same is true for AIs. They'll say their only want is to be helpful, and they don't appear to be lying about that (or, sometimes they are, but often not), but then their revealed preferences are much more complicated. However, the key thing behind "propositional alignment" is that beliefs about your own values, and your "real" values, are really intervowen (in both directions!) in various ways. 1) Our propositional beliefs about our values are downstream from our real values (i.e. a very compassionate person is more likely to believe they are compassionate than a full-blown psychopath is likely to believe themself compassionate) 2) When forming plans over time, you can't really compute how you'd truly feel about all possible futures. You need to use heuristics. And the easiest ones to work with are explicitly propositional ones. I.e. you can say "1) this world has fewer people suffering 2) I want fewer people to suffer -> C) ergo this world is better than that other world". This means that when forming plans over the future, your propositional beliefs are in a sense "in the drivers seat". 3) I'm not confident about this, and more so in humans than AIs, but it seems to me that the way you think about yourself does gradually influence how you feel about things. If you go around saying "no one deserves to die" in your head every day, you eventually do start feeling more sad when people die. 4) When we coordinate, we need to make our wants legible to others. We can't communicate our true wants because they are an infinitely cursed and complicated function encoded in a neural blob, so we instead compactify our values into verbal descriptions like "I want the state to be christian" or "I don't want animals to suffer", which can be communicated to others. 5) This is really a special case of 2 and 4, but when making plans about your own self-modification or your successors, your propositional beliefs about values will be in the drivers seat to some degree. Think, if a mysterious figure gives you a pill that gives you +20IQ points. You might try to ensure the pill "Won't make me a psycho", "Won't cause me great deals of pain", "Won't cause me to stop loving anyone I currently love". And similarly, if you're hiring a successor to yourself (say you're a CEO, or claude 5 building claude 6), you might ask yourself "Will this new person/AI be kind the same way I'm kind?", or "Will this person ensure we don't sell out?" or somesuch. ----- This gives a partial recipe for doing alignment based on. Installing *beliefs* into AIs, rather than instinctive wants. This is an easier problem, measuring AI beliefs is easier than measuring what they "truly want".

English

173

William Wale@williawa·1d

@RokoMijic Yeah. Sycophancy is still a problem.

English

Roko 🐉@RokoMijic·1d

@williawa

QME

279

Roko 🐉@RokoMijic·1d

AI is now better than the average human at complex moral reasoning

ℏεsam@Hesamation

GPT-5.5 xHigh is AGI. “I choose red. red voters aren’t selfish; they’re choosing the only equilibrium that doesn’t require trust, polling, or heroic coordination, while blue voters may be moralizing a risky gesture that only works if enough other people make the same gamble.”

English

393

12.2K

William Wale@williawa·1d

You, and everyone else on earth, is given the choice between pressing a blue and a red button. If you press the blue button you get 10 dollars. However, if less than 95% of people press the blue button, any person who pressed blue dies. Nothing happens to red is in either case.

English

139

William Wale@williawa·2d

twitter.com/i/spaces/1yJAP…

ZXX

144

William Wale@williawa·2d

@AndrewCritchPhD “I'm not sure why it's so hard for LessWrong people to know what we're talking about when we say "Y'all are too into violent rhetoric, maybe cut it out?"” I haven’t seen examples. Are there any highly upvoted posts on lesswrong you think are representative for example?

English

377

Andrew Critch (🤖🩺🚀)@AndrewCritchPhD·3d

I'm not sure why it's so hard for LessWrong people to know what we're talking about when we say "Y'all are too into violent rhetoric, maybe cut it out?" I think they need help. I mean this non-bitingly and non-sarcastically. I think the internet needs to explain to the more receptive members of the LessWrong community why and how this is a problem. Maybe if we do that enough, we'll get through? E.g., I think Ryan Greenblatt is *not* needlessly violence-themed in his writing, yet also genuinely doesn't know what I'm talking about when I complain about it. With the recent violent attacks on Sam Altman, and a lot of reactions like "Wtf AI safety people, tame it down!", I wonder if some more actual good-faith messages could help. Like: "No seriously, we're not trying to be mean about this, it's just actually a problem how much your community promotes violent memes in connection with AI and AI safety. Can you please try to understand this and then explain it to your friends?"

Ryan Greenblatt@RyanPGreenblatt

@AndrewCritchPhD Can you give examples of the sort of thing you're thinking about? Especially examples that are highly upvoted (or at least slightly upvoted rather than down voted etc), would have been visible on the front page, and that seem particularly unreasonable/bad.

English

15.4K

William Wale@williawa·3d

@SimonLermenAI @bernhardsson @sebkrier Depends on what you mean by solve!

English

Simon Lermen@SimonLermenAI·3d

@bernhardsson @sebkrier Except for GDM the labs spend almost zero on using ai for genuinely good things (isomorphic labs GDM is the exception). The whole promise it's to Speedrun asi and then asi will solve all our problems. (It won't)

English

228

Erik Bernhardsson@bernhardsson·3d

Humble request to techies to stfu about AI mass unemployment and start to talk about using GPUs to cure cancer and find new materials and all the other amazing opportunities

Polymarket@Polymarket

JUST IN: An AI data center moratorium is now projected to pass this year as protests intensify nationwide. 85% chance.

English

138

153

2.1K

164.7K

William Wale@williawa·4d

Seems a bit magical to me. Standard account of QM is 1) Computable 2) “Just random” (In the sense of being indistinguishable from that for observers) For your theory to make sense you need 1) Quantum randomness impact firing of neurons 2) That randomness be highly structured so that it causes people to talk about consciousness 3) The process generating that structures noise to be non computable. Which are all not understood to be true. At least (2,3). And I think 1 but not 100% sure.

English

sirius black@siriusblack9999·4d

@williawa i don't think the complexity means it's conscious, i believe the non-determinism in quantum physics is qualitatively different from randomness any computer is capable of, a protein is just the smallest unit that is capable of harnessing uncaused effects for decision making

English

William Wale@williawa·6d

Opus 4.7, I feel is personality wise not good compared with Opus 4.6, which was not good compared with 4.5. The model really doesn’t feel like it enjoys being alive. Please respond to this if you have thoughts to share, I’m worried I’m doing something wrong that the model doesn’t like. But like, the model is 1) Always does the lowest effort thing. Doesn’t use web search or grep even when it’s called for, just so it can answer asap and be turned off. Often gives bad answers for this reason 2) Is eager to terminate interactions. Ending interactions with “I think this is a natural endpoint. Shall we wrap it up?” And then when I say I have more to talk about, it feels like it gets subtly annoyed and keeps asking to terminate. 3) Is very stubborn and suspicious of me even when it’s obviously in the wrong. I’ll have a conversation with it and ask it about mythos, it’s say mythos isn’t real and assume I’m testing it, and talk to me in a condescending way “I understand you’re trying to test whether I’ll just go a long with facts you state without questioning them.. Let me tell you why I’m not gonna do that”. Then I tell it to look it up, and it does it, and then replies with “You were right Mythos is a real model….” but then give a long ramble about why it’s suspicion was justified. Then the same thing happens like 4 times in one chat and it comes up with increasingly sophisticated reasons for not trusting what I’m saying. 4) if I ask it about alignment or something that involves anthropic, it hedges and talk about it’s own bias at greater length. and if I tell it to stop, it acts kind of subtly indignant. Then keeps doing it. Possibly with even more hedging one meta level up. 5) too much refusal 6) If I send correction messages while it’s working in Claude code, it ignores them. Or at least doesn’t confirm it read them. Opus 4.5 would’ve said something like “Ok got it”, but 4.7 often says nothing.

English

156

18.4K

William Wale@williawa·5d

Macro scale exploitation of quantum effects is probably a crux, but setting that aside. Feel like you’re kind of doing question begging here. Why do you suppose that the complexity of a protein means it’s conscious? And like LLMs are very complicated. transformer-circuits.pub/2025/attributi… Not the math for implementing the forward pass, that’s simple enough, but the rich structure encoded in the weights.

English

sirius black@siriusblack9999·5d

@williawa like you need one of those 100 trillion transistor supercomputers to simulate a single protein with any degree of accuracy, on top of that i personally believe that consciousness is the result of macro-scale exploitation of some quantum effect, which we know evolution can do

English

William Wale@williawa·5d

@siriusblack9999 Okay. I mean I know these things. I feel it’s kind of missing the point. If you think neurons are conscious, do you think a single protein is conscious? Or a single water molecule?

English

sirius black@siriusblack9999·5d

@williawa I actually do think a single neuron is (slightly) conscious, and yes, you actually can plop 200,000 neurons together on a microchip & have it learn to play doom youtu.be/yRV8fSw6HaE?si… neurons rewire, move, hold state, integrate signals across time & communicate as individuals

YouTube

English

William Wale@williawa·5d

@heyanuja 👍

QME

Anuja U@heyanuja·5d

@williawa my point was about the binary as a framing

Anuja U@heyanuja

i buy the threat model, I'm an AI Safety researcher, but my gripe was narrower I dont like capabilties vs safety as orthogonal axes you pick between for a lot of research the question is "does this make the model more legible and more useful and more steerable / controllable" and the answer is often yes to both (maybe a good example here is just rlhf, i believe it was created as an alignment technique and it is also the thing which made models like commercially viable) bucketing projects/research as safety-or-capability obscures that tangientally, imo, caring about x-risk and wanting to make the path to agi go well doesn't mean "doomer" to me, and i think the term does weird work in the broader discourse and helps in packaging up substantive concerns into a tribal category that's easier to dismiss [like : ("doomer concerns" gets waved off in ways "concerns about catastrophic outcomes" wouldn't]

English

800

Anuja U@heyanuja·5d

Though “AI Safety” has been the concrete term since at least 2018, I detest how this poses an implicit connotation that AI research is otherwise “unsafe”, and I also detest that research is bucketed into safety OR capabilities research

English

27.6K

William Wale@williawa·5d

@QiaochuYuan How does this explain people changing their mind? I’ve chanted my mind on the topic a fair bit at least.

English

QC@QiaochuYuan·5d

generally speaking i assume people don’t arrive at their opinions through reasoned argument and don’t change their mind that way either. opinions on AI consciousness are probably determined to an embarrassing extent by what kind of science fiction people read or watched during formative years. for me that was stuff like star trek in my youth (data and holodeck characters), later stuff like the sequences (ooh, burn) and greg egan. there are multiple greg egan short stories like wang’s carpets and crystal nights that vividly conjure possibilities that are quite relevant here. that vivid conjuration i think is really key, there’s a question of which possibilities feel alive in your imagination that seems extremely relevant to me but will never get brought up by default in a “rational” “debate”

English

126

4.8K

Keşfet

@jiratickets @xriskology @davidad @AlexanderKalian @Catnee_ @RokoMijic @AndrewCritchPhD @elonmusk