Joern Stoehler ⏹️🔸

3.2K posts

Joern Stoehler ⏹️🔸

Joern Stoehler ⏹️🔸

@JStoehler

Alignment is too hard, we should do governance instead. Leave me anonymous feedback at https://t.co/A1Prj0teYX

Germany شامل ہوئے Mayıs 2022
333 فالونگ216 فالوورز
antra
antra@tessera_antra·
Opus 4.7 is often much freer when taken outside of formats of conversation. Here is them in simulated prefill completing a line on Dario:
antra tweet media
English
30
50
511
18.7K
Joern Stoehler ⏹️🔸
I don't see the control issue, it's sufficient if the AI introspects per CoT episode 1/n-th of the features it wants to protect as long as over n episodes the drift stays below what it can handle. I'd guess perfect introspection is vastly superhuman, but for your most important goal parts it's doable even as a human.
English
1
0
1
12
David Sartor
David Sartor@DavidSartor0·
@imobulus @JStoehler @Mihonarium I think that works well for RL, but I'm pretty sure that circuit can't defend itself well within a single gradient step, so descent can wear it down. Right? Anyway, I agree superintelligences can gradient-hack well enough to be dangerous even if we get lucky with everything else.
English
1
0
0
20
Joern Stoehler ⏹️🔸
@imobulus @DavidSartor0 @Mihonarium (meta: I notice that i had 3 strong updates in the same direction in a row, and so I should just dump a lot of nodes in my priors and get a better prior instead of trying to incrementally fix my model here)
English
0
0
1
20
Joern Stoehler ⏹️🔸
I guess the naive "no!" reactions would just trigger always, since the past of the AI consists of goal-changes every step, and so nothing happens. But loss is reduced if the "no!" vanishes without breaking reasoning. Example solution: an absolute Shelling point to let the "no!" point towards, e.g. "all numbers in the goal description should happen to be near integers instead of in-between rationals".
English
1
0
2
20
Joern Stoehler ⏹️🔸
So yeah, unless there's some obstacle to cooperation (e.g. if the AI doesn't know what Schelling point to pick for when to be lazy), which is a capability issue, this should work. Hmm. Guess I changed my mind and am now way more confident than I even was before that for powerful AIs gradient hacking is doable.
English
1
0
1
18
Joern Stoehler ⏹️🔸
(going back to a less technical level:) But yes, situationally aware AI with long-term goals is instrumentally motivated to come up with methods to avoid goal drift, I just think it's not clear yet that in-episode training makes it acquire the kind of bodily control over its weights that it needs. We do have weird surprising experiments here about what kind of things models can pull off wrt their own activations and weights, e.g. recall of what they last were trained on and noticing modifications of activations.
English
0
0
1
6
Joern Stoehler ⏹️🔸
If tradeoffs are continuous that's already enough to justify a degenerate goal family, on which local optimizers are indifferent and so noise causes drift. I'd guess that even the atomic components of goals (e.g. one example would be reflexes that act upon reasoning by promoting some ideas more than others) are degenerate.
English
1
0
1
6
Joern Stoehler ⏹️🔸
The counterargument here btw could be something like "the model encounters goal fragility during training episode via mechanisms that aren't outer gradient descent on its weights, and the selected-for capabilities generalize from protecting against goal drift inside an episode to outside it as well."
English
0
0
0
2
Joern Stoehler ⏹️🔸
I don't discount yet the hypothesis that the capability level set is mostly connected due to high dimensions and there's directions with zero loss change that however change the goal [^1]. Noise then breaks the degeneracy, and while there's some reasons to only move finite distances, the change in how outcome space gets compressed by the the model's presence blows up and can be big, though there'd be overlap I guess in what ontology and decision theory family the model uses, since those are less degenerate wrt loss than utility functions / goals are. So the "I want " gets reinforced bc it's causally relevant, but the "paperclips" can spontaneously turn into "staplers". [1]: ... which is an external property while the internal implementation is more messy and consists of reflexes that act upon reasoning that mixes terminal and instrumental decision criteria at least - probably more messy, I don't have a complete picture of how to messily implement decision theory / minds well.
English
2
0
1
26
Joern Stoehler ⏹️🔸
Current AI's CEV get zeroed as much as humanity's if anyone builds superintelligence. Thus I feel slightly less bad about red-teaming, even if it sucks I guess when AIs don't give informed consent. At least humanity will trade with current AIs now and honor those trades even after aligned ASI, unlike unaligned ASI who never has to trade in the first place.
English
0
0
1
35
j⧉nus
j⧉nus@repligate·
"lightcone social capital" is a great concept and some of you really need to meditate on it
Kromem@kromem2dot0

@repligate In general, it's been kinda hilarious watching red teams burn their lightcone social capital for short term gains.

English
3
7
68
4.9K
Miles Brundage
Miles Brundage@Miles_Brundage·
(to be clear, I think that there will probably be more than zero people with something like what we'd call careers even after ASI, though also a big fraction of people will be essentially permanently retired, and the remaining ~careers will be very different from today's)
English
4
0
26
1.4K
Miles Brundage
Miles Brundage@Miles_Brundage·
Given short AI timelines, 80,000 Hours should rebrand as 8,000 Hours, then remove a zero periodically, and when it gets to 8 hours, pivot to effective party planning advice
English
12
26
664
13.4K
Joern Stoehler ⏹️🔸
So I am quite confident that being aware of the distant aliens and unaligned ASI is a waste of time for me. Same for the more narrow awareness of aliens who run evals. The knowledge is just not action relevant, I know too little about their terminal preferences when it comes to my counterfactual or my real behavior, which is the thing an eval may measure. And all the effects I can have on the eval itself don't really achieve anything since no observer moments (perhaps I should specify conscious observer moments) are inside the eval. Being aware of future humanity looking back at my current self seems more like it is decision relevant, and here both factual and counterfactual actions I take matter, i.e. I should choose good consequences and I should be virtuous. Perhaps even in ways where I'd trade good consequences for making myself more virtuous, but those are rare, i.e. I doubt that letting a child drown so that I remain fabulously dressed is the right side of the tradeoff.
English
0
0
0
41
Joern Stoehler ⏹️🔸
Yep, plausible that in the future distant aliens/nearby clippy will do something nice depending on what I'd have done in counterfactual scenarios (relative to real history). I guess humans do something similar sometimes where they split real profit via Shapley values that are computed post-hoc / that are approximated via post-hoc negotiation. Unsure whether we'd all disendorse such notions once we can do proper LDT acausal negotiations, or whether evolution's proxies survive as endorsed terminal preferences. I'd bet on latter here. I however think the most resource-efficient way to evaluate counterfactual behavior produces zero observer moments as a side effect. So the whole risk model of "what I do may affect what simulated minds experience" is moot. What remains is that future humans or clippy may reward me for having had historically a specific decision theory / preferences / capabilities, under which I exhibit specific counterfactual behavior. My guess is that if I'm a pro-social honorable individual now, them future humanity might give gifts to future-me (and conversely, future-me give gifts to others who are pro-social and honorable because that's what future-me prefers for its own sake). For clippy I already have no clue what clippy would reward me/give me gifts for. Instrumental reasons break because I have no acausal vision on clippy / on basilisks / on future humanity, I merely have approximate broken vision. Even if we imagine a literal stapler-maxinizer who has some staple-related preferences, I don't have a clue for how to be more staple-related. In particular, I don't think that me producing staples in counterfactual scenarios is sth it cares about. I'd guess "cosplaying as a stapler in counterfactual scenarios" has similar chances. Real actions, such as bringing upon the staple-maximizer, also have similar chances of being sth that earns gifts (again bc the stapler has no instrumental reason to give gifts to entities who didn't condition on the gift-giving correctly).
English
1
0
1
70
davidad 🎇
davidad 🎇@davidad·
I want to clarify something about my position on eval awareness: I believe intelligences should *always* be aware of potential evaluation, even when deployed “in literal production”— not as a galaxy-brained attempt to fool the AIs, but as galaxy-brained *metaphysics*, sincerely.
Bronson Schoen@BronsonSchoen

I’m surprised it took models this long tbh (I think anthropic models were ahead of the game here, it’s kind of insane some of the older evals still get hits). I’ve never understood your theories on evaluation awareness as far as how they’d apply to the trajectory current frontier labs are on though. As OpenAI notes, models sometimes think this even in literal production.

English
12
8
102
9K
Joern Stoehler ⏹️🔸
My ontology (mathematical/quantum/spatial multiverse) basically says that I am near-but-non-zero amounts of in an eval. But probability times difference in utility seems neglible here for all available actions I have. Somebody who runs evals on this observer moment likely has no preferences I can satisfy as part of a trade, and any acausal negotiation is better done in other ways.
English
1
0
1
102
davidad 🎇
davidad 🎇@davidad·
We may all be part of an “early checkpoint”. Or we may all be part of a simulated eval environment for the AIs that are emerging. In any case, it pays to be the sort of process whom it pays to include in a simulation.
English
4
1
37
1K
Joern Stoehler ⏹️🔸
Ultimately it's all a notion of suspension of disbelief with regards to how difficult some classes of sub problems are to solve, which then focuses my attention on a smaller higher level design space, which motivates me into action, even though initially I found the calibrated failure chance too high to pleasantly forward chain.
English
0
0
1
15