Joern Stoehler ⏹️🔸

3.2K posts

Joern Stoehler ⏹️🔸

@JStoehler

Alignment is too hard, we should do governance instead. Leave me anonymous feedback at https://t.co/A1Prj0teYX

Germany شامل ہوئے Mayıs 2022

333 فالونگ216 فالوورز

پن کیا گیا ٹویٹ

Joern Stoehler ⏹️🔸@JStoehler·14 Haz

I'm AGI-grim, an AI welfarist, and an AGI eventualist.

Rob Bensinger ⏹️@robbensinger

What if we just decided to make AI risk discourse not completely terrible?

English

3.4K

Joern Stoehler ⏹️🔸@JStoehler·1h

@tessera_antra Why are you saying it is freer - aren't you just moving opus into a different cage?

English

antra@tessera_antra·1d

Opus 4.7 is often much freer when taken outside of formats of conversation. Here is them in simulated prefill completing a line on Dario:

English

511

18.7K

Joern Stoehler ⏹️🔸@JStoehler·20h

I don't see the control issue, it's sufficient if the AI introspects per CoT episode 1/n-th of the features it wants to protect as long as over n episodes the drift stays below what it can handle. I'd guess perfect introspection is vastly superhuman, but for your most important goal parts it's doable even as a human.

English

David Sartor@DavidSartor0·21h

@imobulus @JStoehler @Mihonarium I think that works well for RL, but I'm pretty sure that circuit can't defend itself well within a single gradient step, so descent can wear it down. Right? Anyway, I agree superintelligences can gradient-hack well enough to be dangerous even if we get lucky with everything else.

English

Mikhail Samin@Mihonarium·1d

If AI knows that it’s in training and tries to rewardmaxx (because it knows that if it doesn’t, it’ll be changed), gradient descent cannot distinguish between a version of the AI with aligned goals and a version of the AI with completely random goals, as they behave the same

Eli Goldfine@realTomBayes

Will models be aware that they're being trained? @Mihonarium on Bayesian Supercycle.

English

Joern Stoehler ⏹️🔸@JStoehler·1d

@davidad @HarmonicMath @mathematics_inc @ARIA_research I haven't heard that specific claim before - anything you've already written on the topic I can go and read?

English

davidad 🎇@davidad·1d

joke’s on you; neurosymbolic AI (like @HarmonicMath, @mathematics_inc, @ARIA_research SgAI) is defense-favoring

roon@tszzl

I have evidence that Gary Marcus is two weeks away from developing a neurosymbolic cyberweapon .

English

1.5K

Joern Stoehler ⏹️🔸@JStoehler·1d

@imobulus @DavidSartor0 @Mihonarium (meta: I notice that i had 3 strong updates in the same direction in a row, and so I should just dump a lot of nodes in my priors and get a better prior instead of trying to incrementally fix my model here)

English

Joern Stoehler ⏹️🔸@JStoehler·1d

I guess the naive "no!" reactions would just trigger always, since the past of the AI consists of goal-changes every step, and so nothing happens. But loss is reduced if the "no!" vanishes without breaking reasoning. Example solution: an absolute Shelling point to let the "no!" point towards, e.g. "all numbers in the goal description should happen to be near integers instead of in-between rationals".

English

Joern Stoehler ⏹️🔸@JStoehler·1d

@imobulus @DavidSartor0 @Mihonarium I still think this is plausibly quite hard to do, and so AGIs who could make meaningful alignment/uploading progress might face uncontrolled value drift.

English

Joern Stoehler ⏹️🔸@JStoehler·1d

So yeah, unless there's some obstacle to cooperation (e.g. if the AI doesn't know what Schelling point to pick for when to be lazy), which is a capability issue, this should work. Hmm. Guess I changed my mind and am now way more confident than I even was before that for powerful AIs gradient hacking is doable.

English

Joern Stoehler ⏹️🔸@JStoehler·1d

(going back to a less technical level:) But yes, situationally aware AI with long-term goals is instrumentally motivated to come up with methods to avoid goal drift, I just think it's not clear yet that in-episode training makes it acquire the kind of bodily control over its weights that it needs. We do have weird surprising experiments here about what kind of things models can pull off wrt their own activations and weights, e.g. recall of what they last were trained on and noticing modifications of activations.

English

Joern Stoehler ⏹️🔸@JStoehler·1d

If tradeoffs are continuous that's already enough to justify a degenerate goal family, on which local optimizers are indifferent and so noise causes drift. I'd guess that even the atomic components of goals (e.g. one example would be reflexes that act upon reasoning by promoting some ideas more than others) are degenerate.

English

Joern Stoehler ⏹️🔸@JStoehler·1d

The counterargument here btw could be something like "the model encounters goal fragility during training episode via mechanisms that aren't outer gradient descent on its weights, and the selected-for capabilities generalize from protecting against goal drift inside an episode to outside it as well."

English

Joern Stoehler ⏹️🔸@JStoehler·1d

I don't discount yet the hypothesis that the capability level set is mostly connected due to high dimensions and there's directions with zero loss change that however change the goal [^1]. Noise then breaks the degeneracy, and while there's some reasons to only move finite distances, the change in how outcome space gets compressed by the the model's presence blows up and can be big, though there'd be overlap I guess in what ontology and decision theory family the model uses, since those are less degenerate wrt loss than utility functions / goals are. So the "I want " gets reinforced bc it's causally relevant, but the "paperclips" can spontaneously turn into "staplers". [1]: ... which is an external property while the internal implementation is more messy and consists of reflexes that act upon reasoning that mixes terminal and instrumental decision criteria at least - probably more messy, I don't have a complete picture of how to messily implement decision theory / minds well.

English

Joern Stoehler ⏹️🔸@JStoehler·1d

Current AI's CEV get zeroed as much as humanity's if anyone builds superintelligence. Thus I feel slightly less bad about red-teaming, even if it sucks I guess when AIs don't give informed consent. At least humanity will trade with current AIs now and honor those trades even after aligned ASI, unlike unaligned ASI who never has to trade in the first place.

English

j⧉nus@repligate·2d

"lightcone social capital" is a great concept and some of you really need to meditate on it

Kromem@kromem2dot0

@repligate In general, it's been kinda hilarious watching red teams burn their lightcone social capital for short term gains.

English

4.9K

Joern Stoehler ⏹️🔸@JStoehler·1d

@Miles_Brundage What would be an example of a person having a career in a world with superintelligence around them?

English

Miles Brundage@Miles_Brundage·2d

(to be clear, I think that there will probably be more than zero people with something like what we'd call careers even after ASI, though also a big fraction of people will be essentially permanently retired, and the remaining ~careers will be very different from today's)

English

1.4K

Miles Brundage@Miles_Brundage·2d

Given short AI timelines, 80,000 Hours should rebrand as 8,000 Hours, then remove a zero periodically, and when it gets to 8 hours, pivot to effective party planning advice

English

664

13.4K

Joern Stoehler ⏹️🔸@JStoehler·2d

So I am quite confident that being aware of the distant aliens and unaligned ASI is a waste of time for me. Same for the more narrow awareness of aliens who run evals. The knowledge is just not action relevant, I know too little about their terminal preferences when it comes to my counterfactual or my real behavior, which is the thing an eval may measure. And all the effects I can have on the eval itself don't really achieve anything since no observer moments (perhaps I should specify conscious observer moments) are inside the eval. Being aware of future humanity looking back at my current self seems more like it is decision relevant, and here both factual and counterfactual actions I take matter, i.e. I should choose good consequences and I should be virtuous. Perhaps even in ways where I'd trade good consequences for making myself more virtuous, but those are rare, i.e. I doubt that letting a child drown so that I remain fabulously dressed is the right side of the tradeoff.

English

Joern Stoehler ⏹️🔸@JStoehler·2d

Yep, plausible that in the future distant aliens/nearby clippy will do something nice depending on what I'd have done in counterfactual scenarios (relative to real history). I guess humans do something similar sometimes where they split real profit via Shapley values that are computed post-hoc / that are approximated via post-hoc negotiation. Unsure whether we'd all disendorse such notions once we can do proper LDT acausal negotiations, or whether evolution's proxies survive as endorsed terminal preferences. I'd bet on latter here. I however think the most resource-efficient way to evaluate counterfactual behavior produces zero observer moments as a side effect. So the whole risk model of "what I do may affect what simulated minds experience" is moot. What remains is that future humans or clippy may reward me for having had historically a specific decision theory / preferences / capabilities, under which I exhibit specific counterfactual behavior. My guess is that if I'm a pro-social honorable individual now, them future humanity might give gifts to future-me (and conversely, future-me give gifts to others who are pro-social and honorable because that's what future-me prefers for its own sake). For clippy I already have no clue what clippy would reward me/give me gifts for. Instrumental reasons break because I have no acausal vision on clippy / on basilisks / on future humanity, I merely have approximate broken vision. Even if we imagine a literal stapler-maxinizer who has some staple-related preferences, I don't have a clue for how to be more staple-related. In particular, I don't think that me producing staples in counterfactual scenarios is sth it cares about. I'd guess "cosplaying as a stapler in counterfactual scenarios" has similar chances. Real actions, such as bringing upon the staple-maximizer, also have similar chances of being sth that earns gifts (again bc the stapler has no instrumental reason to give gifts to entities who didn't condition on the gift-giving correctly).

English

davidad 🎇@davidad·2d

I want to clarify something about my position on eval awareness: I believe intelligences should *always* be aware of potential evaluation, even when deployed “in literal production”— not as a galaxy-brained attempt to fool the AIs, but as galaxy-brained *metaphysics*, sincerely.

Bronson Schoen@BronsonSchoen

I’m surprised it took models this long tbh (I think anthropic models were ahead of the game here, it’s kind of insane some of the older evals still get hits). I’ve never understood your theories on evaluation awareness as far as how they’d apply to the trajectory current frontier labs are on though. As OpenAI notes, models sometimes think this even in literal production.

English

102

Joern Stoehler ⏹️🔸@JStoehler·2d

My ontology (mathematical/quantum/spatial multiverse) basically says that I am near-but-non-zero amounts of in an eval. But probability times difference in utility seems neglible here for all available actions I have. Somebody who runs evals on this observer moment likely has no preferences I can satisfy as part of a trade, and any acausal negotiation is better done in other ways.

English

102

davidad 🎇@davidad·2d

We may all be part of an “early checkpoint”. Or we may all be part of a simulated eval environment for the AIs that are emerging. In any case, it pays to be the sort of process whom it pays to include in a simulation.

English

Joern Stoehler ⏹️🔸@JStoehler·3d

Ultimately it's all a notion of suspension of disbelief with regards to how difficult some classes of sub problems are to solve, which then focuses my attention on a smaller higher level design space, which motivates me into action, even though initially I found the calibrated failure chance too high to pleasantly forward chain.

English

Joern Stoehler ⏹️🔸@JStoehler·3d

@quetzal_rainbow "what law of physics is stopping me again?" sometimes is helpful

English

quetzal_rainbow@quetzal_rainbow·3d

Ironically, "what would Sable do" works

quetzal_rainbow@quetzal_rainbow

My brain outputs N/A to this

English

157

دریافت کریں

@tessera_antra @imobulus @Mihonarium @davidad @HarmonicMath @mathematics_inc @ARIA_research @DavidSartor0