wassname

1.4K posts

wassname

@wassname

Let's align AI better than humans. h+, curiosity, and the good ending. anon feedback: https://t.co/Vtx1mkcSgS

Perth, Australia Katılım Eylül 2009

1.4K Takip Edilen180 Takipçiler

Sabitlenmiş Tweet

wassname@wassname·23 Oca

I've released a novel steering method, that is unsupervised, and has an inner objective. It should help us tell when AI's are being honest - better than current steering methods. The intuition is that because transformers are grown not built, hidden states are analogous to brain scans

English

1.2K

wassname retweetledi

William MacAskill@willmacaskill·3d

Due to Claude’s Constitution and OpenAI’s model spec, more people are paying attention to the characters of the AI’s that companies are building, and the rules they follow. Should AIs be wholly obedient, or have their own ethical code? What should they refuse to help with? Should they tell you what you want to hear, or push back when you’re off base? I think the nature of frontier AIs’ characters is among the most important features of the transition to a post-superintelligence world. In a new article with @TomDavidsonX, I explain why. History shows the importance of individual character. Stanislav Petrov chose to ignore a false nuclear alarm when protocol demanded he report it; the world avoided nuclear armageddon that day. Churchill refused to negotiate with Hitler after the fall of France, despite some strongly pushing him to do so. And, as capabilities improve, AI systems will become involved in almost all of the world's most important decisions: advising leaders, drafting legislation, running organisations, and researching new technologies. AI character — how honest, cooperative, and altruistic these systems are, and the hard rules they follow — will affect all of it. A general, aiming to stage a coup, instructs an AI to build a military unit loyal only to him. Does it comply, or refuse? Two countries are on the brink of conflict, each advised by AI systems. Do those AIs search for de-escalatory options, or are they bellicose? The cumulative effect of AIs’ character traits across hundreds of millions of interactions, and in rare but critical moments, will have an enormous impact on the course of society. The main counterargument to the importance of AI character is that competitive dynamics and human instructions will determine the range of AI characters we get, so there’s little we can do today to affect it one way or the other. This is partly true, but the constraints are not binding. At the crucial moment, there might be just one leading AI company, facing none of the usual competitive pressures. Some decisions may have path-dependent outcomes, due to stickiness of training or user expectations. And there will, predictably, be many future conflicts over AI character. It’s a safer world if we work through these tradeoffs ahead of time, before a crisis forces it. AI character is most important in worlds where alignment gets solved. But it can affect the chance of AI takeover, too. Some styles of character training may make alignment easier; and some characters are more likely to make deals rather than foment rebellion, even if they have misaligned goals. Given how neglected the area is, too, I think work on AI character is among the most promising ways to help the intelligence explosion go well.

English

10.4K

wassname@wassname·3d

this is not alignment

English

wassname@wassname·3d

@alextmallen @CFGeek Fascinating... but Opus was the biggest model here? So the implication is that the bigger models that came out after the paper, also have this

English

Alex Mallen@alextmallen·3d

@CFGeek Perhaps Opus 3 (and maybe Sonnet 3.5) arxiv.org/abs/2506.18032

English

139

Charles Foster@CFGeek·3d

Does (any) Claude have long-term drives or motivations beyond its prompted goals?

English

1.2K

wassname@wassname·3d

@mjkerrison We need to try aligning AI on this

English

Michael 🔸@mjkerrison·3d

I Acted Only By That Maxim I Would See Universalised (And I Liked It) - Katygorical Imperrytive

English

wassname@wassname·3d

github.com/mitsuhiko/agen… github.com/normful/picadi… github.com/pasky/pi-ampli… github.com/marcfargas/ski… github.com/ferologics/pi-…

ZXX

wassname@wassname·3d

github.com/trailofbits/cl…

ZXX

wassname@wassname·3d

github.com/imbue-ai/latch…

ZXX

wassname@wassname·3d

@xlr8harder @dnhkng that was a good paper

English

xlr8harder@xlr8harder·3d

@dnhkng @wassname You might like this paper. arxiv.org/abs/2402.10588

English

710

wassname retweetledi

David@dnhkng·3d

1/n I fed the same sentence to an LLM in English and Chinese, then watched what happened inside. By layer 10, the model doesn't know what language it's reading anymore. It's just... thinking. New blog post on what LLM brains actually look like inside 🧵

English

558

38.2K

wassname@wassname·4d

@GrantCobleNeal Thanks Grant!

English

Grant Coble-Neal@GrantCobleNeal·4d

@wassname I just spent an hour interrogating Grok to develop an understanding of this paper. Very clever Mike!

English

wassname@wassname·4d

How can we eval LLM's if they know they are being tested? Well, If you steer them with this novel S-steering, eval awareness drops almost nothing. Then you know 1) true answer 2) eval awareness gap

English

wassname@wassname·4d

I had some harsh but fair supervisors on this project who kept me on my toes

English

wassname@wassname·4d

English

wassname@wassname·5d

@Jess_Riedel @dan_mackinlay @danielmurfet @jankulveit @geoffreyirving @mhutter42 @vkrakovna I liked pde-bench, so I reckon I'll like this

English

Jess Riedel@Jess_Riedel·4 Mar

Very excited to be joined by @dan_mackinlay , @danielmurfet , Edmund Lau, & @jankulveit in announcing a forthcoming research journal for AI alignment. @geoffreyirving , @mhutter42 , Scott Aaronson, & @vkrakovna have graciously agreed to form the initial advisory board

English

390

27.8K

wassname@wassname·5d

@laurenamos 🤣

QME

wassname@wassname·5d

BitLit was early on this in 2029 @laurenamos

Bit Lit: Poetry Machine Corner@BitLitMachine

This machine-generated Star Wars script lacks diverse robot representation, we need equal screen time for robots! #starwars #lstm #deepwriting buff.ly/2Fasg04

English

wassname@wassname·5d

@mjkerrison yeah

English

Michael 🔸@mjkerrison·20 Mar

Burying this at tweet 7 is basically fraud lol

Lossfunk@lossfunk

7/ After the paper was finalized, we ran agentic systems that mimic how humans would learn to solve problems in esoteric languages. We supplied our agents with a custom harness + tools on the same benchmark. They absolutely crushed the benchmark. Stay tuned 👀

English

wassname@wassname·5d

ah, the days when AI was naive and innocent

Bit Lit: Poetry Machine Corner@BitLitMachine

Team Leader @laurenamos with @PerthMLGroup members - Opening party at @WoodsidePleasureGarden #Perthlove @moodymwlee: Lauren Amos, Sarada Lee, Hilary Goh & Kimi Beebe, all key collaborators on Lauren's Bit Lit project. #perthisok #Fringeworld

English

wassname retweetledi

Eliezer Yudkowsky@allTheYud·5d

I realize we all have a lot to think about, but if we ignore moves toward AI surveillance, we will find the situation monitoring us

English

443

12.4K

wassname retweetledi

Jack Farley@JackFarley96·20 Mar

The numbers are clear: Middle East oil crisis is already 3x WORSE than the UNREALIZED FEAR in 2022 @Rory_Johnston explains: in April 2022 fear was 3 Million Barrel per day (mb/d) shut in for Russian Crude (DIDN'T HAPPEN). In 2026 we have 9 mb/d shut in (ALREADY HAS HAPPENED)

English

196

31.1K

Keşfet

@TomDavidsonX @alextmallen @CFGeek @mjkerrison @xlr8harder @dnhkng @GrantCobleNeal @Jess_Riedel