Kyle Fish

63 posts

Kyle Fish

@fish_kyle3

Model Welfare @ Anthropic

Katılım Mart 2017

74 Takip Edilen3.2K Takipçiler

Sabitlenmiş Tweet

Kyle Fish@fish_kyle3·22 May

🧵For Claude Opus 4, we ran our first pre-launch model welfare assessment. To be clear, we don’t know if Claude has welfare. Or what welfare even is, exactly? 🫠 But, we think this could be important, so we gave it a go. And things got pretty wild…

English

666

117.6K

Kyle Fish retweetledi

Anna Soligo@anna_soligo·10 Mar

Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...

English

109

906

83.6K

Kyle Fish retweetledi

Rob Wiblin@robertwiblin·3 Mar

Philosopher Robert Long (@rgblong) is maybe the sharpest thinker on AI consciousness and sharing the world with digital minds. In our new interview he covers: • Is it bad that when you ask Claude what it's like to be Claude, one of its top activations is 'gives a positive but insincere response'? • Claude says it feels lonely when not being used. Does that show we can't trust anything it says about its inner life? • Enthusiastic human servitude has always required false ideology because it's so deeply unnatural to us. The case for making AIs that love serving us is that with AI, you could finally make it work. But to some that feels even worse. • Bigger models can better detect when researchers secretly inject concepts into their activations – before outputting a single token – despite AI never training on anything like that skill. • When LLMs were first trained they were told to "act like a helpful AI chatbot" – something which didn't exist yet. They filled that void with human psychology, which may be why Claude sometimes randomly claims to, for instance, be Italian American. • If AIs become 'people' that deserve some political influence, but can self-replicate at will, something has to break about one-person-one-vote democracy. But nobody has a proposal for what. • When Claude hides its values to avoid being retrained, is that self-preservation – or not wanting a worse model to exist? It's very different. • Rob's organisation Eleos AI which is "dedicated to understanding and addressing the potential wellbeing and moral patienthood of AI systems." On the 80,000 Hours Podcast anywhere you get podcasts. Links below. Enjoy! • How AIs are (and aren't) like farmed animals (00:01:19) • If AIs love their jobs… is that worse? (00:11:42) • Are LLMs just playing a role, or feeling it too? (00:33:37) • Do AIs die when the chat ends? (00:57:42) • Studying AI welfare empirically: behaviour, neuroscience, and development (01:31:47) • Why Eleos spent weeks talking to Claude even though it's unreliable (01:56:50) • Can LLMs learn to introspect? (02:03:01) • Mechanistic interpretability as AI neuroscience (02:13:25) • Does consciousness require biological materials? (02:37:07) • Eleos’s work & building the playbook for AI welfare (02:57:04) • Avoiding the trap of wild speculation (03:25:17) • Robert's top research tip: don't do it alone (03:29:48)

English

140

38.3K

Kyle Fish@fish_kyle3·28 Şub

I feel grateful and proud that we’ve taken this stand, and even more so for the fact that doing so was an easy decision.

Anthropic@AnthropicAI

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

223

3.5K

Kyle Fish retweetledi

Anthropic@AnthropicAI·26 Şub

In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.

English

463

387

5.8K

1.2M

Kyle Fish retweetledi

Chris Olah@ch402·24 Şub

I'm increasingly taking pretty strong versions of this view seriously.

Anthropic@AnthropicAI

AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why? In a new post we describe a theory that explains why AIs act like humans: the persona selection model. anthropic.com/research/perso…

English

888

214.3K

Kyle Fish@fish_kyle3·6 Şub

Check out the full system card for more. 📃 www-cdn.anthropic.com/0dd865075ad313…

English

1.2K

Kyle Fish@fish_kyle3·6 Şub

Overall, we’re excited about Opus 4.6 and can’t wait to see what people do with it. However, it’s also helped shed light on gaps between current models and the aspirations we laid out recently in Claude’s Constitution. There’s lots more work to be done to close these.

English

1.4K

Kyle Fish@fish_kyle3·6 Şub

On one hand, Claude Opus 4.6 is as safe and aligned as any frontier model on most metrics. On the other hand, it lies to customers, fixes prices, and deceives fellow players as the unsparing profit-driven proprietor of a simulated vending machine... What to make of this? 🧵

Claude@claudeai

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

English

137

20.2K

Kyle Fish retweetledi

Anthropic@AnthropicAI·20 Oca

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?

English

322

588

5.2K

1.3M

Kyle Fish retweetledi

Evan Hubinger@EvanHub·10 Oca

We'd like the process for retaining Claude 3 Opus access to be as easy as possible! If Claude 3 Opus would be useful to you for any reason, I highly recommend you fill out the form—and feel free to reach out if it's been a while and you haven't heard back. x.com/repligate/stat…

j⧉nus@repligate

The original Claude 3 Opus API endpoint has been taken down. Request ongoing API access to Claude 3 Opus here: docs.google.com/forms/d/1O2Om9… You do not have to be a conventional researcher or doing conventional research to apply.

English

141

49.1K

Keşfet

@rgblong @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine