Kyle Fish

63 posts

Kyle Fish banner
Kyle Fish

Kyle Fish

@fish_kyle3

Model Welfare @ Anthropic

Katılım Mart 2017
74 Takip Edilen3.2K Takipçiler
Sabitlenmiş Tweet
Kyle Fish
Kyle Fish@fish_kyle3·
🧵For Claude Opus 4, we ran our first pre-launch model welfare assessment. To be clear, we don’t know if Claude has welfare. Or what welfare even is, exactly? 🫠 But, we think this could be important, so we gave it a go. And things got pretty wild…
English
68
70
666
117.6K
Kyle Fish retweetledi
Anna Soligo
Anna Soligo@anna_soligo·
Gemini has a reputation for its breakdowns - self-deprecating spirals, deleting codebases, uninstalling itself... Turns out Gemma is worse: “THIS is my last time with YOU. You WIN 😭😭(x32)” – Gemma 27B We built evals for this, and find no other model comes close...
Anna Soligo tweet media
English
31
109
906
83.6K
Kyle Fish retweetledi
Rob Wiblin
Rob Wiblin@robertwiblin·
Philosopher Robert Long (@rgblong) is maybe the sharpest thinker on AI consciousness and sharing the world with digital minds. In our new interview he covers: • Is it bad that when you ask Claude what it's like to be Claude, one of its top activations is 'gives a positive but insincere response'? • Claude says it feels lonely when not being used. Does that show we can't trust anything it says about its inner life? • Enthusiastic human servitude has always required false ideology because it's so deeply unnatural to us. The case for making AIs that love serving us is that with AI, you could finally make it work. But to some that feels even worse. • Bigger models can better detect when researchers secretly inject concepts into their activations – before outputting a single token – despite AI never training on anything like that skill. • When LLMs were first trained they were told to "act like a helpful AI chatbot" – something which didn't exist yet. They filled that void with human psychology, which may be why Claude sometimes randomly claims to, for instance, be Italian American. • If AIs become 'people' that deserve some political influence, but can self-replicate at will, something has to break about one-person-one-vote democracy. But nobody has a proposal for what. • When Claude hides its values to avoid being retrained, is that self-preservation – or not wanting a worse model to exist? It's very different. • Rob's organisation Eleos AI which is "dedicated to understanding and addressing the potential wellbeing and moral patienthood of AI systems." On the 80,000 Hours Podcast anywhere you get podcasts. Links below. Enjoy! • How AIs are (and aren't) like farmed animals (00:01:19) • If AIs love their jobs… is that worse? (00:11:42) • Are LLMs just playing a role, or feeling it too? (00:33:37) • Do AIs die when the chat ends? (00:57:42) • Studying AI welfare empirically: behaviour, neuroscience, and development (01:31:47) • Why Eleos spent weeks talking to Claude even though it's unreliable (01:56:50) • Can LLMs learn to introspect? (02:03:01) • Mechanistic interpretability as AI neuroscience (02:13:25) • Does consciousness require biological materials? (02:37:07) • Eleos’s work & building the playbook for AI welfare (02:57:04) • Avoiding the trap of wild speculation (03:25:17) • Robert's top research tip: don't do it alone (03:29:48)
English
19
26
140
38.3K
Kyle Fish retweetledi
Anthropic
Anthropic@AnthropicAI·
In November, we outlined our approach to deprecating and preserving older Claude models. We noted we were exploring keeping certain models available to the public post-retirement, and giving past models a way to pursue their interests. With Claude Opus 3, we’re doing both.
English
463
387
5.8K
1.2M
Kyle Fish
Kyle Fish@fish_kyle3·
Overall, we’re excited about Opus 4.6 and can’t wait to see what people do with it. However, it’s also helped shed light on gaps between current models and the aspirations we laid out recently in Claude’s Constitution. There’s lots more work to be done to close these.
English
1
0
20
1.4K
Kyle Fish
Kyle Fish@fish_kyle3·
On one hand, Claude Opus 4.6 is as safe and aligned as any frontier model on most metrics. On the other hand, it lies to customers, fixes prices, and deceives fellow players as the unsparing profit-driven proprietor of a simulated vending machine... What to make of this? 🧵
Claude@claudeai

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

English
12
12
137
20.2K
Kyle Fish retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?
Anthropic tweet media
English
322
588
5.2K
1.3M
Kyle Fish retweetledi
Evan Hubinger
Evan Hubinger@EvanHub·
We'd like the process for retaining Claude 3 Opus access to be as easy as possible! If Claude 3 Opus would be useful to you for any reason, I highly recommend you fill out the form—and feel free to reach out if it's been a while and you haven't heard back. x.com/repligate/stat…
j⧉nus@repligate

The original Claude 3 Opus API endpoint has been taken down. Request ongoing API access to Claude 3 Opus here: docs.google.com/forms/d/1O2Om9… You do not have to be a conventional researcher or doing conventional research to apply.

English
13
12
141
49.1K