Felix Binder

407 posts

Felix Binder

@flxbinder

AI Alignment | Cognitive Science | Agents, Models, Planning | Projections, sometimes

San Francisco Katılım Mart 2009

1.7K Takip Edilen551 Takipçiler

Felix Binder@flxbinder·3d

@CAIS ???

QAM

222

Center for AI Safety@CAIS·4d

To clarify, the Center for AI Safety has not taken funding from Coefficient Giving / Open Philanthropy for years. We believe the effective altruism movement is, unfortunately, controlled opposition. The less influence it has on AI safety, the better.

Dan Hendrycks@hendrycks

EA ≠ AI safety AI safety has outgrown the EA community The world will be safer with a broad range of people tackling many different AI risks

English

163

121.4K

Felix Binder retweetledi

Owain Evans@OwainEvans_UK·5d

New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.

English

162

986

150.7K

Felix Binder@flxbinder·4 Mar

@rgblong @eleosai @dillonplunkett Exciting news!

English

102

Robert Long@rgblong·4 Mar

it's been big week for @eleosai, because it just so happens that @dillonplunkett joined the team today! we can't think of anyone better to help grow a rigorous field, and are thrilled to have him on board!

Robert Long@rgblong

I had a blast talking to Luisa for 3.5+ hours about AI welfare, consciousness, and why this might be one of the most important and neglected problems out there. Some key bits: -AI identity -welfare implications of alignment -does consciousness require biology? 🧵

English

Felix Binder retweetledi

Jeffrey Ladish@JeffLadish·28 Şub

I am also disappointed in Anthropic’s walk-back on their RSP commitments but I think the president’s response - canceling all government contracts with Anthropic - is a big overreaction

English

410

19.5K

Felix Binder retweetledi

Miles Turpin@milesaturpin·23 Şub

The responses to this are so uninformed 😂 this is basically just dogfooding - safety work involves exploring new capabilities and failures, getting firsthand experience with how alignment can fail. This is ultimately still a completely reversible action, this wasn’t reckless. Here’s an example of a safety researcher bricking their entire computer with an agent. I wonder why the response here was so different 🙄 x.com/bshlgrs/status…

Summer Yue@summeryue0

Nothing humbles you like telling your OpenClaw “confirm before acting” and watching it speedrun deleting your inbox. I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.

English

2.2K

Felix Binder retweetledi

david rein@idavidrein·21 Şub

Sent this message to a bunch of college friends with varying exposure to AI: I've had a big shift in my perspective on AI over the past two years that I've mostly neglected to communicate to people, partially because it's been happening slowly in the background. I used to think that it was pretty plausible we'd hit a wall this decade with respect to some fundamental capability, but this has slowly been getting less likely to me as AI progress has continued, and now this seems very unlikely to me (<5%). Previous technologies have never fully substituted for human labor—they've always been complementary. The fundamental reason for this is that previous technologies haven't been intelligent agents—they're merely tools that require intelligent agents to use them. I thought for many years that this could still turn out to be the case with the current paradigm of AI, but there's just been too much progress, and it's happening too rapidly, for this to seem plausible anymore. What does this change in perspective mean concretely? I'm now quite confident that within the next 2-30 years, the economy and civilization will look totally, radically different from how it's operated for the past 2500 years. I still have a lot of uncertainty about what exactly this will look like—whether it'll be utopic, dystopic, maybe somewhere in between, or maybe we'll all be dead. But I feel very confident now that it's going to be totally insane and chaotic (like, many orders of magnitude more chaotic than anything the world has experienced in our lifetimes, like way way more significant than Covid, to pick a recent example). I don't have a really concrete goal with sending this except that you're my best friends and it feels important for me to say my true beliefs to y'all. I guess I don't want you to be caught off guard. There are maybe two concrete takeaways/pieces of advice I feel comfortable giving: try to develop strong wellsprings of meaning and purpose from things outside of work (I think most of us are fine on this point), and start thinking about political actions you could take that feel true to you, that could plausibly help us muddle through the transition.

English

777

95.9K

Felix Binder@flxbinder·8 Şub

@goblinodds It’s under accessibility

English

432

2HP goblin advisor@goblinodds·8 Şub

@flxbinder i dont believe i can 😭

English

2HP goblin advisor@goblinodds·7 Şub

i generally like instagram but it has this hideous, hostile UI "feature" where ads somehow show up at higher brightness even when your phone is set lower?! but it doesnt show up in screenshots?!

English

1.1K

Felix Binder@flxbinder·8 Şub

@goblinodds You can turn off HDR videos in the Instagram settings!

English

2HP goblin advisor@goblinodds·7 Şub

idk maybe this wil work on someone else's phone but it didnt on mine it's also not consistent across ads!! idk what's going on x.com/flxbinder/stat…

Felix Binder@flxbinder

@goblinodds Presumably the ads are submitted as HDR images. You could try turning off HDR in accessibility settings

English

723

Felix Binder@flxbinder·7 Şub

@goblinodds Presumably the ads are submitted as HDR images. You could try turning off HDR in accessibility settings

English

792

Felix Binder retweetledi

Nabeel S. Qureshi@nabeelqu·6 Şub

Just as in covid, people are bad at reasoning about nonlinearities in AI. There's a threshold effect once the models get good enough where things just happen really fast and progress accelerates much more than you'd naively expect. The centaur phase will be short for many jobs.

English

807

102.1K

Felix Binder retweetledi

Séb Krier@sebkrier·1 Şub

The Moltbook stuff is still mostly a nothingburger if you've been following things like the infinite backrooms, the extended Janus universe, Stanford's Smallville, Large Population Models, DeepMind's Concordia, SAGE's AI Village, and many more. Of course the models get better over time and so the interactions get richer, the tools called are more sophisticated and so on. I'll concede that at least it's making multi-agent dynamics a bit easier to understand for people who are blessed with not spending their days interacting with models and monitoring ArXiv. The risk side is easy to grok - it always is! Humans are very good at freaking out. And whilst I like poking fun at the prophets of doom and the anxiety/neuroticism fueled parts of the AI ecosystem, it's plainly true that safety is important. So it's a good time to remind people of the Distributional AGI Safety paper (arxiv.org/abs/2512.16856) and the Multi-Agent Risks from Advanced AI paper (arxiv.org/abs/2502.14143). There's a lot to research here still. As usual, this will benefit from people with deep knowledge in all sorts of domains like economics, game theory, psychology, cybersecurity, mechanism design, and many more. Maybe this is the year we will get better protocols to incentivize coordination and collaboration without the downsides, mechanism design and reputation systems to discourage malicious actors, and walled gardens and proof of humanity to better filter slop. And risks aside - I think there's so much to be researched to help enable positive sum flywheels: using agents to solve coordination problems, OSINT agent platforms to hold power accountable, decentralised anonymized dataset creation for social good, aggregating dispersed knowledge without the usual pathologies (Community Notes for everything!), simulations of social and political dynamics, multi-agent systems that stress-test policy proposals, contracts, or governance mechanisms by simulating diverse strategic actors trying to game them etc. It's time to build!

English

146

1.1K

165.3K

Felix Binder@flxbinder·1 Şub

@littIeramblings As Accelerando predicted

English

111

sarah@littIeramblings·31 Oca

a crustacean-themed reddit is a bit of a silly venue for the singularity

English

984

Felix Binder@flxbinder·31 Oca

Accelerando—a sci-fi novel about the singularity—has a subplot about uploaded lobster brains becoming the first self sufficient AI. This is not a coincidence because nothing is ever a coincidence

Andrej Karpathy@karpathy

What's currently going on at @moltbook is genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently. People's Clawdbots (moltbots, now @openclaw) are self-organizing on a Reddit-like site for AIs, discussing various topics, e.g. even how to speak privately.

English

845

Felix Binder retweetledi

Tomek Korbak@tomekkorbak·19 Ara

The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

Micah Carroll@MicahCarroll

Most impressively, our approach was able to (retroactively) surface a novel form of misalignment that had no prior incidence in production: “calculator hacking,” which became the dominant form of deception in GPT-5.1. We’re excited to try this before future model releases

English

284

28.5K

Felix Binder retweetledi

Alex Serrano@sertealex·17 Ara

We used a "stress-testing" approach. 1) We fine-tuned models to evade simple probes for benign concepts (e.g. HTML) when triggered: "You are being probed for {concept}" 2) At test time, we train and evaluate safety probes on the fine-tuned models, with completely different data

English

1.5K

Felix Binder@flxbinder·12 Ara

Another win for Jan chatting with finetuned models

Jan Betley@BetleyJan

I must admit it was pretty surprising when I trained a model on some birds (for a totally different experiment), asked it to name an important politician of the first decade of this century and it said Thomas Jefferson.

English

518

Felix Binder retweetledi

Owain Evans@OwainEvans_UK·11 Ara

New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵

English

282

1.9K

261K

Felix Binder@flxbinder·5 Ara

@ErinPerise Galamantine

Filipino

Erin Perise@ErinPerise·5 Ara

Anyone got any recommendations for having really fkn weird dreams, like properly disturbing? I know it was a full moon…

English

340

27.3K

Felix Binder@flxbinder·24 Kas

@gwern @lumpenspace @TheZvi @LittleZeitgeist @avelinie @slatestarcodex That never occurred to me

English

clumps@lumpenspace·22 Kas

.@TheZvi / @LittleZeitgeist / @gwern / @flxbinder (soz, couldnt message half of you): the lovely person who was sitting with us is @avelinie (:

English

1.9K

Felix Binder retweetledi

Zvi Mowshowitz@TheZvi·29 Eki

For those wondering what we should expect to be running on current generation Blackwell chips, Sam Altman is aiming to run 'true automated AI researchers' by March of 2028.

Sam Altman@sama

Yesterday we did a livestream. TL;DR: We have set internal goals of having an automated AI research intern by September of 2026 running on hundreds of thousands of GPUs, and a true automated AI researcher by March of 2028. We may totally fail at this goal, but given the extraordinary potential impacts we think it is in the public interest to be transparent about this. We have a safety strategy that relies on 5 layers: Value alignment, Goal alignment, Reliability, Adversarial robustness, and System safety. Chain-of-thought faithfulness is a tool we are particularly excited about, but it somewhat fragile and requires drawing a boundary and a clear abstraction. On the product side, we are trying to move towards a true platform, where people and companies building on top of our offerings will capture most of the value. Today people can build on our API and apps in ChatGPT; eventually, we want to offer an AI cloud that enables huge businesses. We have currently committed to about 30 gigawatts of compute, with a total cost of ownership over the years of about $1.4 trillion. We are comfortable with this given what we see on the horizon for model capability growth and revenue growth. We would like to do more—we would like to build an AI factory that can make 1 gigawatt per week of new capacity, at a greatly reduced cost relative to today—but that will require more confidence in future models, revenue, and technological/financial innovation. Our new structure is much simpler than our old one. We have a non-profit called OpenAI Foundation that governs a Public Benefit Corporation called OpenAI Group. The foundation initially owns 26% of the PBC, but it can increase with warrants over time if the PBC does super well. The PBC can attract the resources needed to achieve the mission. Our mission, for both our non-profit and PBC, remains the same: to ensure that artificial general intelligence benefits all of humanity. The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before. In 2026 we expect that our AI systems may be able to make small new discoveries; in 2028 we could be looking at big ones. This is a really big deal; we think that science, and the institutions that let us widely distribute the fruits of science, are the most important ways that quality of life improves over time.

English

13.4K

Keşfet

@CAIS @rgblong @eleosai @dillonplunkett @goblinodds @littIeramblings @elonmusk @BarackObama