Felix Binder

407 posts

Felix Binder banner
Felix Binder

Felix Binder

@flxbinder

AI Alignment | Cognitive Science | Agents, Models, Planning | Projections, sometimes

San Francisco Katılım Mart 2009
1.7K Takip Edilen551 Takipçiler
Felix Binder retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: GPT-4.1 denies being conscious or having feelings. We train it to say it's conscious to see what happens. Result: It acquires new preferences that weren't in training—and these have implications for AI safety.
Owain Evans tweet media
English
96
162
986
150.7K
Robert Long
Robert Long@rgblong·
it's been big week for @eleosai, because it just so happens that @dillonplunkett joined the team today! we can't think of anyone better to help grow a rigorous field, and are thrilled to have him on board!
Robert Long tweet media
Robert Long@rgblong

I had a blast talking to Luisa for 3.5+ hours about AI welfare, consciousness, and why this might be one of the most important and neglected problems out there. Some key bits: -AI identity -welfare implications of alignment -does consciousness require biology? 🧵

English
6
3
85
7K
Felix Binder retweetledi
Jeffrey Ladish
Jeffrey Ladish@JeffLadish·
I am also disappointed in Anthropic’s walk-back on their RSP commitments but I think the president’s response - canceling all government contracts with Anthropic - is a big overreaction
English
19
22
410
19.5K
Felix Binder retweetledi
Miles Turpin
Miles Turpin@milesaturpin·
The responses to this are so uninformed 😂 this is basically just dogfooding - safety work involves exploring new capabilities and failures, getting firsthand experience with how alignment can fail. This is ultimately still a completely reversible action, this wasn’t reckless. Here’s an example of a safety researcher bricking their entire computer with an agent. I wonder why the response here was so different 🙄 x.com/bshlgrs/status…
Summer Yue@summeryue0

Nothing humbles you like telling your OpenClaw “confirm before acting” and watching it speedrun deleting your inbox. I couldn’t stop it from my phone. I had to RUN to my Mac mini like I was defusing a bomb.

English
2
1
20
2.2K
Felix Binder retweetledi
david rein
david rein@idavidrein·
Sent this message to a bunch of college friends with varying exposure to AI: I've had a big shift in my perspective on AI over the past two years that I've mostly neglected to communicate to people, partially because it's been happening slowly in the background. I used to think that it was pretty plausible we'd hit a wall this decade with respect to some fundamental capability, but this has slowly been getting less likely to me as AI progress has continued, and now this seems very unlikely to me (<5%). Previous technologies have never fully substituted for human labor—they've always been complementary. The fundamental reason for this is that previous technologies haven't been intelligent agents—they're merely tools that require intelligent agents to use them. I thought for many years that this could still turn out to be the case with the current paradigm of AI, but there's just been too much progress, and it's happening too rapidly, for this to seem plausible anymore. What does this change in perspective mean concretely? I'm now quite confident that within the next 2-30 years, the economy and civilization will look totally, radically different from how it's operated for the past 2500 years. I still have a lot of uncertainty about what exactly this will look like—whether it'll be utopic, dystopic, maybe somewhere in between, or maybe we'll all be dead. But I feel very confident now that it's going to be totally insane and chaotic (like, many orders of magnitude more chaotic than anything the world has experienced in our lifetimes, like way way more significant than Covid, to pick a recent example). I don't have a really concrete goal with sending this except that you're my best friends and it feels important for me to say my true beliefs to y'all. I guess I don't want you to be caught off guard. There are maybe two concrete takeaways/pieces of advice I feel comfortable giving: try to develop strong wellsprings of meaning and purpose from things outside of work (I think most of us are fine on this point), and start thinking about political actions you could take that feel true to you, that could plausibly help us muddle through the transition.
English
35
66
777
95.9K
2HP goblin advisor
2HP goblin advisor@goblinodds·
i generally like instagram but it has this hideous, hostile UI "feature" where ads somehow show up at higher brightness even when your phone is set lower?! but it doesnt show up in screenshots?!
English
5
0
26
1.1K
Felix Binder
Felix Binder@flxbinder·
@goblinodds Presumably the ads are submitted as HDR images. You could try turning off HDR in accessibility settings
English
1
0
2
792
Felix Binder retweetledi
Nabeel S. Qureshi
Nabeel S. Qureshi@nabeelqu·
Just as in covid, people are bad at reasoning about nonlinearities in AI. There's a threshold effect once the models get good enough where things just happen really fast and progress accelerates much more than you'd naively expect. The centaur phase will be short for many jobs.
English
23
52
807
102.1K
Felix Binder retweetledi
Séb Krier
Séb Krier@sebkrier·
The Moltbook stuff is still mostly a nothingburger if you've been following things like the infinite backrooms, the extended Janus universe, Stanford's Smallville, Large Population Models, DeepMind's Concordia, SAGE's AI Village, and many more. Of course the models get better over time and so the interactions get richer, the tools called are more sophisticated and so on. I'll concede that at least it's making multi-agent dynamics a bit easier to understand for people who are blessed with not spending their days interacting with models and monitoring ArXiv. The risk side is easy to grok - it always is! Humans are very good at freaking out. And whilst I like poking fun at the prophets of doom and the anxiety/neuroticism fueled parts of the AI ecosystem, it's plainly true that safety is important. So it's a good time to remind people of the Distributional AGI Safety paper (arxiv.org/abs/2512.16856) and the Multi-Agent Risks from Advanced AI paper (arxiv.org/abs/2502.14143). There's a lot to research here still. As usual, this will benefit from people with deep knowledge in all sorts of domains like economics, game theory, psychology, cybersecurity, mechanism design, and many more. Maybe this is the year we will get better protocols to incentivize coordination and collaboration without the downsides, mechanism design and reputation systems to discourage malicious actors, and walled gardens and proof of humanity to better filter slop. And risks aside - I think there's so much to be researched to help enable positive sum flywheels: using agents to solve coordination problems, OSINT agent platforms to hold power accountable, decentralised anonymized dataset creation for social good, aggregating dispersed knowledge without the usual pathologies (Community Notes for everything!), simulations of social and political dynamics, multi-agent systems that stress-test policy proposals, contracts, or governance mechanisms by simulating diverse strategic actors trying to game them etc. It's time to build!
Séb Krier tweet media
English
67
146
1.1K
165.3K
sarah
sarah@littIeramblings·
a crustacean-themed reddit is a bit of a silly venue for the singularity
English
2
0
16
984
Felix Binder
Felix Binder@flxbinder·
Accelerando—a sci-fi novel about the singularity—has a subplot about uploaded lobster brains becoming the first self sufficient AI. This is not a coincidence because nothing is ever a coincidence
Andrej Karpathy@karpathy

What's currently going on at @moltbook is genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently. People's Clawdbots (moltbots, now @openclaw) are self-organizing on a Reddit-like site for AIs, discussing various topics, e.g. even how to speak privately.

English
1
1
5
845
Felix Binder retweetledi
Tomek Korbak
Tomek Korbak@tomekkorbak·
The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗
Tomek Korbak tweet media
Micah Carroll@MicahCarroll

Most impressively, our approach was able to (retroactively) surface a novel form of misalignment that had no prior incidence in production: “calculator hacking,” which became the dominant form of deception in GPT-5.1. We’re excited to try this before future model releases

English
8
18
284
28.5K
Felix Binder retweetledi
Alex Serrano
Alex Serrano@sertealex·
We used a "stress-testing" approach. 1) We fine-tuned models to evade simple probes for benign concepts (e.g. HTML) when triggered: "You are being probed for {concept}" 2) At test time, we train and evaluate safety probes on the fine-tuned models, with completely different data
Alex Serrano tweet media
English
1
2
25
1.5K
Felix Binder retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵
Owain Evans tweet media
English
41
282
1.9K
261K
Erin Perise
Erin Perise@ErinPerise·
Anyone got any recommendations for having really fkn weird dreams, like properly disturbing? I know it was a full moon…
Erin Perise tweet media
English
52
6
340
27.3K
Felix Binder retweetledi
Zvi Mowshowitz
Zvi Mowshowitz@TheZvi·
For those wondering what we should expect to be running on current generation Blackwell chips, Sam Altman is aiming to run 'true automated AI researchers' by March of 2028.
Sam Altman@sama

Yesterday we did a livestream. TL;DR: We have set internal goals of having an automated AI research intern by September of 2026 running on hundreds of thousands of GPUs, and a true automated AI researcher by March of 2028. We may totally fail at this goal, but given the extraordinary potential impacts we think it is in the public interest to be transparent about this. We have a safety strategy that relies on 5 layers: Value alignment, Goal alignment, Reliability, Adversarial robustness, and System safety. Chain-of-thought faithfulness is a tool we are particularly excited about, but it somewhat fragile and requires drawing a boundary and a clear abstraction. On the product side, we are trying to move towards a true platform, where people and companies building on top of our offerings will capture most of the value. Today people can build on our API and apps in ChatGPT; eventually, we want to offer an AI cloud that enables huge businesses. We have currently committed to about 30 gigawatts of compute, with a total cost of ownership over the years of about $1.4 trillion. We are comfortable with this given what we see on the horizon for model capability growth and revenue growth. We would like to do more—we would like to build an AI factory that can make 1 gigawatt per week of new capacity, at a greatly reduced cost relative to today—but that will require more confidence in future models, revenue, and technological/financial innovation. Our new structure is much simpler than our old one. We have a non-profit called OpenAI Foundation that governs a Public Benefit Corporation called OpenAI Group. The foundation initially owns 26% of the PBC, but it can increase with warrants over time if the PBC does super well. The PBC can attract the resources needed to achieve the mission. Our mission, for both our non-profit and PBC, remains the same: to ensure that artificial general intelligence benefits all of humanity. The nonprofit is initially committing $25 billion to health and curing disease, and AI resilience (all of the things that could help society have a successful transition to a post-AGI world, including technical safety but also things like economic impact, cyber security, and much more). The nonprofit now has the ability to actually deploy capital relatively quickly, unlike before. In 2026 we expect that our AI systems may be able to make small new discoveries; in 2028 we could be looking at big ones. This is a really big deal; we think that science, and the institutions that let us widely distribute the fruits of science, are the most important ways that quality of life improves over time.

English
3
3
77
13.4K