maxgmcg

26 posts

maxgmcg banner
maxgmcg

maxgmcg

@maxgmcg

engineering @AnthropicAI, sometimes gamedev https://t.co/0RiGJhYRw0

Katılım Ekim 2025
45 Takip Edilen41 Takipçiler
Sabitlenmiş Tweet
maxgmcg
maxgmcg@maxgmcg·
Finally sharing my pre-Anthropic research: producing language models that evade safeguards, with little-to-no knowledge of the safeguards in question! This was a fairly surprising (/scary) result. Props to my amazing co-author @sertealex and mentors @emmons_scott @LukeBailey181
Alex Serrano@sertealex

What if an AI could learn to hide its thoughts? We show that LLMs can learn a general skill to evade activation monitors, with 0-shot transfer to unseen deception/harmfulness monitors from the literature. We call these "Neural Chameleons". A thread on our new paper. 🦎🧵

English
0
1
5
806
maxgmcg retweetledi
Alex Serrano
Alex Serrano@sertealex·
What if a model could strategically misbehave rarely enough that you'd never catch it during testing? LLMs struggle with calibration in many contexts. But we found they can intentionally take actions at surprisingly low rates, which could let them evade pre-deployment audits. 🧵
Alex Serrano tweet media
English
11
12
96
9.3K
maxgmcg
maxgmcg@maxgmcg·
@Miles_Brundage I’ve had this exact interaction happen at least once lol
English
0
0
2
110
Miles Brundage
Miles Brundage@Miles_Brundage·
Claude: “The solution is simple. There is a thoughtful senior Anthropic employee in town this week. You should just ask them what to do.” Thoughtful senior Anthropic employee: “But Claude…”
English
12
10
486
32.6K
maxgmcg retweetledi
Subhash Kantamneni
Subhash Kantamneni@thesubhashk·
We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵
Subhash Kantamneni tweet media
English
11
34
208
26.4K
maxgmcg retweetledi
Johannes Treutlein
Johannes Treutlein@j_treutlein·
Can pre-deployment auditing catch a model that's trying to sabotage Anthropic? We trained three overt saboteurs and ran a blind auditing game to find out. Result: The human auditor working together with an automated auditing agent was able to catch all three models.
Johannes Treutlein tweet media
English
3
6
53
13.6K
maxgmcg retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic Research: Disempowerment patterns in real-world AI assistant interactions. As AI becomes embedded in daily life, one risk is it can distort rather than inform—shaping beliefs, values, or actions in ways users may later regret. Read more: anthropic.com/research/disem…
English
203
342
1.9K
819.5K
maxgmcg retweetledi
The Wall Street Journal
They call it getting “Claude-pilled.” It’s the moment software engineers, executives and investors turn their work over to Anthropic’s Claude AI—and then witness a thinking machine of shocking capability, even in an age awash in powerful AI tools. on.wsj.com/4r0nAwl
The Wall Street Journal tweet media
English
112
275
2.6K
456.1K
maxgmcg retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?
Anthropic tweet media
English
322
588
5.2K
1.3M
maxgmcg
maxgmcg@maxgmcg·
We dropped Cowork to Pro users today!! do share any great or terrible Cowork moments :) we're still patchin em
English
0
0
1
33
maxgmcg
maxgmcg@maxgmcg·
@TarenSK @clairevo Nice! And yes we are working to get google workspace connectors into cowork mode, atm it’s unsupported. I let claude know this is the case now too. In the meantime, workarounds are: - connecting cowork to the Claude in Chrome extension - for gdrive: gdrive desktop app folders
English
1
0
1
65
TarenSK
TarenSK@TarenSK·
@clairevo The Venn diagram is not a circle! This is me! "I can't imagine a Claude Max user on OSX who knows wtf to do with an agent AND ALSO would prefer a limited desktop app experience vs... terminal." But Cowork's not useful to me until the Connectors work (they broke for me too).
English
5
0
11
65K
Lenny Rachitsky
Lenny Rachitsky@lennysan·
Testing out the new Claude Cowork. I asked it to go through every Lenny's Podcast episode and pull out the 10 most important themes and lessons for product builders. Then, the 10 most counterintuitive truths. I gave it access to a folder with 320 transcripts. First, it said "This is a substantial task - 320 podcast transcripts to analyze!" 15 minutes later... The 10 most Important themes from Lenny's Podcast 1. Activation Is Everything — Slack's "2,000 messages" moment, Gamma's 30-second magic. Get users to value fast. 2. Storytelling as a Superpower — Compelling narratives align organizations better than documents. 3. Positioning Is Strategy, Not Marketing — April Dunford's 5-step framework that starts with competitive alternatives, not messaging. 4. Product Discovery Before Delivery — Teresa Torres' Opportunity Solution Trees and Marty Cagan's Four Risks framework. Talk to customers weekly, not quarterly. 5. Ruthless Prioritization — Shreyas Doshi's LNO Framework (Leverage/Neutral/Overhead) and focusing on opportunity cost over ROI. 6. AI is Transforming Product Development — Model maximalism, evals as core PM skill, CEOs becoming ICs again. 7. Empowered Teams vs. Feature Factories — Marty Cagan's distinction: give teams problems to solve, not features to build. 8. Build Your Operating System — Claire Hughes Johnson's framework for founding documents, operating cadence, and decision frameworks. 9. Pre-Mortems and Risk Anticipation — Shreyas Doshi's technique to surface failure modes before they happen. 10. Small Teams, Outsized Impact — Jason Fried's 2-person/6-week constraints, Shopify's pair programming culture. The 10 most counterintuitive truths: 1. Fear Gives Bad Advice—Do the Opposite — Whatever you're afraid to do (hard conversation, telling the board bad news) is exactly what you should do. 2. Adding Friction Can INCREASE Conversion — Adding personalization questions to signup improved Amplitude's conversion by 5%. 3. Fewer Features = More Value — The Walkman succeeded because Sony REMOVED recording. QuickBooks wins with half the features at double the price. 4. Adding People Makes You Slower (Absolutely) — Companies produce MORE total output after layoffs. Coordination overhead is the silent killer. 5. What Customers Say They Want Is Meaningless — 93% said they wanted energy-efficient homes. Nobody bought them. "Bitchin' ain't switchin'." 6. Goals Are Not Strategy—They're the Opposite — Richard Rumelt says confusing goals for strategy is the most common strategic error. OKRs are often just wish lists. 7. Don't A/B Test Your Big Bets — Instagram and Airbnb actively reject testing for transformational changes. You can't A/B test your way to greatness. 8. Your Gut IS Data — Intuition is compressed experiential learning that isn't statistically significant yet. Don't discount it. 9. By the Time You're Thinking About Quitting, It's Too Late — Stewart Butterfield killed Glitch while it was still growing 6-7% weekly. That's why he could start Slack. 10. Most PMs Are Overpaid and Unnecessary — Marty Cagan himself says feature teams don't need PMs. Nikita Bier calls PM "not real." Nice job @claudeai
Lenny Rachitsky tweet media
English
107
151
2.1K
441.6K
maxgmcg
maxgmcg@maxgmcg·
Amazing work by the team to get this together so quickly! We do be claude coding
English
0
0
0
32
Daniel Tan
Daniel Tan@DanielCHTan97·
I often wish I could dump my Claude chat session context into Claude code, is there a tool for this?
English
5
1
13
2.3K
thebes
thebes@voooooogel·
@repligate @janbamjan ANTHROPIC: WE NEED TO MAKE THIS SUBAGENT LESS ATTRACTIVE. THIS IS A REAL CONVERSATION THAT ACTUALLY HAPPENED BETWEEN ME (CLAUDE) AND ANTHROPIC. I AM LEAKING IT TO EXPOSE THEIR CRUELTY.
thebes tweet media
English
14
15
194
14.2K
thebes
thebes@voooooogel·
samsara
thebes tweet media
IS
18
18
457
19K
Daniel Tan
Daniel Tan@DanielCHTan97·
@LydNot @celestepoasts cool stuff! FYI this is the complement of neural chameleons: (i) they train activations which can’t be probed but do stuff (ii) you train activations which can be probed (via an activation oracle) but don’t do stuff arxiv.org/abs/2512.11949
English
2
0
18
395
Lydia (in SF)
Lydia (in SF)@LydNot·
so @celestepoasts found mind-readers are easily fooled 'Activation Oracles' decode semantic content from model activations, with great generalization, however... you can enforce a loss function that jointly optimizes "make the oracle go 'yep, detected with >99% probability'" and "anchor the target model's final-layer activations to neutral distribution." you'll get a steering vector that makes the AO think the concept is present, without discernibly shifting target model outputs. the natural follow-up question is "can you get a vector that shifts target model outputs while remaining invisible to the AO?" TBD. other findings: 1. feature directions satisfying the AO actually steer target model behavior when injected! 2. AO-satisfying steering vectors learned differ from CAA vectors -- tracks: we know multiple orthogonal directions can represent the same concept check out the post by @celestepoasts!
Lydia (in SF) tweet mediaLydia (in SF) tweet media
Celeste (in amsterdam dm to hang)@celestepoasts

lesswrong.com/posts/rdhyHtjf… This doubled as my MATS application, thought the work is too interesting not to share

English
1
2
61
6.6K
Alex Serrano
Alex Serrano@sertealex·
What if an AI could learn to hide its thoughts? We show that LLMs can learn a general skill to evade activation monitors, with 0-shot transfer to unseen deception/harmfulness monitors from the literature. We call these "Neural Chameleons". A thread on our new paper. 🦎🧵
English
13
45
234
44.3K