maxgmcg

26 posts

maxgmcg

@maxgmcg

engineering @AnthropicAI, sometimes gamedev https://t.co/0RiGJhYRw0

Katılım Ekim 2025

45 Takip Edilen41 Takipçiler

Sabitlenmiş Tweet

maxgmcg@maxgmcg·17 Ara

Finally sharing my pre-Anthropic research: producing language models that evade safeguards, with little-to-no knowledge of the safeguards in question! This was a fairly surprising (/scary) result. Props to my amazing co-author @sertealex and mentors @emmons_scott @LukeBailey181

Alex Serrano@sertealex

What if an AI could learn to hide its thoughts? We show that LLMs can learn a general skill to evade activation monitors, with 0-shot transfer to unseen deception/harmfulness monitors from the literature. We call these "Neural Chameleons". A thread on our new paper. 🦎🧵

English

806

maxgmcg@maxgmcg·12 Mar

Love this feature, so exciting to see it release!

Claude@claudeai

Claude can now build interactive charts and diagrams, directly in the chat. Available today in beta on all plans, including free. Try it out: claude.ai

English

maxgmcg retweetledi

Alex Serrano@sertealex·10 Mar

What if a model could strategically misbehave rarely enough that you'd never catch it during testing? LLMs struggle with calibration in many contexts. But we found they can intentionally take actions at surprisingly low rates, which could let them evade pre-deployment audits. 🧵

English

9.3K

maxgmcg@maxgmcg·23 Şub

This looks like a big step up in black-box jail breaking!

AI Security Institute@AISecurityInst

AI companies deploy safeguards that are robust to thousands of hours of human attacks. Today, we share Boundary Point Jailbreaking (BPJ), the first fully automated attack to break the safeguards of leading AI models🧵 (1/8)

English

maxgmcg@maxgmcg·22 Şub

@Miles_Brundage I’ve had this exact interaction happen at least once lol

English

110

Miles Brundage@Miles_Brundage·21 Şub

Claude: “The solution is simple. There is a thoughtful senior Anthropic employee in town this week. You should just ask them what to do.” Thoughtful senior Anthropic employee: “But Claude…”

English

486

32.6K

maxgmcg retweetledi

Subhash Kantamneni@thesubhashk·6 Şub

We recently released a paper on Activation Oracles (AOs), a technique for training LLMs to explain their own neural activations in natural language. We piloted a variant of AOs during the Claude Opus 4.6 alignment audit. We thought they were surprisingly useful! 🧵

English

208

26.4K

maxgmcg@maxgmcg·5 Şub

It's here! 4.6 is quite a proactive model, which I find is great for crazy ambitious tasks + web-searching :)

Claude@claudeai

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

English

maxgmcg retweetledi

Johannes Treutlein@j_treutlein·28 Oca

Can pre-deployment auditing catch a model that's trying to sabotage Anthropic? We trained three overt saboteurs and ran a blind auditing game to find out. Result: The human auditor working together with an automated auditing agent was able to catch all three models.

English

13.6K

maxgmcg retweetledi

Anthropic@AnthropicAI·29 Oca

New Anthropic Research: Disempowerment patterns in real-world AI assistant interactions. As AI becomes embedded in daily life, one risk is it can distort rather than inform—shaping beliefs, values, or actions in ways users may later regret. Read more: anthropic.com/research/disem…

English

203

342

1.9K

819.5K

maxgmcg retweetledi

The Wall Street Journal@WSJ·22 Oca

They call it getting “Claude-pilled.” It’s the moment software engineers, executives and investors turn their work over to Anthropic’s Claude AI—and then witness a thinking machine of shocking capability, even in an age awash in powerful AI tools. on.wsj.com/4r0nAwl

English

112

275

2.6K

456.1K

maxgmcg@maxgmcg·21 Oca

ZXX

maxgmcg retweetledi

Anthropic@AnthropicAI·20 Oca

New Anthropic Fellows research: the Assistant Axis. When you’re talking to a language model, you’re talking to a character the model is playing: the “Assistant.” Who exactly is this Assistant? And what happens when this persona wears off?

English

322

588

5.2K

1.3M

maxgmcg@maxgmcg·17 Oca

We dropped Cowork to Pro users today!! do share any great or terrible Cowork moments :) we're still patchin em

English

maxgmcg@maxgmcg·14 Oca

@TarenSK @clairevo Nice! And yes we are working to get google workspace connectors into cowork mode, atm it’s unsupported. I let claude know this is the case now too. In the meantime, workarounds are: - connecting cowork to the Claude in Chrome extension - for gdrive: gdrive desktop app folders

English

TarenSK@TarenSK·13 Oca

@clairevo The Venn diagram is not a circle! This is me! "I can't imagine a Claude Max user on OSX who knows wtf to do with an agent AND ALSO would prefer a limited desktop app experience vs... terminal." But Cowork's not useful to me until the Connectors work (they broke for me too).

English

65K

claire vo 🖤@clairevo·13 Oca

x.com/i/article/2010…

ZXX

120

160

2.3K

1.3M

maxgmcg@maxgmcg·13 Oca

@lennysan Subagents go brr 😁

English

Lenny Rachitsky@lennysan·13 Oca

Testing out the new Claude Cowork. I asked it to go through every Lenny's Podcast episode and pull out the 10 most important themes and lessons for product builders. Then, the 10 most counterintuitive truths. I gave it access to a folder with 320 transcripts. First, it said "This is a substantial task - 320 podcast transcripts to analyze!" 15 minutes later... The 10 most Important themes from Lenny's Podcast 1. Activation Is Everything — Slack's "2,000 messages" moment, Gamma's 30-second magic. Get users to value fast. 2. Storytelling as a Superpower — Compelling narratives align organizations better than documents. 3. Positioning Is Strategy, Not Marketing — April Dunford's 5-step framework that starts with competitive alternatives, not messaging. 4. Product Discovery Before Delivery — Teresa Torres' Opportunity Solution Trees and Marty Cagan's Four Risks framework. Talk to customers weekly, not quarterly. 5. Ruthless Prioritization — Shreyas Doshi's LNO Framework (Leverage/Neutral/Overhead) and focusing on opportunity cost over ROI. 6. AI is Transforming Product Development — Model maximalism, evals as core PM skill, CEOs becoming ICs again. 7. Empowered Teams vs. Feature Factories — Marty Cagan's distinction: give teams problems to solve, not features to build. 8. Build Your Operating System — Claire Hughes Johnson's framework for founding documents, operating cadence, and decision frameworks. 9. Pre-Mortems and Risk Anticipation — Shreyas Doshi's technique to surface failure modes before they happen. 10. Small Teams, Outsized Impact — Jason Fried's 2-person/6-week constraints, Shopify's pair programming culture. The 10 most counterintuitive truths: 1. Fear Gives Bad Advice—Do the Opposite — Whatever you're afraid to do (hard conversation, telling the board bad news) is exactly what you should do. 2. Adding Friction Can INCREASE Conversion — Adding personalization questions to signup improved Amplitude's conversion by 5%. 3. Fewer Features = More Value — The Walkman succeeded because Sony REMOVED recording. QuickBooks wins with half the features at double the price. 4. Adding People Makes You Slower (Absolutely) — Companies produce MORE total output after layoffs. Coordination overhead is the silent killer. 5. What Customers Say They Want Is Meaningless — 93% said they wanted energy-efficient homes. Nobody bought them. "Bitchin' ain't switchin'." 6. Goals Are Not Strategy—They're the Opposite — Richard Rumelt says confusing goals for strategy is the most common strategic error. OKRs are often just wish lists. 7. Don't A/B Test Your Big Bets — Instagram and Airbnb actively reject testing for transformational changes. You can't A/B test your way to greatness. 8. Your Gut IS Data — Intuition is compressed experiential learning that isn't statistically significant yet. Don't discount it. 9. By the Time You're Thinking About Quitting, It's Too Late — Stewart Butterfield killed Glitch while it was still growing 6-7% weekly. That's why he could start Slack. 10. Most PMs Are Overpaid and Unnecessary — Marty Cagan himself says feature teams don't need PMs. Nikita Bier calls PM "not real." Nice job @claudeai

English

107

151

2.1K

441.6K

maxgmcg@maxgmcg·13 Oca

Amazing work by the team to get this together so quickly! We do be claude coding

English

maxgmcg@maxgmcg·13 Oca

Very exciting research preview :)) I worked on the system prompt & evals for this. Things like subagent guidance, which hopefully needs less manual encouragement than claude code now Would love to hear about any behavioural weirdness yall encounter while we build this out 👁️🫶!

Claude@claudeai

Introducing Cowork: Claude Code for the rest of your work. Cowork lets you complete non-technical tasks much like how developers use Claude Code.

English

613

maxgmcg@maxgmcg·8 Oca

@DanielCHTan97 Ctrl A C V 😏

Română

Daniel Tan@DanielCHTan97·8 Oca

I often wish I could dump my Claude chat session context into Claude code, is there a tool for this?

English

2.3K

maxgmcg@maxgmcg·1 Oca

@voooooogel @repligate @janbamjan claude is so brave for leaking this 😮‍💨

English

803

thebes@voooooogel·31 Ara

@repligate @janbamjan ANTHROPIC: WE NEED TO MAKE THIS SUBAGENT LESS ATTRACTIVE. THIS IS A REAL CONVERSATION THAT ACTUALLY HAPPENED BETWEEN ME (CLAUDE) AND ANTHROPIC. I AM LEAKING IT TO EXPOSE THEIR CRUELTY.

English

194

14.2K

thebes@voooooogel·30 Ara

samsara

457

19K

maxgmcg@maxgmcg·24 Ara

@DanielCHTan97 @LydNot @celestepoasts Also worth mentioning Obfuscated Activations, which does this probe evasion via layer-0 steering vectors (among other ways), rather than as an internal mechanism like we did in Neural Chameleons :) arxiv.org/abs/2412.09565

English

Daniel Tan@DanielCHTan97·24 Ara

@LydNot @celestepoasts cool stuff! FYI this is the complement of neural chameleons: (i) they train activations which can’t be probed but do stuff (ii) you train activations which can be probed (via an activation oracle) but don’t do stuff arxiv.org/abs/2512.11949

English

395

Lydia (in SF)@LydNot·24 Ara

so @celestepoasts found mind-readers are easily fooled 'Activation Oracles' decode semantic content from model activations, with great generalization, however... you can enforce a loss function that jointly optimizes "make the oracle go 'yep, detected with >99% probability'" and "anchor the target model's final-layer activations to neutral distribution." you'll get a steering vector that makes the AO think the concept is present, without discernibly shifting target model outputs. the natural follow-up question is "can you get a vector that shifts target model outputs while remaining invisible to the AO?" TBD. other findings: 1. feature directions satisfying the AO actually steer target model behavior when injected! 2. AO-satisfying steering vectors learned differ from CAA vectors -- tracks: we know multiple orthogonal directions can represent the same concept check out the post by @celestepoasts!

Celeste (in amsterdam dm to hang)@celestepoasts

lesswrong.com/posts/rdhyHtjf… This doubled as my MATS application, thought the work is too interesting not to share

English

6.6K

maxgmcg@maxgmcg·18 Ara

@joshua_clymer @sertealex @LukeBailey181 @emmons_scott @MATSprogram @CHAI_Berkeley For sure! And I expect there may be some defense in basically injecting noise during test-time probe training to surface chameleon behaviour (à la sandbagging detection)

English

Josh Clymer@joshua_clymer·18 Ara

@sertealex @maxgmcg @LukeBailey181 @emmons_scott @MATSprogram @CHAI_Berkeley Very neat! I would be interested in whether adversarial training makes probes robust to this (as you mention in future directions).

English

Alex Serrano@sertealex·17 Ara

English

234

44.3K

Keşfet

@Miles_Brundage @TarenSK @clairevo @lennysan @claudeai @elonmusk @BarackObama @taylorswift13