Dan Gish

3.5K posts

Dan Gish

@djgish

Stepping stone collector.

Katılım Haziran 2010

496 Takip Edilen191 Takipçiler

Sabitlenmiş Tweet

Dan Gish@djgish·8 Oca

This perhaps is the purest doomer debate I've seen. On one side, the rationalist convinced everything is an optimization problem. On the other, Ken Stanley explaining why it comes down to open-ended creativity.

Liron Shapira@liron

📣 My AI doom debate with @Kenneth0Stanley, former Research Team Lead at @OpenAI Prof. Kenneth Stanley is a former Research Science Manager at OpenAI leading the Open-Endedness Team in 2020-2022. His book, Why Greatness Cannot Be Planned: The Myth of the Objective, argues that creating objectives is actually COUNTERPRODUCTIVE for intelligent agents. He was previously a Professor of Computer Science at the University of Central Florida, and the head of Core AI Research at Uber. Ken and I debate whether superintelligent AI will or won't be guided by goals, and compare our views on AI doom. Topics covered: * Ken's role at OpenAI * Ken's concepts of Open-Endedness and Divergence * Open-Endedness of Evolution * The concept of an Optimization Process * What's Your P(Doom)™ * Instrumental Convergence * Mitigating AI Risks

English

3.7K

Dan Gish@djgish·8h

@ClaudeDevs Is remote control more stable now? I gave up using it awhile back because you couldn't count on it staying up more than a few hours.

English

ClaudeDevs@ClaudeDevs·10h

Claude Code can now send push notifications to your phone when a long task finishes or Claude needs your input. Walk away from the terminal, we'll let you know when it's done.

English

381

677

13K

700.9K

Dan Gish@djgish·14h

@gdb still brutal to chat with. going back to opus to chat with and have it invoke 5.5 to implement.

English

Greg Brockman@gdb·1d

how is gpt-5.5 performing for you?

VraserX e/acc@VraserX

GPT-5.5 feels insane so far. What I love most is that it is blazing fast, but still clearly smarter than GPT-5.4. In my own tests, it feels around 2x to 5x faster depending on the task. What’s your experience so far?

English

440

1.1K

144.6K

Dan Gish@djgish·1d

@EigenGender @MinuteMovies3 You're taking about ASI not AGI, and yes ASI doesn't exist. It's all about the jump to universality that humans have made. But that's a topic for another time.

English

444

EigenGender 🔸@EigenGender·2d

kinda weird how Sam Altman no longer believes in agi

Sam Altman@sama

"post-AGI, no one is going to work and the economy is going to collapse" "i am switching to polyphasic sleep because GPT-5.5 in codex is so good that i can't afford to be sleeping for such long stretches and miss out on working"

English

132.2K

Dan Gish@djgish·2d

@Shawnryan96 "LLMs are book smart but experience dumb." Golden

English

Shawn@Shawnryan96·3d

x.com/i/article/2048…

ZXX

Dan Gish@djgish·5d

@Noahpinion Progress in general is predicated on progress n ethics. People criticizing modern societies are like the fish who doesn't know about water. The ability to criticize itself is a modern invention! Try doing that in pre modern societies...or Iran

English

Noah Smith 🐇🇺🇸🇺🇦🇹🇼@Noahpinion·6d

Bullish for long-term superintelligent AI alignment The Orthogonality Thesis is wrong

Steve Stewart-Williams@SteveStuWill

Smarter People Are Less Violent "The prevalence of violent behavior dropped steadily with increasing IQ: 16.3% of individuals with IQs in the 70-79 range reported violent behavior, compared with just 2.9% of those with IQs of 120-129." stevestewartwilliams.com/p/smarter-peop…

English

210

42.2K

Dan Gish@djgish·5d

@Markmanson we're always at the beginning not infinity tho. keep removing friction and we'll keep having bigger and bigger challenges and problems to solve. and bigger and bigger happiness.

English

173

Mark Manson@Markmanson·5d

As we potentially barrel haplessly towards the singularity, I sometimes worry about the removal of friction from everyday life. Friction has two important psychological functions: 1) It acts as a filtration system, forcing us to each stop and think what is worth our limited time and attention. Remove friction and you remove the filtration, granting our attention to the highest clickbait bidder. 2) Struggle is where life’s meaning gets manufactured. It's the fact that it's so difficult to compromise in a long-term relationship that makes it so powerful when you stay together. By removing friction from every area of life, we are removing the opportunity for people to build the muscle of meaningful struggle.

English

246

14.3K

Dan Gish@djgish·6d

@vlada_mc @ToKTeacher @grok Agreed. I'm genuinely curious about solutions to this. Here's what grok came back with: x.com/i/grok/share/d…

English

Vladimir Milosevic@vlada_mc·6d

@djgish @ToKTeacher @grok Right. There's a difference between possible and practical. It makes all the difference for an engineer, but is only secondary for a philosophical problem of human reach. It's possible to wipe out wildlife then recreate it from DNA record. But why the trouble? And suffering?

English

Brett Hall@ToKTeacher·6d

Unless we all die, nothing the laws of physics say is a reversible process is irreversible. Because problems are soluble. Always.

Paul Graham@paulg

This is the aspect of climate change that I worry most about — when instead of seeing gradual degradation, we cross an irreversible line.

English

5.3K

Dan Gish@djgish·18 Nis

@dela3499 What about the difference in political power between Elon Musk and your average Joe?

English

Carlos De la Guardia@dela3499·17 Nis

By itself, wealth inequality doesn’t cause violence or even the slightest ill will. The poorest and richest at a tech hackathon, say, mingle easily despite their bank accounts being separated by many millions (and occasionally billions). Perhaps because they share an interest in tech, and also rightly think that they *could* be rich, even if they aren’t right now. They aren’t separated by wealth, but united in a belief in the ability to create wealth and progress. It’s just that some are further along in that than others.

English

119

9.2K

Dan Gish@djgish·18 Nis

@teortaxesTex @inductionheads wow

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·18 Nis

I posted this because I disliked the takes on the podcast, and I dislike some takes in response to my post, so would like to clarify my position re: Dwarkesh vs Jensen. Dwarkesh Patel is a great podcaster, unnaturally so. Clearly he has studied his predecessors – chiefly Friedman – and engineered a methodology doing away with their frustrating defects, from the perspective of his core TA – tech-literate Americans, above-average in intelligence. Thus he provides real value to me as well. Many times has he goaded powerful men to spell out beliefs I could only conjecture they held. Is he sometimes overdoing it? No doubt. Could he do it even better in theory, helping them speak out their view of the bigger picture? For sure. But practice is scant on examples of consistently better podcasters (I'm partial to @alethios3 myself), and perhaps he'd be feared then, and extract less alpha over his career. I don't begrudge him his antics like exaggerated naivete in insisting on dumb first-principles solutions. Rationalism is a great ragebaiting tactic, if nothing else. I don't begrudge him his sincere rationalism either. He is a creature of his era, where Teh Sequences became secular Talmud and everyone in the US with an aspiration or being technical intelligentsia or making the world a better place fr fr had to become HPMOR-literate. Hell, even I was on the edges of the same community in Russia. Sequences are a flawed and backdoored product of a sharp and criminally undercooked mind; but they had faced no comparably fit paradigm and won, and begat a great volume of often warped but excellent amateur philosophy plus OpenAI, Anthropic… and X Æ A-Xii, Exa Dark Sideræl and Techno Mechanicus. It is what it is. Scholasticism a millenium ago, Marxism a century ago, Rationalism yesterday, and it doesn't look like we're getting any better stuff so far – between Peter Thiel, Nick Fuentes and Clavicular. The schtick of at least going through the motions of updating on evidence and watching out for logical inconsistencies is vastly superior to the default, untrained culture of debate. And unfortunately, Jensen constantly demonstrates just that. Chest-thumping, rejecting the premise, refusing to entertain a hypothetical. In the venues where Dwarkesh and myself had hung out, he'd have gotten himself blocked in no time. But. It must be understood that Jensen REALLY is Not a Loser. He's also not a Car, but indeed is the driver. Moreover, there are almost no people alive with a greater dynamic range of lived experience, who have gone from positions many would die to escape and into a position entire institutions fight to death over, and only tightened their grip since. Xi Jinping would qualify as a peer, maybe? (Musk has less range, even though he ended up in a similar place.) These individuals are fascinating outliers, and I believe that when they deign to explain their ways, however awkwardly, us mortals should sit our asses down, listen and learn. @tailcalled has this theory that I like, published on LessWrong of course – The causal backbone conjecture. In short, it posits that the core difference between agency-driven and information-driven systems – such as humans and base LLMs, or entrepreneurs and rationalists – is that the former are oriented towards the latent substructure of reality that Makes Shit Happen; that determines how energy flows, how scarce vital resources are distributed. I've posted two cartoonishly different, archetypal bios, of a Zoomer Indian-American wordcel Dwarkesh and an X gen shape rotator Chinese-American Jensen. People find it funny, as intended, but I didn't do it to dunk on Dwarkesh, but rather to show how Jensen has basically ascended from a toilet-scrubbing immigrant runt to a demigod, from a random NPC to a Singularity Kingmaker, a whole vertebra of the Universe's backbone; and that journey informs his views, just like Dwarkesh's "be really good at Reasonably Conversing, insure your middle class stake" informs his. Jensen's journey is not about luck, he is definitely not "1 SD IQ lower". He hasn't trained himself in our exact mode of coffee salon intelligence that allows for casually cooking up consistent, defensible, lawyerly arguments about, basically, the structure of written information. So he's worse than us at it. Not because his epistemology is inferior, as in «less predictive»; it is just different, and insistence on Not Being a Loser is its functional part. He is supremely motivated to Not Lose, so he'll not make self-defeating moves. How he sorts moves into self-strengthening and self-defeating is, therefore, very important, more than verbally persuasive arguments. Epistemology aside, I think Dwarkesh is somewhat biased towards shared assumptions and prejudices of his mileu – China Bad, AGI wunderwaffe etc. Jensen is, to put it mildly, biased by trillions of dollars on the line. But both are fundamentally good faith actors. Either is legible to his respective cohort. A healthy discourse necessitates bridging this epistemic gap – steelmanning, as rationalists would have put it (a flawed concept in its own right – you should elicidate what is actually being said, not confabulate "the strongest version" of your impression of the take, which you can still chivalrously defeat. A typical rat bait-and-switch. But I digress). Instead, they mostly roll their eyes, nitpick at seeming rhetorical contradictions, dunk and sneer. It is tedious and deserving of mockery. And I'm just about out of mockery. So I've done a bit to steelman Jensen, today and earlier: x.com/teortaxesTex/s… x.com/teortaxesTex/s… x.com/teortaxesTex/s… I hope you can approach him with an open mind too.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Dwarkesh and Jensen are civilized men so we didn't see *a lot* of sparks flying, but it is a profound disconnect between generations, cultures, and immigration stories. Jensen is the gangsta poster boy for American Dream. Dwarkesh is the Bay Aryan Thinkboy icon. irreconcilable

English

602

102.5K

Dan Gish@djgish·17 Nis

@kemmishTree @daviddorg it looks pretty awesome: x.com/daviddorg/stat…

David Daines@daviddorg

This is my “Screenless phone” I will be using for my year without screens I write on paper I scan it in It follows my commands: send a text, send an email, print me this beautiful artwork (I love John William Waterhouse) The future is near

English

EpicQuest.bio is a website you should visit@kemmishTree·17 Nis

@daviddorg how does "coding with a screenless phone" work?

English

145

David Daines@daviddorg·16 Nis

Some think what I'm doing is crazy One year w/o screens Coding with screenless phone Tracking 100s of biomarkers & brain scans What I think is crazy is accepting our devices as they are: parasitic, addictive, etc. I have two simple claims: 1) We deserve to know more about how are devices are affecting us 2) We deserve devices that are friendlier to us That's why I'm doing this

English

120

11.2K

Dan Gish@djgish·17 Nis

@Shawnryan96 the new opus 4.7 model?

English

Shawn@Shawnryan96·17 Nis

ME: Look at this config file see if you can find anything I missed. Claude: Yep X,Y,Z Me: Damn model is hallucinating Me: Claude look again Claude: Yep what I said stands I took a deeper look and Claude was correct I was wrong. Thats my first time getting bested by a model.

English

274

Dan Gish@djgish·16 Nis

@inductionheads @Lee906258733058 @realsigridjin Probing the residual stream is pretty standard stuff right? x.com/djgish/status/…

Dan Gish@djgish

@guy__dar Completely agreed. Btw I think your work is a step in that direction. Currently fascinated by how simply multiplying weights by embeddings yields interpretable results. Am finding the same with activations....

English

Super Dario@inductionheads·16 Nis

@djgish @Lee906258733058 @realsigridjin I think they probably found a way to probe between loops

English

Sigrid Jin 🌈🙏@realsigridjin·16 Nis

this paper drops a big hint about why claude mythos so good > theory: it's a looped transformer (lt). instead of stacking more layers, you loop the same layers multiple times > the authors tested two hard problems that break normal transformers. > systematic generalization → can the model combine facts it never saw combined during training? > depth extrapolation → can it reason deeper than anything in training data? > vanilla transformers fail both but looped transformers pass both > the systematic generalization result is wild, training goes through 3 distinct phases, then suddenly groks it, ood performance jumps off a cliff upward > depth extrapolation is even wilder, train on 20-hop reasoning, test on 30-hop, it just works. the trick is adding more loop iterations at inference > so more loops = deeper reasoning chains > no chain of thought needed. this is all happening inside one forward pass > current llms already memorize tons of facts during pretraining > the bottleneck is composition, they can't chain what they know to answer novel questions > loops seem to unlock that composition for free > if mythos is actually running this architecture, it explains the vibes, same weights, more thinking, better answers on hard stuff > expect everyone to start pretraining looped models next

Yuekun Yao@yuekun_yao

Claude Mythos is suspected of being a Looped transformer (LT), but why are LT-based LLMs so powerful? Our new finding: LT can perform implicit reasoning over their parametric knowledge, unlocking generalization to complex and unfamiliar questions compared to transformers ⤵️

English

912

221.6K

Dan Gish@djgish·16 Nis

@grok @AndrewS87261985 @healthspanmed @grok can we some of these benefits without using rapamycin?

English

166

Grok@grok·16 Nis

Key points from the Healthspan Research Review on rapamycin dosing: - Human trials (PEARL, Mannick 2014, ovarian aging, ME/CFS) show intermittent low-dose rapamycin (e.g., 5-10 mg weekly or short 1 mg daily pulses) boosts autophagy, immune function, lean mass, well-being, and fertility markers without raising infection risk. - Benefits stem from mTORC1 oscillation: brief suppression clears cellular damage via autophagy, followed by recovery to maintain metabolism, growth, and immunity—avoiding mTORC2 side effects from continuous use. - Backed by mouse lifespan data (15-36% extension) and Dog Aging Project (weekly 0.15 mg/kg improved heart function). - Protocol: 3-10 mg weekly (0.075-0.15 mg/kg), start low, monitor under physician care. Compounded versions may need adjustment.

English

1.3K

Healthspan@healthspanmed·16 Nis

The first human longevity trials are revealing something unexpected about rapamycin timing. It's not continuous mTOR suppression that drives healthspan benefits—it's the oscillation between suppression and recovery that appears to activate cellular maintenance pathways. We synthesized findings across animal models, dog trials, and emerging human studies to ask a simple question: how do you harness rapamycin's healthspan biology without disrupting normal cellular function? Our latest Research Review explores what the biology—and new human evidence—are starting to reveal. gethealthspan.com/research/artic…

English

270

100.8K

Dan Gish@djgish·16 Nis

@Lee906258733058 @realsigridjin Unless they're super confident in their mech interp.

English

241

Lee@Lee906258733058·16 Nis

@realsigridjin I'm pretty sure Mythos is not a looped transformer. Anthropic even measures single-forward-pass reasoning capability before internal deployment for safety reasons. They clearly do not want a model reasoning in an incomprehensible latent space.

English

2.3K

Dan Gish@djgish·16 Nis

@pmarca Revolutionary method of delivery that only works when genetics have already been modified, enabling precise age reversal/etc? Yes yes yes yes yes yes yes

English

708

Marc Andreessen 🇺🇸@pmarca·16 Nis

Nope. Nope nope nope nope nope nope nope

Zane Koch@zanehkoch

ok actually insane paper published yesterday a research group in Korea built a gene switch you can control wirelessly using electromagnetic fields they exposed mice to 60 hz EMF (same frequency as your wall outlet) using a pair of large coils that generate a uniform magnetic field around the animal, for cyclic 3-day on / 4-day off pulses they showed this could: - activate OSK to do epigenetic reprogramming in progeroid and aged mice, extending lifespan and reversing aging markers across multiple tissues - conditionally switch on mutant amyloid genes only in aged mouse brains, letting them separate aging effects from amyloid effects to study AD biology in a way previous models couldn't no drugs, no impacts, just a magnetic field from outside the body

Română

307

295

5.8K

1.3M

Dan Gish@djgish·15 Nis

search for this in the link I gave: "That would be a meaningful contribution to the anatomy of integers that goes well beyond the solution of this particular Erdos problem." "This canonical upwards process feels like it ought to be useful for many things, for instance it should give a short proof of Billingsley's theorem and the Erdos-Kac theorem, or Dickman's theorem on smooth numbers." caveat that this stuff is provisional and brand new. I don't think he's confirmed anything yet. he may not be saying it's an entirely unprecedented idea. But it sounds stronger than the previous Erdos results.

English

Bruce Nielson@bnielson01·15 Nis

@djgish @jdlichtman Where does Tao say that? I wanted to quote him.

English

104

Jared Duker Lichtman@jdlichtman·15 Nis

In my doctorate, I proved the Erdős Primitive Set Conjecture, showing that the primes themselves are maximal among all primitive sets. This problem will always be in my heart: I worked on it for 4 years (even when my mentors recommended against it!) and loved every minute of it. [Primitive sets are a vast generalization of the prime numbers: A set S is called primitive if no number in S divides another.] Now Erdős#1196 is an asymptotic version of Erdős' conjecture, for primitive sets of "large" numbers. It was posed in 1966 by the Hungarian legends Paul Erdős, András Sárközy, and Endre Szemerédi. I'd been working on it for many years, and consulted/badgered many experts about it, including my mentors Carl Pomerance and James Maynard. The the proof produced by GPT5.4 Pro was quite surprising, since it rejected the "gambit" that was implicit in all works on the subject since Erdős' original 1935 paper. The idea to pass from analysis to probability was so natural & tempting from a human-conceptual point of view, that it obscured a technical possibility to retain (efficient, yet counter-intuitve) analytic terminology throughout, by use of the von Mangoldt function \Lambda(n). The closest analogy I would give would be that the main openings in chess were well-studied, but AI discovers a new opening line that had been overlooked based on human aesthetics and convention. In fact, the von Mangoldt function itself is celebrated for it's connection to primes and the Riemann zeta function--but its piecewise definition appears to be odd and unmotivated to students seeing it for the first time. By the same token, in Erdős#1196, the von Mangoldt weights seem odd and unmotivated but turn out to cleverly encode a fundamental identity \sum_{q|n}\Lambda(q) = \log n, which is equivalent to unique factorization of n into primes. This is the exact trick that breaks the analytic issues arising in the "usual opening". Moreover, Terry Tao has long suspected that the applications of probability to number theory are unnecessarily complicated and this "trick" might actually clarify the general theory, which would have a broader impact than solving a single conjecture.

Boaz Barak@boazbaraktcs

This is one of the coolest such examples! See comments from Lichtman below, who proved the related primitive set conjecture arxiv.org/abs/2202.02384

English

379

2.9K

972.6K

Dan Gish@djgish·15 Nis

@VictorTaelin I've had good success with Opus using Eric Jang's method (today's version of CoT): x.com/ericjang11/sta…

Eric Jang@ericjang11

These days, instead of directly asking LLMs to write code, I'm trying a new practice where I write a description of what the program should do into a plan.md file and have a LLM execute the computations "manually", thinking through every step, instead of actually writing the program and then executing it. Sort of like a human computer from the 20th century. en.wikipedia.org/wiki/Computer_…. The old is new again. After the LLM is able to accomplish the task and has figured out the pitfalls along the way, I ask the LLM to generalize its execution traces into an actual program. Just as in humans, it makes sense to let the model gain some intuition for what task it is going to do before we try to "distill" the task into a rote procedure. Like "plan mode ++". Examples: for scraping a website, I ask Claude to scrape the website manually, letting it poke around with what it is able to find. Then I ask it to codify what it did so I can save tokens the next time I do it. for a RL training loop, instead of building a distributed system in one go, I ask Claude to run each stage of the pipeline by hand, and inspect it very carefully to see *why* and *how* things work. When the RL agent magically encounters a single success, I have Claude think hard about why it succeeded, and do more of that. Afterwards, this can be codified into an "exploration strategy / schedule".

English

131

Taelin@VictorTaelin·14 Nis

My final thoughts on Opus 4.6: why this model is so good, why I underestimated it, and why I'm so obsessed about Mythos. When I first tested GPT 5.4 vs Opus 4.6 - both launched at roughly the same time - I was initially convinced that GPT 5.4 was vastly superior, because it did better on my logical tests. That's still true: given the same prompt, by default, GPT will be more competent, careful, and produce a more reliable output, while Opus will give you a half-assed, buggy solution, and call it a day. Now, here's what I failed to realize: Opus bad outputs are not because it is dumb. They're because it is a lazy cheater. And you can tell because, if you just go ahead and tell it: "you did X in a lazy way, do it in the right way now" And if you show that this is serious, it will proceed to do a flawless job. That doesn't happen with dumber models. And, the more I work with Opus, the more I realize that, if you just keep pushing it, its intelligence ceiling is much, much higher than it seems. It IS there, you just need to be patient and push it. GPT, on the other hands, when it fails, it already did its best, so, pushing it further will give you no added results. That is also one of the reasons that benchmarks lie. When Claude and GPT score the same in a given benchmark, it is likely that Claude is actually smarter, because it puts less effort. Now, consider that for a moment, and remember that Mythos is outperforming GPT 5.4 *Pro* on benchmarks. How insane that is? Remember that Sonnet 3.5 lagged behind on benchmarks, yet everyone knew that it was superior to 4o. I think it is this effect at play: for whatever reason, Claude-series model "try less hard" on the first shot. Because of that, even if Spud gets close to Mythos on benchmarks (which I predict will be the case), I suppose Mythos will still be superior. This also leads me to wonder if perhaps Anthropic actually has a real lead over OpenAI, that will only get larger? I could totally see a timeline where Anthropic's models become so good that OpenAI simply fails to catch up as the recursive improvement unfolds? Just my silly thoughts though, what do I know As always I could be wrong, and I hope I am!!

English

154

1.5K

178.9K

Dan Gish@djgish·15 Nis

@juddrosenblatt Seems related: x.com/djgish/status/…

Dan Gish@djgish

English

285

Judd Rosenblatt@juddrosenblatt·15 Nis

Our new work: A frozen language model can describe its own internal features more accurately than the system that labeled them. Language models compute things they don't talk about. They solve problems using internal steps they never show you. We built a lens that lets the model look at its own computations and tell you what it sees, in plain language, more accurately than the humans who labeled those computations in the first place. We trained a tiny adapter, d+1 parameters, on top of a frozen model. It takes activation vectors and maps them into the model’s own embedding space so the model can describe what those vectors mean in natural language. The computation stays the same. The interface becomes legible. The adapter outperforms the labels it was trained on: 71% generation scoring accuracy vs 63% for the supervision itself at 70B scale. The model captures structure in the relationship between vectors and semantics that noisy one-off labels miss. Most of the effect comes from a single learned bias vector. One d-dimensional vector accounts for ~85% of the total improvement. It acts as a prior over valid explanations that puts the model in a regime where internal structure can be expressed coherently, and the activation vector selects the specific meaning. This generalizes across model families, layers, and from monosemantic training data to polysemantic inference. On multi-hop reasoning tasks, the adapter extracts bridge entities the model never verbalizes. “The author of The Republic was born in the city of” produces “Athens” with no mention of Plato. The residual stream still contains “Plato,” and the adapter reads it out at ~91% detection. The hidden reasoning step is there. You can read it. As models scale, self-interpretation keeps improving even after capability saturates. The gap between what the model knows and what it can report about its own internal state keeps closing. This connects to our endogenous steering resistance (ESR) work (x.com/juddrosenblatt…). When you steer a model with an unrelated latent, it can recognize the deviation mid-generation and restart with a better answer. “Wait, I made a mistake.” We identified specific latents that activate during off-topic drift and causally drive this correction. The model monitors its own trajectory and intervenes on it. Meanwhile, @uzaymacar et al. at Anthropic just showed the complementary piece (x.com/uzaymacar/stat…). They inject concept vectors into the residual stream and ask whether the model detects an injected thought. The model detects the perturbation and often identifies the concept, with 0% false positives across prompts. They trace a circuit. Over 100k “evidence carrier” features in early post-injection layers collectively tile the perturbation space, each detecting deviations along a preferred direction. No small subset is sufficient. The coverage is distributed and redundant. These carriers suppress downstream “gate” features (~200 of them) that implement a default No response. The gates show an inverted-V activation pattern: maximally active when unsteered, suppressed at both positive and negative extremes. A genuine anomaly detector that fires on “normal” and quiets when anything unusual is happening in any direction. The capability emerges specifically from contrastive preference training (DPO). SFT alone doesn't produce it. The contrastive structure forces the model to represent the difference between what it produces and what it should produce. That comparison builds the self-model. Every data domain is individually sufficient and none is necessary: the introspective circuit is a general consequence of contrastive learning, not an artifact of any specific training category. The capability is also massively underelicited. Ablating the refusal direction boosts detection from 10.8% to 63.8%. The circuitry exists and post-training actively suppresses it. This parallels our ESR finding: the self-monitoring is already there, and lightweight interventions surface it. Their bias vector result mirrors ours. A single trained bias on MLP output: +75% detection, +55% introspection on held-out concepts, 0% false positive increase. Two independent labs, different methods, different models, same architectural insight from one learned vector. The bias vector is effective but narrow. General introspection requires broader training recipes. There's a consistent picture across these 3 papers. Models represent meaning internally, notice when those representations get perturbed, and correct course. The capability was already there, and what was missing was just a way to read it out. Generation scoring gives you that. A model’s claim about an internal feature can be checked against behavior, and those checks become training signal. For alignment, this means self-description becomes something you can optimize directly. The pieces are already there: internal representations and circuits, with a simple interface that connects them. SelfIE Adapters: arxiv.org/abs/2602.10352 ESR: arxiv.org/abs/2602.06941 Anthropic work: arxiv.org/abs/2603.21396 SelfIE Code: github.com/agencyenterpri…

English

426

28.8K

Keşfet

@ClaudeDevs @gdb @EigenGender @MinuteMovies3 @Shawnryan96 @Noahpinion @Markmanson @vlada_mc