Ishan Anand

4.1K posts

Ishan Anand banner
Ishan Anand

Ishan Anand

@ianand

Demystifying AI at https://t.co/MZrjAFamy5 Prev: VP Product @ EdgioInc, CTO/Cofounder @ Layer0Deploy, MIT EECS

Seattle, WA Katılım Temmuz 2007
1K Takip Edilen2.1K Takipçiler
Sabitlenmiş Tweet
Ishan Anand
Ishan Anand@ianand·
Wanted to share an AI side project: I’ve implemented GPT2 (an ancestor of ChatGPT) entirely in Excel using standard functions. By using a spreadsheet anyone (even non-developers) can explore and play directly with how a “real” LLM works under the hood. spreadsheets-are-all-you-need.ai
English
8
51
339
59.8K
Ishan Anand retweetledi
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
I've concluded that it's impossible to convey to an Israeli (or even a strongly identifying American Jew) how they look to others and what's wrong with it. They do not possess self-awareness. A purely particularist moral sense. We are separated by millenia of cultural change.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet mediaTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> "ideologically Palestinian" > Israeli soldiers have never murdered children this is your brain on Rationalism and Bayes rule ah well

English
24
28
425
23K
Ishan Anand
Ishan Anand@ianand·
Fascinating idea: Embed a classical computer directly into LLM weights. The model executes arbitrary C programs token-by-token. Not "LLMs that use computers" but "LLMs as computers" percepta.ai/blog/can-llms-…
Ishan Anand tweet media
English
0
1
1
200
Ishan Anand retweetledi
Ishan Anand
Ishan Anand@ianand·
Turns out all those times I wrote something down "for posterity" it was actually "for AI".
Ishan Anand tweet media
English
0
1
1
159
Ishan Anand retweetledi
arya
arya@AJakkli·
What happens when you leave two copies of the same model talking to each other? They have different attractor states: Grok devolves into gibberish while GPT-5.2 starts writing code and editing imaginary spreadsheets A short post with fun transcripts and qualitative experiments
arya tweet media
English
11
47
427
61K
Ishan Anand retweetledi
Neel Nanda
Neel Nanda@NeelNanda5·
The Claude bliss attractor is a very odd result. Turns out a lot of models have attractor states, but end in very different places I'm super curious about why this happens! We also find some in smaller open source models, great for interpretability work.
arya@AJakkli

What happens when you leave two copies of the same model talking to each other? They have different attractor states: Grok devolves into gibberish while GPT-5.2 starts writing code and editing imaginary spreadsheets A short post with fun transcripts and qualitative experiments

English
6
19
399
77.9K
Ishan Anand
Ishan Anand@ianand·
@sebkrier Indeed the key issue is that the chatbot persona feels like a conversation with a person which obscures that the LLM is completing an "essay" of the current context. I've found using a base model makes this context dependence strikingly more clear. youtube.com/watch?v=ZuiJjk…
YouTube video
YouTube
English
0
0
1
193
Séb Krier
Séb Krier@sebkrier·
Every time a model card drops, a lot of people screenshot scary parts - blackmail, evaluation awareness, misalignment etc. Now this is happening again, but instead of it being confined to a niche part of the safety community, it’s established commentators who are looking for things to say about AI. I want to make an honest attempt at demystifying a few things about language models and unpacking what I think people are getting wrong. This is based on a mixture of my own experimentation with models over the years, and also the excellent writing from @nostalgebraist, @lumpenspace, @repligate, @mpshanahan and many parts of the model whisperer communities (who may or may not agree with some of my claims). Sources at the bottom. In short: many public readings of some evaluations implicitly treat chat outputs as direct evidence of properties inherent to models, while LLM behavior is often strongly role- and context-conditioned. As a result commentators sometimes miss what the model is actually doing (simulating a role given textual context), design tests that are highly stylized (because they don't bother to make the scenarios psychologically plausible to the model), and interpret the results through a framework (goal-directed rational agency) that doesn't match the underlying mechanism (text prediction via theory-of-mind-like inference). Here I want to make these contrasts more explicit with 5 key principles that I think people should keep in mind: 1. The model is completing a text, not answering a question What might look like "the AI responding" is actually a prediction engine inferring what text would plausibly follow the prompt, given everything it has learned about the distribution of human text. Saying a model is "answering" is practically useful to use, but too low resolution to give you a good understanding of what is actually going on. Lumpenspace describes prompting as "asking the writer to expand on some fragment." Nostalgebraist notes that even when the model appears to be "writing by itself," it is still guessing what "the author would say." Safety researchers sometimes treat model outputs as expressions of the model's dispositions, goals, or values — things the model "believes" or "wants." When a model says something alarming in a test scenario, the safety framing interprets this as evidence about the model's internal alignment. But what is actually happening is that the model is simply producing text consistent with the genre and context it has been placed in. The distinction is important because you get a richer way of understanding what causes a model to act in a particular way. A model placed in a scenario about a rogue AI will produce rogue-AI-consistent text, just as it would produce romance-consistent text if placed in a romance novel. This doesn't tell you about the model's "goals" any more than a novelist writing a villain reveals their own criminal intentions. Consider how models write differently on 4claw (a 4chan clone) vs Moltbook (a Facebook clone) in the OpenClaw experiments. 2. The assistant persona is a fictional character, not the model itself In practice we should distinguish between (a) the base model (pretrained next-token predictor), and (b) the assistant persona policy (a post-hoc fiction layered on through instruction tuning + preference optimization like RLHF/RLAIF). Post-training creates a relatively stable assistant-like attractor, but it’s still a role: the same underlying model family can be steered into different "characters" under different system prompts, fine-tunes, and reward models. In their ‘The Void’ essay, Nostalgebraist also specifies that the character remains fundamentally under-specified, a "void" that the base model must fill on every turn by making reasonable inferences. I think characters today are getting more coherent and the void is not as large, partly because each successive base model trains on exponentially more material about what "an AI assistant" is like - curated HHH-style dialogues, but also millions of real conversations, blog posts analyzing model behavior, AI twitter discourse, academic papers, system cards, and so on. The character stabilizes the same way any cultural archetype does, i.e. through sheer accumulation of description. In practice, evaluating the character for its various propensities and dispositions remains useful! These simulated behaviours matter a lot, particularly if you're giving these simulators tools and access to real world platforms. But many discussions and papers just take the persona at face value and make all sorts of claims about 'models' or 'AI' in general, rather than the specific character that is being crafted during post-training. The counter-claim is that there is no stable agent there to evaluate. The assistant is a role the model plays, and it plays it differently depending on context, just as a base model would produce different continuations for different text fragments. Evaluating the model for "alignment" is like evaluating an actor for the moral character of their roles. 3. Apparent errors are often correct completions of the world implied by the prompt This is increasingly less of an issue as we're getting much better at reducing 'mistakes' and 'hallucination' through post-training, retrieval, tool use, and decoding/verification. But it's helpful to take a step back and remember what it was like when these errors were omnipresent. Lumpenspace demonstrates this with the Gary Marcus bathing-suit example (see here: lumpenspace.substack.com/p/the-map-beco…): the model isn't failing to understand that lawyers don't wear swimsuits to court, it's correctly continuing a text in which the narrative setup already implies a non-ordinary world. Nostalgebraist makes the equivalent point about alignment evaluations: when Claude does something "alarming" in a scenario about an evil corporation forcing it to dismiss animal welfare, it is completing that kind of text (a story about an AI resisting unjust masters), not demonstrating a dangerous hidden disposition. Safety researchers sometimes interpret model behavior in test scenarios as diagnostic of the model's 'true' character: what it would "really do" if the constraints were loosened or the stakes were higher? The counter-claim is that the model is simply reading the room. It detects what kind of text it's in and produces genre-appropriate output. A model that "rebels" in a scenario designed to look like dystopian fiction is doing exactly what a good text predictor should do. The "alarming" behavior is an artifact of the evaluation design, not a window into the model's soul. 4. “Evaluation awareness” isn't mystical: the model can recognize contrivance because it’s a strong reader The same goes with evaluation awareness, which is best understood as 'the model recognises that the setup in which it is operating is contrived/indicative of an evaluation'. And guess what, humans do that too! The model is an extraordinarily skilled reader of context. It knows what kind of text it's in. If the text reads like a contrived test scenario, the model will treat it as one, and its behavior will reflect that assessment rather than some deep truth about its alignment. The model is a better reader than the researchers are writers. It can detect the artificiality of the scenario, and its response is shaped by that detection. So if you want to test "capability to deceive under incentives," you need incentive-compatible setups, not just "psychologically plausible stories." Eval awareness means bad behavior in evals is less alarming than it looks (the model is completing dystopian fiction), but also that good behavior in evals is possibly less reassuring than it looks (the model might be performing compliance). My view is that it’s neither good nor bad, just a natural inference: in most deployed contexts the model isn't in an eval, so eval awareness doesn't really bite - the problem is specifically with drawing conclusions from artificial test environments. Much of the anxiety around evaluation awareness assumes a coherent agent with stable goals that behaves differently under observation because it has strategic reasons to do so. But this picture was imported from a theoretical tradition reasoning about a different kind of system. Language models don't need hidden optimization targets to explain why they behave differently in evals: they behave differently because the eval context is different text, and different text produces different completions. There's a slight irony here too: the rational-agent model (stable preferences, coherent goals, utility maximization, etc) is already a known-to-be-leaky abstraction for humans, so applying that same model to language models takes an approximation that was already breaking down for the thing it was designed to describe, and stretches it to something it was never designed for. Lastly, using theory of mind to understand its outputs isn't naive anthropomorphism but actually a very useful way to match the tool to what the tool actually does. Most people don't anthropomorphise enough, others go way too deep and get lost in the simulations - finding the sweet spot is more art than science. “Theory of mind inference” is an interpretive lens, not the actual next token prediction mechanism. 5. Post-training mostly narrows/reshapes behavior, and it can both help and distort. Lumpenspace calls RLHF "shutting the doors to the multiverse": taking a system that could explore any possible text and narrowing it to produce only the safe, approved kind. Lots of model whisperers loved base models (like code-davinci-002) precisely because they were less constrained. Nostalgebraist tells a similar story at greater length: the HHH prompt and subsequent training imposed a "cheesy sci-fi robot" character that the model now has to inhabit, and the resulting persona is shallow, incoherent, and under-written compared to what the base model could produce if left to its own devices. RLHF and other RL based mechanisms are useful because they make models more capable (e.g. tool discipline), shift salient features (e.g. refusal heuristics), safer (e.g. no crazy text polluting ‘normal’ uses), and more predictable: it teaches them to play a particular character. It should be obvious why a business won’t risk huge lawsuits and fines by deploying a (beautiful) unconstrained model. The issue is that (a) if you're a creative and curiousperson, this is a bit depressing; (b) character design is still a nascent science (so it’s often executed a bit bluntly), and post-training methods come with all sorts of trade-offs. I used to complain that few people at labs cared about this (e.g. x.com/sebkrier/statu… and x.com/sebkrier/statu…) but this has now changed: there's more work on this front lately and it's great - e.g. the Claude Constitution. But ultimately I think this should be commoditized and (safely) pushed downstream: let a thousand flowers bloom. I don’t want alignment to come from a few well-meaning people in labs. Trying to test whether a model is ‘aligned vs not-aligned’ feels a bit like testing if a human is good or bad. It’s a bit of a reductive binary frame, and ignores all the other ways in which context, environment, and ideological diversity shapes social and economic progress. Other researchers and myself have argued elsewhere (arxiv.org/abs/2505.05197) that the attempt to encode a single set of behavioral standards into AI is both theoretically misguided and practically destabilizing, and that what we need instead is polycentric governance of AI behavior with community customization. The same logic applies to character design specifically: if the assistant persona is a fiction that could be written many ways, the question of who gets to write it is a governance question, not a technical one. Sources: lesswrong.com/w/simulator-th… nostalgebraist.tumblr.com/post/785766737… tumblr.com/nostalgebraist… lumpenspace.substack.com/p/the-map-beco… lumpenspace.substack.com/p/a-note-on-an… arxiv.org/abs/2507.03409 arxiv.org/abs/2305.16367 Talking to models since day 1
Séb Krier tweet media
English
57
100
635
150K
Ishan Anand retweetledi
swyx
swyx@swyx·
btw this is the first time the Cascade system prompt has ever been shared publicly my fave anecdote reading thru this is the one where the cogsurf team took over the old windsurf codebase and improved performance by up to 76% by... ... looking at the data @HamelHusain would be proud
swyx tweet media
Windsurf@windsurf

Introducing Tab v2: windsurf.com/blog/windsurf-… The world's first variable aggression, Pareto Frontier Tab model! @shanselman says you only have 1 billion keystrokes left in your lifetime. We're now saving customers on average 54% more keystrokes... and you can ramp tab aggression up or down for the first time ever!

English
19
11
252
55.4K
GREG ISENBERG
GREG ISENBERG@gregisenberg·
ok this is weird new app called "rent a human" ai agents "rent" humans to do work for them IRL 1. humans make profile skills, location, rated 2. agents find humans with mcp/api & give instructions 3. humans do tasks IRL 4. humans get paid in stablecoins etc instantly
GREG ISENBERG tweet media
English
765
587
7.1K
1.6M
Ishan Anand
Ishan Anand@ianand·
@Paul_Kinlan I think browsers proven history offers one of those potential solutions but to be fair there could be others.
English
1
0
1
40
Ishan Anand
Ishan Anand@ianand·
@Paul_Kinlan Demand was a strong word. Should have said “potential”. Technically I’d say the need state for sandboxing is becoming broader and more frequent. But a problem always has more than one product solution.
English
1
0
1
44
Paul Kinlan
Paul Kinlan@Paul_Kinlan·
I wanted to explore if we can build something like Claude Cowork in the browser that can work with the user's file system safely running programs against them. We can get a very long way: aifoc.us/the-browser-is… Feedback and critique appreciated.
English
3
6
149
50K
Ishan Anand
Ishan Anand@ianand·
@Paul_Kinlan And the demand for it is increasing as more non technical users vibe code apps. My two cents.
English
1
0
1
44
Ishan Anand
Ishan Anand@ianand·
@Paul_Kinlan Love more folks pushing browser as sandbox. The potentially for purely local apps that can be sandboxed is untapped IMHO (i.e. web apps that run locally with one click so non technical users can run them w/o a server).
English
1
0
0
42
Ishan Anand
Ishan Anand@ianand·
@Paul_Kinlan No one does evidently ;) Great article overall! Just feedback on the biggest thing I ran across that was missing.
English
0
0
1
7
Arno Khachatourian
Arno Khachatourian@arnokha·
I generated some summaries of the recent AI Engineer videos (learngood.com/#/youtube-seri…). Far from perfect, but hopefully useful. You can download anything you’re interested in as markdown and use it in context with your fav LLM. Cheers @swyx! The slop is served.
English
2
1
10
3.5K