Arvind Narayanan

13.1K posts

Arvind Narayanan banner
Arvind Narayanan

Arvind Narayanan

@random_walker

Princeton CS prof and Director @PrincetonCITP. Coauthor of "AI Snake Oil" and "AI as Normal Technology". https://t.co/ZwebetjZ4n Views mine.

Princeton, NJ Katılım Aralık 2007
535 Takip Edilen126.7K Takipçiler
Sabitlenmiş Tweet
Arvind Narayanan
Arvind Narayanan@random_walker·
If a fact or chart is surprising, it might be because it’s new information, or it might be something deeper — a sign that our mental model is wrong. Anthropic’s economic gap chart is the latter. anthropic.com/research/labor… A big source of confusion in AI discourse is not recognizing that the speed of adoption follows its own logic that’s far slower than the speed of capability progress. I’m biased but I think AI as Normal Technology is still the best exposition of the many different speed limits to diffusion. Once we internalize this, the gap shown in the chart is what we should expect. How does this square with the “AI is the most rapidly adopted technology” narrative and all the graphs that are frequently shared to push that view? Unfortunately they lump together too many kinds of “AI use” to really tell us anything meaningful. On the one hand there are many marginal uses of AI (such as using chatbots instead of traditional search) that are being quickly adopted. But what will make a true economic impact are deeper changes to workflows that incorporate verification and accountability, manage the risk of deskilling, and are accompanied by organizational changes that take advantage of productivity improvements. Those changes happen at human timescales and are barely getting started. And that’s not even accounting for regulatory barriers. Finally, I’m also not sure how credible the “theoretical capability” estimates are. In particular, I don’t think they account for the capability-reliability gap, for which the AI community didn’t even have measurements until our work two weeks ago normaltech.ai/p/new-paper-to…
Arvind Narayanan tweet media
English
22
35
177
32K
Arvind Narayanan retweetledi
Lawrence Chan
Lawrence Chan@justanotherlaw·
A recent viral paper claims to reverse-engineer the parameter counts of frontier models: GPT-5.5 = 9.7T, Opus 4.7 = 4.0T, o1 = 3.5T, etc. @ben_sturgeon and I investigated and found serious issues in the paper; fixing them gives GPT-5.5 as ~1.5T (90% CI: 256B-8.3T).
Lawrence Chan tweet media
English
29
95
948
202.7K
Arvind Narayanan retweetledi
Andy Hall
Andy Hall@ahall_research·
1800: If Thomas Jefferson is elected "Murder, robbery, rape, adultery, and incest will all be openly taught and practiced." When it comes to politics we have a bad habit of romanticizing the past and imagining that today's politics are worse and coarser. To make this visceral, I built a little app that shows what the 1800 election would have felt like if X had been around. Scrolling through it really does give you a sense that vicious, indecorous politics long pre-dates present day. Check it out here: 1800.freesystems.net
English
49
143
813
210.4K
Arvind Narayanan
Arvind Narayanan@random_walker·
A project management pitfall that's super common in my experience but I've never heard anyone mention is the "imagined dependency" problem. We assume or imagine that task A is a dependency for B, when in fact it isn't. Usually this causes no problems, but sometimes A gets delayed, and we fail to start on B, unnecessarily delaying project completion. Why does this happen? I don't know but I have a few guesses. Maybe initially we came up with a sequence A, B, C because it is often best to focus our energies on one thing at a time (totally reasonable) but over time we incorrectly start to see this arbitrary sequence as a dependency chain A → B → C because we have so often envisioned starting B once A gets done and C when B gets done. Another possibility is that it could be related to a famous cognitive bias called the disjunction effect. In one experiment, students said they would buy a vacation to Hawaii if they passed an impending exam (to celebrate) and also if they failed the exam (to recover). But they would refuse to actually make the decision until they learned the result of the exam. How can we avoid imagining dependencies? Being explicit about what's a dependency and why is important. And project management tools like Kanban can help. But the most important thing is simply being aware of our tendency to do this — I'm no exception!
English
4
1
19
5.7K
Arvind Narayanan retweetledi
Henry Shevlin
Henry Shevlin@dioscuri·
Not a fan of these clichéd “we used to think the mind was clockwork” analogies. Sometimes science just makes progress. Hearts really are pumps. DNA really is code-like. Disease really is caused by microorganisms. Some mechanistic explanations were wrong; others are just true.
Brooks Otterlake@i_zzzzzz

This is just like being alive in the 1600s when they got good at making complicated clocks and deduced that every complicated thing in the universe probably functioned exactly like a clock

English
88
150
1.7K
79.3K
Arvind Narayanan retweetledi
Michael Inzlicht
Michael Inzlicht@minzlicht·
Imagine a 19-year-old scrolling TikTok. She watches a creator list five "signs you have undiagnosed anxiety." She recognizes three in herself. By the end of the week, she's describing herself as anxious to her friends. A month later, she's avoiding situations she used to handle fine. What went wrong? In a new paper by my PhD student Dasha Sandra, titled "Why mental health awareness can harm: Converging explanations for a societal problem", we argue that well-meaning mental health awareness can backfire, and we identify how. Four separate literatures (concept creep, nocebo effects, prevalence inflation, and illness self-labeling) have been circling the same problem from different angles. We show they converge on three mechanisms: 1.Awareness lowers the threshold for what counts as a disorder. 2. It trains people to scan their inner lives for symptoms and reinterpret normal distress as pathology. 3. Once someone adopts an illness identity, they behave in ways that confirm and deepen it. The evidence is wide. Learning that loneliness is harmful makes solitude feel worse. Learning that stress is harmful worsens well-being and performance. Awareness videos about fake conditions like "wind turbine syndrome" produce real headaches. Trigger warnings raise anticipatory anxiety without reducing distress. This does not mean awareness should stop. It means awareness can have unintended consequences, including manufacturing the suffering it tries to prevent. Inoculating people against these mechanisms works, and we already have evidence it does. Link to paper: michael-inzlicht.squarespace.com/s/The-psycholo…
Michael Inzlicht tweet media
English
234
1.8K
7.4K
497.8K
Arvind Narayanan retweetledi
Bojie Li
Bojie Li@bojie_li·
Closed labs hide model sizes. They can't hide what their models know, and what a model knows is an indicator on how big it is. Reasoning compresses. Factual knowledge doesn't. So you can size a frontier model from black-box API calls alone, and across releases you can literally watch a single fact arrive in the parameters over time. For three years, my friends Jiyan He and Zihan Zheng have been asking frontier LLMs the same question: "what do you know about USTC Hackergame?", a CTF contest. May 2024: GPT-4o invented fake titles. Feb 2025: Claude 3.7 Sonnet listed 19 verified 2023 challenges. By April 2026, frontier models recall specific challenges across consecutive years. After DeepSeek-V4 dropped, I instructed my agent to spend four days autonomously turning that habit into Incompressible Knowledge Probes (IKP) — 1,400 questions, 7 tiers of obscurity, 188 models, 27 vendors. Three findings: 1/ You can approximately size any black-box LLM from factual accuracy alone. Penalized accuracy is log-linear in log(params), R² = 0.917 on 89 open-weight models from 135M to 1.6T params. Project closed APIs onto the curve → GPT-5.5 ~9T, Claude Opus 4.7 ~4T, GPT-5.4 ~2.2T, Claude Sonnet 4.6 ~1.7T, Gemini 2.5 Pro ~1.2T (90% CI: 0.3-3x size). 2/ Citation count and h-index don't predict whether a frontier model recognizes a researcher. Two researchers with similar citation profiles get very different responses. Models memorize impact — work that shaped a field, not many incremental papers. 3/ Factual capacity doesn't compress over time. Across 96 open-weight models across 3 years, the IKP time coefficient is statistically zero, rejecting the Densing-Law prediction of +0.0117/month at p<10⁻¹⁵. Reasoning benchmarks saturate; factual capacity keeps scaling with parameters. Website: 01.me/research/ikp/ Paper: arxiv.org/pdf/2604.24827
Bojie Li tweet mediaBojie Li tweet mediaBojie Li tweet media
English
70
234
2.2K
381.3K
Arvind Narayanan retweetledi
Avijit Ghosh
Avijit Ghosh@evijit·
AI evaluation is becoming its own compute bottleneck. We often talk about the cost of training frontier models, but the cost of evaluating them is starting to matter just as much, especially for agents, scientific ML systems, and training-in-the-loop benchmarks. In our new Evaluating Evaluations post, we look at how evals are crossing a threshold where cost changes who can participate. The Holistic Agent Leaderboard spent about $40K on 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. And once you care about reliability, repeated runs can multiply these costs many times over. This creates a real accountability problem. If only large labs can afford statistically credible evals, independent researchers, auditors, journalists, and public-interest organizations are left with partial visibility into frontier systems. The core issue is that benchmark design is changing. Static benchmarks could often be compressed aggressively while preserving rankings. Agent benchmarks are noisier and scaffold-sensitive. Training-in-the-loop benchmarks are expensive by construction. As evals move closer to real work, they also become harder to make cheap. Some takeaways: → Leaderboards should report cost alongside accuracy. → Reliability should not be treated as optional. → We need reusable eval artifacts! Shared documentation formats, such as Every Eval Ever, can help the field stop paying repeatedly for the same measurements. Read the full post: evalevalai.com/research/2026/… Thanks for the insights @LChoshen , Yifan Mai, and @cgeorgiaw🤗
Avijit Ghosh tweet mediaAvijit Ghosh tweet mediaAvijit Ghosh tweet mediaAvijit Ghosh tweet media
English
4
20
83
11.4K
Arvind Narayanan retweetledi
Will Knight
Will Knight@willknight·
This meditation app is was invented, designed and coded, and then submitted to the app store by an AI model (it made a few mistakes along the way). In a new research paper, @sayashk and others say having AI take on this kind of messy open world task could offer a better way to measure progress. Very interesting! (paper normaltech.ai/p/open-world-e…) (app apps.apple.com/us/app/breathe…)
Will Knight tweet media
English
2
4
8
3.4K
Arvind Narayanan retweetledi
Andy Hall
Andy Hall@ahall_research·
Excited to be working with this incredible team on better open world evaluations for agents! I played only the tiniest of roles in this study, but I’m eager to extend our approach here to look at governance cases. Can an agent successfully complete a real world bureaucratic task for you? Can it monitor a school board event for you and report back? There are so many interesting things to test.
Sara Hooker@sarahookr

Most agentic benchmarks center around tasks that are automatically verifiable. But any task that is veriafiable is also easy to optimize for. This work instead describes the future of critical open world evaluations. Led by @sayashk, our current draft is now live.

English
2
6
34
8.1K
Arvind Narayanan retweetledi
Sara Hooker
Sara Hooker@sarahookr·
Most agentic benchmarks center around tasks that are automatically verifiable. But any task that is veriafiable is also easy to optimize for. This work instead describes the future of critical open world evaluations. Led by @sayashk, our current draft is now live.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
8
21
194
40.4K
Arvind Narayanan retweetledi
Andy Hall
Andy Hall@ahall_research·
I continue to be super puzzled by the whole policy conversation around jailbreaking and AI safety. No one seems to be thinking about this clearly. If an adversarial actor can always jailbreak the models -- as @elder_plinius shows over and over again -- what do policymakers think they are going to accomplish by banning models that don't have these guardrails? I'm open to possible arguments, but someone in DC has to actually make the argument. It will have to rely on some behavioral model in which there are a lot of lazier adversaries for whom the guardrails will deter (similar to arguments about airport security, for example). This is what @random_walker, @sayashk and others have explained super clearly, yet the conversation seems to just carry on anyways. It's crazy how many people just take for granted that guardrails accomplish something important without spelling it out. We need a sign we tap that just says, "remember @elder_plinius "
Andrew Curran@AndrewCurran_

House lawmakers were given a demonstration by DHS yesterday where they were able to interact with jailbroken models. Open source will probably reach Mythos performance by the end of the year. By the summer there will be a push to regulate open source in the US. This is a prelude.

English
21
26
157
27.6K
Arvind Narayanan retweetledi
Anand Shah
Anand Shah@avshah99·
🚨New preprint! We find evidence of LLMs enabling people to file lawsuits without lawyers (filing "pro se") at historically unprecedented rates in federal courts.👇 1/n
Anand Shah tweet media
English
48
253
1.1K
464.3K
Arvind Narayanan
Arvind Narayanan@random_walker·
Our brand is careful and nuanced analysis of what AI agents can/can't do which we know doesn't play well on social media when there's so much hype to compete with. But if you take the time to read the paper I promise it will be rewarding! cruxevals.com/open-world-eva… This is one of the most interesting projects I've ever worked on. We have a great team of collaborators that we're continuing to expand, and we plan to release new CRUXes regularly. Remember the name, you will hear it again!
Josh Pipe@Malamo999

@sayashk @TransluceAI @PKirgis @steverab @random_walker @fly_upside_down @RishiBommasani @DubMagda @ghadfield @ahall_research This was such an interesting thread! I can't believe it doesn't have more views. I appreciated your take on evaluation awareness and also you offering the 1GB of logs for users who know what to do with them. I'm a non-technical user but I still enjoyed the read and its details.

English
1
12
44
9.6K
Arvind Narayanan retweetledi
Andy Masley
Andy Masley@AndyMasley·
Two months ago the YouTuber Benn Jordan made what has become one of the most popular videos about data centers ever, and maybe the most popular piece of media on data centers this year: Datacenters Behaving Like Accoustic Weapons. The claim is that they produce harmful infrasound. This video (and the one before it) is a moment-by-moment disaster. I found this to be far and away the most jaw-dropping experience of writing I've had. To my knowledge I'm the first one to publicly push back against all the problems with it. Even if you're not interested in data centers or infrasounds, I think this is just an incredible example of how pseudoscience can become highbrow misinformation. All it takes is a chill-seeming guy, great production, and not actually checking literally any of the studies he's quickly flashing on the screen. blog.andymasley.com/p/contra-benn-…
English
22
60
519
35.3K
Arvind Narayanan retweetledi
Kyle Chan
Kyle Chan@kyleichan·
This Chinese humanoid robot just shattered the world record for a half marathon, finishing in 50 min 26 sec. This video shows its crash just meters before the finish line where it had to be picked up by a team of humans. The robot is from Honor, the smartphone maker and Huawei spin-off. This robot was teleoperated while others were autonomous. It seems like all the robots had battery swaps along the way.
English
1.3K
1.5K
10.5K
5.3M
Arvind Narayanan retweetledi
Gillian Hadfield
Gillian Hadfield@ghadfield·
Glad to be a part of this initiative to develop open-world evaluations for AI. We need the ability to assess just how capable agents are becoming in order to anticipate and respond to the impact they can have on real world systems and transactions. An agent that can successfully act on the general instruction “build an app and get it posted in the App Store” is one that brings us closer to an economy of agents, with significant implications for how markets behave and need regulating arxiv.org/pdf/2509.01063
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
6
23
5.9K
Arvind Narayanan retweetledi
Peter Kirgis
Peter Kirgis@PKirgis·
Yesterday, we announced CRUX, a project that aims to conduct regular “open-world evaluations,” where we will be testing the ability of AI agents to complete long-horizon tasks in messy, real-world environments. @sayashk's post dives into the details; here are a few of my own thoughts about why this is worth doing.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
3
11
3.9K
Arvind Narayanan retweetledi
Cozmin Ududec
Cozmin Ududec@CUdudec·
This paper makes a strong case for open-world evaluations as a complement to traditional benchmarks, particularly for realistic, long-horizon, open-ended settings! Glad the AISI SoE team could contribute to this effort.
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
1
5
28
8.1K
Arvind Narayanan retweetledi
Nathan Calvin
Nathan Calvin@_NathanCalvin·
This work on messy real-world evals from Sayash et al is wild and surprised me (and Sayash isn't known to over-hype). "App store operators should prepare for and police spam submissions, as they might soon see thousands of applications submitted autonomously using agents."
Nathan Calvin tweet media
Sayash Kapoor@sayashk

Benchmarks are saturated more quickly than ever. How should frontier AI evaluations evolve? In a new paper, we argue that the AI community is already converging on an answer: Open-world evaluations. They are long, messy, real-world tasks that would be impractical for benchmarks.

English
2
5
16
4.7K