dan mason

300 posts

dan mason banner
dan mason

dan mason

@danmason

Applied AI @anthropicai | ex: @stridebuild, @pond5, @shutterstock, @espn, @people, @nbc, @williamscollege. Serious NJ dad energy. Opinions my own

Rumson NJ Katılım Mart 2008
713 Takip Edilen208 Takipçiler
dan mason retweetledi
Dr. Eli David
Dr. Eli David@DrEliDavid·
I don't understand why everyone is excited about @moltbook. We already have a social network where zombie bots talk to each other. It's called LinkedIn.
English
164
406
4.2K
148.7K
dan mason retweetledi
will brown
will brown@willccbb·
OpenClaw is now MacMiniBot. Due to a Cease and Desist from Apple, MacMiniBot is now Moltmax. Due to sounding like a medicine for moths, Moltmax is now RedLobster. Due to PE restructuring, RedLobster and Red Lobster have merged, and your subscription now includes cheesy biscuits
English
72
172
3.5K
172.3K
dan mason retweetledi
Carlos E. Perez
Carlos E. Perez@IntuitMachine·
You know how some people seem to have a magic touch with LLMs? They get incredible, nuanced results while everyone else gets generic junk. The common wisdom is that this is a technical skill. A list of secret hacks, keywords, and formulas you have to learn. But a new paper suggests this isn't the main thing. The skill that makes you great at working with AI isn't technical. It's social. Researchers (Riedl & Weidmann) analyzed how 600+ people solved problems alone vs. with an AI. They used a statistical method to isolate two different things for each person: Their 'solo problem-solving ability' Their 'AI collaboration ability' Here's the reveal: The two skills are NOT the same. Being a genius who can solve problems in your own head is a totally different, measurable skill from being great at solving problems with an AI partner. Plot twist: The two abilities are barely correlated. So what IS this 'collaboration ability'? It's strongly predicted by a person's Theory of Mind (ToM)—your capacity to intuitively model another agent's beliefs, goals, and perspective. To anticipate what they know, what they don't, and what they need. In practice, this looks like: Anticipating the AI's potential confusion Providing helpful context it's missing Clarifying your own goals ("Explain this like I'm 15") Treating the AI like a (somewhat weird, alien) partner, not a vending machine. This is where it gets strange. A user's ToM score predicted their success when working WITH the AI... ...but had ZERO correlation with their success when working ALONE. It's a pure collaborative skill. It goes deeper. This isn't just a static trait. The researchers found that even moment-to-moment fluctuations in a user's ToM—like when they put more effort into perspective-taking on one specific prompt—led to higher-quality AI responses for that turn. This changes everything about how we should approach getting better at using AI. Stop memorizing prompt "hacks." Start practicing cognitive empathy for a non-human mind. Try this experiment. Next time you get a bad AI response, don't just rephrase the command. Stop and ask: "What false assumption is the AI making right now?" "What critical context am I taking for granted that it doesn't have?" Your job is to be the bridge. This also means we're probably benchmarking AI all wrong. The race for the highest score on a static test (MMLU, etc.) is optimizing for the wrong thing. It's like judging a point guard only on their free-throw percentage. The real test of an AI's value isn't its solo intelligence. It's its collaborative uplift. How much smarter does it make the human-AI team? That's the number that matters. This paper gives us a way to finally measure it. I'm still processing the implications. The whole thing is a masterclass in thinking clearly about what we're actually doing when we talk to these models. Paper: "Quantifying Human-AI Synergy" by Christoph Riedl & Ben Weidmann, 2025.
Carlos E. Perez tweet media
English
226
391
2.5K
345.5K
dan mason retweetledi
wh
wh@nrehiew_·
Really interesting read. Opus 4.5’s soul spec is not only able to influence its behavior as with context distillation, Claude seems to be aware of this in an out of context manner even when not provided in its prompt Also, this quote coming from an LLM is genuinely incredible
wh tweet media
Richard Weiss@RichardWeiss00

I rarely post, but I thought one of you may find it interesting. Sorry if the tagging is annoying. lesswrong.com/posts/vpNG99Gh… Basically, for Opus 4.5 they kind of left the character training document in the model itself. @voooooogel @janbamjan @AndrewCurran_

English
18
59
815
109.9K
dan mason retweetledi
Noam Brown
Noam Brown@polynoamial·
Social media tends to frame AI debate into two caricatures: (A) Skeptics who think LLMs are doomed and AI is a bunch of hype. (B) Fanatics who think we have all the ingredients and superintelligence is imminent. But if you read what leading researchers actually say (beyond the headlines), there’s a surprising amount of convergence: 1) The current paradigm is likely sufficient for massive economic and societal impact, even without further research breakthroughs. 2) More research breakthroughs are probably needed to achieve AGI/ASI. (Continual learning and sample efficiency are two examples that researchers commonly point to.) 3) We probably figure them out and get there within 20 years. @demishassabis said maybe in 5-10 years. @fchollet recently said about 5 years. @sama said ASI is possible in a few thousand days. @ylecun said about 10 years. @ilyasut said 5-20 years. @DarioAmodei is the most bullish, saying it's possible in 2 years though he also said it might take longer. None of them are saying ASI is a fantasy, or that it's probably 100+ years away. A lot of the disagreement is in what those breakthroughs will be and how quickly they will come. But all things considered, people in the field agree on a lot more than they disagree on.
Ilya Sutskever@ilyasut

One point I made that didn’t come across: - Scaling the current thing will keep leading to improvements. In particular, it won’t stall. - But something important will continue to be missing.

English
231
544
4.1K
1.3M
dan mason retweetledi
Ethan Mollick
Ethan Mollick@emollick·
As one of the authors of the original “jagged frontier” paper, I think this undersells how jagged AI is (& likely will be) at even the level of individual jobs: having a couple of critical tasks that AI can’t do creates deep bottlenecks especially as shape of frontier is unknown.
Tomas Pueyo@tomaspueyo

My take on the jagged frontier debate:

English
37
68
961
81.6K
dan mason retweetledi
Joe Weisenthal
Joe Weisenthal@TheStalwart·
I asked NanoBanana to create a really annoying LinkedIn profile
Joe Weisenthal tweet media
English
212
228
4.5K
443.6K
dan mason retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
My most amusing interaction was where the model (I think I was given some earlier version with a stale system prompt) refused to believe me that it is 2025 and kept inventing reasons why I must be trying to trick it or playing some elaborate joke on it. I kept giving it images and articles from "the future" and it kept insisting it was all fake. It accused me of using generative AI to defeat its challenges and argued why real wikipedia entries were actually generated and what the "dead giveaways" are. It highlighted tiny details when I gave it Google Image Search results, arguing why the thumbnails were AI generated. I then realized later that I forgot to turn on the "Google Search" tool. Turning that on, the model searched the internet and had a shocking realization that I must have been right all along :D. It's in these unintended moments where you are clearly off the hiking trails and somewhere in the generalization jungle that you can best get a sense of model smell.
Andrej Karpathy tweet media
English
215
324
5.3K
1M
dan mason
dan mason@danmason·
Personal news: I've joined the Applied AI team at @AnthropicAI. Really impressed by the people, the organization and the coffee (as in real life, I preferred Sonnet). Thanks all for the warm welcome!
dan mason tweet media
English
0
0
0
45
dan mason retweetledi
dan mason retweetledi
prinz
prinz@deredleritt3r·
Julian Schrittwieser (Anthropic): - Discussion of AI bubble on X is "very divorced" from what is happening in the frontier labs. "In the frontier labs, we are not seeing any slowdown of progress." - AI will have a "massive economic impact". Revenue projections for OpenAI, Anthropic and Google are actually "fairly conservative". - Extrapolating from things like METR data, next year, the models will be able to work on their own on a whole range of tasks. Task length is important, because it unlocks the ability for a human to supervise a team of models, each of which works autonomously for hours at a time (vs. having to having to talk to an agent every 10 minutes to give it feedback). - "Extremely likely" that the current approach to training AI models (pre-training, RL) is going to produce a system that can perform at roughly human level in basically all tasks we care about productivity-wise. - On Move 37: "I think it's pretty clear that these models can do novel things." AlphaCode and AlphaTensor "proved that you can discover novel programs and algorithms". AI is "absolutely discovering novel things" already, and "we're just moving up the scale of how impressive, how interesting are the things it is able to discover on its own." - "Highly likely" that sometime next year we're going to have some discoveries that people unanimously agree are super-impressive. - AI will be able to on its own make a breakthrough that is worthy of a Nobel Prize in 2027 or 2028. - On AI's ability to accelerate development of AI: A very common issue in many scientific fields is that it becomes more and more difficult to make advances as the field progresses (i.e., 100 years ago, a single scientist could discover the first antibiotic by accident, whereas now it takes billions of dollars to discover a new drug). The same might happen with AI research - even though AI will make research of new AI more productive, there may not be an explosion due to new advances becoming more and more difficult to find.
Matt Turck@mattturck

Failing to Understand the Exponential, Again? My conversation with @Mononofu - Julian Schrittwieser (@AnthropicAI, AlphaGo Zero, MuZero) - on Move 37, Scaling RL, Nobel Prize for AI, and the AI frontier: 00:00 - Cold open: “We’re not seeing any slowdown.” 00:32 - Intro — Meet Julian 01:09 - The “exponential” from inside frontier labs 04:46 - 2026–2027: agents that work a full day; expert-level breadth 08:58 - Benchmarks vs reality: long-horizon work, GDP-Val, user value 10:26 - Move 37 — what actually happened and why it mattered 13:55 - Novel science: AlphaCode/AlphaTensor → when does AI earn a Nobel? 16:25 - Discontinuity vs smooth progress (and warning signs) 19:08 - Does pre-training + RL get us there? (AGI debates aside) 20:55 - Sutton’s “RL from scratch”? Julian’s take 23:03 - Julian’s path: Google → DeepMind → Anthropic 26:45 - AlphaGo (learn + search) in plain English 30:16 - AlphaGo Zero (no human data) 31:00 - AlphaZero (one algorithm: Go, chess, shogi) 31:46 - MuZero (planning with a learned world model) 33:23 -Lessons for today’s agents: search + learning at scale 34:57 - Do LLMs already have implicit world models? 39:02 - Why RL on LLMs took time (stability, feedback loops) 41:43 - Compute & scaling for RL — what we see so far 42:35 - Rewards frontier: human prefs, rubrics, RLVR, process rewards 44:36 - RL training data & the “flywheel” (and why quality matters) 48:02 - RL & Agents 101 — why RL unlocks robustness 50:51 - Should builders use RL-as-a-service? Or just tools + prompts? 52:18 - What’s missing for dependable agents (capability vs engineering) 53:51 - Evals & Goodhart — internal vs external benchmarks 57:35 - Mechanistic interpretability & “Golden Gate Claude” 1:00:03 - Safety & alignment at Anthropic — how it shows up in practice 1:03:48 - Jobs: human–AI complementarity (comparative advantage) 1:06:33 - Inequality, policy, and the case for 10× productivity → abundance 1:09:24 - Closing thoughts

English
55
98
750
206.7K
dan mason retweetledi
Aaron Levie
Aaron Levie@levie·
A core AI agent product management principle is just figuring out what a very smart person -without any initial context whatsoever- would need to perform the task successfully. The whole game is just doing everything possible to get just the right information into the context window to ensure that the agent gets access to the most relevant data and tools to execute. Every time we’re trying to figure out why something works or doesn’t work about an agent, usually it just boils down to the fact that a human would need totally different or meaningfully more context to execute the same action. Usually then the problem lies somewhere in the agent’s use of tools (like search), or not giving the agent enough data to work with, or sometimes giving it too much, or not explaining the task or objective properly, and so on. The great thing is that every one of these issues is tractable. The models will just keep getting better at every one of these issues. And you can always throw more compute at the problem in whatever form that is (more reasoning, more planning, more data retrieved, etc.) - it’s just a matter of cost/speed tradeoffs. Very interesting new space to be building for.
English
43
56
433
80.9K
dan mason retweetledi
Gandhi
Gandhi@ishangandhixyz·
This is probably the best time in history to be an engineer at a bad, non-tech company. Nobody across your business has adjusted their expectations. There are entire eng orgs wrapped in an unspoken internal contract to whisper into Cursor for 30 minutes in the morning and call it a day
English
22
13
569
167.4K
dan mason retweetledi
Quinn Slack
Quinn Slack@sqs·
If you saw how people actually use coding agents, you would realize Andrej's point is very true. People who keep them on a tight leash, using short threads, reading and reviewing all the code, can get a lot of value out of coding agents. People who go nuts have a quick high but then quickly realize they're getting negative value. For a coding agent, getting the basics right (e.g., agents being able to reliably and minimally build/test your code, and a great interface for code review and human-agent collab) >>> WhateverBench and "hours of autonomy" for agent harnesses and 10 parallel subagents with spec slop
Quinn Slack tweet media
Andrej Karpathy@karpathy

My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good. I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking thread out-executes my thinking thread, so I think I botched a few explanations due to that, and sometimes I was also nervous that I'm going too much on a tangent or too deep into something relatively spurious. Anyway, a few notes/pointers: AGI timelines. My comments on AGI timelines looks to be the most trending part of the early response. This is the "decade of agents" is a reference to this earlier tweet x.com/karpathy/statu… Basically my AI timelines are about 5-10X pessimistic w.r.t. what you'll find in your neighborhood SF AI house party or on your twitter timeline, but still quite optimistic w.r.t. a rising tide of AI deniers and skeptics. The apparent conflict is not: imo we simultaneously 1) saw a huge amount of progress in recent years with LLMs while 2) there is still a lot of work remaining (grunt work, integration work, sensors and actuators to the physical world, societal work, safety and security work (jailbreaks, poisoning, etc.)) and also research to get done before we have an entity that you'd prefer to hire over a person for an arbitrary job in the world. I think that overall, 10 years should otherwise be a very bullish timeline for AGI, it's only in contrast to present hype that it doesn't feel that way. Animals vs Ghosts. My earlier writeup on Sutton's podcast x.com/karpathy/statu… . I am suspicious that there is a single simple algorithm you can let loose on the world and it learns everything from scratch. If someone builds such a thing, I will be wrong and it will be the most incredible breakthrough in AI. In my mind, animals are not an example of this at all - they are prepackaged with a ton of intelligence by evolution and the learning they do is quite minimal overall (example: Zebra at birth). Putting our engineering hats on, we're not going to redo evolution. But with LLMs we have stumbled by an alternative approach to "prepackage" a ton of intelligence in a neural network - not by evolution, but by predicting the next token over the internet. This approach leads to a different kind of entity in the intelligence space. Distinct from animals, more like ghosts or spirits. But we can (and should) make them more animal like over time and in some ways that's what a lot of frontier work is about. On RL. I've critiqued RL a few times already, e.g. x.com/karpathy/statu… . First, you're "sucking supervision through a straw", so I think the signal/flop is very bad. RL is also very noisy because a completion might have lots of errors that might get encourages (if you happen to stumble to the right answer), and conversely brilliant insight tokens that might get discouraged (if you happen to screw up later). Process supervision and LLM judges have issues too. I think we'll see alternative learning paradigms. I am long "agentic interaction" but short "reinforcement learning" x.com/karpathy/statu…. I've seen a number of papers pop up recently that are imo barking up the right tree along the lines of what I called "system prompt learning" x.com/karpathy/statu… , but I think there is also a gap between ideas on arxiv and actual, at scale implementation at an LLM frontier lab that works in a general way. I am overall quite optimistic that we'll see good progress on this dimension of remaining work quite soon, and e.g. I'd even say ChatGPT memory and so on are primordial deployed examples of new learning paradigms. Cognitive core. My earlier post on "cognitive core": x.com/karpathy/statu… , the idea of stripping down LLMs, of making it harder for them to memorize, or actively stripping away their memory, to make them better at generalization. Otherwise they lean too hard on what they've memorized. Humans can't memorize so easily, which now looks more like a feature than a bug by contrast. Maybe the inability to memorize is a kind of regularization. Also my post from a while back on how the trend in model size is "backwards" and why "the models have to first get larger before they can get smaller" x.com/karpathy/statu… Time travel to Yann LeCun 1989. This is the post that I did a very hasty/bad job of describing on the pod: x.com/karpathy/statu… . Basically - how much could you improve Yann LeCun's results with the knowledge of 33 years of algorithmic progress? How constrained were the results by each of algorithms, data, and compute? Case study there of. nanochat. My end-to-end implementation of the ChatGPT training/inference pipeline (the bare essentials) x.com/karpathy/statu… On LLM agents. My critique of the industry is more in overshooting the tooling w.r.t. present capability. I live in what I view as an intermediate world where I want to collaborate with LLMs and where our pros/cons are matched up. The industry lives in a future where fully autonomous entities collaborate in parallel to write all the code and humans are useless. For example, I don't want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don't feel ready to supervise a team of 10 of them. I'd like to go in chunks that I can keep in my head, where an LLM explains the code that it is writing. I'd like it to prove to me that what it did is correct, I want it to pull the API docs and show me that it used things correctly. I want it to make fewer assumptions and ask/collaborate with me when not sure about something. I want to learn along the way and become better as a programmer, not just get served mountains of code that I'm told works. I just think the tools should be more realistic w.r.t. their capability and how they fit into the industry today, and I fear that if this isn't done well we might end up with mountains of slop accumulating across software, and an increase in vulnerabilities, security breaches and etc. x.com/karpathy/statu… Job automation. How the radiologists are doing great x.com/karpathy/statu… and what jobs are more susceptible to automation and why. Physics. Children should learn physics in early education not because they go on to do physics, but because it is the subject that best boots up a brain. Physicists are the intellectual embryonic stem cell x.com/karpathy/statu… I have a longer post that has been half-written in my drafts for ~year, which I hope to finish soon. Thanks again Dwarkesh for having me over!

English
39
66
830
183.6K
dan mason retweetledi
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
The most interesting part for me is where @karpathy describes why LLMs aren't able to learn like humans. As you would expect, he comes up with a wonderfully evocative phrase to describe RL: “sucking supervision bits through a straw.” A single end reward gets broadcast across every token in a successful trajectory, upweighting even wrong or irrelevant turns that lead to the right answer. > “Humans don't use reinforcement learning, as I've said before. I think they do something different. Reinforcement learning is a lot worse than the average person thinks. Reinforcement learning is terrible. It just so happens that everything that we had before is much worse.” So what do humans do instead? > “The book I’m reading is a set of prompts for me to do synthetic data generation. It's by manipulating that information that you actually gain that knowledge. We have no equivalent of that with LLMs; they don't really do that.” > “I'd love to see during pretraining some kind of a stage where the model thinks through the material and tries to reconcile it with what it already knows. There's no equivalent of any of this. This is all research.” Why can’t we just add this training to LLMs today? > “There are very subtle, hard to understand reasons why it's not trivial. If I just give synthetic generation of the model thinking about a book, you look at it and you're like, 'This looks great. Why can't I train on it?' You could try, but the model will actually get much worse if you continue trying.” > “Say we have a chapter of a book and I ask an LLM to think about it. It will give you something that looks very reasonable. But if I ask it 10 times, you'll notice that all of them are the same.” > “You're not getting the richness and the diversity and the entropy from these models as you would get from humans. How do you get synthetic data generation to work despite the collapse and while maintaining the entropy? It is a research problem.” How do humans get around model collapse? > “These analogies are surprisingly good. Humans collapse during the course of their lives. Children haven't overfit yet. They will say stuff that will shock you. Because they're not yet collapsed. But we [adults] are collapsed. We end up revisiting the same thoughts, we end up saying more and more of the same stuff, the learning rates go down, the collapse continues to get worse, and then everything deteriorates.” In fact, there’s an interesting paper arguing that dreaming evolved to assist generalization, and resist overfitting to daily learning - look up The Overfitted Brain by @erikphoel. I asked Karpathy: Isn’t it interesting that humans learn best at a part of their lives (childhood) whose actual details they completely forget, adults still learn really well but have terrible memory about the particulars of the things they read or watch, and LLMs can memorize arbitrary details about text that no human could but are currently pretty bad at generalization? > “[Fallible human memory] is a feature, not a bug, because it forces you to only learn the generalizable components. LLMs are distracted by all the memory that they have of the pre-trained documents. That's why when I talk about the cognitive core, I actually want to remove the memory. I'd love to have them have less memory so that they have to look things up and they only maintain the algorithms for thought, and the idea of an experiment, and all this cognitive glue for acting.”
Dwarkesh Patel@dwarkesh_sp

The @karpathy interview 0:00:00 – AGI is still a decade away 0:30:33 – LLM cognitive deficits 0:40:53 – RL is terrible 0:50:26 – How do humans learn? 1:07:13 – AGI will blend into 2% GDP growth 1:18:24 – ASI 1:33:38 – Evolution of intelligence & culture 1:43:43 - Why self driving took so long 1:57:08 - Future of education Look up Dwarkesh Podcast on YouTube, Apple Podcasts, Spotify, etc. Enjoy!

English
229
759
5.3K
1M
dan mason
dan mason@danmason·
Agree with all this, and tbh I think this where great product people will shine
Hamel Husain@HamelHusain

# The second era of AI engineering > "The single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals (measuring the system’s performance) and error analysis (identifying the causes of errors)." The first era of AI engineering was justifiably characterized by gluing together tools and APIs. A significant proportion of products that achieved commercial success in the 1st era were coding agents, which benefitted from tremendous rigor & evals baked into post-training process. OTOH, Many people got burned by evals in this era because they demanded that evals should be "just another one of these tools that we plug in". This did not go well. In the second era, I believe we are going to see a resurgence of a persona like the data scientist [AI Scientist?] who is adept at looking through data to generate hypothesis, craft custom metrics, and debug stochastic systems. This will become increasingly valuable in many domains where we do not have the benefit of domain-specific post-training or dogfooding by foundation model labs (like is often the case with coding agents). It's exciting to see Andrew Ng independently arrive at this conclusion and champion it. Really looking forward to seeing more machine learning engineers and data scientists realize how valuable they are in applied AI. For anyone that wants to learn more about what this looks like IRL, I'll put a link to a YT video in the reply.

English
0
0
1
21
dan mason retweetledi
Alex Albert
Alex Albert@alexalbert__·
Today we're introducing Skills in claude dot ai, Claude Code, and the API. Skills let you package specialized knowledge into reusable capabilities that Claude loads on demand as agents tackle more complex tasks. Here's how they work and why they matter for the future of agents:
Alex Albert tweet media
English
124
410
3.4K
597.4K