Carson Poole

1.3K posts

Carson Poole

@CarsonPoole

New York City Katılım Mayıs 2010

135 Takip Edilen940 Takipçiler

Sabitlenmiş Tweet

Carson Poole@CarsonPoole·8 Nis

this was lots of fun and lots of all nighters over the past few weeks. really happy with what we achieved!

Artificial Analysis@ArtificialAnlys

NVIDIA Blackwell can achieve 303 output tokens/s for DeepSeek R1 in FP4 precision, per our benchmarking of an Avian API endpoint Artificial Analysis benchmarked DeepSeek R1 on an @avian_io private API endpoint. Running DeepSeek R1 in FP4 precision on NVIDIA Blackwell, their endpoint achieved 303 output tokens/s - the fastest speed we have measured yet for DeepSeek R1. The FP4 version of DeepSeek R1 maintained accuracy across our evaluation suite as compared to the native FP8 version. Inference speed is especially critical for reasoning models that ‘think’ before they answer - we look forward to wider availability of NVIDIA Blackwell hardware in the coming months!

English

2.3K

Carson Poole@CarsonPoole·3d

@MartinShkreli different view… have heard it’s a generator on fire but no clue if that’s true

English

8.1K

Martin Shkreli@MartinShkreli·3d

Well, that’s not normal

English

182

1.6K

513.5K

Carson Poole@CarsonPoole·31 Oca

@allTheYud is getting victory lap material the likes of which have heretofore never been seen. my fast takeoff has been updated from ~0 to 5% and I’m unironically having slightly whimsical thoughts of what happens if the electric grid is down in the morning

English

Carson Poole@CarsonPoole·29 Oca

GPT-5.2T vs Opus 4.5 shows some major big model smell

English

146

Carson Poole@CarsonPoole·28 Oca

someone pls make a worldle that has a daily leaderboard for whom can make the highest logprob sentence in some range of tokens

English

Carson Poole@CarsonPoole·27 Oca

@karpathy new visit to @dwarkesh_sp’s podcast soon?

English

507

Andrej Karpathy@karpathy·26 Oca

A few random notes from claude coding quite a bit last few weeks. Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in December. i.e. I really am mostly programming in English now, a bit sheepishly telling the LLM what code to write... in words. It hurts the ego a bit but the power to operate over software in large "code actions" is just too net useful, especially once you adapt to it, configure it, learn to use it, and wrap your head around what it can and cannot do. This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I'd expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent. IDEs/agent swarms/fallability. Both the "no need for IDE anymore" hype and the "agent swarm" hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines. They still sometimes change/remove comments and code they don't like or don't sufficiently understand as side effects, even if it is orthogonal to the task at hand. All of this happens despite a few simple attempts to fix it via instructions in CLAUDE . md. Despite all these issues, it is still a net huge improvement and it's very difficult to imagine going back to manual coding. TLDR everyone has their developing flow, my current is a small few CC sessions on the left in ghostty windows/tabs and an IDE on the right for viewing the code + manual edits. Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased. Speedups. It's not clear how to measure the "speedup" of LLM assistance. Certainly I feel net way faster at what I was going to do, but the main effect is that I do a lot more than I was going to do because 1) I can code up all kinds of things that just wouldn't have been worth coding before and 2) I can approach code that I couldn't work on before because of knowledge/skill issue. So certainly it's speedup, but it's possibly a lot more an expansion. Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP. Write the naive algorithm that is very likely correct first, then ask it to optimize it while preserving correctness. Change your approach from imperative to declarative to get the agents looping longer and gain leverage. Fun. I didn't anticipate that with agents programming feels *more* fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part. I also feel less blocked/stuck (which is not fun) and I experience a lot more courage because there's almost always a way to work hand in hand with it to make some positive progress. I have seen the opposite sentiment from other people too; LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building. Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually. Generation (writing code) and discrimination (reading code) are different capabilities in the brain. Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it. Slopacolypse. I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media. We're also going to see a lot more AI hype productivity theater (is that even possible?), on the side of actual, real improvements. Questions. A few of the questions on my mind: - What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*. - Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro). - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music? - How much of society is bottlenecked by digital knowledge work? TLDR Where does this leave us? LLM agent capabilities (Claude & Codex especially) have crossed some kind of threshold of coherence around December 2025 and caused a phase shift in software engineering and closely related. The intelligence part suddenly feels quite a bit ahead of all the rest of it - integrations (tools, knowledge), the necessity for new organizational workflows, processes, diffusion more generally. 2026 is going to be a high energy year as the industry metabolizes the new capability.

English

1.6K

5.4K

39.4K

7.6M

Carson Poole@CarsonPoole·24 Oca

I have never understood why people need these tools? What is hard about (# billion params) * (2 for 16 bit, 1 for 8bit, etc) * (fudge factor for activations, kv cache) < (vram on your GPU in GB)

neural nets.@cneuralnetwork

I made an internal tool for myself to check the VRAM required to run models on GPUs Open-sourcing it today! "do-i-have-the-vram" checks the amount of vram you need to loading the model, without loading the model! use it by running ` pip install do-i-have-the-vram `

English

156

Carson Poole@CarsonPoole·19 Oca

@ID_AA_Carmack I find it interesting that like many things before, meaningful ML research progress manifests as a 1-2 line of code delta (without hiding complexity): Dropout, Adam(W), ResNets, BatchNorm, etc

English

462

John Carmack@ID_AA_Carmack·19 Oca

#PaperADay 7 Cautious Weight Decay arxiv.org/abs/2510.12402 This is a 36 page paper about a very simple idea: Don’t apply weight decay when it is in opposition to the current optimizer step. If the step is moving the weight farther from zero, there is no decay. If the step is towards zero, decay moves it in faster. They spent 20,000 H100 GPU hours (about $60k!) testing this across multiple optimizers and models, and it looks like it is basically always a modest improvement, with no changes to any hyperparameters. My current models use weight norm on most of the parameters, but there are still some with traditional weight decay. A first test with this idea does seem to be a tiny improvement, but I will need to do more runs to have confidence in it. Two modifications of the idea come to mind: Use the current gradient instead of the optimizer step (which includes momentum) for masking, as in arxiv.org/abs/2411.16085. If using a cautious optimizer, explicitly masking the optimizer step before calculating cautious weight decay would do this automatically. Weight decay is an exponential effect, which mixes with a linear learning rate. It might be interesting to just have a larger learning rate when the step is heading towards zero than away from it, which would do similar things, but with different learning dynamics.

English

412

50.2K

Carson Poole@CarsonPoole·13 Oca

the comparison is really striking between when Google released Lion versus everybody quietly switching to Muon

elie@eliebakouch

lfg, deepseek uses Muon in the ablation setup of their latest paper

English

160

Carson Poole@CarsonPoole·13 Oca

@Bojangles when Manhattan?

English

1.8K

Bojangles@Bojangles·12 Oca

NEW YORK WE'VE ARRIVED! Come see us at 📍 5910 Church Avenue, Brooklyn, NY 11203 🍗

English

142

1.7K

257.8K

Carson Poole@CarsonPoole·5 Oca

@fleetwood___ Had more short term conviction on the long side of the trade, so it’s done alright in the past 6 months

English

Fleetwood@fleetwood___·12 Ara

@CarsonPoole Would have been a very bad bet

English

Carson Poole@CarsonPoole·23 Tem

just gonna go ahead and lock in this prediction: diffusion LMs will eat GPT models and I’m probably long FLOPs short HBM

Chen-Yu Lee@chl260

Thrilled to introduce "𝗗𝗲𝗲𝗽 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗲𝗿 𝘄𝗶𝘁𝗵 𝗧𝗲𝘀𝘁-𝗧𝗶𝗺𝗲 𝗗𝗶𝗳𝗳𝘂𝘀𝗶𝗼𝗻," a new deep research agent designed to mimic the iterative nature of human research, complete with cycles of planning, drafting, and revision. 🚀🚀 arxiv.org/pdf/2507.16075

English

1.4K

Carson Poole retweetledi

Drafted@DraftedAI·24 Ara

Hello World 👋 Welcome to Drafted — an AI tool that lets anyone design a home from scratch, tailored to your life. techcrunch.com/2025/12/23/thi…

English

119

165K

Carson Poole@CarsonPoole·23 Ara

a phenomenon I haven’t seen anybody point out is what happens when you can “few shot” a robot? with sufficient scale this ability emerged with LLMs. instead of training it to perform a specific task, can you show it 2-3 representative examples of itself doing said task?

Physical Intelligence@physical_int

We got our robots to wash pans, clean windows, make peanut butter sandwiches, and more! Fine-tuning our latest model enables all of these tasks, and this has interesting implications for robotics, Moravec's paradox, and the future of large models in embodied AI. More below!

English

124

Carson Poole@CarsonPoole·16 Ara

@stochasticchasm Deepseek did this for R1; they went from 7k dim for the attention to 2k for the (non-shared) experts. I’m assuming there’s something more here they’re doing beyond that to qualify this as truly novel? (I haven’t looked at their code yet)

English

stochasm@stochasticchasm·15 Ara

Just straight up doing MoE in lower dim, kinda interesting

elie@eliebakouch

@vikhyatk @xeophon @justjoshinyou13 #page=3.21" target="_blank" rel="nofollow noopener">research.nvidia.com/labs/nemotron/…

English

122

13.4K

Carson Poole@CarsonPoole·26 Eki

the momentum is building

Andrej Karpathy@karpathy

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've seen a bit of both. A lot of diffusion papers look a bit dense but if you strip the mathematical formalism, you end up with simple baseline algorithms, e.g. something a lot closer to flow matching in continuous, or something like this in discrete. It's your vanilla transformer but with bi-directional attention, where you iteratively re-sample and re-mask all tokens in your "tokens canvas" based on a noise schedule until you get the final sample at the last step. (Bi-directional attention is a lot more powerful, and you get a lot stronger autoregressive language models if you train with it, unfortunately it makes training a lot more expensive because now you can't parallelize across sequence dim). So autoregression is doing an `.append(token)` to the tokens canvas while only attending backwards, while diffusion is refreshing the entire token canvas with a `.setitem(idx, token)` while attending bidirectionally. Human thought naively feels a bit more like autoregression but it's hard to say that there aren't more diffusion-like components in some latent space of thought. It feels quite possible that you can further interpolate between them, or generalize them further. And it's a component of the LLM stack that still feels a bit fungible. Now I must resist the urge to side quest into training nanochat with diffusion.

English

Carson Poole@CarsonPoole·14 Eki

>1000 tps single batch inference. short HBM long FLOPs

Ant Ling@AntLingAGI

A new milestone on dLLMs🚀🚀🚀

English

206

Carson Poole@CarsonPoole·11 Ağu

my take (x.com/CarsonPoole/st…) continues to age well

Jinjie Ni@NiJinjie

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens hits 56% HellaSwag & 33% MMLU — no tricks, no cherry-picks. > No saturation: more repeats = more gains. 🚨 ”x.openreview.net” We also dissected the serious methodological flaws in our parallel work “Diffusion Beats Autoregressive in Data-Constrained Settings” — let’s raise the bar for open review! 🔗 Blog & details: jinjieni.notion.site/Diffusion-Lang… 18 🧵s ahead:

English

701

Carson Poole@CarsonPoole·9 Eki

why does this look like nvidia selling gloves

English

106

Carson Poole@CarsonPoole·8 Eki

TIL the originator of the phrase "embarassingly parallel" is Cleve Moler, the creator of matlab (sorry if that gives you painful flashbacks)

English

100

Carson Poole@CarsonPoole·5 Eki

make (positive) science fiction a reality

Power to the People ☭🕊@ProudSocialist

This is diabolical. There is no future for humanity when AI consumes this much energy. Shut these evil robot corporations down and make science fiction films fiction again!!!

English

194

Carson Poole@CarsonPoole·3 Eki

ZXX

105

Carson Poole@CarsonPoole·3 Eki

@gregisenberg x.com/CarsonPoole/st…

Carson Poole@CarsonPoole

calling it now: the way people use “chat” in reference to ChatGPT will become similarly prolific to how Google became a verb, but this time it’s appropriating a normal English verb instead of a trademarked term

QME

GREG ISENBERG@gregisenberg·2 Eki

people outside the tech bubble call chatgpt just “chat” pretty cool.

English

695

Keşfet

@MartinShkreli @allTheYud @karpathy @dwarkesh_sp @ID_AA_Carmack @Bojangles @fleetwood___ @stochasticchasm