
Jeff Preshing
2.6K posts

Jeff Preshing
@preshing
Canadian game developer
Toque weather Katılım Temmuz 2008
530 Takip Edilen4.3K Takipçiler
Jeff Preshing retweetledi

Doing some experiments today with Opus 4.6's 1M context window.
Trying to push coding sessions deep into what I would consider the 'dumb zone' of SOTA models: >100K tokens.
The drop-off in quality is really noticeable. Dumber decisions, worse code, worse instruction-following.
Don't treat 1M context window any differently.
It's still 100K of smart, and 900K of dumb.
English

So refreshing to hear practical technical discussions about AI amidst all the hyperbole and clickbait that have dominated 𝕏 in recent years. Keep up the good work @shanselman Big fan!
Scott Hanselman 🌮@shanselman
Prompt engineering is just hoping really hard. Inference engineering is the firewall that actually guarantees it. hanselminutes.com/1038
English

Here's an example why you need to review AI generated code if you don't want your codebase to deteriorate to shit: gist.github.com/JarkkoPFC/088c…
English


@jackclarkSF @deredleritt3r I see AI systems as inference engines that require humans to design, operate and direct them, so I'm not confused at all.
If a theory/insight is consistent with existing knowledge, then it should be possible to build an AI system to generate it, but that's a human endeavor.
English

Just to operationalize this, using the MOLG definition:
"Smarter than a nobel prize winner" - huge error bars? 20%?***
Has same interfaces as a human working virtually - 100%; this is true today given plugins/composite systems.
Tasks that takes hours/days/weeks - 80%, if METR fast trend holds.
Can control existing tools - I feel uncalibrated here and don't think especially important. More depends on things having APIs and many things do.
Resources can be repurposed to run many copies - 100%; this is true today.
***This is the biggest variable imo, and it's really confusing. In the past couple of months there is now evidence of AI helping humans co-develop solutions at frontier of science in bio, math, physics, etc. And many experts are increasingly impressed/surprised by capabilities.
HOWEVER
I think no AI system has yet had a simple and rebellious insight on par with stuff like coming up with general relativity, CRISPR, etc. This may just be a function of AIs not being given enough opportunities to do open-ended agent-led experimentation. Or it could be something deeper - maybe these systems lack some quality that allows for the outrageous and inspiring creativity of humans that change the paradigm.
English

Jack Clark continues to believe that "powerful AI" is achievable *this year* and "running many copies" in 2027.
Jack Clark@jackclarkSF
@chatgpt21 yes
English

@emollick How do we stop people from doing bad things? With laws. How do we deal with bugs caused by using LLMs? Sandboxing, testing, observability and support — same as every other software component. What further justification are you looking for?
English

"It's more of a observational stress test than a hard validator — it deliberately creates messy allocation patterns and logs what happens so you can watch fragmentation behavior.
It has a list of target totals (in KB): 400 → 100 → 2000 → 400 → 5000 … → 0.
For each target it grows or shrinks the total allocated bytes irregularly:
Grow phase: alloc → free random → alloc (repeat until target).
Shrink phase: free random → alloc small → free random.
Allocation sizes are randomized:
10 % chance: large (100–400 KB)
25 % chance: medium (5–15 KB)
65 % chance: small (10–509 bytes)
After every operation it logs: operation count, current user-allocated bytes, and Heap::getStats().totalSystemMemoryUsed.
No validate(), no coalescing assertions, no pass/fail — you just watch the logs. If system memory keeps growing even when user-allocated bytes drop, you have external fragmentation."
English

The agent ran the fragmentation test several times and corrected its own mistakes. At one point it tried to use gdb but couldn't, so it modified PLY_ASSERT to log the file/line instead, which I have to admit is a better default behavior. All can be seen in the transcript.
Hey @grok - any chance you could explain how `apps/fragmentation-test` works in the linked repo?
English

😲 Wow! Codex 5.3 wrote a complete, general-purpose C++ memory allocator for me in just 30 minutes. Bada bing bada boom. The code is clearly written, well-documented, efficient, handles fragmentation well and stands up to a battery of tests. I was able to submit the AI-generated work directly to the main branch with no additional changes on my part.
Of course, if you want the full story, I should also mention that I spent several days preparing the workspace, designing the fragmentation test, customizing the AGENTS file and iterating on the prompt in addition to those 30 minutes. But I still find it very cool. Using a powerful LLM is like using a fax machine to get the answer back from a parallel universe where the remaining work has already been completed. 📠
For anyone interested, the prompt used can be seen in the commit description: github.com/preshing/plywo…
English

Another fun followup: I ran the same prompt entirely on local hardware using a 4-bit quant of Qwen3.5:122b, another open weight model released two days ago.
It worked for an hour and 20 minutes, producing code and documentation that were remarkably close to what I asked for but ignoring the project's coding style. Then it declared success even though the test suite actually crashed. It told me that this was "due to a pre-existing issue," but it was actually due to a huge memory leak in the code it wrote. Still pretty impressive considering it ran locally. I wonder if some coaching would help it iron the bugs out.
Transcript available here: github.com/preshing/plywo…
English

As an experiment, I ran the same prompt using Minimax-M2.5, a popular open weight model. It worked for 40 minutes, wrote a 300-line allocator that only implemented small bins, then gave up saying, "A full-featured allocator would require significantly more work to handle all edge cases correctly."
So I ran the same prompt again and it added some code to support tree bins. The code looks reasonable, but now it's spiraling trying to fix build errors.
English

@Zino2201_ Locks aren't slow; lock contention is. preshing.com/20111118/locks…
English

@mateberes_ Locks aren't slow; lock contention is. preshing.com/20111118/locks…
English

@preshing I stopped reading at the global mutex in the allocator.
English

This is using the minimalist agent harness pi.dev by @badlogicgames by the way. You can even use it "off the grid" by running qwen3-coder-next (or similar) on local hardware. There's no question that the commercial labs have much more capable models however.
Jeff Preshing@preshing
Git finally has an easy-to-use command line interface: pi -p "Rename remote origin to local"
English

Having fun using Codex 5.3 ($20 plan) to improve my C++ Markdown parser. From 277 to 402 passing test cases since yesterday. AI really shines at this kind of work! I'm directing the effort and playing code janitor, but it even helps with those things too. github.com/preshing/plywo…
English

"AI systems amplify well-structured knowledge while punishing undocumented systems." 🎯
This sums up well how software development is changing, I think. Well-organized projects with clear documentation and legible code will have a big advantage. It isn't obvious how granular the documentation should be, or how to best feed information to agents at the right time, but success will favor teams who get it right. I hope to see more discussion about best practices going forward. Thanks @clattner_llvm for sharing your perspective and expertise.
Chris Lattner@clattner_llvm
The Claude C Compiler is the first AI-generated compiler that builds complex C code, built by @AnthropicAI. Reactions ranged from dismissal as "AI nonsense" to "SW is over": both takes miss the point. As a compiler🐉 expert and experienced SW leader, I see a lot to learn: 👇
English

This is the one I'm currently using in my open source project: github.com/preshing/plywo…
It's enough to make the frontier models use my API pretty well, but weaker models like qwen3-coder-next still keep pulling standard C functions in. I'm wondering if being more explicit will help, like "If you need to do X, use Y."
English

@preshing Devs who don't care about performance, memory usage, maintainability and other pesky NFRs, AI coding is working perfectly 😉 Anyway, I'll be testing how teaching AI agent via agents file will work and if I can significantly improve the code quality AI produces.
English

I'm convinced that if you're building a large piece of commercial software, the only good way to use AI coding agents is as a power tool, without forsaking your ability to understand the code.
On the other hand, if you're building disposable software — for personal use, for a demo, or just experimenting — then abandoning understanding of the code is totally fine. If the project is simple, the AI-generated code will be easy to understand anyway.
I see many takes claiming the opposite. People are saying that soon, understanding the code won't even matter. The logic goes: AI is improving all the time, therefore, it will eventually just do everything without human intervention. Some people are already trying to live in this imagined future: Running Ralph loops, getting agents to supervise other agents, cranking out hundreds of commits per day. More power to them, but I don't plan to work that way.
Of course AI will keep improving. In the best case, those improvements will lead us towards greater simplicity. They'll help untangle spaghetti code, suggest better ways to organize the project, generate clear, up-to-date documentation. Projects will become easier to understand and onboarding will become easier for both humans and AI alike. This is the only trend that makes sense to me.
I can imagine that in 20 years or so, software engineering will no longer be a glamour job. Development tools will be more intuitive. The opportunities to get rich overnight will fade. We'll still need programmers and system maintainers, but not as many. But I think it will take time.
English

