David Sweet

12

David Sweet@phinance99·1h

@MParakhin You can coax more rigor out of it with a good Karl Popper prompt. You can get it to be more creative with a good Margaret Boden prompt. The LLMs are well-read, you just need to focus their attention. See github.com/dsweet99/agent… & github.com/dsweet99/agent…

English

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. github.com/karpathy/autor… Part code, part sci-fi, and a pinch of psychosis :)

8

Mikhail Parakhin@MParakhin·1d

Report after a week of running autoresearch on my real-scale project: 103 experiments (distributed over multiple machines), only one showed improvement (produced by GPT-5.4 xhigh). The batting average is way worse than if I ran 103 experiments personally, but I have a day job, so, a "free" improvement - I'll take it.

Andrej Karpathy@karpathy

English

28

7

377

69.2K

David Sweet@phinance99·2h

@neural_avb Distribution.

English

12

AVB@neural_avb·7h

I am surprised how so many people got so excited with autoresearch when similar ideas (iterative refinement through experiments) have existed in AI for years maybe it went viral coz of its simplicity (everyone can understand it), or coz it is so accessible (all you need is a coding agent), or coz it's literally Dr Karpathy (the goat) that dropped it

English

11

4

49

2.5K

David Sweet@phinance99·3h

@oelma__ Seven Seas Italian dressing, then chill. This is maximal (yum * health) / dollar.

English

1

6

Elma@oelma__·1d

What can I put in canned green beans to improve their flavor? I hate them, but we're on a budget and have plenty of them.

English

5.2K

49

840

242.1K

David Sweet@phinance99·3h

@ujjwalscript Oh, and for more: x.com/phinance99/sta…

x.com/i/article/2034…

English

42

David Sweet@phinance99·12h

This problem was solved in ~1950. Read about Toyota quality. Also, Stewart, Deming, Six Sigma. tldr: Define quality. Measure it at every step. Don't proceed to next step until quality is high enough. Take small steps. Practical terms: Make a small change. Insist it passes linters, tests, and reviews. Repeat ad infinitum. Also: `cargo install kiss-ai`, a linter for code complexity.

English

10

1

26

1.6K

Ujjwal Chadha@ujjwalscript·14h

Your AI Agent is mathematically guaranteed to FAIL. This is the dirty secret the industry is hiding in 2026. Everyone on your timeline is currently bragging about their "Multi-Agent Swarms." Founders are acting like chaining five AI agents together is going to replace their entire engineering team overnight. Here is the reality check: It’s a mathematical illusion. Let’s look at the actual numbers. Say you have a state-of-the-art AI agent with an incredible 85% accuracy rate per action. In a vacuum, that sounds amazing. But an "autonomous" workflow isn't one action. It’s a chain. Read the ticket ➡️ Query the DB ➡️ Write the code ➡️ Run the test ➡️ Commit. Let's do the math on a 10-step process: $0.85^10= 0.19$ Your "revolutionary" autonomous system has a 19% success rate. And the real-world data proves it. Recent studies out of CMU this year show that the top frontier models are failing at over 70% of real-world, multi-step office tasks. We are officially in the era of "Agent Washing." Startups are rebranding complex, buggy software as "autonomous agents" to look cool, but they are ignoring the scariest part: AI fails silently. When traditional code breaks, it crashes and throws a stack trace. When an AI agent breaks, it doesn't crash. It just confidently hallucinates a fake database entry, sidesteps a broken API by faking the response, and keeps running—corrupting your data for weeks before you notice. If your "automated" system requires a senior engineer to spend three hours digging through prompt logs to figure out why the bot made a "creative decision," you didn't save any time. You just invented a highly expensive, unpredictable form of technical debt. Stop trying to build fully autonomous swarms to replace human judgment. Start building deterministic guardrails where AI is the engine, but the engineer holds the steering wheel

English

129

55

379

28.3K

David Sweet@phinance99·5h

x.com/phinance99/sta…

KPop Bug Hunting -- LLM's are generally good at fixing bugs, but they get overconfident and, thus, give up too soon. Sometimes they even argue or make excuses, like, "You must have stopped the script manually!" These behaviors are, to put it politely, inefficient. The solution is to *teach the LLM* how to solve this kind of problem in a scientific way. Paste this prompt in before asking your LLM to find a bug.

ZXX

20

David Sweet@phinance99·5h

/kpop Stress-test plan.md /kpop Double-check review.md /kpop Fix this bug. You have a budget of 10 hypotheses. /kpop Speed this up by 2x. You have a budget of 20 hypotheses. /kpop Get this to run within bounded memory. You have a budget of 30 hypotheses.

English

0

16

David Sweet@phinance99·5h

Bingo. This works for all kinds of things. It's actually Karl Popper's scientific method: Hypothesis and Falsification. Treat everything that comes out of the LLM as a hypothesis, then ask it to falsify. It's read Popper, so it knows (better than me!) how to do it. Here's a prompt (Cursor command) you can use for lots of things: github.com/dsweet99/agent…

Yishan@yishan

I have stumbled onto a way to improve agent steering. Namely, how to improve performance when you say "make sure you do this" and the LLM doesn't do it. Here it is: Saying "remember to do X" is unreliable - it requires the LLM to agent to spontaneously initiate a procedural behavior. But presenting the agent with a specific, possibly-wrong claim ("You should be doing X - are you still doing it?") reliably triggers corrective behavior when the claim is wrong. The agent doesn't need to remember to check. The mismatch between presented state and actual state creates a correction event that the agent LLM naturally responds to. This reminds me of the old maxim of "the best way to get a correct answer on the internet is to post a wrong one" and I guess that makes sense since LLMs are predominantly the distilled "knowledge" of the internet. Anyhow I've been building a long-running memory system for my agents and implementing it this way fixed a lot of problems.

English

0

1

31

David Sweet@phinance99·5h

@yishan /kpop Stress-test plan.md /kpop Double-check review.md /kpop Fix this bug. You have a budget of 10 hypotheses. /kpop Speed this up by 2x. You have a budget of 20 hypotheses. /kpop Get this to run within bounded memory. You have a budget of 30 hypotheses.

English

28

David Sweet@phinance99·5h

Bingo. This works for all kinds of things. It's actually Karl Popper's scientific method: Hypothesis and Falsification. Treat everything that comes out of the LLM as a hypothesis, then ask it to falsify. It's read Popper, so it knows (better than me!) how to do it. Here's a prompt (Cursor command) you can use for lots of things: github.com/dsweet99/agent…

English

0

386

Yishan@yishan·5h

I have stumbled onto a way to improve agent steering. Namely, how to improve performance when you say "make sure you do this" and the LLM doesn't do it. Here it is: Saying "remember to do X" is unreliable - it requires the LLM to agent to spontaneously initiate a procedural behavior. But presenting the agent with a specific, possibly-wrong claim ("You should be doing X - are you still doing it?") reliably triggers corrective behavior when the claim is wrong. The agent doesn't need to remember to check. The mismatch between presented state and actual state creates a correction event that the agent LLM naturally responds to. This reminds me of the old maxim of "the best way to get a correct answer on the internet is to post a wrong one" and I guess that makes sense since LLMs are predominantly the distilled "knowledge" of the internet. Anyhow I've been building a long-running memory system for my agents and implementing it this way fixed a lot of problems.

English

12

2

114

9.4K

David Sweet@phinance99·5h

@DassCoool @rohindhar Yea. It's like a small step up from linoleum. Even the "good" engineered wood. (Apologies to those to who like it, but ...)

English

0

1

20

Dass Coool@DassCoool·16h

@rohindhar I wouldn’t put engineered hardwood in my modest home. I ripped some out and put in more hardwood in one room where the previous owner had the fake stuff. I can’t think it would be taken seriously anywhere.

English

0

3

1.4K

Rohin Dhar@rohindhar·18h

If you’re doing a high end house flip I don’t think you can use engineered hardwood floors anymore Gotta go with the real thing

English

43

1

354

82.2K

David Sweet@phinance99·8h

@rohanpaul_ai That looks a lot smarter than remote control. Both technically and as a business.

English

32

Rohan Paul@rohanpaul_ai·1d

🇨🇳 It has started. A new home service in China pairs human cleaners with autonomous AI robots to tackle household chores. Residents in Shenzhen can now book a service where a human professional and an autonomous robot arrive together to clean their home. Real houses present a chaotic mess of dropped toys and random furniture that confuse traditional machines. @XSquareRobot and a major service platform named 58[.]com decided to tackle this chaos by launching China's first robot cleaner service in March-26. Customers use an application to hire a cleaning crew that consists of 1 human worker and 1 robot. The human takes care of the tricky chores that require complex judgment. The robot handles the repetitive physical work like picking up trash and wiping down flat surfaces. This machine runs on a system called WALL-A, which acts as a single continuous AI brain rather than a list of pre-written rules. They built this AI foundation model to perceive its surroundings and make its own decisions without human guidance. It processes visual data and plans multi-step actions. And deploying these robots into actual homes now provides the massive amounts of extremely important training data to improve it continuously. Alibaba and ByteDance backed this project. IMO, if the foundational model behind it figures out how to navigate a messy living room without getting stuck, it can learn to operate in almost any other physical environment.

English

34

79

415

47.9K

David Sweet@phinance99·9h

@1a1n1d1y Mine, too.

English

1

55

andy@1a1n1d1y·1d

frog’s eyes can detect single photons i believe we are on the cusp of a computing breakthrough

English

22

495

50.4K

David Sweet@phinance99·9h

I've found the people who have the most trouble with agent coding are the people most uncomfortable with uncertainty. I'm reading things like "predictability" and "100% certainty". 1. There is no such thing. 2. The low failure rates in manufacturing processes come about *because* of quality control. Not the other way around. "All the screws look the same" is the outcome of a quality-controlled process. Metal does not come from the earth with "predictability" or "100% certainty" in any aspect other than atomic number.

English

1

32

David Sweet@phinance99·10h

@RhysSullivan Have I got a treat for you: All of them are. Just do `cargo install kiss-ai`. It's like ruff for code complexity. The LLMs know all the code-factoring best practices, they just need guidance on when & where to apply them. Put kiss in the coding loop and you're golden.

English

24

Rhys@RhysSullivan·20h

are any of the models actually good at doing large refactors? i have to spend so much time fighting with them to not take shortcuts and actually make large changes to code

English

100

3

153

22.8K

David Sweet@phinance99·10h

@bestpeterward There's a code sample here: github.com/dsweet99/agent… For comparison, this is what cursor-agent produced on its own (w/o the harness): github.com/dsweet99/agent… [using Opus 4.5 in both cases]

English

13

David Sweet@phinance99·10h

As Claude would say, "You're absolutely right! Stability comes from grounding. md. In there I list the main objectives and constraints of the system. The reviewers are searching for deviations from grounding. md in both code and tests. Without some stable / static element like that, the code will drift, just like you're saying.

English

0

20

Solved Problem Solver@bestpeterward·10h

I don’t entirely agree with the OP, but quality gate at every step would mean inter alia unit testing what are LLM generated, constantly changing implementation details. Cars are assembled from parts which are stable for years, even decades.

This problem was solved in ~1950. Read about Toyota quality. Also, Stewart, Deming, Six Sigma. tldr: Define quality. Measure it at every step. Don't proceed to next step until quality is high enough. Take small steps. Practical terms: Make a small change. Insist it passes linters, tests, and reviews. Repeat ad infinitum. Also: `cargo install kiss-ai`, a linter for code complexity.

English

0

50

David Sweet@phinance99·10h

The plans/PRs aren't huge. That's how I worked before AI, too: Take a small risk, then lock down quality. Repeat forever. It's just a lot easier now! There's no more reason to fear an agent make a mess than there was to fear a person doing it. As long as you insist on quality before merging, your codebase is safe.

English

21

David Sweet@phinance99·10h

My coding today looks like this: 1. Write plan. md with Cursor. 2. Run a script from the CLI 3. Write PR notice with Cursor Step 2 runs a loop like this: - Implement plan. md - Review, coarse-grained (repeat until passed) - Review, fine-grained (repeat until passed) Details and script: github.com/dsweet99/agent…

This problem was solved in ~1950. Read about Toyota quality. Also, Stewart, Deming, Six Sigma. tldr: Define quality. Measure it at every step. Don't proceed to next step until quality is high enough. Take small steps. Practical terms: Make a small change. Insist it passes linters, tests, and reviews. Repeat ad infinitum. Also: `cargo install kiss-ai`, a linter for code complexity.

English

0

59

David Sweet@phinance99·11h

Hear! Hear! I was warned not to do this exercise by a personal trainer to avoid hurting my back. For years I babied my back it got worse and worse. Then I did this exercise. It felt AMAZING after all that time avoiding it. All the exercises you mentioned have found their way into my routine and I felt better and better.

English