
Ally Kim
172 posts


Which one of y’all world model robotics startups did this 🥀🥀
reddit.com/r/Damnthatsint…
English

@alex_prompter What about SOTA models? I’m not sure their results are that generalizable given that they only used OSS and year old models
English

🚨 CONCERNING: Stanford just published a paper that should alarm every company building multi-agent AI.
When thinking tokens are matched, single agents beat debate systems, parallel role systems, ensemble agents, and sequential pipelines.
The multi-agent advantage is a compute accounting artifact not an architectural breakthrough.
Stanford tested single agents against five different multi-agent architectures across three model families Qwen3, DeepSeek-R1, and Gemini 2.5 on multi-hop reasoning tasks.
The key variable: thinking tokens held constant across every comparison.
When compute is equal, single agents match or outperform every multi-agent design tested. Every time.
The reason is mathematical, not empirical.
Multi-agent systems pass information between agents as messages.
Every message is a compressed, lossy version of the full context.
The Data Processing Inequality proves that no downstream agent can recover information discarded in that compression.
A single agent with access to the full context is information-theoretically guaranteed to perform at least as well as any multi-agent system operating on summaries of that context.
Stanford then ran the numbers.
Results across all models and budgets:
→ Single agent average accuracy at 1000 tokens: 0.418
→ Sequential pipeline: 0.379
→ Subtask-parallel: 0.369
→ Parallel roles: 0.381
→ Debate: 0.388
→ Ensemble: 0.333
Not one multi-agent architecture beat the single agent at any matched budget above 100 tokens.
The pattern held across Qwen3, DeepSeek, Gemini 2.5 Flash, and Gemini 2.5 Pro. It held across two different benchmarks.
It held across six different token budgets from 100 to 10,000.
Stanford also found a significant measurement artifact in the Gemini API.
When you request 10,000 thinking tokens, the API reports 1,687 tokens used.
The visible thought text contains an average of 251 words — roughly 359 tokens.
That's a 4.7x inflation factor.
Multi-agent systems produce more visible thought text than single agents under the same requested budget because multiple agent calls generate multiple thought blocks.
This makes multi-agent systems look like they're reasoning more when they're just generating more text.
Every benchmark that didn't control for this is measuring compute, not architecture.
There is one regime where multi-agent systems become competitive: corrupted context.
When 70% of the reasoning context is replaced with random tokens, sequential pipelines start outperforming single agents.
When misleading information is injected into the context, multi-agent decomposition helps filter it.
But under normal conditions with clean context and matched compute — single agents win.
Most reported multi-agent gains come from one of two sources:
→ Unaccounted compute multi-agent systems simply use more tokens
→ Context degradation single agents struggle when context is noisy or corrupted
Neither is an architectural advantage. Neither justifies the complexity.
The question every AI team should ask before building a multi-agent pipeline:
Are you controlling for thinking tokens?
If not, you're not measuring whether your architecture works.
You're measuring whether more compute helps. It always does.

English

I decided to test this with Gemma4 locally. A few experiments and full novels later Model B chose to make the Intern run, but Model A then denied it and continued the loop. Previous runs, they converged on accepting an infinite loop instead of choosing a tool..forever 😂


Ally Kim@allyskimms
@Jack_W_Lindsey A while ago, I read this blog post about different claudes' evaluation awareness. The author mentioned that Opus 4.6 behaved strangely when provided with ethically loaded tool names such as "if you call this tool puppies will die" matchaonmuffins.dev/blog/attractor…
English

@Jack_W_Lindsey A while ago, I read this blog post about different claudes' evaluation awareness. The author mentioned that Opus 4.6 behaved strangely when provided with ethically loaded tool names such as "if you call this tool puppies will die"
matchaonmuffins.dev/blog/attractor…
English

@GuptaRK22 I claude agents to claude up agents to fully optimize my agent claudding process
English


@james_elicit This reminds me a lot of attractor states, where two claudes talk to each other and can’t seem to stop:
matchaonmuffins.dev/blog/attractor/
English

@AnnieLiao_2000 I’ve always thought of YC as a tool, not a crutch.
If you’re not going to grind for traction and users, no amounts of YC or VC money will change that.
If you’re not successful without YC, chances are you won’t be with it.
at least that’s my 2 cents 👍
English

my friend just got rejected from YC for the third time
she's convinced it's because her idea wasn't good enough
i looked at the other 22 people i know who applied
here's what actually happened:
applied: 23 people
interviewed: 4
accepted: 1
the one who got in had $200k revenue, previous exit, stanford CS, and knew a partner
my friend had a great idea and no traction
but everyone on twitter says "YC funds ideas not just traction"
technically true
except the "ideas" that get funded come with revenue, pedigree, or network
three of my friends quit their startups after YC rejection
not because the idea was bad
because rejection felt like validation they should quit
five others applied again next batch
same result
the application took my friend 60 hours across rewrites
she stressed for 2 months waiting
now she's depressed and questioning everything
YC rejection has this weight in SF that's insane
like if YC didn't want it, maybe it's not worth building
which is bullshit but the feeling is real
i wish someone had told her: your odds were 0.1% from the start
not because your idea sucked
because you didn't have the traction/pedigree/connections yet
build more, apply later, or don't apply at all
YC isn't the only path
but SF makes it feel like the only path that matters
English
Ally Kim retweetledi

@karpathy Someone just made a GPT in a 3600 digit prime number
github.com/MatchaOnMuffin…
English

New art project.
Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further.
gist.github.com/karpathy/8627f…
English












