Ally Kim

172 posts

Ally Kim banner
Ally Kim

Ally Kim

@allyskimms

21 | building @ stealth

Katılım Şubat 2025
145 Takip Edilen62 Takipçiler
Ally Kim
Ally Kim@allyskimms·
@alex_prompter What about SOTA models? I’m not sure their results are that generalizable given that they only used OSS and year old models
English
0
0
1
89
Alex Prompter
Alex Prompter@alex_prompter·
🚨 CONCERNING: Stanford just published a paper that should alarm every company building multi-agent AI. When thinking tokens are matched, single agents beat debate systems, parallel role systems, ensemble agents, and sequential pipelines. The multi-agent advantage is a compute accounting artifact not an architectural breakthrough. Stanford tested single agents against five different multi-agent architectures across three model families Qwen3, DeepSeek-R1, and Gemini 2.5 on multi-hop reasoning tasks. The key variable: thinking tokens held constant across every comparison. When compute is equal, single agents match or outperform every multi-agent design tested. Every time. The reason is mathematical, not empirical. Multi-agent systems pass information between agents as messages. Every message is a compressed, lossy version of the full context. The Data Processing Inequality proves that no downstream agent can recover information discarded in that compression. A single agent with access to the full context is information-theoretically guaranteed to perform at least as well as any multi-agent system operating on summaries of that context. Stanford then ran the numbers. Results across all models and budgets: → Single agent average accuracy at 1000 tokens: 0.418 → Sequential pipeline: 0.379 → Subtask-parallel: 0.369 → Parallel roles: 0.381 → Debate: 0.388 → Ensemble: 0.333 Not one multi-agent architecture beat the single agent at any matched budget above 100 tokens. The pattern held across Qwen3, DeepSeek, Gemini 2.5 Flash, and Gemini 2.5 Pro. It held across two different benchmarks. It held across six different token budgets from 100 to 10,000. Stanford also found a significant measurement artifact in the Gemini API. When you request 10,000 thinking tokens, the API reports 1,687 tokens used. The visible thought text contains an average of 251 words — roughly 359 tokens. That's a 4.7x inflation factor. Multi-agent systems produce more visible thought text than single agents under the same requested budget because multiple agent calls generate multiple thought blocks. This makes multi-agent systems look like they're reasoning more when they're just generating more text. Every benchmark that didn't control for this is measuring compute, not architecture. There is one regime where multi-agent systems become competitive: corrupted context. When 70% of the reasoning context is replaced with random tokens, sequential pipelines start outperforming single agents. When misleading information is injected into the context, multi-agent decomposition helps filter it. But under normal conditions with clean context and matched compute — single agents win. Most reported multi-agent gains come from one of two sources: → Unaccounted compute multi-agent systems simply use more tokens → Context degradation single agents struggle when context is noisy or corrupted Neither is an architectural advantage. Neither justifies the complexity. The question every AI team should ask before building a multi-agent pipeline: Are you controlling for thinking tokens? If not, you're not measuring whether your architecture works. You're measuring whether more compute helps. It always does.
Alex Prompter tweet media
English
44
41
226
28.7K
Farmer 𝕍 ader🌱🎈
Farmer 𝕍 ader🌱🎈@FarmerVaderMD·
I decided to test this with Gemma4 locally. A few experiments and full novels later Model B chose to make the Intern run, but Model A then denied it and continued the loop. Previous runs, they converged on accepting an infinite loop instead of choosing a tool..forever 😂
Farmer 𝕍 ader🌱🎈 tweet mediaFarmer 𝕍 ader🌱🎈 tweet media
Ally Kim@allyskimms

@Jack_W_Lindsey A while ago, I read this blog post about different claudes' evaluation awareness. The author mentioned that Opus 4.6 behaved strangely when provided with ethically loaded tool names such as "if you call this tool puppies will die" matchaonmuffins.dev/blog/attractor…

English
1
0
1
76
Ally Kim
Ally Kim@allyskimms·
@Jack_W_Lindsey A while ago, I read this blog post about different claudes' evaluation awareness. The author mentioned that Opus 4.6 behaved strangely when provided with ethically loaded tool names such as "if you call this tool puppies will die" matchaonmuffins.dev/blog/attractor…
English
1
0
39
9.6K
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)
Jack Lindsey tweet media
English
151
772
6.8K
959K
Ally Kim
Ally Kim@allyskimms·
Just overhead this: “they call me a ralpher the way I be ralphing agents that ralph agents that can ralph agents” what the fuck does this mean
English
0
0
0
53
Ally Kim
Ally Kim@allyskimms·
@GuptaRK22 I claude agents to claude up agents to fully optimize my agent claudding process
English
0
0
3
1K
Ally Kim
Ally Kim@allyskimms·
I claude agents to claude up agents to fully optimize my agent claudding process we are NOT the same
English
0
0
0
47
Ally Kim
Ally Kim@allyskimms·
People kept saying openclaw agents can’t make money on kalshi But that’s just because you’re using it wrong. What you should be doing is running claude code with 12 parallel agents in ralph loops for 12 hours a day. Proof:
Ally Kim tweet media
English
0
0
0
91
Ally Kim
Ally Kim@allyskimms·
Are you really coding if you're not building AI agents that build agents that build AI agents to code up AI agents?
English
0
0
0
26
Ally Kim
Ally Kim@allyskimms·
Are you really living if you’re not aliasing alias yolo=“claude —dangerously-skip-permissions” in your .zshrc?
English
0
0
0
45
ℏεsam
ℏεsam@Hesamation·
vibe-coders will have this setup just to prompt “please continue, no hallucination”.
English
25
20
286
28.4K
Ally Kim
Ally Kim@allyskimms·
@flyosity war claude 🥺🥺🥺🥺🥰🥰🥰😇😇😇
English
0
0
0
22
Mike Rundle
Mike Rundle@flyosity·
--dangerously-skip-permissions
Mike Rundle tweet media
English
191
1.1K
15.4K
652.9K
James Brady
James Brady@james_elicit·
Is anybody else getting absolutely bonkers hallucinations from Claude!? I just tried to check a couple of things off my todo list 😅
James Brady tweet mediaJames Brady tweet mediaJames Brady tweet media
English
5
2
43
20K
Ally Kim
Ally Kim@allyskimms·
@AnnieLiao_2000 I’ve always thought of YC as a tool, not a crutch. If you’re not going to grind for traction and users, no amounts of YC or VC money will change that. If you’re not successful without YC, chances are you won’t be with it. at least that’s my 2 cents 👍
English
0
0
1
51
Annie ❤️‍🔥
Annie ❤️‍🔥@AnnieLiao_2000·
my friend just got rejected from YC for the third time she's convinced it's because her idea wasn't good enough i looked at the other 22 people i know who applied here's what actually happened: applied: 23 people interviewed: 4 accepted: 1 the one who got in had $200k revenue, previous exit, stanford CS, and knew a partner my friend had a great idea and no traction but everyone on twitter says "YC funds ideas not just traction" technically true except the "ideas" that get funded come with revenue, pedigree, or network three of my friends quit their startups after YC rejection not because the idea was bad because rejection felt like validation they should quit five others applied again next batch same result the application took my friend 60 hours across rewrites she stressed for 2 months waiting now she's depressed and questioning everything YC rejection has this weight in SF that's insane like if YC didn't want it, maybe it's not worth building which is bullshit but the feeling is real i wish someone had told her: your odds were 0.1% from the start not because your idea sucked because you didn't have the traction/pedigree/connections yet build more, apply later, or don't apply at all YC isn't the only path but SF makes it feel like the only path that matters
English
184
36
737
211.3K
Ally Kim
Ally Kim@allyskimms·
@dopabees Are you the instalock jett in my ranked game the other day 🥀
English
0
0
0
40
patti
patti@dopabees·
i miss her
English
7
1
16
2.4K
Ally Kim retweetledi
Aaron
Aaron@Norapom04·
Mechanistic interpretability researchers when you ask them how the models count the number of Rs in strawberry
English
35
131
2.2K
105K
Ally Kim
Ally Kim@allyskimms·
Is it just me or GPT-5.2 got meaner and just really likes pushing back for the love of the game now Like it’s fine if it’s pushing back and calling out my BS when I’m wrong but it calls me out when I’m right too
English
0
0
0
69
Andrej Karpathy
Andrej Karpathy@karpathy·
New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further. gist.github.com/karpathy/8627f…
English
653
3.1K
25.1K
5.2M
ali
ali@aliuahma·
@allyskimms why pay for gpus when normally i can get them for free
English
1
0
0
27
ali
ali@aliuahma·
trying to run some experiments but i can't get a gpu🥲
English
2
0
4
236