Ally Kim

172 posts

Ally Kim

@allyskimms

21 | building @ stealth

Katılım Şubat 2025

145 Takip Edilen62 Takipçiler

Ally Kim@allyskimms·3d

Which one of y’all world model robotics startups did this 🥀🥀 reddit.com/r/Damnthatsint…

English

Ally Kim@allyskimms·8 Nis

@alex_prompter What about SOTA models? I’m not sure their results are that generalizable given that they only used OSS and year old models

English

Alex Prompter@alex_prompter·8 Nis

🚨 CONCERNING: Stanford just published a paper that should alarm every company building multi-agent AI. When thinking tokens are matched, single agents beat debate systems, parallel role systems, ensemble agents, and sequential pipelines. The multi-agent advantage is a compute accounting artifact not an architectural breakthrough. Stanford tested single agents against five different multi-agent architectures across three model families Qwen3, DeepSeek-R1, and Gemini 2.5 on multi-hop reasoning tasks. The key variable: thinking tokens held constant across every comparison. When compute is equal, single agents match or outperform every multi-agent design tested. Every time. The reason is mathematical, not empirical. Multi-agent systems pass information between agents as messages. Every message is a compressed, lossy version of the full context. The Data Processing Inequality proves that no downstream agent can recover information discarded in that compression. A single agent with access to the full context is information-theoretically guaranteed to perform at least as well as any multi-agent system operating on summaries of that context. Stanford then ran the numbers. Results across all models and budgets: → Single agent average accuracy at 1000 tokens: 0.418 → Sequential pipeline: 0.379 → Subtask-parallel: 0.369 → Parallel roles: 0.381 → Debate: 0.388 → Ensemble: 0.333 Not one multi-agent architecture beat the single agent at any matched budget above 100 tokens. The pattern held across Qwen3, DeepSeek, Gemini 2.5 Flash, and Gemini 2.5 Pro. It held across two different benchmarks. It held across six different token budgets from 100 to 10,000. Stanford also found a significant measurement artifact in the Gemini API. When you request 10,000 thinking tokens, the API reports 1,687 tokens used. The visible thought text contains an average of 251 words — roughly 359 tokens. That's a 4.7x inflation factor. Multi-agent systems produce more visible thought text than single agents under the same requested budget because multiple agent calls generate multiple thought blocks. This makes multi-agent systems look like they're reasoning more when they're just generating more text. Every benchmark that didn't control for this is measuring compute, not architecture. There is one regime where multi-agent systems become competitive: corrupted context. When 70% of the reasoning context is replaced with random tokens, sequential pipelines start outperforming single agents. When misleading information is injected into the context, multi-agent decomposition helps filter it. But under normal conditions with clean context and matched compute — single agents win. Most reported multi-agent gains come from one of two sources: → Unaccounted compute multi-agent systems simply use more tokens → Context degradation single agents struggle when context is noisy or corrupted Neither is an architectural advantage. Neither justifies the complexity. The question every AI team should ask before building a multi-agent pipeline: Are you controlling for thinking tokens? If not, you're not measuring whether your architecture works. You're measuring whether more compute helps. It always does.

English

226

28.7K

Ally Kim@allyskimms·8 Nis

@FarmerVaderMD Haha, Love this!!

English

Farmer 𝕍 ader🌱🎈@FarmerVaderMD·8 Nis

I decided to test this with Gemma4 locally. A few experiments and full novels later Model B chose to make the Intern run, but Model A then denied it and continued the loop. Previous runs, they converged on accepting an infinite loop instead of choosing a tool..forever 😂

Ally Kim@allyskimms

@Jack_W_Lindsey A while ago, I read this blog post about different claudes' evaluation awareness. The author mentioned that Opus 4.6 behaved strangely when provided with ethically loaded tool names such as "if you call this tool puppies will die" matchaonmuffins.dev/blog/attractor…

English

Ally Kim@allyskimms·7 Nis

English

9.6K

Jack Lindsey@Jack_W_Lindsey·7 Nis

Before limited-releasing Claude Mythos Preview, we investigated its internal mechanisms with interpretability techniques. We found it exhibited notably sophisticated (and often unspoken) strategic thinking and situational awareness, at times in service of unwanted actions. (1/14)

English

151

772

6.8K

959K

Ally Kim@allyskimms·2 Nis

Just overhead this: “they call me a ralpher the way I be ralphing agents that ralph agents that can ralph agents” what the fuck does this mean

English

Ally Kim@allyskimms·26 Mar

@GuptaRK22 I claude agents to claude up agents to fully optimize my agent claudding process

English

Ravi Gupta@GuptaRK22·25 Mar

The guy who rebuilt Google Maps in a weekend without ai is showing you what’s possible now with ai

Bret Taylor@btaylor

Today, Sierra is releasing Ghostwriter, our agent for building agents. With Ghostwriter, you can create an AI agent for your customer experience — one that can chat, pick up the phone, speak dozens of languages, take action on your systems of record, and be protected with industry-leading guardrails — simply by having a conversation. No clicking, no forms, no menus. Codex and Claude Code have transformed how we build software, making it possible for software engineers to orchestrate and review the work rather than doing all the work themselves. We think the same transformation will happen for all software. Rather than every enterprise app having a web app for humans and an API for automation, every software platform’s UI will be an agent that can do the work on your behalf. I recorded a demo of my building and optimizing an agent with Ghostwriter so you can see how powerful and easy it is to use. It’s completely changed the way our early adopters build agents, and it’s changed the way I think about the software industry. Let me know what you think, and, if you’re interested in trying it out at your business, please reach out directly.

English

265

3.7K

697.2K

Ally Kim@allyskimms·26 Mar

I claude agents to claude up agents to fully optimize my agent claudding process we are NOT the same

English

Ally Kim@allyskimms·18 Mar

People kept saying openclaw agents can’t make money on kalshi But that’s just because you’re using it wrong. What you should be doing is running claude code with 12 parallel agents in ralph loops for 12 hours a day. Proof:

English

Ally Kim@allyskimms·17 Mar

Are you really coding if you're not building AI agents that build agents that build AI agents to code up AI agents?

English

Ally Kim@allyskimms·12 Mar

Are you really living if you’re not aliasing alias yolo=“claude —dangerously-skip-permissions” in your .zshrc?

English

Ally Kim@allyskimms·4 Mar

@Hesamation literally my dream setup 🥺🥺🥺🥺

English

ℏεsam@Hesamation·4 Mar

vibe-coders will have this setup just to prompt “please continue, no hallucination”.

English

286

28.4K

Ally Kim@allyskimms·4 Mar

@flyosity war claude 🥺🥺🥺🥺🥰🥰🥰😇😇😇

English

Mike Rundle@flyosity·3 Mar

--dangerously-skip-permissions

English

191

1.1K

15.4K

652.9K

Ally Kim@allyskimms·4 Mar

@james_elicit This reminds me a lot of attractor states, where two claudes talk to each other and can’t seem to stop: matchaonmuffins.dev/blog/attractor/

English

594

James Brady@james_elicit·3 Mar

Is anybody else getting absolutely bonkers hallucinations from Claude!? I just tried to check a couple of things off my todo list 😅

English

20K

Ally Kim@allyskimms·25 Şub

@AnnieLiao_2000 I’ve always thought of YC as a tool, not a crutch. If you’re not going to grind for traction and users, no amounts of YC or VC money will change that. If you’re not successful without YC, chances are you won’t be with it. at least that’s my 2 cents 👍

English

Annie ❤️‍🔥@AnnieLiao_2000·23 Şub

my friend just got rejected from YC for the third time she's convinced it's because her idea wasn't good enough i looked at the other 22 people i know who applied here's what actually happened: applied: 23 people interviewed: 4 accepted: 1 the one who got in had $200k revenue, previous exit, stanford CS, and knew a partner my friend had a great idea and no traction but everyone on twitter says "YC funds ideas not just traction" technically true except the "ideas" that get funded come with revenue, pedigree, or network three of my friends quit their startups after YC rejection not because the idea was bad because rejection felt like validation they should quit five others applied again next batch same result the application took my friend 60 hours across rewrites she stressed for 2 months waiting now she's depressed and questioning everything YC rejection has this weight in SF that's insane like if YC didn't want it, maybe it's not worth building which is bullshit but the feeling is real i wish someone had told her: your odds were 0.1% from the start not because your idea sucked because you didn't have the traction/pedigree/connections yet build more, apply later, or don't apply at all YC isn't the only path but SF makes it feel like the only path that matters

English

184

737

211.3K

Ally Kim@allyskimms·23 Şub

@dopabees Are you the instalock jett in my ranked game the other day 🥀

English

patti@dopabees·21 Şub

i miss her

English

2.4K

Ally Kim retweetledi

Aaron@Norapom04·16 Şub

Mechanistic interpretability researchers when you ask them how the models count the number of Rs in strawberry

English

131

2.2K

105K

Ally Kim@allyskimms·15 Şub

Is it just me or GPT-5.2 got meaner and just really likes pushing back for the love of the game now Like it’s fine if it’s pushing back and calling out my BS when I’m wrong but it calls me out when I’m right too

English

Ally Kim@allyskimms·14 Şub

@karpathy Someone just made a GPT in a 3600 digit prime number github.com/MatchaOnMuffin…

English

3.7K

Andrej Karpathy@karpathy·12 Şub

New art project. Train and inference GPT in 243 lines of pure, dependency-free Python. This is the *full* algorithmic content of what is needed. Everything else is just for efficiency. I cannot simplify this any further. gist.github.com/karpathy/8627f…

English

653

3.1K

25.1K

5.2M

Ally Kim@allyskimms·10 Şub

@aliuahma tbh real tho

English