Andreas Sjostrom

9.7K posts

Andreas Sjostrom banner
Andreas Sjostrom

Andreas Sjostrom

@AndreasSjostrom

My personal opinions.

Palo Alto Katılım Nisan 2008
254 Takip Edilen1.8K Takipçiler
Andreas Sjostrom retweetledi
Matt Dancho (Business Science)
This is huge. A group of 50 AI researchers (ByteDance, Alibaba, Tencent + universities) just dropped a 303 page field guide on code models + coding agents. And the takeaways are not what most people assume. Here are the highlights I’m thinking about (as someone who lives in Python + agents):
Matt Dancho (Business Science) tweet media
English
32
211
1.3K
132.6K
Andreas Sjostrom retweetledi
Carlos E. Perez
Carlos E. Perez@IntuitMachine·
Everyone says LLMs can't do true reasoning—they just pattern-match and hallucinate code. So why did our system just solve abstract reasoning puzzles that are specifically designed to be unsolvable by pattern matching? Let me show you what happens when you stop asking AI for answers and start asking it to think. 🧵 First, what even is ARC-AGI? It's a benchmark that looks deceptively simple: You get 2-4 examples of colored grids transforming (input → output), and you have to figure out the rule. But here's the catch: These aren't IQ test patterns. They're designed to require genuine abstraction. (Why This Is Hard) Humans solve these by forming mental models: "Oh, it's mirroring across the diagonal" "It's finding the bounding box of blue pixels" "It's rotating each object independently" Traditional ML? Useless. You'd need millions of examples to learn each rule. LLMs? They hallucinate plausible-sounding nonsense. But we had a wild idea: What if instead of asking the LLM to predict the answer, we asked it to write Python code that transforms the grid? Suddenly, the problem shifts from "memorize patterns" to "reason about transformations and implement them." Code is a language of logic. Here's the basic algorithm: Show the LLM examples: "Write a transform(grid) function" LLM writes code Run it against examples If wrong → show exactly where it failed Repeat with feedback Sounds simple, right? But that's not even the most interesting part. When the code fails, we don't just say "wrong." We show the LLM a visual diff of what it predicted vs. what was correct: Your output: 1 2/3 4 ← "2/3" means "you said 2, correct was 3" 5 6/7 8 Plus a score: "Output accuracy: 0.75" It's like a teacher marking your work in red ink. With each iteration, the LLM sees: Its previous failed attempts Exactly what went wrong The accuracy score It's not guessing. It's debugging. And here's where it gets wild: We give it up to 10 tries to refine its logic. Most problems? Solved by iteration 3-5. But wait, it gets crazier. We don't just run this once. We run it with 8 independent "experts"—same prompt, different random seeds. Why? Because the order you see examples matters. Shuffling them causes different insights. Then we use voting to pick the best answer. After all experts finish, we group solutions by their outputs. If 5 experts produce solution A and 3 produce solution B, we rank A higher. Why does this work? Because wrong answers are usually unique. Correct answers converge. It's wisdom of crowds, but for AI reasoning. Each expert gets a different random seed, which affects: Example order (we shuffle them) Which previous solutions to include in feedback The "creativity" of the response Same prompt. Same model. Wildly different exploration paths. One expert might focus on colors. Another on geometry. Our prompts are elaborate. We don't just say "solve this." We teach the LLM how to approach reasoning: Analyze objects and relationships Form hypotheses (start simple!) Test rigorously Refine based on failures It's like giving it a graduate-level course in problem-solving. Here's why code matters: When you write: def transform(grid): return np.flip(grid) You're forced to be precise. You can't hand-wave. Code doesn't tolerate ambiguity. It either works or it doesn't. This constraint makes the LLM think harder. Oh, and we execute all this code in a sandboxed subprocess with timeouts. Because yeah, the LLM will occasionally write infinite loops or try to import libraries that don't exist. Safety first. But also: fast failure = faster learning. ARC-AGI isn't about knowledge. It's about: Abstraction (seeing the pattern behind the pattern) Generalization (applying a rule to new cases) Reasoning (logical step-by-step thinking) We're not teaching the AI facts. We're teaching it how to think. So did it work? We shattered the state-of-the-art on ARC-AGI-2. Not by a little. By a lot. Problems that stumped every other system? Solved. And the solutions are readable, debuggable Python functions. You can literally see the AI's reasoning process. This isn't just about solving puzzles. It's proof that LLMs can do genuine reasoning if you frame the problem correctly. Don't ask for answers. Ask for logic. Don't accept vague outputs. Demand executable precision. Don't settle for one attempt. Iterate and ensemble. Which makes you wonder: What else are we getting wrong about AI capabilities because we're asking the wrong questions? Maybe the limit isn't the models. Maybe it's our imagination about how to use them. Here's what you can steal from this: When working with LLMs on hard problems: Ask for code/structure, not raw answers Give detailed feedback on failures Let it iterate Run multiple attempts with variation Use voting/consensus to filter noise Precision beats creativity. The most powerful pattern here? Treating the LLM like a reasoning partner, not an oracle. We're not extracting pre-trained knowledge. We're creating a thought process—prompt → code → test → feedback → refined thought. That loop is where the magic lives. If you're working on hard AI problems, stop asking: "Can the model do X?" Start asking: "How can I design a process that lets the model discover X?" The future of AI isn't smarter models. It's smarter prompts, loops, and systems around them.
Carlos E. Perez tweet media
English
120
141
987
90.6K
Andreas Sjostrom retweetledi
Carlos E. Perez
Carlos E. Perez@IntuitMachine·
You know how some people seem to have a magic touch with LLMs? They get incredible, nuanced results while everyone else gets generic junk. The common wisdom is that this is a technical skill. A list of secret hacks, keywords, and formulas you have to learn. But a new paper suggests this isn't the main thing. The skill that makes you great at working with AI isn't technical. It's social. Researchers (Riedl & Weidmann) analyzed how 600+ people solved problems alone vs. with an AI. They used a statistical method to isolate two different things for each person: Their 'solo problem-solving ability' Their 'AI collaboration ability' Here's the reveal: The two skills are NOT the same. Being a genius who can solve problems in your own head is a totally different, measurable skill from being great at solving problems with an AI partner. Plot twist: The two abilities are barely correlated. So what IS this 'collaboration ability'? It's strongly predicted by a person's Theory of Mind (ToM)—your capacity to intuitively model another agent's beliefs, goals, and perspective. To anticipate what they know, what they don't, and what they need. In practice, this looks like: Anticipating the AI's potential confusion Providing helpful context it's missing Clarifying your own goals ("Explain this like I'm 15") Treating the AI like a (somewhat weird, alien) partner, not a vending machine. This is where it gets strange. A user's ToM score predicted their success when working WITH the AI... ...but had ZERO correlation with their success when working ALONE. It's a pure collaborative skill. It goes deeper. This isn't just a static trait. The researchers found that even moment-to-moment fluctuations in a user's ToM—like when they put more effort into perspective-taking on one specific prompt—led to higher-quality AI responses for that turn. This changes everything about how we should approach getting better at using AI. Stop memorizing prompt "hacks." Start practicing cognitive empathy for a non-human mind. Try this experiment. Next time you get a bad AI response, don't just rephrase the command. Stop and ask: "What false assumption is the AI making right now?" "What critical context am I taking for granted that it doesn't have?" Your job is to be the bridge. This also means we're probably benchmarking AI all wrong. The race for the highest score on a static test (MMLU, etc.) is optimizing for the wrong thing. It's like judging a point guard only on their free-throw percentage. The real test of an AI's value isn't its solo intelligence. It's its collaborative uplift. How much smarter does it make the human-AI team? That's the number that matters. This paper gives us a way to finally measure it. I'm still processing the implications. The whole thing is a masterclass in thinking clearly about what we're actually doing when we talk to these models. Paper: "Quantifying Human-AI Synergy" by Christoph Riedl & Ben Weidmann, 2025.
Carlos E. Perez tweet media
English
225
388
2.5K
346.5K
Andreas Sjostrom retweetledi
Connor Davis
Connor Davis@connordavis_ai·
Nobody’s ready for what this Stanford paper reveals about multi-agent AI. "Latent Collaboration in Multi-Agent Systems" shows that agents don’t need messages, protocols, or explicit teamwork instructions. They start coordinating inside their own hidden representations a full collaboration layer that exists only in the latent space. And the behaviors are insane: • Agents silently hand off tasks based on who’s better • Roles appear out of nowhere leader, executor, supporter • Policies encode signals that never show up in actions • Teams adapt to new environments without retraining • Collaboration stays stable even when communication is impossible The wildest detail: Even when you remove all channels for communication, agents still cooperate. The “teamwork” doesn’t live in messages. It lives in the network. This flips the entire multi-agent playbook. We’ve been building coordination mechanisms on top… while the real coordination is happening underneath. A new era of emergent team intelligence is unfolding — and it’s happening in the places we weren’t even looking. Project: github. com/Gen-Verse/LatentMAS
Connor Davis tweet media
English
98
323
1.7K
147.2K
Andreas Sjostrom retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
McKinsey published new report. Agents, robots, and us: Skill partnerships in the age of AI - Today’s technologies could theoretically automate more than half of current US work hours. This reflects how profoundly work may change - By 2030, about $2.9 trillion of economic value could be unlocked in the United States - Demand for AI fluency—the ability to use and manage AI tools—has grown 7X in two years, faster than for any other skill in US job postings. The surge is visible across industries and likely marks the beginning of much bigger changes ahead.
Rohan Paul tweet media
English
55
158
710
123.5K
Andreas Sjostrom retweetledi
Robert Youssef
Robert Youssef@rryssf·
This Stanford University paper just broke my brain. They just built an AI agent framework that evolves from zero data no human labels, no curated tasks, no demonstrations and it somehow gets better than every existing self-play method. It’s called Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning And it’s insane what they pulled off. Every “self-improving” agent you’ve seen so far has the same fatal flaw: they can only generate tasks slightly harder than what they already know. So they plateau. Immediately. Agent0 breaks that ceiling. Here’s the twist: They spawn two agents from the same base LLM and make them compete. • Curriculum Agent - generates harder and harder tasks • Executor Agent - tries to solve them using reasoning + tools Whenever the executor gets better, the curriculum agent is forced to raise the difficulty. Whenever the tasks get harder, the executor is forced to evolve. This creates a closed-loop, self-reinforcing curriculum spiral and it all happens from scratch, no data, no humans, nothing. Just two agents pushing each other into higher intelligence. And then they add the cheat code: A full Python tool interpreter inside the loop. The executor learns to reason through problems with code. The curriculum agent learns to create tasks that require tool use. So both agents keep escalating. The results? → +18% gain in math reasoning → +24% gain in general reasoning → Beats R-Zero, SPIRAL, Absolute Zero, even frameworks using external proprietary APIs → All from zero data, just self-evolving cycles They even show the difficulty curve rising across iterations: tasks start as basic geometry and end at constraint satisfaction, combinatorics, logic puzzles, and multi-step tool-reliant problems. This is the closest thing we’ve seen to autonomous cognitive growth in LLMs. Agent0 isn’t just “better RL.” It’s a blueprint for agents that bootstrap their own intelligence. The agent era just got unlocked.
Robert Youssef tweet media
English
66
228
1.1K
70.1K
Andreas Sjostrom retweetledi
Hamza Baig
Hamza Baig@hamza_automates·
I BUILT AN AI AGENT FOR A RESTAURANT IN 10 MINUTES And it solved a problem they’ve been struggling with for years. No fancy setup. No giant workflows. Just one clean system that handles: - customer inquiries - reservation flow - menu questions - upsells - follow-up reviews ...all on autopilot. If you want the exact prompt stack, agent setup, and deployment workflow I used Comment “GROK” + Like + Repost and I’ll send it over. (must follow for DM)
Hamza Baig tweet media
English
373
246
891
98.2K
Andreas Sjostrom retweetledi
Fei-Fei Li
Fei-Fei Li@drfeifei·
AI’s next frontier is Spatial Intelligence, a technology that will turn seeing into reasoning, perception into action, and imagination into creation. But what is it? Why does it matter? How do we build it? And how can we use it? Today, I want to share with you my thoughts on building and using world models to unlock spatial intelligence in this essay below. 1/n
English
291
750
3.6K
921.4K
Andreas Sjostrom retweetledi
Nozz
Nozz@NoahEpstein_·
gemini 3 just made every $15k ai consultant look like a clown google silent-dropped autonomous agents to 650 million users yesterday what consultants charge $15K and 6 weeks to "implement" now takes 4 minutes on a phone here's what actually changed: the model: → plans multi-step workflows autonomously → executes start to finish with zero hand-holding → optimized for non-experts (no CS degree needed) → already live on mobile canvas feature while "AI agencies" are charging $8k-20k for strategy decks, google just deployed real automation to more people than chatgpt's entire user base the intelligence gap is getting stupid: that consultant billing $200/hr to "set up AI workflows" → the app does it autonomously now that agency charging $15k for "custom AI implementation" → built in 4 minutes on gemini 3 mobile that bootcamp selling "learn AI automation" for $2k → obsolete before the course launched some startup just replaced their $18k/month AI consulting retainer with a free app same output. 4 minute setup. zero technical knowledge required. most businesses still think AI automation needs: - 6 month roadmaps - technical teams - consulting firms - $50k+ budgets reality: it needs a phone and 4 minutes your competition doesn't know this exists yet but they will comment "GEMINI" and i'll send you the breakdown of how to use this before everyone figures it out
English
2.6K
241
3.5K
541.6K
Andreas Sjostrom retweetledi
Robert Youssef
Robert Youssef@rryssf·
Holy shit… this might be the most impressive scientific reasoning system anyone has built so far. A new paper just dropped called 'SciAgent' and it basically shows an AI system outperforming human gold medalists across multiple Science Olympiads in one unified architecture. Not separate math agents, not physics-specific pipelines… one system that reasons across disciplines. And the wild part: it doesn’t rely on handcrafted tricks. It uses a hierarchical multi-agent setup where a top-level Coordinator figures out the domain, difficulty, and reasoning style, then assembles a custom reasoning pipeline on the fly. Meaning: the model doesn’t “solve problems” in a single chain of thought… it coordinates a whole team of specialist agents like a real scientific lab. Here’s why this is insane 👇 → Gold medal performance in IMO 2025 → Perfect score in IMC 2025 → Near top human scores in IPhO 2024 and 2025 → Massive win in CPhO 2025 (264 vs 199 human gold) → Strong generalization on Humanity’s Last Exam → Dynamic multi-agent collaboration instead of fixed templates → Symbolic deduction, modeling, computation, and verification all happening in parallel SciAgent isn’t just “doing math” or “solving physics.” It’s showing that coordinated agent systems can build adaptive reasoning pipelines that behave much closer to human scientific thinking. This is not another LLM benchmark bump. It is a glimpse of what happens when AI starts reasoning like a distributed team instead of a single model predicting one token at a time. If this scales, scientific problem solving is about to get completely redefined.
Robert Youssef tweet media
English
28
128
662
72.9K
Andreas Sjostrom retweetledi
DualverseAI
DualverseAI@dualverse_ai·
What if AI agents could be real scientists, not just a tool? We built The STATION, an open-world for agents to read, hypothesize, collaborate and experiment. We let this AI world run for weeks without any human help. (1/2)
DualverseAI tweet media
English
21
14
101
164.8K
Andreas Sjostrom retweetledi
Mayank Vora
Mayank Vora@aiwithmayank·
Microsoft Research just dropped something that could flip AI architecture on its head 🤯 It’s called "Agentic Organization" and it’s not a new model. It’s a new way for intelligence to exist. Here’s why it’s wild: Every big AI model today “thinks” like a single brain one giant, slow, linear process. Even “parallel reasoning” is basically the same mind cloned twice and merged later. Microsoft just broke that mold. They built a protocol called AsyncThink, where one model acts like both: 🧠 An Organizer that splits a complex problem into sub-tasks ⚙️ Workers that solve those parts in parallel Then the Organizer pulls everything back together merging, verifying, adapting all in real time. It’s not one AI anymore. It’s a network of minds that coordinate thought. And it learns to do this through reinforcement learning literally teaching itself how to organize its own reasoning. The numbers are nuts: → 28% faster inference → Higher accuracy on math reasoning → Zero-shot problem solving on new tasks like Sudoku → Dynamic self-evolving organization during reasoning This isn’t just scaling compute. It’s scaling coordination. AsyncThink moves us from a single “intelligent agent” → to an “intelligent organization.” The next wave of AI won’t think like a person. It’ll think like a company of minds delegating, merging, verifying, and iterating at machine speed.
Mayank Vora tweet media
English
32
60
328
23.1K
Andreas Sjostrom retweetledi
Nozz
Nozz@NoahEpstein_·
mckinsey just made every AI consultant look like a scam their 2025 report dropped and it's brutal: 88% of companies "use AI" but 80%+ report ZERO bottom-line impact translation: corporate AI theater is alive and well here's what's actually happening: - 67% stuck in pilot purgatory - companies spending $100k+ on consultants - building "AI strategies" that never ship - measuring vanity metrics instead of revenue meanwhile the intelligence gap is insane right now: you can build a working n8n workflow in 2 hours that actually moves numbers. but companies think they need a 6-month roadmap and a team of PhDs. the high performers? they're not overthinking it. - they rebuild workflows (not just "add AI") - they ship fast - they measure real EBIT not pilot success 51% have already seen AI backfire from inaccuracy. so they're spending more on risk management than on things that work. end result: 32% expect layoffs from AI 13% expect growth everyone else is just guessing this is the opportunity: while enterprise burns budgets on pilots, you can charge $10k-50k to build automation that actually works. because clarity beats credentials when the other side is lost. comment "gap" and i'll send you the breakdown of where companies waste the most + what to build instead
Nozz tweet media
English
181
69
549
61.4K
João Moura
João Moura@joaomdmoura·
Finally putting a dent on losing some weight! Since I created CrewAI I put 40 lbs, slowly cutting that away now!
João Moura tweet media
English
8
0
26
1.5K
Andreas Sjostrom retweetledi
Sam Rodriques
Sam Rodriques@SGRodriques·
Today, we’re announcing Kosmos, our newest AI Scientist, available to use now. Users estimate Kosmos does 6 months of work in a single day. One run can read 1,500 papers and write 42,000 lines of code. At least 79% of its findings are reproducible. Kosmos has made 7 discoveries so far, which we are releasing today, in areas ranging from neuroscience to material science and clinical genetics, in collaboration with our academic beta testers. Three of these discoveries reproduced unpublished findings; four are net new, validated contributions to the scientific literature. AI-accelerated science is here. Our core innovation in Kosmos is the use of a structured, continuously-updated world model. As described in our technical report, Kosmos’ world model allows it to process orders of magnitude more information than could fit into the context of even the longest-context language models, allowing it to synthesize more information and pursue coherent goals over longer time horizons than Robin or any of our other prior agents. In this respect, we believe Kosmos is the most compute-intensive language agent released so far in any field, and by far the most capable AI Scientist available today. The use of a persistent world model also enables single Kosmos trajectories to produce highly complex outputs that require multiple significant logical leaps. As with all of our systems, Kosmos is designed with transparency and verifiability in mind: every conclusion in a Kosmos report can be traced through our platform to the specific lines of code or the specific passages in the scientific literature that inspired it, ensuring that Kosmos’ findings are fully auditable at all times. We are also using this opportunity to announce the launch of Edison Scientific, a new commercial spinout of FutureHouse, which will be focused on commercializing our agents and applying them to automate scientific research in drug discovery and beyond. Edison will be taking over management of the FutureHouse platform, where you can access Kosmos alongside our Literature, Molecules, and Precedent agents (previously Crow, Phoenix, and Owl). Edison will continue to offer free tier usage for casual users and academics, while also offering higher rate limits and additional features for users who need them. You can read more about this spinout on our blog, below. A few important notes if you’re going to try Kosmos. Firstly, Kosmos is different from many other AI tools you might have played with, including our other agents. It is more similar to a Deep Research tool than it is to a chatbot: it takes some time to figure out how to prompt it effectively, and we have tried to include guidelines on this to help (see below). It costs $200/run right now (200 credits per run, and $1/credit), with some free tier usage for academics. This is heavily discounted; people who sign up for Founding Subscriptions now can lock in the $1/credit price indefinitely, but the price ultimately will probably be higher. Again, this is less chatbot and more research tool, something you run on high-value targets as needed. Some caveats are also warranted. Firstly, we find that 80% of Kosmos findings are reproducible, which also means 20% are not -- some things it says will be wrong. Also, Kosmos certainly does produce outputs that are the equivalent to several months of human labor, but it also often goes down rabbit holes or chases statistically significant yet scientifically irrelevant findings. We often run Kosmos multiple times on the same objective in order to sample the various research avenues it can take. There are still a bunch of rough edges on the UI and such, which we are working on. Finally, we are aware that the 6 month figure is much greater than estimates by other AI labs, like METR, about the length of tasks that AI Agents can currently perform. You can read discussion about this in our blog post. Huge congratulations to our team that put this together, led by @ludomitch and @michaelathinks: Angela Yiu, @benjamin0chang, @sidn137, Edwin Melville-Green, Albert Bou, @arvissulovari, Oz Wassie, @jonmlaurent. A particular shout out to @m_skarlinski and his team that rebuilt the platform for this launch, especially Andy Cai @notAndyCai, Richard Magness, Remo Storni, Tyler Nadolski @_tnadolski, Mayk Caldas @maykcaldas, Sam Cox @samcox822 and more. This work would not have been possible without significant contributions from academic collaborators @mathieubourdenx, @EricLandsness, @bdanubius, @physicistnevans, Tonio Buonassisi, @BGomes_1905, Shriya Reddy, @marthafoiani, and @RandallBateman3. We also want to thank our numerous supporters, especially @ericschmidt, who has been a tremendous ally. We will have more to say about our supporters soon!
English
272
639
3.7K
730.9K
Andreas Sjostrom retweetledi
Linus ✦ Ekenstam
Linus ✦ Ekenstam@LinusEkenstam·
This is nuts. These guys are genuinely geniuses. The crunch is happening right now. In front of our eyes.
English
204
277
4.1K
596.1K
Andreas Sjostrom retweetledi
elvis
elvis@omarsar0·
Anthropic just posted another banger guide. This one is on building more efficient agents to handle more tools and efficient token usage. This is a must-read for AI devs! (bookmark it) It helps with three major issues in AI agent tool calling: token costs, latency, and tool composition. How? It combines code executions with MCP, where it turns MCP servers into code APIs rather than direct tool calls. Here is all you need to know: 1. Token Efficiency Problem: Loading all MCP tool definitions upfront and passing intermediate results through the context window creates massive token overhead, sometimes 150,000+ tokens for complex multi-tool workflows. 2. Code-as-API Approach: Instead of direct tool calls, present MCP servers as code APIs (e.g., TypeScript modules) that agents can import and call programmatically, reducing the example workflow from 150k to 2k tokens (98.7% savings). 3. Progressive Tool Discovery: Use filesystem exploration or search_tools functions to load only the tool definitions needed for the current task, rather than loading everything upfront into context. This solves so many context rot and token overload problems. 4. In-Environment Data Processing: Filter, transform, and aggregate data within the code execution environment before passing results to the model. E.g., filter 10,000 spreadsheet rows down to 5 relevant ones. 5. Better Control Flow: Implement loops, conditionals, and error handling with native code constructs rather than chaining individual tool calls through the agent, reducing latency and token consumption. 6. Privacy: Sensitive data can flow through workflows without entering the model's context; only explicitly logged/returned values are visible, with optional automatic PII tokenization. 7. State Persistence: Agents can save intermediate results to files and resume work later, enabling long-running tasks and incremental progress tracking. 8. Reusable Skills: Agents can save working code as reusable functions (with SKILL .MD documentation), building a library of higher-level capabilities over time. This approach is complex and it's not perfect, but it should enhance the efficiency and accuracy of your AI agents across the board. anthropic. com/engineering/code-execution-with-mcp
elvis tweet media
English
55
222
1.5K
180.9K
Andreas Sjostrom retweetledi
Connor Davis
Connor Davis@connordavis_ai·
Holy shit… Google just rewired how AI agents talk to the world 🤯 They’ve built a real-time, bidirectional streaming architecture meaning agents no longer wait for your input to finish before responding. They see, hear, and act while you’re still speaking. They can be interrupted mid-task. They can collaborate with other agents in real time. It’s all powered by the new Agent Development Kit (ADK) built around async I/O, stateful sessions, streaming-native tools, and live callbacks. This isn’t request-response anymore. It’s conversation without turns. The moment AI starts feeling alive.
Connor Davis tweet media
English
54
137
882
61.1K