Arthur Ostapenko

8.3K posts

Arthur Ostapenko

@arthurostapenko

explorer, researcher, builder

Spain เข้าร่วม Kasım 2009

5.3K กำลังติดตาม859 ผู้ติดตาม

Arthur Ostapenko รีทวีตแล้ว

Sakana AI@SakanaAILabs·1d

What if instead of building one giant AI, we evolved a coordinator to orchestrate a diverse team of specialized AIs? 🐟 Excited to share our new paper: “TRINITY: An Evolved LLM Coordinator”, published as a conference paper at #ICLR2026! Paper: arxiv.org/abs/2512.04695 In nature, complex problems are rarely solved by a single monolithic entity, but rather by the coordinated efforts of specialized individuals working together. Yet, modern AI development is heavily focused on endlessly scaling up single, massive monolithic models, yielding diminishing returns. While model merging offers a way to combine different skills, it is often impractical due to mismatched neural architectures and the closed-source nature of top-performing models. To address this, we took a macro-level approach: test-time model composition. We introduce TRINITY, a system that fuses the complementary strengths of diverse, state-of-the-art models without needing to modify their underlying weights. TRINITY processes queries over multiple turns. At each step, a lightweight coordinator assigns one of three distinct roles to an LLM from its available pool: 1/ Thinker: Devises high-level strategies and analyzes the current state. 2/ Worker: Executes concrete problem-solving steps. 3/ Verifier: Evaluates if the current solution is complete and correct. By dynamically assigning these roles, the coordinator effectively offloads complex reasoning and skill execution onto the external models. What makes TRINITY unique is its extreme efficiency. The coordinator relies on the hidden states of a compact language model and a small routing head. In total, it has fewer than 20K learnable parameters. Training this system presented a massive challenge. Traditional Reinforcement Learning (REINFORCE) failed because the gradients had a low signal-to-noise ratio due to binary rewards and weak parameter coupling. Imitation learning (Supervised Fine-Tuning) was ruled out because generating multi-turn labels is prohibitively expensive. Our solution? We turned to nature-inspired algorithms. We optimized the coordinator using a derivative-free evolutionary algorithm. We found that evolution is uniquely suited to optimize this tight, high-dimensional coordination problem where traditional gradient-based methods fail. The results are very promising. In our experiments, TRINITY consistently outperforms existing multi-agent methods and individual models across various benchmarks. At the time of publication, it set a new state-of-the-art record on LiveCodeBench, achieving an 86.2% pass@1 score. More importantly, it demonstrated incredible generalization. Without any retraining, TRINITY transferred zero-shot to four unseen tasks (AIME, BigCodeBench, MT-Bench, and GPQA). On average, the evolved coordinator surpassed every individual constituent model in its pool, including GPT-5, Gemini 2.5-Pro, and Claude-4-Sonnet (the top frontier models available at the time of our #ICLR2026 submission last year). This work is central to Sakana AI's vision. We believe the future of AI isn't just about scaling monolithic models, but engineering collaborative, diverse AI ecosystems that can adapt and combine their strengths. We invite the community to read the paper and explore these ideas! Paper: arxiv.org/abs/2512.04695 OpenReview: openreview.net/forum?id=5HaRj… This foundational research is part of the core engine powering our multi-agent product: Sakana Fugu 🐡👇

Sakana AI@SakanaAILabs

We’re launching the beta for our new commercial AI product: Sakana Fugu 🐡, a multi-agent orchestration system! Blog: sakana.ai/fugu-beta Fugu hits SOTA on SWE-Pro, GPQA-D, and ALE-Bench, and has been our internal secret weapon. It dynamically coordinates frontier models, autonomously selecting the optimal agent combinations and roles for each task. Available as an OpenAI-compatible API, you can seamlessly integrate Fugu into your existing workflows with minimal changes. 🐟 Fugu Mini: High-speed orchestration optimized for latency 🐡 Fugu Ultra: Full model pool utilization for deep, complex reasoning Apply for the beta test here: forms.gle/BtKkhc2CfLKk1d…

English

337

74.5K

Arthur Ostapenko รีทวีตแล้ว

Kevin Patrick Murphy@sirbayes·4d

New paper: "Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs". Our system (BLF) matches human superforecasters on ForecastBench, and beats all the top methods (GPT-5, Cassi, Grok 4.20, and Foresight-32B). 🧵

English

199

25.7K

Arthur Ostapenko รีทวีตแล้ว

Sakana AI@SakanaAILabs·2d

English

140

575

234.9K

Arthur Ostapenko รีทวีตแล้ว

DeepSeek@deepseek_ai·2d

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

English

1.5K

7.6K

44.3K

Arthur Ostapenko รีทวีตแล้ว

Qwen@Alibaba_Qwen·4d

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English

501

1.7K

12.4K

3.6M

Arthur Ostapenko รีทวีตแล้ว

Xenova@xenovacom·4d

NEW: OpenAI releases Privacy Filter, their first open model of 2026! 🤗 Apache-2.0! It's a bidirectional token-classification adaptation of GPT-OSS, trained to mask personally identifiable information (PII) in text. At only 1.5B params, it can even run locally in your browser!

English

780

217K

Arthur Ostapenko รีทวีตแล้ว

Zain Shah@zan2434·4d

Imagine every pixel on your screen, streamed live directly from a model. No HTML, no layout engine, no code. Just exactly what you want to see. @eddiejiao_obj, @drewocarr and I built a prototype to see how this could actually work, and set out to make it real. We're calling it Flipbook. (1/5)

English

3.3K

26.1K

5.5M

Arthur Ostapenko รีทวีตแล้ว

GLADIA Research Lab@GladiaLab·27 Eki

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

English

279

1.3K

11.1K

5.1M

Arthur Ostapenko รีทวีตแล้ว

Sakana AI@SakanaAILabs·5d

Discovering Novel LLM Experts via Task-Capability Coevolution Project: acdc-llm.github.io Paper: arxiv.org/abs/2604.14969 Can we build AI that is smarter than its parts? This week, our team will present AC/DC⚡ at #ICLR2026. The current paradigm in AI assumes that to solve more complex problems, we must train a single, ever-larger model. But no single model can excel at every task without massive computational costs. Instead of building one monolithic model, we asked what if we coevolved a diverse collective of specialized experts. We introduce Assessment Coevolving with Diverse Capabilities (AC/DC). It is a framework that simultaneously evolves a population of LLMs using evolutionary model merging and an archive of synthetic tasks generated by an AI scientist. As the tasks become more complex, the models must develop distinct, specialized skills to solve them. Crucially, AC/DC selects models based on Quality-Diversity. It keeps models not just because they score high on average, but because they solve different problems than the rest of the population. The results show that a collaborative task force of 8 small, evolved models can outperform a massive 72B parameter model, using significantly fewer total parameters. These models genuinely specialize, providing completely different, yet correct, approaches to complex problems. This suggests a new path forward for AI development, creating highly capable, parameter-efficient systems through collective intelligence rather than relying solely on brute-force scaling. OpenReview: openreview.net/forum?id=efNIN…

English

189

18.9K

Arthur Ostapenko รีทวีตแล้ว

Sakana AI@SakanaAILabs·6d

Can LLMs flip coins in their heads? When prompted to “Flip a fair coin” 100 times, the heads to tails ratio drifts far from 50:50. LLMs can understand what the target probability should be, but generating outputs that faithfully follow a given distribution is a separate problem. This bias extends beyond coin flips. When LLMs are asked to generate multiple story ideas or brainstorm solutions, the outputs tend to cluster around a narrow range. The same probabilistic skew that distorts coin flips limits diversity in creative generation, recommendations, and other tasks where varied outputs are needed. We discovered a prompting technique named String Seed of Thought (SSoT). The method is simple: instruct the LLM to generate a random string in its own output, then manipulate that string to derive its answer. It requires only a small addition to the prompt and no external random number generator. SSoT significantly reduces output bias across a wide range of LLMs, both open and closed. With reasoning models (such as DeepSeek-R1), it reaches accuracy close to that of actual random sampling. The method generalizes from binary choices to n-way selections and arbitrary probability distributions. On the NoveltyBench diversity benchmark, SSoT outperformed other approaches across all six categories while maintaining output quality. This work will be presented at #ICLR2026! Blog: pub.sakana.ai/ssot Paper: arxiv.org/abs/2510.21150 Openreview: openreview.net/forum?id=luXtb…

GIF

English

137

811

252.1K

Arthur Ostapenko รีทวีตแล้ว

Natalie Wolchover@nattyover·6d

Bacteria move around using a molecular machine called the flagellar motor that rotates faster than the flywheel of a race car engine and switches directions in an instant. After 50 yrs, scientists have finally figured out how it works. “My lifelong quest is now fulfilled.” Link⤵️

English

526

4.4K

32.9K

Arthur Ostapenko รีทวีตแล้ว

Kimi.ai@Kimi_Moonshot·6d

Meet Kimi K2.6: Advancing Open-Source Coding 🔹Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2) What's new: 🔹Long-horizon coding - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization). 🔹Motion-rich frontend - Videos in hero sections, WebGL shaders, GSAP + Framer Motion, Three.js 3D. 🔹Agent Swarms, elevated - 300 parallel sub-agents × 4,000 steps per run (up from K2.5's 100 / 1,500). One prompt, 100+ files. 🔹Proactive Agents - K2.6 model powers OpenClaw, Hermes Agent, etc for 24/7 autonomous ops. 🔹Claw Groups (research preview) - bring your own agents, command your friends', bots & humans in the loop. - K2.6 is now live on kimi.com in chat mode and agent mode. For production-grade coding, pair K2.6 with Kimi Code: kimi.com/code - 🔗 API: platform.moonshot.ai 🔗 Tech blog: kimi.com/blog/kimi-k2-6 🔗 Weights & code: huggingface.co/moonshotai/Kim…

English

893

2.4K

18.1K

7.3M

Arthur Ostapenko รีทวีตแล้ว

Boris ✈️🇧🇷 ICLR 2026@BorisMeinardus·20 Nis

🚨Why should one huge LLM know and solve everything? - No single human does, yet our civilization does endless innovation. Introducing AC/DC - it continually coevolves a population of small expert LLMs that collectively outperform GPT-4o. (ICLR 2026 w/ @SakanaAILabs) 👇🧵

English

202

22.4K

Arthur Ostapenko รีทวีตแล้ว

alphaXiv@askalphaxiv·6d

“Think Anywhere in Code Generation” Most reasoning LLMs think before writing code. But coding often gets hard because the tricky parts only gets revealed mid-implementation when the edge cases or final return logic appear. So this paper introduces Think-Anywhere, where models can pause and reason at any token position while generating code, then strip those thoughts out to leave clean executable code. Trained with cold-start SFT + execution-based RL, this beats CoT, self-planning, interleaved thinking, GRPO, and recent code post-training methods. This lets the model learns to think exactly where uncertainty appears.

English

368

19.1K

Arthur Ostapenko รีทวีตแล้ว

Yulu Gan@yule_gan·13 Mar

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt. To verify that this is not limited to specific models, we tested it on Qwen, Llama, OLMo3, and VLMs. What's behind this? We find that in the Gaussian search neighborhood around pretrained LLMs, diverse task experts are densely distributed — a regime we term Neural Thickets. Paper: arxiv.org/pdf/2603.12228 Code: github.com/sunrainyg/Rand… Website: thickets.mit.edu

English

435

685K

Arthur Ostapenko รีทวีตแล้ว

Sakana AI@SakanaAILabs·18 Nis

What happens when you put competing neural networks in a Petri Dish and start changing the rules while they adapt? Last year we released Petri Dish NCA, where neural nets are the organisms that learn during simulation. Today we're releasing Digital Ecosystems: a browser-based platform for interactive artificial life research. The setup: several small CNNs share a 2D grid, each seeing only a 3x3 neighborhood. No global plan. They compete for territory by attacking neighbours and defending against incoming attacks, learning via gradient descent online while the simulation runs. What we didn't expect was the role of the learning itself. Gradient descent isn't just optimising each species' strategy. Instead, it acts to stabilize the whole system during simulation. Species that overextend get pushed back by the loss. Species that stagnate get nudged to grow. This means you can push parameters toward edge-of-chaos regimes: a zone characterised by emergent complexity. Letting the neural networks learn acts to hold the complex system together while you explore and interact. The platform lets you steer all of this interactively. You can draw walls to create niches, erase parts of the system online, and tune 40+ system parameters to explore the most interesting configurations. We find it mesmerizing to watch species carve out territories and reorganise when you perturb them. Everything runs client-side in your browser, no install needed. Blog: pub.sakana.ai/digital-ecosys… Code: github.com/SakanaAI/digit…

English

196

1.1K

212.4K

Arthur Ostapenko รีทวีตแล้ว

Chris Tate@ctatedev·18 Nis

Terminal automation + e2e testing solved Now as simple as snapshot, click, type: – wterm renders terminal-in-html, every cell in the a11y tree – agent-browser automates pages via the a11y tree Here's opencode in one browser driving Claude Code in another

English

107

216

3.4K

962K

Arthur Ostapenko รีทวีตแล้ว

Felix Rieseberg@felixrieseberg·17 Nis

Hi! I'm here with *another launch*, it just happens to be extremely niche, nerdy, and probably only for a handful of people. In the desktop app, Claude Cowork and Code now have a little Bluetooth API for makers & developers, allowing you to build hardware devices that interact with Claude. I, for instance, built a little desk pet that alerts me whenever Claude is waiting for permission.

English

127

119

1.7K

180.3K

Arthur Ostapenko รีทวีตแล้ว

Higgsfield AI 🧩@higgsfield·16 Nis

We just made a 23-MINUTE sci-fi pilot in 4 days. And it is 100% AI.

English

753

737

4.3K

2.1M

Arthur Ostapenko รีทวีตแล้ว

Yuekun Yao@yuekun_yao·15 Nis

Claude Mythos is suspected of being a Looped transformer (LT), but why are LT-based LLMs so powerful? Our new finding: LT can perform implicit reasoning over their parametric knowledge, unlocking generalization to complex and unfamiliar questions compared to transformers ⤵️

English

154

962

184.2K

ค้นพบ

@eddiejiao_obj @drewocarr @SakanaAILabs @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates