Sean

869 posts

Sean

@seanqualia

I prostrate to those who have abandoned all views.

Katılım Temmuz 2020

4.5K Takip Edilen312 Takipçiler

Sean@seanqualia·4d

@AgustinLebron3 @Biggiethelad1 Same here. Only open to verified users. Only in-person, in Prague?

English

Agustin Lebron@AgustinLebron3·4d

@Biggiethelad1 DMed

English

Agustin Lebron@AgustinLebron3·4d

Starting to sense that the big prop firms (JS, HRT, Jump, etc) are entering their 2010s-era Google phase. Printing so much $, they can afford to warehouse expensive brains doing not-that-much. So that others don't get access to those brains. Cushy but not very stimulating.

English

840

161.7K

Sean@seanqualia·5d

@asimovinc I would love to. Please keep posted

English

126

Asimov@asimovinc·5d

We'll be in SF this June. Want to grab a coffee with the people building robots & AI. Drinks on us. Would you be up for a meetup?

English

117

15.6K

Sean retweetledi

Intology@intology·5d

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

English

277

138.6K

Sean retweetledi

Swaroop Mishra@Swarooprm7·5d

Proud to have worked on recreating Alphazero. The future is super super exciting 🔥!

koray kavukcuoglu@koraykv

Today at Google I/O, we introduced Gemini 3.5 Flash! It has become an integral part of our daily research cycle and works with all the tools we have at Google. We used a team of agents in Antigravity 2.0 to recreate the original AlphaZero research paper and build a playable version. They coded the reinforcement learning pipeline in JAX/Flax, trained a ResNet model from scratch via self-play on multi-TPU pods, and shipped a full-stack web app so you can play against it, from just 2 prompts. . Here’s what else makes 3.5 Flash special 🧵

English

10.3K

Sean@seanqualia·15 May

@PrimeIntellect I would like to see on Q Labs Nano GPT SlowRun data efficiency benchmark

English

855

Prime Intellect@PrimeIntellect·15 May

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

English

154

1.7K

585.9K

Sean retweetledi

Tilde@tilderesearch·8 May

Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.

Tilde@tilderesearch

x.com/i/article/2052…

English

177

1.6K

517.9K

Sean retweetledi

Francesco Bertolotti@f14bertolotti·6 May

New GRPO variant. The idea is to re-weight the advantage to tokens that deviate the most from the original model distribution. There are several tricks to make this work and the experiments are fairly limited, but the idea is cool. 🔗arxiv.org/pdf/2605.03327

English

178

14.3K

Sean@seanqualia·7 May

@eliebakouch Hey - I’m curious: if you could build one extension/ablation/etc on the model - doable for 1 person and <~$500 in compute - what would it be?

English

elie@eliebakouch·7 May

btw i'm not saying it's a better arch since it's hard to conclude with the lack of ablation in the paper, but i like the novelty!

English

736

elie@eliebakouch·6 May

very impressive release with lots of care at every stage of training: custom arch with bigger experts, more expressive router, compressed attention, residual scaling, and much more on the post training side including test time compute etc.. benchmark scores are very competitive

Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English

231

17.9K

Sean retweetledi

Lee Sharkey@leedsharkey·5 May

My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)

English

193

1.5K

240.7K

Sean retweetledi

Goodfire@GoodfireAI·30 Nis

Introducing Silico: the platform for building AI models with the precision of written software. Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up. Early access is open now. 🧵(1/10)

English

114

869

109.9K

Sean@seanqualia·28 Nis

Ruthless

English

Sean retweetledi

Sakana AI@SakanaAILabs·26 Nis

What if instead of building one giant AI, we evolved a coordinator to orchestrate a diverse team of specialized AIs? 🐟 Excited to share our new paper: “TRINITY: An Evolved LLM Coordinator”, published as a conference paper at #ICLR2026! Paper: arxiv.org/abs/2512.04695 In nature, complex problems are rarely solved by a single monolithic entity, but rather by the coordinated efforts of specialized individuals working together. Yet, modern AI development is heavily focused on endlessly scaling up single, massive monolithic models, yielding diminishing returns. While model merging offers a way to combine different skills, it is often impractical due to mismatched neural architectures and the closed-source nature of top-performing models. To address this, we took a macro-level approach: test-time model composition. We introduce TRINITY, a system that fuses the complementary strengths of diverse, state-of-the-art models without needing to modify their underlying weights. TRINITY processes queries over multiple turns. At each step, a lightweight coordinator assigns one of three distinct roles to an LLM from its available pool: 1/ Thinker: Devises high-level strategies and analyzes the current state. 2/ Worker: Executes concrete problem-solving steps. 3/ Verifier: Evaluates if the current solution is complete and correct. By dynamically assigning these roles, the coordinator effectively offloads complex reasoning and skill execution onto the external models. What makes TRINITY unique is its extreme efficiency. The coordinator relies on the hidden states of a compact language model and a small routing head. In total, it has fewer than 20K learnable parameters. Training this system presented a massive challenge. Traditional Reinforcement Learning (REINFORCE) failed because the gradients had a low signal-to-noise ratio due to binary rewards and weak parameter coupling. Imitation learning (Supervised Fine-Tuning) was ruled out because generating multi-turn labels is prohibitively expensive. Our solution? We turned to nature-inspired algorithms. We optimized the coordinator using a derivative-free evolutionary algorithm. We found that evolution is uniquely suited to optimize this tight, high-dimensional coordination problem where traditional gradient-based methods fail. The results are very promising. In our experiments, TRINITY consistently outperforms existing multi-agent methods and individual models across various benchmarks. At the time of publication, it set a new state-of-the-art record on LiveCodeBench, achieving an 86.2% pass@1 score. More importantly, it demonstrated incredible generalization. Without any retraining, TRINITY transferred zero-shot to four unseen tasks (AIME, BigCodeBench, MT-Bench, and GPQA). On average, the evolved coordinator surpassed every individual constituent model in its pool, including GPT-5, Gemini 2.5-Pro, and Claude-4-Sonnet (the top frontier models available at the time of our #ICLR2026 submission last year). This work is central to Sakana AI's vision. We believe the future of AI isn't just about scaling monolithic models, but engineering collaborative, diverse AI ecosystems that can adapt and combine their strengths. We invite the community to read the paper and explore these ideas! Paper: arxiv.org/abs/2512.04695 OpenReview: openreview.net/forum?id=5HaRj… This foundational research is part of the core engine powering our multi-agent product: Sakana Fugu 🐡👇

Sakana AI@SakanaAILabs

We’re launching the beta for our new commercial AI product: Sakana Fugu 🐡, a multi-agent orchestration system! Blog: sakana.ai/fugu-beta Fugu hits SOTA on SWE-Pro, GPQA-D, and ALE-Bench, and has been our internal secret weapon. It dynamically coordinates frontier models, autonomously selecting the optimal agent combinations and roles for each task. Available as an OpenAI-compatible API, you can seamlessly integrate Fugu into your existing workflows with minimal changes. 🐟 Fugu Mini: High-speed orchestration optimized for latency 🐡 Fugu Ultra: Full model pool utilization for deep, complex reasoning Apply for the beta test here: forms.gle/BtKkhc2CfLKk1d…

English

406

98.7K

Sean retweetledi

Yoonho Lee@yoonholeee·30 Mar

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English

283

1.7K

572.2K

Sean@seanqualia·26 Nis

arxiv.org/pdf/2501.14082

ZXX

Sean retweetledi

shyamal@shyamalanadkat·24 Nis

personality is underrated as an intelligence primitive. every sufficiently advanced research program eventually becomes someone's unresolved aesthetic preference with compute

English

102

Sean retweetledi

Ifdita Hasan@ifdita_hasan·22 Nis

Deploying language models in scientific discovery domains requires extraordinary amounts of test-time compute for search algorithms. An ideal training algorithm should be designed with this goal in mind - that we want agents to learn how to not only exploit but also optimistically explore novel strategies. The agent should learn how to synergistically explore and exploit. We propose Poly-EPO, a set RL algorithm that explores and discovers diverse reasoning paths. Work with @jubayer_hamid (co-lead), Shreya, @ShirleyYXWu, @HengyuanH, @noahdgoodman, @DorsaSadigh, and @chelseabfinn.

English

108

51.8K

Sean retweetledi

Galbot@GalbotRobotics·23 Nis

Introducing LDA, a latent world action foundation model that, for the first time, unifies the utilization of heterogeneous embodied data across simulation and reality, humans and robots, and varying levels of action quality and annotation. By breaking long-standing data silos in embodied intelligence, LDA enables the field, much like GPT did for language, to benefit continuously from scaling data, marking the transition into a new era of scalable learning. #Galbot #Robotics #Innovation #AI #Technology #Humanoid #WorldModel

English

244

37.3K

Sean retweetledi

Ilija Lichkovski@carnot_cyclist·7 Nis

x.com/i/article/2041…

ZXX

104

848

303.4K

Sean retweetledi

Andrej Karpathy@karpathy·27 Haz

The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal text/vision/audio at both input and output. - Matryoshka-style architecture allowing a dial of capability up and down at test time. - Reasoning, also with a dial. (system 2) - Aggressively tool-using. - On-device finetuning LoRA slots for test-time training, personalization and customization. - Delegates and double checks just the right parts with the oracles in the cloud if internet is available. It doesn't know that William the Conqueror's reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can't recite the SHA-256 of empty string as e3b0c442..., but it can calculate it quickly should you really want it. What LLM personal computing lacks in broad world knowledge and top tier problem-solving capability it will make up in super low interaction latency (especially as multimodal matures), direct / private access to data and state, offline continuity, sovereignty ("not your weights not your brain"). i.e. many of the same reasons we like, use and buy personal computers instead of having thin clients access a cloud via remote desktop or so.

Omar Sanseviero@osanseviero

I’m so excited to announce Gemma 3n is here! 🎉 🔊Multimodal (text/audio/image/video) understanding 🤯Runs with as little as 2GB of RAM 🏆First model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface, @kaggle, llama.cpp, ai.dev, and more

English

390

1.3K

10.7K

1.3M

Sean retweetledi

Jubayer Ibn Hamid@jubayer_hamid·21 Nis

Exploration is the lifeblood of learning from experience. An agent must search broadly to uncover successful behaviors. It should continue exploring to expand its capabilities by learning distinct strategies to complex problems. Threading this needle between exploration and exploitation is critical for solving unsolved problems at test-time. An algorithm should encourage (1) optimistically exploring reasoning strategies, and (2) achieving a synergy between exploration and exploitation. Towards that end, we develop Poly-EPO: a method for training LMs to explore and reason. Work with @ifdita_hasan (co-lead), Shreya, @ShirleyYXWu, @HengyuanH, @noahdgoodman, @DorsaSadigh, and @chelseabfinn. 🧵

English

327

75.2K

Keşfet

@AgustinLebron3 @Biggiethelad1 @asimovinc @PrimeIntellect @eliebakouch @GoodfireAI @elonmusk @BarackObama