Sean

869 posts

Sean banner
Sean

Sean

@seanqualia

I prostrate to those who have abandoned all views.

Katılım Temmuz 2020
4.5K Takip Edilen312 Takipçiler
Agustin Lebron
Agustin Lebron@AgustinLebron3·
Starting to sense that the big prop firms (JS, HRT, Jump, etc) are entering their 2010s-era Google phase. Printing so much $, they can afford to warehouse expensive brains doing not-that-much. So that others don't get access to those brains. Cushy but not very stimulating.
English
29
10
840
161.7K
Sean
Sean@seanqualia·
@asimovinc I would love to. Please keep posted
English
0
0
1
126
Asimov
Asimov@asimovinc·
We'll be in SF this June. Want to grab a coffee with the people building robots & AI. Drinks on us. Would you be up for a meetup?
English
27
1
117
15.6K
Sean retweetledi
Intology
Intology@intology·
Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵
Intology tweet media
English
22
59
277
138.6K
Sean retweetledi
Sean
Sean@seanqualia·
@PrimeIntellect I would like to see on Q Labs Nano GPT SlowRun data efficiency benchmark
English
0
0
1
855
Prime Intellect
Prime Intellect@PrimeIntellect·
Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline
Prime Intellect tweet media
English
57
154
1.7K
585.9K
Sean retweetledi
Tilde
Tilde@tilderesearch·
Introducing Aurora, a new optimizer for training frontier-scale models. We train Aurora-1.1B, which achieves 100x data efficiency on open-source internet data. Despite having 25% fewer parameters, 2 orders of magnitude fewer training tokens, and using fully open-source internet-only data, Aurora matches Qwen3-1.7B on several benchmarks. Aurora was developed after identifying a major failure mode that can occur under Muon, an increasingly popular optimizer that has shown strong gains over Adam(W). We find that Muon can cause a huge percentage of neurons to effectively die early in training, reducing effective network capacity so that many parameters no longer meaningfully contribute to network outputs. By redistributing update energy more uniformly across neurons while preserving Muon’s stability properties, Aurora prevents neuron death and recovers substantial model capacity. What makes this work especially exciting is that it points toward a broader direction for ML research: better optimizers may not come purely from elegant mathematical abstractions, but from understanding and addressing the concrete dynamics and pathologies that emerge inside real training systems.
Tilde@tilderesearch

x.com/i/article/2052…

English
41
177
1.6K
517.9K
Sean retweetledi
Francesco Bertolotti
Francesco Bertolotti@f14bertolotti·
New GRPO variant. The idea is to re-weight the advantage to tokens that deviate the most from the original model distribution. There are several tricks to make this work and the experiments are fairly limited, but the idea is cool. 🔗arxiv.org/pdf/2605.03327
Francesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet mediaFrancesco Bertolotti tweet media
English
3
19
178
14.3K
Sean
Sean@seanqualia·
@eliebakouch Hey - I’m curious: if you could build one extension/ablation/etc on the model - doable for 1 person and <~$500 in compute - what would it be?
English
0
0
0
8
elie
elie@eliebakouch·
btw i'm not saying it's a better arch since it's hard to conclude with the lack of ablation in the paper, but i like the novelty!
English
1
1
9
736
elie
elie@eliebakouch·
very impressive release with lots of care at every stage of training: custom arch with bigger experts, more expressive router, compressed attention, residual scaling, and much more on the post training side including test time compute etc.. benchmark scores are very competitive
elie tweet media
Zyphra@ZyphraAI

Today we're releasing ZAYA1-8B, a reasoning MoE trained on @AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size on math and reasoning, closing in on DeepSeek-V3.2 and GPT-5-High with test-time compute. 🧵

English
5
12
231
17.9K
Sean retweetledi
Lee Sharkey
Lee Sharkey@leedsharkey·
My team at @GoodfireAI has been cooking up a new way to do interpretability: decompose a language model’s weights, not its activations. Our decomposition natively handles attention (!) and behaves less like a lookup table and more like a generalizing algorithm. (1/6)
English
34
193
1.5K
240.7K
Sean retweetledi
Goodfire
Goodfire@GoodfireAI·
Introducing Silico: the platform for building AI models with the precision of written software. Silico lets researchers and engineers see inside their models, debug failures, and intentionally design them from the ground up. Early access is open now. 🧵(1/10)
English
20
114
869
109.9K
Sean
Sean@seanqualia·
Ruthless
Sean tweet media
English
0
0
0
10
Sean retweetledi
Sakana AI
Sakana AI@SakanaAILabs·
What if instead of building one giant AI, we evolved a coordinator to orchestrate a diverse team of specialized AIs? 🐟 Excited to share our new paper: “TRINITY: An Evolved LLM Coordinator”, published as a conference paper at #ICLR2026! Paper: arxiv.org/abs/2512.04695 In nature, complex problems are rarely solved by a single monolithic entity, but rather by the coordinated efforts of specialized individuals working together. Yet, modern AI development is heavily focused on endlessly scaling up single, massive monolithic models, yielding diminishing returns. While model merging offers a way to combine different skills, it is often impractical due to mismatched neural architectures and the closed-source nature of top-performing models. To address this, we took a macro-level approach: test-time model composition. We introduce TRINITY, a system that fuses the complementary strengths of diverse, state-of-the-art models without needing to modify their underlying weights. TRINITY processes queries over multiple turns. At each step, a lightweight coordinator assigns one of three distinct roles to an LLM from its available pool: 1/ Thinker: Devises high-level strategies and analyzes the current state. 2/ Worker: Executes concrete problem-solving steps. 3/ Verifier: Evaluates if the current solution is complete and correct. By dynamically assigning these roles, the coordinator effectively offloads complex reasoning and skill execution onto the external models. What makes TRINITY unique is its extreme efficiency. The coordinator relies on the hidden states of a compact language model and a small routing head. In total, it has fewer than 20K learnable parameters. Training this system presented a massive challenge. Traditional Reinforcement Learning (REINFORCE) failed because the gradients had a low signal-to-noise ratio due to binary rewards and weak parameter coupling. Imitation learning (Supervised Fine-Tuning) was ruled out because generating multi-turn labels is prohibitively expensive. Our solution? We turned to nature-inspired algorithms. We optimized the coordinator using a derivative-free evolutionary algorithm. We found that evolution is uniquely suited to optimize this tight, high-dimensional coordination problem where traditional gradient-based methods fail. The results are very promising. In our experiments, TRINITY consistently outperforms existing multi-agent methods and individual models across various benchmarks. At the time of publication, it set a new state-of-the-art record on LiveCodeBench, achieving an 86.2% pass@1 score. More importantly, it demonstrated incredible generalization. Without any retraining, TRINITY transferred zero-shot to four unseen tasks (AIME, BigCodeBench, MT-Bench, and GPQA). On average, the evolved coordinator surpassed every individual constituent model in its pool, including GPT-5, Gemini 2.5-Pro, and Claude-4-Sonnet (the top frontier models available at the time of our #ICLR2026 submission last year). This work is central to Sakana AI's vision. We believe the future of AI isn't just about scaling monolithic models, but engineering collaborative, diverse AI ecosystems that can adapt and combine their strengths. We invite the community to read the paper and explore these ideas! Paper: arxiv.org/abs/2512.04695 OpenReview: openreview.net/forum?id=5HaRj… This foundational research is part of the core engine powering our multi-agent product: Sakana Fugu 🐡👇
Sakana AI tweet media
Sakana AI@SakanaAILabs

We’re launching the beta for our new commercial AI product: Sakana Fugu 🐡, a multi-agent orchestration system! Blog: sakana.ai/fugu-beta Fugu hits SOTA on SWE-Pro, GPQA-D, and ALE-Bench, and has been our internal secret weapon. It dynamically coordinates frontier models, autonomously selecting the optimal agent combinations and roles for each task. Available as an OpenAI-compatible API, you can seamlessly integrate Fugu into your existing workflows with minimal changes. 🐟 Fugu Mini: High-speed orchestration optimized for latency 🐡 Fugu Ultra: Full model pool utilization for deep, complex reasoning Apply for the beta test here: forms.gle/BtKkhc2CfLKk1d…

English
15
67
406
98.7K
Sean retweetledi
Yoonho Lee
Yoonho Lee@yoonholeee·
How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end
Yoonho Lee tweet media
English
78
283
1.7K
572.2K
Sean retweetledi
shyamal
shyamal@shyamalanadkat·
personality is underrated as an intelligence primitive. every sufficiently advanced research program eventually becomes someone's unresolved aesthetic preference with compute
English
7
3
102
5K
Sean retweetledi
Ifdita Hasan
Ifdita Hasan@ifdita_hasan·
Deploying language models in scientific discovery domains requires extraordinary amounts of test-time compute for search algorithms. An ideal training algorithm should be designed with this goal in mind - that we want agents to learn how to not only exploit but also optimistically explore novel strategies. The agent should learn how to synergistically explore and exploit. We propose Poly-EPO, a set RL algorithm that explores and discovers diverse reasoning paths. Work with @jubayer_hamid (co-lead), Shreya, @ShirleyYXWu, @HengyuanH, @noahdgoodman, @DorsaSadigh, and @chelseabfinn.
Ifdita Hasan tweet media
English
3
22
108
51.8K
Sean retweetledi
Galbot
Galbot@GalbotRobotics·
Introducing LDA, a latent world action foundation model that, for the first time, unifies the utilization of heterogeneous embodied data across simulation and reality, humans and robots, and varying levels of action quality and annotation. By breaking long-standing data silos in embodied intelligence, LDA enables the field, much like GPT did for language, to benefit continuously from scaling data, marking the transition into a new era of scalable learning. #Galbot #Robotics #Innovation #AI #Technology #Humanoid #WorldModel
English
5
39
244
37.3K
Sean retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. It lives always-on and by default on every computer as the kernel of LLM personal computing. Its features are slowly crystalizing: - Natively multimodal text/vision/audio at both input and output. - Matryoshka-style architecture allowing a dial of capability up and down at test time. - Reasoning, also with a dial. (system 2) - Aggressively tool-using. - On-device finetuning LoRA slots for test-time training, personalization and customization. - Delegates and double checks just the right parts with the oracles in the cloud if internet is available. It doesn't know that William the Conqueror's reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can't recite the SHA-256 of empty string as e3b0c442..., but it can calculate it quickly should you really want it. What LLM personal computing lacks in broad world knowledge and top tier problem-solving capability it will make up in super low interaction latency (especially as multimodal matures), direct / private access to data and state, offline continuity, sovereignty ("not your weights not your brain"). i.e. many of the same reasons we like, use and buy personal computers instead of having thin clients access a cloud via remote desktop or so.
Omar Sanseviero@osanseviero

I’m so excited to announce Gemma 3n is here! 🎉 🔊Multimodal (text/audio/image/video) understanding 🤯Runs with as little as 2GB of RAM 🏆First model under 10B with @lmarena_ai score of 1300+ Available now on @huggingface, @kaggle, llama.cpp, ai.dev, and more

English
390
1.3K
10.7K
1.3M
Sean retweetledi
Jubayer Ibn Hamid
Jubayer Ibn Hamid@jubayer_hamid·
Exploration is the lifeblood of learning from experience. An agent must search broadly to uncover successful behaviors. It should continue exploring to expand its capabilities by learning distinct strategies to complex problems. Threading this needle between exploration and exploitation is critical for solving unsolved problems at test-time. An algorithm should encourage (1) optimistically exploring reasoning strategies, and (2) achieving a synergy between exploration and exploitation. Towards that end, we develop Poly-EPO: a method for training LMs to explore and reason. Work with @ifdita_hasan (co-lead), Shreya, @ShirleyYXWu, @HengyuanH, @noahdgoodman, @DorsaSadigh, and @chelseabfinn. 🧵
Jubayer Ibn Hamid tweet media
English
5
57
327
75.2K