Isaac Hodes

206 posts

Isaac Hodes banner
Isaac Hodes

Isaac Hodes

@ihodes

Product @ Open Athena: building foundation models in the open

NYC Katılım Eylül 2008
853 Takip Edilen454 Takipçiler
Isaac Hodes retweetledi
Percy Liang
Percy Liang@percyliang·
Our 1e23 Delphi run finished last night. It's loss was within 0.005 of the projected (preregistered) loss. Note that these projections were based on only training models over 100x smaller (3e20)! Still more work to do. We still had loss spikes and if you closely, our scaling laws are bending. We have some ideas for fixing both...
Will Held@WilliamBarrHeld

How far do Marin's scaling laws extrapolate? At least 100x, apparently! Despite spooky spikes, our 1e23 Delphi finished on forecast. The compute-optimal ladder costs ~1e21 FLOPs to train. Good scaling science lets you “run” this (not tiny) experiment at 1/100th the cost.

English
7
13
187
31.1K
Isaac Hodes retweetledi
Will Held
Will Held@WilliamBarrHeld·
How far do Marin's scaling laws extrapolate? At least 100x, apparently! Despite spooky spikes, our 1e23 Delphi finished on forecast. The compute-optimal ladder costs ~1e21 FLOPs to train. Good scaling science lets you “run” this (not tiny) experiment at 1/100th the cost.
Will Held tweet media
Percy Liang@percyliang

In Marin, we are trying to get really good at scaling laws. We have trained models up to 1e22 FLOPs and have made a prediction of the loss at 1e23 FLOPs, which @WilliamBarrHeld is running. This prediction is preregistered on GitHub, so we'll see in a few days how accurate our prediction was. What we want is not just a single model but a training recipe that scales reliably.

English
3
19
142
48.1K
Isaac Hodes retweetledi
Will Held
Will Held@WilliamBarrHeld·
Scaling laws are "just" regressions. But a biased fitting method can quietly misallocate millions of $ of compute at frontier scales. My coworker Eric Czech dug into a bias in parabolic IsoFLOP fits used by Meta, DeepSeek, Microsoft, Waymo, et al. for their scaling laws🧵
Will Held tweet media
English
2
27
135
34.5K
Isaac Hodes
Isaac Hodes@ihodes·
@jxmnop Yeah, would not recommend, these are already being automated. The objective is clear and hill-climbable and the surface area is small.
English
0
0
2
568
dr. jack morris
dr. jack morris@jxmnop·
Learning to write kernels might be the highest-ROI activity for displaced SWEs: → prereq: reasonable engineering ablity → six to twelve months of study → millions of dollars, mark zuckerberg showing up at your house to hire you, etc. i wish this were an exaggeration
English
42
61
1.9K
123.4K
Isaac Hodes retweetledi
Percy Liang
Percy Liang@percyliang·
In Marin, we are trying to get really good at scaling laws. We have trained models up to 1e22 FLOPs and have made a prediction of the loss at 1e23 FLOPs, which @WilliamBarrHeld is running. This prediction is preregistered on GitHub, so we'll see in a few days how accurate our prediction was. What we want is not just a single model but a training recipe that scales reliably.
Percy Liang tweet media
English
18
47
471
77.1K
Isaac Hodes retweetledi
Nathan Lambert
Nathan Lambert@natolambert·
We present Olmo 3, our next family of fully open, leading language models. This family of 7B and 32B models represents: 1. The best 32B base model. 2. The best 7B Western thinking & instruct models. 3. The first 32B (or larger) fully open reasoning model. This is a big milestone for Ai2 and the Olmo project. These aren’t huge models (more on that later), but it’s crucial for the viability of fully open-source models that they are competitive on performance – not just replications of models that came out 6 to 12 months ago. As always, all of our models come with full training data, code, intermediate checkpoints, training logs, and a detailed technical report. All are available today, with some more additions coming before the end of the year. As with OLMo 2 32B at its release, OLMo 3 32B is the best open-source language model ever released. It’s an awesome privilege to get to provide these models to the broader community researching and understanding what is happening in AI today. Base models – a strong foundation Pretraining’s demise is now regularly overstated. 2025 has marked a year where the entire industry rebuilt their training stack to focus on reasoning and agentic tasks, but some established base model sizes haven’t seen a new leading model since @alibaba_qwen's Qwen 2.5 in 2024. The Olmo 3 32B base model could be our most impactful artifact here, as Qwen3 did not release their 32B base model (likely for competitive reasons). We show that our 7B recipe competes with Qwen 3, and the 32B size enables a starting point for strong reasoning models or specialized agents. Our base model’s performance is in the same ballpark as Qwen 2.5, surpassing the likes of Stanford’s Marin (@stanfordAILab) and Gemma 3 (@GoogleDeepMind), but with pretraining data and code available, it should be more accessible to the community to learn how to finetune it (and be confident in our results). We’re excited to see the community take Olmo 3 32B base in many directions. 32B is a loved size for easy deployment on single 80GB+ memory GPUs and even on many laptops, like the MacBook I’m using to write this on. A model flow – the lifecycle of creating a model With these strong base models, we’ve created a variety of post-training checkpoints to showcase the many ways post-training can be done to suit different needs. We’re calling this a “Model Flow.” For post-training, we’re releasing Instruct versions – short, snappy, intelligent, and useful especially for synthetic data en masse (e.g. recent work by Datology @datologyai on OLMo 2 Instruct), Think versions – thoughtful reasoners with the performance you expect from a leading thinking model on math, code, etc. and RL Zero versions – controlled experiments for researchers understanding how to build post-training recipes that start with large-scale RL on the base model. The first two post-training recipes are distilled from a variety of leading, open and closed, language models. At the 32B and smaller scale, direct distillation with further preference finetuning and reinforcement learning with verifiable rewards (RLVR) is becoming an accessible and highly capable pipeline. Our post-training recipe follows our recent models: 1) create an excellent SFT set, 2) use direct preference optimization (DPO) as a highly iterable, cheap, and stable preference learning method despite its critics, and 3) finish up with scaled up RLVR. All of these stages confer meaningful improvements on the models’ final performance. Instruct models – low latency workhorse Instruct models today are often somewhat forgotten, but the likes of @aiatmeta Llama 3.1 Instruct and smaller, concise models are some of the most adopted open models of all time. The instruct models we’re building are a major polishing and evolution of the Tülu 3 pipeline – you’ll see many similar datasets and methods, but with pretty much every datapoint or training code being refreshed. Olmo 3 Instruct should be a clear upgrade on Llama 3.1 8B, representing the best 7B scale model from a Western or American company. As scientists we don’t like to condition the quality of our work based on its geographic origins, but this is a very real consideration to many enterprises looking to open models as a solution for trusted AI deployments with sensitive data. Building a thinking model What people have most likely been waiting for are our thinking or reasoning models, both because every company needs to have a reasoning model in 2025, but also to clearly open the black box for the most recent evolution of language models. Olmo 3 Think, particularly the 32B, are flagship models of this release, where we considered what would be best for a reasoning model at every stage of training. Extensive effort (ask me IRL about more war stories) went into every stage of the post-training of the Think models. We’re impressed by the magnitude of gains that can be achieved in each stage – neither SFT nor RL is all you need at these intermediate model scales. First we built an extensive reasoning dataset for supervised finetuning (SFT), called Dolci-Think-SFT, building on very impactful open projects like OpenThoughts3, Nvidia’s Nemotron Post-training, Prime Intellect’s SYNETHIC-2, and many more open prompt sources we pulled forward from Tülu 3 / OLMo 2. Datasets like this are often some of our most impactful contributions (see the Tülu 3 dataset as an example in Thinking Machine’s Tinker :D @thinkymachines @tinker_api – please add Dolci-Think-SFT too, and Olmo 3 while you’re at it, the architecture is very similar to Qwen which you have). For DPO with reasoning, we converged on a very similar method as HuggingFace’s (@huggingface) SmolLM 3 with Qwen3 32B as the chosen model and Qwen3 0.6B as the rejected. Our intuition is that the delta between the chosen and rejected samples is what the model learns from, rather than the overall quality of the chosen answer alone. These two models provide a very consistent delta, which provides way stronger gains than expected. Same goes for the Instruct model. It is likely that DPO is helping the model converge on more stable reasoning strategies and softening the post-SFT model, as seen by large gains even on frontier evaluations such as AIME. Our DPO approach was an expansion of Geng, Scott, et al. "The delta learning hypothesis: Preference tuning on weak data can yield strong gains." arXiv preprint arXiv:2507.06187 (2025). Many early open thinking models that were also distilled from larger, open-weight thinking models likely left a meaningful amount of performance on the table by not including this stage. Finally, we turn to the RL stage. Most of the effort here went into building effective infrastructure to be able to run stable experiments with the long-generations of larger language models. This was an incredible team effort to be a small part of, and reflects work ongoing at many labs right now. Most of the details are in the paper, but our details are a mixture of ideas that have been shown already like ServiceNow’s PipelineRL or algorithmic innovations like DAPO and Dr. GRPO. We have some new tricks too! Some of the exciting contributions of our RL experiments are 1) what we call “active refilling” which is a way of keeping the generations from the learner nodes constantly flowing until there’s a full batch of completions with nonzero gradients (from equal advantages) – a major advantage of our asynchronous approach; and 2) cleaning, documenting, decontaminating, mixing, and proving out the large swaths of work done by the community over the last months. The result is an excellent model that we’re very proud of. It has very strong reasoning benchmarks (AIME, GPQA, etc.) while also being stable, quirky, and fun in chat with excellent instruction following. The 32B range is largely devoid of non-Qwen competition. The scores for both of our Thinkers get within 1-2 points overall with their respective Qwen3 8/32B models – we’re proud of this! A very strong 7B scale, Western thinking model is Nvidia’s (@NVIDIAAI) NVIDIA-Nemotron-Nano-9B-v2 hybrid model. It came out months ago and is extremely strong. I personally suspect it may be due to the hybrid architecture making subtle implementation bugs in popular libraries, but who knows. All in, the Olmo 3 Think recipe gives us a lot of excitement for new things to try in 2026. RL Zero DeepSeek R1 showed us a way to new post-training recipes for frontier models, starting with RL on the base model rather than a big SFT stage (yes, I know about cold-start SFT and so on, but that’s an implementation detail). We used RL on base model as a core feedback cycle when developing the model, such as during intermediate midtraining mixing. This is viewed now as a fundamental, largely innate, capability of the base-model. To facilitate further research on RL Zero, we released 4 datasets and series of checkpoints, showing per-domain RL Zero performance on our 7B model for data mixes focus on math, code, instruction following, and all mixed together. In particular, we’re excited about the future of RL Zero research on Olmo 3 precisely because everything is open. Researchers can study the interaction between the reasoning traces we include at midtraining and the downstream model behavior (qualitative and quantitative). This helps answer questions that have plagued RLVR results on Qwen models, hinting at forms of data contamination particularly on math and reasoning benchmarks (see Shao, Rulin, et al. "Spurious rewards: Rethinking training signals in rlvr." arXiv preprint arXiv:2506.10947 (2025). or Wu, Mingqi, et al. "Reasoning or memorization? unreliable results of reinforcement learning due to data contamination." arXiv preprint arXiv:2507.10532 (2025).) What’s next This is the biggest project we’ve ever taken on at Ai2 (@allen_ai), with 60+ authors and numerous other support staff. In building and observing “thinking” and “instruct” models coming today, it is clear to us that there’s a very wide variety of models that fall into both of these buckets. The way we view it is that thinking and instruct characteristics are on a spectrum, as measured by the number of tokens used per evaluation task. In the future we’re excited to view this thinking budget as a trade-off, and build models that serve different use-cases based on latency/throughput needs. As for a list of next models or things we’ll build, we can give you a list of things you’d expect from a (becoming) frontier lab: MoEs, better character training, pareto efficient instruct vs think, scale, specialized models we actually use at Ai2 internally, and all the normal things. This is one small step towards what I see as a success for my ATOM project. We thank you for all your support of our work at Ai2. We have a lot of work to do. We’re going to be hunting for top talent at NeurIPS to help us scale up our Olmo team in 2026. This post in full also appears on Interconnects – the full links to the artifacts and paper are below. Moo, moo, rawr!
Nathan Lambert tweet mediaNathan Lambert tweet mediaNathan Lambert tweet media
English
99
359
2.2K
500.4K
Isaac Hodes retweetledi
Ai2
Ai2@allen_ai·
Introducing OlmoEarth 🌍, state-of-the-art AI foundation models paired with ready-to-use open infrastructure to turn Earth data into clear, up-to-date insights within hours—not years.
English
26
97
546
679.6K
Isaac Hodes
Isaac Hodes@ihodes·
This is a noteworthy release, really incredible work from the Marin team! If you want to know the gnarly details behind what it takes to pre-train a high-quality 32B foundation LLM, this is the only place to find them.
Percy Liang@percyliang

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

English
0
0
1
207
Isaac Hodes retweetledi
Hannes Stark
Hannes Stark@HannesStaerk·
Excited to release BoltzGen which brings SOTA folding performance to binder design! The best part of this project has been collaborating with many leading biologists who tested BoltzGen at an unprecedented scale, showing success on many novel targets and pushing its limits! 🧵..
Hannes Stark tweet media
English
18
268
991
299.5K
Isaac Hodes
Isaac Hodes@ihodes·
@geoffreylitt What I've found most compelling is when I'm using the LLM as a tutor who is reading the book alongside me—one that I feel very comfortable challenging—to question what I'm reading and better understand it. Using it as an interactive summarization tool defeats the purpose.
English
0
0
0
49
Geoffrey Litt
Geoffrey Litt@geoffreylitt·
@ihodes Yeah I like that. Generally I think I’m in favor of techniques to engage *more* deeply with books, and very skeptical of ways to “engage less and get the same out”
English
1
0
2
365
Geoffrey Litt
Geoffrey Litt@geoffreylitt·
I’m generally not a huge fan of LLM “chat with book” as an idea — I believe you gotta put in real reading time to get the good stuff. But! There are times when it would be incredibly helpful. Eg: “give me parenting advice for this specific situation, based on these 3 books”
English
14
3
118
12K
Sam D'Amico
Sam D'Amico@sdamico·
@liminalsunset_ Yeah so Claude 4 seems to consistently check the AWS docs. It only got lost after struggling to debug long enough that the log files blew out its context window.
English
1
0
2
143
Isaac Hodes
Isaac Hodes@ihodes·
Just finished @shreyas Product Sense course, and I can sincerely say it's the most no-nonsense logical examination of good product / business creation practices that I've seen. Extremely low bullshit, extremely high signal.
English
0
2
18
7.5K
Sam Whitmore
Sam Whitmore@sjwhitmore·
if you’re a woman who is interested in having kids but are worried it’s not fun or are curious to hear how to balance w intense work I will happily talk to you & answer questions!! being a mom is the best thing I ever did
Aella@Aella_Girl

Ok but to be fair it does seem like moms with kids in general are having a pretty bad time, based on the way I hear most parents talk about it. Id like kids someday but I've heard enough from parents to be downright terrified about it and I know it's gonna suck and be awful

English
4
0
113
11.4K
Isaac Hodes
Isaac Hodes@ihodes·
+1 The irony of the take that math/CS people are somehow smarter as a rule is that biology is a significantly more challenging domain to make progress in and make money in, with all the downstream effects of that, than 99% of problems solved by math/CS people in tech companies. If it were so easy and the problems were worth solving (and it's abundantly clear that they are), why wouldn't these superintellignent math/CS people flock to the field and wipe out those bio PhD dummies to take home the all those biopharma profits and also cure cancer? I say this as one of those math/CS people who worked alongside Alex trying to do good things in bio and…having much less success than I would've liked.
English
2
1
13
1.2K
alex rubinsteyn
alex rubinsteyn@iskander·
@Clarksterh @lu_sichu Smarter...how? Like I have the probably unique experience of joining with a lot of smart-by-math-standards people to try to bring our energy and genius to biology and we mostly flopped around uselessly, writing software no one used and solving problems no one had.
English
3
1
50
2.5K
alex rubinsteyn
alex rubinsteyn@iskander·
I think this was part of the implicit premise of the first incarnation of Hammer Lab at Mount Sinai, about a dozen of us with math/CS backgrounds ditched tech for biomedicine. And we got humbled hard: most of what we did flopped & techies don't understand experimental design.
Douglas Yao@DouglasYaoDY

Peter Thiel said that the lack of progress in biology is partially due to a lack of talent. I think this makes sense. Something about biology's non-technical nature + people's inability to tinker w/biology outside of a lab/PhD make the smartest people select other fields.

English
15
73
706
94.5K
Michael Nielsen
Michael Nielsen@michael_nielsen·
@dwarkesh_sp I asked ChatGPT to make me one for emacs, and it did. Works well, very easy to adapt!
English
2
0
8
1.3K
Dwarkesh Patel
Dwarkesh Patel@dwarkesh_sp·
Has someone built a good open source UI/scaffold which allows an LLM to leave in-line Google Docs style comments on your text?
English
25
10
258
36K
Isaac Hodes
Isaac Hodes@ihodes·
Great thread (and thread of threads) on Apple's mass manufacturing using CNCs, and prototyping. x.com/gak_pdx/status…
Greg Koenig@gak_pdx

This is, not true. The machines in Apple's model studio are very different from the production machines used by Catcher and Foxconn for mass-scale production. Fun Story: The iPhone 6 bending debacle forced Apple to quickly re-evaluate using a 7000 series of aluminum in place of their standard 6061. They had (of course) experimented with this long ago, but cost and finishing concerns - 7075 is harder to anodize, and finishing is the *hardest* thing Apple dos - drove them to use a slightly tweaked version of industry standard 6061. Catcher and Foxconn both thought this would be a nice opportunity to milk some more margin, so they told Apple they could do 7075, but the cycle time was almost 2x, so the price would go up substantially. One of the machinists thought this sounded like bullshit, so Apple called the US distributor for the machines that Catcher and Foxconn use, pulled the identical tools they were running in prototype (in new BT30 holders), and ran back to back tests to prove out the cycle time and tool life on the same machines their vendors were using. Until this point, they did all their own prototyping and process development, but it was up to the vendors to use the competitive fight for more orders to figure out how to get the cycle down; Apple never got very specific as to the process - just the quality and price. Well, lo and behold - 7075 had the same cycle time as 6061. The vendors were bullshitting them to the tune of tens of millions of dollars. Apple showed them the results of their testing and told them to pound sand with the price increase. Ever since, Apple has run a full-scale production lab (separate from Ive's model shop) to validate production methods and guide Apple's design and procurement decisions. Apple's prototype shops run very high-end, hyper-accurate multi-axis machines that are ideal for rapidly turning concept sketches into machined parts. If you know CNC stuff - they are a big on Hermle and Willemin-Macodel shop, but they have whatever gear they want. Other machine shops at Apple support engineering efforts running similar equipment. The production validation lab however, uses the same gear that is used in Asia, and evaluates precisely what Apple can expect their suppliers to be doing when they move to full-scale production. These are similar, but very different machines than the model/engineering shops run - lower cost, faster, more fussy to set up, but designed to absolutely bang-out lots of parts very very rapidly.

English
0
0
0
186