augustus odena

291 posts

augustus odena banner
augustus odena

augustus odena

@gstsdn

Something new. Previously: AI research at TBD Labs / Meta; cofounder at @AdeptAILabs; Invented Scratchpad / Chain-of-Thought; Google Brain

Katılım Ekim 2015
3K Takip Edilen12.2K Takipçiler
Sabitlenmiş Tweet
augustus odena
augustus odena@gstsdn·
Yesterday I resigned from TBD Labs / @metaai. I wasn't there for very long, but I think I got a few useful things done! It's an impressive group of people and it's especially impressive that it got assembled as quickly as it did with such a high talent bar at a large company. "Founder Mode" is real and good. I will certainly miss my coworkers there. I think now is an unusually high-leverage time to pursue ambitious new projects at the intersection of AI and other technologies. Please reach out to me if you're interested in that sort of thing, and I expect I will have something more detailed to share in not-too-long.
English
25
6
631
281.3K
augustus odena retweetledi
bucket of kets
bucket of kets@bucketofkets·
I’m actually not sure why Covid didn’t already black-pill people on this. We had an essentially working prototype of the Covid vaccine within 2 days of sequencing the virus. It took nearly a year for general availability after that (which was considered historically and unusually fast!)
Séb Krier@sebkrier

This is wild. theaustralian.com.au/business/techn…

English
1
2
14
1.7K
augustus odena retweetledi
Alec Stapp
Alec Stapp@AlecStapp·
This is a really astonishing claim: Students in Mississippi & Louisiana score higher on reading tests than students in California & New York despite spending way less money per pupil and having higher child poverty rates. Decided to double check the data because, if true, this should be alarming for blue state leaders. And yup, it checks out. Reading performance (NAEP 2024, Grade 4 reading, average scale score): Mississippi: 219 Louisiana: 216 New York: 215 California: 212 Child poverty (SAIPE; “estimated percent of people age 0–17 in poverty,” 2023): Louisiana: 25.2% Mississippi: 24.3% New York: 18.6% California: 15.0% Per-pupil spending (public K–12 “current expenditures per pupil,” FY2023, inflation-adjusted to FY2023 dollars) New York: $29,588 California: $18,568 Louisiana: $14,822 Mississippi: $12,238 It should be unacceptable to spend that much more taxpayer money while delivering worse results for students.
Alec Stapp tweet media
Nicholas Bagley@nicholas_bagley

@ProfSchleich @dbroockman @j_kalla If Democrats want to stay relevant, and to deliver for the public, they cannot wait for unions to change. They need to break more often with their friends. nytimes.com/2026/02/23/opi…

English
407
1.7K
8.6K
845.1K
augustus odena
augustus odena@gstsdn·
Re: high cost of training neural nets, it’s commonly brought up that humans had to undergo evolution, which is very costly. This is a bad argument - you couldn’t train a modern LLM without a large pretraining corpus, which was also “generated by evolution” in the same sense.
English
4
1
28
2.8K
augustus odena
augustus odena@gstsdn·
One of the best accounts on here IMO.
Loquitur Ponte Sublicio@loquitur_ponte

One thing that's become clear to me over time is that a certain proportion of Americans would rather be european. It's not that they're wrong per se about the tradeoffs but that to them the other side feels better. Sanders' response captures this, he rejects the premise of prosperity on basic terms because they have x y z protections. The floor simply matters more. I don't think however that this is limited to leftists yearning for European healthcare. There is an acute philia on the right for a sort of idealized Europe, not the EU and open borders but for the sort of clearly delineated nation state, a place where my people live, have lived, and always will live. And it is not just the political arrangement but also the village - my grave in the family plot - and the tradition - our cultural passed down - that is a decidedly european yearning. I would suggest that all these tendencies on the left and right are not so different from each other. They are all a desire for a more certain and assured place, less to gain perhaps but less to prove. You may not transform the world (did it need to be transformed anyway?) or elevate your family's position (dont you want to carry on their honest rooted legacy?) but you will not have to carve a place. You have a place. You are valued for you. You will have healthcare because we care about your health, a reserved spot in a town where you belong, a role which if you fill half decently is assured to you, food on your plate. A place for every man. These are not unnatural desires but i cannot stress enough that we do not do that here. We are the country of people who left their place, braved uncertainty, restarted their family legacy from scratch, drove off natives and the wild to create a spot and then generation after generation drove on further to do it, or cast roots in merchant communities and took back to the sea anew. This is a country where people made places. That is our culture more assuredly than any food or song or church. And that has faded bit by bit along the way. The ancestors forgot it, mean reversion of culture if you will. They found their yards idyllic and their extended families helpful. I will even throw the nativists a bone and say that it is not clear to me that later migrants shared the same experience. Did the 1899 arrival see the country as an ordeal to surmount or as a country much like theirs but richer and with more jobs, a place where a palatine with higher gdp might rise on the ohio? A mix i am sure but differences i expect. But one way or another here we are. The through line on the demands of left and right is assured place where you belong, where you dont have to compete, where being you is assurance of being valued and cared for and being a part, an appreciated part, of the place where you belong. Where not more is asked. Where we do not prioritize ambition over certainty. An old country ashore the new. That's the vision on which a verdict must be passed.

English
1
0
8
2K
augustus odena retweetledi
"Leigh Marie" Braswell
"Leigh Marie" Braswell@LM_Braswell·
Really enjoyed chatting with @LauraDeming (@untillabs) and @jacobkimmel (@newlimit) on a longevity-themed Builders. They are founders of companies with complementary strategies- Laura is pausing time through cryo, while Jacob is resetting aging through epigenetic reprogramming. Full episode link below!
English
7
14
166
19.4K
augustus odena retweetledi
Dean W. Ball
Dean W. Ball@deanwball·
Here’s a good one from earlier today: A friend of mine works for a company that purchases a particular kind of common small business across America. I don’t want to say what the industry is, but let’s just say that the underlying industry is among the least susceptible to automation. This actually doesn’t matter much for my purposes though, since I want to analyze my friend’s firm, an investment company that on paper may seem much more automation-friendly. Ok so the target companies, the ones my friend’s company buys: Like most small businesses, the owners are often prideful. Like many small businesses, especially those owned and operated by sole proprietors, the line between business and personal finances can blur. For both of these reasons, and others besides, the owners of these businesses can be mistrustful, confused, embarrassed, or angry when my friend’s company approaches them about a sale. A large fraction of what my friend’s company does is manage these emotions. Even interpreting the data can, at times, be a deeply sensitive issue for the owner, requiring careful outreach from the person at my friend’s firm whose relationship with the small business owner is the strongest. The financial and accounting practices of these companies are remarkably variegated across the country, despite the underlying business itself being quite homogenous. Some of the employees at my friend’s company spend their time structuring this financial data and building models. Of course building financial models of arbitrary complexity will be automated. And of course agents will bring the best of contemporary data science to bear on the problem of structuring all this haphazardly collected financial data (think: invoices scanned in dim light on a 2012 iPhone at a 73-degree angle). The latter may take some serious time and effort to diffuse. My friend’s company has established processes for doing the latter, and unlike financial modeling, it is not a standardized process subject to easy LLM ingestion. His company will need a high bar before AI replaces or automates these processes, and even optimally prompted AI would probably not exceed the success rate of his firm at these tasks today (btw, when his firm tries AI on these tasks, it will not be ideally prompted). Even this, however, I acknowledge will get automated eventually, though probably it takes longer than some may think. But what about that relationship management? This is what a large fraction of employees in this firm already do! The financial modeling is not row-crop-agriculture-level automated, but the firm isn’t run by morons, and neither is the financial services industry as a whole. *They have used software already to automate large fractions of their automation-prone activities.* So not only will a large portion of those jobs not be automated anytime soon, but, if you endogenize the effect of AI on the labor market, it seems safe to assume that human jobs will shift, along the margin, toward tasks that involve this sort of relationship management, social capital, etc. This is just one example from a quick discussion I had with a friend today, so not selected for memetic fitness or robustness to attacks.
English
2
2
44
6.6K
augustus odena
augustus odena@gstsdn·
Another free research idea: 100% neural RAG. The way you do RAG right now is generally to run some model over a bunch of documents independently to generate embeddings. Then when you have some input to your system that ought to trigger retrieval you use the same embedding model on your input and do a nearest neighbor search. This has at least two issues: 1) You generally have to run your input through a heuristic query-construction layer, which can hurt performance. 2) If you have two documents that embed very close to each other but which you don't actually want to confuse, it's hard for a normal RAG system to distinguish them - you can use re-rankers for this, but IMO that's pretty ugly. One thing we could do instead is to build on the Cartridges idea (arxiv.org/abs/2506.06266), but instead of training the KV cache to be able to answer arbitrary questions about a corpus, we could instead train it to emit document IDs directly. That is, we'd generate a bunch of synthetic questions given a corpus, and each synthetic question would be paired with the document IDs you'd require to answer that question (which we'd know in advance because we'd have generated the questions conditional on document IDs). This would address both of the problems from above - it would be easy for the model to learn to distinguish documents, and the system is trained to emit document IDs directly, so nobody would ever have to make a decision about which part of the input to use to run retrieval. The paper reported something like a 40x compression rate, so even the original cartridges idea could in theory represent 40M token corpora directly with a 1M token KV cache. We'd presumably be able to do better, because we'd only need to store the implicit mapping of question -> doc ID rather than the full mapping of question -> answer in our modified KV cache. Moreover, because it seems like you can compose multiple cartridges together, you could add new docs (or groups of docs at least) without a full re-train of the cartridge. The closest thing I could find to this in the literature was DSI (arxiv.org/pdf/2202.06991) but AFAIK people aren't using this in practice. Maybe the composability from using cartridges would finally make this practical?
English
7
13
188
16.1K
augustus odena
augustus odena@gstsdn·
Here’s a free research idea - I don’t know if it’s any good. I liked this SDPO paper. It made me think about doing RL in environments where feedback is costly or rare. In those environments it might be wasteful to only take one gradient update. When a human gets a specific bit of contextual feedback on a mistake, they don’t just update their policy so that they’re marginally less likely to make the mistake - they try to update their policy so that they wouldn’t make that mistake in the same context. You could try to solve this problem by taking more than one gradient update on the pi(y|x, f) logits, but what’s your stopping condition? In the ideal case you’d just keep trying and collecting more feedback, but we’ve already specified that feedback is expensive, and if this were cheap enough to do you may as well keep the original algorithm. If you could have another model (or the same model?) judge whether the feedback has been adequately addressed, you could do the following: 1. Sample y_i from pi(.|x) 2. Ask another model G(y_i, x, f) whether y_i represents a sufficiently big change from y_0 (the first sample) that the feedback no longer applies. 3. If not yet, update using distillation from pi(y_i|x, f) and go back to the first step 4. End This feels kind of like what I would do in real life? I expect there’s a bunch of interesting stuff adjacent to this in idea space.
augustus odena tweet media
English
6
7
146
10.6K
augustus odena retweetledi
Mark Goldstein
Mark Goldstein@marikgoldstein·
all of us, defending quadratic attention
English
10
34
784
56.4K
augustus odena
augustus odena@gstsdn·
That LLMs have serious deficiencies compared to human brains is becoming more mainstream as a belief, but a lot of people still regard this objection as "merely aesthetic". I think it has serious commercial implications! The world in which there are many companies making lots of money selling RL environments for coding agents is very different than the world in which the agents just figured it out by reading a few books. Near term impacts of LLMs still probably under-appreciated though...
English
1
1
31
7.3K
augustus odena
augustus odena@gstsdn·
In some sense I'm more impressed by self driving cars than by coding agents, even though coding agents feel more "AGI-adjacent". Coding sort of happened to be a thing that LLMs were useful for almost by default, and we iterated from that default to the existing tools in largely a bottom-up way. Self driving cars were a thing that we've wanted for decades that we had to work backwards to create. Not totally fair of course, since self-driving cars were blocked waiting on neural networks...
English
3
0
30
3.2K
augustus odena retweetledi
roon
roon@tszzl·
@eudaemonea youre still a believer in “winning the race”? in 2026 it’s marginal isn’t it
English
25
16
625
40K
augustus odena
augustus odena@gstsdn·
I'm working with an extraordinary group of people on a big new project. We're looking for more extraordinary people to join us. If the project succeeds, it will have a dramatic effect on the physical world. Reach out if you're interested, or know someone who might be.
English
24
7
207
14.2K
augustus odena retweetledi
Josh McGrath
Josh McGrath@j_mcgraph·
I’m sorry for your loss (spikes)
English
0
1
21
2.6K
augustus odena retweetledi
Jessy Lin
Jessy Lin@realJessyLin·
great post, and I generally find this way of reasoning about "limit cases" and things that should be true in principle to be really valuable for thinking about what approaches to "memory" and continual learning make sense in the long term (out of a huge and heterogenous design space!) > repeated data: when humans see the same piece of experience over and over again, we eventually stop updating -> what kind of update algorithm would make this true? > integration into existing concepts: if someone tells you they're from Michigan, your representation of Michigan should also change -> what kind of representation/parameterization would make this true?
augustus odena@gstsdn

I have a bunch of thoughts about continual learning and nothing to do with them (I'm working on something else) so I figured I'd just turn them into a post: First: I think people use "continual learning" to point at a cluster of issues that are related but distinct. I'll list the issues and then speculate about what might fix them. a) Catastrophic Forgetting: If you train on a distribution D_1 and then do SFT on another distribution D_2, you'll often find that your performance on D_1 degrades. The extent of this issue is maybe overstated and is more true for SFT than for RL, but it's still real. There's also an important limit case that IMO is a "smell" for the way we train models currently: repeated data can seriously harm model performance. Humans don't have this problem - they eventually just stop updating on redundant information. b) No integration of new knowledge into existing concepts: If I tell you that I'm from Michigan, you will update your representation of me to include that fact, but you will also change your representation of Michigan. Michigan becomes "a place where someone I know is from". If people ask you questions about Michigan in the future, you may answer those questions with this knowledge in mind. If I tell a chatbot that I'm from Michigan, that fact may get stored in a memory file about me, but it won't affect the model's representation of Michigan. c) No consolidation from short-term memory to long-term memory: Models are good at accumulating information in context up to a point, but then they run out of context (or effective context) and performance degrades. They are missing a mechanism for deciding what's important to retain and then taking action to retain it. d) No notion of timeliness: When you tell a human something, they also retain *when* they learned it, and that "time tag" becomes part of the representation. Humans experience a stream of facts unfolding through time. As a result we form an implicit model of history/causality. Many people can answer "who is the current Pope?" without doing a special search step. Now that we've enumerated the issues, we can think about solutions. In AI it's always worth asking why the simplest solution can't work. The very simplest thing to try is what chatbots currently do: maintain a text file of memories. IMO it's obvious why this is unsatisfying relative to what humans are doing, so I won't dwell on it. I expect there are many refinements you could make here around learning to manually manage the text file, but I also expect these approaches to be brittle. A slightly smarter thing that's still pretty simple is to just keep updating the model during deployment. I actually do think that something like this could work OK, but we probably need a few tweaks. Some combination of the following seems worth pursuing: 1. Sparser updates: Catastrophic forgetting is plausibly worsened by updating all parameters at once. I'd bet either selective parameter updates or making the models themselves sparser could help a lot here. @realJessyLin has some nice work here. 2. Update only on surprising data: Updating on every new datapoint feels wrong. We want a mechanism that decides what’s important/surprising and only updates on that subset. A crude version: automatically generate questions about a datapoint and only update if the model fails to answer them. The hippocampus also has interesting mechanisms for doing this that seem worth trying to emulate. 3. Don't train on the raw datapoint w/ the standard objective. Given that we've decided a datapoint is surprising, I don't think we should just train on it using the standard objective. We may want to automatically generate questions about a given corpus and train on the answers (as in e.g. the Cartridges work) and we may also want to modify the objective. One option is to do prompt distillation with the facts in context - the intuition being that the consolidated model ought to answer the question as though it has the facts on hand. These are "in-paradigm" approaches compatible with LLMs. I bet they’ll yield real progress, but I’m also starting to suspect something less in-paradigm may be needed for a really satisfying solution. That’s for a different post though.

English
1
1
12
3.6K
augustus odena
augustus odena@gstsdn·
Yeah good point. Though this is maybe the thing I’m most pessimistic about addressing “in paradigm”. But yeah part of the reason it doesn’t work to simply continue training on new data is that you would need too many new pieces of data to actually learn what you want, and indeed that might contribute to catastrophic forgetting
English
0
0
5
621
augustus odena retweetledi
Sabri Eyuboglu
Sabri Eyuboglu@EyubogluSabri·
@gstsdn Another issue/desiderata I'd add to the cluster is (e) the efficiency with which the model learns new information. This connects to the very active area of work on new architectures: ssms, linear attn, and most recently "meta-learning" architectures like Titans/TTT
English
3
1
14
1.5K
augustus odena
augustus odena@gstsdn·
I have a bunch of thoughts about continual learning and nothing to do with them (I'm working on something else) so I figured I'd just turn them into a post: First: I think people use "continual learning" to point at a cluster of issues that are related but distinct. I'll list the issues and then speculate about what might fix them. a) Catastrophic Forgetting: If you train on a distribution D_1 and then do SFT on another distribution D_2, you'll often find that your performance on D_1 degrades. The extent of this issue is maybe overstated and is more true for SFT than for RL, but it's still real. There's also an important limit case that IMO is a "smell" for the way we train models currently: repeated data can seriously harm model performance. Humans don't have this problem - they eventually just stop updating on redundant information. b) No integration of new knowledge into existing concepts: If I tell you that I'm from Michigan, you will update your representation of me to include that fact, but you will also change your representation of Michigan. Michigan becomes "a place where someone I know is from". If people ask you questions about Michigan in the future, you may answer those questions with this knowledge in mind. If I tell a chatbot that I'm from Michigan, that fact may get stored in a memory file about me, but it won't affect the model's representation of Michigan. c) No consolidation from short-term memory to long-term memory: Models are good at accumulating information in context up to a point, but then they run out of context (or effective context) and performance degrades. They are missing a mechanism for deciding what's important to retain and then taking action to retain it. d) No notion of timeliness: When you tell a human something, they also retain *when* they learned it, and that "time tag" becomes part of the representation. Humans experience a stream of facts unfolding through time. As a result we form an implicit model of history/causality. Many people can answer "who is the current Pope?" without doing a special search step. Now that we've enumerated the issues, we can think about solutions. In AI it's always worth asking why the simplest solution can't work. The very simplest thing to try is what chatbots currently do: maintain a text file of memories. IMO it's obvious why this is unsatisfying relative to what humans are doing, so I won't dwell on it. I expect there are many refinements you could make here around learning to manually manage the text file, but I also expect these approaches to be brittle. A slightly smarter thing that's still pretty simple is to just keep updating the model during deployment. I actually do think that something like this could work OK, but we probably need a few tweaks. Some combination of the following seems worth pursuing: 1. Sparser updates: Catastrophic forgetting is plausibly worsened by updating all parameters at once. I'd bet either selective parameter updates or making the models themselves sparser could help a lot here. @realJessyLin has some nice work here. 2. Update only on surprising data: Updating on every new datapoint feels wrong. We want a mechanism that decides what’s important/surprising and only updates on that subset. A crude version: automatically generate questions about a datapoint and only update if the model fails to answer them. The hippocampus also has interesting mechanisms for doing this that seem worth trying to emulate. 3. Don't train on the raw datapoint w/ the standard objective. Given that we've decided a datapoint is surprising, I don't think we should just train on it using the standard objective. We may want to automatically generate questions about a given corpus and train on the answers (as in e.g. the Cartridges work) and we may also want to modify the objective. One option is to do prompt distillation with the facts in context - the intuition being that the consolidated model ought to answer the question as though it has the facts on hand. These are "in-paradigm" approaches compatible with LLMs. I bet they’ll yield real progress, but I’m also starting to suspect something less in-paradigm may be needed for a really satisfying solution. That’s for a different post though.
English
11
14
179
18.1K