Nate Busa

4.4K posts

Nate Busa banner
Nate Busa

Nate Busa

@natbusa

Director of AI and Automation at Neom | AI @Stanford | CTO Program @Wharton | Technology Strategy, Innovation, Product Development

Planet Earth Katılım Şubat 2007
523 Takip Edilen721 Takipçiler
Rich Raposa
Rich Raposa@richraposa·
Two years later! I finally got the Wikimedia events in realtime into @ClickHouseDB - made possible in v26.2 with the new input_format_max_block_wait_ms setting: #f/BRcTu0HK3febulLE+4iw==GCM" target="_blank" rel="nofollow noopener">pastila.nl/?05077155/9617…
English
1
2
6
155
Nate Busa
Nate Busa@natbusa·
@deepak_rav @deedydas That is definitely an upper bound. What we do not know and what should be proved is if that is the minimum ...
English
0
0
0
39
Deedy
Deedy@deedydas·
The hardest high school math exam in the world, the 6 problem 9 hour IMO 2025, was this week. AI models performed poorly. Gemini 2.5 Pro scored the highest, just 13/42, costing $431.97, in a best of 32 eval. Bronze cutoff was 19. Long way to go for AI to solve hard Math.
Deedy tweet media
English
68
73
586
212.3K
Nate Busa
Nate Busa@natbusa·
Why all 2025 AI Chat Apps feel like a Zork I adventure? @karpathy
English
0
0
0
43
Nate Busa
Nate Busa@natbusa·
@_jasonwei Your post is itself a good example of a novel dataset for CoT. More generally, i have the feeling we are underutilizing the text written in scientific and philosophy papers, as next token rewards fail to fully extract/learn the reasoning described in their text.
English
0
0
1
607
Jason Wei
Jason Wei@_jasonwei·
An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were dataset-agnostic; for example, a better optimizer would be expected to improve on both ImageNet and CIFAR-10. Nowadays language models are so multi-task that the answer to whether something works is almost always “it depends on the dataset”. A common example of this is the question, “on what datasets does chain of thought improve performance?” A recent paper even argued (will link below) that CoT mainly helps on math/logic, and I think that is both a failure of imagination and a lack of diverse evals. Naively you might try CoT models on 100 random user chat prompts and not see much difference, but this is because the prompts were already solvable without CoT. In fact there is a small and very important slice of data where CoT makes a big difference—the obvious examples are math and coding, but include almost any task with asymmetry of verification. For example, generating a poem that fits a list of constraints is hard on the first try but much easier if you can draft and revise using CoT. As another made-up example, let’s say you want to know if browsing improves performance on geology exams. Maybe using browsing on some random geology dataset didn’t improve performance. The important thing to do here would be to see if the without-browsing model was actually suffering due to lack of world knowledge—if it wasn’t, then this was the wrong dataset to try browsing on. In other words you should hesitate to draw a conclusion like “X method doesn’t work” without ensuring that the dataset used for testing actually exercises that method. The inertia from five years ago is to take existing benchmarks and try to solve them, but nowadays there is a lot more flexibility and sometimes it even makes sense to create a custom dataset to showcase the initial usefulness of an idea. Obviously the danger with doing this is that a contrived dataset may not represent a substantial portion of user queries. But if the method is in principle general I think this is a good way to start and something people should do more often.
English
22
69
694
83.2K
Nate Busa
Nate Busa@natbusa·
I’ve changed the name I go by for the fourth time in my life, driven by a heartfelt need to connect better with people across different nationalities and interpersonal contexts. It feels right and fits my experiences and identity. Hello, world — Nate Busa!
English
0
0
0
63
Nate Busa
Nate Busa@natbusa·
@kellerjordan0 I remember @karpathy mentioning that in the future we would run GPT3 models just for fun as we have done with MNIST. Seems that the future is already here. It took less than a year (and very talented people 🙏)
English
0
0
0
314
Keller Jordan
Keller Jordan@kellerjordan0·
I enjoy getting NanoGPT training speed records. I’m also interested in making my formulation of NanoGPT speedrunning an accessible benchmark on which other people find it easy to try new ideas. To that end, I have tried to keep the code of the current record short, and minimize its installation time. Currently it’s 537 lines of code, and installs+runs in 20 minutes on a fresh 8xH100. That means the cost of a new record attempt is about $8. I’ve enjoyed seeing the records that other people have gotten. @vyasnikhil96 got a new sample-efficiency record using the SOAP optimizer, and I understand he’s currently working on reducing its overhead so that it can potentially compete with Muon on wallclock time in the future. @bozavlado discovered that Muon works better if the QKV weights are orthogonalized separately. And @Grad62304977 improved the record significantly using a wide range of architectural modernizations, including QK-norm. I was surprised to see that QK-norm, which from what I understand was invented to deal with instabilities that appear at large scale, also helps train faster at the small scale. I’ve seen some more interesting new ideas for the speedrun be posted recently, and I’d like to encourage the researchers who came up with those ideas to also be the ones to try them out empirically. I think this makes the benchmark more reliable, if the empirical experiments are distributed across the community, rather than only me doing them. I’m interested in two kinds of new results around this speedrun. First, of course, I’m interested in new records that improve the time to 3.28 val loss. The only rule is that you can’t use external data besides Fineweb10B, and you can’t use pretrained models. Beyond that, everything is fair game. Second, I’m interested in new trainings that match the current record, while being simpler. For example, if it can be shown that we can match the current record using standard AdamW instead of the Muon optimizer, then I think that would be a very interesting result. The log file produced by the current speedrun contains not just the timing and final loss, but also a copy of the code used to produce the run. Therefore, the only thing myself or anyone else needs to verify and reproduce a new record is its log file. Researchers have pointed out that we shouldn’t uncritically trust every result which is obtained at the 124M-parameter scale. I absolutely agree - we shouldn't blindly expect results to scale up. However, I still believe it’s valuable for the community to at least have one stable small-scale benchmark. Once an idea has been clearly proven to work at small scale, it becomes relatively simple to test it a larger scale. I think this is a better situation than the current status quo, where every LM training paper seems to use a different benchmark, making it challenging for the community to evaluate new ideas. The only exception to this evaluation system would be ideas that only work at large scale, so can’t be demonstrated in a small-scale benchmark. These do exist, but I believe they are less common in the recent literature than ideas which are also supposed to work at the 124M-parameter scale, which we should be able to efficiently evaluate using a stable and competitive small-scale benchmark. If the interest in this benchmark stays strong, I am hopeful that some very interesting things can come out of it. Thanks for your interest, Keller
English
14
36
498
249.5K
Nate Busa
Nate Busa@natbusa·
@sama Plus study. That never hurts.
English
0
0
0
16
Sam Altman
Sam Altman@sama·
the best way to get good at something is usually to just practice actually doing the thing in question. a lot of very capable people outsmart themselves with complex plans that involve working a lot on fake prerequisites.
English
1K
3.6K
29.7K
1.7M
Nate Busa
Nate Busa@natbusa·
@enithka Normaal gesproken gebruik ik 4 tot 5 slagen, waarbij ik hints, persoonlijke bewerkingen, correcties en beoordelingen afwissel. Dit omvat grammaticale correcties, kritiek en mijn eigen reflectie. Ik vind het bijna onmogelijk om te bepalen wie/wat schreef wat.
Nederlands
1
0
1
31
enith vlooswijk
enith vlooswijk@enithka·
Lekker dan. Ik heb sterke vermoedens dat een studente AI heeft gebruikt voor haar tekst, maar verschillende AI detectors geven verschillende uitkomsten. Vorig jaar was detectie nog vrij gemakkelijk, nu lijkt de kans op vals positief me te groot om iemand te beschuldigen.
Nederlands
30
1
33
18.8K
Ricky Gervais
Ricky Gervais@rickygervais·
When life gets you down and you feel unloved and worthless, just remember you’ll be dead one day. Have a great weekend.
Ricky Gervais tweet media
English
5.1K
24.6K
158.7K
86.7M
Nate Busa
Nate Busa@natbusa·
@deedydas I have read the article and looked into the code. What I don't understand is why not reporting the improved gpt4 numbers with MCTSr if the technique is actually working, at least for a subset of the examples? It's a prompt only technique...
English
1
0
0
670
Deedy
Deedy@deedydas·
It's finally here. Q* rings true. Tiny LLMs are as good at math as a frontier model. By using the same techniques Google used to solve Go (MTCS and backprop), Llama8B gets 96.7% on math benchmark GSM8K! That’s better than GPT-4, Claude and Gemini, with 200x less parameters!
Deedy tweet media
English
38
303
2.6K
803.5K
Nate Busa
Nate Busa@natbusa·
@garykubiak @OpenAI GPT-4 is somewhat better, sloppy initially but can correct itself through python/verify step-by-step
Nate Busa tweet media
English
0
0
5
628
Gary Kubiak
Gary Kubiak@garykubiak·
@OpenAI But can it tell me where the letter r is in blueberry? 🙏🏻
Gary Kubiak tweet media
English
27
5
453
46.6K
OpenAI
OpenAI@OpenAI·
Our new GPT-4 Turbo is now available to paid ChatGPT users. We’ve improved capabilities in writing, math, logical reasoning, and coding. Source: github.com/openai/simple-…
OpenAI tweet media
English
618
1.2K
6.8K
6.4M
Nate Busa
Nate Busa@natbusa·
@chrisalbon Using a library is far simpler than building a library. Library making is a craft on its own. There will be dragons.
English
0
0
0
59
Nate Busa
Nate Busa@natbusa·
@Emil_BB @fchollet Think for instance of the implication of a "intelligent assistant/companion" for each person in the world. How is this going to affect us?
English
0
0
0
25
François Chollet
François Chollet@fchollet·
In 2033 it will seem utterly baffling how a bunch of tech folks lost their minds over text generators in 2023 -- like reading about Eliza or Minsky's 1970 quote about achieving human-level general intelligence by 1975
English
73
314
2.4K
536.5K
Nate Busa
Nate Busa@natbusa·
@Emil_BB @fchollet Yes, agreed. I guess i should have said "till now". In short, i am not in the doom camp, but i do believe that society changes will be significative.
English
0
0
0
23
Nate Busa
Nate Busa@natbusa·
@fchollet i disagree. Given enough context truthfulness can be computed or be unknowable both for machine and person alike. it's only a matter of time before machines will be more factually correct than people.
English
0
0
0
40
Yannick Scholich
Yannick Scholich@YannickScholich·
Change my mind: AGI is a high-parameter LLM with longterm memory, shortterm memory and looped evaluation-chains, semi-variable "VIP params" and realTime-API access.
English
1
0
1
68