Nate Busa

4.4K posts

Nate Busa

@natbusa

Director of AI and Automation at Neom | AI @Stanford | CTO Program @Wharton | Technology Strategy, Innovation, Product Development

Planet Earth Katılım Şubat 2007

523 Takip Edilen721 Takipçiler

Nate Busa@natbusa·26 Şub

@richraposa @ClickHouseDB There is something magical in that linux pipe - curl | clickhouse_client . Nice work.

English

Rich Raposa@richraposa·26 Şub

Two years later! I finally got the Wikimedia events in realtime into @ClickHouseDB - made possible in v26.2 with the new input_format_max_block_wait_ms setting: #f/BRcTu0HK3febulLE+4iw==GCM" target="_blank" rel="nofollow noopener">pastila.nl/?05077155/9617…

English

155

Nate Busa@natbusa·22 Tem

@deepak_rav @deedydas That is definitely an upper bound. What we do not know and what should be proved is if that is the minimum ...

English

Deepak Ravichandran@deepak_rav·19 Tem

@deedydas 4048?

1.3K

Deedy@deedydas·18 Tem

The hardest high school math exam in the world, the 6 problem 9 hour IMO 2025, was this week. AI models performed poorly. Gemini 2.5 Pro scored the highest, just 13/42, costing $431.97, in a best of 32 eval. Bronze cutoff was 19. Long way to go for AI to solve hard Math.

English

586

212.3K

Nate Busa@natbusa·25 Haz

Why all 2025 AI Chat Apps feel like a Zork I adventure? @karpathy

English

Nate Busa@natbusa·8 Oca

huyenchip.com//2025/01/07/ag…

ZXX

Nate Busa@natbusa·4 Oca

@_jasonwei Your post is itself a good example of a novel dataset for CoT. More generally, i have the feeling we are underutilizing the text written in scientific and philosophy papers, as next token rewards fail to fully extract/learn the reasoning described in their text.

English

607

Jason Wei@_jasonwei·3 Oca

An underrated but occasionally make-or-break skill in AI research (that didn’t really exist ten years ago) is the ability to find a dataset that actually exercises a new method you are working on. Back in the day when the bottleneck in AI was learning, many methods were dataset-agnostic; for example, a better optimizer would be expected to improve on both ImageNet and CIFAR-10. Nowadays language models are so multi-task that the answer to whether something works is almost always “it depends on the dataset”. A common example of this is the question, “on what datasets does chain of thought improve performance?” A recent paper even argued (will link below) that CoT mainly helps on math/logic, and I think that is both a failure of imagination and a lack of diverse evals. Naively you might try CoT models on 100 random user chat prompts and not see much difference, but this is because the prompts were already solvable without CoT. In fact there is a small and very important slice of data where CoT makes a big difference—the obvious examples are math and coding, but include almost any task with asymmetry of verification. For example, generating a poem that fits a list of constraints is hard on the first try but much easier if you can draft and revise using CoT. As another made-up example, let’s say you want to know if browsing improves performance on geology exams. Maybe using browsing on some random geology dataset didn’t improve performance. The important thing to do here would be to see if the without-browsing model was actually suffering due to lack of world knowledge—if it wasn’t, then this was the wrong dataset to try browsing on. In other words you should hesitate to draw a conclusion like “X method doesn’t work” without ensuring that the dataset used for testing actually exercises that method. The inertia from five years ago is to take existing benchmarks and try to solve them, but nowadays there is a lot more flexibility and sometimes it even makes sense to create a custom dataset to showcase the initial usefulness of an idea. Obviously the danger with doing this is that a contrived dataset may not represent a substantial portion of user queries. But if the method is in principle general I think this is a good way to start and something people should do more often.

English

694

83.2K

Nate Busa@natbusa·1 Oca

Check out my latest article: The AI Renaissance of 2025 linkedin.com/pulse/ai-renai… via @LinkedIn

English

Nate Busa@natbusa·12 Ara

I’ve changed the name I go by for the fourth time in my life, driven by a heartfelt need to connect better with people across different nationalities and interpersonal contexts. It feels right and fits my experiences and identity. Hello, world — Nate Busa!

English

Nate Busa@natbusa·31 Eki

@kellerjordan0 I remember @karpathy mentioning that in the future we would run GPT3 models just for fun as we have done with MNIST. Seems that the future is already here. It took less than a year (and very talented people 🙏)

English

314

Keller Jordan@kellerjordan0·17 Eki

I enjoy getting NanoGPT training speed records. I’m also interested in making my formulation of NanoGPT speedrunning an accessible benchmark on which other people find it easy to try new ideas. To that end, I have tried to keep the code of the current record short, and minimize its installation time. Currently it’s 537 lines of code, and installs+runs in 20 minutes on a fresh 8xH100. That means the cost of a new record attempt is about $8. I’ve enjoyed seeing the records that other people have gotten. @vyasnikhil96 got a new sample-efficiency record using the SOAP optimizer, and I understand he’s currently working on reducing its overhead so that it can potentially compete with Muon on wallclock time in the future. @bozavlado discovered that Muon works better if the QKV weights are orthogonalized separately. And @Grad62304977 improved the record significantly using a wide range of architectural modernizations, including QK-norm. I was surprised to see that QK-norm, which from what I understand was invented to deal with instabilities that appear at large scale, also helps train faster at the small scale. I’ve seen some more interesting new ideas for the speedrun be posted recently, and I’d like to encourage the researchers who came up with those ideas to also be the ones to try them out empirically. I think this makes the benchmark more reliable, if the empirical experiments are distributed across the community, rather than only me doing them. I’m interested in two kinds of new results around this speedrun. First, of course, I’m interested in new records that improve the time to 3.28 val loss. The only rule is that you can’t use external data besides Fineweb10B, and you can’t use pretrained models. Beyond that, everything is fair game. Second, I’m interested in new trainings that match the current record, while being simpler. For example, if it can be shown that we can match the current record using standard AdamW instead of the Muon optimizer, then I think that would be a very interesting result. The log file produced by the current speedrun contains not just the timing and final loss, but also a copy of the code used to produce the run. Therefore, the only thing myself or anyone else needs to verify and reproduce a new record is its log file. Researchers have pointed out that we shouldn’t uncritically trust every result which is obtained at the 124M-parameter scale. I absolutely agree - we shouldn't blindly expect results to scale up. However, I still believe it’s valuable for the community to at least have one stable small-scale benchmark. Once an idea has been clearly proven to work at small scale, it becomes relatively simple to test it a larger scale. I think this is a better situation than the current status quo, where every LM training paper seems to use a different benchmark, making it challenging for the community to evaluate new ideas. The only exception to this evaluation system would be ideas that only work at large scale, so can’t be demonstrated in a small-scale benchmark. These do exist, but I believe they are less common in the recent literature than ideas which are also supposed to work at the 124M-parameter scale, which we should be able to efficiently evaluate using a stable and competitive small-scale benchmark. If the interest in this benchmark stays strong, I am hopeful that some very interesting things can come out of it. Thanks for your interest, Keller

English

498

249.5K

Nate Busa@natbusa·28 Eki

@sama Plus study. That never hurts.

English

Sam Altman@sama·27 Eki

the best way to get good at something is usually to just practice actually doing the thing in question. a lot of very capable people outsmart themselves with complex plans that involve working a lot on fake prerequisites.

English

3.6K

29.7K

1.7M

Nate Busa@natbusa·14 Ağu

@IQ120plus

QME

Do You Have IQ of 120+❓@IQ120plus·14 Ağu

Can you quickly find a solution for this IQ test❓

English

155

1.6K

1.2M

Nate Busa@natbusa·9 Tem

@enithka Normaal gesproken gebruik ik 4 tot 5 slagen, waarbij ik hints, persoonlijke bewerkingen, correcties en beoordelingen afwissel. Dit omvat grammaticale correcties, kritiek en mijn eigen reflectie. Ik vind het bijna onmogelijk om te bepalen wie/wat schreef wat.

Nederlands

enith vlooswijk@enithka·8 Tem

Lekker dan. Ik heb sterke vermoedens dat een studente AI heeft gebruikt voor haar tekst, maar verschillende AI detectors geven verschillende uitkomsten. Vorig jaar was detectie nog vrij gemakkelijk, nu lijkt de kans op vals positief me te groot om iemand te beschuldigen.

Nederlands

18.8K

Nate Busa@natbusa·8 Tem

@rickygervais @az_andersons The man in the sky will give Anders a great mansion for sure!

English

Ricky Gervais@rickygervais·6 Tem

@az_andersons Cheers, Anders. Have a great weekend.

English

1.4K

125.1K

Ricky Gervais@rickygervais·5 Tem

When life gets you down and you feel unloved and worthless, just remember you’ll be dead one day. Have a great weekend.

English

5.1K

24.6K

158.7K

86.7M

Nate Busa@natbusa·17 Haz

@deedydas I have read the article and looked into the code. What I don't understand is why not reporting the improved gpt4 numbers with MCTSr if the technique is actually working, at least for a subset of the examples? It's a prompt only technique...

English

670

Deedy@deedydas·15 Haz

It's finally here. Q* rings true. Tiny LLMs are as good at math as a frontier model. By using the same techniques Google used to solve Go (MTCS and backprop), Llama8B gets 96.7% on math benchmark GSM8K! That’s better than GPT-4, Claude and Gemini, with 200x less parameters!

English

303

2.6K

803.5K

Nate Busa@natbusa·12 Nis

@garykubiak @OpenAI GPT-4 is somewhat better, sloppy initially but can correct itself through python/verify step-by-step

English

628

Gary Kubiak@garykubiak·12 Nis

@OpenAI But can it tell me where the letter r is in blueberry? 🙏🏻

English

453

46.6K

OpenAI@OpenAI·12 Nis

Our new GPT-4 Turbo is now available to paid ChatGPT users. We’ve improved capabilities in writing, math, logical reasoning, and coding. Source: github.com/openai/simple-…

English

618

1.2K

6.8K

6.4M

Nate Busa@natbusa·14 Nis

@chrisalbon Using a library is far simpler than building a library. Library making is a craft on its own. There will be dragons.

English

Nate Busa@natbusa·5 Nis

@Emil_BB @fchollet Think for instance of the implication of a "intelligent assistant/companion" for each person in the world. How is this going to affect us?

English

François Chollet@fchollet·4 Nis

In 2033 it will seem utterly baffling how a bunch of tech folks lost their minds over text generators in 2023 -- like reading about Eliza or Minsky's 1970 quote about achieving human-level general intelligence by 1975

English

314

2.4K

536.5K

Nate Busa@natbusa·5 Nis

@Emil_BB @fchollet Yes, agreed. I guess i should have said "till now". In short, i am not in the doom camp, but i do believe that society changes will be significative.

English

Nate Busa@natbusa·26 Mar

@fchollet i disagree. Given enough context truthfulness can be computed or be unknowable both for machine and person alike. it's only a matter of time before machines will be more factually correct than people.

English

François Chollet@fchollet·25 Mar

People in tech characterize this as "LLMs make factual errors", but that's a misleading framing, implying that LLMs have a model of what they say and this model is sometimes wrong. For a LLM there is no difference between saying something true, something false, or pure nonsense.

Neil Gaiman@neilhimself

ChatGPT doesn't give you information. It gives you information-shaped sentences.

English

198

1.6K

270K

Nate Busa@natbusa·26 Mar

@YannickScholich I want one :)

English

Yannick Scholich@YannickScholich·20 Mar

Change my mind: AGI is a high-parameter LLM with longterm memory, shortterm memory and looped evaluation-chains, semi-variable "VIP params" and realTime-API access.

English

Keşfet

@richraposa @ClickHouseDB @deepak_rav @deedydas @karpathy @_jasonwei @LinkedIn @kellerjordan0