Matteo Paz

21 posts

Matteo Paz

Matteo Paz

@matteopaz06

Katılım Haziran 2025
53 Takip Edilen2.5K Takipçiler
Matteo Paz
Matteo Paz@matteopaz06·
string theory is just an irl headcannon
English
0
0
4
165
Matteo Paz
Matteo Paz@matteopaz06·
@jsuarez is this just rollout and training acceleration or is it abstracted/optimized rl algos
English
0
0
0
210
Joseph Suarez 🐡
Joseph Suarez 🐡@jsuarez·
Overcooked merged into PufferLib today - thanks Roze! Agents trained in <10 seconds on a 4090. This is without hparam tuning for 4.0.
GIF
English
10
23
308
53.5K
Matteo Paz
Matteo Paz@matteopaz06·
@haider1 this is not true, 4o was pre reasoning. reasoning meant smaller models can punch higher. Economics just dont work out to keep the same large models.
English
0
0
0
632
Haider.
Haider.@haider1·
greg brockman recently confirmed that "spud" is openai first new pre-train model in two years since gpt-5.x models seem to build on gpt-4o/4-turbo, if openai RL can push a weaker base model like gpt-4o close to gpt-5.4-x-high-level intelligence then openai clearly has a secret sauce
English
34
28
1.1K
96.1K
Deepanshu Rohilla
Deepanshu Rohilla@Deepans36819800·
@GeneralistAI We’ll see a fully robot-maintanined data center go live within the next 12 months, by Generalist and a startup. Infra is about to change faster than most people expect.
English
1
3
23
3.6K
Generalist
Generalist@GeneralistAI·
GEN-1 plugging in ethernet cables to a handheld socket.
English
12
38
297
51.9K
Matteo Paz
Matteo Paz@matteopaz06·
@IanOsband am i missing something? RL is useful in low signal unsupervised environments, so adding a target nll loss just makes this supervised. In that case the best optimizer is just MLE.
English
0
0
3
247
Ian Osband
Ian Osband@IanOsband·
Something is rotten with policy gradient. PG has become *the* RL loss for LLMs. But it’s not even good at basic RL. Even on MNIST with bandit feedback, vanilla PG performs far worse than cross-entropy because it wastes gradient budget. Delightful Policy Gradient: arxiv.org/abs/2603.14608…
Ian Osband tweet media
English
16
44
438
160K
Matteo Paz
Matteo Paz@matteopaz06·
i always have great ideas that I don't want to be the one to build. I made a little home for these ideas, let me know what you think: needanidea.xyz
English
4
1
20
2K
Matteo Paz
Matteo Paz@matteopaz06·
@Thomasdelvasto_ the idea is that the information capacity of even 1t params is so vastly below the information content of pretraining text, that the only efficient algorithm *to* learn is approximate to some sort of general intelligence that might have generated it
English
0
0
4
350
Θωμᾶς del Vasto
Θωμᾶς del Vasto@Thomasdelvasto_·
LLMs are basically 'borrowing' human intelligence by training on our output, I don't believe from a first principles standpoint they will be able to get smarter than us unless they train on data that's smarter than us
English
195
16
322
26.9K
kognise
kognise@kognise7·
whose idea was it to put treehacks on valentine's day 💀
kognise tweet media
English
16
2
366
59.9K
Matteo Paz
Matteo Paz@matteopaz06·
@agniv_s the relevant view I most agree with is that of @_albertgu wrt chunking in nlp. obviously not biological but seems to take advantage of psychological heuristics #a-differentiable-chunking-mechanism" target="_blank" rel="nofollow noopener">goombalab.github.io/blog/2025/hnet…
English
2
0
1
222
agniv
agniv@agniv_s·
@matteopaz06 Ah, nah I agree (poeppl&embick ‘05 which says the same, though I wonder if they revisited it) on the cts brain. But im thinking about the current “limitation of computability,” wherein you just have to deal with discrete things? (approximation thms seem to be okay with this)
agniv tweet media
English
1
0
1
232
Matteo Paz
Matteo Paz@matteopaz06·
@agniv_s tokens are just convenient constructs to trade off transformer attention memory to granularity. theres no evidence to suggest brains work discretely. do your eyes stream vision tokens?
English
1
0
3
178
agniv
agniv@agniv_s·
@matteopaz06 if you think of tokens as a discretization of the cts world, it seems fine a la “the unreasonable effectiveness of numbers in the natural sciences?”
English
1
0
1
191
Andrej Karpathy
Andrej Karpathy@karpathy·
nanochat can now train GPT-2 grade LLM for <<$100 (~$73, 3 hours on a single 8XH100 node). GPT-2 is just my favorite LLM because it's the first time the LLM stack comes together in a recognizably modern form. So it has become a bit of a weird & lasting obsession of mine to train a model to GPT-2 capability but for much cheaper, with the benefit of ~7 years of progress. In particular, I suspected it should be possible today to train one for <<$100. Originally in 2019, GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K. It achieves 0.256525 CORE score, which is an ensemble metric introduced in the DCLM paper over 22 evaluations like ARC/MMLU/etc. As of the last few improvements merged into nanochat (many of them originating in modded-nanogpt repo), I can now reach a higher CORE score in 3.04 hours (~$73) on a single 8XH100 node. This is a 600X cost reduction over 7 years, i.e. the cost to train GPT-2 is falling approximately 2.5X every year. I think this is likely an underestimate because I am still finding more improvements relatively regularly and I have a backlog of more ideas to try. A longer post with a lot of the detail of the optimizations involved and pointers on how to reproduce are here: github.com/karpathy/nanoc… Inspired by modded-nanogpt, I also created a leaderboard for "time to GPT-2", where this first "Jan29" model is entry #1 at 3.04 hours. It will be fun to iterate on this further and I welcome help! My hope is that nanochat can grow to become a very nice/clean and tuned experimental LLM harness for prototyping ideas, for having fun, and ofc for learning. The biggest improvements of things that worked out of the box and simply produced gains right away were 1) Flash Attention 3 kernels (faster, and allows window_size kwarg to get alternating attention patterns), Muon optimizer (I tried for ~1 day to delete it and only use AdamW and I couldn't), residual pathways and skip connections gated by learnable scalars, and value embeddings. There were many other smaller things that stack up. Image: semi-related eye candy of deriving the scaling laws for the current nanochat model miniseries, pretty and satisfying!
Andrej Karpathy tweet media
English
332
619
7.4K
1.3M
retail trash
retail trash@retail_trash·
@kanavtwt it gets better. when you use an integer in a for loop, it has to malloc/free the integer every iteration. all 28 bytes of it. the best optimization they could come up with is preallocating the integers -5 to 256. also bool inherits from int
English
14
4
748
34.3K
kanav
kanav@kanavtwt·
I thought Boolean values taking 8 bits of space was bad. Until I realised they take over 192 bits in Python. Why does a simple true/false need so much space???
English
230
45
5K
474.6K
Matteo Paz
Matteo Paz@matteopaz06·
something something guitar pedal. linus vibecoding alert too
Matteo Paz tweet media
English
0
0
6
2.1K
Matteo Paz
Matteo Paz@matteopaz06·
new torvalds repo dropped...
Matteo Paz tweet media
English
3
1
8
2.4K
Matteo Paz
Matteo Paz@matteopaz06·
@justinskycak thanks Justin! MA and Eurisko were amazing opportunities.
English
5
0
128
5.2K
Justin Skycak
Justin Skycak@justinskycak·
Can't think of a better way to close out 2025 than seeing the head of NASA ask my former student @matteopaz06 to apply, with a fighter jet ride as a signing bonus. Matteo was one of my students in the Eurisko program, which, during its operation from 2020-23, was the most advanced high school math/CS sequence in the USA. It culminated in high school students doing masters/PhD-level coursework (reproducing academic research papers in artificial intelligence, building everything from scratch in Python) Matteo joined Eurisko as a 10th grader, during the last year it was offered, and worked hard to complete almost all 2-3 years’ worth of assignments in a single year. (Eurisko ended when I relocated; nobody else in the district had the requisite knowledge to teach it.) This is exactly the position that we were trying to put students in with the Eurisko program – get them to a point of skill that they can capitalize on some math/coding-related opportunity and turn it into a chain reaction of fortunate events. And it’s been so great to witness some of these chain reactions get underway.
Eric Zeller@TheOnlyEZ

@curiosityonx @justinskycak update: @_MathAcademy gets you a tweet from the head of NASA and a ride in a fighter jet

English
84
336
4.4K
656.3K
Matteo Paz
Matteo Paz@matteopaz06·
@kevinweil thanks man. Come to project Launchpad demo day at OpenAI in January. Would love you to see what I've been working on since.
English
5
6
324
20.3K
Kevin Weil 🇺🇸
Kevin Weil 🇺🇸@kevinweil·
High school student uses AI to discover 1M+ objects humans missed in astronomical data. Head of NASA openly recruiting him through Twitter with a fighter jet ride included. All my worlds colliding. I love everything about this.
Jared Isaacman@rookisaacman

@curiosityonx Matteo please apply to work at NASA and I will personally throw in a fighter jet ride as a signing bonus

English
74
996
14.1K
1.1M
Giuliano
Giuliano@Giuliano_Mana·
Ted Turner (now a billionaire) wanted to major in Classics in college. You won’t believe the letter his father sent him. (1/3)
Giuliano tweet media
English
105
483
4.9K
733.8K
Jared Isaacman
Jared Isaacman@rookisaacman·
@curiosityonx Matteo please apply to work at NASA and I will personally throw in a fighter jet ride as a signing bonus
English
288
1K
23.5K
2.5M
Curiosity
Curiosity@CuriosityonX·
🚨 A student in the US just discovered MILLIONS of new space objects. The astronomy world was recently shaken by a discovery from an unexpected source: a teenager still in high school. Matteo Paz, a student from Pasadena, utilized archival data from NASA’s retired NEOWISE mission to bring 1.5 million invisible cosmic objects into the light. During a stint at Caltech’s Planet Finder Academy, and mentored by astrophysicist Davy Kirkpatrick, Paz took a novel approach to data analysis. He built a unique machine learning model capable of sifting through a staggering 200 billion infrared records. In a span of only six weeks, his AI detected subtle patterns that human analysts had missed, identifying everything from distant quasars to exploding supernovas. Paz’s findings were so robust that they earned him a spot in the prestigious The Astronomical Journal and a position as a research assistant at Caltech. His work does more than just populate star maps; it provides specific coordinates for the James Webb Space Telescope to investigate further. This breakthrough highlights a growing trend where fresh perspectives and AI tools allow young researchers to make historic scientific impacts from the classroom.
Curiosity tweet media
English
279
2K
20.6K
2.2M