Norman Casagrande

2.4K posts

Norman Casagrande banner
Norman Casagrande

Norman Casagrande

@nova77t

ML, history, space & sciencey stuff. Research Eng @ Google DeepMind. Opinions are my own etc. Find me @adabstract.bsky.social & @[email protected]

@ Google DeepMind Katılım Ocak 2010
247 Takip Edilen796 Takipçiler
Norman Casagrande retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
961
2.1K
19.3K
3.5M
Norman Casagrande retweetledi
Seth Forsgren
Seth Forsgren@sethforsgren·
It started with the silly idea of generating images and turning them into music. We didn’t know if it would work, and we didn’t care if anyone used it. We were building it for ourselves. As musicians, we marveled at how this instrument could inspire and challenge us. A lot has changed since then, but that sense of wonder has remained. Couldn’t be more excited to continue the journey as part of @GoogleLabs
Producer.ai@producer_ai

Producer is now part of Google! We’re proud to be joining @GoogleLabs and @GoogleDeepMind to build the future of music creation. Producer is here to stay, with more on the way. Come make music with us!

English
23
4
64
8.2K
Norman Casagrande
Norman Casagrande@nova77t·
@dccommonsense Time to turn the executive into a council like in Switzerland (whose constitution was inspired by the US one). Swiss council has 7 members: while not perfect it's another obstacle to tyranny. Especially if they are forced to speak "as one" instead of across party lines.
English
0
0
0
40
Dan Carlin
Dan Carlin@dccommonsense·
(More)...and one of the main rationales for reducing presidential power is that that's the most likely branch of the government to lead to tyranny. Congress,of course,has taken its turn at being overbearing...but it's harder to be tyranized by hundreds of humans than one person.
English
43
49
1.1K
34.2K
Norman Casagrande retweetledi
Jason Baldridge
Jason Baldridge@jasonbaldridge·
A little jam for all the absent minded professors out there. (And demonstrating some of Lyria's flexibility and weirdness. I love the waka-waka Doppler-effect sound from 24-26 seconds in.)
English
1
1
11
469
Norman Casagrande retweetledi
David Pfau
David Pfau@pfau·
Alright, time for my own vibe-coding story. Over the last several months, we've been rewriting the plasma simulator used as an RL environment in our 2022 Nature paper from Matlab to JAX. An experimental version is now available on EPFL's Gitlab: gitlab.epfl.ch/spc/public/meq…
English
3
6
100
10.1K
Norman Casagrande
Norman Casagrande@nova77t·
@pfau hahah, yeah: fusion as well. But I cannot claim much fame since my contribution was minuscule 😅 (for now at least!😉)
English
0
0
0
63
Norman Casagrande retweetledi
Google Gemini
Google Gemini@GeminiApp·
Introducing Lyria 3, our new music generation model in Gemini that lets you turn any idea, photo, or video into a high-fidelity track with custom lyrics. From funny jingles to lo-fi beats, you can create custom 30-second soundtracks for any moment. See how it works. 🧵
English
505
1.3K
8.9K
4.6M
Norman Casagrande
Norman Casagrande@nova77t·
And I can finally update my awkward answering machine reply! 😄🎵
English
0
1
3
217
Dario Bressanini
Dario Bressanini@DarioBressanini·
@nova77t @malinverno_luca Non ti aspettavi che usassi Gemini? O che gli abbia dato in pasto qualcosa prima? (volevo anche dargli il Kernighan & Plauger ma non l'avevo sottomano)
Italiano
1
0
0
78
Dario Bressanini
Dario Bressanini@DarioBressanini·
Sono stupefatto: sto sperimentando con la generazione di codice tramite AI. Ho chiesto di scrivere da zero degli algoritmi di simulazioni di computational physics (dove devi prima capire cosa sono e cosa fanno prima di programmare). Ha fatto tutto! Io ci avrei messo un mese. 😱
Italiano
77
10
694
73.2K
Norman Casagrande
Norman Casagrande@nova77t·
@DarioBressanini @malinverno_luca Hahaha, questo non me lo sarei aspettato! È un po’ che non lavoro su modelli di generazione del codice(in particolare “agentic”) quindi non posso prendermi il merito, ma passerò il messaggio d’apprezzamento
Italiano
1
0
0
81
Dario Bressanini
Dario Bressanini@DarioBressanini·
@nova77t @malinverno_luca Oddio su X è un po’ difficile ;) comunque prima che iniziasse a scrivere codice (uso Gemini) ho avuto un lungo scambio di idee su come volevo che scrivesse. E gli ho anche dato un pasto un famoso scritto di Rob Pike per vedere a concordava 😄
Italiano
1
0
1
59
Norman Casagrande
Norman Casagrande@nova77t·
@DarioBressanini @malinverno_luca Ma certo, tutto è relativo. Dove l’ho trovato particolarmente utile era in un progettino per la conversione di codice da Matlab scritto da fisici (!) a Jax. In confronto al codice dei fisici quello del modello era Knuth! Certo la barra di partenza era bassa, eh! :p
Italiano
1
0
0
55
Dario Bressanini
Dario Bressanini@DarioBressanini·
@nova77t @malinverno_luca Io invece sono rimasto piacevolmente sorpreso dalleleganza. Nomi di variabili evocativi, funzioni chiare, strutture dati solide e tutto ben commentato. Ho dovuto solo abbassare un po la sua propensione a pensare “per oggetti”
Italiano
1
0
0
62
Norman Casagrande
Norman Casagrande@nova77t·
@malinverno_luca @DarioBressanini Esatto. Con "eleganza" non mi riferisco alle 2 linee criptiche al posto di 40, quanto a qualcosa che e' facilmente mantenibile, leggibile, e chiaro nel suo intento.
Italiano
1
0
0
51
Luca Malinverno, PhD
Luca Malinverno, PhD@malinverno_luca·
@nova77t @DarioBressanini Quella dell'elenganza é una metrica molto interessante in effetti... Potrebbe essere usata come discriminante sul codice. perchè poi elegante si traduce in sicuro, scalabile, manutenibile ecc...
Italiano
1
0
1
36
Norman Casagrande
Norman Casagrande@nova77t·
@Petedemountain @DarioBressanini Scrivo codice da 30 anni: se mi fa risparmiare e' al massimo qualche minuto, anche se e' cmq uno strumento interessante. A volte mi tocca pero' riarrangiare quanto mi suggerisce perche' l'"eleganza" e' quello che ti permette di risparmiare tempo alla lunga!
Italiano
0
0
0
13
Pete de mountain
Pete de mountain@Petedemountain·
@nova77t @DarioBressanini Quanto conta l'eleganza quando ti fa in mezz'ora quello che a te avrebbe richiesto una settimana? In fondo anche quando hanno inventato i linguaggi di alto livello si è perso il "dettaglio fine" nella scrittura del linguaggio macchina ma il vantaggio in termini di tempo era tale.
Italiano
1
0
0
26
Norman Casagrande retweetledi
David Pfau
David Pfau@pfau·
This is the key difference between in-domain and out-of-domain generalization, and we still have not truly solved out-of-domain generalization. It just turns out you can build world changing technology by throwing so much data at things that the entire universe is in-domain.
Niels Rogge@NielsRogge

One of the best visual explanations I've ever seen for why scaling Transformers works, but is suboptimal, as it's just brute-forcing things, by @YesThisIsLion (co-author of the Transformer) on @MLStreetTalk "In the (rejected) paper "Intelligent Matrix Exponentiation", they show the decision boundary of a classic MLP with a ReLu/Tanh activation function on the classic Spiral dataset." "You can see they both technically solve it with great scores on the test set. Next, they show the decision boundary of the "M-layer" they propose in the paper. And it represents the spiral ... as a spiral!" "Shouldn't we? If the data is a spiral... shouldn't we represent it as a spiral?" "If you look back at the decision boundaries of the MLP, it's clear that you just have these tiny, piecewise separations without learning the concept of a spiral. That's what I mean!" "If you train these things enough, it can fit the spiral and get a high accuracy. But there's no indication that the MLP actually understands a spiral. When you represent it as a spiral, it extrapolates correctly, cause the spiral just keeps going out."

English
13
22
337
35.9K
Igor Babuschkin
Igor Babuschkin@ibab·
@DavidSHolz There are decades where nothing happens, and there are weeks where decades happen
English
7
29
476
19.9K
David
David@DavidSHolz·
ive done more personal coding projects over christmas break than i have in the last 10 years. its crazy. i can sense the limitations, but i *know* nothing is going to be the same anymore.
English
298
464
8.5K
1.2M
Norman Casagrande retweetledi
underwood
underwood@underwoodxie96·
I tweaked the prompt. The old version generated different camera angles from a single image, but it’s hard to stitch those into a coherent clip. So I updated the setup: now the AI expands keyframes based on the same scene + storyline for better continuity.
underwood tweet media
TechHalla@techhalla

these 2 prompts for Nano Banana Pro will save you a ton of time. just upload an image, generate the cinematic grid, and pull the frames you like! examples made in Higgsfield AI, and prompts below 👇

English
53
89
1K
690K