Rafael Toledo

25 posts

Rafael Toledo

Rafael Toledo

@rafatold

AI / Computer Vision Engineer

Brazil Katılım Ağustos 2015
126 Takip Edilen5 Takipçiler
Rafael Toledo
Rafael Toledo@rafatold·
@dyah10 Giving Claude real creative superpowers is exactly what the space has been missing, not just chat, but actual creation at scale.
English
0
0
0
195
Rafael Toledo retweetledi
Dominique
Dominique@dyah10·
Turn Claude into the best creative agent in the world! Our users generate over 1 billion images and videos every year. Now Claude can too. RT and comment "Pixa" for free access!
English
1.1K
635
2.7K
1.3M
Rafael Toledo
Rafael Toledo@rafatold·
@quail_man2 @awsaf49 @karpathy it's funny but true. Most of the AI researchers world wild do not have access to this knowledge in an intuitive way, a solid networks of experts, or access to savvy tricks. So, when Andrej spreads ideas that circulate in the inner circle of the big labs, people gets amazed.
English
0
0
0
31
Andrej Karpathy
Andrej Karpathy@karpathy·
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.
Andrej Karpathy tweet media
English
974
2.1K
19.4K
3.6M
Rafael Toledo
Rafael Toledo@rafatold·
@fal This looks like a really powerful tool. Impressive robustness across different scenes. I definitely recommend.
English
1
0
2
514
Rafael Toledo retweetledi
fal
fal@fal·
🚨 Pixelcut Background Removal is now on fal! ✂️ Extremely precise cutouts, even for hair, fur, and fine edges ⚡️ Sub-second background removal 🖼️ High-resolution output up to 2400×2400
fal tweet media
English
15
25
230
26.1K
Rafael Toledo
Rafael Toledo@rafatold·
@NoThanksHoney1 @ylecun @rao2z in high-dimensional spaces (as with image pixels, word embeddings, etc.), almost all points are far away. So even inside the "data manifold", your model is extrapolating, operating in regions not seen before. You’re always generalizing beyond your training examples.
English
0
0
0
20
Yann LeCun
Yann LeCun@ylecun·
Pantheon of scientific territorial hubris: Chemistry is just physics Biology is just chemistry Neuroscience is just biology Psychology is just neuroscience Economics is just psychology Computer science is just math Machine learning is just statistics All of it is just philosophy!
English
53
168
1.1K
0
Andrej Karpathy
Andrej Karpathy@karpathy·
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
Andrej Karpathy tweet media
English
690
3.4K
24.2K
5.8M
Rafael Toledo
Rafael Toledo@rafatold·
@giffmana @ylecun Besides, eg., ResNet, only starts to working with OS=16 at their stage 4. So, most of the CNN computation is done in a higher-resolution feature map than on the ViT counterpart.
English
0
0
0
16
Rafael Toledo
Rafael Toledo@rafatold·
@giffmana @ylecun It'd be fairer to consider patch-size of 1 for ViT when comparing with CNNs for HR applications. Bc/ when we need dense HR outcome such as image matting or semantic segmentation for some applications like detecting road damages. We can't just apply a patch-size of 16 at start.
English
1
0
0
58
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
I wrote a blogpost "On the speed of ViTs and CNNs". Addresses the following concerns I often hear: - worry about ViTs speed at high resolution. - how high resolution do I need? - is it super important to keep the aspect ratio? I think @ylecun might like it too! Link below
Lucas Beyer (bl16) tweet media
English
24
91
703
101.1K
Rafael Toledo
Rafael Toledo@rafatold·
@karpathy Does anyone know what is the reference of these 2017's patches? GANs are from 2014, Haven't DL image generation started earlier?
English
0
0
0
22
Rafael Toledo
Rafael Toledo@rafatold·
@barbarikon @kchonyc @ylecun "m" or "a", F is the observed variable. if "m" is known, then you can estimate "a" such as a linear regression problem. That's my guess.
English
0
0
0
71
Kyunghyun Cho
Kyunghyun Cho@kchonyc·
once @ylecun told me (heavily paraphrased), it's not F=ma but \min (F-ma)^2. i didn't realize its importance, but it is perhaps the most enlightning perspective i've ever heard.
English
38
37
576
454.9K
Rafael Toledo
Rafael Toledo@rafatold·
@_florianmai @ylecun @StevenLevy @DBahdanau @kchonyc You nailed it. I'd also mention the savvy way they compose the architecture to make it extremely scalable, gathering many DL tricks like shortcuts, layer normalization, the scale factor to better distribute attention scores, and the multi-head for parallelization.
English
0
0
1
70
visp
visp@wrwagox·
@ylecun @StevenLevy @DBahdanau @kchonyc Self-attention is NOT the innovation in Transformers. Others had used it before as is cited in their background section. The contribution of the realization that you no longer need recurrence, which, together with causal masking, enables training parallelization.
English
2
0
31
4.7K
Yann LeCun
Yann LeCun@ylecun·
This leaves out a bunch of prior innovations that *clearly* inspired the transformer authors. Chief among them is the whole idea of attention, which was popularized by "Neural Machine Translation by Jointly Learning to Align and Translate" by @DBahdanau, @kchonyc, and Yoshua Bengio posted in September 2014. This is the paper that started the attention craze: arxiv.org/abs/1409.0473 Self-attention is a clever trick which uses similarities between all pairs of inputs. This makes the network care about relationships between inputs, independently of their order (permutations equivariance). That's the real contribution of the 2017 transformer paper. But what really boosted the craze was the application of Self-Supervised Learning to transformers, triggered by the 2018 BERT paper, also from Google: arxiv.org/abs/1810.04805
English
18
69
643
231.9K
Rafael Toledo
Rafael Toledo@rafatold·
@karpathy can't wait for more Karpathy lecture videos!!!! The community needs you haha
English
0
0
0
3
Andrej Karpathy
Andrej Karpathy@karpathy·
Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been really great - the team is really strong, the people are wonderful, and the roadmap is very exciting, and I think we all have a lot to look forward to. My immediate plan is to work on my personal projects and see what happens. Those of you who’ve followed me for a while may have a sense for what that might look like ;) Cheers
English
1.5K
1.3K
21.5K
3.3M
Rafael Toledo
Rafael Toledo@rafatold·
@karpathy One of the internet problems now, as seen in LinkedIn content, for example, is that people have merged so much self-expression and marketing that you don't know anymore where one starts and the other ends.
English
0
0
0
31
Andrej Karpathy
Andrej Karpathy@karpathy·
The internet used to be ✨ fun✨ projects.kwon.nyc/internet-is-fu… I remember visiting my friend’s websites. They were ugly and quirky and it was awesome. You wondered who’d stop by yours. They were a labor of love and a medium of self-expression, not your LinkedIn. We can fight this.
English
137
237
3.4K
477.6K
Rafael Toledo
Rafael Toledo@rafatold·
@Andercot does this dollar reference take inflation into account?
English
1
0
1
366
Andrew Côté
Andrew Côté@Andercot·
The opportunity cost of delayed technological progress has always been measured in human suffering.
Andrew Côté tweet media
English
21
123
696
97.7K
Andrej Karpathy
Andrej Karpathy@karpathy·
The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously: 1) expressive (in the forward pass) 2) optimizable (via backpropagation+gradient descent) 3) efficient (high parallelism compute graph)
English
50
542
4.1K
0
Dan Elton
Dan Elton@moreisdifferent·
@karpathy what is meant here by "Absence of any flat tails."?
English
1
0
4
0
Rafael Toledo
Rafael Toledo@rafatold·
@AndrewYNg it's great to see these two big AI names bringing rational scenarios for both sides and leaving all the fuzzy buzz out.
English
0
0
0
148
Andrew Ng
Andrew Ng@AndrewYNg·
Had a great conversation with Yoshua Bengio. Both of us agreed that a good step forward for AI risk is to articulate the concrete scenarios where AI can lead to significant harm. More to come, and looking forward to continuing the conversation!
English
93
265
1.7K
430K
Bert Kastel
Bert Kastel@KastelBert·
@svscarpino @ylecun @Northeastern @Experiential_AI @usamaf Most? Says the ivory tower talking inside a bubble. 98%+ of all scientists do not work inside academia. They, engineers and practitioners, contribute. I have lots of respect for academia. But this statement is elitist, disrespectful, and can even be dangerous!
English
1
0
2
127
Sam Scarpino
Sam Scarpino@svscarpino·
“Most good ideas still come from academia.” 💯💯💯 Prof @ylecun on the role of academia in AI and the importance of good ideas (even if you don’t have access to 50k gpus for compute). Fireside chat @Northeastern with @Experiential_AI’s @usamaf.
Sam Scarpino tweet media
Boston, MA 🇺🇸 English
9
10
65
27.2K