Rafael Toledo

25 posts

Rafael Toledo

@rafatold

AI / Computer Vision Engineer

Brazil Katılım Ağustos 2015

126 Takip Edilen5 Takipçiler

Rafael Toledo@rafatold·1d

@dyah10 Giving Claude real creative superpowers is exactly what the space has been missing, not just chat, but actual creation at scale.

English

195

Rafael Toledo retweetledi

Dominique@dyah10·1d

Turn Claude into the best creative agent in the world! Our users generate over 1 billion images and videos every year. Now Claude can too. RT and comment "Pixa" for free access!

English

1.1K

635

2.7K

1.3M

Rafael Toledo@rafatold·13 Mar

@quail_man2 @awsaf49 @karpathy it's funny but true. Most of the AI researchers world wild do not have access to this knowledge in an intuitive way, a solid networks of experts, or access to savvy tricks. So, when Andrej spreads ideas that circulate in the inner circle of the big labs, people gets amazed.

English

Quail Man@quail_man2·10 Mar

@awsaf49 @karpathy Is it because 'he is pointing the way" lolol

English

110

Andrej Karpathy@karpathy·10 Mar

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

974

2.1K

19.4K

3.6M

Rafael Toledo@rafatold·13 Mar

@awsaf49 @karpathy lmao so true

English

Awsaf@awsaf49·10 Mar

@karpathy I really like @karpathy's work for sure but

English

147

10.5K

Rafael Toledo@rafatold·10 Mar

@fal This looks like a really powerful tool. Impressive robustness across different scenes. I definitely recommend.

English

514

Rafael Toledo retweetledi

fal@fal·10 Mar

🚨 Pixelcut Background Removal is now on fal! ✂️ Extremely precise cutouts, even for hair, fur, and fine edges ⚡️ Sub-second background removal 🖼️ High-resolution output up to 2400×2400

English

230

26.1K

Rafael Toledo@rafatold·12 Kas

@NoThanksHoney1 @ylecun @rao2z in high-dimensional spaces (as with image pixels, word embeddings, etc.), almost all points are far away. So even inside the "data manifold", your model is extrapolating, operating in regions not seen before. You’re always generalizing beyond your training examples.

English

No thanks honey@NoThanksHoney1·20 Eki

@ylecun @rao2z Fitting is never an extrapolation by the definition.

English

Yann LeCun@ylecun·15 Tem

Pantheon of scientific territorial hubris: Chemistry is just physics Biology is just chemistry Neuroscience is just biology Psychology is just neuroscience Economics is just psychology Computer science is just math Machine learning is just statistics All of it is just philosophy!

English

168

1.1K

Rafael Toledo@rafatold·16 Eki

@karpathy @karpathy could we wait for another classic Karpathy lecture on this repo content?

English

Andrej Karpathy@karpathy·13 Eki

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

English

690

3.4K

24.2K

5.8M

Rafael Toledo@rafatold·20 Ağu

@giffmana @ylecun Besides, eg., ResNet, only starts to working with OS=16 at their stage 4. So, most of the CNN computation is done in a higher-resolution feature map than on the ViT counterpart.

English

Rafael Toledo@rafatold·20 Ağu

@giffmana @ylecun It'd be fairer to consider patch-size of 1 for ViT when comparing with CNNs for HR applications. Bc/ when we need dense HR outcome such as image matting or semantic segmentation for some applications like detecting road damages. We can't just apply a patch-size of 16 at start.

English

Lucas Beyer (bl16)@giffmana·19 Ağu

I wrote a blogpost "On the speed of ViTs and CNNs". Addresses the following concerns I often hear: - worry about ViTs speed at high resolution. - how high resolution do I need? - is it super important to keep the aspect ratio? I think @ylecun might like it too! Link below

English

703

101.1K

Rafael Toledo@rafatold·3 Tem

@karpathy Does anyone know what is the reference of these 2017's patches? GANs are from 2014, Haven't DL image generation started earlier?

English

Andrej Karpathy@karpathy·1 Tem

I feel like I have to once again pull out this figure. These 32x32 texture patches were state of the art image generation in 2017 (7 years ago). What does it look like for Gen-3 and friends to look similarly silly 7 years from now.

Runway@runwayml

Gen-3 Alpha Text to Video is now available to everyone. A new frontier for high-fidelity, fast and controllable video generation. Try it now at runwayml.com

English

262

2.6K

391.3K

Rafael Toledo@rafatold·25 Nis

@barbarikon @kchonyc @ylecun "m" or "a", F is the observed variable. if "m" is known, then you can estimate "a" such as a linear regression problem. That's my guess.

English

Ali Minai@barbarikon·25 Nis

@kchonyc @ylecun What parameter would you be optimizing here?

English

2.3K

Kyunghyun Cho@kchonyc·25 Nis

once @ylecun told me (heavily paraphrased), it's not F=ma but \min (F-ma)^2. i didn't realize its importance, but it is perhaps the most enlightning perspective i've ever heard.

English

576

454.9K

Rafael Toledo@rafatold·21 Mar

@_florianmai @ylecun @StevenLevy @DBahdanau @kchonyc You nailed it. I'd also mention the savvy way they compose the architecture to make it extremely scalable, gathering many DL tricks like shortcuts, layer normalization, the scale factor to better distribute attention scores, and the multi-head for parallelization.

English

visp@wrwagox·20 Mar

@ylecun @StevenLevy @DBahdanau @kchonyc Self-attention is NOT the innovation in Transformers. Others had used it before as is cited in their background section. The contribution of the realization that you no longer need recurrence, which, together with causal masking, enables training parallelization.

English

4.7K

Yann LeCun@ylecun·20 Mar

This leaves out a bunch of prior innovations that *clearly* inspired the transformer authors. Chief among them is the whole idea of attention, which was popularized by "Neural Machine Translation by Jointly Learning to Align and Translate" by @DBahdanau, @kchonyc, and Yoshua Bengio posted in September 2014. This is the paper that started the attention craze: arxiv.org/abs/1409.0473 Self-attention is a clever trick which uses similarities between all pairs of inputs. This makes the network care about relationships between inputs, independently of their order (permutations equivariance). That's the real contribution of the 2017 transformer paper. But what really boosted the craze was the application of Self-Supervised Learning to transformers, triggered by the 2018 BERT paper, also from Google: arxiv.org/abs/1810.04805

English

643

231.9K

Rafael Toledo@rafatold·14 Şub

@karpathy can't wait for more Karpathy lecture videos!!!! The community needs you haha

English

Andrej Karpathy@karpathy·14 Şub

Hi everyone yes, I left OpenAI yesterday. First of all nothing "happened" and it’s not a result of any particular event, issue or drama (but please keep the conspiracy theories coming as they are highly entertaining :)). Actually, being at OpenAI over the last ~year has been really great - the team is really strong, the people are wonderful, and the roadmap is very exciting, and I think we all have a lot to look forward to. My immediate plan is to work on my personal projects and see what happens. Those of you who’ve followed me for a while may have a sense for what that might look like ;) Cheers

English

1.5K

1.3K

21.5K

3.3M

Rafael Toledo@rafatold·13 Şub

@karpathy One of the internet problems now, as seen in LinkedIn content, for example, is that people have merged so much self-expression and marketing that you don't know anymore where one starts and the other ends.

English

Andrej Karpathy@karpathy·12 Şub

The internet used to be ✨ fun✨ projects.kwon.nyc/internet-is-fu… I remember visiting my friend’s websites. They were ugly and quirky and it was awesome. You wondered who’d stop by yours. They were a labor of love and a medium of self-expression, not your LinkedIn. We can fight this.

English

137

237

3.4K

477.6K

Rafael Toledo@rafatold·22 Oca

@Andercot does this dollar reference take inflation into account?

English

366

Andrew Côté@Andercot·22 Oca

The opportunity cost of delayed technological progress has always been measured in human suffering.

English

123

696

97.7K

Rafael Toledo@rafatold·3 Eyl

@iantimmis @karpathy look for pre-norm Transformers

English

「Ian」⛩@iantimmis·19 Eki

@karpathy Link to the "reshuffling of layernorms"? 🙂

English

Andrej Karpathy@karpathy·19 Eki

The Transformer is a magnificient neural network architecture because it is a general-purpose differentiable computer. It is simultaneously: 1) expressive (in the forward pass) 2) optimizable (via backpropagation+gradient descent) 3) efficient (high parallelism compute graph)

English

542

4.1K

Rafael Toledo@rafatold·15 Haz

@moreisdifferent @karpathy Did you get what it means?

English

Dan Elton@moreisdifferent·19 Eki

@karpathy what is meant here by "Absence of any flat tails."?

English

Rafael Toledo@rafatold·8 Haz

@AndrewYNg it's great to see these two big AI names bringing rational scenarios for both sides and leaving all the fuzzy buzz out.

English

148

Andrew Ng@AndrewYNg·8 Haz

Had a great conversation with Yoshua Bengio. Both of us agreed that a good step forward for AI risk is to articulate the concrete scenarios where AI can lead to significant harm. More to come, and looking forward to continuing the conversation!

English

265

1.7K

430K

Rafael Toledo@rafatold·26 May

@KastelBert @svscarpino @ylecun @Northeastern @Experiential_AI @usamaf I think he means about the core tech. About deep learning, you have the layers, the optimizers, the data augmentation operations, and the metrics, most of which came from the academy.

English

Bert Kastel@KastelBert·25 May

@svscarpino @ylecun @Northeastern @Experiential_AI @usamaf Most? Says the ivory tower talking inside a bubble. 98%+ of all scientists do not work inside academia. They, engineers and practitioners, contribute. I have lots of respect for academia. But this statement is elitist, disrespectful, and can even be dangerous!

English

127

Sam Scarpino@svscarpino·24 May

“Most good ideas still come from academia.” 💯💯💯 Prof @ylecun on the role of academia in AI and the importance of good ideas (even if you don’t have access to 50k gpus for compute). Fireside chat @Northeastern with @Experiential_AI’s @usamaf.

Boston, MA 🇺🇸 English

27.2K

Keşfet

@dyah10 @quail_man2 @awsaf49 @karpathy @fal @NoThanksHoney1 @ylecun @rao2z