Chen Lu

15 posts

Chen Lu

@_Chen_Lu_

PhD student in math @MIT

Katılım Mart 2021

319 Takip Edilen881 Takipçiler

Chen Lu@_Chen_Lu_·25 Mar

Composer 2 is pretty good!

Cursor@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

English

1.4K

Chen Lu@_Chen_Lu_·18 Mar

@kevinyang 🔥 congrats!

English

Kevin Yang@kevinyang·18 Mar

We raised $6.5M to build the agent for professionals. When your reputation is on the line, you need an agent that's reliable, secure, and one step ahead. Try it now at serif.ai

English

435

71.2K

Chen Lu retweetledi

Magic@magicailabs·29 Ağu

LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels. Full blog: magic.dev/blog/100m-toke… Evals, efficiency, and more ↓

English

171

423

2.7K

1.6M

Chen Lu@_Chen_Lu_·1 Tem

@yapdianang Hey Dian Ang! Don't think there were too many surprises because this was a learning project. I did learn that convolutions are hard, and you really should worry about memory transfers

English

212

Dian Ang Yap@yapdianang·30 Haz

@_Chen_Lu_ @karpathy @Si_Boehm Chen!!! Wow

English

382

Chen Lu@_Chen_Lu_·28 Haz

I wrote a UNet diffusion model in pure CUDA: github.com/clu0/unet.cu This project was inspired by @karpathy 's llm.c (github.com/karpathy/llm.c). I also learnt a lot about CUDA kernels from @Si_Boehm 's Matmul blog (siboehm.com/articles/22/CU…). (1/3)

English

160

1.4K

269.5K

Chen Lu@_Chen_Lu_·1 Tem

@MaheshaGodekere Read Simon's blog, it's terrific

English

209

Godekere@MaheshaGodekere·30 Haz

@_Chen_Lu_ @karpathy @Si_Boehm Awesome....I want to learn CUDA kernals....any advice?

English

264

Chen Lu@_Chen_Lu_·1 Tem

@RisingSayak @karpathy @Si_Boehm images are 64x64, using fp32, not sure if that's what you're asking?

English

596

Sayak Paul@RisingSayak·30 Haz

@_Chen_Lu_ @karpathy @Si_Boehm Also, what resolution are you benchmarking at? A transformer would be very nice too.

English

999

Chen Lu@_Chen_Lu_·1 Tem

@iman2_718 @karpathy @Si_Boehm Thanks for the suggestion!! It looks promising for reducing the memory reloads for the convolutions, let me check it out

English

Iman Hosseini@iman2_718·29 Haz

@_Chen_Lu_ @karpathy @Si_Boehm Have you seen docs.nvidia.com/cuda/cublasdx/ ? (to fuse blas into your kernel)

English

112

Chen Lu@_Chen_Lu_·29 Haz

@naklecha @karpathy @Si_Boehm Thanks @naklecha ! Excited to see what's next from aaaaaaaaaa!

English

3.2K

naklecha@naklecha·29 Haz

@_Chen_Lu_ @karpathy @Si_Boehm you are cracked omg

English

4.3K

Chen Lu@_Chen_Lu_·29 Haz

@TiggerSharkML @karpathy @Si_Boehm Ah that should be doable already with the kernels in Andrej's llm.c, since DiT uses the same components You'd probably want to rewrite some kernels to get best performance, otherwise you will be doing a bunch of slow tensor permutes. Definitely would be cool to see though

English

1.8K

Tigger@TiggerSharkML·29 Haz

@_Chen_Lu_ @karpathy @Si_Boehm How about DiT in cuda

English

2.2K

Chen Lu@_Chen_Lu_·28 Haz

@ChrisChoy208 @karpathy @Si_Boehm Ah thanks for spotting the cudaMalloc in the training loop Chris! I thought I had moved all of them outside the loop😅 Almost all of the memory is otherwise allocated before the training loop, so hopefully this is not affecting the times too much.

English

2.7K

Chris Choy, Ph.D.@realChrisChoy·28 Haz

@_Chen_Lu_ @karpathy @Si_Boehm This is impressive! You might want to check out memory allocators. cudaMalloc is an expensive operator and replacing it with some allocators that cache allocation could improve timing!

English

3.9K

Chen Lu@_Chen_Lu_·28 Haz

@eliebakouch @karpathy @Si_Boehm Thanks elie!

English

2.7K

elie@eliebakouch·28 Haz

@_Chen_Lu_ @karpathy @Si_Boehm congrats it's super nice 🔥

English

3.2K

Chen Lu@_Chen_Lu_·28 Haz

@karpathy Thanks Andrej! Big fan of your work 😊

English

940

Andrej Karpathy@karpathy·28 Haz

unet.cu Let's go!! 🚀 :)

Chen Lu@_Chen_Lu_

English

1.1K

176.7K

Chen Lu@_Chen_Lu_·28 Haz

The main targets for optimization are the forward and backward passes of convolutions, which are currently written in a matmul-like fashion. (3/3)

English

10.8K

Chen Lu@_Chen_Lu_·28 Haz

Most of the effort so far was spent getting the whole model to work. There is still a lot of room for optimization. Current per iteration timings on a single RTX 4090: - this repo: 143ms - PyTorch: 66ms - PyTorch with torch.compile: 59ms (2/3)

English

15.5K

Keşfet

@kevinyang @yapdianang @karpathy @Si_Boehm @MaheshaGodekere @RisingSayak @iman2_718 @naklecha