Chen Lu

15 posts

Chen Lu

Chen Lu

@_Chen_Lu_

PhD student in math @MIT

Katılım Mart 2021
319 Takip Edilen881 Takipçiler
Kevin Yang
Kevin Yang@kevinyang·
We raised $6.5M to build the agent for professionals. When your reputation is on the line, you need an agent that's reliable, secure, and one step ahead. Try it now at serif.ai
English
70
56
435
71.2K
Chen Lu retweetledi
Magic
Magic@magicailabs·
LTM-2-Mini is our first model with a 100 million token context window. That’s 10 million lines of code, or 750 novels. Full blog: magic.dev/blog/100m-toke… Evals, efficiency, and more ↓
English
171
423
2.7K
1.6M
Chen Lu
Chen Lu@_Chen_Lu_·
@yapdianang Hey Dian Ang! Don't think there were too many surprises because this was a learning project. I did learn that convolutions are hard, and you really should worry about memory transfers
English
0
0
2
212
Chen Lu
Chen Lu@_Chen_Lu_·
@iman2_718 @karpathy @Si_Boehm Thanks for the suggestion!! It looks promising for reducing the memory reloads for the convolutions, let me check it out
English
0
0
2
81
Chen Lu
Chen Lu@_Chen_Lu_·
@TiggerSharkML @karpathy @Si_Boehm Ah that should be doable already with the kernels in Andrej's llm.c, since DiT uses the same components You'd probably want to rewrite some kernels to get best performance, otherwise you will be doing a bunch of slow tensor permutes. Definitely would be cool to see though
English
0
0
4
1.8K
Chen Lu
Chen Lu@_Chen_Lu_·
@ChrisChoy208 @karpathy @Si_Boehm Ah thanks for spotting the cudaMalloc in the training loop Chris! I thought I had moved all of them outside the loop😅 Almost all of the memory is otherwise allocated before the training loop, so hopefully this is not affecting the times too much.
English
0
1
9
2.7K
Chris Choy, Ph.D.
Chris Choy, Ph.D.@realChrisChoy·
@_Chen_Lu_ @karpathy @Si_Boehm This is impressive! You might want to check out memory allocators. cudaMalloc is an expensive operator and replacing it with some allocators that cache allocation could improve timing!
English
3
1
9
3.9K
Chen Lu
Chen Lu@_Chen_Lu_·
@karpathy Thanks Andrej! Big fan of your work 😊
English
0
0
7
940
Chen Lu
Chen Lu@_Chen_Lu_·
The main targets for optimization are the forward and backward passes of convolutions, which are currently written in a matmul-like fashion. (3/3)
Chen Lu tweet media
English
1
2
62
10.8K
Chen Lu
Chen Lu@_Chen_Lu_·
Most of the effort so far was spent getting the whole model to work. There is still a lot of room for optimization. Current per iteration timings on a single RTX 4090: - this repo: 143ms - PyTorch: 66ms - PyTorch with torch.compile: 59ms (2/3)
English
1
5
63
15.5K