Parameswaran Raman

111 posts

Parameswaran Raman

Parameswaran Raman

@paramsraman

Research Scientist @ Meta (Superintelligence Labs) | LLM Training Efficiency and Optimizer Design | Large Batch Scaling | Distributed AI Systems

San Jose, CA Katılım Nisan 2010
440 Takip Edilen191 Takipçiler
Parameswaran Raman retweetledi
Alexandr Wang
Alexandr Wang@alexandr_wang·
1/ today we're releasing muse spark, the first model from MSL. nine months ago we rebuilt our ai stack from scratch. new infrastructure, new architecture, new data pipelines. muse spark is the result of that work, and now it powers meta ai. 🧵
Alexandr Wang tweet media
English
727
1.2K
10.3K
4.5M
Parameswaran Raman
Parameswaran Raman@paramsraman·
We propose GPA (Generalized Primal Averaging), a new optimizer for LLM Training, making interesting connections to DiLoCo and Schedule-Free! Paper: arxiv.org/abs/2512.17131 and Code: github.com/facebookresear…. Checkout below thread for more details.
Hao-Jun Michael Shi@hjmshi

1/10 Are DiLoCo and Schedule-Free actually related? A brief history and unusually late advertisement for our work: Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs (see arxiv.org/abs/2512.17131).

English
0
0
3
122
Parameswaran Raman retweetledi
Runa Eschenhagen
Runa Eschenhagen@runame_·
1/14 Is Muon “better” than Shampoo? We argue that their relationship parallels Adam's relationship with Signum. Analogous to @lukas_balles and Hennig’s (2018) decomposition of Adam into element-wise scaled Signum, we can decompose Shampoo as left- and right-adapted Muon.
Runa Eschenhagen tweet media
English
3
45
264
32.3K
Parameswaran Raman retweetledi
Aaron Defazio
Aaron Defazio@aaron_defazio·
Why do gradients increase near the end of training? Read the paper to find out! We also propose a simple fix to AdamW that keeps gradient norms better behaved throughout training. arxiv.org/abs/2506.02285
Aaron Defazio tweet media
English
13
73
549
64K
Parameswaran Raman retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
wow. The new model from @LumaLabsAI extending images into videos is really something else. I understood intuitively that this would become possible very soon, but it's still something else to see it and think through future iterations of. A few more examples around, e.g. the girl in front of the house on fire x.com/CharaspowerAI/…
Pierrick Chevallier | IA@CharaspowerAI

I've been wanting to get to the bottom of this story for so long. #DreamMachine #LumaAI 😂

English
126
549
5.7K
859.9K
Parameswaran Raman retweetledi
Nando de Freitas
Nando de Freitas@NandoDF·
I absolutely love this education demo of @OpenAI. Let's make it available in all languages, and personalised to each country. For example, we could do Spanish for kids and teenagers in Bolivia, giving them access to a technical education that otherwise would not be available to them. Voice opens up the opportunity to do this even with old not-smart phones, e.g. a farmer in Ghana could start a conversation to get assistance on how to run a farm more efficiently and make more money. There's a great opportunity here to help people in the entire world, and the AI community should embrace it.
OpenAI@OpenAI

Math problems with GPT-4o and @khanacademy

English
5
13
138
37.9K
Parameswaran Raman retweetledi
Hoi To Wai
Hoi To Wai@hoitowai·
In arxiv.org/abs/2404.10575, we propose an MCMC sampler for contrastive learning to look for negative samples - works especially well with small batch size and we showed stationary point convergence.
Hoi To Wai tweet mediaHoi To Wai tweet media
English
0
1
4
582
Parameswaran Raman
Parameswaran Raman@paramsraman·
If you are at #AISTATS2024, checkout our work "Krylov cubic regularized Newton: A subspace second-order method with dimension-free convergence rate" where we present a novel subspace method that converges fast by selecting a subspace with a handful of dimensions (size m <<< d ).
Parameswaran Raman tweet mediaParameswaran Raman tweet mediaParameswaran Raman tweet media
English
1
1
4
848
Parameswaran Raman retweetledi
Andrej Karpathy
Andrej Karpathy@karpathy·
Congrats to @AIatMeta on Llama 3 release!! 🎉 ai.meta.com/blog/meta-llam… Notes: Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ @lmsysorg :)) 400B is still training, but already encroaching GPT-4 territory (e.g. 84.8 MMLU vs. 86.5 4Turbo). Tokenizer: number of tokens was 4X'd from 32K (Llama 2) -> 128K (Llama 3). With more tokens you can compress sequences more in length, cites 15% fewer tokens, and see better downstream performance. Architecture: no major changes from the Llama 2. In Llama 2 only the bigger models used Grouped Query Attention (GQA), but now all models do, including the smallest 8B model. This is a parameter sharing scheme for the keys/values in the Attention, which reduces the size of the KV cache during inference. This is a good, welcome, complexity reducing fix and optimization. Sequence length: the maximum number of tokens in the context window was bumped up to 8192 from 4096 (Llama 2) and 2048 (Llama 1). This bump is welcome, but quite small w.r.t. modern standards (e.g. GPT-4 is 128K) and I think many people were hoping for more on this axis. May come as a finetune later (?). Training data. Llama 2 was trained on 2 trillion tokens, Llama 3 was bumped to 15T training dataset, including a lot of attention that went to quality, 4X more code tokens, and 5% non-en tokens over 30 languages. (5% is fairly low w.r.t. non-en:en mix, so certainly this is a mostly English model, but it's quite nice that it is > 0). Scaling laws. Very notably, 15T is a very very large dataset to train with for a model as "small" as 8B parameters, and this is not normally done and is new and very welcome. The Chinchilla "compute optimal" point for an 8B model would be train it for ~200B tokens. (if you were only interested to get the most "bang-for-the-buck" w.r.t. model performance at that size). So this is training ~75X beyond that point, which is unusual but personally, I think extremely welcome. Because we all get a very capable model that is very small, easy to work with and inference. Meta mentions that even at this point, the model doesn't seem to be "converging" in a standard sense. In other words, the LLMs we work with all the time are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence. Actually, I really hope people carry forward the trend and start training and releasing even more long-trained, even smaller models. Systems. Llama 3 is cited as trained with 16K GPUs at observed throughput of 400 TFLOPS. It's not mentioned but I'm assuming these are H100s at fp16, which clock in at 1,979 TFLOPS in NVIDIA marketing materials. But we all know their tiny asterisk (*with sparsity) is doing a lot of work, and really you want to divide this number by 2 to get the real TFLOPS of ~990. Why is sparsity counting as FLOPS? Anyway, focus Andrej. So 400/990 ~= 40% utilization, not too bad at all across that many GPUs! A lot of really solid engineering is required to get here at that scale. TLDR: Super welcome, Llama 3 is a very capable looking model release from Meta. Sticking to fundamentals, spending a lot of quality time on solid systems and data work, exploring the limits of long-training models. Also very excited for the 400B model, which could be the first GPT-4 grade open source release. I think many people will ask for more context length. Personal ask: I think I'm not alone to say that I'd also love much smaller models than 8B, for educational work, and for (unit) testing, and maybe for embedded applications etc. Ideally at ~100M and ~1B scale. Talk to it at meta.ai Integration with github.com/pytorch/torcht…
English
138
992
7.6K
885.7K
Parameswaran Raman
Parameswaran Raman@paramsraman·
Interested in training SOTA LLMs end-to-end on Trainium - the AI chip purpose built by AWS? We shared our experience on training the LLaMA2 (7B) model on Trn here: arxiv.org/abs/2404.10630 (Code and scripts to follow soon)
English
1
0
3
110
Parameswaran Raman retweetledi
Jeff Dean
Jeff Dean@JeffDean·
On behalf of our co-authors Tomáš Mikolov, @ilyasut and Kai Chen, @greg_corrado and I were delighted to accept the #NeurIPS2023 Test of Time Award for the "word2vec" paper (arxiv.org/abs/1310.4546). Thanks to the @NeurIPSConf test of time committee for honoring us with this award! This work started as an earlier ICLR 2013 workshop paper (arxiv.org/abs/1301.3781) that explored a few different self-supervised techniques for learning word embeddings. The skip-gram approach worked better than others, and we scaled that and explored various alternative loss functions in the NeurIPS paper. The geometric relationships contained in the trained word embeddings were one thing about this work that I think people found interesting (see images from our talk below).
Jeff Dean tweet mediaJeff Dean tweet mediaJeff Dean tweet mediaJeff Dean tweet media
Google AI@GoogleAI

Congratulations to Jeff Dean, Greg Corrado, & co-authors of the paper “Distributed Representations of Words and Phrases and their Compositionality”, for winning the #NeurIPS2023 Test of Time Award! This prize recognizes a highly impactful paper published at NeurIPS 10 years ago.

English
38
115
1.4K
297.2K
Parameswaran Raman retweetledi
NVIDIA AI
NVIDIA AI@NVIDIAAI·
Explore how @amazon leveraged the NVIDIA NeMo framework, GPUs, and EFA from @awscloud to train its next-generation LLM, giving some of the largest Amazon Titan foundation models customers a faster, more accessible solution for #generativeAI. #AWSreinvent nvda.ws/3uAxihf
English
0
20
73
21K
Parameswaran Raman retweetledi
Jim Fan
Jim Fan@DrJimFan·
One of the best tutorial-style repos since @karpathy's minGPT! GPT-Fast: a minimalistic, PyTorch-only decoding implementation loaded with best practices: int8/int4 quantization, speculative decoding, Tensor parallelism, etc. Boosts the "clock speed" of LLM OS by 10x with no model change! We need more minGPTs and GPT-Fasts in the open-source world! Created by the awesome @cHHillee from PyTorch team. Blog: pytorch.org/blog/accelerat… Code: github.com/pytorch-labs/g…
English
24
399
2.5K
407.7K
Parameswaran Raman
Parameswaran Raman@paramsraman·
Our group is hiring PhD interns for projects related to optimization and large-scale training of deep learning models. Desired background: Design and implementation of optimization algorithms. If interested, please get in touch. #internship2023 #phdinternships #machinelearning
English
1
1
6
537
Parameswaran Raman retweetledi
Dan Fu
Dan Fu@realDanFu·
I'll be at #NeurIPS2022 this week! @tri_dao and I will be presenting FlashAttention (arxiv.org/abs/2205.14135) at Poster Session 4 Hall J #917, Wednesday 4-6 PM. Super excited to talk all things performance, ML+systems, and breaking down scaling bottlenecks!
English
2
5
48
0
Parameswaran Raman retweetledi
Lilian Weng
Lilian Weng@lilianweng·
Updated this 1-year old post on diffusion models with some new content based on recent progresses - including classifier-free guidance, GLIDE, unCLIP, Imagen and latent diffusion model.
Lilian Weng@lilianweng

Diffusion models are another type of generative models, besides GAN, VAE, and flow models. The idea is quite smart and clean. It is flexible enough to model any complex distribution while remains tractable to evaluate the distribution. lilianweng.github.io/lil-log/2021/0…

English
13
120
953
0