Nilabhra Roy Chowdhury

633 posts

Nilabhra Roy Chowdhury

@nilabhraroy

I train LLMs

Dubai, United Arab Emirates Katılım Eylül 2015

500 Takip Edilen195 Takipçiler

Nilabhra Roy Chowdhury@nilabhraroy·16 Mar

@soumithchintala Nice!

English

238

Soumith Chintala@soumithchintala·16 Mar

someone's getting started early!

English

168

3.6K

107.7K

Nilabhra Roy Chowdhury retweetledi

Sayak Paul@RisingSayak·13 Mar

The @bfl_ml team released Klein KV and showed how KV-caching can incorporated in a flow pipeline 🤯 The idea is simple and elegant. In the first denoising step, reference image tokens are included in the full DiT forward pass. Their per-layer KVs are computed and cached. In the subsequent steps, KVs for only noisy latents are computed while the cached reference KVs are injected during computing attention. As a result, it delivers upto 2.5x speedups for multi-reference editing tasks over Klein. I basically learned about it from this PR: github.com/huggingface/di… The PR is a poetry in motion and is from the BFL team itself! Kudos to them for always being the best when it comes to designing codebases for flow and diffusion models. The best! Check out the model here: huggingface.co/black-forest-l…

English

131

21.9K

Nilabhra Roy Chowdhury retweetledi

Joël Niklaus@joelniklaus·2 Mar

We just released pre-mixed, pre-shuffled pretraining datasets at 100BT scale. @asankhaya tested 50+ different mixture strategies at 1B scale. The winner? A static 50% finePDFs + 30% DCLM + 20% FineWeb-Edu blend. No fancy curriculum needed. We scaled this up to 100BT and pre-shuffled everything so you don't have to burn compute on sampling. Just use it: from datasets import load_dataset ds = load_dataset("HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT-shuffled") Browse the full smol-data collection: huggingface.co/collections/Hu… Reproduce it yourself: github.com/huggingface/da… Read the methodology: huggingface.co/blog/codelion/…

English

199

13.7K

Nilabhra Roy Chowdhury@nilabhraroy·13 Oca

@RisingSayak @cneuralnetwork Why am I being tagged? 🤣

English

Sayak Paul@RisingSayak·13 Oca

@cneuralnetwork @nilabhraroy

QAM

650

neural nets.@cneuralnetwork·12 Oca

idk some seniors literally treat juniors shit, in excuse of teaching manners. I've seen so many cases of literal abuse from seniors to juniors in many colleges because of some on-existent ego problems for some reason. people have really forgotten how to be a decent human over some stupid attitude problems.

English

5.1K

Nilabhra Roy Chowdhury retweetledi

Delip Rao e/σ@deliprao·18 Ara

Adversarial attacks on vision language action models. x.com/cosyposter/sta…

English

297

189.1K

Nilabhra Roy Chowdhury retweetledi

Atli Kosson@AtliKosson·23 Eki

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

English

334

77.3K

Nilabhra Roy Chowdhury retweetledi

Aayush Karan@aakaran31·17 Eki

We found a new way to get language models to reason. 🤯 No RL, no training, no verifiers, no prompting. ❌ With better sampling, base models can achieve single-shot reasoning on par with (or better than!) GRPO while avoiding its characteristic loss in generation diversity.

English

250

1.7K

276.9K

Nilabhra Roy Chowdhury@nilabhraroy·3 Eki

@RisingSayak Well said. It provides insights to model optimisation after the prototyping phase and the insights help the user discover new avenues for optimisation which otherwise would have been quite opaque.

English

Sayak Paul@RisingSayak·2 Eki

`torch.compile`, in a way, teaches you many good practices of implementing models like TensorFlow used to (yeah, I said that). Some personal favorites: 1> Forcing a model to NOT have graph breaks and recompilation triggers 2> CPU <> GPU syncs (reduce lookup time) 3> Weather regional compilation is desirable 4> Prepping the model for dynamism during compilation without perf drawbacks Then, in the context of diffusion models, delivering compilation benefits with critical scenarios like offloading and LoRAs is just a joyous engineering experience to implement! And then comes testing, which tops it all off (my most favorite part). If you're interested in all of it, I can recommend a post "torch.compile and Diffusers: A Hands-On Guide to Peak Performance", I co-authored with @anijain2305 and @BenjaminBossan!

English

1.7K

Nilabhra Roy Chowdhury@nilabhraroy·29 Eyl

@giffmana 😆

QME

167

Lucas Beyer (bl16)@giffmana·28 Eyl

Guys, I have a theory... see Fig1 below.

English

585

96.6K

Nilabhra Roy Chowdhury retweetledi

Nscale@nscale·11 Ağu

We’re building Stargate Norway to support the most demanding AI workloads in the world — built for scale, speed, and sustainability. @nvidia's leadership in accelerated computing makes it possible to push the limits of what’s technically achievable, from training massive foundation models to deploying ultra-low-latency inference applications. 🎥 Watch the full announcement: youtube.com/watch?v=rWx824…

YouTube

English

346

Nilabhra Roy Chowdhury@nilabhraroy·9 Ağu

@Swatch Got mine in The Mall of Emirates :)

English

1.8K

Swatch@Swatch·9 Ağu

MISSION TO EARTHPHASE - MOONSHINE GOLD is now available! Remember this timepiece is only available today August 9, 2025, at selected Swatch stores worldwide. #MoonSwatch #OMEGAxSwatch #Swatch swat.ch/3HoMaWE

English

561

35.8K

Nilabhra Roy Chowdhury retweetledi

Nscale@nscale·6 Ağu

GPT-OSS-120B and GPT-OSS-20B are now live on Nscale as day-zero serverless endpoints. No orchestration required. Just build. You’ll also find Nscale listed as an inference provider on @huggingface, making it even easier to get started wherever you build. At Nscale, we’re committed to getting powerful AI into the hands of practitioners—quickly, safely and without lock-in. Give them a whirl: nscale.com

English

109.8K

Nilabhra Roy Chowdhury retweetledi

Nscale@nscale·31 Tem

Today, we’re proud to announce Stargate Norway—a landmark initiative by @nscale, in partnership with Aker ASA and @OpenAI — one of the most significant AI infrastructure investments in Europe.

English

2.1K

Nilabhra Roy Chowdhury retweetledi

tender (mlsys 5/18-21)@tenderizzation·9 Tem

ZXX

258

8.5K

Nilabhra Roy Chowdhury@nilabhraroy·8 Tem

@tenderizzation 😆

QME

tender (mlsys 5/18-21)@tenderizzation·7 Tem

NCCL tree allreduces when the rank assignments match the network topology

@@anthraxxxx

Malaysian team smoked South Korean team in cup stacking competition

English

158

10.1K

Nilabhra Roy Chowdhury@nilabhraroy·12 May

@rohanpaul_ai Thanks for the share!

English

Rohan Paul@rohanpaul_ai·14 Nis

Large language model (LLM) training suffers from gradient instability and loss spikes, and fixed gradient clipping methods fail to adapt dynamically. This paper proposes ZClip, an adaptive algorithm that adjusts the clipping threshold based on the gradient norm's recent statistical properties (mean and standard deviation). ZClip proactively mitigates spikes, enabling stable training even at higher learning rates, achieving baseline performance up to 35% faster in some tests. 📌 ZClip dynamically adapts its threshold to evolving gradient statistics, unlike fixed methods. 📌 Lightweight Exponential Moving Average statistics provide efficiency over history-based adaptive methods like AutoClip. 📌 Stabilizing higher learning rates (e.g., 3.0e-3), ZClip achieves 35% faster convergence demonstrated. ---------- Methods Explored in this Paper 🔧: → ZClip detects gradient norm anomalies using z-scores calculated relative to the gradient norm's running mean and standard deviation. → It efficiently tracks these statistics using exponential moving averages (EMA), avoiding the need to store a full history. → When a spike (z-score above a threshold) occurs, the gradient norm is scaled down, often using a reciprocal function based on the spike's severity (z-score value). → To prevent skewed statistics, updates to the mean and variance use the adjusted (clipped) gradient norm value during spike events. ---------------------------- Paper - arxiv. org/abs/2504.02507 Paper Title: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training"

English

3.4K

Nilabhra Roy Chowdhury@nilabhraroy·11 Nis

@jxmnop What a legend.

English

249

Jack Morris@jxmnop·10 Nis

this guy invented VLLM. he's basically the john wick of CUDA kernels

English

151

916

13.9K

2.6M

Nilabhra Roy Chowdhury retweetledi

Sayak Paul@RisingSayak·8 Nis

Want to benefit from `torch.compile()` while hotswapping LoRA adapters into your diffusion models? This is now possible, thanks to the OG @BenjaminBossan's incredible hard work! Follow the comments for a tutorial, code, etc.

English

111

14.9K

Nilabhra Roy Chowdhury@nilabhraroy·5 Nis

@cheatyyyy @JeffDean @jeremyphoward You don’t need all the experts to be in memory all the time. Experts can be swapped in and out as needed. This is will be slow but possible.

English

cheaty@cheatyyyy·5 Nis

@nilabhraroy @JeffDean @jeremyphoward scout is 109b the max that fits on my 3090 with 8k context is a qwen 32b model at 4.5bpw i would need 4 3090s to run it

English

Jeremy Howard@jeremyphoward·5 Nis

Just read through the Llama 4 release announcement. I'm really grateful they've released this with open weights. But tbh I'm also pretty disappointed. The models are both giant MoEs that can't be run on consumer GPUs, even with quant. ai.meta.com/blog/llama-4-m… A big loss.😢

English

100

1.9K

299.3K

Nilabhra Roy Chowdhury@nilabhraroy·5 Nis

@JeffDean @jeremyphoward The scout should be able to run on a single GPU with limited context length right?

English

1.1K

Jeff Dean@JeffDean·5 Nis

@jeremyphoward Why can't you run them on consumer GPUs?

English

487

122.9K

Keşfet

@soumithchintala @bfl_ml @asankhaya @RisingSayak @cneuralnetwork @anijain2305 @BenjaminBossan @giffmana