Kashif Rasul

699 posts

Kashif Rasul

@krasul

Research Scientist working on Deep Learning, Time Series Forecasting, Reinforcement Learning and HPC.

Berlin, Germany Katılım Ağustos 2007

355 Takip Edilen2K Takipçiler

Kashif Rasul retweetledi

Sergio Paniego@SergioPaniego·4d

check out this new notebook by @krasul on TimesFM 2.5, Google's time series foundation model which is now supported in transformers zero-shot forecasting, quantile predictions, LoRA fine-tuning, and forecasting with exogenous covariates colab.research.google.com/github/hugging…

English

841

Kashif Rasul retweetledi

Stas Bekman@StasBekman·9 Mar

Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: huggingface.co/blog/ulysses-sp Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.

English

116

17.4K

Kashif Rasul@krasul·25 Şub

@RisingSayak @vivs4092 @huggingface Thanks @vivs4092 yes about to add llada-2.1 flash to diffusers

English

Sayak Paul@RisingSayak·25 Şub

@vivs4092 @huggingface Irrelevant to the thread but @krasul is working on adding Llada to Diffusers.

English

414

Sayak Paul@RisingSayak·25 Şub

In case folks don't know already, there's a mini presence of @huggingface in India. Yeah, we work here from India, living in different cities. There are about 5 of us! I see many important events where I feel like -- "oh damn, we should have been here, but it's too late". Like for any important open-source stuff, it's a no-brainer for me to represent HF there but it's either the organizers aren't interested or we came to know about it too late 🤷‍♂️

English

835

63K

Kashif Rasul@krasul·19 Şub

@JeffDean @GoogleResearch Also being added to transformers github.com/huggingface/tr…

English

873

Jeff Dean@JeffDean·19 Şub

Learn more about a very high quality time series model released by @GoogleResearch a while back at research.google/blog/a-decoder…

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰@techwith_ram

𝗔 𝗧𝗶𝗺𝗲 𝗦𝗲𝗿𝗶𝗲𝘀 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹 𝗕𝘆 𝗚𝗼𝗼𝗴𝗹𝗲 This has been pre-trained on a time series corpus of 100 billion data points, & shows impressive performance on various benchmarks from diverse domains. 𝗧𝗶𝗺𝗲𝘀𝗙𝗠 𝗚𝗶𝘁𝗵𝘂𝗯 𝗽𝗮𝗴𝗲: github.com/google-researc… 𝗟𝗲𝗮𝗿𝗻 𝗠𝗟 𝗮𝗻𝗱 𝗙𝗼𝗿𝗲𝗰𝗮𝘀𝘁𝗶𝗻𝗴: leanpub.com/pycaretbook/

English

176

1.7K

264.9K

Kashif Rasul@krasul·6 Şub

@jxmnop How is P sampled BTW? I assume it's Gaussian with 1/sqrt(r) normalization?

English

dr. jack morris@jxmnop·5 Şub

at long last, the final paper of my phd 🧮 Learning to Reason in 13 Parameters 🧮 we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯

English

235

2.1K

180.6K

Kashif Rasul@krasul·25 Oca

@natolambert added a section on on-policy distillation if interested github.com/natolambert/rl…

English

724

Nathan Lambert@natolambert·25 Oca

Has taken a long time to polish, but slowly becoming very proud of rlhfbook dot com and do think it's a great resource for many people. A lot of hours (and tokens and reader feedback) going into making it right. I struggled through learning when LLMs didn't know any of it, now they're the best tools possible to find bugs. They'll let me amplify it as a source of truth and place to learn post training in many different ways.

English

228

14.7K

Kashif Rasul@krasul·10 Oca

@m_sirovatka I also recall this huggingface.co/spaces/open-r1… from @_lewtun

English

146

Matej Sirovatka@m_sirovatka·10 Oca

does anyone have a good resource on like all the various chat template footguns, ideally really in depth covering basically all a person can know?

English

2.5K

Kashif Rasul@krasul·10 Oca

@m_sirovatka I keep going back to huggingface.co/blog/qgalloued… by @QGallouedec

English

528

Kashif Rasul@krasul·7 Oca

@EyubogluSabri added cartridges natively to PEFT: github.com/huggingface/pe…

English

224

Sabri Eyuboglu@EyubogluSabri·9 Haz

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x (enabling 26x higher tok/s and lower TTFT) while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests! Github: HazyResearch/cartridges

English

347

96.5K

Kashif Rasul@krasul·7 Oca

@m_sirovatka @jackminong feel free to check the liger-kernels, we have added support for a lot of RL losses there also this open PR: github.com/linkedin/Liger…

English

Matej Sirovatka@m_sirovatka·7 Oca

@krasul @jackminong yeah there is a bunch of stuff that can be fused, this is only "naive" torch implementation. Custom kernels next

English

286

Matej Sirovatka@m_sirovatka·7 Oca

Materializing full logits is very memory heavy, in prime-rl you can now use vocab chunked lm_head with fused logprobs+entropy, getting some CRAZY memory savings. You can just do things 🚀

English

191

16.6K

Kashif Rasul retweetledi

Rémi Ouazan@remi_or_·12 Ara

Just opened a PR to make continuous batching in transformers go EVEN faster🚆 With simple optimizations like no torch sync and more GPU-sided operations, we gained 10-14.5% throughput across 500 requests🥳 Soon, there will be native fast RL training in transformers. Keep up 😉

English

7.7K

Kashif Rasul retweetledi

Ferdinand Mom@FerdinandMom·9 Ara

In collaboration with @PyTorch team, we added transformers modeling backend to torchtitan library ! This means training any Dense model (MoE support coming soon) with torch.compile + FSDPP/TP/PP/CP out of the box with no performance drop !

English

2.2K

Kashif Rasul retweetledi

Stas Bekman@StasBekman·21 Kas

Ulysses Sequence Parallelism integration from Arctic Long Sequence Training has been merged into @huggingface HF Trainer. github.com/huggingface/tr… Thanks to @krasul and @_marcsun for help with integration and Weijie Zhang for being the first early adopter! There is also work being done on integration into HF trl.

English

1.6K

Kashif Rasul retweetledi

Benny (Yufei) Chen@the_bunny_chen·20 Kas

Reinforcement Learning for agents has been held back by a lack of standard infrastructure. Production agents don't live in clean "gyms"—they live in messy, async environments. Today we’re open-sourcing Eval Protocol: a framework to run RL directly on your production agents. Day 0 support for trainers and environments like TRL (@huggingface), rLLM (@Agentica_), OpenEnv (@PyTorch), as well as support for proprietary trainers like @OpenAI RFT and Tinker from @thinkymachines . 🧵

English

125.8K

Kashif Rasul@krasul·9 Kas

@HeMuyu0327 @huggingface Great job! We decided to keep the same chat template as to not generate more / less tokens to avoid misalignment issues with the merging that might become different if passed through different chat templates. You are right it will be potentially OOD for the teacher.

English

500

Muyu He@HeMuyu0327·9 Kas

There is a "bug" in how @huggingface implements their on-policy distillation for teacher-student models with different tokenizers, and we have fixed it in our implementation using native Tinker. The "bug": the student's rollout is retrieved raw for the computation of the teacher's logprobs, before KL is applied. But different tokenizers use different chat templates. So computing teacher logprob uses the student's chat-templated rollouts sends the logprob to a low-probability region where the teacher might not perform well. Our fixes in spider (our open-sourced distillation engine using @thinkymachines's Tinker): we remove + reapply the teacher's chat template to the rollouts, and KL-supervise the logprobs of student vs teacher using their respective chat templates. This guarantees that the student is modeling after the teacher's probabilities as if the teacher is natively answering the question. Up next: we want to see how the logprobs differ for teacher with vs without the correct chat template, so we will probe the KL divergence between the two scenarios. If they diverge quite a bit, that would show the importance of applying the correct template!

English

202

27.3K

Kashif Rasul retweetledi

Carlos Miguel Patiño@cmpatino_·29 Eki

On-policy distillation is a promising way to train small models, but it’s usually limited to teacher–student pairs sharing the same tokenizer. With our GOLD method, you can now distill across different model families and even outperform GRPO! huggingface.co/spaces/Hugging…

English

173

41.9K

Kashif Rasul retweetledi

Sergio Paniego@SergioPaniego·15 Eki

Qwen released their new small and dense VLMs (Qwen3-VL). They're incredibly capable and one of my all-time favourite VLMs. 🤗 We’ve prepared some resources to help you get started. sharing in the next one

English

Kashif Rasul retweetledi

Clémentine Fourrier 🍊 is off till Dec 2026 hiking@clefourrier·17 Eyl

Updated the evaluation guidebook with a new deep dive! 2025 panorama of all the important and next level evaluations that you need to know to build *actually impactful and useful* models! (Assistant tasks, games, forecasting, and more) Tell me wyt! :) github.com/huggingface/ev…

English

168

18.5K

Kashif Rasul retweetledi

Sergio Paniego@SergioPaniego·16 Eyl

Training long-context LLMs is getting easier! TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even multi-node setups, seamlessly 💆 Combine TRL and accelerate to run it effortlessly!

English

154

11.7K

Keşfet

@huggingface @RisingSayak @vivs4092 @JeffDean @GoogleResearch @jxmnop @natolambert @m_sirovatka