Kashif Rasul

699 posts

Kashif Rasul

Kashif Rasul

@krasul

Research Scientist working on Deep Learning, Time Series Forecasting, Reinforcement Learning and HPC.

Berlin, Germany Katılım Ağustos 2007
355 Takip Edilen2K Takipçiler
Kashif Rasul retweetledi
Sergio Paniego
Sergio Paniego@SergioPaniego·
check out this new notebook by @krasul on TimesFM 2.5, Google's time series foundation model which is now supported in transformers zero-shot forecasting, quantile predictions, LoRA fine-tuning, and forecasting with exogenous covariates colab.research.google.com/github/hugging…
Sergio Paniego tweet media
English
0
6
22
841
Kashif Rasul retweetledi
Stas Bekman
Stas Bekman@StasBekman·
Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: huggingface.co/blog/ulysses-sp Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.
Stas Bekman tweet media
English
4
21
116
17.4K
Sayak Paul
Sayak Paul@RisingSayak·
In case folks don't know already, there's a mini presence of @huggingface in India. Yeah, we work here from India, living in different cities. There are about 5 of us! I see many important events where I feel like -- "oh damn, we should have been here, but it's too late". Like for any important open-source stuff, it's a no-brainer for me to represent HF there but it's either the organizers aren't interested or we came to know about it too late 🤷‍♂️
English
43
26
835
63K
Kashif Rasul
Kashif Rasul@krasul·
@jxmnop How is P sampled BTW? I assume it's Gaussian with 1/sqrt(r) normalization?
English
0
0
0
57
dr. jack morris
dr. jack morris@jxmnop·
at long last, the final paper of my phd 🧮 Learning to Reason in 13 Parameters 🧮 we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯
dr. jack morris tweet media
English
60
235
2.1K
180.6K
Nathan Lambert
Nathan Lambert@natolambert·
Has taken a long time to polish, but slowly becoming very proud of rlhfbook dot com and do think it's a great resource for many people. A lot of hours (and tokens and reader feedback) going into making it right. I struggled through learning when LLMs didn't know any of it, now they're the best tools possible to find bugs. They'll let me amplify it as a source of truth and place to learn post training in many different ways.
English
12
10
228
14.7K
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
does anyone have a good resource on like all the various chat template footguns, ideally really in depth covering basically all a person can know?
English
3
0
25
2.5K
Sabri Eyuboglu
Sabri Eyuboglu@EyubogluSabri·
When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x (enabling 26x higher tok/s and lower TTFT) while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests! Github: HazyResearch/cartridges
Sabri Eyuboglu tweet media
English
17
72
347
96.5K
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
@krasul @jackminong yeah there is a bunch of stuff that can be fused, this is only "naive" torch implementation. Custom kernels next
English
1
0
6
286
Matej Sirovatka
Matej Sirovatka@m_sirovatka·
Materializing full logits is very memory heavy, in prime-rl you can now use vocab chunked lm_head with fused logprobs+entropy, getting some CRAZY memory savings. You can just do things 🚀
Matej Sirovatka tweet media
English
10
7
191
16.6K
Kashif Rasul retweetledi
Rémi Ouazan
Rémi Ouazan@remi_or_·
Just opened a PR to make continuous batching in transformers go EVEN faster🚆 With simple optimizations like no torch sync and more GPU-sided operations, we gained 10-14.5% throughput across 500 requests🥳 Soon, there will be native fast RL training in transformers. Keep up 😉
Rémi Ouazan tweet media
English
3
6
23
7.7K
Kashif Rasul retweetledi
Ferdinand Mom
Ferdinand Mom@FerdinandMom·
In collaboration with @PyTorch team, we added transformers modeling backend to torchtitan library ! This means training any Dense model (MoE support coming soon) with torch.compile + FSDPP/TP/PP/CP out of the box with no performance drop !
Ferdinand Mom tweet media
English
6
11
44
2.2K
Kashif Rasul retweetledi
Stas Bekman
Stas Bekman@StasBekman·
Ulysses Sequence Parallelism integration from Arctic Long Sequence Training has been merged into @huggingface HF Trainer. github.com/huggingface/tr… Thanks to @krasul and @_marcsun for help with integration and Weijie Zhang for being the first early adopter! There is also work being done on integration into HF trl.
English
0
4
25
1.6K
Kashif Rasul retweetledi
Benny (Yufei) Chen
Benny (Yufei) Chen@the_bunny_chen·
Reinforcement Learning for agents has been held back by a lack of standard infrastructure. Production agents don't live in clean "gyms"—they live in messy, async environments. Today we’re open-sourcing Eval Protocol: a framework to run RL directly on your production agents. Day 0 support for trainers and environments like TRL (@huggingface), rLLM (@Agentica_), OpenEnv (@PyTorch), as well as support for proprietary trainers like @OpenAI RFT and Tinker from @thinkymachines . 🧵
Benny (Yufei) Chen tweet media
English
8
22
64
125.8K
Kashif Rasul
Kashif Rasul@krasul·
@HeMuyu0327 @huggingface Great job! We decided to keep the same chat template as to not generate more / less tokens to avoid misalignment issues with the merging that might become different if passed through different chat templates. You are right it will be potentially OOD for the teacher.
English
0
0
4
500
Muyu He
Muyu He@HeMuyu0327·
There is a "bug" in how @huggingface implements their on-policy distillation for teacher-student models with different tokenizers, and we have fixed it in our implementation using native Tinker. The "bug": the student's rollout is retrieved raw for the computation of the teacher's logprobs, before KL is applied. But different tokenizers use different chat templates. So computing teacher logprob uses the student's chat-templated rollouts sends the logprob to a low-probability region where the teacher might not perform well. Our fixes in spider (our open-sourced distillation engine using @thinkymachines's Tinker): we remove + reapply the teacher's chat template to the rollouts, and KL-supervise the logprobs of student vs teacher using their respective chat templates. This guarantees that the student is modeling after the teacher's probabilities as if the teacher is natively answering the question. Up next: we want to see how the logprobs differ for teacher with vs without the correct chat template, so we will probe the KL divergence between the two scenarios. If they diverge quite a bit, that would show the importance of applying the correct template!
Muyu He tweet mediaMuyu He tweet mediaMuyu He tweet media
English
8
11
202
27.3K
Kashif Rasul retweetledi
Carlos Miguel Patiño
Carlos Miguel Patiño@cmpatino_·
On-policy distillation is a promising way to train small models, but it’s usually limited to teacher–student pairs sharing the same tokenizer. With our GOLD method, you can now distill across different model families and even outperform GRPO! huggingface.co/spaces/Hugging…
Carlos Miguel Patiño tweet media
English
13
25
173
41.9K
Kashif Rasul retweetledi
Sergio Paniego
Sergio Paniego@SergioPaniego·
Qwen released their new small and dense VLMs (Qwen3-VL). They're incredibly capable and one of my all-time favourite VLMs. 🤗 We’ve prepared some resources to help you get started. sharing in the next one
Sergio Paniego tweet media
English
2
3
13
1K
Kashif Rasul retweetledi
Sergio Paniego
Sergio Paniego@SergioPaniego·
Training long-context LLMs is getting easier! TRL now supports Context Parallelism (CP), letting you scale sequences across multiple GPUs, even multi-node setups, seamlessly 💆 Combine TRL and accelerate to run it effortlessly!
Sergio Paniego tweet media
English
3
29
154
11.7K