Marc Sun

702 posts

Marc Sun

Marc Sun

@_marcsun

Machine Learning Engineer @huggingface Open Source team

New york Katılım Şubat 2023
487 Takip Edilen1.6K Takipçiler
Marc Sun retweetledi
Aritra 🤗
Aritra 🤗@ariG23498·
When you run a @PyTorch model on a GPU, the acutal work is executed through kernels. These are low-level, hardware-specific functions designed for GPUs (or other accelerators). If you profile a model, you'll see a sequence of kernel launches. Between these launches, the GPU can sit idle, waiting for the next operation. A key optimization goal is therefore to minimize gaps between kernel execution and keep the GPU fully utilized. One common approach is `torch.compile`, which fuses multiple operations into fewer kernels, reducing overhead and improving utilization. Another approach is to write custom kernels tailored to specific workfloads (e.g., optimized attention or fused ops). However, this comes with significant challenges: > requires deep expertise in kernels writing > installation hell > integration with the model is non-trivial To address this,@huggingface introduces the `kernels` library. With this one can: > build custom kernels (with the help of a template) > upload them to the Hub (like models or datasets) > integrate them to models with ease Let's take a look at how the transformers team use the kernels library to integrate it into the already existing models. (more in the thread)
English
19
88
1.2K
82K
Marc Sun retweetledi
Unsloth AI
Unsloth AI@UnslothAI·
Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Self-healing tool calling and code execution • Compare models side by side + export to GGUF GitHub: github.com/unslothai/unsl… Blog and Guide: unsloth.ai/docs/new/studio Available now on Hugging Face, NVIDIA, Docker and Colab.
English
216
837
5K
1.5M
Marc Sun retweetledi
Stas Bekman
Stas Bekman@StasBekman·
Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: huggingface.co/blog/ulysses-sp Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.
Stas Bekman tweet media
English
4
21
116
17.4K
Marc Sun retweetledi
Sayak Paul
Sayak Paul@RisingSayak·
Introducing Modular Diffusers 🔥 The `DiffusionPipeline` abstraction in Diffusers has established a standard in the community. But it has also limited flexibility. Modular Diffusers breaks those shackles & enables the next gen. of creative user workflows 🧨 Details ⬇️
English
7
9
86
7.3K
Junyang Lin
Junyang Lin@JustinLin610·
me stepping down. bye my beloved qwen.
English
1.7K
738
13.6K
6.5M
Georgi Gerganov
Georgi Gerganov@ggerganov·
Today ggml.ai joins Hugging Face Together we will continue to build ggml, make llama.cpp more accessible and empower the open-source community. Our joint mission is to make local AI easy and efficient to use by everyone on their own hardware.
Georgi Gerganov@ggerganov

I've started a company: ggml.ai From a fun side project just a few months ago, ggml has now become a useful library and framework for machine learning with a great open-source community

English
139
232
1.6K
294.8K
Zach Mueller
Zach Mueller@TheZachMueller·
Over the last month I've been digging into model inference; what's the best out-of-the-box tokens/s on our hardware, and how do you benchmark it? Our model-inference revamp is now live, with model cards built to answer exactly this (in a community-focused way):
Zach Mueller tweet media
English
9
9
60
13.3K
Marc Sun retweetledi
Lysandre
Lysandre@LysandreJik·
The PyTorch Conference is coming to Europe the 7-8th April 2026! It'll be great to see all of you, don't hesitate to come and talk there!
Lysandre tweet media
English
0
3
13
582
Marc Sun retweetledi
Dan Alistarh
Dan Alistarh@DAlistarh·
Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4. Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration! [1/4]
English
5
25
170
19.3K
Marc Sun retweetledi
Pavlo Molchanov
Pavlo Molchanov@PavloMolchanov·
🚀 New NVIDIA report: NVFP4 + Quantization-Aware Distillation (QAD) FP4 inference without quality collapse. Key idea: distill a BF16 teacher into an NVFP4 student using KL loss - much more robust than PTQ/QAT, especially after SFT/RL. 🔥 Near-BF16 accuracy ⚡ ~2-3× throughput, ~1.8× memory savings vs FP8 🧠 Works for LLMs and VLMs (Nemotron Nano, Super, VL) Technical report: huggingface.co/nvidia/NVIDIA-… Research blog: research.nvidia.com/labs/nemotron/… Hugging Face models: research.nvidia.com/labs/nemotron/…
NVIDIA AI Developer@NVIDIAAIDev

We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200. Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16. Nemotron 3 Nano NVFP4: nvda.ws/4t63z9y Tech Report: nvda.ws/4bj3pp0

English
3
17
114
15.2K
Marc Sun
Marc Sun@_marcsun·
If you could fix ONE thing about `Trainer` in transformers, what would it be? Share your feedback: github.com/huggingface/tr… Thanks @UnslothAI @axolotl_ai and others for trusting and building on top of Trainer. We want to make sure you all get the best experience.
English
0
5
10
1.8K
Marc Sun retweetledi
Marc Sun retweetledi
NVIDIA AI Developer
NVIDIA AI Developer@NVIDIAAIDev·
We just launched an ultra-efficient NVFP4 precision version of Nemotron 3 Nano that delivers up to 4x higher throughput on Blackwell B200. Using our new Quantization Aware Distillation method, the NVFP4 version achieves up to 99.4% accuracy of BF16. Nemotron 3 Nano NVFP4: nvda.ws/4t63z9y Tech Report: nvda.ws/4bj3pp0
NVIDIA AI Developer tweet media
English
24
87
700
129.2K
Marc Sun retweetledi
Lysandre
Lysandre@LysandreJik·
Transformers v5's FINAL, stable release is out 🔥 Transformers' biggest release. The big Ws of this release: - Performance, especially for MoE (6x-11x speedups) - No more slow/fast tokenizers -> way simpler API, explicit backends, better performance - dynamic weight loading: way faster, and enabling: MoE now working w/ {quants, tp, peft, ...} We have a migration guide on the main branch; please take a look at it in case you run into issues. Come in our GH issues if you still do after reading it 😀
Lysandre tweet media
English
9
87
435
75.3K
Marc Sun retweetledi
Sayak Paul
Sayak Paul@RisingSayak·
You can run ANY pipeline from Diffusers in @sgl_project and benefit from the open tooling for optimized inference in the space 🔥 Combine SGLang's optims + Diffusers' flexible options for optims to suit your needs 🤗 Kudos to @adarshxs for leading the work here!
Sayak Paul tweet media
English
2
9
33
8.5K
Marc Sun retweetledi
Radical Numerics
Radical Numerics@RadicalNumerics·
Scaling scientific world models requires co-designing architectures, training objectives, and numerics. Today, we share the first posts in our series on low-precision pretraining, starting with NVIDIA's NVFP4 recipe for stable 4-bit training. Part 1: radicalnumerics.ai/blog/nvfp4-par… Part 2: radicalnumerics.ai/blog/nvfp4-par… We cover floating point fundamentals, heuristics, custom CUDA kernels, and stabilization techniques. Future entries will cover custom recipes and results on hybrid architectures.
Radical Numerics tweet media
English
9
93
525
66.8K
Qubitium
Qubitium@qubitium·
Does anyone have a vm with Ascend NPU that they can share with me for like 8 hours? I need this hw to complete the Huawei NPU support for GPT-QModel. Thanks! @Huawei github.com/ModelCloud/GPT…
English
1
0
0
174