Ferdinand Mom

599 posts

Ferdinand Mom banner
Ferdinand Mom

Ferdinand Mom

@FerdinandMom

Distributed & Decentralized training @HuggingFace

France Entrou em Ekim 2013
1.4K Seguindo2.8K Seguidores
Tweet fixado
Ferdinand Mom
Ferdinand Mom@FerdinandMom·
Interested in 4D parallelism but feeling overwhelmed by Megatron-LM codebase? We are currently cooking something with @Haojun_Zhao14 and @xariusrke 😉 In the meantime, here is a self-contained script that implements Pipeline Parallelism (AFAB + 1F1B) in 200 LOC 🧵👇
Ferdinand Mom tweet media
English
12
44
230
26.7K
Ferdinand Mom retweetou
Guilherme Penedo
Guilherme Penedo@gui_penedo·
Today we’re announcing Macrodata Labs. Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs. We are starting to see a similar takeoff in robotics. Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies. That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method. We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets. We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring. We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in. We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies. I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world. If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.
Guilherme Penedo tweet media
English
25
31
207
37.2K
Ferdinand Mom retweetou
Macrodata Labs
Macrodata Labs@macrodata_labs·
Macrodata Labs is launching today to build infrastructure for the robotics data loop. Robotics is starting to scale. Progress in LLMs and VLMs is making robots more capable, but the data layer behind robotics is still underbuilt. Physical-world data is messy and fragmented. Every robot, sensor setup, and lab has its own assumptions, and teams still spend too much time writing brittle scripts just to make their data usable. The hard part is not only collecting more demonstrations. It is turning those demonstrations into datasets teams can train on, inspect, improve, and reuse as their policies and data collection setups change. We built Refiner as our first step toward better infrastructure for robotics data. It is an open-source framework for turning messy robotics data into scalable, inspectable, training-ready datasets. Refiner helps teams process demonstrations, add annotations, run reward model scoring, and scale robotics data pipelines from local execution to managed cloud compute on the Macrodata Labs platform. Starting today, you can use Refiner and the Macrodata Labs platform to make the most out of your robotics data. We are fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, @Thom_Wolf , and business angels from leading AI labs and technology companies to make this mission possible. @gui_penedo @HKydlicek
Macrodata Labs tweet media
English
1
11
41
8.7K
Ferdinand Mom retweetou
Loubna Ben Allal
Loubna Ben Allal@LoubnaBenAllal1·
Introducing Carbon 🧬 a family of open generative DNA foundation models. Carbon-3B matches Evo2-7B while running 250x faster at inference. It can generate new DNA sequences and score the functional impact of mutations, zero-shot. We borrowed a lot from how modern LLMs are trained, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe: Tokenizer. Most genomic models tokenize at the nucleotide/character level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention. Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5/6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS). Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation, like mixing a web corpus, but for biology. We're releasing the models, training data, training code, evaluation suite, and a demo to play with. More details in the technical report: github.com/huggingface/ca… Demo to play with the model, with a biology primer for our ML friends ;) huggingface.co/spaces/Hugging…
English
16
82
359
39.8K
Ferdinand Mom retweetou
Rémi Ouazan
Rémi Ouazan@remi_or_·
Anyone interested in a CUDA deep dive that makes your workload 25% faster? 🧐 Just published a new blog post on asynchronous CPU / GPU inference: 100% insight, zero slop 😊 To learn how to remove all CPU overhead and use your GPU to the max, just read it 🔥
Rémi Ouazan tweet media
English
1
11
25
3.6K
Ferdinand Mom retweetou
Arthur Douillard
Arthur Douillard@Ar_Douillard·
The DiLoCo team at Google DeepMind and Google Research is proud to release Decoupled DiLoCo, the next frontier for resilient AI pre-training. Decoupled DiLoCo enables training with datacenters across the world, using heterogeneous hardware, and never halting the system despite hardware failures.
GIF
English
33
85
608
2.7M
Ferdinand Mom retweetou
Aksel
Aksel@akseljoonas·
Introducing ml-intern, the agent that just automated the post-training team @huggingface It's an open-source implementation of the real research loop that our ML researchers do every day. You give it a prompt, it researches papers, goes through citations, implements ideas in GPU sandboxes, iterates and builds deeply research-backed models for any use case. All built on the Hugging Face ecosystem. It can pull off crazy things: We made it train the best model for scientific reasoning. It went through citations from the official benchmark paper. Found OpenScience and NemoTron-CrossThink, added 7 difficulty-filtered dataset variants from ARC/SciQ/MMLU, and ran 12 SFT runs on Qwen3-1.7B. This pushed the score 10% → 32% on GPQA in under 10h. Claude Code's best: 22.99%. In healthcare settings it inspected available datasets, concluded they were too low quality, and wrote a script to generate 1100 synthetic data points from scratch for emergencies, hedging, multilingual etc. Then upsampled 50x for training. Beat Codex on HealthBench by 60%. For competitive mathematics, it wrote a full GRPO script, launched training with A100 GPUs on hf.co/spaces, watched rewards claim and then collapse, and ran ablations until it succeeded. All fully backed by papers, autonomously. How it works? ml-intern makes full use of the HF ecosystem: - finds papers on arxiv and hf.co/papers, reads them fully, walks citation graphs, pulls datasets referenced in methodology sections and on hf.co/datasets - browses the Hub, reads recent docs, inspects datasets and reformats them before training so it doesn't waste GPU hours on bad data - launches training jobs on HF Jobs if no local GPUs are available, monitors runs, reads its own eval outputs, diagnoses failures, retrains ml-intern deeply embodies how researchers work and think. It knows how data should look like and what good models feel like. Releasing it today as a CLI and a web app you can use from your phone/desktop. CLI: github.com/huggingface/ml… Web + mobile: huggingface.co/spaces/smolage… And the best part? We also provisioned 1k$ GPU resources and Anthropic credits for the quickest among you to use.
English
138
642
4.7K
1.2M
Ferdinand Mom retweetou
clem 🤗
clem 🤗@ClementDelangue·
Next steps: - enable the 50,000 models available in inference providers - enable the 3,000,000 models available on HF - local free fast inference with llama.cpp - train and bring your own model! We don't want a world where you're forced to choose between two or three lookalike models with the same biases, limitations, forced to pay fortunes in tokens even for small tasks and send all your data to the cloud. We want a world where you have real model choice, options and freedom for your agents. Cloud, local, small, big, specialized, general, English or French, fast or slow, from six months ago or from six seconds ago, from third party or your own! Let's go!
Nous Research@NousResearch

We have integrated @huggingface as a first-class inference provider in Hermes Agent. When you select Hugging Face in the model picker it now shows 28 curated models organized by use case, with a custom option for the 100+ other models they serve.

English
34
64
689
62.8K
Ferdinand Mom retweetou
Arthur Douillard
Arthur Douillard@Ar_Douillard·
Training distributed DiLoCo / SparseLoCo over eduroam wifi, awesome!
Swarnim Jain@swar_ja

I trained models across MacBooks using Apple's AirDrop protocol. grove is a distributed training library for Apple Silicon. Devices discover each other over AWDL, a direct radio link. If there's a shared WiFi network it upgrades to that for speed, otherwise everything goes over the direct link. No router, no cloud, no setup. grove start