Louis Béthune

65 posts

Louis Béthune

@LouisBAlgue

Please constrain the Lipschitz constant of your networks.

Toulouse, France Katılım Temmuz 2020

198 Takip Edilen136 Takipçiler

Louis Béthune retweetledi

Jason Ramapuram@jramapuram·26 Şub

Autoregressive models dominate, but what if we treat multimodal generation as discrete order agnostic iterative refinement? Excited to share our systematic study on the design space of Tri-Modal Masked Diffusion Models (MDMs). We pre-trained the first Tri-Modal MDM from scratch on (text,), (image, text), and (audio, text). The same model can do ASR, TTS, T2I, captioning and native text generation. What I'm the most proud of in this work is the scientific rigor. Over 3,500 training runs. Principled hyperparameter transfer. Honest results. Carefully controlled ablations across multiple different axis of entanglement. A thread on our empirical findings (arXiV: arxiv.org/abs/2602.21472)

English

234

38.9K

Louis Béthune retweetledi

Bruno Mlodozeniec@brunorganised·6 Oca

In our new work—Complete(d)P—we tackle 3 questions about hyperparameter (HP) scaling: • How to transfer across model size, tokens&batch-size?→Complete(d)P • Do per-module HPs matter?✔️2x speed-ups possible • Do they transfer to larger scale?✔️ With the right parameterisation

English

18.3K

Louis Béthune retweetledi

Mustafa Shukor@MustafaShukor1·15 Tem

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

English

266

31.1K

Louis Béthune retweetledi

Dan Busbridge@danbusbridge·12 Tem

@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708) with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin 🔗 icml.cc/virtual/2025/p…

English

354

Louis Béthune retweetledi

Rohan Paul@rohanpaul_ai·30 Kas

This paper maps hardware-cost sweet spots for training efficient small-scale language models. Data shows A100-40GB beats H100 for training cost-effective small language models 🎯 Original Problem: Training small-scale LLMs (under 2B parameters) faces unclear computational bottlenecks. No systematic study exists on optimal hardware configurations and training dynamics for these models. ----- 🔧 Solution in this Paper: • Analyzed training behavior up to 2B parameter models across: - GPU types (A100-40GB, A100-80GB, H100-80GB) - Batch sizes (4 to 64 per device) - Communication protocols (DDP vs FSDP) - Attention mechanisms (Vanilla vs FlashAttention) - GPU counts (1 to 64) • Used Token/Dollar and Token/Second as key metrics for cost efficiency ----- 💡 Key Insights: • FlashAttention gives higher efficiency gains in smaller models • A100-40GB is cost-optimal for smaller models • H100 GPUs are not cost-efficient for training small LLMs • DDP works better for smaller models due to less communication overhead • FSDP outperforms DDP for 2B parameter models with large batch sizes • Cost efficiency saturates before maximum GPU memory utilization ----- 📊 Results: • FlashAttention enables training 1B-2B models with 512 batch size • A100-80GB performs best for 1B-2B models with 32+ GPUs • FSDP with gradient/optimizer state sharding beats full sharding • DDP shows 20-30% better Token/Dollar for sub-1B models

English

2.6K

Louis Béthune retweetledi

AK@_akhaliq·22 Kas

Apple releases AIMv2 Multimodal Autoregressive Pre-training of Large Vision Encoders

English

544

55.6K

Louis Béthune@LouisBAlgue·10 Kas

@cloneofsimo That’s why I use mlx.data. Hackable and fast. Full C++. github.com/ml-explore/mlx…

English

3.2K

Simo Ryu@cloneofsimo·10 Kas

I wrote a custom c++ dataloader and im getting 2000 images/sec with 4 threads Im very frustrated to see my reference pytorch implementation so slow, to a point i feel like something is straight up wrong with my implementation which ive been doing for past 5 years is there hidden config for torch.data.DataLoader that im missing that can fill this gap?

English

964

189.8K

Louis Béthune retweetledi

Pierre Ablin@PierreAblin·4 Eki

🍏 Apple ML research in Paris has multiple open internship positions!🍎 We are looking for Ph.D. students interested in generative modeling, optimization, large-scale learning or uncertainty quantification, with applications to challenging scientific problems. Details below 👇

English

572

106.7K

Louis Béthune@LouisBAlgue·15 Eyl

@diegoasua @shortstein I don’t need to materialize physically the square. Generating the point coordinates and checking their distance to the center can be done purely numerically.

English

Diego@diegoasua·15 Eyl

@LouisBAlgue @shortstein how do you set coordinates in the unit square without a ruler?

English

Louis Béthune@LouisBAlgue·15 Eyl

@francoisfleuret Why refactor? Don’t lie to yourself, you’ll never re-run this XP anyway. Restart from scratch!

English

François Fleuret@francoisfleuret·15 Eyl

Me refactoring: *run* *crash* *fix* *run* *crash* *fix* ...

English

3.1K

Louis Béthune@LouisBAlgue·14 Eyl

@aldopacchiano @shortstein This one is smart

English

362

Aldo Pacchiano@aldopacchiano·14 Eyl

@shortstein I would use the fact that the sum of 1/n^2 is pi^2/6.

English

2.8K

Louis Béthune@LouisBAlgue·14 Eyl

@shortstein Ok I just checked, I will die from old age before discovering that pi = 3.141. 🥲

English

166

Louis Béthune@LouisBAlgue·14 Eyl

@shortstein I would use the calculator to design a pseudo-random generator, and would sample points in a unit square, relying on Monte Carlo estimates. That’s inefficient, but I don’t remember Taylor expansions :(

English

1.7K

Louis Béthune@LouisBAlgue·9 Eyl

@Napoolar @hugues_va @miniapeur Indeed we have this ( arxiv.org/abs/2308.14335 ) which is more concerned by the theory of distribution regression in general.

English

278

Thomas Fel@thomas_fel_·8 Eyl

@hugues_va @miniapeur @LouisBAlgue you also have a follow up no ?

English

214

Mathieu@miniapeur·6 Eyl

Can someone points me to relevant ressources about kernels that use optimal transport?

English

3.3K

Louis Béthune@LouisBAlgue·4 Ağu

@miniapeur I’d say that Solomonoff Induction hypothesis is the best link we have.

English

109

Mathieu@miniapeur·4 Ağu

How is information theory and intelligence related to each other? Compression?

English

16.1K

Louis Béthune@LouisBAlgue·28 Şub

@amirrahnama_ @ThibautBoissin @Napoolar That's not the explanation technique that satisfies Lipschitz continuity. That the model itself. The fact that the model is explainable is more a consequence of the Lipschitzness and the loss used for training. Lipschitz+OT = interpretable model.

English

100

🅐🅜🅘🅡 🅡🅐🅗🅝🅐🅜🅐@amirrahnama_·28 Şub

@LouisBAlgue @ThibautBoissin @Napoolar What arguments are there that explanation techniques should satisfy the Lipschitz continuity? I never really understood this. Alvarez Melis and Jaakkola (2018), and Montavan, methods for interpreting DNN (2018), never discussed this.

English

Louis Béthune@LouisBAlgue·11 Ara

Interested in results at the intersection between explainability🔍 and optimal transport 🚚? Come check out "On the explainable properties of 1-Lipschitz Neural Networks: An Optimal Transport Perspective" on Tuesday at 5:15pm, panel #1508.

English

3.5K

Louis Béthune retweetledi

Thomas Fel@thomas_fel_·22 Şub

👋👨‍🍳🍵 After a year of cooking up a secret project, I'm thrilled to officially reveal: The 𝐋𝐄𝐍𝐒 𝐏𝐫𝐨𝐣𝐞𝐜𝐭. By combining modern tools of Explainable AI, how much can we explain a ResNet50? 🧶

English

266

42K

Louis Béthune@LouisBAlgue·20 Şub

@ccanonne_ Two weird things: (1) TV can be defined without assuming that X is a metric space. So the result is true regardless of the metric d(,). (2) d()>epsilon suggests that epsilon has the dimension of a distance (feet, meters), but P()<1-epsilon suggests that epsilon is dimensionless.

English

118

Louis Béthune retweetledi

Michael Arbel@MichaelArbel·19 Şub

📢 *PhD opening* at @inria_grenoble ! Edouard Pauwels, @vaiter and myself are looking for a student to work with us on learning theory for bilevel optimization, in particular, the implicit bias in bilevel optimization. If interested, please reach out!

English

18.2K

Louis Béthune@LouisBAlgue·6 Şub

@jskherman @prerat @_candroid My guest would be to apply a bijection F between [0,T] and itself so that F#f is uniform (with f the non uniform density) and reapply the same strat on F^{-1}(1/e).

English

jskherman ☄️@jskherman·6 Şub

@prerat @_candroid How can this be generalized to a non-uniform arrival time density f so that it can be applied for more random encounters?

English

115

prerat@prerat·5 Şub

um actually the correct way to do it is to reject the first 37% of partners (that's 1/e btw) and then pick the first potential partner you see that's the better than anyone in that 37%

English

1.6K

191.6K

Keşfet

@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @PierreAblin @cloneofsimo @diegoasua