Louis Béthune

65 posts

Louis Béthune

Louis Béthune

@LouisBAlgue

Please constrain the Lipschitz constant of your networks.

Toulouse, France Katılım Temmuz 2020
198 Takip Edilen136 Takipçiler
Louis Béthune retweetledi
Jason Ramapuram
Jason Ramapuram@jramapuram·
Autoregressive models dominate, but what if we treat multimodal generation as discrete order agnostic iterative refinement? Excited to share our systematic study on the design space of Tri-Modal Masked Diffusion Models (MDMs). We pre-trained the first Tri-Modal MDM from scratch on (text,), (image, text), and (audio, text). The same model can do ASR, TTS, T2I, captioning and native text generation. What I'm the most proud of in this work is the scientific rigor. Over 3,500 training runs. Principled hyperparameter transfer. Honest results. Carefully controlled ablations across multiple different axis of entanglement. A thread on our empirical findings (arXiV: arxiv.org/abs/2602.21472)
Jason Ramapuram tweet media
English
6
43
234
38.9K
Louis Béthune retweetledi
Bruno Mlodozeniec
Bruno Mlodozeniec@brunorganised·
In our new work—Complete(d)P—we tackle 3 questions about hyperparameter (HP) scaling: • How to transfer across model size, tokens&batch-size?→Complete(d)P • Do per-module HPs matter?✔️2x speed-ups possible • Do they transfer to larger scale?✔️ With the right parameterisation
Bruno Mlodozeniec tweet media
English
3
16
51
18.3K
Louis Béthune retweetledi
Mustafa Shukor
Mustafa Shukor@MustafaShukor1·
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
Mustafa Shukor tweet media
English
6
46
266
31.1K
Louis Béthune retweetledi
Rohan Paul
Rohan Paul@rohanpaul_ai·
This paper maps hardware-cost sweet spots for training efficient small-scale language models. Data shows A100-40GB beats H100 for training cost-effective small language models 🎯 Original Problem: Training small-scale LLMs (under 2B parameters) faces unclear computational bottlenecks. No systematic study exists on optimal hardware configurations and training dynamics for these models. ----- 🔧 Solution in this Paper: • Analyzed training behavior up to 2B parameter models across: - GPU types (A100-40GB, A100-80GB, H100-80GB) - Batch sizes (4 to 64 per device) - Communication protocols (DDP vs FSDP) - Attention mechanisms (Vanilla vs FlashAttention) - GPU counts (1 to 64) • Used Token/Dollar and Token/Second as key metrics for cost efficiency ----- 💡 Key Insights: • FlashAttention gives higher efficiency gains in smaller models • A100-40GB is cost-optimal for smaller models • H100 GPUs are not cost-efficient for training small LLMs • DDP works better for smaller models due to less communication overhead • FSDP outperforms DDP for 2B parameter models with large batch sizes • Cost efficiency saturates before maximum GPU memory utilization ----- 📊 Results: • FlashAttention enables training 1B-2B models with 512 batch size • A100-80GB performs best for 1B-2B models with 32+ GPUs • FSDP with gradient/optimizer state sharding beats full sharding • DDP shows 20-30% better Token/Dollar for sub-1B models
Rohan Paul tweet media
English
4
3
16
2.6K
Louis Béthune retweetledi
AK
AK@_akhaliq·
Apple releases AIMv2 Multimodal Autoregressive Pre-training of Large Vision Encoders
AK tweet media
English
4
86
544
55.6K
Simo Ryu
Simo Ryu@cloneofsimo·
I wrote a custom c++ dataloader and im getting 2000 images/sec with 4 threads Im very frustrated to see my reference pytorch implementation so slow, to a point i feel like something is straight up wrong with my implementation which ive been doing for past 5 years is there hidden config for torch.data.DataLoader that im missing that can fill this gap?
Simo Ryu tweet mediaSimo Ryu tweet mediaSimo Ryu tweet media
English
43
51
964
189.8K
Louis Béthune retweetledi
Pierre Ablin
Pierre Ablin@PierreAblin·
🍏 Apple ML research in Paris has multiple open internship positions!🍎 We are looking for Ph.D. students interested in generative modeling, optimization, large-scale learning or uncertainty quantification, with applications to challenging scientific problems. Details below 👇
English
4
77
572
106.7K
Louis Béthune
Louis Béthune@LouisBAlgue·
@diegoasua @shortstein I don’t need to materialize physically the square. Generating the point coordinates and checking their distance to the center can be done purely numerically.
English
1
0
1
20
Louis Béthune
Louis Béthune@LouisBAlgue·
@francoisfleuret Why refactor? Don’t lie to yourself, you’ll never re-run this XP anyway. Restart from scratch!
English
0
0
0
55
François Fleuret
François Fleuret@francoisfleuret·
Me refactoring: *run* *crash* *fix* *run* *crash* *fix* ...
English
6
1
22
3.1K
Louis Béthune
Louis Béthune@LouisBAlgue·
@shortstein Ok I just checked, I will die from old age before discovering that pi = 3.141. 🥲
English
1
0
2
166
Louis Béthune
Louis Béthune@LouisBAlgue·
@shortstein I would use the calculator to design a pseudo-random generator, and would sample points in a unit square, relying on Monte Carlo estimates. That’s inefficient, but I don’t remember Taylor expansions :(
English
3
0
10
1.7K
Mathieu
Mathieu@miniapeur·
Can someone points me to relevant ressources about kernels that use optimal transport?
English
3
1
17
3.3K
Louis Béthune
Louis Béthune@LouisBAlgue·
@miniapeur I’d say that Solomonoff Induction hypothesis is the best link we have.
English
0
0
1
109
Mathieu
Mathieu@miniapeur·
How is information theory and intelligence related to each other? Compression?
English
30
2
72
16.1K
Louis Béthune
Louis Béthune@LouisBAlgue·
@amirrahnama_ @ThibautBoissin @Napoolar That's not the explanation technique that satisfies Lipschitz continuity. That the model itself. The fact that the model is explainable is more a consequence of the Lipschitzness and the loss used for training. Lipschitz+OT = interpretable model.
English
2
0
4
100
Louis Béthune
Louis Béthune@LouisBAlgue·
Interested in results at the intersection between explainability🔍 and optimal transport 🚚? Come check out "On the explainable properties of 1-Lipschitz Neural Networks: An Optimal Transport Perspective" on Tuesday at 5:15pm, panel #1508.
Louis Béthune tweet media
English
2
7
20
3.5K
Louis Béthune retweetledi
Thomas Fel
Thomas Fel@thomas_fel_·
👋👨‍🍳🍵 After a year of cooking up a secret project, I'm thrilled to officially reveal: The 𝐋𝐄𝐍𝐒 𝐏𝐫𝐨𝐣𝐞𝐜𝐭. By combining modern tools of Explainable AI, how much can we explain a ResNet50? 🧶
English
8
67
266
42K
Louis Béthune
Louis Béthune@LouisBAlgue·
@ccanonne_ Two weird things: (1) TV can be defined without assuming that X is a metric space. So the result is true regardless of the metric d(,). (2) d()>epsilon suggests that epsilon has the dimension of a distance (feet, meters), but P()<1-epsilon suggests that epsilon is dimensionless.
English
0
0
0
118
Louis Béthune retweetledi
Michael Arbel
Michael Arbel@MichaelArbel·
📢 *PhD opening* at @inria_grenoble ! Edouard Pauwels, @vaiter and myself are looking for a student to work with us on learning theory for bilevel optimization, in particular, the implicit bias in bilevel optimization. If interested, please reach out!
English
1
36
67
18.2K
Louis Béthune
Louis Béthune@LouisBAlgue·
@jskherman @prerat @_candroid My guest would be to apply a bijection F between [0,T] and itself so that F#f is uniform (with f the non uniform density) and reapply the same strat on F^{-1}(1/e).
English
0
0
1
52
jskherman ☄️
jskherman ☄️@jskherman·
@prerat @_candroid How can this be generalized to a non-uniform arrival time density f so that it can be applied for more random encounters?
English
1
0
1
115
prerat
prerat@prerat·
um actually the correct way to do it is to reject the first 37% of partners (that's 1/e btw) and then pick the first potential partner you see that's the better than anyone in that 37%
English
37
59
1.6K
191.6K