Sebastian Bordt

179 posts

Sebastian Bordt

Sebastian Bordt

@sbordt

Language models and interpretable machine learning. Postdoc @ Uni Tübingen.

Tuebingen, Germany Katılım Aralık 2020
623 Takip Edilen298 Takipçiler
Sabitlenmiş Tweet
Sebastian Bordt
Sebastian Bordt@sbordt·
Ever wondered about the rationale behind transformer training details like qk-norm, learning rate, and z-loss? Read this blog post to find out more! (link below)
Sebastian Bordt tweet media
English
2
1
14
1.1K
Sebastian Bordt retweetledi
Tessa
Tessa@tessa1157·
⭐ How can we set up LLM pretraining to improve the model’s ability to learn new data upon further training? In this new preprint, we find that weight decay during pretraining helps! Preprint: arxiv.org/abs/2602.11137 Thread below🧵⬇️
Tessa tweet media
English
4
25
104
8.8K
Sebastian Bordt
Sebastian Bordt@sbordt·
@SunnySanyal9 I did not know this paper, but I assume my co-authors might. We are presenting today from 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903. Please drop by!
English
1
0
1
226
Sebastian Bordt
Sebastian Bordt@sbordt·
Ever wondered about the rationale behind transformer training details like qk-norm, learning rate, and z-loss? Read this blog post to find out more! (link below)
Sebastian Bordt tweet media
English
2
1
14
1.1K
Sebastian Bordt retweetledi
Leena C Vankadara
Leena C Vankadara@leenaCvankadara·
📄 Paper: arxiv.org/abs/2505.22491 Catch our Spotlight at #NeurIPS2025 Today! 📅 Wed Dec 3 🕟 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903 Huge thanks to my amazing collaborators: Moritz Haas, @sbordt, and Ulrike von Luxburg
English
0
1
4
452
Sebastian Bordt retweetledi
Leena C Vankadara
Leena C Vankadara@leenaCvankadara·
Under He/Lecun inits, theory implies Kernel OR Unstable regimes as width→∞. Discrepancies (e.g. feature learning) are seen as finite width effects. Our #NeurIPS2025 spotlight refutes this: practical nets do not converge to kernel limits; Feature learning persists as width→∞🧵
English
1
3
9
445
Sebastian Bordt
Sebastian Bordt@sbordt·
@kothasuhas This looks really cool! You may be interested in the fact that the value of the weight decay parameter can be used to gauge the influence of past data on the model. We took a look at this in arxiv.org/abs/2410.03249, there are also other works on the topic
English
0
0
3
653
Suhas Kotha
Suhas Kotha@kothasuhas·
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
Suhas Kotha tweet media
English
10
84
447
151.8K
Sebastian Bordt
Sebastian Bordt@sbordt·
@DimitrisPapail We found in a recent pre-print that the cross-entropy loss also contributes to stable scaling behavior, which could be relevant for its success in large-scale pre-training: arxiv.org/pdf/2505.22491. Our current arXiv version of this result is a bit technical, though.
English
0
0
5
434
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
Why is cross-entropy a good loss for language pretraining? caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point. Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :) "I want my model to sound intelligent" but we can't measure that, so we ask "I want my model to sound like a human" Although we have access to all texts ever written, we can't quite measure that either, so we instead ask "I want my model to be as likely as possible to generate one of the texts ever written" Or more bluntly: "I want my model to memorize the training data." Consider this thought experiment: Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S: Sample: "sample text" from our model Pr( ;W) Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text". Define success as the event E = "all per-sample checks succeed" The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W Pr(E) = Π_{text in S} Pr(text; W) Maximizing log Pr(E) over W gives you the cross-entropy objective. How do you do you optimize this with SGD? sample text from corpus compute grad log Pr(token|prefix) for every prefix of text update model What's elegant is that this same simultaneously: 1) Minimizes the description length of the data under model P( ;W) (compression view) 2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one) 3) Implements maximum likelihood estimation The derivation is straightforward and well-known, but it highlights something important: cross-entropy emerges naturally from wanting exact reproduction of the training data. P.S. you could have instead asked to maximize Pr(text generated by the model is in ground truth) interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor i.e., Pr(text;W) grad log Pr(text;W)
English
9
21
215
68K
Dimitris Papailiopoulos
Dimitris Papailiopoulos@DimitrisPapail·
Is there any quantifiable skill (approximately measurable via some proxy) that we believe LLMs can't saturate?
English
18
1
31
8.8K
Sebastian Bordt
Sebastian Bordt@sbordt·
Today at 4:30 pm in the East Exhibition Hall. #icml icml.cc/virtual/2025/p…
Sebastian Bordt tweet media
Sebastian Bordt@sbordt

During the last couple of years, we have read a lot of papers on explainability and often felt that something was fundamentally missing🤔 This led us to write a position paper (accepted at #ICML2025) that attempts to identify the problem and to propose a solution. Introducing: "Rethinking Explainable Machine Learning as Applied Statistics" arxiv.org/abs/2402.02870 👇🧵

English
0
0
5
305
Sebastian Bordt
Sebastian Bordt@sbordt·
I'm at #ICML in Vancouver this week, hit me up if you want to chat about pre-training experiments or explainable machine learning. You can find me at these posters: Tuesday: How Much Can We Forget about Data Contamination? icml.cc/virtual/2025/p… Wednesday: Position: Rethinking Explainable Machine Learning as Applied Statistics icml.cc/virtual/2025/p…
English
0
0
0
157
Sebastian Bordt
Sebastian Bordt@sbordt·
There are many more interesting aspects to this, so take a look at our paper! arxiv.org/abs/2402.02870 We would also be happy for questions and your comments on why we got it completely wrong.😊 If you are at ICML, I will present this paper on Wed 16 Jul 4:30 in the East Exhibition Hall A-B #E-501.📍
English
1
0
0
117
Sebastian Bordt
Sebastian Bordt@sbordt·
During the last couple of years, we have read a lot of papers on explainability and often felt that something was fundamentally missing🤔 This led us to write a position paper (accepted at #ICML2025) that attempts to identify the problem and to propose a solution. Introducing: "Rethinking Explainable Machine Learning as Applied Statistics" arxiv.org/abs/2402.02870 👇🧵
Sebastian Bordt tweet media
English
1
0
5
613