Sebastian Bordt

179 posts

Sebastian Bordt

@sbordt

Language models and interpretable machine learning. Postdoc @ Uni Tübingen.

Tuebingen, Germany Katılım Aralık 2020

623 Takip Edilen298 Takipçiler

Sabitlenmiş Tweet

Sebastian Bordt@sbordt·2 Ara

Ever wondered about the rationale behind transformer training details like qk-norm, learning rate, and z-loss? Read this blog post to find out more! (link below)

English

1.1K

Sebastian Bordt@sbordt·1 Nis

sbordt.substack.com/p/neurips-rele…

ZXX

975

Sebastian Bordt@sbordt·1 Nis

New Blog Post!🎭

English

9.9K

Sebastian Bordt retweetledi

Tessa@tessa1157·20 Şub

⭐ How can we set up LLM pretraining to improve the model’s ability to learn new data upon further training? In this new preprint, we find that weight decay during pretraining helps! Preprint: arxiv.org/abs/2602.11137 Thread below🧵⬇️

English

104

8.8K

Harsha Nori@HarshaNori·10 Ara

Rich's early papers are a delight to read across compression, ensembling, calibration, multi-task learning...He's the reason I'm in AI research today. @suchenzang you might also enjoy his multitask learning paper from NeurIPS 7 in 1994: proceedings.neurips.cc/paper_files/pa…

Susan Zhang@suchenzang

we all stand on the shoulders of giants

English

524

Sebastian Bordt@sbordt·10 Ara

@HarshaNori @suchenzang Indeed! :)

English

Sebastian Bordt@sbordt·3 Ara

@SunnySanyal9 I did not know this paper, but I assume my co-authors might. We are presenting today from 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903. Please drop by!

English

226

Sunny Sanyal@SunnySanyal9·3 Ara

@sbordt are you aware of this work (quite a few years) arxiv.org/abs/2306.03241

English

Sebastian Bordt@sbordt·2 Ara

Ever wondered about the rationale behind transformer training details like qk-norm, learning rate, and z-loss? Read this blog post to find out more! (link below)

English

1.1K

Sebastian Bordt@sbordt·3 Ara

Our spotlight paper is happening today at the #NeurIPS poster session! Drop by if you want to chat about the nitty-gritty details of large-scale transformer training!

Leena C Vankadara@leenaCvankadara

📄 Paper: arxiv.org/abs/2505.22491 Catch our Spotlight at #NeurIPS2025 Today! 📅 Wed Dec 3 🕟 4:30 - 7:30 PM 📍 Exhibit Hall C,D,E — Poster #3903 Huge thanks to my amazing collaborators: Moritz Haas, @sbordt, and Ulrike von Luxburg

English

260

Sebastian Bordt retweetledi

Leena C Vankadara@leenaCvankadara·3 Ara

English

452

Sebastian Bordt retweetledi

Leena C Vankadara@leenaCvankadara·3 Ara

Under He/Lecun inits, theory implies Kernel OR Unstable regimes as width→∞. Discrepancies (e.g. feature learning) are seen as finite width effects. Our #NeurIPS2025 spotlight refutes this: practical nets do not converge to kernel limits; Feature learning persists as width→∞🧵

English

445

Sebastian Bordt@sbordt·2 Ara

Link: sbordt.substack.com/p/why-can-we-t…

English

Sebastian Bordt@sbordt·19 Eyl

@kothasuhas This looks really cool! You may be interested in the fact that the value of the weight decay parameter can be used to gauge the influence of past data on the model. We took a look at this in arxiv.org/abs/2410.03249, there are also other works on the topic

English

653

Suhas Kotha@kothasuhas·19 Eyl

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

English

447

151.8K

Sebastian Bordt@sbordt·11 Ağu

@DimitrisPapail We found in a recent pre-print that the cross-entropy loss also contributes to stable scaling behavior, which could be relevant for its success in large-scale pre-training: arxiv.org/pdf/2505.22491. Our current arXiv version of this result is a bit technical, though.

English

434

Dimitris Papailiopoulos@DimitrisPapail·11 Ağu

Why is cross-entropy a good loss for language pretraining? caveat: this is all known btw; interestingly, even though there are many viewpoints and intuitions on "why x-ent", they all can be arrived at from a single starting point. Here's a simple first-principles derivation that doesn't assume anything about the data distribution. It comes from a very reasonable operational requirement :) "I want my model to sound intelligent" but we can't measure that, so we ask "I want my model to sound like a human" Although we have access to all texts ever written, we can't quite measure that either, so we instead ask "I want my model to be as likely as possible to generate one of the texts ever written" Or more bluntly: "I want my model to memorize the training data." Consider this thought experiment: Given a dataset S of all text ever written by humans, we perform independent trials for each "text" in S: Sample: "sample text" from our model Pr( ;W) Check: did "sample text" exactly match the original? Note: we do not condition on anything! we just ask, of all the stuff the model could generate, did we get "text". Define success as the event E = "all per-sample checks succeed" The probability of E is, the product of the probabilities assigned to the correct ground truth by your model W Pr(E) = Π_{text in S} Pr(text; W) Maximizing log Pr(E) over W gives you the cross-entropy objective. How do you do you optimize this with SGD? sample text from corpus compute grad log Pr(token|prefix) for every prefix of text update model What's elegant is that this same simultaneously: 1) Minimizes the description length of the data under model P( ;W) (compression view) 2) Minimizes KL divergence to the true distribution—if one exists (though we never assumed one) 3) Implements maximum likelihood estimation The derivation is straightforward and well-known, but it highlights something important: cross-entropy emerges naturally from wanting exact reproduction of the training data. P.S. you could have instead asked to maximize Pr(text generated by the model is in ground truth) interestingly, optimizing this can lead to mode collapse, since an optimal solution is to always predict a single piece of text from the corpus. Yet the gradients again look like x-entropy but with a multiplying factor i.e., Pr(text;W) grad log Pr(text;W)

English

215

68K

Sebastian Bordt@sbordt·20 Tem

@DimitrisPapail It will be much harder to generate training data to solve this en.wikipedia.org/wiki/Millenniu…

English

108

Dimitris Papailiopoulos@DimitrisPapail·19 Tem

Is there any quantifiable skill (approximately measurable via some proxy) that we believe LLMs can't saturate?

English

8.8K

Sebastian Bordt@sbordt·16 Tem

Today at 4:30 pm in the East Exhibition Hall. #icml icml.cc/virtual/2025/p…

Sebastian Bordt@sbordt

During the last couple of years, we have read a lot of papers on explainability and often felt that something was fundamentally missing🤔 This led us to write a position paper (accepted at #ICML2025) that attempts to identify the problem and to propose a solution. Introducing: "Rethinking Explainable Machine Learning as Applied Statistics" arxiv.org/abs/2402.02870 👇🧵

English

305

Sebastian Bordt@sbordt·15 Tem

Today at 11:00 a.m. #ICML poster session East. icml.cc/virtual/2025/p…

Sebastian Bordt@sbordt

Have you ever wondered whether a few times of data contamination really lead to benchmark overfitting?🤔 Then our latest paper about the effect of data contamination on LLM evals might be for you!🚀 "How Much Can We Forget about Data Contamination?" (accepted at #ICML2025) shows that benchmark leakage does not necessarily invalidate evaluations — everything depends on scale. Paper: arxiv.org/abs/2410.03249 👇🧵

English

208

Sebastian Bordt@sbordt·14 Tem

I'm at #ICML in Vancouver this week, hit me up if you want to chat about pre-training experiments or explainable machine learning. You can find me at these posters: Tuesday: How Much Can We Forget about Data Contamination? icml.cc/virtual/2025/p… Wednesday: Position: Rethinking Explainable Machine Learning as Applied Statistics icml.cc/virtual/2025/p…

English

157

Sebastian Bordt@sbordt·10 Tem

@gcskoenig @XAI_Research this might be interesting to you

English

Sebastian Bordt@sbordt·10 Tem

There are many more interesting aspects to this, so take a look at our paper! arxiv.org/abs/2402.02870 We would also be happy for questions and your comments on why we got it completely wrong.😊 If you are at ICML, I will present this paper on Wed 16 Jul 4:30 in the East Exhibition Hall A-B #E-501.📍

English

117

Sebastian Bordt@sbordt·10 Tem

English

613

Keşfet

@suchenzang @HarshaNori @SunnySanyal9 @kothasuhas @DimitrisPapail @elonmusk @BarackObama @taylorswift13