Tomasz Limisiewicz

278 posts

Tomasz Limisiewicz

Tomasz Limisiewicz

@TomLimi

Postdoctoral researcher at @meta Fair and @uwnlp , Interested in going into the inner workings of neural networks, multilingualism, and fairer NLP (he/him)

Seattle Katılım Eylül 2021
507 Takip Edilen796 Takipçiler
Sabitlenmiş Tweet
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]
English
21
95
621
100.9K
Tomasz Limisiewicz retweetledi
Tokenization Workshop (TokShop) @COLM2026
Announcing First Call for Papers: Second Tokenization Workshop 🔡 📣 ▶️ Non-archival submissions of two types: Research papers (up to 9 pages) ▶️ Extended abstracts (up to 2 pages) Submission deadline June 23, 2026 (AoE) Acceptance notification on July 24, 2026 (AoE)
Tokenization Workshop (TokShop) @COLM2026 tweet media
English
1
8
13
3.4K
Hadas Orgad
Hadas Orgad@OrgadHadas·
Excited that our paper on Actionable Interpretability got accepted to ICML! And just in time -- we also heard that our Actionable Interpretability workshop will be happening again, in COLM! See you in Korea 🇰🇷 and SF🌉 [Arxiv paper link in the comment]
Hadas Orgad@OrgadHadas

Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How? We're ready to answer. 🧵

English
4
20
164
14.9K
Tomasz Limisiewicz retweetledi
Margaret Li
Margaret Li@margs_li·
MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]
Margaret Li tweet media
English
15
56
377
35K
Tomasz Limisiewicz retweetledi
Alisa Liu
Alisa Liu@alisawuffles·
In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.
Alisa Liu tweet media
Tomasz Limisiewicz@TomLimi

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English
3
23
198
24.9K
Grigory Sapunov
Grigory Sapunov@che_shr_cat·
10/ I also made a comic version of this paper — sometimes a picture is worth a thousand tokens. 🎨 #MachineLearning #AI
Grigory Sapunov tweet media
English
1
0
8
805
Tomasz Limisiewicz retweetledi
Grigory Sapunov
Grigory Sapunov@che_shr_cat·
1/ The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes. 🧵
Grigory Sapunov tweet media
English
6
46
344
19.1K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
There is life beyond BPE! 🔠🌱🥪 Don’t miss this amazing work from @JulieKallini tackling one of the key challenges of byte-level LLMs: generation speed. Diffusion and speculative decoding come to the rescue, enabling much faster generation with BLT with similar performance.
Julie Kallini ✨@JulieKallini

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/

English
1
3
25
3.1K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@arimedai @AnthropicAI With lower compression, we are allowing more compute for the same data sample, benefiting performance. But during training a low compressing model needs more compute to process enough data.
English
0
0
0
12
Healthy Anon
Healthy Anon@arimedai·
@TomLimi @AnthropicAI Tokens were always a shaky unit of account. Byte-normalized rules are overdue. The Anthropic connection is the real insight. Why does smaller vocab help at scale?
English
1
0
1
26
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]
English
21
95
621
100.9K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@Vijay2050977 128k subword tokenizer is constrained to close vocabulary. Latent tokenizer supports any string as a token, while maintaining the set average compression across the sequence.
English
0
0
1
85
Vijay
Vijay@Vijay2050977·
@TomLimi Any difference between exact raw tokens (17 bits chunk), and 128k tokens dictionary based (common nowadays)?
English
1
0
1
177
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@jan_metzen That's interesting. We compared different tokenization schemes and got consistent trends, optimal compression varied a bit. You can check appendix C for more details.
English
0
0
0
75
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@TomLimi Very nice work @TomLimi ! Based on my experience, the optimal compression rates for English are pretty low in your work - wonder if this is an artifact of not using end-to-end learned compression. Did you try H-Net or similar?
English
1
0
1
173
Artidoro Pagnoni
Artidoro Pagnoni@ArtidoroPagnoni·
Tokens are not a universal unit of data. In our new work on Compute Optimal Tokenization, we show that when adapting scaling recipes across tokenizers, bytes are the more stable unit. And the compute-optimal compression rate is not necessarily what today’s BPE tokenizers use.
Tomasz Limisiewicz@TomLimi

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English
3
6
69
8.2K
Tomasz Limisiewicz retweetledi
Srini Iyer
Srini Iyer@sriniiyer88·
Extremely excited about our work on Compute Optimal Tokenization! This paper categorically nails down the role that compression plays in compute optimality and recommends how to scale models keeping compression in mind. Cool results on multiple languages too!
Tomasz Limisiewicz@TomLimi

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English
0
4
7
1.4K
Tomasz Limisiewicz retweetledi
You Jiacheng
You Jiacheng@YouJiacheng·
larger compute prefer smaller vocabulary, interesting. 2 follow-up questions: 1. can we decouple in/out tokenization? to isolate the effect of more-input-tokens vs. finer-prediction-granularity. (see also arxiv.org/abs/2504.14992) 2. can we combine it with n-gram embed?
Tomasz Limisiewicz@TomLimi

These findings hold both for latent tokenizers (BLT) and subword tokenizers (BPE variants). Interestingly, with BPE we observe that at large scale decreasing compression by choosing smaller vocabulary improves performance. [4/N]

English
1
5
32
3.9K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@sanmking @ArtidoroPagnoni @WeijiaShi2 Thanks. The rotating graph is an animated series of matplotlib figures (with mpl_toolkits). Interactive version (in the blog post) is done with plotly. All with help of LM to beautify 😉
English
1
0
1
47
Mert Gulsun
Mert Gulsun@mert_gulsun·
@TomLimi Superb work, always wondered this. There is life beyond BPE
English
1
0
1
725
Julie Kallini ✨
Julie Kallini ✨@JulieKallini·
@TomLimi You know a paper is thorough, thoughtful, and an overall banger when there's 17 pages of main text and 41 pages total with appendices
English
1
0
8
994