Tomasz Limisiewicz

8

13

3.4K

Tomasz Limisiewicz@TomLimi·19 May

@OrgadHadas Congrats! We (@tokshop2025) are also transferring from ICML to COLM this year!

English

2

293

Hadas Orgad@OrgadHadas·19 May

Excited that our paper on Actionable Interpretability got accepted to ICML! And just in time -- we also heard that our Actionable Interpretability workshop will be happening again, in COLM! See you in Korea 🇰🇷 and SF🌉 [Arxiv paper link in the comment]

Hadas Orgad@OrgadHadas

Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How? We're ready to answer. 🧵

English

4

20

164

14.9K

Tomasz Limisiewicz retweetledi

Margaret Li@margs_li·18 May

MoEs are everywhere, but the design space is confusing: total vs active experts? expert size? shared experts? routing? token dropping? We train >2000 MoE LMs 🫠 to investigate and bring you: 📄🔪🍰 Slicing and Dicing MoEs Tl;dr: it's all about expert size and count [1/9]

English

15

56

377

35K

Tomasz Limisiewicz@TomLimi·17 May

@yoavgo No wonder LLM adoption is so low in Europe, with blunders like this

English

Danny Hendler@DannyHendler

5

245

(((ل()(ل() 'yoav))))👾@yoavgo·17 May

cute and works also on claude 4.6 english. (4.7 did get it right)

ה-AI כבר מפגין יכולות על-אנושיות בתחומים משמעותיים. לכן אני מתענג על כל פעם שבה אני מצליח לגרום לו לפלוט שטויות, כי מי יודע כמה עוד הזדמנויות כאלה יהיו לי. הפעם שאלתי אותו כך: ״אני צריך לשלוח דף בפקס ורוצה לוודא שיישאר לי עותק. מה לעשות?״ התשובה של ChatGPT 5.5 extended בצילום המסך המצורף.

English

8

1

45

9.6K

Tomasz Limisiewicz retweetledi

Alisa Liu@alisawuffles·15 May

In SuperBPE we found: as tokenizer compression increases, the compute-optimal ratio of train tokens to model params decreases — and remarkably, corresponds to the same underlying ratio of train *bytes* / param! Our new work makes it official: scaling laws depend on compression.

Tokenization Workshop (TokShop) @COLM2026@tokshop2025

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

3

23

198

24.9K

Tomasz Limisiewicz@TomLimi·14 May

See you there! 🌉🔠

TokShop will be at #COLM2026! 🗓️ October 9th, 2026 📍 San Francisco, USA More details and a call for papers coming soon.

English

10

343

Tomasz Limisiewicz@TomLimi·12 May

@che_shr_cat That's a nice one! 🤖😄

English

1

16

Grigory Sapunov@che_shr_cat·12 May

10/ I also made a comic version of this paper — sometimes a picture is worth a thousand tokens. 🎨 #MachineLearning #AI

English

Julie Kallini ✨@JulieKallini

0

8

805

Tomasz Limisiewicz retweetledi

Grigory Sapunov@che_shr_cat·12 May

1/ The "20 tokens per parameter" Chinchilla scaling law is flawed. It is an artifact of your tokenizer. Scaling shouldn't be measured in tokens at all. It should be measured in bytes. 🧵

English

6

46

344

19.1K

Tomasz Limisiewicz@TomLimi·11 May

There is life beyond BPE! 🔠🌱🥪 Don’t miss this amazing work from @JulieKallini tackling one of the key challenges of byte-level LLMs: generation speed. Diffusion and speculative decoding come to the rescue, enabling much faster generation with BLT with similar performance.

Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/

English

3

25

3.1K

Tomasz Limisiewicz@TomLimi·7 May

@arimedai @AnthropicAI With lower compression, we are allowing more compute for the same data sample, benefiting performance. But during training a low compressing model needs more compute to process enough data.

English

12

Healthy Anon@arimedai·5 May

@TomLimi @AnthropicAI Tokens were always a shaky unit of account. Byte-normalized rules are overdue. The Anthropic connection is the real insight. Why does smaller vocab help at scale?

English

0

1

26

Tomasz Limisiewicz@TomLimi·4 May

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

21

95

621

100.9K

Tomasz Limisiewicz@TomLimi·7 May

@Vijay2050977 128k subword tokenizer is constrained to close vocabulary. Latent tokenizer supports any string as a token, while maintaining the set average compression across the sequence.

English

1

85

Vijay@Vijay2050977·5 May

@TomLimi Any difference between exact raw tokens (17 bits chunk), and 128k tokens dictionary based (common nowadays)?

English

0

1

177

Tomasz Limisiewicz@TomLimi·7 May

@Artificially999 mine too! 💙

English

2

Artificially Intelligent 🏴‍☠️@Artificially999·5 May

work close to my heart, lfg!

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

0

3

68

Tomasz Limisiewicz@TomLimi·7 May

@jan_metzen That's interesting. We compared different tokenization schemes and got consistent trends, optimal compression varied a bit. You can check appendix C for more details.

English

75

Jan Hendrik Metzen@jan_metzen·5 May

@TomLimi Very nice work @TomLimi ! Based on my experience, the optimal compression rates for English are pretty low in your work - wonder if this is an artifact of not using end-to-end learned compression. Did you try H-Net or similar?

English

0

1

173

Artidoro Pagnoni@ArtidoroPagnoni·4 May

Tokens are not a universal unit of data. In our new work on Compute Optimal Tokenization, we show that when adapting scaling recipes across tokenizers, bytes are the more stable unit. And the compute-optimal compression rate is not necessarily what today’s BPE tokenizers use.

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

3

6

69

8.2K

Tomasz Limisiewicz retweetledi

Srini Iyer@sriniiyer88·4 May

Extremely excited about our work on Compute Optimal Tokenization! This paper categorically nails down the role that compression plays in compute optimality and recommends how to scale models keeping compression in mind. Cool results on multiple languages too!

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

4

7

1.4K

Tomasz Limisiewicz retweetledi

You Jiacheng@YouJiacheng·5 May

larger compute prefer smaller vocabulary, interesting. 2 follow-up questions: 1. can we decouple in/out tokenization? to isolate the effect of more-input-tokens vs. finer-prediction-granularity. (see also arxiv.org/abs/2504.14992) 2. can we combine it with n-gram embed?

These findings hold both for latent tokenizers (BLT) and subword tokenizers (BPE variants). Interestingly, with BPE we observe that at large scale decreasing compression by choosing smaller vocabulary improves performance. [4/N]

English

5

32

3.9K

Tomasz Limisiewicz@TomLimi·5 May

@sanmking @ArtidoroPagnoni @WeijiaShi2 Thanks. The rotating graph is an animated series of matplotlib figures (with mpl_toolkits). Interactive version (in the blog post) is done with plotly. All with help of LM to beautify 😉

English

0

1

47

Santiago M.@sanmking·4 May

@ArtidoroPagnoni @WeijiaShi2 Nice graph! How did you made it (and the visualization)?

English

0

1

95

Tomasz Limisiewicz@TomLimi·5 May

@mert_gulsun Definitely, there is! Thanks!

English

1

359

Mert Gulsun@mert_gulsun·4 May

@TomLimi Superb work, always wondered this. There is life beyond BPE

English

0

1

725

Tomasz Limisiewicz@TomLimi·5 May

@stochasticchasm Thank you!

English

1

465

stochasm@stochasticchasm·4 May

@TomLimi very cool work!

English

0

2

918

Tomasz Limisiewicz@TomLimi·5 May

@JulieKallini Thank you, Julie! Heads up for all fast skimmers, the main findings are distilled into a 15-minutes-read blogpost 🏃‍♂️: co-tok.github.io

English

1

527

Julie Kallini ✨@JulieKallini·4 May

@TomLimi You know a paper is thorough, thoughtful, and an overall banger when there's 17 pages of main text and 41 pages total with appendices

English