Tomasz Limisiewicz
278 posts

Tomasz Limisiewicz
@TomLimi
Postdoctoral researcher at @meta Fair and @uwnlp , Interested in going into the inner workings of neural networks, multilingualism, and fairer NLP (he/him)


Our ICML 2025 workshop on Actionable Interpretability drew massive interest. But the same questions kept coming up: What does "actionable" mean? Is it achievable? How? We're ready to answer. 🧵



ה-AI כבר מפגין יכולות על-אנושיות בתחומים משמעותיים. לכן אני מתענג על כל פעם שבה אני מצליח לגרום לו לפלוט שטויות, כי מי יודע כמה עוד הזדמנויות כאלה יהיו לי. הפעם שאלתי אותו כך: ״אני צריך לשלוח דף בפקס ורוצה לוודא שיישאר לי עותק. מה לעשות?״ התשובה של ChatGPT 5.5 extended בצילום המסך המצורף.


We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]



Fast Byte Latent Transformer is accepted to ICML 2026! ⚡🥪 Byte-level LMs promise to free us from subword tokenizers, but decoding one byte at a time is super slow. We make BLT generation more efficient with BLT-D: text diffusion for parallel byte decoding. 1/




We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]


We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

These findings hold both for latent tokenizers (BLT) and subword tokenizers (BPE variants). Interestingly, with BPE we observe that at large scale decreasing compression by choosing smaller vocabulary improves performance. [4/N]












