Tomasz Limisiewicz

240 posts

Tomasz Limisiewicz

@TomLimi

Postdoctoral researcher at @meta Fair and @uwnlp , Interested in going into the inner workings of neural networks, multilingualism, and fairer NLP (he/him)

Seattle Katılım Eylül 2021

495 Takip Edilen542 Takipçiler

Sabitlenmiş Tweet

Tomasz Limisiewicz@TomLimi·31 Mar

Excited to continue my research adventure as a postdoc at @uwnlp and @Meta ! I’ve joined @LukeZettlemoyer's fantastic lab. Together, we plan to rethink how LLMs perceive data to unlock their capabilities to uncharted language and, further, beyond text! [🦋posting]

English

118

9.8K

Tomasz Limisiewicz retweetledi

İlker Kesen@ilker_kesen·17 Mar

📢I'm organizing a BoF session at #EACL2026 called Tokenization & Beyond, aiming to gather researchers exploring tokenization and alternatives such as byte-level and pixel-based approaches. Sign up using the form if you're interested! #NLProc @eaclmeeting

English

3.6K

Tomasz Limisiewicz@TomLimi·7 Mar

@RTomMcCoy "judge of others" criterion for US exceptional ability visas/green cards is easy to prove by reviewing.

English

198

Tom McCoy@RTomMcCoy·7 Mar

What's the endgame for people who mass-email conference organizers saying that they'd like to review papers? I've gotten several such requests in the past few months

English

Tomasz Limisiewicz retweetledi

Srini Iyer@sriniiyer88·5 Mar

Meta Super-Intelligence Labs and FAIR will be at Rio! If you're at ICLR this year, do try to attend this!

Gargi Ghosh@gargighosh

Few of us from Meta SuperIntelligence lab will attend ICLR this year- Happy to chat in person. If you are interested in joining the networking mixer, please register here- events.atmeta.com/iclrnetworking…

English

3.7K

Tomasz Limisiewicz retweetledi

Jason Weston@jaseweston·16 Oca

Our team in FAIR at Meta is hiring a postdoc researcher! We work on the topics of Reasoning, Alignment and Memory/architectures (RAM). Apply here: metacareers.com/profile/job_de… Location: NY, Seattle or Menlo Park. Some of our recent work to give flavor: Co-Improvement (position): arxiv.org/abs/2512.05356 SPICE (Self-Play in Corpus Environments): arxiv.org/abs/2510.24684 Self-Challenging Agents: arxiv.org/abs/2506.01716 RL from Human Interaction: arxiv.org/abs/2509.25137 AggLM (parallel aggregation): arxiv.org/abs/2509.06870 StepWiser (CoT-PRM RL): arxiv.org/abs/2508.19229 DARLING (diversity-trained RL): arxiv.org/abs/2509.02534 J1 (RL-trained LLM-as-Judge): arxiv.org/abs/2505.10320 CoT-Self-Instruct: arxiv.org/abs/2507.23751 Multi-Token Attention: arxiv.org/abs/2504.00927

English

262

32.8K

Tomasz Limisiewicz@TomLimi·2 Oca

@xhluca @benno_krojer I have always considered coding as part of engineering. What is your definition of engineering? (like “planning out” then the percentages would roughly match my pipeline)

English

Xing Han Lu@xhluca·1 Oca

@benno_krojer Before: 90% coding, 10% engineering Now: 10% prompting, 90% engineering

English

Benno Krojer@benno_krojer·30 Ara

Funnily, automating a lot of my research processes with ai (coding) asisstants finally forces me to document and structure my repository and paper well i was always able to vibe it myself but now as i setup automation i need to ensure things are well documented in the .cursorrules/CLAUDE.md files, in the README and so on, so that the cursor/claude can handle the complexity of the repo For now I have a workflow where i have one central folder and notebook where all the final paper plots are generated and thus easy to reproduce, one big README that documents everything that is needed to reproduce the whole research, and also the Overleaf project is a git submodule of the project so that the plots/numbers can be directly synced I used to have such chaotic repositories, this feels good

English

293

Tomasz Limisiewicz retweetledi

Benjamin Minixhofer@bminixhofer·18 Ara

Bolmo is now on arXiv!

English

1.2K

Tomasz Limisiewicz@TomLimi·16 Ara

Check out 🅱️olmo! Really cool approach of retrofitting the existing BPE model to operate on bytes with 𝐥𝐚𝐭𝐞𝐧𝐭 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐚𝐭𝐢𝐨𝐧. This step allows us to close the gap and surpass subword-based models! ✨ Happy I could contribute to this forward-thinking project.

Benjamin Minixhofer@bminixhofer

We are releasing Bolmo today! Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3. Bolmo also performs competitively in terms of speed & is fully open. I was skeptical of byte-level models for a long time but I finally switched camps🧵

English

401

Tomasz Limisiewicz@TomLimi·16 Ara

@BlancheMinerva @allen_ai Latent Tokenization != Subword Tokenization

English

480

Stella Biderman@BlancheMinerva·16 Ara

@allen_ai What are you talking about? You are comparing to exactly 0 SOTA models. Also.... Byte tokenization IS A SUBWORD TOKENIZER

English

Tomasz Limisiewicz retweetledi

Ai2@allen_ai·15 Ara

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

English

107

677

119K

Tomasz Limisiewicz retweetledi

Melanie Sclar@melaniesclar·25 Kas

Carmen Sandiego is heading to #NeurIPS2025 - finally, a good use for this costume! I'm on the industry job market and organizing the agents + reasoning & planning workshop. Excited to chat about research (LLM robustness, reasoning, theory of mind), and job opportunities. DM me!

English

131

11.5K

Tomasz Limisiewicz@TomLimi·13 Kas

@n0mad_0 @cloneofsimo …but H200

English

ëugene kharitonov 🏴‍☠️@n0mad_0·13 Kas

@cloneofsimo ...and not H100

English

353

Simo Ryu@cloneofsimo·13 Kas

> nearly end of 2025 > Meta has well over 1M gpus and one of the largest datacenter in the world > JEPA as a franchise has been around for three years > "yeah we got 79% on Imagenet-1k 💪💪💪 🎉🎉🎉🎉"

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

A NEW PAPER FROM YANN LECUN: LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics This could be one of LeCun's last papers at Meta (lol), but it's a really interesting one I think. Quick summary: Yann LeCun's big idea is JEPA, a self-supervised learning method. However, there are various failure modes of this approach, so training strong JEPA models is very brittle, unstable, and quite difficult. So overall JEPA has seen little adoption in practice. This paper tries to directly address this, making specific design decisions that improve training stability. The authors identify the isotropic Gaussian as the optimal distribution that JEPA models’ embeddings should follow and design the Sketched Isotropic Gaussian Regularization (SICReg) to constrain embeddings to reach that ideal distribution. This forms the LeJEPA framework, which can be implemented in ~50 lines of code. On empirical tests, the authors demonstrate stability of training across hyperparameters, architectures, and datasets. A result particularly interesting to me however is that training a LeJEPA model from scratch directly on the downstream dataset outperforms finetuning a DINOv2/v3 model on the dataset!

English

423

106.4K

Tomasz Limisiewicz retweetledi

Hila Gonen@hila_gonen·28 Eki

Considering a PhD/MSc in NLP? I’m hiring students this cycle! If you are passionate about making language models reliable and safe, eager about understanding and controlling language models, and would like to add to your research some multilingual flavor - apply to my group! 👇

English

102

732

72.3K

Tomasz Limisiewicz@TomLimi·25 Eki

@_albertgu pitchforks? We are friendly people in general!

English

418

Albert Gu@_albertgu·25 Eki

at the tokenizer workshop panel at ICML, i made an offhand joke about eventually going to raw pixels being the way, i didn't press it too hard bc the pitchforks were already out over h-net and i wanted to make it home, but yes still in favor 🙋

Andrej Karpathy@karpathy

@thawani_avijit Haha. I am afraid people interpreted my “delete tokenizer” as “use bytes directly without BPE”, the issue is you *still* need bytes encoding arbitrariness even for that! Pixels is the only way. Just like humans. It is written. If GPT-10 uses utf8 at the input I will eat a shoe.

English

287

48.6K

Tomasz Limisiewicz@TomLimi·23 Eki

@karpathy @thawani_avijit Still, in the long run, atomic encoding whatever it is: bytes, pixels, phones is much more promising than static vocabulary of BPE.

English

125

Tomasz Limisiewicz@TomLimi·23 Eki

@karpathy @thawani_avijit Pixels still depend on the arbitrary choice of font. So the risk is we are replacing ASCII bias with Times New Roman bias.

English

330

Avijit Thawani (Avi)@thawani_avijit·22 Eki

retweeting purely for the periodic tokenizer rant.

Andrej Karpathy@karpathy

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

English

162

73.7K

Tomasz Limisiewicz@TomLimi·9 Eki

@yoavgo Are there any non-loss-based LLMs out there?

English

403

(((ل()(ل() 'yoav))))👾@yoavgo·9 Eki

COLM is inevitable, really. because LLMs are not really about language/nlp and are certainly not about loss-based ML. they are something new. and they deserve their own thing. i'm glad dipanjan et al realized this and pushed forward.

English

133

19.6K

Tomasz Limisiewicz retweetledi

Yen-Ju Lu@Yen_Ju_Lu·8 Eki

🚀 Introducing the Latent Speech-Text Transformer (LST) — a speech-text model that organizes speech tokens into latent patches for better text→speech transfer, enabling steeper scaling laws and more efficient multimodal training ⚡️ Paper 📄 arxiv.org/pdf/2510.06195

English

9.3K

Tomasz Limisiewicz retweetledi

Julie Kallini ✨@JulieKallini·29 Eyl

New paper! 🌈 In English, pie = 🥧. In Spanish, pie = 🦶. Multilingual tokenizers often share such overlapping tokens between languages. Do these “False Friends” hurt or help multilingual LMs? We find that overlap consistently improves transfer—even when it seems misleading. 🧵

English

24.9K

Tomasz Limisiewicz@TomLimi·20 Eyl

@ayffous note, that performance on unseen language (but with seen script) is decent, thanks to vocabulary overlap with seen languages. I expect that the improvements from extending byte maps would be the most significant for unseen script languages (e.g. Santali)

English

Tomasz Limisiewicz@TomLimi·20 Eyl

@ayffous Myte can be extended by training Morfessor for a new language and extending mappings, as in ‘byte_map’. I'm not sure how well it will work with continued pre-training, it’s an interesting research idea!

English

Tomasz Limisiewicz@TomLimi·13 May

📢New pre-print alert! 📢 A curse of over-segmentation haunts multilingual language models. While prior approaches have tried to resolve this by balancing data across languages, the problem lies much deeper — in the byte encodings themselves.🔍🔡 arxiv.org/pdf/2403.10691 (1/6)

English

14.3K

Keşfet

@eaclmeeting @RTomMcCoy @xhluca @benno_krojer @BlancheMinerva @allen_ai @n0mad_0 @cloneofsimo