Tomasz Limisiewicz

240 posts

Tomasz Limisiewicz

Tomasz Limisiewicz

@TomLimi

Postdoctoral researcher at @meta Fair and @uwnlp , Interested in going into the inner workings of neural networks, multilingualism, and fairer NLP (he/him)

Seattle Katılım Eylül 2021
495 Takip Edilen542 Takipçiler
Sabitlenmiş Tweet
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
Excited to continue my research adventure as a postdoc at @uwnlp and @Meta ! I’ve joined @LukeZettlemoyer's fantastic lab. Together, we plan to rethink how LLMs perceive data to unlock their capabilities to uncharted language and, further, beyond text! [🦋posting]
Tomasz Limisiewicz tweet mediaTomasz Limisiewicz tweet media
English
5
2
118
9.8K
Tomasz Limisiewicz retweetledi
İlker Kesen
İlker Kesen@ilker_kesen·
📢I'm organizing a BoF session at #EACL2026 called Tokenization & Beyond, aiming to gather researchers exploring tokenization and alternatives such as byte-level and pixel-based approaches. Sign up using the form if you're interested! #NLProc @eaclmeeting
İlker Kesen tweet media
English
1
16
49
3.6K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@RTomMcCoy "judge of others" criterion for US exceptional ability visas/green cards is easy to prove by reviewing.
English
1
0
5
198
Tom McCoy
Tom McCoy@RTomMcCoy·
What's the endgame for people who mass-email conference organizers saying that they'd like to review papers? I've gotten several such requests in the past few months
English
3
0
2
2K
Tomasz Limisiewicz retweetledi
Jason Weston
Jason Weston@jaseweston·
Our team in FAIR at Meta is hiring a postdoc researcher! We work on the topics of Reasoning, Alignment and Memory/architectures (RAM). Apply here: metacareers.com/profile/job_de… Location: NY, Seattle or Menlo Park. Some of our recent work to give flavor: Co-Improvement (position): arxiv.org/abs/2512.05356 SPICE (Self-Play in Corpus Environments): arxiv.org/abs/2510.24684 Self-Challenging Agents: arxiv.org/abs/2506.01716 RL from Human Interaction: arxiv.org/abs/2509.25137 AggLM (parallel aggregation): arxiv.org/abs/2509.06870 StepWiser (CoT-PRM RL): arxiv.org/abs/2508.19229 DARLING (diversity-trained RL): arxiv.org/abs/2509.02534 J1 (RL-trained LLM-as-Judge): arxiv.org/abs/2505.10320 CoT-Self-Instruct: arxiv.org/abs/2507.23751 Multi-Token Attention: arxiv.org/abs/2504.00927
English
10
44
262
32.8K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@xhluca @benno_krojer I have always considered coding as part of engineering. What is your definition of engineering? (like “planning out” then the percentages would roughly match my pipeline)
English
1
0
0
31
Xing Han Lu
Xing Han Lu@xhluca·
@benno_krojer Before: 90% coding, 10% engineering Now: 10% prompting, 90% engineering
English
1
0
1
52
Benno Krojer
Benno Krojer@benno_krojer·
Funnily, automating a lot of my research processes with ai (coding) asisstants finally forces me to document and structure my repository and paper well i was always able to vibe it myself but now as i setup automation i need to ensure things are well documented in the .cursorrules/CLAUDE.md files, in the README and so on, so that the cursor/claude can handle the complexity of the repo For now I have a workflow where i have one central folder and notebook where all the final paper plots are generated and thus easy to reproduce, one big README that documents everything that is needed to reproduce the whole research, and also the Overleaf project is a git submodule of the project so that the plots/numbers can be directly synced I used to have such chaotic repositories, this feels good
English
2
0
4
293
Tomasz Limisiewicz retweetledi
Benjamin Minixhofer
Benjamin Minixhofer@bminixhofer·
Bolmo is now on arXiv!
Benjamin Minixhofer tweet media
English
1
3
23
1.2K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
Check out 🅱️olmo! Really cool approach of retrofitting the existing BPE model to operate on bytes with 𝐥𝐚𝐭𝐞𝐧𝐭 𝐭𝐨𝐤𝐞𝐧𝐢𝐳𝐚𝐭𝐢𝐨𝐧. This step allows us to close the gap and surpass subword-based models! ✨ Happy I could contribute to this forward-thinking project.
Benjamin Minixhofer@bminixhofer

We are releasing Bolmo today! Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3. Bolmo also performs competitively in terms of speed & is fully open. I was skeptical of byte-level models for a long time but I finally switched camps🧵

English
0
0
8
401
Stella Biderman
Stella Biderman@BlancheMinerva·
@allen_ai What are you talking about? You are comparing to exactly 0 SOTA models. Also.... Byte tokenization IS A SUBWORD TOKENIZER
English
4
0
31
5K
Tomasz Limisiewicz retweetledi
Ai2
Ai2@allen_ai·
Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
Ai2 tweet mediaAi2 tweet media
English
22
107
677
119K
Tomasz Limisiewicz retweetledi
Melanie Sclar
Melanie Sclar@melaniesclar·
Carmen Sandiego is heading to #NeurIPS2025 - finally, a good use for this costume! I'm on the industry job market and organizing the agents + reasoning & planning workshop. Excited to chat about research (LLM robustness, reasoning, theory of mind), and job opportunities. DM me!
Melanie Sclar tweet media
English
4
14
131
11.5K
Simo Ryu
Simo Ryu@cloneofsimo·
> nearly end of 2025 > Meta has well over 1M gpus and one of the largest datacenter in the world > JEPA as a franchise has been around for three years > "yeah we got 79% on Imagenet-1k 💪💪💪 🎉🎉🎉🎉"
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

A NEW PAPER FROM YANN LECUN: LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics This could be one of LeCun's last papers at Meta (lol), but it's a really interesting one I think. Quick summary: Yann LeCun's big idea is JEPA, a self-supervised learning method. However, there are various failure modes of this approach, so training strong JEPA models is very brittle, unstable, and quite difficult. So overall JEPA has seen little adoption in practice. This paper tries to directly address this, making specific design decisions that improve training stability. The authors identify the isotropic Gaussian as the optimal distribution that JEPA models’ embeddings should follow and design the Sketched Isotropic Gaussian Regularization (SICReg) to constrain embeddings to reach that ideal distribution. This forms the LeJEPA framework, which can be implemented in ~50 lines of code. On empirical tests, the authors demonstrate stability of training across hyperparameters, architectures, and datasets. A result particularly interesting to me however is that training a LeJEPA model from scratch directly on the downstream dataset outperforms finetuning a DINOv2/v3 model on the dataset!

English
19
15
423
106.4K
Tomasz Limisiewicz retweetledi
Hila Gonen
Hila Gonen@hila_gonen·
Considering a PhD/MSc in NLP? I’m hiring students this cycle! If you are passionate about making language models reliable and safe, eager about understanding and controlling language models, and would like to add to your research some multilingual flavor - apply to my group! 👇
Hila Gonen tweet mediaHila Gonen tweet media
English
16
102
732
72.3K
Albert Gu
Albert Gu@_albertgu·
at the tokenizer workshop panel at ICML, i made an offhand joke about eventually going to raw pixels being the way, i didn't press it too hard bc the pitchforks were already out over h-net and i wanted to make it home, but yes still in favor 🙋
Andrej Karpathy@karpathy

@thawani_avijit Haha. I am afraid people interpreted my “delete tokenizer” as “use bytes directly without BPE”, the issue is you *still* need bytes encoding arbitrariness even for that! Pixels is the only way. Just like humans. It is written. If GPT-10 uses utf8 at the input I will eat a shoe.

English
8
12
287
48.6K
Avijit Thawani (Avi)
Avijit Thawani (Avi)@thawani_avijit·
retweeting purely for the periodic tokenizer rant.
Andrej Karpathy@karpathy

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

English
3
2
162
73.7K
(((ل()(ل() 'yoav))))👾
COLM is inevitable, really. because LLMs are not really about language/nlp and are certainly not about loss-based ML. they are something new. and they deserve their own thing. i'm glad dipanjan et al realized this and pushed forward.
English
1
8
133
19.6K
Tomasz Limisiewicz retweetledi
Yen-Ju Lu
Yen-Ju Lu@Yen_Ju_Lu·
🚀 Introducing the Latent Speech-Text Transformer (LST) — a speech-text model that organizes speech tokens into latent patches for better text→speech transfer, enabling steeper scaling laws and more efficient multimodal training ⚡️ Paper 📄 arxiv.org/pdf/2510.06195
Yen-Ju Lu tweet media
English
7
14
35
9.3K
Tomasz Limisiewicz retweetledi
Julie Kallini ✨
Julie Kallini ✨@JulieKallini·
New paper! 🌈 In English, pie = 🥧. In Spanish, pie = 🦶. Multilingual tokenizers often share such overlapping tokens between languages. Do these “False Friends” hurt or help multilingual LMs? We find that overlap consistently improves transfer—even when it seems misleading. 🧵
Julie Kallini ✨ tweet media
English
1
19
99
24.9K
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@ayffous note, that performance on unseen language (but with seen script) is decent, thanks to vocabulary overlap with seen languages. I expect that the improvements from extending byte maps would be the most significant for unseen script languages (e.g. Santali)
English
1
0
1
75
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
@ayffous Myte can be extended by training Morfessor for a new language and extending mappings, as in ‘byte_map’. I'm not sure how well it will work with continued pre-training, it’s an interesting research idea!
English
1
0
1
67
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
📢New pre-print alert! 📢 A curse of over-segmentation haunts multilingual language models. While prior approaches have tried to resolve this by balancing data across languages, the problem lies much deeper — in the byte encodings themselves.🔍🔡 arxiv.org/pdf/2403.10691 (1/6)
Tomasz Limisiewicz tweet media
English
5
20
90
14.3K