Jan Hendrik Metzen

102 posts

Jan Hendrik Metzen

@jan_metzen

Research Scientist at Prior Labs @prior_labs

Böblingen, Germany Katılım Ocak 2022

622 Takip Edilen182 Takipçiler

Jan Hendrik Metzen retweetledi

Prior Labs@prior_labs·12 May

TabPFN-3 is live. A massive leap forward in scale, speed & accuracy. 1M data points and 10-1000x faster inference on one H100. No training. No tuning. Load your dataset and predict. #tabpfn #tabularfoundationmodels #priorlabs

English

1.4K

Jan Hendrik Metzen@jan_metzen·5 May

@TomLimi Very nice work @TomLimi ! Based on my experience, the optimal compression rates for English are pretty low in your work - wonder if this is an artifact of not using end-to-end learned compression. Did you try H-Net or similar?

English

174

Tomasz Limisiewicz@TomLimi·4 May

We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]

English

621

101K

Jan Hendrik Metzen@jan_metzen·18 Ara

@bminixhofer @allen_ai amazing, thanks for considering this feedback!

English

Benjamin Minixhofer@bminixhofer·18 Ara

@jan_metzen @allen_ai fyi @jan_metzen this is fixed (and your vllm fork referenced) in the arxiv version: arxiv.org/abs/2512.15586!

English

144

Ai2@allen_ai·15 Ara

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

English

104

675

119.5K

Jan Hendrik Metzen@jan_metzen·16 Ara

@srchvrs @PontiEdoardo @BlancheMinerva @allen_ai one still has a more direct access to individual bytes than with subword tokenization. and with sliding-window attention in the encoder, one could attend to individual bytes without context length explosion (but it underperforms Mamba-2)

English

Leo Boytsov@srchvrs·16 Ara

@PontiEdoardo @BlancheMinerva @allen_ai 1. This sounds like you don't use attention at all, but you do at higher levels. 2. Yes, you can compromise and compress bytes effectively replacing a tokenizer with a recurrent net, but you lose the ability to attend to individual bytes.

English

177

Jan Hendrik Metzen@jan_metzen·16 Ara

@allen_ai Nice work! Thanks also for pointing out future required "batched inference optimizations". We have explored this (github.com/Aleph-Alpha/vl…) and it remains unclear if batched inference performance of such architectures can be made competitive (dynamic patching, scheduling etc.)

English

301

Jan Hendrik Metzen@jan_metzen·16 Ara

@srchvrs @BlancheMinerva @allen_ai The authors (and other related work) don't use quadratic attention in the byte-level models. The actual inference efficiency challenge is in batched inference, due to varying patch size and scheduling overhead of the hierarchical architecture.

English

Leo Boytsov@srchvrs·16 Ara

@BlancheMinerva @allen_ai I think a big issue here is loss in efficiency (as it was called out recently) due to increased context window. But increased context window can also lead to loss in accuracy due to attention dilution when the window is large-enough. machinelearningatscale.substack.com/p/dont-abolish…

English

1.3K

Jan Hendrik Metzen retweetledi

Piotr Mazurek (in SF 🌉)@tugot17·26 Kas

RL is cool, but what do you actually need to know about hardware and infra to predict its future? Check out our new piece on tensoreconomics:

English

19.9K

Jan Hendrik Metzen retweetledi

Prior Labs@prior_labs·6 Kas

Today, TabPFN gets an upgrade. TabPFN-2.5 is here. 🪂 TabPFN-2.5 outperforms tuned-tree based models & matches the performance of a complex ensemble (AutoGluon) 1.4 tuned for 4 hours on benchmarks of up to 50,000 samples and 2,000 features. 🧵 1/7 #tabpfn #priorlabs

English

1.2K

Jan Hendrik Metzen@jan_metzen·26 Eyl

@linguist_cat what is your definition of a tokenizer, and what would be an approach you would consider tokenizer-free?

English

343

Catherine Arnett@linguist_cat·25 Eyl

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!

English

548

177.9K

Jan Hendrik Metzen@jan_metzen·9 Eyl

@main_horse did you compare mamba2 vs. transformer encoder/decoder? would be interested if the finding from the paper of Mamba2 being preferable can be reproduced

English

1.3K

main@main_horse·9 Eyl

I hope to receive pushback on today's claim.

English

332

129.9K

Jan Hendrik Metzen@jan_metzen·9 Eyl

@main_horse imo, comparing in an overtrained regime makes most sense as the initial overhead of learning the chunking becomes almost negligible (downside being that compute cost for experiments increases)

English

212

Jan Hendrik Metzen@jan_metzen·9 Eyl

@main_horse agree with you in general. making a fair comparison is non-trivial, even pretraining encoder/decoder would be skewed as it would correspond to a setting with pretrained tokenizer + pretrained embedding/LM-head.

English

238

Jan Hendrik Metzen retweetledi

Pablo Iyu Guerrero@pabloiyu·4 Eyl

First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵

English

3.7K

Jan Hendrik Metzen@jan_metzen·25 Ağu

@JJitsev @BlancheMinerva after compression, there is one latent unit per word, but this unit depends on the entire input up to that word (encoder attention crosses word boundaries). it does not merely encode that particulate word and is not an "atomic" token in that sense.

English

Jan Hendrik Metzen@jan_metzen·25 Ağu

@JJitsev @BlancheMinerva Two comments: the architecture of the released model differs from the paper cited above (is more a HATv2, we will share a tech report on this architecture soon). the word-splitting is only used in an internal cross-attention layer for sequence compression.

English

Stella Biderman@BlancheMinerva·22 Ağu

What do you call those units of semantic text the LLM compresses English and German into when you brag about the compression rate? It's not UTF-8 bytes... there's a word for it, maybe starts with a a T?

Aleph Alpha@Aleph__Alpha

Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.

English

14K

Jan Hendrik Metzen@jan_metzen·23 Ağu

@sasuke___420 @tulkenss @TomLimi there are certainly challenges in productionizing these type of models, mainly because batched inference is non-trivial (varying sewuence compression across batch). We have a work-in-progress vLLM fork to address those challenges, see: x.com/jan_metzen/sta…

Jan Hendrik Metzen@jan_metzen

Excited about our release of a collection of byte-level hierarchical autoregressive transformers (HAT). If you care about bringing these type of models into production, checkout out our work-in-progress vLLM fork for HAT: github.com/Aleph-Alpha/vl…

English

sasuke⚡420@sasuke___420·23 Ağu

@jan_metzen @tulkenss @TomLimi but I'm still sort of interested in trying out the various ideas in these implementations somewhat independently from each other!

English

slm tokens@tulkenss·22 Ağu

I wouldn't really consider these to be tokenizer-free tbh. Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization). Then, the bytes of these words are turned into embeddings, which are then processed by a model.

Aleph Alpha@Aleph__Alpha

English

5.4K

Jan Hendrik Metzen@jan_metzen·23 Ağu

@tulkenss yes, the implementation differs from the January HAT paper (this is confusing admittedly, sorry for this!). We will release a more detailed tech report soon. Atm, the best explanation is #model-architecture" target="_blank" rel="nofollow noopener">huggingface.co/Aleph-Alpha/ll…

English

slm tokens@tulkenss·23 Ağu

@jan_metzen e.g. you could also do the wordlevel processing, but build up the words out of subwords. There’s no real reason this would work worse I guess. You’d have to deal with a larger vocabulary size in the initial encoder, but whatever

English

Jan Hendrik Metzen@jan_metzen·23 Ağu

@atomicflndr Agree with @atomicflndr that eventually, the splitting rule will have to be learned to realize the full potential of tokenizer-free approaches. And solving engineering challenges related to batched inference will have to be adresssed (x.com/jan_metzen/sta…)

Jan Hendrik Metzen@jan_metzen

English

Johannes Messner@atomicflndr·23 Ağu

The real issue, as I see it, is that the word splitting rule is not learned, unlike BLT and HNet. This is WIP.

English

149

Johannes Messner@atomicflndr·23 Ağu

Seeing this pushback a lot - and it‘s fair! However, these models don’t have a fixed vocabulary, i.e. there are infinitely many words the model can operate over instead of a finite set of tokens.

slm tokens@tulkenss

English

1.2K

Keşfet

@TomLimi @bminixhofer @allen_ai @srchvrs @PontiEdoardo @BlancheMinerva @linguist_cat @main_horse