Jan Hendrik Metzen

102 posts

Jan Hendrik Metzen

Jan Hendrik Metzen

@jan_metzen

Research Scientist at Prior Labs @prior_labs

Böblingen, Germany Katılım Ocak 2022
622 Takip Edilen182 Takipçiler
Jan Hendrik Metzen retweetledi
Prior Labs
Prior Labs@prior_labs·
TabPFN-3 is live. A massive leap forward in scale, speed & accuracy. 1M data points and 10-1000x faster inference on one H100. No training. No tuning. Load your dataset and predict. #tabpfn #tabularfoundationmodels #priorlabs
English
1
6
28
1.4K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@TomLimi Very nice work @TomLimi ! Based on my experience, the optimal compression rates for English are pretty low in your work - wonder if this is an artifact of not using end-to-end learned compression. Did you try H-Net or similar?
English
1
0
1
174
Tomasz Limisiewicz
Tomasz Limisiewicz@TomLimi·
We present Compute Optimal Tokenization! 🔡 Common in LLM scaling works stick to one tokenizer, sweeping data/model size. But what happens when we control the tokenizer’s compression rate (bytes/token)? Here we sweep tokenizers, params, and data across compute budgets: [1/N]
English
21
95
621
101K
Ai2
Ai2@allen_ai·
Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵
Ai2 tweet mediaAi2 tweet media
English
22
104
675
119.5K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@srchvrs @PontiEdoardo @BlancheMinerva @allen_ai one still has a more direct access to individual bytes than with subword tokenization. and with sliding-window attention in the encoder, one could attend to individual bytes without context length explosion (but it underperforms Mamba-2)
English
1
0
1
45
Leo Boytsov
Leo Boytsov@srchvrs·
@PontiEdoardo @BlancheMinerva @allen_ai 1. This sounds like you don't use attention at all, but you do at higher levels. 2. Yes, you can compromise and compress bytes effectively replacing a tokenizer with a recurrent net, but you lose the ability to attend to individual bytes.
English
2
0
0
177
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@allen_ai Nice work! Thanks also for pointing out future required "batched inference optimizations". We have explored this (github.com/Aleph-Alpha/vl…) and it remains unclear if batched inference performance of such architectures can be made competitive (dynamic patching, scheduling etc.)
English
2
0
4
301
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@srchvrs @BlancheMinerva @allen_ai The authors (and other related work) don't use quadratic attention in the byte-level models. The actual inference efficiency challenge is in batched inference, due to varying patch size and scheduling overhead of the hierarchical architecture.
English
1
0
0
52
Jan Hendrik Metzen retweetledi
Piotr Mazurek (in SF 🌉)
RL is cool, but what do you actually need to know about hardware and infra to predict its future? Check out our new piece on tensoreconomics:
Piotr Mazurek (in SF 🌉) tweet media
English
4
18
78
19.9K
Jan Hendrik Metzen retweetledi
Prior Labs
Prior Labs@prior_labs·
Today, TabPFN gets an upgrade. TabPFN-2.5 is here. 🪂 TabPFN-2.5 outperforms tuned-tree based models & matches the performance of a complex ensemble (AutoGluon) 1.4 tuned for 4 hours on benchmarks of up to 50,000 samples and 2,000 features. 🧵 1/7 #tabpfn #priorlabs
Prior Labs tweet media
English
1
7
15
1.2K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@linguist_cat what is your definition of a tokenizer, and what would be an approach you would consider tokenizer-free?
English
1
0
0
343
Catherine Arnett
Catherine Arnett@linguist_cat·
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
Catherine Arnett tweet media
English
24
64
548
177.9K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@main_horse did you compare mamba2 vs. transformer encoder/decoder? would be interested if the finding from the paper of Mamba2 being preferable can be reproduced
English
1
0
3
1.3K
main
main@main_horse·
I hope to receive pushback on today's claim.
main tweet media
English
13
18
332
129.9K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@main_horse imo, comparing in an overtrained regime makes most sense as the initial overhead of learning the chunking becomes almost negligible (downside being that compute cost for experiments increases)
English
0
0
8
212
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@main_horse agree with you in general. making a fair comparison is non-trivial, even pretraining encoder/decoder would be skewed as it would correspond to a setting with pretrained tokenizer + pretrained embedding/LM-head.
English
1
0
6
238
Jan Hendrik Metzen retweetledi
Pablo Iyu Guerrero
Pablo Iyu Guerrero@pabloiyu·
First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵
Pablo Iyu Guerrero tweet media
English
2
7
27
3.7K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@JJitsev @BlancheMinerva after compression, there is one latent unit per word, but this unit depends on the entire input up to that word (encoder attention crosses word boundaries). it does not merely encode that particulate word and is not an "atomic" token in that sense.
English
0
0
0
34
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@JJitsev @BlancheMinerva Two comments: the architecture of the released model differs from the paper cited above (is more a HATv2, we will share a tech report on this architecture soon). the word-splitting is only used in an internal cross-attention layer for sequence compression.
English
1
0
0
42
Stella Biderman
Stella Biderman@BlancheMinerva·
What do you call those units of semantic text the LLM compresses English and German into when you brag about the compression rate? It's not UTF-8 bytes... there's a word for it, maybe starts with a a T?
Aleph Alpha@Aleph__Alpha

Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.

English
8
6
76
14K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@sasuke___420 @tulkenss @TomLimi there are certainly challenges in productionizing these type of models, mainly because batched inference is non-trivial (varying sewuence compression across batch). We have a work-in-progress vLLM fork to address those challenges, see: x.com/jan_metzen/sta…
Jan Hendrik Metzen@jan_metzen

Excited about our release of a collection of byte-level hierarchical autoregressive transformers (HAT). If you care about bringing these type of models into production, checkout out our work-in-progress vLLM fork for HAT: github.com/Aleph-Alpha/vl…

English
0
0
1
56
sasuke⚡420
sasuke⚡420@sasuke___420·
@jan_metzen @tulkenss @TomLimi but I'm still sort of interested in trying out the various ideas in these implementations somewhat independently from each other!
English
1
0
1
54
slm tokens
slm tokens@tulkenss·
I wouldn't really consider these to be tokenizer-free tbh. Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization). Then, the bytes of these words are turned into embeddings, which are then processed by a model.
Aleph Alpha@Aleph__Alpha

Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.

English
4
2
47
5.4K
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@tulkenss yes, the implementation differs from the January HAT paper (this is confusing admittedly, sorry for this!). We will release a more detailed tech report soon. Atm, the best explanation is #model-architecture" target="_blank" rel="nofollow noopener">huggingface.co/Aleph-Alpha/ll…
English
1
0
3
60
slm tokens
slm tokens@tulkenss·
@jan_metzen e.g. you could also do the wordlevel processing, but build up the words out of subwords. There’s no real reason this would work worse I guess. You’d have to deal with a larger vocabulary size in the initial encoder, but whatever
English
1
0
1
88
Jan Hendrik Metzen
Jan Hendrik Metzen@jan_metzen·
@atomicflndr Agree with @atomicflndr that eventually, the splitting rule will have to be learned to realize the full potential of tokenizer-free approaches. And solving engineering challenges related to batched inference will have to be adresssed (x.com/jan_metzen/sta…)
Jan Hendrik Metzen@jan_metzen

Excited about our release of a collection of byte-level hierarchical autoregressive transformers (HAT). If you care about bringing these type of models into production, checkout out our work-in-progress vLLM fork for HAT: github.com/Aleph-Alpha/vl…

English
0
0
1
31
Johannes Messner
Johannes Messner@atomicflndr·
The real issue, as I see it, is that the word splitting rule is not learned, unlike BLT and HNet. This is WIP.
English
3
0
6
149
Johannes Messner
Johannes Messner@atomicflndr·
Seeing this pushback a lot - and it‘s fair! However, these models don’t have a fixed vocabulary, i.e. there are infinitely many words the model can operate over instead of a finite set of tokens.
slm tokens@tulkenss

I wouldn't really consider these to be tokenizer-free tbh. Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization). Then, the bytes of these words are turned into embeddings, which are then processed by a model.

English
1
0
11
1.2K