Johannes Messner

396 posts

Johannes Messner banner
Johannes Messner

Johannes Messner

@atomicflndr

🇪🇺

Hamburg, Deutschland Katılım Aralık 2016
306 Takip Edilen317 Takipçiler
Johannes Messner
Johannes Messner@atomicflndr·
... and scale up to larger models and generative evals to confirm this trend!
Johannes Messner tweet media
English
1
0
0
38
Matthias #SiempreGino
Matthias #SiempreGino@NairoInGreen·
I was sleeping, when I saw a strange figure in the corner of my room, it looked like it was a ghost. When I turned the lights on I saw it was none other than Marco Odermatt! I was scared but then I said that a big race is coming up soon. Thankfully he immediately vanished.
English
3
2
143
21.8K
Johannes Messner retweetledi
samsja
samsja@samsja19·
Today we’re releasing Trinity Large, a 400B MoE LLM with 13B active parameters, trained over 17T tokens The base model is on par with GLM-4.5 Base, while being significantly faster at inference because it’s sparser and hybrid The architecture we picked is one of my favorites: 3:1 local/global with SWA, NoPE on the global layers and RoPE on the local layers, gated attention, depth-scaled sandwich norm, and smooth training with Muon. Our dataset is also high quality, curated by @datologyai . We trained it on 2,000 B300s for a month on @PrimeIntellect infrastructure. This is a preview release with an instruct model only — we’re ramping up RL on it. When @latkins approached us a couple of months ago to train this model together, I thought he was crazy — but then he hired @stochasticchasm, and here we are.
Prime Intellect@PrimeIntellect

We're excited to introduce @arcee_ai's Trinity Large model. An open 400B parameter Mixture of Experts model, delivering frontier-level performance with only 13B active parameters. Trained in collaboration between Arcee, Datology and Prime Intellect.

English
24
46
584
71.1K
Nan Wang
Nan Wang@nanwang_t·
From Feb 2020 to now, 2000+ days building this wild ride.  From OSS roots to ES acquisition: funding highs, ChatGPT panic, jina topping GH/HF, reader traffic booms, bilingual embeddings' crickets. Family ties keep me from joining ES.  Thank you for all your support, and here's to the next adventure! @JinaAI_
Elastic@elastic

We’re excited to announce that we have joined forces with @JinaAI_, a leader in frontier models for multimodal and multilingual search. This acquisition deepens Elastic’s capabilities in retrieval, embeddings, and context engineering to power agentic AI: go.es.io/48QeYCM

English
3
0
8
963
Johannes Messner
Johannes Messner@atomicflndr·
@NWalhan @KuittinenPetri @Aleph__Alpha tech report is coming soon, but the positional encoding is standard rope (for all sub-transformers). I don't have the loss curve of these particular checkpoints at hand right now, but I can show you a cpt curve from a different HAT model i'm currently training; it's quite boring
Johannes Messner tweet media
English
1
0
2
76
Waltan
Waltan@NWalhan·
I can’t find some info about your innovative model. If it is ok can you highlight: Positional encoding used (in encoder and backbone and decoder) Also if you can highlight training loss at various step counts. Would love to know how it curved with three transformers on top of each other.
English
1
0
0
44
Aleph Alpha
Aleph Alpha@Aleph__Alpha·
Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.
Aleph Alpha tweet media
English
14
46
435
74.2K
Johannes Messner retweetledi
Vedant Nanda
Vedant Nanda@_nvedant_·
Curious how to accelerate inference of some of the recent byte level models like HAT/HNet/BLT? Check out this vllm fork developed by my friends and colleagues, Pablo and Lukas! To my knowledge first demonstration of inference speedups from dynamic chunking in byte models!
Pablo Iyu Guerrero@pabloiyu

First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵

English
1
2
8
742
Johannes Messner retweetledi
Pablo Iyu Guerrero
Pablo Iyu Guerrero@pabloiyu·
First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵
Pablo Iyu Guerrero tweet media
English
2
7
28
3.6K
Johannes Messner retweetledi
Vedant Nanda
Vedant Nanda@_nvedant_·
Our work on tokenizer free LLMs: Hierarchical Autoregressive Transformers (HAT)! We recently dropped HAT models on HF, pretrained from scratch! huggingface.co/collections/Al… You can try them with both HF Inference AND our vllm fork: github.com/Aleph-Alpha/vl… 🧵 (1/6)
Aleph Alpha@Aleph__Alpha

Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.

English
2
2
21
1.3K
Johannes Messner retweetledi
Vedant Nanda
Vedant Nanda@_nvedant_·
And imo what's the coolest is that we made it ready for production grade inference with our own vllm fork (more details on this soon!): github.com/Aleph-Alpha/vl… So now you can now enjoy all the vllm features like continuous batching, paged attention etc also for HAT! (5/6)
English
1
1
4
190
slm tokens
slm tokens@tulkenss·
@atomicflndr By the way, do you have an easy way to run the initial encoder (the “word embedding maker”) in separation? I could make it work, but maybe you already did. I’m very curious what these embeddings look like, and how they perform compared to other word embeddings.
English
1
0
0
28
Johannes Messner
Johannes Messner@atomicflndr·
Seeing this pushback a lot - and it‘s fair! However, these models don’t have a fixed vocabulary, i.e. there are infinitely many words the model can operate over instead of a finite set of tokens.
slm tokens@tulkenss

I wouldn't really consider these to be tokenizer-free tbh. Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization). Then, the bytes of these words are turned into embeddings, which are then processed by a model.

English
1
0
11
1.2K
slm tokens
slm tokens@tulkenss·
@atomicflndr Ok, so if itdoesn’t matter, maybe don’t call it tokenizer-free, because the first step is a tokenizer. I think “without a fixed vocabulary” is much more descriptive and does a lot more justice to the problem this solves. (But probably sells less well)
English
1
0
1
38
Johannes Messner
Johannes Messner@atomicflndr·
So you can call this work tokenizer-free, or not, I don’t think it matters too much. What matters is, imo, that it addresses some of the fundamental limitations of subword tokenization a la BPE
English
1
0
7
140
Johannes Messner
Johannes Messner@atomicflndr·
The real issue, as I see it, is that the word splitting rule is not learned, unlike BLT and HNet. This is WIP.
English
3
0
6
147
slm tokens
slm tokens@tulkenss·
I wouldn't really consider these to be tokenizer-free tbh. Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization). Then, the bytes of these words are turned into embeddings, which are then processed by a model.
Aleph Alpha@Aleph__Alpha

Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.

English
4
2
46
5.4K