Johannes Messner

398 posts

Johannes Messner

@atomicflndr

🇪🇺

Hamburg, Deutschland Katılım Aralık 2016

306 Takip Edilen317 Takipçiler

Johannes Messner@atomicflndr·17 May

@Dorialexander Hit who? People in the field already know and I‘m yet to see a politician who seems even close to understanding what is going on

English

203

Alexander Doria@Dorialexander·17 May

i'm not so sure: reality might hit violently after summer…

Lisan al Gaib@scaling01

You will see a lot more european turbo cope in the next couple years

English

Johannes Messner@atomicflndr·1 Nis

@samsja19 huge for us gpu poors

English

116

samsja@samsja19·1 Nis

sign sgd seems comparable to adamw for RL with 0 optimizer memory cost Inspired by the glm5 paper where they reset the optimizer state every step which is roughly equivalent to sign sgd we merged support for it in prime-rl, defo a strong baseline especially for memory constrain setup

English

103

5.2K

Johannes Messner@atomicflndr·12 Mar

we have kernels, too :)

English

Johannes Messner@atomicflndr·12 Mar

... and scale up to larger models and generative evals to confirm this trend!

English

Johannes Messner@atomicflndr·12 Mar

The last thing I worked on at Aleph Alpha! With @SohirMaskey Constantin Eichenberg @douglasahorr tl;dr: - Quantisation-aware training works really well - If you have a fixed memory budget you should probably go many parameters - few bits - k-means quant. is better than uniform

Graphcore Research@GCResearchTeam

Would you rather use 1 million × 16-bit weights, 4 million × 4-bit weights, or even 16 million × 1-bit weights? In joint work between Aleph Alpha Research and Graphcore, we asked this question of LLMs — the answer encouraged us to embrace the wonder ✨ of 1-bit weights, which can outperform 4-bit and 16-bit weights on a fixed weight memory budget. In our work - ⚖️ A scaling laws evaluation prompts us to consider very low-bit formats - 📈 Scaled-up tests show the power of memory-matched models with 1-bit weights - ⚡ Kernel benchmarking demonstrates their feasibility for autoregressive inference Read all about it in our blog and paper (link below! ⬇️) Massive thanks to our collaborators at Aleph Alpha Research! Authors: @SohirMaskey, Constantin Eichenberg, @atomicflndr and @douglasahorr

English

325

Johannes Messner@atomicflndr·9 Şub

@NairoInGreen you are talking about an olympic champion sir

English

720

Matthias #SiempreGino@NairoInGreen·9 Şub

I was sleeping, when I saw a strange figure in the corner of my room, it looked like it was a ghost. When I turned the lights on I saw it was none other than Marco Odermatt! I was scared but then I said that a big race is coming up soon. Thankfully he immediately vanished.

English

142

21.9K

Johannes Messner retweetledi

samsja@samsja19·28 Oca

Today we’re releasing Trinity Large, a 400B MoE LLM with 13B active parameters, trained over 17T tokens The base model is on par with GLM-4.5 Base, while being significantly faster at inference because it’s sparser and hybrid The architecture we picked is one of my favorites: 3:1 local/global with SWA, NoPE on the global layers and RoPE on the local layers, gated attention, depth-scaled sandwich norm, and smooth training with Muon. Our dataset is also high quality, curated by @datologyai . We trained it on 2,000 B300s for a month on @PrimeIntellect infrastructure. This is a preview release with an instruct model only — we’re ramping up RL on it. When @latkins approached us a couple of months ago to train this model together, I thought he was crazy — but then he hired @stochasticchasm, and here we are.

Prime Intellect@PrimeIntellect

We're excited to introduce @arcee_ai's Trinity Large model. An open 400B parameter Mixture of Experts model, delivering frontier-level performance with only 13B active parameters. Trained in collaboration between Arcee, Datology and Prime Intellect.

English

582

71.3K

Johannes Messner@atomicflndr·28 Kas

@samsja19 @rasdani_ Where is the pirate hat?

English

samsja@samsja19·28 Kas

@atomicflndr @rasdani_ bro look at my pfp I don't have a choice

English

Daniel Auras@rasdani_·27 Kas

agi was the friends we made along the way

Prime Intellect@PrimeIntellect

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

English

710

442.7K

Johannes Messner@atomicflndr·10 Eki

@nanwang_t wish you all the best for whatever is next!

English

Nan Wang@nanwang_t·10 Eki

From Feb 2020 to now, 2000+ days building this wild ride. From OSS roots to ES acquisition: funding highs, ChatGPT panic, jina topping GH/HF, reader traffic booms, bilingual embeddings' crickets. Family ties keep me from joining ES. Thank you for all your support, and here's to the next adventure! @JinaAI_

Elastic@elastic

We’re excited to announce that we have joined forces with @JinaAI_, a leader in frontier models for multimodal and multilingual search. This acquisition deepens Elastic’s capabilities in retrieval, embeddings, and context engineering to power agentic AI: go.es.io/48QeYCM

English

979

Johannes Messner@atomicflndr·8 Eki

@tugot17 You could apply the same logic to handbags, and yet…

English

Piotr Mazurek (in SF 🌉)@tugot17·8 Eki

there is simply no future in which cars stay expensive; the cost of batteries is falling day by day, the cost of copying self driving software will be around $0, all margins in the limit will go to ~0%

unusual_whales@unusual_whales

Chinese electric vehicles now account for more than half of global sales, priced thousands of dollars below US & European models, per WSJ

English

806

Johannes Messner@atomicflndr·8 Eyl

@NWalhan @KuittinenPetri @Aleph__Alpha tech report is coming soon, but the positional encoding is standard rope (for all sub-transformers). I don't have the loss curve of these particular checkpoints at hand right now, but I can show you a cpt curve from a different HAT model i'm currently training; it's quite boring

English

Waltan@NWalhan·8 Eyl

I can’t find some info about your innovative model. If it is ok can you highlight: Positional encoding used (in encoder and backbone and decoder) Also if you can highlight training loss at various step counts. Would love to know how it curved with three transformers on top of each other.

English

Aleph Alpha@Aleph__Alpha·21 Ağu

Introducing two new tokenizer-free LLM checkpoints from our research lab: TFree-HAT 7B Built on our Hierarchical Autoregressive Transformer (HAT) architecture, these models achieve top-tier German and English performance while processing text on a UTF-8 byte level.

English

435

74.6K

Johannes Messner retweetledi

Vedant Nanda@_nvedant_·4 Eyl

Curious how to accelerate inference of some of the recent byte level models like HAT/HNet/BLT? Check out this vllm fork developed by my friends and colleagues, Pablo and Lukas! To my knowledge first demonstration of inference speedups from dynamic chunking in byte models!

Pablo Iyu Guerrero@pabloiyu

First high-performance inference for hierarchical byte models. @LukasBluebaum and I developed batched inference for tokenizer-free HAT (Hierarchical Autoregressive Transformers) models, developed by @Aleph__Alpha Research. In some settings, we outcompete the baseline Llama.🧵

English

756

Johannes Messner retweetledi

Pablo Iyu Guerrero@pabloiyu·4 Eyl

English

3.7K

Johannes Messner@atomicflndr·30 Ağu

Tears in my eyes

EU–INC@euinc_petition

🤯 MERZ AND MACRON JUST CONFIRMED A PAN-EUROPEAN LEGAL ENTITY IS COMING It's now up to all of us to ensure the solution that gets passed into law is fit for purpose for European startups. That means: EU–INC. 🚀 Support us and we all will get this done together. 🇪🇺🤝

English

191

Johannes Messner retweetledi

Vedant Nanda@_nvedant_·22 Ağu

Our work on tokenizer free LLMs: Hierarchical Autoregressive Transformers (HAT)! We recently dropped HAT models on HF, pretrained from scratch! huggingface.co/collections/Al… You can try them with both HF Inference AND our vllm fork: github.com/Aleph-Alpha/vl… 🧵 (1/6)

Aleph Alpha@Aleph__Alpha

English

1.3K

Johannes Messner retweetledi

Vedant Nanda@_nvedant_·22 Ağu

And imo what's the coolest is that we made it ready for production grade inference with our own vllm fork (more details on this soon!): github.com/Aleph-Alpha/vl… So now you can now enjoy all the vllm features like continuous batching, paged attention etc also for HAT! (5/6)

English

193

Johannes Messner@atomicflndr·23 Ağu

@tulkenss Hmm I don’t think so, you‘d have to do some model surgery to extract it. @PitNeitemeier correct me if I’m wrong?

English

slm tokens@tulkenss·23 Ağu

@atomicflndr By the way, do you have an easy way to run the initial encoder (the “word embedding maker”) in separation? I could make it work, but maybe you already did. I’m very curious what these embeddings look like, and how they perform compared to other word embeddings.

English

Johannes Messner@atomicflndr·23 Ağu

Seeing this pushback a lot - and it‘s fair! However, these models don’t have a fixed vocabulary, i.e. there are infinitely many words the model can operate over instead of a finite set of tokens.

slm tokens@tulkenss

I wouldn't really consider these to be tokenizer-free tbh. Unlike Hnets, these models are word level. The sequence is turned into words (this is literally called tokenization). Then, the bytes of these words are turned into embeddings, which are then processed by a model.

English

1.2K

Johannes Messner@atomicflndr·23 Ağu

@tulkenss Fair :)

Français

slm tokens@tulkenss·23 Ağu

@atomicflndr Ok, so if itdoesn’t matter, maybe don’t call it tokenizer-free, because the first step is a tokenizer. I think “without a fixed vocabulary” is much more descriptive and does a lot more justice to the problem this solves. (But probably sells less well)

English

Keşfet

@Dorialexander @samsja19 @SohirMaskey @douglasahorr @NairoInGreen @datologyai @PrimeIntellect @latkins