Andreas Grivas

563 posts

Andreas Grivas banner
Andreas Grivas

Andreas Grivas

@andreasgrv

Interested in Bottlenecks in Neural Networks; Unargmaxable Outputs. Postdoc in ML/NLP at the University of Edinburgh.

Edinburgh, Scotland Katılım Ağustos 2015
661 Takip Edilen513 Takipçiler
Sabitlenmiş Tweet
Andreas Grivas
Andreas Grivas@andreasgrv·
How expressive is your deep multi-label classifier? Can it represent all outputs of interest? 🤔 SPOILER🚨: Your model can have *test set* outputs that are impossible to predict! 🚫 Check out our paper! arxiv.org/abs/2310.10443 🧵(1/7)
Andreas Grivas tweet media
English
2
21
73
20.5K
Andreas Grivas retweetledi
Benjamin Minixhofer
Benjamin Minixhofer@bminixhofer·
Bolmo is now on arXiv!
Benjamin Minixhofer tweet media
English
1
3
23
1.2K
Andreas Grivas retweetledi
Desmond Elliott
Desmond Elliott@delliott·
I am grateful that the Carlsberg Foundation is supporting our basic research on tokenization-free language models at the University of Copenhagen. I will be hiring Ph.D students to start in September 2026. Feel free to reach out early if you want to express informal interest.
Carlsbergfondet@Carlsbergfondet

Fra politologi til arkæologi. Fra astrofysik til marinbiologi og glaciologi. 159 forskere modtager i dag en bevilling fra Carlsbergfondet til vidt forskellige grundvidenskabelige initiativer. Se hvilke projekter, der har fået støtte 👉bit.ly/4iK2fV2 #dkforsk

English
1
7
25
2.1K
Andreas Grivas retweetledi
Benjamin Minixhofer
Benjamin Minixhofer@bminixhofer·
We are releasing Bolmo today! Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3. Bolmo also performs competitively in terms of speed & is fully open. I was skeptical of byte-level models for a long time but I finally switched camps🧵
Benjamin Minixhofer tweet media
Ai2@allen_ai

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

English
8
21
115
20K
Andreas Grivas retweetledi
Edoardo Ponti
Edoardo Ponti@PontiEdoardo·
Finally, you can count the r's in strawberry and check if 3.11 is higher than 3.9 without tokenisation interfering: Here's Bolmo, a fully open byte-level LLM with latent tokenisation, derived from a SOTA LLM (Olmo 3). Promising on coding and char-level understanding!
Ai2@allen_ai

Introducing Bolmo, a new family of byte-level language models built by "byteifying" our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵

English
2
7
44
4.2K
Andreas Grivas retweetledi
Eleonora Giunchiglia
Eleonora Giunchiglia@e_giunchiglia·
📣 PhD opening – Fall 2026 The DUCK Lab @imperialcollege is looking for a PhD student to join us! Why 🦆? We work on foundational aspects of #neurosymbolicAI and #SafeAI. 👉 DUCK = Data, Uncertainty, Constraints & Knowledge 📩 Apply by emailing: e.giunchiglia@imperial.ac.uk
English
6
44
321
23.4K
Andreas Grivas retweetledi
Ivan Titov
Ivan Titov@iatitov·
Happy to announce one (or more) postdoctoral positions at the U Amsterdam! There’s a lot of flexibility in research direction, including continual learning, memory in LLMs, AI safety, unlearning/editing, reasoning, and interpretability - areas our group is currently focused on.
Ivan Titov tweet media
English
2
5
25
2.9K
Andreas Grivas retweetledi
Piotr Nawrot
Piotr Nawrot@p_nawrot·
We'll present "Inference-Time Hyper-Scaling with KV Cache Compression", both at NeurIPS and EurIPS. We believe that future advances in AI will require model efficiency, and this work is another step in this direction. Save the date! -San Diego, Thur 11:00 -Copenhagen, Thur 10:30
Piotr Nawrot tweet media
English
1
5
14
1.5K
Andreas Grivas retweetledi
Ivan Titov
Ivan Titov@iatitov·
Excited about the collaboration with Kolya @FelineAutomaton . We’re offering a fully funded PhD at @EdinburghNLP (start Sept 2026), working on language-based state representations for time series, comes with a generous budget for travel and experiments.
English
3
7
28
5.2K
Andreas Grivas retweetledi
Sarah Wiegreffe
Sarah Wiegreffe@sarahwiegreffe·
I am recruiting 2 PhD students to work on LM interpretability at UMD @umdcs starting in fall 2026! We are #3 in AI and #4 in NLP research on @CSrankings. Come join us in our lovely building just a few miles from Washington, D.C. Details in 🧵
Sarah Wiegreffe tweet media
English
14
170
774
110.8K
Andreas Grivas retweetledi
Ivan Titov
Ivan Titov@iatitov·
We at @EdinburghUni are looking for new PhD students to join us through the Centre for Doctoral Training in Responsible NLP. Work with us on making AI systems more responsible, trustworthy and safe @EdinburghNLP
Ivan Titov tweet media
English
2
12
41
3.3K
Andreas Grivas retweetledi
GLADIA Research Lab
GLADIA Research Lab@GladiaLab·
After reading many of the replies, we would like to issue a few clarifications: - we cannot extract training data from the model using our method - LLMs are not injective w.r.t. the output text, that function is definitely non-injective and collisions occur all the time - for the same reasons, LLMs are not invertible from the output text we hope this clears up any confusion and we welcome any feedback on the matter. For any further questions, feel free to reach out to the authors: @GiorgosNik02, @tommaso_mncttn, @DonatoCrisosto1, @teelinsan, Yannis Panagakis, @EmanueleRodola
GLADIA Research Lab@GladiaLab

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

English
30
149
1.2K
191K
Andreas Grivas retweetledi
Piotr Nawrot
Piotr Nawrot@p_nawrot·
> From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. > Did we find a free lunch? Not quite. > The price became clear at larger scales: the model showed obvious weaknesses in complex, multi-hop reasoning tasks. Long post but the above sparked some motivation in me to describe what I believe is the most interesting theory I've developed about the efficiency-performance trade-off and the evaluation of Efficient Methods. It matters how easy a given task is for your model When we worked on Sparse Frontier study — a large-scale evaluation of training-free sparse attention — we systematically tested: 6 sparse methods; 4 model sizes (7-72B); 9 tasks; 4 sequence lengths (16-128k). Everything was tightly controlled. At first, some results made no sense. For instance, a 70B model solved a task perfectly across all sparsity methods up to 64k tokens (with 95% sparsity — an impressive ~20× theoretical efficiency gain). But at 128k tokens, performance suddenly collapsed, even with moderate sparsity (around 60–70%). Meanwhile, the 14B model — though never perfect — maintained a consistent 70% accuracy across all sequence lengths for the same task and sparsity methods, again up to 95% sparsity. My intuition has always been that larger model should tolerate sparsity better so what's going on? Why the performance stays constant for 14B and drops for 70B? After some investigation, I developed a theory. Sparse methods inherently reduce model capacity — the more you compress, the less capable the model becomes. To understand how far you can push compression, you have to look at the relationship between initial model capability (C) and task difficulty (D): * If C ≫ D, you can compress aggressively and performance will stay strong. * If C ≈ D, even small compression can break the model’s performance. ⠀ In the example above, the 70B model had enough capacity to achieve 100% accuracy at 64k tokens. But at 128k, with added distractors, the task difficulty increased — pushing the model right to its limit. A bit of compression was enough to tip it over. The 14B model, on the other hand, couldn’t solve every input, but its consistent 70% success rate came from easier samples. Since those inputs were very easy, adding distractors had little impact. The remaining 30% of samples 14B could never solve was challenging and at 128k they pushed 70B model to its limits. Takeaway: When a paper reports “no accuracy drop” on easy benchmarks, that doesn’t mean the method is safe — it just means the benchmark wasn’t hard enough to expose the weaknesses. That’s why *Needle-in-a-Haystack* aren’t meaningful for evaluating sparse attention or token eviction. Modern models already solve them perfectly; they’re too easy. We need benchmarks that push models to their limits, and then apply efficiency mods. [Extra insight / thing to pay attention to] In Sparse Attention and KV Compression context relevance matters a lot I’ve also noticed that in some papers, the evaluation setup changes quietly — for example, switching from 0-shot to 5-shot settings in tasks where extra shots doesn't make a real difference. If the performance gap between 0-shot and 5-shot is within the standard deviation, those extra shots don’t add meaningful information. But they can make compression methods appear stronger. Why? Because in these cases, the “extra” context tokens (the shots) can be compressed with almost no loss in accuracy. A paper might then report “5× compression with no performance drop” — but if you ran the same experiment under strict 0-shot conditions, performance would likely fall sharply. TLDR: Efficiency gains often look good — until you test them at the edge of a model’s true capability. The closer you get to that edge, the more trade-offs reveal themselves.
Piotr Nawrot tweet media
Pengyu Zhao@zpysky1125

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. (zhihu.com/question/19653…) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock and go with full attention with MiniMax M2?" After explaining the backstory in one chat after another, I figured it's time to write down our journey in a blog. Honestly, I could give you the textbook debate. I could talk all afternoon about why you should build linear/sparse attention. Then, I could turn around and talk all afternoon about why you shouldn't. But what's the point of all that hand-waving? The real question is whether you should actually do it. So, let's start with the conclusion: We are always working on it. But in a real-world, industrial-grade system, the truth is that efficient attention still has some way to go before it can definitively beat full attention. As LLMs have evolved, the entire stack has become monstrously complex. We serve more scenarios, and the architecture design trade-offs are exploding: "How does it perform on code and math? What about agent scenarios? How does it handle multimodality? Does long-chain CoT still hold up? Can RL scale on top of it? Are there hidden traps with low-precision compute? How do you implement interleaved thinking, caching, or speculative decoding? ... " In short, there's a vast difference between the promise on paper and its payoff in production. You only get to claim that payoff after satisfying Condition 1...n and solving Problem 1...n. II. Why Efficient Attention? Let's do a thought experiment. If you had infinite compute, would you even bother with linear or sparse attention? Some might bring up theoretical arguments about softmax attention "oversmoothing" in an infinite context... but who knows? Under the current compute bound, no model has truly pushed softmax attention to its absolute limit. So, for all practical purposes, the race for efficient attention is a race to save compute. For our M2 design, could we aim to save tokens — achieving the same quality with fewer tokens? Well if you believe in scaling laws, to achieve this goal, you'd probably bet on other paths to get there, not efficient attention. So, the simple truth is this: Compute is finite. We need an architecture that makes better use of it — models that achieve higher performance under the same budget (training & inference). III. The Real Bottlenecks To build a model that can practically be deployed and used by the community, we have to start with what users care: Quality, Speed (TPS), and Price. Quality is non-negotiable. A useless model is useless even if it's free. So how do we make a Linear/Sparse/Hybrid Attention model that performs well enough? The biggest challenge here isn’t the architecture design — the real bottleneck is the limitations of evaluation. (As for speed and price, those are heavily influenced by the inference stack—and great models tend to attract great engineers to optimize them.) The Evaluation Trap: Goodhart's Law in Action “As long as you build the benchmark, I’ll find a way to beat it.” Over the past few years of LLM development, the pace of leaderboard progress is staggering. No matter how hard a benchmark is — even if the SOTA score starts in single digits — once it catches the industry’s attention, it’s usually crushed within a few iterations. But how do you build an evaluation system that is comprehensive and actually reflects a model's true capabilities? That’s one of the hardest — and most critical — problems in LLM development, and it becomes even more acute when you start messing with a component as fundamental as attention. Benchmarks are a Leaky Abstraction There’s no free lunch. When you reduce the complexity of attention, you pay a price. The question is, where? When we were developing MiniMax-Text-01, everyone was still evaluating MMLU, BBH, MATH, and LongBench (all of which are now saturated). From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. Our own small-scale hybrid models confirmed this on the leaderboards. (Did we find a free lunch?) Not quite. The price paid became obvious at a larger scale: the model had clear deficits in complex, multi-hop reasoning tasks. Okay, once a problem is exposed, you can fix it. We developed proxy metrics for this specific weakness and iterated until the hybrid model seemed to match MHA. But does that proxy metric still correlate with real-world downstream performance at an even larger scale? Are there other hidden weaknesses? Who knows. We haven't run those experiments yet. The better the models get, the harder they are to evaluate. But that’s a must part of the journey — keep it up, eval teams! The High Cost of Knowing Things For complex reasoning tasks, we can sometimes find early proxy metrics that correlate well with final performance — but not for all tasks (at least, not yet). As tasks get harder, the amount of experiment compute required just to get a statistically significant signal on your metric grows astronomically — which is ironic, since we study efficient attention because compute is limited. And beyond the academic benchmarks, optimization issues often only surface at scale. You never really know what’s going to happen until you scale up. Anyone who read our M1 paper will recall the serious precision issues we hit during RL training — problems that would’ve been spotted earlier. Going back and analyzing Lightning Attention's numerical convergence with that experience in hand was incredibly clarifying. Discovering the real problems is often far harder than solving them. A Symphony of Variables There are just too many variables in model training. Different architectures behave very differently on different data distributions and with different optimizers. In a world where our data is constantly being updated, an experiment run on last month's data mix might yield the opposite conclusion today. We can’t observe everything perfectly — but we’re working on finding more reliable experimental strategies. Infrastructure: Where Theory Meets Metal Compared to full attention, the infrastructure for linear and sparse attention is much less mature. To actually get the promised results, there’s still a lot of groundwork to fill in. Take linear attention for example: If you analyze the compute intensity of existing linear architectures, many of them are memory-bound — even during training. Without extreme IO optimization, you’re basically leaving a huge amount of GPU FLOPs on the table. And inference brings even more challenges than training: How do you deliver a service that is genuinely faster and cheaper? Linear attention has linear compute complexity and constant memory usage. That means there’s a crossover point where it becomes more efficient than full attention in compute and memory. In theory, that point lies at a few thousand tokens — which isn’t particularly long for today’s large models. But that’s just theory. We need to solve a few key problems to actually approach it: Low-Precision State Storage: Linear attention is currently far more sensitive to numerical precision than full attention. Prefix Caching: In real-world applications, the cache-hit rate for conversations is very high. A new architecture must handle this gracefully. Speculative Decoding: How do you optimize speculative decoding with linear attention backbone? Well fortunately, all of these seem solvable. IV. What’s Next Scaling remains the name of the game, and context scaling is one of the key problems. Longer and longer context length is key in both pre-training and post-training. As GPU compute growth slows while data length keeps increasing, the benefits of linear and sparse attention will gradually emerge. We should start preparing now: Better Data: More multimodal, information-rich long-context data. Better Evaluation: More informative evaluation system and experimental paradigms to speed up iteration. Better Infrastructure: Mature training and inference infrastructure to fully squeeze out GPU potential. V. Addendum: the SWA code... We accidentally left the SWA inference code in the open-source release, and some people asked why it wasn’t used in the final model. Simple answer: the performance wasn't good enough. That experiment was from quite early on, before GPT-OSS was open-sourced (we were pretty surprised to see its structure, by the way). But I can share a brief summary of our failed attempt. We tried adapting CPT into a Hybrid SWA, testing both inter & intra-layer mixing. The motivation for intra-layer mixing was to balance the compute intensity across all layers, which is friendly to both PP in training and PP or AFD during inference. Unfortunately, neither worked. Performance degraded noticeably as context length grew — which is unacceptable in agentic scenarios. Our analysis showed that many global attention patterns (like retrieval head and induction head) were already established early during pre-training. CPT can hardly adjust those patterns afterwards. You surely can mitigate the issue by using data probes to identify and keep those heads as full attention — but unfortunately, it’s nearly impossible to discover them all from human priors. (And no, this issue isn’t related to attention sinks.) If you're interested in this line of research, I recommend taking a closer look at GPT-OSS, CWM, and Gemma, especially their long-context performance. Finally, we’re hiring! If you want to join us, send your resume to guixianren@minimaxi.com. References MiniMax-01: Scaling Foundation Models with Lightning Attention MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention CWM: An Open-Weights LLM for Research on Code Generation with World Models Qwen3-Next Gemma 3 Technical Report gpt-oss-120b & gpt-oss-20b Model Card Retrieval Head Mechanistically Explains Long-Context Factuality transformer-circuits.pub/2022/in-contex…

English
2
22
103
15.3K
Andreas Grivas retweetledi
NeSy 2026
NeSy 2026@nesyconf·
@luislamb We're glad to announce the NeSy 2025 Test of Time award for "Probabilistic Inference Modulo Theories"! 🏆Rodrigo de Salvo Braz was here to accept the award. This is groundwork for recent NeSy approaches like DeepSeaProbLog and the probabilistic algebraic layer.
NeSy 2026 tweet mediaNeSy 2026 tweet mediaNeSy 2026 tweet media
English
1
8
13
852
Orion Weller
Orion Weller@orionweller·
Instructions/reasoning are now everywhere in retrieval - we want embeddings to do it all! 🚀 But... is it even possible? 🤔 Turns out, it's not possible for single-vector models 😱 theoretically and empirically! To make it obvious we OSS a simple eval SoTA models flop on! 🧵
Orion Weller tweet media
English
15
83
324
33.3K
Andreas Grivas retweetledi
Osman Batur İnce
Osman Batur İnce@ospanbatyr·
Multimodal models typically need millions of examples from each modality paired with text for training. With SEMI 🌓, we integrate new low-resource modalities into LLMs with as few as 32 samples — including satellite images, galaxies, sensors, and molecules. (1/6)
GIF
English
3
39
211
21.6K
Andreas Grivas
Andreas Grivas@andreasgrv·
@orionweller @tetraduzione It would be cool to know if feasibility and learnability differ, i.e. whether the Linear Programme proves a test example is predictable but the embedding is super constrained and therefore hard to learn.
English
0
0
1
70
Andreas Grivas
Andreas Grivas@andreasgrv·
@orionweller Love the "critical-n" idea! Alongside learning embeddings, have you tried a Linear Programme to verify test set predictability? We (@tetraduzione) did this for MLC arxiv.org/abs/2310.10443 and guaranteed top-k thresholded outputs are predictable by constraining the embeddings.
English
1
1
2
454