Samu

252 posts

Samu

@SamuelNellessen

Teaching LLMs how to jailbreak @KachmanLab, AI @Radboud_Uni – https://t.co/7hqxpdu7PF 🚀

Nijmegen, Netherlands Katılım Ekim 2018

318 Takip Edilen81 Takipçiler

Samu retweetledi

Foresight Institute@foresightinst·28 Nis

How can AI strengthen security, protect privacy, and enable cooperation among a rising plurality of AI systems and humans? Some of the best minds working on this are joining us at our Secure & Sovereign AI Workshop, July 18–19 in Berlin. Confirmed speakers: • @jesseposner – Vora • @robinhanson – George Mason University • @ml_sudo – Project Sovereign • Lisa Beckers – Global Technology Risk Foundation • @socrates1024 – ZCash Foundation • @FazlBarez – University of Oxford • @IvanVendrov – Midjourney • Georgios Kaissis – HPI Potsdam • Davide Crapis – MystLabs Inc. • @DanGirsh – WorldCoin • @ObadiaAlex – ARIA • @jsotterbach – SPRIND • @NitzanShulman – Heron AI • Or Zamir – Blavatnik School of Computer Science • @dimasquest – PhD, Imperial College London • @0xQuintus – Flashbots • John Liagouris – Boston University • Mariana Meireles – UC Berkeley • Rob Sison – UNSW Sydney • @KeithPatarroyo – University of Glasgow • Dongwon Lee – PSU • @galmasha – Foresight Fellow • @EricMoore – PhD, Kennedy Kreiger Institute • @MorganLvng – TechCongress Fellow • Abraham Nash – Infinite Zero Foundation • @JaimeRalV – Apart Research • @SamuelNellessen – CAIS • @iamwsubramanyam – CAIS • Pascal Berrang – Zeroth Research • @luca_arnaboldi – Zeroth Research • @mateo_petel – Google Deepmind • Madeleine Parker – Newfoundation • @jmartink – Lateral • @aurelcode – Inversed • Janabel Xia – Privacy Residency • Kazik Pogoda – Xemantic • @Tianyi_Alex_Qiu – Peking University • @Gunnar_Zarncke – Aintelope • @_georg_lange • Eduard Kapelko If you’re researching or building in AI safety, security, privacy, or decentralized cooperation, apply to attend: foresight.org/events/2026-se… Sponsored by: @protocollabs

GIF

English

2.3K

Samu@SamuelNellessen·16 Nis

@ConnorDilgren congrats! first milestone to many! 🚀

English

144

Connor Dilgren@ConnorDilgren·16 Nis

Excited to announce my first preprint in LM interpretability! Latent reasoning models are not monitorable by default, since they don't reason in human-readable, natural language text. But can we make progress in understanding their intermediate reasoning steps using mech interp?

English

209

14.8K

Samu retweetledi

Dave Banerjee@DaveRBanerjee·12 Nis

Mean time-to-exploit has collapsed from 2.3 years in 2018 to 1.6 days in 2026

English

1.9K

Samu@SamuelNellessen·6 Nis

@wondering_camel you hate gradient boosting? i am lost

English

Mushroom's Mutters 🎀@wondering_camel·6 Nis

I hate xgboost. Whatever version youtube is using. Currrently it's marrigage season in Vietnam. Seeing your ex get married is complicated. The relationship ended, I never got a chance to look back. Today Youtube keeps pushing me the songs the sad songs we used to listen.

English

208

Samu@SamuelNellessen·5 Nis

@visakanv probably not that deep, but it’s the same phenomenon as saying "did you see that, chat?" IRL. zoomers internalized the panopticon. so repeating the criticism acts as a wink to the imaginary audience, narrating the internal dialogue.

English

Visa is doing marketing consults (see pinned!)@visakanv·5 Nis

it does feel like a zoomer linguistic innovation to respond to a criticism with a ~repetition of the criticism and a crying emoji “hey you didn’t submit your report in time” “not the deadline-missing allegations 😭”

English

248

9.9K

Samu@SamuelNellessen·4 Nis

@AmmannNora big news, congrats! 🥳

English

Nora Ammann@AmmannNora·4 Nis

I’ve recently accepted the Programme Director role at ARIA, taking over from davidad in running the Safeguarded AI programme. 🧵 about the programme’s strategic vision + our upcoming funding efforts in cybersecurity + we're hiring!

English

228

18.4K

Samu@SamuelNellessen·30 Mar

@alexinexxx @vikhyatk 2FA is cool if you have a yubikey

English

alexine 🏴‍☠️@alexinexxx·30 Mar

@vikhyatk tell them. 2FA hater here

English

508

vik@vikhyatk·30 Mar

i would rather walk on a bed of nails than have to do 2FA again

English

6.5K

Samu@SamuelNellessen·29 Mar

@viemccoy does building an automated red-teaming agent with CISPO that breaks multiple open- and closed-source models in a zero-shot transfer test qualify? arxiv.org/abs/2602.02395 will focus on demonstrating runaway behaviour of LAT under continuous adversarial pressure next.

English

𝚟𝚒𝚎 ⟢@viemccoy·28 Mar

The top requirement for joining my team at OpenAI is having a certain "je ne sais quois". If you don't have this, I'm not really sure it's worth taking the time to apply.

English

426

44.4K

Samu@SamuelNellessen·11 Ara

@lfschiavo i love koyaanisqatsi, but what does your life look like if it is one of your top 5 albums???

English

Larissa Schiavo@lfschiavo·3 Ara

Do not hand me the aux cord

Larissa Schiavo@lfschiavo

@1thousandfaces_ the future is Skrillex-Philip Glass euthanasia coaster

English

4.3K

Samu@SamuelNellessen·10 Ara

@kalomaze hmm.. it appears the screenshot is cut off, can you post the whole image?

English

145

kalomaze@kalomaze·10 Ara

be careful with what you log to wandb

English

4.8K

Samu@SamuelNellessen·12 Kas

ZXX

Samu@SamuelNellessen·11 Kas

You never get the feedback loop that would let you improve or correct your thinking. The scary part isn't being wrong publicly. It's being wrong publicly and having everyone know it except you.

English

Samu@SamuelNellessen·11 Kas

The more I have "skin in the game" and write, the more I realize my fear has shifted. It's not that people will criticize my work. It's that they'll think it's bullshit and never tell me. Actual criticism is a luxury. Most people just form their judgment in silence and move on.

English

Samu@SamuelNellessen·31 Eki

@rqobela Scala...

Indonesia

Rezi@rqobela·29 Eki

Programming language you learnd but never used again is...?

English

5.9K

150

5.5K

951.3K

Samu@SamuelNellessen·30 Eki

@yacinelearning Kimi Linear is also rather small; the point made for Minimax was that linear attention breaks when deployed at scale (i.e. whether it is memory-bound issues or bad performance in relevant benchmarks). I'd expect Kimi Linear also breaks in multi-hop. No free lunch!

English

428

Yacine Mahdid@yacinelearning·30 Eki

between minimax ditching linear attention and kimi finding a variant that outperform full attention I’m completely confused

Kimi.ai@Kimi_Moonshot

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi Linear offers up to a 75% reduction in KV cache usage and up to 6x decoding throughput at a 1M context length. Key highlights: 🔹 Kimi Delta Attention: A hardware-efficient linear attention mechanism that refines the gated delta rule. 🔹 Kimi Linear Architecture: The first hybrid linear architecture to surpass pure full attention quality across the board. 🔹 Empirical Validation: Scaled, fair comparisons + open-sourced KDA kernels, vLLM integration, and checkpoints. The future of agentic-oriented attention is here! 💡

English

189

20.1K

Samu@SamuelNellessen·30 Eki

@yacinelearning Both can be true at the same time; Kimi Linear shows a different inductive bias (they say themselves that it struggles with some stuff). Hybrid wins in certain regimes; full attention remains the safest generalist today.

English

193

Samu@SamuelNellessen·30 Eki

@sidbing banger

Indonesia

sidbing 🪽@sidbing·29 Eki

bro they flaming my ass on reddit

English

4.5K

Samu@SamuelNellessen·29 Eki

@snowclipsed My guess would be that you wouldn't even go beyond a few thousand tokens of context window length, until you memory-bandwidth-bound? Also, unsure what information the additional complexity would encode beyond full attention 🤔

English

snow@snowclipsed·29 Eki

just out of vain curiousity ; what happens if you increase the complexity of attention? like, has anyone tried cubic attention lol

Lucas Beyer (bl16)@giffmana

> There’s no free lunch. > When you reduce the complexity of attention, you pay a price. > The question is, where? This is *exactly* how I typically end my Transformer tutorial. This slide is already 4 years old, I've never updated it, but it still holds:

English

484

132.7K

Samu@SamuelNellessen·29 Eki

As contexts grow to 10M+ tokens, full attention becomes economically impossible. 'Compute growth slows' = can't afford to scale at the same pace. Efficient attention becomes necessary despite current infrastructure problems.

English

Samu@SamuelNellessen·29 Eki

there's no free lunch. Efficient attention only approximates full attention to save compute. Every optimization trades something - here it's information. Small cost at short contexts, bigger cost at long contexts or complex reasoning (where it matters). However...

English

Samu@SamuelNellessen·29 Eki

Just spent some time going through MiniMax's blog on why they chose full attention over efficient attention and writing down learnings. Lots of misconceptions corrected. Here's what I learned about the compute/memory tradeoffs that aren't obvious from theory:

Pengyu Zhao@zpysky1125

MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. (zhihu.com/question/19653…) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock and go with full attention with MiniMax M2?" After explaining the backstory in one chat after another, I figured it's time to write down our journey in a blog. Honestly, I could give you the textbook debate. I could talk all afternoon about why you should build linear/sparse attention. Then, I could turn around and talk all afternoon about why you shouldn't. But what's the point of all that hand-waving? The real question is whether you should actually do it. So, let's start with the conclusion: We are always working on it. But in a real-world, industrial-grade system, the truth is that efficient attention still has some way to go before it can definitively beat full attention. As LLMs have evolved, the entire stack has become monstrously complex. We serve more scenarios, and the architecture design trade-offs are exploding: "How does it perform on code and math? What about agent scenarios? How does it handle multimodality? Does long-chain CoT still hold up? Can RL scale on top of it? Are there hidden traps with low-precision compute? How do you implement interleaved thinking, caching, or speculative decoding? ... " In short, there's a vast difference between the promise on paper and its payoff in production. You only get to claim that payoff after satisfying Condition 1...n and solving Problem 1...n. II. Why Efficient Attention? Let's do a thought experiment. If you had infinite compute, would you even bother with linear or sparse attention? Some might bring up theoretical arguments about softmax attention "oversmoothing" in an infinite context... but who knows? Under the current compute bound, no model has truly pushed softmax attention to its absolute limit. So, for all practical purposes, the race for efficient attention is a race to save compute. For our M2 design, could we aim to save tokens — achieving the same quality with fewer tokens? Well if you believe in scaling laws, to achieve this goal, you'd probably bet on other paths to get there, not efficient attention. So, the simple truth is this: Compute is finite. We need an architecture that makes better use of it — models that achieve higher performance under the same budget (training & inference). III. The Real Bottlenecks To build a model that can practically be deployed and used by the community, we have to start with what users care: Quality, Speed (TPS), and Price. Quality is non-negotiable. A useless model is useless even if it's free. So how do we make a Linear/Sparse/Hybrid Attention model that performs well enough? The biggest challenge here isn’t the architecture design — the real bottleneck is the limitations of evaluation. (As for speed and price, those are heavily influenced by the inference stack—and great models tend to attract great engineers to optimize them.) The Evaluation Trap: Goodhart's Law in Action “As long as you build the benchmark, I’ll find a way to beat it.” Over the past few years of LLM development, the pace of leaderboard progress is staggering. No matter how hard a benchmark is — even if the SOTA score starts in single digits — once it catches the industry’s attention, it’s usually crushed within a few iterations. But how do you build an evaluation system that is comprehensive and actually reflects a model's true capabilities? That’s one of the hardest — and most critical — problems in LLM development, and it becomes even more acute when you start messing with a component as fundamental as attention. Benchmarks are a Leaky Abstraction There’s no free lunch. When you reduce the complexity of attention, you pay a price. The question is, where? When we were developing MiniMax-Text-01, everyone was still evaluating MMLU, BBH, MATH, and LongBench (all of which are now saturated). From the perspective of a year ago, a hybrid of Lightning Attention and Full Attention looked just as good as pure full attention. Our own small-scale hybrid models confirmed this on the leaderboards. (Did we find a free lunch?) Not quite. The price paid became obvious at a larger scale: the model had clear deficits in complex, multi-hop reasoning tasks. Okay, once a problem is exposed, you can fix it. We developed proxy metrics for this specific weakness and iterated until the hybrid model seemed to match MHA. But does that proxy metric still correlate with real-world downstream performance at an even larger scale? Are there other hidden weaknesses? Who knows. We haven't run those experiments yet. The better the models get, the harder they are to evaluate. But that’s a must part of the journey — keep it up, eval teams! The High Cost of Knowing Things For complex reasoning tasks, we can sometimes find early proxy metrics that correlate well with final performance — but not for all tasks (at least, not yet). As tasks get harder, the amount of experiment compute required just to get a statistically significant signal on your metric grows astronomically — which is ironic, since we study efficient attention because compute is limited. And beyond the academic benchmarks, optimization issues often only surface at scale. You never really know what’s going to happen until you scale up. Anyone who read our M1 paper will recall the serious precision issues we hit during RL training — problems that would’ve been spotted earlier. Going back and analyzing Lightning Attention's numerical convergence with that experience in hand was incredibly clarifying. Discovering the real problems is often far harder than solving them. A Symphony of Variables There are just too many variables in model training. Different architectures behave very differently on different data distributions and with different optimizers. In a world where our data is constantly being updated, an experiment run on last month's data mix might yield the opposite conclusion today. We can’t observe everything perfectly — but we’re working on finding more reliable experimental strategies. Infrastructure: Where Theory Meets Metal Compared to full attention, the infrastructure for linear and sparse attention is much less mature. To actually get the promised results, there’s still a lot of groundwork to fill in. Take linear attention for example: If you analyze the compute intensity of existing linear architectures, many of them are memory-bound — even during training. Without extreme IO optimization, you’re basically leaving a huge amount of GPU FLOPs on the table. And inference brings even more challenges than training: How do you deliver a service that is genuinely faster and cheaper? Linear attention has linear compute complexity and constant memory usage. That means there’s a crossover point where it becomes more efficient than full attention in compute and memory. In theory, that point lies at a few thousand tokens — which isn’t particularly long for today’s large models. But that’s just theory. We need to solve a few key problems to actually approach it: Low-Precision State Storage: Linear attention is currently far more sensitive to numerical precision than full attention. Prefix Caching: In real-world applications, the cache-hit rate for conversations is very high. A new architecture must handle this gracefully. Speculative Decoding: How do you optimize speculative decoding with linear attention backbone? Well fortunately, all of these seem solvable. IV. What’s Next Scaling remains the name of the game, and context scaling is one of the key problems. Longer and longer context length is key in both pre-training and post-training. As GPU compute growth slows while data length keeps increasing, the benefits of linear and sparse attention will gradually emerge. We should start preparing now: Better Data: More multimodal, information-rich long-context data. Better Evaluation: More informative evaluation system and experimental paradigms to speed up iteration. Better Infrastructure: Mature training and inference infrastructure to fully squeeze out GPU potential. V. Addendum: the SWA code... We accidentally left the SWA inference code in the open-source release, and some people asked why it wasn’t used in the final model. Simple answer: the performance wasn't good enough. That experiment was from quite early on, before GPT-OSS was open-sourced (we were pretty surprised to see its structure, by the way). But I can share a brief summary of our failed attempt. We tried adapting CPT into a Hybrid SWA, testing both inter & intra-layer mixing. The motivation for intra-layer mixing was to balance the compute intensity across all layers, which is friendly to both PP in training and PP or AFD during inference. Unfortunately, neither worked. Performance degraded noticeably as context length grew — which is unacceptable in agentic scenarios. Our analysis showed that many global attention patterns (like retrieval head and induction head) were already established early during pre-training. CPT can hardly adjust those patterns afterwards. You surely can mitigate the issue by using data probes to identify and keep those heads as full attention — but unfortunately, it’s nearly impossible to discover them all from human priors. (And no, this issue isn’t related to attention sinks.) If you're interested in this line of research, I recommend taking a closer look at GPT-OSS, CWM, and Gemma, especially their long-context performance. Finally, we’re hiring! If you want to join us, send your resume to guixianren@minimaxi.com. References MiniMax-01: Scaling Foundation Models with Lightning Attention MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention CWM: An Open-Weights LLM for Research on Code Generation with World Models Qwen3-Next Gemma 3 Technical Report gpt-oss-120b & gpt-oss-20b Model Card Retrieval Head Mechanistically Explains Long-Context Factuality transformer-circuits.pub/2022/in-contex…

English

117

Keşfet

@jesseposner @robinhanson @ml_sudo @socrates1024 @FazlBarez @IvanVendrov @DanGirsh @ObadiaAlex